To take an iterable object (assumed to contain numbers) and plot the frequency of their leading digits. Based on Benford's Law (also called the first-digit law), if it is a "natural dataset," we should see the following distribution of leading digits:
| d | P(d) | |--- |------: | | 1 | 30.1% | | 2 | 17.6% | | 3 | 12.5% | | 4 | 9.7% | | 5 | 7.9% | | 6 | 6.7% | | 7 | 5.8% | | 8 | 5.1% | | 9 | 4.6% |
In data science, this pattern is used to detect fraud, primarily for taxes purposes. It can also be used to detect deepfakes or altered images.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
world = pd.read_csv('world_population_data.csv')
world.head()
Country | Population_2020 | Yearly_Change | Net_Change | Density | Land_Area | Migrants | Fert_Rate | Med_Age | Urban_Pop | World_Share | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | China | 1,439,323,776 | 0.39% | 5,540,090 | 153 | 9,388,211 | -348,399 | 1.7 | 38 | 61% | 18.47% |
1 | India | 1,380,004,385 | 0.99% | 13,586,631 | 464 | 2,973,190 | -532,687 | 2.2 | 28 | 35% | 17.70% |
2 | United States | 331,002,651 | 0.59% | 1,937,734 | 36 | 9,147,420 | 954,806 | 1.8 | 38 | 83% | 4.25% |
3 | Indonesia | 273,523,615 | 1.07% | 2,898,047 | 151 | 1,811,570 | -98,955 | 2.3 | 30 | 56% | 3.51% |
4 | Pakistan | 220,892,340 | 2.00% | 4,327,022 | 287 | 770,880 | -233,379 | 3.6 | 23 | 35% | 2.83% |
def digit_widget(list):
number_stash = []
for num in list:
leading_digit = str(num)[0]
if leading_digit == '-':
leading_digit = str(num)[1]
if leading_digit == '$':
leading_digit = str(num)[1]
if leading_digit == 'n':
continue
if leading_digit == '0':
continue
number_stash.append(leading_digit)
number_stash = sorted(number_stash)
fig, ax = plt.subplots()
ax.set_yticks([0.10, 0.20, 0.30])
plt.hist(number_stash, bins=9, density=True)
return plt.show()
digit_widget(world['Population_2020'])
digit_widget(world['Migrants'])
digit_widget(world['Net_Change'])