Benford's Law¶

Purpose¶

To take an iterable object (assumed to contain numbers) and plot the frequency of their leading digits. Based on Benford's Law (also called the first-digit law), if it is a "natural dataset," we should see the following distribution of leading digits:

| d | P(d) | |--- |------: | | 1 | 30.1% | | 2 | 17.6% | | 3 | 12.5% | | 4 | 9.7% | | 5 | 7.9% | | 6 | 6.7% | | 7 | 5.8% | | 8 | 5.1% | | 9 | 4.6% |

Application¶

In data science, this pattern is used to detect fraud, primarily for taxes purposes. It can also be used to detect deepfakes or altered images.

In [1]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

In [2]:

world = pd.read_csv('world_population_data.csv')
world.head()

Out[2]:

	Country	Population_2020	Yearly_Change	Net_Change	Density	Land_Area	Migrants	Fert_Rate	Med_Age	Urban_Pop	World_Share
0	China	1,439,323,776	0.39%	5,540,090	153	9,388,211	-348,399	1.7	38	61%	18.47%
1	India	1,380,004,385	0.99%	13,586,631	464	2,973,190	-532,687	2.2	28	35%	17.70%
2	United States	331,002,651	0.59%	1,937,734	36	9,147,420	954,806	1.8	38	83%	4.25%
3	Indonesia	273,523,615	1.07%	2,898,047	151	1,811,570	-98,955	2.3	30	56%	3.51%
4	Pakistan	220,892,340	2.00%	4,327,022	287	770,880	-233,379	3.6	23	35%	2.83%

In [3]:

def digit_widget(list):
    number_stash = []
    for num in list:
        leading_digit = str(num)[0]
        if leading_digit == '-':
            leading_digit = str(num)[1]
        if leading_digit == '$':
            leading_digit = str(num)[1]
        if leading_digit == 'n':
            continue
        if leading_digit == '0':
            continue
        number_stash.append(leading_digit)
    number_stash = sorted(number_stash)
    fig, ax = plt.subplots()
    ax.set_yticks([0.10, 0.20, 0.30])
    plt.hist(number_stash, bins=9, density=True)
    return plt.show()

In [4]:

digit_widget(world['Population_2020'])

In [5]:

digit_widget(world['Migrants'])

In [6]:

digit_widget(world['Net_Change'])