Fitting data to univariate distributions with `distfit`¶

What's our goal?¶

You have some data points. Numeric, preferably.

And you want to find out which statistical distribution they might have come from. Classic statistical inference problem.

There are, of course, rigorous statistical methods to accomplish this goal. But, maybe you are a busy data scientist. Or, a busier software engineer who happens to be given this dataset to quickly write an application endpoint so that another machine learning app can use some synthetic data generated based on the best distribution that matches the data.

In short, you don't have a lot of time on hand and want to find a quick method to discover the best-matching distribution that the data could have come from.

Basically, you want to run an automated batch of goodness-of-fit (GOF) tests on a number of distributions and summarize the result in a flash.

You can, of course, write code from scratch to run the data through standard GOF tests using say Scipy library one by one for a number of distributions.

Or, you can use this small but useful Python library - distfit to do the heavy lifting for you.

In [1]:

from distfit import distfit
import numpy as np
import matplotlib.pyplot as plt

Generate test data¶

In [2]:

# Generate test data
data1 = np.random.normal(loc=5.0, scale=10, size=1000)

Initiate model¶

In [3]:

# Initialize model
dist1 = distfit(bins=25,alpha=0.02,stats='ks')

Fit to the data¶

In [4]:

dist1.fit_transform(data1,verbose=1)

Out[4]:

{'model': {'distr': <scipy.stats._continuous_distns.norm_gen at 0x2493c0c5370>,
  'stats': 'ks',
  'params': (5.168141032320424, 10.297680831713478),
  'name': 'norm',
  'model': <scipy.stats._distn_infrastructure.rv_frozen at 0x2493d248ca0>,
  'score': 1.1527914473738086e-07,
  'loc': 5.168141032320424,
  'scale': 10.297680831713478,
  'arg': (),
  'CII_min_alpha': -15.980709757845336,
  'CII_max_alpha': 26.31699182248618},
 'summary':          distr     score  LLE               loc            scale  \
 0         norm       0.0  NaN          5.168141        10.297681   
 1            t       0.0  NaN          5.168624        10.297493   
 2   genextreme       0.0  NaN          1.345935        10.208551   
 3        gamma       0.0  NaN       -804.696532         0.130926   
 4      lognorm       0.0  NaN       -442.600479        447.61915   
 5         beta       0.0  NaN        -56.534935       129.443957   
 6     loggamma       0.0  NaN      -2197.020027       320.587092   
 7     dweibull  0.001945  NaN          5.968244         8.919706   
 8        expon  1.108472  NaN        -25.240552        30.408693   
 9       pareto  1.108472  NaN -710722358.209082  710722332.96853   
 10     uniform  3.228419  NaN        -25.240552        64.035316   
 
                                          arg  
 0                                         ()  
 1                       (5964061.469961431,)  
 2                     (0.24650762422035927,)  
 3                       (6185.654208686727,)  
 4                     (0.02303400880174134,)  
 5   (18.311101389857235, 20.103025094062506)  
 6                       (962.7002556656262,)  
 7                      (1.2805598141142776,)  
 8                                         ()  
 9                       (21661180.97529108,)  
 10                                        ()  ,
 'histdata': (array([0.00117123, 0.00156164, 0.00234246, 0.0039041 , 0.00741778,
         0.01132188, 0.01795884, 0.01991089, 0.03006154, 0.02810949,
         0.03865055, 0.040993  , 0.03474645, 0.0327944 , 0.02928072,
         0.02967113, 0.02147253, 0.01288352, 0.01015065, 0.00936983,
         0.0039041 , 0.00078082, 0.00039041, 0.00078082, 0.00078082]),
  array([-23.95984595, -21.39843333, -18.8370207 , -16.27560808,
         -13.71419545, -11.15278283,  -8.59137021,  -6.02995758,
          -3.46854496,  -0.90713234,   1.65428029,   4.21569291,
           6.77710554,   9.33851816,  11.89993078,  14.46134341,
          17.02275603,  19.58416866,  22.14558128,  24.7069939 ,
          27.26840653,  29.82981915,  32.39123177,  34.9526444 ,
          37.51405702])),
 'size': 1000,
 'alpha': 0.02,
 'stats': 'ks',
 'bins': 25,
 'bound': 'both',
 'distr': 'popular',
 'method': 'parametric',
 'multtest': 'fdr_bh',
 'n_perm': 10000,
 'smooth': None,
 'weighted': True,
 'f': 1.5}

Plot¶

In [5]:

dist1.plot(verbose=1)

Out[5]:

(<Figure size 720x576 with 1 Axes>,
 <AxesSubplot:title={'center':'\nnorm\nloc=5.17, scale=10.30'}, xlabel='Values', ylabel='Frequency'>)

Summary table of fitted distributions¶

In [6]:

dist1.summary

Out[6]:

	distr	score	LLE	loc	scale	arg
0	norm	0.0	NaN	5.168141	10.297681	()
1	t	0.0	NaN	5.168624	10.297493	(5964061.469961431,)
2	genextreme	0.0	NaN	1.345935	10.208551	(0.24650762422035927,)
3	gamma	0.0	NaN	-804.696532	0.130926	(6185.654208686727,)
4	lognorm	0.0	NaN	-442.600479	447.61915	(0.02303400880174134,)
5	beta	0.0	NaN	-56.534935	129.443957	(18.311101389857235, 20.103025094062506)
6	loggamma	0.0	NaN	-2197.020027	320.587092	(962.7002556656262,)
7	dweibull	0.001945	NaN	5.968244	8.919706	(1.2805598141142776,)
8	expon	1.108472	NaN	-25.240552	30.408693	()
9	pareto	1.108472	NaN	-710722358.209082	710722332.96853	(21661180.97529108,)
10	uniform	3.228419	NaN	-25.240552	64.035316	()

Using `Scipy` functions internally¶

In [7]:

dist1.distributions

Out[7]:

[<scipy.stats._continuous_distns.norm_gen at 0x2493c0c5370>,
 <scipy.stats._continuous_distns.expon_gen at 0x2493c0c8f10>,
 <scipy.stats._continuous_distns.pareto_gen at 0x2493c147280>,
 <scipy.stats._continuous_distns.dweibull_gen at 0x2493c0e4310>,
 <scipy.stats._continuous_distns.t_gen at 0x2493c147100>,
 <scipy.stats._continuous_distns.genextreme_gen at 0x2493c0ee970>,
 <scipy.stats._continuous_distns.gamma_gen at 0x2493c0fa9d0>,
 <scipy.stats._continuous_distns.lognorm_gen at 0x2493c11a5b0>,
 <scipy.stats._continuous_distns.beta_gen at 0x2493c0c59a0>,
 <scipy.stats._continuous_distns.uniform_gen at 0x2493c144700>,
 <scipy.stats._continuous_distns.loggamma_gen at 0x2493c11ad90>]

Generate synthetic data too¶

In [8]:

dist1.generate(10,verbose=1)

Out[8]:

array([15.67374831, 13.91320123,  2.56643945, 22.87063749, -1.26082996,
        8.06140318, 12.93001425,  4.26721857, -3.99528754, 10.23653794])

Could be tricky if the shapes are close - especially with small dataset¶

In [9]:

data2 = np.random.beta(a=2.2,b=2.0,size=500)
dist2 = distfit(bins=50,alpha=0.02,stats='ks')
dist2.fit_transform(data2,verbose=1)
dist2.plot(title="Best-fitted with 500 data points",verbose=1)

Out[9]:

(<Figure size 720x576 with 1 Axes>,
 <AxesSubplot:title={'center':'Best-fitted with 500 data points\ngenextreme\nc=0.37, loc=0.46, scale=0.22'}, xlabel='Values', ylabel='Frequency'>)

In [10]:

data2 = np.random.beta(a=2.2,b=2.0,size=5000)
dist2 = distfit(bins=50,alpha=0.02,stats='ks')
dist2.fit_transform(data2,verbose=1)
dist2.plot(title="Best-fitted with 5000 data points",verbose=1)

Out[10]:

(<Figure size 720x576 with 1 Axes>,
 <AxesSubplot:title={'center':'Best-fitted with 5000 data points\nbeta\na=2.17, b=1.97, loc=0.01, scale=0.99'}, xlabel='Values', ylabel='Frequency'>)

Predict¶

In [11]:

dist2.predict(0.2)

[distfit] >predict..
[distfit] >Multiple test correction..[fdr_bh]

Out[11]:

{'y': array([0.2]),
 'y_proba': array([0.07816971]),
 'y_pred': array(['none'], dtype='<U4'),
 'P': array([0.07816971])}

Fitting data to univariate distributions with distfit¶