England Pharmacy Dispensing Data EDA

Imports

In [1]:
import pandas as pd
import numpy as np
from scipy.stats import skew 
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

Data import

In [2]:
%run -i ../scripts/prepare_data_from_csv.py
en_dispensing_numbers, dispensers, *_= prepare_en_dispensing_numbers()
In [3]:
df_endispno = pd.merge(en_dispensing_numbers, dispensers, how="inner", on="DispenserCode")
In [4]:
# Final df for analyis
df_endispno = df_endispno[['Date', 'DispenserCode', 'DispenserName', 'DispenserPostcode', 
                           'NumberofForms', 'NumberofItems', 'NumberofFormsEPS', 'NumberofItemsEPS', 'NumberofCOVID19HomeDeliveryFees']]

First impression

In [5]:
df_endispno.shape
Out[5]:
(906134, 9)
In [6]:
df_endispno.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 906134 entries, 0 to 906133
Data columns (total 9 columns):
 #   Column                           Non-Null Count   Dtype         
---  ------                           --------------   -----         
 0   Date                             906134 non-null  datetime64[ns]
 1   DispenserCode                    906134 non-null  object        
 2   DispenserName                    906134 non-null  object        
 3   DispenserPostcode                906134 non-null  object        
 4   NumberofForms                    906134 non-null  int64         
 5   NumberofItems                    906134 non-null  int64         
 6   NumberofFormsEPS                 906134 non-null  int64         
 7   NumberofItemsEPS                 906134 non-null  int64         
 8   NumberofCOVID19HomeDeliveryFees  906134 non-null  int64         
dtypes: datetime64[ns](1), int64(5), object(3)
memory usage: 69.1+ MB
In [7]:
df_endispno.describe(include=np.number).T
Out[7]:
count mean std min 25% 50% 75% max
NumberofForms 906134.0 3519.546865 2782.062629 0.0 2158.0 3151.0 4490.0 536176.0
NumberofItems 906134.0 7139.139368 5594.793345 0.0 4331.0 6359.0 9098.0 1038528.0
NumberofFormsEPS 906134.0 1789.062447 2459.829582 0.0 622.0 1541.0 2562.0 533987.0
NumberofItemsEPS 906134.0 3943.677000 5037.041187 0.0 1378.0 3395.0 5629.0 1034933.0
NumberofCOVID19HomeDeliveryFees 906134.0 2.764578 32.440843 0.0 0.0 0.0 0.0 8450.0
In [8]:
for i in ['NumberofForms', 'NumberofItems', 'NumberofFormsEPS', 'NumberofItemsEPS', 'NumberofCOVID19HomeDeliveryFees']:
    print(f'Skewness for {i : <35}: {skew(df_endispno[i])}')
Skewness for NumberofForms                      : 62.245500720700484
Skewness for NumberofItems                      : 55.94727641397043
Skewness for NumberofFormsEPS                   : 88.07204006433865
Skewness for NumberofItemsEPS                   : 74.99998156932371
Skewness for NumberofCOVID19HomeDeliveryFees    : 59.70565648271784

-Findings and TODOs

  • There are some pharmacies with "0" items and forms.
    • [X] I should check those zero item rows to find out more.
  • The number of forms and items are highly skewed. This may be due to "0" items and forms.
    • [X] I should run the describe and skewness functions after removing zero entries.
  • EPS forms and items are more skewed. This may be due to online pharmacies which almost always dispense EPS forms.
    • [X] I should run the skewness function for manual forms and items to compare.
  • [X] I should analyse multiples against independents, and also against online pharmacies.
  • [X] I should check top pharmacies for items.
  • [X] I should make another analysis after removing outliers.

Rows with zero items

In [9]:
zeros = df_endispno[df_endispno['NumberofItems'] == 0]
In [10]:
print(f"zeros.shape: {zeros.shape} \n\
zeros to df_endispno ratio: {zeros.shape[0] / df_endispno.shape[0]}")
zeros.shape: (11422, 9) 
zeros to df_endispno ratio: 0.012605199672454626
In [11]:
print(f"zeros distinct dispensers: {zeros['DispenserCode'].nunique()} \n\
dispensers count: {dispensers.shape[0]} \n\
ratio: {zeros['DispenserCode'].nunique() / df_endispno['DispenserCode'].nunique()}")
zeros distinct dispensers: 2961 
dispensers count: 13231 
ratio: 0.22411444141689374
In [12]:
# moving dates to columns
zerospivoted = zeros.pivot(index='DispenserCode', columns='Date', values='NumberofItems')
In [13]:
# Non-zero cells becomes "1"
zerospivoted.fillna(1, inplace=True)
In [14]:
print(f"df_endispno distinct dates: {df_endispno['Date'].nunique()}")
print(f"number of zerospivoted columns: {len(zerospivoted.columns)}")
df_endispno distinct dates: 77
number of zerospivoted columns: 77
In [15]:
# summing "0" and "1"s across columns
zerospivoted['sum'] = zerospivoted.sum(axis=1)
In [16]:
fig = px.histogram(zerospivoted, x='sum')
fig.show()