I was doing my usual weekly ego-search for Gramener when I found Gramener's employee provident fund records.
What's interesting is that the page lets you download the full list for the entire city of Hyderabad. Which got me wondering... could we do this for other cities? It turns out that there's a full list of offices that we can pull data from.
A simple set of curl
statements like
curl -s -o DELHINORTH.xls http://esewa.epfoservices.in/ecr_dashboard/export_establishment_wise_missing_detail.php?office_name=DELHINORTH
curl -s -o DEHRADUN.xls http://esewa.epfoservices.in/ecr_dashboard/export_establishment_wise_missing_detail.php?office_name=DEHRADUN
curl -s -o HALDWANI.xls http://esewa.epfoservices.in/ecr_dashboard/export_establishment_wise_missing_detail.php?office_name=HALDWANI
...
gets you 146 MB of EPF data for every single company. Or, you could scrape it in Python.
import pandas as pd
dfs = pd.read_html('http://esewa.epfoservices.in/ecr_dashboard/summary_of_reg_estt.php', infer_types=False, skiprows=2)
dfs[0].head()
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | 1 | 1 | RO | DL | CPM | DELHINORTH | 26752 | 157184 | 17 | 0.06 | 15713 | 58.74 | 209 |
3 | 2 | 1 | RO | UK | DDN | DEHRADUN | 4546 | 36406 | 2 | 0.04 | 3192 | 70.22 | 26 |
4 | 3 | 1 | SRO | UK | HLD | HALDWANI | 2231 | 21458 | 2 | 0.09 | 1676 | 75.12 | 18 |
5 | 4 | 1 | SRO | DS | SHD | LAXMINAGAR | 6559 | 36083 | 3 | 0.05 | 3811 | 58.1 | 70 |
6 | 5 | 1 | RO | DS | NHP | DELHISOUTH | 13915 | 102883 | 10 | 0.07 | 8282 | 59.52 | 166 |
import os.path
from urllib import urlretrieve
def get_epf(offices, path):
data = []
base = 'http://esewa.epfoservices.in/ecr_dashboard/export_establishment_wise_missing_detail.php?office_name='
# Weird workaround because pd.read_excel does not support encoding_override
xl = pd.ExcelFile.__new__(pd.ExcelFile)
for office in offices:
excel_file = os.path.join(path, office) + '.xls'
if not os.path.exists(excel_file):
urlretrieve(base + office, excel_file)
try:
book = xlrd.open_workbook(excel_file, encoding_override='cp1252')
except UnicodeDecodeError:
book = xlrd.open_workbook(excel_file, encoding_override='utf8')
setattr(xl, 'book', book)
df = xl.parse('Sheet 1', skiprows=2)
df['Office'] = office
data.append(df)
return pd.concat(data, ignore_index=True)
path = 'd:/site/gramener.com/viz/epf/establishments/'
datafile = os.path.join(path, 'data.csv')
if os.path.exists(datafile):
data = pd.read_csv(datafile, encoding='cp1252')
else:
data = get_epf(offices=dfs[0][5][:120], path=path)
data.to_csv(datafile, index=False, encoding='cp1252')
data.sort('On the basis of ECR (wage month Feb-2013 onwards)', ascending=False).head(20)[['Office', 'Estt. Name', 'On the basis of ECR (wage month Feb-2013 onwards)']]
Office | Estt. Name | On the basis of ECR (wage month Feb-2013 onwards) | |
---|---|---|---|
315758 | BANDRAMUMBAI-I | TATA CONSULTANCY SERVICES LIMITED | 249660 |
265395 | PARKSTREET | LARSEN - TOUBRO LTD. (ECC CONSTRUCTION GROUP) | 157831 |
0 | DELHINORTH | LARSEN - TOUBRO LTD | 149374 |
203666 | CHENNAI | COGNIZANT TECHNOLOGY SOLUTIONS INDIA PRIVATE LTD | 138108 |
315759 | BANDRAMUMBAI-I | LARSEN - TURBO LTD ECC | 136442 |
129643 | HYDERABAD | A P S R T C | 131587 |
187848 | BOMMASANDRA | ADECCO INDIA PVT. LTD. | 130769 |
187849 | BOMMASANDRA | M/S INFOSYS LIMITED | 126704 |
173884 | BANGALORE | TEAM LEASE SERVICES PVT LTD | 107921 |
315760 | BANDRAMUMBAI-I | ACCENTURE SERVICES PVT. LTD. | 101822 |
315761 | BANDRAMUMBAI-I | SHAPOORJI PALLONJI - CO LTD. | 99531 |
95233 | PATNA | SECURITY AND INTELLIGENCE SERVICE (INIDA) LTD. | 95299 |
406613 | AHMEDABAD | LARSEN - TOUBRO LIMITED | 94941 |
198662 | TAMBARAM | M/S.LARSEN - TOUBRO LIMITED, | 94087 |
190766 | KRPURAMWHITEFIELD | IBM INDIA PVT.LTD | 93927 |
129644 | HYDERABAD | LARSEN TOUBRO LTD.,[ECC-CG] | 90508 |
315762 | BANDRAMUMBAI-I | H D F C BANK LTD. | 87726 |
315763 | BANDRAMUMBAI-I | M.S.E.B. | 85844 |
187850 | BOMMASANDRA | M/S WIPRO LIMITED - TECHNOLOGIES GROUP | 85099 |
41293 | GURGAON | HINDUSTAN INSTRUMENT | 83170 |
Now, a dataset like this requires a nice exploratory interface. Here's what I sketched quickly, and a working implementation is at https://gramener.com/companysize/