Employee Provident Fund company data¶

I was doing my usual weekly ego-search for Gramener when I found Gramener's employee provident fund records.

What's interesting is that the page lets you download the full list for the entire city of Hyderabad. Which got me wondering... could we do this for other cities? It turns out that there's a full list of offices that we can pull data from.

A simple set of curl statements like

curl -s -o DELHINORTH.xls http://esewa.epfoservices.in/ecr_dashboard/export_establishment_wise_missing_detail.php?office_name=DELHINORTH
curl -s -o DEHRADUN.xls http://esewa.epfoservices.in/ecr_dashboard/export_establishment_wise_missing_detail.php?office_name=DEHRADUN
curl -s -o HALDWANI.xls http://esewa.epfoservices.in/ecr_dashboard/export_establishment_wise_missing_detail.php?office_name=HALDWANI
...

gets you 146 MB of EPF data for every single company. Or, you could scrape it in Python.

In [30]:

import pandas as pd

dfs = pd.read_html('http://esewa.epfoservices.in/ecr_dashboard/summary_of_reg_estt.php', infer_types=False, skiprows=2)
dfs[0].head()

Out[30]:

	0	1	2	3	4	5	6	7	8	9	10	11	12
2	1	1	RO	DL	CPM	DELHINORTH	26752	157184	17	0.06	15713	58.74	209
3	2	1	RO	UK	DDN	DEHRADUN	4546	36406	2	0.04	3192	70.22	26
4	3	1	SRO	UK	HLD	HALDWANI	2231	21458	2	0.09	1676	75.12	18
5	4	1	SRO	DS	SHD	LAXMINAGAR	6559	36083	3	0.05	3811	58.1	70
6	5	1	RO	DS	NHP	DELHISOUTH	13915	102883	10	0.07	8282	59.52	166

In [146]:

import os.path
from urllib import urlretrieve

def get_epf(offices, path):
    data = []
    base = 'http://esewa.epfoservices.in/ecr_dashboard/export_establishment_wise_missing_detail.php?office_name='
    
    # Weird workaround because pd.read_excel does not support encoding_override
    xl = pd.ExcelFile.__new__(pd.ExcelFile)
    
    for office in offices:
        excel_file = os.path.join(path, office) + '.xls'
        if not os.path.exists(excel_file):
            urlretrieve(base + office, excel_file)
        try:
            book = xlrd.open_workbook(excel_file, encoding_override='cp1252')
        except UnicodeDecodeError:
            book = xlrd.open_workbook(excel_file, encoding_override='utf8')
        setattr(xl, 'book', book)
        df = xl.parse('Sheet 1', skiprows=2)
        df['Office'] = office
        data.append(df)
    
    return pd.concat(data, ignore_index=True)

path = 'd:/site/gramener.com/viz/epf/establishments/'
datafile = os.path.join(path, 'data.csv')
if os.path.exists(datafile):
    data = pd.read_csv(datafile, encoding='cp1252')
else:
    data = get_epf(offices=dfs[0][5][:120], path=path)
    data.to_csv(datafile, index=False, encoding='cp1252')

Presenting India's largest companies, by employee count¶

In [119]:

data.sort('On the basis of ECR (wage month Feb-2013 onwards)', ascending=False).head(20)[['Office', 'Estt. Name', 'On the basis of ECR (wage month Feb-2013 onwards)']]

Out[119]:

	Office	Estt. Name	On the basis of ECR (wage month Feb-2013 onwards)
315758	BANDRAMUMBAI-I	TATA CONSULTANCY SERVICES LIMITED	249660
265395	PARKSTREET	LARSEN - TOUBRO LTD. (ECC CONSTRUCTION GROUP)	157831
0	DELHINORTH	LARSEN - TOUBRO LTD	149374
203666	CHENNAI	COGNIZANT TECHNOLOGY SOLUTIONS INDIA PRIVATE LTD	138108
315759	BANDRAMUMBAI-I	LARSEN - TURBO LTD ECC	136442
129643	HYDERABAD	A P S R T C	131587
187848	BOMMASANDRA	ADECCO INDIA PVT. LTD.	130769
187849	BOMMASANDRA	M/S INFOSYS LIMITED	126704
173884	BANGALORE	TEAM LEASE SERVICES PVT LTD	107921
315760	BANDRAMUMBAI-I	ACCENTURE SERVICES PVT. LTD.	101822
315761	BANDRAMUMBAI-I	SHAPOORJI PALLONJI - CO LTD.	99531
95233	PATNA	SECURITY AND INTELLIGENCE SERVICE (INIDA) LTD.	95299
406613	AHMEDABAD	LARSEN - TOUBRO LIMITED	94941
198662	TAMBARAM	M/S.LARSEN - TOUBRO LIMITED,	94087
190766	KRPURAMWHITEFIELD	IBM INDIA PVT.LTD	93927
129644	HYDERABAD	LARSEN TOUBRO LTD.,[ECC-CG]	90508
315762	BANDRAMUMBAI-I	H D F C BANK LTD.	87726
315763	BANDRAMUMBAI-I	M.S.E.B.	85844
187850	BOMMASANDRA	M/S WIPRO LIMITED - TECHNOLOGIES GROUP	85099
41293	GURGAON	HINDUSTAN INSTRUMENT	83170

Now, a dataset like this requires a nice exploratory interface. Here's what I sketched quickly, and a working implementation is at https://gramener.com/companysize/