It's always fun to play around with marks distributions. Let's load some dependencies...

In [1]:
import pandas as pd
from pylab import *

Let's also ensure that what Pandas prints is fully displayed as HTML.

In [2]:
pd.set_option('display.max_columns',  50)
pd.set_option('display.line_width', 2000)

... and get started with Tamil Nadu SSLC results.

TN SSLC 2012

The data is provided in this structure:

In [3]:
print open('D:/site/gramener.com/viz/edu/TN-10-2012/README.TXT').read()[2600:3700]
---------------------------------------------------------------------------

Structure for S.S.L.C DATABASE : NET10M_S.DBF   Records : 967420
----------------------------------------------------------------------------------
Field  Field Name  Type       Width  
    1  DIST        Character      2 -> Edn.District Code           
    2  REGNO       Character      6 -> Registration No.           
    3  NAME        Character     29            
    4  MARK1       Character      3 -> Language Mark           
    5  MARK2       Character      3 -> English Mark           
    6  MARK3       Character      3 -> Mathematics Mark           
    7  MARK4       Character      3 -> Science Mark           
    8  MARK5       Character      3 -> Social Science Mark           
    9  TOTAL       Character      3 -> Total Mark           
   10  PASS        Character      1 -> Pass "P" indicates passed in examination           
   11  WITHHELD    Character      1 -> Withheld "W" indicates result is withheld                     
   12  BLANK       Character      2            
-------------------------

It's a CSV file without headers. So let's load it.

In [4]:
tn = pd.read_csv('D:/site/gramener.com/viz/edu/TN-10-2012/net10m_s.csv', names=['DIST', 'REGNO', 'NAME', 'LANG', 'ENG', 'MATH', 'SCI', 'SOC', 'MARKS', 'PASS', 'WITHHELD'], index_col=False)
tn.head(10)
Out[4]:
DIST REGNO NAME LANG ENG MATH SCI SOC MARKS PASS WITHHELD
0 1 10001 ANISH M 060 036 047 039 036 218 P NaN
1 1 10002 AJITHKUMAR K 070 048 076 050 061 305 P NaN
2 1 10003 ASHOK S 072 045 070 052 048 287 P NaN
3 1 10004 AJITHSINGH J 067 038 050 050 048 253 P NaN
4 1 10005 ARAVINTH A 056 031 051 026 041 205 NaN NaN
5 1 10006 ANTO S AAA AAA AAA AAA AAA AAA NaN A
6 1 10007 DAVIDRAJ T 053 028 036 037 022 176 NaN NaN
7 1 10008 JAGATHEES S 051 035 035 044 035 200 P NaN
8 1 10009 MANIKANDAN S 051 038 045 035 036 205 P NaN
9 1 10010 MATHEES N 077 045 080 083 057 342 P NaN

We have each student's marks in one row. The district and registration number probably define the unique key. (It might just be the registration number alone too.) Their marks in Language, English, etc are give. We also know whether they passed (the PASS column in "P") or failed (it's NaN, i.e. empty). I've no idea what the WITHHELD column has. Let's ignore it.

As always, this contains messy data. The marks are not always numbers. You can see that Anto S in row 7 was absent for the exams, and the mark is shown as "AAA". We'd rather have that as a blank, i.e. NaN. So let's clean it up.

First, we need to check what other non-numeric values are there, except for "AAA".

In [5]:
numpy.sort(tn['LANG'].unique())
Out[5]:
array([nan, '001', '002', '003', '004', '005', '006', '007', '008', '009',
       '010', '011', '012', '013', '014', '015', '016', '017', '018',
       '019', '020', '021', '022', '023', '024', '025', '026', '027',
       '028', '029', '030', '031', '032', '033', '034', '035', '036',
       '037', '038', '039', '040', '041', '042', '043', '044', '045',
       '046', '047', '048', '049', '050', '051', '052', '053', '054',
       '055', '056', '057', '058', '059', '060', '061', '062', '063',
       '064', '065', '066', '067', '068', '069', '070', '071', '072',
       '073', '074', '075', '076', '077', '078', '079', '080', '081',
       '082', '083', '084', '085', '086', '087', '088', '089', '090',
       '091', '092', '093', '094', '095', '096', '097', '098', '099', 'AAA'], dtype=object)

This doesn't look too bad. Other than "AAA", everything else is a number. Let's convert the "AAA" to nan.

In [6]:
for column in ['LANG', 'ENG', 'MATH', 'SCI', 'SOC', 'MARKS']:
    tn[column] = tn[column].replace('AAA', numpy.nan).astype(float)
tn.head(10)
Out[6]:
DIST REGNO NAME LANG ENG MATH SCI SOC MARKS PASS WITHHELD
0 1 10001 ANISH M 60 36 47 39 36 218 P NaN
1 1 10002 AJITHKUMAR K 70 48 76 50 61 305 P NaN
2 1 10003 ASHOK S 72 45 70 52 48 287 P NaN
3 1 10004 AJITHSINGH J 67 38 50 50 48 253 P NaN
4 1 10005 ARAVINTH A 56 31 51 26 41 205 NaN NaN
5 1 10006 ANTO S NaN NaN NaN NaN NaN NaN NaN A
6 1 10007 DAVIDRAJ T 53 28 36 37 22 176 NaN NaN
7 1 10008 JAGATHEES S 51 35 35 44 35 200 P NaN
8 1 10009 MANIKANDAN S 51 38 45 35 36 205 P NaN
9 1 10010 MATHEES N 77 45 80 83 57 342 P NaN

Now, to get a quick sense of the pass marks by subject:

In [13]:
tn_marks = tn[['LANG', 'ENG', 'MATH', 'SCI', 'SOC']].dropna()
1 - (tn_marks > 35).sum().astype(float) / tn_marks.count()
Out[13]:
LANG    0.079699
ENG     0.167786
MATH    0.152373
SCI     0.231818
SOC     0.120569
dtype: float64

Before we start looking at the distribution, let's set some defaults:

In [7]:
rcParams['figure.figsize'] = 25, 10

def distribution(series):
    # Drop the nans, get the frequency, and sort it in ascending order
    marks = series.dropna().value_counts().sort_index()
    # Ensure that all marks from 0 to 100 are present, and draw a bar chart
    pd.Series(marks, index=range(101)).plot(kind='bar')

English

In [8]:
distribution(tn['ENG'])

The English marks distributions look like a good normal distribution overall, except for the spike at 35 marks. It's easy to figure out what's happening here: 35 is the pass mark. Either the teachers or the board are liberally scooping up students with "just-fail" marks of 34, 33 etc. and awarding them 35 marks.

(We're not saying that this is a good or bad thing. Just that it's happening.)

It's also important to note that not everyone who gets 34 is made to pass. Some do, some don't. So there does appear to be an element of luck in whether a student passes or fails. (Offering prayers for English might not be a bad idea.)

Social science

In [9]:
distribution(tn['SOC'])

Now, social science marks tell a different story. Almost no one fails between 30 and 35. Further, students scoring between 25-30 are also promoted to the pass mark. This appears to be a fairly consistent policy, though it is unclear whether it is the teachers or the Board that have decided that the social sciences shall behave this way.

Language

In [10]:
distribution(tn['LANG'])

In the languages (which is mostly Tamil in Tamil Nadu), there is not much moderation. But as you can see, there isn't much need for moderation in the first place. More students are getting 90% than nearly any other mark. It's only a handfull of students who are being made to pass. As with English, this is based on luck. It may be worth praying for the languages as well.

Mathematics

In [11]:
distribution(tn['MATH'])

Unlike the earlier subects, the extent of moderation in Maths is quite large. As with the social sciences, not many students score between 30 and 34. Further, thos scoring between 26 - 30 are further given the benefit of doubt.

Science

In [12]:
distribution(tn['SCI'])

As with mathematics, the extent of moderation is quite high -- in fact, even higher than in mathematics. But does that actually lead to more students passing? Let's look at the percentegate of students that fail in each subject.

In [13]:
rcParams['figure.figsize'] = 4, 3
marks = tn[['LANG', 'ENG', 'MATH', 'SCI', 'SOC']]
((marks < 35).sum().astype(float) / marks.count()).order().plot(kind='bar')
rcParams['figure.figsize'] = 25, 10

So despite the extensive moderation in science, the fail ratio is the highest here.

Now let's move on to the next state.

Karnataka marks

In [14]:
ka = pd.read_csv('d:/site/gramener.com/viz/autolyse/data/karnataka-marks-2011-subfields.csv')
ka
Out[14]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 855446 entries, 0 to 855445
Data columns (total 30 columns):
DIST_CODE                                    855446  non-null values
SCHOOL_CODE                                  855446  non-null values
DOB                                          855391  non-null values
NRC_CASTE_CODE                               855446  non-null values
NRC_GENDER_CODE                              855446  non-null values
NRC_MEDIUM                                   855446  non-null values
NRC_PHYSICAL_CONDITION                       855446  non-null values
TOTAL_MARKS                                  841629  non-null values
Hindi 3rd                                    757012  non-null values
Kannada 3rd                                  49814  non-null values
English 3rd                                  24183  non-null values
Arabic 3rd                                   134  non-null values
Urdu 3rd                                     58  non-null values
Sanskrit 3rd                                 5391  non-null values
Science                                      833730  non-null values
Mathematics                                  832720  non-null values
Social Science                               835894  non-null values
Kannada                                      706916  non-null values
Hindi                                        3693  non-null values
Telugu                                       701  non-null values
Marathi                                      16855  non-null values
Indian Social Studies                        81  non-null values
Indian Political Science and Urban Policy    195  non-null values
Indian Economics                             207  non-null values
Tamil                                        1899  non-null values
Urdu                                         33316  non-null values
English                                      56405  non-null values
Sanskrit                                     17661  non-null values
Kannada 2nd                                  80486  non-null values
English 2nd                                  753933  non-null values
dtypes: float64(23), object(7)

English 2nd

In [15]:
distribution(ka['English 2nd'])

Kannada

In [16]:
distribution(ka['Kannada'])

Mathematics

In [17]:
distribution(ka['Mathematics'])

Science

In [18]:
distribution(ka['Science'])

Social science

In [19]:
distribution(ka['Social Science'])

Who scores more? Boys or girls?

In [20]:
ka.groupby('NRC_GENDER_CODE')['TOTAL_MARKS'].mean()
Out[20]:
NRC_GENDER_CODE
B                  303.662583
G                  335.868927
Name: TOTAL_MARKS, dtype: float64

OK, how sure are we that girls score more? What's the probability boys score more than girls?

In [21]:
import scipy.stats
boys = ka[ka['NRC_GENDER_CODE'] == 'B']['TOTAL_MARKS']
girls = ka[ka['NRC_GENDER_CODE'] == 'G']['TOTAL_MARKS']
F, prob = scipy.stats.ttest_ind(boys.dropna(), girls.dropna())
prob
Out[21]:
0.0

ICSE 2013

The data is from https://github.com/octonion/CISCEResults2013. It caused a media ruckus.

In [22]:
icse = pd.read_csv('d:/site/gramener.com/viz/edu/CISCEResults2013/icse_scores.csv', names=['Student', 'Subject', 'Mark'])
icse
Out[22]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 844836 entries, 0 to 844835
Data columns (total 3 columns):
Student    844836  non-null values
Subject    844836  non-null values
Mark       844836  non-null values
dtypes: object(3)

We do have to do a bit of cleanup on the marks:

In [23]:
numpy.sort(icse['Mark'].unique())
Out[23]:
array(['09', '10', '100', '11', '12', '13', '14', '15', '16', '17', '18',
       '19', '20', '24', '25', '26', '27', '28', '29', '30', '35', '38',
       '40', '42', '44', '46', '48', '50', '52', '54', '58', '60', '62',
       '64', '66', '69', '72', '74', '76', '78', '80', '83', '86', '88',
       '90', '92', '94', '95', '96', '97', '98', '99', 'A', 'ABS  *', 'B',
       'C', 'D', 'PCA  *', 'PCNA *', 'SPCA *', 'SPCNA*', 'X', 'XXX'], dtype=object)
In [24]:
icse['Mark'] = icse['Mark'].replace({'A': '', 'ABS  *': '', 'B': '', 'C': '', 'D': '', 'PCA  *': '', 'PCNA *': '', 'SPCA *': '', 'SPCNA*': '', 'X': '', 'XXX': ''}).replace('', numpy.nan).astype(float)

Now let's take a look at the overall mark distribution across subjects.

In [25]:
distribution(icse['Mark'])

That's definitely weird. Let's break it up by subject.

In [26]:
icse['Subject'].value_counts().head(20)
Out[26]:
Result    105644
ENG       105612
HCG       105608
SUPW      105592
MAT       102399
SCI        93756
HIN        69364
CTA        62564
EAS        13006
BEN        11997
CAS        11418
CST         8290
PED         7481
MAL         6350
ECO         5486
KAN         4580
HSC         4090
TEL         3255
ORI         3189
ART         2816
dtype: int64

English marks

In [27]:
distribution(icse[icse['Subject'] == 'ENG']['Mark'])

History, Civics, Geography

In [28]:
distribution(icse[icse['Subject'] == 'HCG']['Mark'])

Mathematics

In [29]:
distribution(icse[icse['Subject'] == 'MAT']['Mark'])

Science

In [30]:
distribution(icse[icse['Subject'] == 'SCI']['Mark'])

Hindi

In [31]:
distribution(icse[icse['Subject'] == 'HIN']['Mark'])

CBSE Class XII, 2013

The data is from https://github.com/octonion/CISCEResults2013.

In [32]:
cbse = pd.read_csv('d:/site/gramener.com/viz/edu/CISCEResults2013/cbse_scores_revised.csv', names=['ROLLNO', 'SUB_ID', 'SUBJECT', 'MARK', 'GRADE', 'SOMETHING'])
cbse.head()
Out[32]:
ROLLNO SUB_ID SUBJECT MARK GRADE SOMETHING
0 1699000 301 ENGLISH CORE 043 D1 False
1 1699000 302 HINDI CORE 072 B2 False
2 1699000 42 PHYSICS 047 D2 False
3 1699000 43 CHEMISTRY 056 D1 False
4 1699000 83 COMPUTER SCIENCE 048 D2 False

There are a variety of subjects in class XII that one could choose from. It's quite a long tail, actually, as you can see below.

In [33]:
pd.Series(cbse['SUBJECT'].value_counts()).head(50).plot(kind='bar')
Out[33]:
<matplotlib.axes.AxesSubplot at 0x29a303c8>

The marks are messed up, as always. We'll need to clean it up, but this is worse than usual.

In [34]:
cbse['MARK'].unique()
Out[34]:
array(['043', '072', '047', '056', '048', '---', '014 F', nan, '061',
       '065', '055', '053', '050', '016F', '022F', '002F', '011FT',
       '016FT', '023 FT', '073', '080', '052', '070', '008 F', '059',
       '035FT', '049', '013 F', '036', '076', '058', '057', '064', '046',
       '045', '031 FT', '067', '054', '060', '083', '077', '033', '043 FT',
       '041', '051', '040', '004F', '028FT', '028 FT', '069', '063', '075',
       '085', '081', '018FT', '044', '032 FT', '030 FT', '030FT', '042FT',
       '039FT', '002 F', '026FT', '078', '084', '005F', '033FT', '029FT',
       '066', '010F', '012F', '031FT', '043FT', '036FT', '068', '086',
       '062', '074', '017FT', '071', '034FT', '082', '079', '042', '034',
       '094', '037FT', '041FT', '092', '090', '088', '021F', '014F',
       '027FT', '095', '035', '020F', '098', '009F', '087', '007F',
       '025FT', '024FT', '089', '032FT', '099', '038FT', '006F', '003F',
       '001F', '023FT', '021FT', '000F', '009FT', '038', '014FT', '022FT',
       '091', '008F', '019FT', '011F', '013F', '040FT', '015FT', '013FT',
       '093', '096', '097', '037', '039', '019F', '020FT', '044FT', '015F',
       '018F', '007FT', '017F', '008FT', '005 F', '045FT', '100', '009 F',
       'ABA', '010FT', '024F', '012FT', '018 FT', '004 F', '064FT',
       '010 F', 'AB A', '011 F', '012 F', '049FT', '023F', '001 F',
       '000 F', '015 FT', '033 FT', '044 FT', '035 FT', '025 FT', '014 FT',
       '015 F', '047FT', '046FT', '037 FT', '051FT', '042 FT', '036 FT',
       '007 F', '016 F', '006 F', '034 FT', '040 FT', '021 FT', '026 FT',
       '027 FT', '007 FT', '025F', '048FT', '003 F', '047 FT', '046 FT',
       '038 FT', '066FT', '062FT', '057FT', '060FT', '050FP', '045 FP',
       '022 FT', '026F', '029 FT', '052 FT', '061FT', '024 FT', '045FP',
       '026FP', '029F', '028F', '024FP', '048FP', '023FP', '055FP',
       '050FT', '039 FT', '018 F', '017 F', '034FP', '020 FT', '042FP',
       '020 F', '027 FP', '040FP', '035FP', '031FP', '053FT', '047FP',
       '029FP', '069FT', '030FP', '036FP', '033FP', '048 FT', '045 FT',
       '041 FT', '073FT', '019 FT', '019 F', '039FP', '017 FT', '008 FT',
       '016 FT', '012 FT', '056FT', '022 F', '054FP', '006FT', '011 FT',
       '013 FT', '027F', '043FP', '058FT', '068FT', '037FP', '084FP',
       '028FP', '068FP', '025 FP', '051FP', '057FP', '062FP', '058FP',
       '027 F', '046FP', '070FT', '072FT', '074FT', '074 FT', '059 FT',
       '065FT', '049FP', '024 F', '067FT', '063 FT', '069 FT', '067 FT',
       '062 FT', '068 FT', '064 FT', '077 FT', '070 FT', '071 FT',
       '072 FT', '015 FP', '076 FT', '014 FP', '025FP', '071FT', '075FT',
       '065 FT', '028 FP', '050 FT', '004FT', '051 FP', '023 FP', '026 F',
       '058 FP', '023 F', '060 FT', '061 FT', '021 F', '025 F', '077FT',
       '010 FT', '005FT', '059FP', '059FT', '027FP', '057 FT', '053 FT',
       '032FP', '066 FT', '058 FT', '075 FT', '054FT', '063FT', '056 FT',
       '021 FP', '049 FT', '078FT', '034 FP', '030 FP', '010 FP', '052FT',
       '073 FT', '054 FT', '055 FT', '044 FP', '055FT', '013FP', '054 FP',
       '012 FP', '061FP', '060 FP', '076FT', '011 FP', '024 FP', '029 FP',
       '041FP', '032 FP', '010FP', '009 FT', '013 FP', '016FP', '038FP',
       '044FP', '062 FP', '031 FP', '017FP', '020FP', '019FP', '038 FP',
       '036 FP', '043 FP', '033 FP', '039 FP', '037 FP', '049 FP',
       '040 FP', '035 FP', '046 FP', '052 FP', '047 FP', '051 FT',
       '026 FP', '048 FP', '042 FP', '059 FP', '052FP', '066FP', '028 F',
       '022FP', '053FP', '050 FP', '017 FP', '020 FP', '018FP', '014FP',
       '009FP', '055 FP', '067FP', '016 FP', '018 FP', '064FP', '030F',
       '029 F', '060FP', '064 FP', '056FP', '065FP', '072FP', '041 FP',
       '021FP', '083FP', '012FP', '056 FP', '053 FP', '006 FT', '003FT',
       '004 FT', '070FP', '015FP', '022 FP', '005 FT', '057 FP'], dtype=object)

Let's just take the first 3 characters and process them. It turns out that we only need to worry about 3 other strings: '---', 'ABA', and 'AB ' if we only took the first 3 characters. Let's make them all NaNs.

In [35]:
cbse['SCORE'] = cbse['MARK'].dropna().str.slice(0,3).replace({'---':numpy.nan, 'ABA':numpy.nan, 'AB ':numpy.nan}).astype(float)

Now it's time to look at the distribution of the top subjects.

English

In [36]:
distribution(cbse[cbse['SUBJECT'] == 'ENGLISH CORE']['SCORE'])

Mathematics

In [37]:
distribution(cbse[cbse['SUBJECT'] == 'MATHEMATICS']['SCORE'])