Suppose that the College Board, a nonprofit organization responsible for administering the SAT (Scholastic Aptitude Test), seeks to increase the rate of high-school graduates who participate in its exams. This project's aim is to make recommendations about which measures the College Board might take in order to achieve that.
The problem we need to solve is to how to make actionable suggestions to the College Board to help them increase the participation rates in their exams. For that we need to perform an exploratory data analysis (EDA) to find appropriate metrics that can be adjusted by the College Board accordingly.
Performing the EDA we must among other things:
The data is based on the SAT and the ACT (which stands for American College Testing and it is administered by another institution, namely, the ACT. Inc) exams from around the United States in 2017.
The data contains:
Average SAT and ACT scores by state (scores for each section of each exam);
Participation rates for both exams by state.
Both SAT and ACT are standardized tests for college admissions and are similar in content type but have some differences in structure. A few relevant differences are:
The ACT has a Science Test and the SAT does not;
There is a SAT Math Section for which the student is not allowed to use a calculator;
The SAT's College Board joins Reading and Writing into one score, the "Evidence-Based Reading and Writing" whereas in the ACT the tests are separated.
We first need to import Python libraries including:
Pandas
, for data manipulation and analysisSciPy
which is a Python-based ecosystem of software for mathematics, science, and engineering.NumPy
which is a library consisting of multidimensional array objects and a collection of routines for processing of array.Statsmodels
which is a Python package that allows users to explore data, estimate statistical models, and perform statistical tests and complements SciPy
;Matplotlib
is a plotting library for the Python and NumPy;Seaborn
is complimentary to Matplotlib and it specifically targets statistical data visualizationPylab
is embedded inside Matplotlib and provides a Matlab-like experience for the user. It imports portions of Matplotlib and NumPy.OBS: This information was taken directly from the documentation.
import scipy
import pandas as pd
import scipy.stats as stats
import numpy as np
import csv
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("darkgrid")
import pytest
import pylab as p
%matplotlib inline
import pandas as pd
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all" # so we can see the value of multiple statements at once.
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
/Users/marcotavora/Applications/anaconda3/lib/python3.6/site-packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead. from pandas.core import datetools
We can refer to the Pandas module using the "dot notation" to call its methods. To read our data (which is in the form of csv
files), into a so-called DataFrame structure, we use the method read_csv()
and pass in each file name as a string:
sat = pd.read_csv('sat.csv')
sat.head(10)
Unnamed: 0 | State | Participation | Evidence-Based Reading and Writing | Math | Total | |
---|---|---|---|---|---|---|
0 | 0 | Alabama | 5% | 593 | 572 | 1165 |
1 | 1 | Alaska | 38% | 547 | 533 | 1080 |
2 | 2 | Arizona | 30% | 563 | 553 | 1116 |
3 | 3 | Arkansas | 3% | 614 | 594 | 1208 |
4 | 4 | California | 53% | 531 | 524 | 1055 |
5 | 5 | Colorado | 11% | 606 | 595 | 1201 |
6 | 6 | Connecticut | 100% | 530 | 512 | 1041 |
7 | 7 | Delaware | 100% | 503 | 492 | 996 |
8 | 8 | District of Columbia | 100% | 482 | 468 | 950 |
9 | 9 | Florida | 83% | 520 | 497 | 1017 |
act = pd.read_csv('act.csv')
act.head(10)
Unnamed: 0 | State | Participation | English | Math | Reading | Science | Composite | |
---|---|---|---|---|---|---|---|---|
0 | 0 | National | 60% | 20.3 | 20.7 | 21.4 | 21.0 | 21.0 |
1 | 1 | Alabama | 100% | 18.9 | 18.4 | 19.7 | 19.4 | 19.2 |
2 | 2 | Alaska | 65% | 18.7 | 19.8 | 20.4 | 19.9 | 19.8 |
3 | 3 | Arizona | 62% | 18.6 | 19.8 | 20.1 | 19.8 | 19.7 |
4 | 4 | Arkansas | 100% | 18.9 | 19.0 | 19.7 | 19.5 | 19.4 |
5 | 5 | California | 31% | 22.5 | 22.7 | 23.1 | 22.2 | 22.8 |
6 | 6 | Colorado | 100% | 20.1 | 20.3 | 21.2 | 20.9 | 20.8 |
7 | 7 | Connecticut | 31% | 25.5 | 24.6 | 25.6 | 24.6 | 25.2 |
8 | 8 | Delaware | 18% | 24.1 | 23.4 | 24.8 | 23.6 | 24.1 |
9 | 9 | District of Columbia | 32% | 24.4 | 23.5 | 24.9 | 23.5 | 24.2 |
Note that the first columns of both tables seem to be identical to the DataFrame indexes. We can quickly confirm that using an assert
statement. When an assert
statement is encountered, Python evaluates it and if the expression is false raises an AssertionError
exception (from www.tutorialspoint.com).
assert (sat.index.tolist() == sat['Unnamed: 0'].tolist())
assert (act.index.tolist() == act['Unnamed: 0'].tolist())
From the the box above we are allowed to drop the first columns of both DataFrames.
Next, I renamed the column "Evidence-Based Reading and Writing" as "EBRW" to be able to use methods on it via the dot notation.
Dropping also the last column of the SAT table (which is just the sum of the two previous ones) and the last row and renaming the others we obtain the following DataFrame for the SAT scores:
cols_to_keep = ['State', 'Participation', 'Evidence-Based Reading and Writing', 'Math']
sat = sat[cols_to_keep]
sat.columns = ['State', 'Participation', 'EBRW', 'Math']
sat.head()
State | Participation | EBRW | Math | |
---|---|---|---|---|
0 | Alabama | 5% | 593 | 572 |
1 | Alaska | 38% | 547 | 533 |
2 | Arizona | 30% | 563 | 553 |
3 | Arkansas | 3% | 614 | 594 |
4 | California | 53% | 531 | 524 |
As with the SAT DataFrame, we drop last column of the ACT DataFrame (the Composite score is, according to the ACT website, just the average of the four other test scores, rounded to the nearest whole number).
Furthermore, I used the .iloc( )
method to exclude the first row of the ACT frame since it is just a summary row.
cols_to_keep = ['State', 'Participation', 'English', 'Math', 'Reading', 'Science']
act = act.iloc[1:][cols_to_keep]
sat.head()
act.head()
State | Participation | EBRW | Math | |
---|---|---|---|---|
0 | Alabama | 5% | 593 | 572 |
1 | Alaska | 38% | 547 | 533 |
2 | Arizona | 30% | 563 | 553 |
3 | Arkansas | 3% | 614 | 594 |
4 | California | 53% | 531 | 524 |
State | Participation | English | Math | Reading | Science | |
---|---|---|---|---|---|---|
1 | Alabama | 100% | 18.9 | 18.4 | 19.7 | 19.4 |
2 | Alaska | 65% | 18.7 | 19.8 | 20.4 | 19.9 |
3 | Arizona | 62% | 18.6 | 19.8 | 20.1 | 19.8 |
4 | Arkansas | 100% | 18.9 | 19.0 | 19.7 | 19.5 |
5 | California | 31% | 22.5 | 22.7 | 23.1 | 22.2 |
SAT
The table displays three different averages for each state:
ACT
The table displays the following averages for each State:
We can look for problems with the data for example:
info()
describe()
The third item was taken care of. There were no null values but there are outliers as we shall see when we perform the plotting.
df_sat = sat.copy() # making copies to keep original ones intact
df_act = act.copy() # making copies to keep original ones intact
print (df_sat.isnull().any())
print (df_act.isnull().any())
State False Participation False EBRW False Math False dtype: bool State False Participation False English False Math False Reading False Science False dtype: bool
print (sat.dtypes)
print (act.dtypes)
State object Participation object EBRW int64 Math int64 dtype: object State object Participation object English float64 Math float64 Reading float64 Science float64 dtype: object
I will convert the columns 'Participation' into floats
using a function to extract the $\%$ but keeping the scale between 0 and 100. The .replace( )
method takes the argument regex
=True
because type('%')
= str
. The function was based on http://pythonjourney.com/python-pandas-dataframe-convert-percent-to-float/
def perc_into_float(df,col):
return df[col].replace('%','',regex=True).astype('float')
df_sat['Participation'] = perc_into_float(df_sat,'Participation')
df_act['Participation'] = perc_into_float(df_act,'Participation')
df_sat.head(2)
State | Participation | EBRW | Math | |
---|---|---|---|---|
0 | Alabama | 5.0 | 593 | 572 |
1 | Alaska | 38.0 | 547 | 533 |
df_act.head(2)
State | Participation | English | Math | Reading | Science | |
---|---|---|---|---|---|---|
1 | Alabama | 100.0 | 18.9 | 18.4 | 19.7 | 19.4 |
2 | Alaska | 65.0 | 18.7 | 19.8 | 20.4 | 19.9 |
Checking types again:
print (df_sat.dtypes)
print (df_act.dtypes)
State object Participation float64 EBRW int64 Math int64 dtype: object State object Participation float64 English float64 Math float64 Reading float64 Science float64 dtype: object
Answer version 1
Using a list comprehension, the command:
df.set_index('State').to_dict()
creates a dictionary with column names as keys. For example:
print (df_sat.set_index('State').to_dict())
{'Participation': {'Alabama': 5.0, 'Alaska': 38.0, 'Arizona': 30.0, 'Arkansas': 3.0, 'California': 53.0, 'Colorado': 11.0, 'Connecticut': 100.0, 'Delaware': 100.0, 'District of Columbia': 100.0, 'Florida': 83.0, 'Georgia': 61.0, 'Hawaii': 55.0, 'Idaho': 93.0, 'Illinois': 9.0, 'Indiana': 63.0, 'Iowa': 2.0, 'Kansas': 4.0, 'Kentucky': 4.0, 'Louisiana': 4.0, 'Maine': 95.0, 'Maryland': 69.0, 'Massachusetts': 76.0, 'Michigan': 100.0, 'Minnesota': 3.0, 'Mississippi': 2.0, 'Missouri': 3.0, 'Montana': 10.0, 'Nebraska': 3.0, 'Nevada': 26.0, 'New Hampshire': 96.0, 'New Jersey': 70.0, 'New Mexico': 11.0, 'New York': 67.0, 'North Carolina': 49.0, 'North Dakota': 2.0, 'Ohio': 12.0, 'Oklahoma': 7.0, 'Oregon': 43.0, 'Pennsylvania': 65.0, 'Rhode Island': 71.0, 'South Carolina': 50.0, 'South Dakota': 3.0, 'Tennessee': 5.0, 'Texas': 62.0, 'Utah': 3.0, 'Vermont': 60.0, 'Virginia': 65.0, 'Washington': 64.0, 'West Virginia': 14.0, 'Wisconsin': 3.0, 'Wyoming': 3.0}, 'EBRW': {'Alabama': 593, 'Alaska': 547, 'Arizona': 563, 'Arkansas': 614, 'California': 531, 'Colorado': 606, 'Connecticut': 530, 'Delaware': 503, 'District of Columbia': 482, 'Florida': 520, 'Georgia': 535, 'Hawaii': 544, 'Idaho': 513, 'Illinois': 559, 'Indiana': 542, 'Iowa': 641, 'Kansas': 632, 'Kentucky': 631, 'Louisiana': 611, 'Maine': 513, 'Maryland': 536, 'Massachusetts': 555, 'Michigan': 509, 'Minnesota': 644, 'Mississippi': 634, 'Missouri': 640, 'Montana': 605, 'Nebraska': 629, 'Nevada': 563, 'New Hampshire': 532, 'New Jersey': 530, 'New Mexico': 577, 'New York': 528, 'North Carolina': 546, 'North Dakota': 635, 'Ohio': 578, 'Oklahoma': 530, 'Oregon': 560, 'Pennsylvania': 540, 'Rhode Island': 539, 'South Carolina': 543, 'South Dakota': 612, 'Tennessee': 623, 'Texas': 513, 'Utah': 624, 'Vermont': 562, 'Virginia': 561, 'Washington': 541, 'West Virginia': 558, 'Wisconsin': 642, 'Wyoming': 626}, 'Math': {'Alabama': 572, 'Alaska': 533, 'Arizona': 553, 'Arkansas': 594, 'California': 524, 'Colorado': 595, 'Connecticut': 512, 'Delaware': 492, 'District of Columbia': 468, 'Florida': 497, 'Georgia': 515, 'Hawaii': 541, 'Idaho': 493, 'Illinois': 556, 'Indiana': 532, 'Iowa': 635, 'Kansas': 628, 'Kentucky': 616, 'Louisiana': 586, 'Maine': 499, 'Maryland': 52, 'Massachusetts': 551, 'Michigan': 495, 'Minnesota': 651, 'Mississippi': 607, 'Missouri': 631, 'Montana': 591, 'Nebraska': 625, 'Nevada': 553, 'New Hampshire': 520, 'New Jersey': 526, 'New Mexico': 561, 'New York': 523, 'North Carolina': 535, 'North Dakota': 621, 'Ohio': 570, 'Oklahoma': 517, 'Oregon': 548, 'Pennsylvania': 531, 'Rhode Island': 524, 'South Carolina': 521, 'South Dakota': 603, 'Tennessee': 604, 'Texas': 507, 'Utah': 614, 'Vermont': 551, 'Virginia': 541, 'Washington': 534, 'West Virginia': 528, 'Wisconsin': 649, 'Wyoming': 604}}
We obtain the dictionary values using
[cols[i]],
where cols
are the column names and cols[i]
are the keys for $i$ within the range of the number of columns. We then return the elements of this list of dictionaries.
def dict_all(df,cols,n):
return [df.set_index('State').to_dict()[cols[i]] for i in range(1,len(cols))][n]
dsat_part = dict_all(df_sat,df_sat.columns.tolist(),0)
dsat_EBRW = dict_all(df_sat,df_sat.columns.tolist(),1)
dsat_math = dict_all(df_sat,df_sat.columns.tolist(),2)
Example:
print( dsat_EBRW)
{'Alabama': 593, 'Alaska': 547, 'Arizona': 563, 'Arkansas': 614, 'California': 531, 'Colorado': 606, 'Connecticut': 530, 'Delaware': 503, 'District of Columbia': 482, 'Florida': 520, 'Georgia': 535, 'Hawaii': 544, 'Idaho': 513, 'Illinois': 559, 'Indiana': 542, 'Iowa': 641, 'Kansas': 632, 'Kentucky': 631, 'Louisiana': 611, 'Maine': 513, 'Maryland': 536, 'Massachusetts': 555, 'Michigan': 509, 'Minnesota': 644, 'Mississippi': 634, 'Missouri': 640, 'Montana': 605, 'Nebraska': 629, 'Nevada': 563, 'New Hampshire': 532, 'New Jersey': 530, 'New Mexico': 577, 'New York': 528, 'North Carolina': 546, 'North Dakota': 635, 'Ohio': 578, 'Oklahoma': 530, 'Oregon': 560, 'Pennsylvania': 540, 'Rhode Island': 539, 'South Carolina': 543, 'South Dakota': 612, 'Tennessee': 623, 'Texas': 513, 'Utah': 624, 'Vermont': 562, 'Virginia': 561, 'Washington': 541, 'West Virginia': 558, 'Wisconsin': 642, 'Wyoming': 626}
dact_part = dict_all(df_sat,df_sat.columns.tolist(),0)
dact_eng = dict_all(df_act,df_act.columns.tolist(),1)
dact_math = dict_all(df_act,df_act.columns.tolist(),2)
dact_read = dict_all(df_act,df_act.columns.tolist(),3)
dact_sci = dict_all(df_act,df_act.columns.tolist(),4)
Example:
print( dact_part)
{'Alabama': 5.0, 'Alaska': 38.0, 'Arizona': 30.0, 'Arkansas': 3.0, 'California': 53.0, 'Colorado': 11.0, 'Connecticut': 100.0, 'Delaware': 100.0, 'District of Columbia': 100.0, 'Florida': 83.0, 'Georgia': 61.0, 'Hawaii': 55.0, 'Idaho': 93.0, 'Illinois': 9.0, 'Indiana': 63.0, 'Iowa': 2.0, 'Kansas': 4.0, 'Kentucky': 4.0, 'Louisiana': 4.0, 'Maine': 95.0, 'Maryland': 69.0, 'Massachusetts': 76.0, 'Michigan': 100.0, 'Minnesota': 3.0, 'Mississippi': 2.0, 'Missouri': 3.0, 'Montana': 10.0, 'Nebraska': 3.0, 'Nevada': 26.0, 'New Hampshire': 96.0, 'New Jersey': 70.0, 'New Mexico': 11.0, 'New York': 67.0, 'North Carolina': 49.0, 'North Dakota': 2.0, 'Ohio': 12.0, 'Oklahoma': 7.0, 'Oregon': 43.0, 'Pennsylvania': 65.0, 'Rhode Island': 71.0, 'South Carolina': 50.0, 'South Dakota': 3.0, 'Tennessee': 5.0, 'Texas': 62.0, 'Utah': 3.0, 'Vermont': 60.0, 'Virginia': 65.0, 'Washington': 64.0, 'West Virginia': 14.0, 'Wisconsin': 3.0, 'Wyoming': 3.0}
Answer version 2
def func_dict_v2(df,col_name):
return {df['State'][i]:df[col_name][i] for i in range(df.shape[0])}
dsat_part_v2 = func_dict_v2(df_sat,'Participation')
dsat_EBRW_v2 = func_dict_v2(df_sat,'EBRW')
dsat_math_v2 = func_dict_v2(df_sat,'Math')
assert(dsat_math == dsat_math_v2)
assert(dsat_part == dsat_part_v2)
assert(dsat_math == dsat_math_v2)
But let us use Version 1.
Using a simple dictionary comprehension.
def dict_col(df):
return {col:df[col].tolist() for col in df.columns}
print (dict_col(df_sat))
print ("")
print (dict_col(df_act))
{'State': ['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut', 'Delaware', 'District of Columbia', 'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire', 'New Jersey', 'New Mexico', 'New York', 'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania', 'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington', 'West Virginia', 'Wisconsin', 'Wyoming'], 'Participation': [5.0, 38.0, 30.0, 3.0, 53.0, 11.0, 100.0, 100.0, 100.0, 83.0, 61.0, 55.0, 93.0, 9.0, 63.0, 2.0, 4.0, 4.0, 4.0, 95.0, 69.0, 76.0, 100.0, 3.0, 2.0, 3.0, 10.0, 3.0, 26.0, 96.0, 70.0, 11.0, 67.0, 49.0, 2.0, 12.0, 7.0, 43.0, 65.0, 71.0, 50.0, 3.0, 5.0, 62.0, 3.0, 60.0, 65.0, 64.0, 14.0, 3.0, 3.0], 'EBRW': [593, 547, 563, 614, 531, 606, 530, 503, 482, 520, 535, 544, 513, 559, 542, 641, 632, 631, 611, 513, 536, 555, 509, 644, 634, 640, 605, 629, 563, 532, 530, 577, 528, 546, 635, 578, 530, 560, 540, 539, 543, 612, 623, 513, 624, 562, 561, 541, 558, 642, 626], 'Math': [572, 533, 553, 594, 524, 595, 512, 492, 468, 497, 515, 541, 493, 556, 532, 635, 628, 616, 586, 499, 52, 551, 495, 651, 607, 631, 591, 625, 553, 520, 526, 561, 523, 535, 621, 570, 517, 548, 531, 524, 521, 603, 604, 507, 614, 551, 541, 534, 528, 649, 604]} {'State': ['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut', 'Delaware', 'District of Columbia', 'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire', 'New Jersey', 'New Mexico', 'New York', 'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania', 'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington', 'West Virginia', 'Wisconsin', 'Wyoming'], 'Participation': [100.0, 65.0, 62.0, 100.0, 31.0, 100.0, 31.0, 18.0, 32.0, 73.0, 55.0, 90.0, 38.0, 93.0, 35.0, 67.0, 73.0, 100.0, 100.0, 8.0, 28.0, 29.0, 29.0, 100.0, 100.0, 100.0, 100.0, 84.0, 100.0, 18.0, 34.0, 66.0, 31.0, 100.0, 98.0, 75.0, 100.0, 40.0, 23.0, 21.0, 100.0, 80.0, 100.0, 45.0, 100.0, 29.0, 29.0, 29.0, 69.0, 100.0, 100.0], 'English': [18.9, 18.7, 18.6, 18.9, 22.5, 20.1, 25.5, 24.1, 24.4, 19.0, 21.0, 17.8, 21.9, 21.0, 22.0, 21.2, 21.1, 19.6, 19.4, 24.2, 23.3, 25.4, 24.1, 20.4, 18.2, 19.8, 19.0, 20.9, 16.3, 25.4, 23.8, 18.6, 23.8, 17.8, 19.0, 21.2, 18.5, 21.2, 23.4, 24.0, 17.5, 20.7, 19.5, 19.5, 19.5, 23.3, 23.5, 20.9, 20.0, 19.7, 19.4], 'Math': [18.4, 19.8, 19.8, 19.0, 22.7, 20.3, 24.6, 23.4, 23.5, 19.4, 20.9, 19.2, 21.8, 21.2, 22.4, 21.3, 21.3, 19.4, 18.8, 24.0, 23.1, 25.3, 23.7, 21.5, 18.1, 19.9, 20.2, 20.9, 18.0, 25.1, 23.8, 19.4, 24.0, 19.3, 20.4, 21.6, 18.8, 21.5, 23.4, 23.3, 18.6, 21.5, 19.2, 20.7, 19.9, 23.1, 23.3, 21.9, 19.4, 20.4, 19.8], 'Reading': [19.7, 20.4, 20.1, 19.7, 23.1, 21.2, 25.6, 24.8, 24.9, 21.0, 22.0, 19.2, 23.0, 21.6, 23.2, 22.6, 22.3, 20.5, 19.8, 24.8, 24.2, 25.9, 24.5, 21.8, 18.8, 20.8, 21.0, 21.9, 18.1, 26.0, 24.1, 20.4, 24.6, 19.6, 20.5, 22.5, 20.1, 22.4, 24.2, 24.7, 19.1, 22.3, 20.1, 21.1, 20.8, 24.4, 24.6, 22.1, 21.2, 20.6, 20.8], 'Science': [19.4, 19.9, 19.8, 19.5, 22.2, 20.9, 24.6, 23.6, 23.5, 19.4, 21.3, 19.3, 22.1, 21.3, 22.3, 22.1, 21.7, 20.1, 19.6, 23.7, 2.3, 24.7, 23.8, 21.6, 18.8, 20.5, 20.5, 21.5, 18.2, 24.9, 23.2, 20.0, 23.9, 19.3, 20.6, 22.0, 19.6, 21.7, 23.3, 23.4, 18.9, 22.0, 19.9, 20.9, 20.6, 23.2, 23.5, 22.0, 20.5, 20.9, 20.6]}
df_total = pd.merge(df_sat, df_act, on='State')
df_total.head()
State | Participation_x | EBRW | Math_x | Participation_y | English | Math_y | Reading | Science | |
---|---|---|---|---|---|---|---|---|---|
0 | Alabama | 5.0 | 593 | 572 | 100.0 | 18.9 | 18.4 | 19.7 | 19.4 |
1 | Alaska | 38.0 | 547 | 533 | 65.0 | 18.7 | 19.8 | 20.4 | 19.9 |
2 | Arizona | 30.0 | 563 | 553 | 62.0 | 18.6 | 19.8 | 20.1 | 19.8 |
3 | Arkansas | 3.0 | 614 | 594 | 100.0 | 18.9 | 19.0 | 19.7 | 19.5 |
4 | California | 53.0 | 531 | 524 | 31.0 | 22.5 | 22.7 | 23.1 | 22.2 |
df_total = pd.merge(df_sat, df_act, on='State')
df_total.columns = ['State','Participation_SAT (%)','EBRW_SAT','Math_SAT',\
'Participation_ACT (%)','English_ACT','Math_ACT','Reading_ACT','Science_ACT']
df_total.head(3)
State | Participation_SAT (%) | EBRW_SAT | Math_SAT | Participation_ACT (%) | English_ACT | Math_ACT | Reading_ACT | Science_ACT | |
---|---|---|---|---|---|---|---|---|---|
0 | Alabama | 5.0 | 593 | 572 | 100.0 | 18.9 | 18.4 | 19.7 | 19.4 |
1 | Alaska | 38.0 | 547 | 533 | 65.0 | 18.7 | 19.8 | 20.4 | 19.9 |
2 | Arizona | 30.0 | 563 | 553 | 62.0 | 18.6 | 19.8 | 20.1 | 19.8 |
df_total.describe().loc[['min','max']]
Participation_SAT (%) | EBRW_SAT | Math_SAT | Participation_ACT (%) | English_ACT | Math_ACT | Reading_ACT | Science_ACT | |
---|---|---|---|---|---|---|---|---|
min | 2.0 | 482.0 | 52.0 | 8.0 | 16.3 | 18.0 | 18.1 | 2.3 |
max | 100.0 | 644.0 | 651.0 | 100.0 | 25.5 | 25.3 | 26.0 | 24.9 |
sd
.¶def stdev(X):
n = len(X)
return ((1.0/n)*np.sum([(x-np.mean(X))**2 for x in X]))**(0.5)
cols = df_total.columns[1:].tolist()
sd = [round( stdev([df_total[col].tolist() for col in cols][i] ) ,3) for i in range(0,len(cols))]
print (sd)
[34.929, 45.217, 84.073, 31.824, 2.33, 1.962, 2.047, 3.151]
Note that Pandas calculates std
DataFrame using $n-1$ as denominator instead of $n$:
df_total.describe().loc[['std']]
Participation_SAT (%) | EBRW_SAT | Math_SAT | Participation_ACT (%) | English_ACT | Math_ACT | Reading_ACT | Science_ACT | |
---|---|---|---|---|---|---|---|---|
std | 35.276632 | 45.666901 | 84.909119 | 32.140842 | 2.353677 | 1.981989 | 2.067271 | 3.182463 |
Setting the number of ${\rm{ddof}}=0$ solves this issue and we obtain the same values as the list sd
.
df_total.std(ddof=0)
Participation_SAT (%) 34.929071 EBRW_SAT 45.216970 Math_SAT 84.072555 Participation_ACT (%) 31.824176 English_ACT 2.330488 Math_ACT 1.962462 Reading_ACT 2.046903 Science_ACT 3.151108 dtype: float64
sd
into a new observation in your dataset.¶I first put State
as index and then concatenate the new row, renaming it.
df_total_new = df_total.copy()
df_total_new = df_total_new.set_index('State')
df_total_new.head()
Participation_SAT (%) | EBRW_SAT | Math_SAT | Participation_ACT (%) | English_ACT | Math_ACT | Reading_ACT | Science_ACT | |
---|---|---|---|---|---|---|---|---|
State | ||||||||
Alabama | 5.0 | 593 | 572 | 100.0 | 18.9 | 18.4 | 19.7 | 19.4 |
Alaska | 38.0 | 547 | 533 | 65.0 | 18.7 | 19.8 | 20.4 | 19.9 |
Arizona | 30.0 | 563 | 553 | 62.0 | 18.6 | 19.8 | 20.1 | 19.8 |
Arkansas | 3.0 | 614 | 594 | 100.0 | 18.9 | 19.0 | 19.7 | 19.5 |
California | 53.0 | 531 | 524 | 31.0 | 22.5 | 22.7 | 23.1 | 22.2 |
df2 = pd.DataFrame([[34.929, 45.217, 84.073, 31.824, 2.33, 1.962, 2.047, 3.151]],columns=df_total_new.columns)
df_total_new = pd.concat([df2,df_total_new])
df_total_new = df_total_new.rename(index={df_total_new.index[0]: 'sd'})
df_total_new.head()
Participation_SAT (%) | EBRW_SAT | Math_SAT | Participation_ACT (%) | English_ACT | Math_ACT | Reading_ACT | Science_ACT | |
---|---|---|---|---|---|---|---|---|
sd | 34.929 | 45.217 | 84.073 | 31.824 | 2.33 | 1.962 | 2.047 | 3.151 |
Alabama | 5.000 | 593.000 | 572.000 | 100.000 | 18.90 | 18.400 | 19.700 | 19.400 |
Alaska | 38.000 | 547.000 | 533.000 | 65.000 | 18.70 | 19.800 | 20.400 | 19.900 |
Arizona | 30.000 | 563.000 | 553.000 | 62.000 | 18.60 | 19.800 | 20.100 | 19.800 |
Arkansas | 3.000 | 614.000 | 594.000 | 100.000 | 18.90 | 19.000 | 19.700 | 19.500 |
I will start from the DataFrame without the sd
and include it afterwards:
df_total_new = df_total.copy()
df_total_new = df_total_new.set_index('State').sort_values("Participation_SAT (%)",ascending=False)
df_total_new = pd.concat([df2,df_total_new])
df_total_new = df_total_new.rename(index={df_total_new.index[0]: 'sd'})
df_total_new.head()
Participation_SAT (%) | EBRW_SAT | Math_SAT | Participation_ACT (%) | English_ACT | Math_ACT | Reading_ACT | Science_ACT | |
---|---|---|---|---|---|---|---|---|
sd | 34.929 | 45.217 | 84.073 | 31.824 | 2.33 | 1.962 | 2.047 | 3.151 |
District of Columbia | 100.000 | 482.000 | 468.000 | 32.000 | 24.40 | 23.500 | 24.900 | 23.500 |
Michigan | 100.000 | 509.000 | 495.000 | 29.000 | 24.10 | 23.700 | 24.500 | 23.800 |
Connecticut | 100.000 | 530.000 | 512.000 | 31.000 | 25.50 | 24.600 | 25.600 | 24.600 |
Delaware | 100.000 | 503.000 | 492.000 | 18.000 | 24.10 | 23.400 | 24.800 | 23.600 |
df_total_new[['Participation_SAT (%)']].head(20)
Participation_SAT (%) | |
---|---|
sd | 34.929 |
District of Columbia | 100.000 |
Michigan | 100.000 |
Connecticut | 100.000 |
Delaware | 100.000 |
New Hampshire | 96.000 |
Maine | 95.000 |
Idaho | 93.000 |
Florida | 83.000 |
Massachusetts | 76.000 |
Rhode Island | 71.000 |
New Jersey | 70.000 |
Maryland | 69.000 |
New York | 67.000 |
Virginia | 65.000 |
Pennsylvania | 65.000 |
Washington | 64.000 |
Indiana | 63.000 |
Texas | 62.000 |
Georgia | 61.000 |
I printed out the tail to check.
df_total_new = df_total_new[df_total_new['Participation_SAT (%)']>50]
df_total_new.tail()
Participation_SAT (%) | EBRW_SAT | Math_SAT | Participation_ACT (%) | English_ACT | Math_ACT | Reading_ACT | Science_ACT | |
---|---|---|---|---|---|---|---|---|
Texas | 62.0 | 513.0 | 507.0 | 45.0 | 19.5 | 20.7 | 21.1 | 20.9 |
Georgia | 61.0 | 535.0 | 515.0 | 55.0 | 21.0 | 20.9 | 22.0 | 21.3 |
Vermont | 60.0 | 562.0 | 551.0 | 29.0 | 23.3 | 23.1 | 24.4 | 23.2 |
Hawaii | 55.0 | 544.0 | 541.0 | 90.0 | 17.8 | 19.2 | 19.2 | 19.3 |
California | 53.0 | 531.0 | 524.0 | 31.0 | 22.5 | 22.7 | 23.1 | 22.2 |
There are certain technical criteria for the optimal choice of number of bins which will not be used here but are briefly described in the end of the notebook. We will follow the heuristical "not too large, not too little" bin size criterion.
There is an outlier which we will exclude now for good measure:
df_total_new = df_total_new[df_total_new.index != 'Maryland']
p1 = df_total_new[['Participation_SAT (%)']]
p2 = df_total_new[['Participation_ACT (%)']]
fig, axes = plt.subplots(1, 2)
p1.hist('Participation_SAT (%)', bins=10, ax=axes[0])
p2.hist('Participation_ACT (%)', bins=10, ax=axes[1])
array([<matplotlib.axes._subplots.AxesSubplot object at 0x1139db320>], dtype=object)
array([<matplotlib.axes._subplots.AxesSubplot object at 0x113a3b400>], dtype=object)
p1 = df_total_new[['Math_SAT']]
p2 = df_total_new[['Math_ACT']]
fig, axes = plt.subplots(1, 2)
p1.hist('Math_SAT', bins=10, ax=axes[0])
p2.hist('Math_ACT', bins=10, ax=axes[1])
array([<matplotlib.axes._subplots.AxesSubplot object at 0x113a8d780>], dtype=object)
array([<matplotlib.axes._subplots.AxesSubplot object at 0x113b94da0>], dtype=object)
In the ACT scores I joined English and reading into one Verbal section following SAT's strategy:
df_total_new['Verbal_ACT'] = (df_total_new['English_ACT'] + df_total_new['Reading_ACT'])/2
df_total_new.head()
Participation_SAT (%) | EBRW_SAT | Math_SAT | Participation_ACT (%) | English_ACT | Math_ACT | Reading_ACT | Science_ACT | Verbal_ACT | |
---|---|---|---|---|---|---|---|---|---|
District of Columbia | 100.0 | 482.0 | 468.0 | 32.0 | 24.4 | 23.5 | 24.9 | 23.5 | 24.65 |
Michigan | 100.0 | 509.0 | 495.0 | 29.0 | 24.1 | 23.7 | 24.5 | 23.8 | 24.30 |
Connecticut | 100.0 | 530.0 | 512.0 | 31.0 | 25.5 | 24.6 | 25.6 | 24.6 | 25.55 |
Delaware | 100.0 | 503.0 | 492.0 | 18.0 | 24.1 | 23.4 | 24.8 | 23.6 | 24.45 |
New Hampshire | 96.0 | 532.0 | 520.0 | 18.0 | 25.4 | 25.1 | 26.0 | 24.9 | 25.70 |
Next I change syntax and disposition of plots to make the x-axis labels more legible
columns = ['EBRW_SAT','Verbal_ACT']
df_total_new.hist(column=columns, layout=(2,1))
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x11463d908>], [<matplotlib.axes._subplots.AxesSubplot object at 0x118fa8438>]], dtype=object)
We can do other plots using seaborn as well.
print( df_total_new.columns.tolist())
['Participation_SAT (%)', 'EBRW_SAT', 'Math_SAT', 'Participation_ACT (%)', 'English_ACT', 'Math_ACT', 'Reading_ACT', 'Science_ACT', 'Verbal_ACT']
sns.pairplot(df_total_new[['Participation_SAT (%)', 'EBRW_SAT', 'Math_SAT']], size=5)
<seaborn.axisgrid.PairGrid at 0x119044b38>
sns.pairplot(df_total_new[['Participation_ACT (%)', 'Math_ACT', 'Science_ACT','Verbal_ACT',]], size=5)
# point are too small though (don't know yet how to increase them)
<seaborn.axisgrid.PairGrid at 0x11aae5400>
I plotted just for the SAT to avoid excess repetition.
import seaborn as sns
fig, axs = plt.subplots(1, 3, figsize=(12,6))
axs[0] = sns.distplot(df_total_new['Participation_SAT (%)'], kde=False, color='blue', ax=axs[0], bins=10)
axs[1] = sns.distplot(df_total_new['Math_SAT'], kde=False, color='red', ax=axs[1], bins=10)
axs[2] = sns.distplot(df_total_new['EBRW_SAT'], kde=False, color='green', ax=axs[2], bins=10)
axs[0].set_title('Participation_SAT (%)', fontsize=8)
axs[1].set_title('SAT Math', fontsize=8)
axs[2].set_title('SAT Evidence-Based Reading and Writing', fontsize=8)
plt.show()
Text(0.5,1,'Participation_SAT (%)')
Text(0.5,1,'SAT Math')
Text(0.5,1,'SAT Evidence-Based Reading and Writing')
That they are normally distributed.
No. In fact some of the distributions seem to be bimodal (two peaks in the distribution).
I will make several plots and make the commentaries after them.
df_total_new.plot(kind='scatter', x='Participation_SAT (%)', y='EBRW_SAT', c='blue', title='Participation_SAT (%) vs EBRW')
<matplotlib.axes._subplots.AxesSubplot at 0x11be64048>
df_total_new.plot(kind='scatter', x='Participation_ACT (%)', y='Verbal_ACT', c='red', title='Participation_SAT (%) vs Verbal')
<matplotlib.axes._subplots.AxesSubplot at 0x11bdd17b8>
df_total_new.plot(kind='scatter', x='Participation_SAT (%)', y='Math_SAT', c='red', title='Participation_SAT (%) vs Math')
<matplotlib.axes._subplots.AxesSubplot at 0x11c9286d8>
df_total_new.plot(kind='scatter', x='Participation_ACT (%)', y='Math_ACT', c='red', title='Participation_ACT (%) vs Math')
<matplotlib.axes._subplots.AxesSubplot at 0x11cae9240>
df_aux = df_total_new.copy() # To backup
plt.scatter(df_aux['Participation_SAT (%)'],df_aux['Math_SAT'], color = 'r', label='Math')
plt.scatter(df_aux['Participation_SAT (%)'],df_aux['EBRW_SAT'], color = 'b', label='EBRW')
plt.xlabel('Rate')
plt.title('SAT correlation between Math and EBRW scores')
plt.grid(True)
plt.xlim([50,105])
plt.ylim([450,600])
plt.legend(loc='upper right')
plt.show()
<matplotlib.collections.PathCollection at 0x11cc544e0>
<matplotlib.collections.PathCollection at 0x11cc54978>
Text(0.5,0,'Rate')
Text(0.5,1,'SAT correlation between Math and EBRW scores')
(50, 105)
(450, 600)
<matplotlib.legend.Legend at 0x11cc548d0>
Same plot as above with different $y$-range show the negative correlation more clearly:
plt.scatter(df_aux['Participation_SAT (%)'],df_aux['Math_SAT'], color = 'r', label='Math')
plt.scatter(df_aux['Participation_SAT (%)'],df_aux['EBRW_SAT'], color = 'b', label='EBRW')
plt.xlabel('Rate')
plt.title('SAT correlation between Math and EBRW scores')
plt.grid(True)
plt.xlim([50,105])
plt.ylim([400,650])
plt.legend(loc='upper right')
plt.show()
<matplotlib.collections.PathCollection at 0x11cd28748>
<matplotlib.collections.PathCollection at 0x11cd28be0>
Text(0.5,0,'Rate')
Text(0.5,1,'SAT correlation between Math and EBRW scores')
(50, 105)
(400, 650)
<matplotlib.legend.Legend at 0x11cd28eb8>
plt.scatter(df_aux['Participation_ACT (%)'],df_aux['Math_ACT'], color = 'r', label='Math')
plt.scatter(df_aux['Participation_ACT (%)'],df_aux['Verbal_ACT'], color = 'b', label='Verbal')
plt.xlabel('Rate')
plt.title('ACT correlation between Math and Verbal')
plt.grid(True)
plt.legend(loc='upper right')
plt.show()
<matplotlib.collections.PathCollection at 0x11cdeaa58>
<matplotlib.collections.PathCollection at 0x11cdeaeb8>
Text(0.5,0,'Rate')
Text(0.5,1,'ACT correlation between Math and Verbal')
<matplotlib.legend.Legend at 0x11cdf52e8>
Participation rate and scores seem to be "universally" negatively correlated. A larger fraction of students taking the tests is correlated with lower average grades for all tests in both SAT and ACT. This may have several explanations. One of them could be lack of incentive from public schools, in some states, for students to take the exams. Tesk takes would then come mostly from private schools which tend to score better. This issue must be investigated in much more detail and there are a multitude of articles online discussing the topics.
Good or bad performance in Math and Verbal tests seem to occur together i.e. student on average do well in both or in neither.
fig = plt.figure(figsize=(12,4))
ax1 = fig.add_subplot(131)
plt.boxplot(df_aux['Participation_SAT (%)'])
ax1.set_title('Box plot for Participation_SAT (%)')
ax2 = fig.add_subplot(132)
plt.boxplot(df_aux['EBRW_SAT'])
ax2.set_title('Box plot for EBRW_SAT')
ax3 = fig.add_subplot(133)
plt.boxplot(df_aux['Math_SAT'])
ax3.set_title('Box plot for Math_SAT')
plt.show()
{'boxes': [<matplotlib.lines.Line2D at 0x11ceb2f60>], 'caps': [<matplotlib.lines.Line2D at 0x11cebca58>, <matplotlib.lines.Line2D at 0x11cebceb8>], 'fliers': [<matplotlib.lines.Line2D at 0x11cec47b8>], 'means': [], 'medians': [<matplotlib.lines.Line2D at 0x11cec4358>], 'whiskers': [<matplotlib.lines.Line2D at 0x11cebc128>, <matplotlib.lines.Line2D at 0x11cebc5f8>]}
Text(0.5,1,'Box plot for Participation_SAT (%)')
{'boxes': [<matplotlib.lines.Line2D at 0x11cf067b8>], 'caps': [<matplotlib.lines.Line2D at 0x11cf0e2b0>, <matplotlib.lines.Line2D at 0x11cf0e710>], 'fliers': [<matplotlib.lines.Line2D at 0x11cf0efd0>], 'means': [], 'medians': [<matplotlib.lines.Line2D at 0x11cf0eb70>], 'whiskers': [<matplotlib.lines.Line2D at 0x11cf06940>, <matplotlib.lines.Line2D at 0x11cf06e10>]}
Text(0.5,1,'Box plot for EBRW_SAT')
{'boxes': [<matplotlib.lines.Line2D at 0x11cf552e8>], 'caps': [<matplotlib.lines.Line2D at 0x11cf55da0>, <matplotlib.lines.Line2D at 0x11cf5c240>], 'fliers': [<matplotlib.lines.Line2D at 0x11cf5cb00>], 'means': [], 'medians': [<matplotlib.lines.Line2D at 0x11cf5c6a0>], 'whiskers': [<matplotlib.lines.Line2D at 0x11cf55470>, <matplotlib.lines.Line2D at 0x11cf55940>]}
Text(0.5,1,'Box plot for Math_SAT')
df_aux = df_total_new.copy()
df_aux['Verbal_ACT'] = (df_aux['English_ACT'] + df_aux['Reading_ACT'])/2
fig = plt.figure(figsize=(12,4))
ax1 = fig.add_subplot(131)
plt.boxplot(df_aux['Participation_ACT (%)'])
ax1.set_title('Box plot for Participation_ACT (%)')
ax2 = fig.add_subplot(132)
plt.boxplot(df_aux['Verbal_ACT'])
ax2.set_title('Box plot for Verbal_ACT')
ax3 = fig.add_subplot(133)
plt.boxplot(df_aux['Math_ACT'])
ax3.set_title('Box plot for Math_ACT')
plt.show()
{'boxes': [<matplotlib.lines.Line2D at 0x11d077c50>], 'caps': [<matplotlib.lines.Line2D at 0x11d081748>, <matplotlib.lines.Line2D at 0x11d081ba8>], 'fliers': [<matplotlib.lines.Line2D at 0x11d08a4a8>], 'means': [], 'medians': [<matplotlib.lines.Line2D at 0x11d08a048>], 'whiskers': [<matplotlib.lines.Line2D at 0x11d077dd8>, <matplotlib.lines.Line2D at 0x11d0812e8>]}
Text(0.5,1,'Box plot for Participation_ACT (%)')
{'boxes': [<matplotlib.lines.Line2D at 0x11d0ca4e0>], 'caps': [<matplotlib.lines.Line2D at 0x11d0caf98>, <matplotlib.lines.Line2D at 0x11d0d2438>], 'fliers': [<matplotlib.lines.Line2D at 0x11d0d2cf8>], 'means': [], 'medians': [<matplotlib.lines.Line2D at 0x11d0d2898>], 'whiskers': [<matplotlib.lines.Line2D at 0x11d0ca668>, <matplotlib.lines.Line2D at 0x11d0cab38>]}
Text(0.5,1,'Box plot for Verbal_ACT')
{'boxes': [<matplotlib.lines.Line2D at 0x11d116198>], 'caps': [<matplotlib.lines.Line2D at 0x11d116c50>, <matplotlib.lines.Line2D at 0x11d11f0f0>], 'fliers': [<matplotlib.lines.Line2D at 0x11d11f9b0>], 'means': [], 'medians': [<matplotlib.lines.Line2D at 0x11d11f550>], 'whiskers': [<matplotlib.lines.Line2D at 0x11d116320>, <matplotlib.lines.Line2D at 0x11d1167f0>]}
Text(0.5,1,'Box plot for Math_ACT')
We look into the mean, standard deviation and degree of skewness. The median and the $50\%$ percentil values are the same by definition.
df_aux[cols].describe().loc[['mean','std','50%']]
Participation_SAT (%) | EBRW_SAT | Math_SAT | Participation_ACT (%) | English_ACT | Math_ACT | Reading_ACT | Science_ACT | |
---|---|---|---|---|---|---|---|---|
mean | 76.142857 | 529.666667 | 517.904762 | 34.666667 | 22.833333 | 22.833333 | 23.704762 | 22.800000 |
std | 16.950769 | 19.652820 | 21.586813 | 18.521159 | 2.144139 | 1.669830 | 1.769315 | 1.553061 |
50% | 70.000000 | 531.000000 | 523.000000 | 31.000000 | 23.500000 | 23.300000 | 24.400000 | 23.300000 |
For example, magnitude of percentual differences between means and medians (values are in percentage so they are indeed low for the scores!):
for col in df_aux.columns.tolist():
print (col,":",np.abs(100*(df_aux[col].describe().loc['mean']/df_aux[col].describe().loc['50%']-1)))
Participation_SAT (%) : 8.775510204081627 EBRW_SAT : 0.25109855618330457 Math_SAT : 0.9742329053992527 Participation_ACT (%) : 11.827956989247301 English_ACT : 2.8368794326241176 Math_ACT : 2.002861230329045 Reading_ACT : 2.849336455893814 Science_ACT : 2.145922746781126 Verbal_ACT : 2.8432249726612957
25.1) Let us look at correlations:
cols_SAT = ['Participation_SAT (%)', 'EBRW_SAT', 'Math_SAT']
cols_ACT = [ 'Participation_ACT (%)', 'Math_ACT', 'Science_ACT', 'Verbal_ACT']
SAT
corr_SAT = df_aux[cols_SAT].corr()
corr_SAT
Participation_SAT (%) | EBRW_SAT | Math_SAT | |
---|---|---|---|
Participation_SAT (%) | 1.000000 | -0.656950 | -0.718164 |
EBRW_SAT | -0.656950 | 1.000000 | 0.961519 |
Math_SAT | -0.718164 | 0.961519 | 1.000000 |
ACT
corr_ACT = df_aux[cols_ACT].corr()
corr_ACT
Participation_ACT (%) | Math_ACT | Science_ACT | Verbal_ACT | |
---|---|---|---|---|
Participation_ACT (%) | 1.000000 | -0.839660 | -0.854699 | -0.835211 |
Math_ACT | -0.839660 | 1.000000 | 0.987912 | 0.974774 |
Science_ACT | -0.854699 | 0.987912 | 1.000000 | 0.980839 |
Verbal_ACT | -0.835211 | 0.974774 | 0.980839 | 1.000000 |
25.2) Skewness general observations
The skewness is an important metric and is related to the mean and median. It can be measured using Pearson's coefficient, namely:
$${g_1} = \frac{{{\mu _3}}}{{{\sigma ^3}}}$$where the numerator is the third momentum. One way to relate the difference between mean and median to the skewness is to use (which is valid for some particular distributions):
$$ g_1 = 2(\mu-{\rm{mode}})/\sigma$$or
$$ \mu-{\rm{mode}} = (1/2) g_1 \sigma$$Though this equality is not valid in general, an estimation of skewness as an indirect measure of the difference between mode and median and vice-versa may be used heuristically.
25.3) In our case
In our case, consider the Math scores for example. The median is larger so $g_1<0$. If the left tail is more pronounced than the right one the function has negative skewness. As we see below (where I reproduced the plots), the left tail is indeed more pronounced.
g1_SAT = 2*(np.mean(df_aux['Math_SAT'])-np.median(df_aux['Math_SAT']))/ np.std(df_aux['Math_SAT'])
print ("g1 from SAT Math is:", g1_SAT)
g1 from SAT Math is: -0.4837273105976173
g1_ACT = 2*(np.mean(df_aux['Math_ACT'])-np.median(df_aux['Math_ACT']))/ np.std(df_aux['Math_ACT'])
print ("g1 from ACT is:", g1_ACT)
g1 from ACT is: -0.5727420648104604
p1 = df_total_new[['Math_SAT']]
p2 = df_total_new[['Math_ACT']]
fig, axes = plt.subplots(1, 2,figsize=(6,2))
p1.hist('Math_SAT', bins=10, ax=axes[0])
p2.hist('Math_ACT', bins=10, ax=axes[1])
array([<matplotlib.axes._subplots.AxesSubplot object at 0x11caff390>], dtype=object)
array([<matplotlib.axes._subplots.AxesSubplot object at 0x11d256f98>], dtype=object)
The full population of high-school graduates is divided into two groups namely, SAT test takers and ACT test takers.
Suppose I want to test whether the true population mean of the participation rates are different between the SAT and ACT. Our hypotheses have the form:
$${H_0}:\,{\mu _{{\rm{PR,SAT}}}} = \,{\mu _{{\rm{PR,ACT}}}}\\ {H_{\rm{A}}}:\,{\mu _{{\rm{PR,SAT}}}} \ne {\mu _{{\rm{PR,ACT}}}}$$To obtain the $p$-value we use:
"p-value is:",stats.ttest_ind(df_aux['Participation_SAT (%)'],df_aux['Participation_ACT (%)'])[1]
('p-value is:', 3.0499821970067098e-09)
This is much smaller than $\alpha$ so we reject the null hypothesis and conclude that both means differ at this level of confidence.
From below I conclude that I am 95$\%$ confident that the true mean rate for the SAT participation is within (26.261598995475964, 43.071734337857364) and that the true mean rate for the ACT participation is within (68.450446510874016, 83.835267774840261).
#print df_aux['Participation_ACT (%)'].mean()
#print df_aux['Participation_SAT (%)'].mean()
df_aux.head()
Participation_SAT (%) | EBRW_SAT | Math_SAT | Participation_ACT (%) | English_ACT | Math_ACT | Reading_ACT | Science_ACT | Verbal_ACT | |
---|---|---|---|---|---|---|---|---|---|
District of Columbia | 100.0 | 482.0 | 468.0 | 32.0 | 24.4 | 23.5 | 24.9 | 23.5 | 24.65 |
Michigan | 100.0 | 509.0 | 495.0 | 29.0 | 24.1 | 23.7 | 24.5 | 23.8 | 24.30 |
Connecticut | 100.0 | 530.0 | 512.0 | 31.0 | 25.5 | 24.6 | 25.6 | 24.6 | 25.55 |
Delaware | 100.0 | 503.0 | 492.0 | 18.0 | 24.1 | 23.4 | 24.8 | 23.6 | 24.45 |
New Hampshire | 96.0 | 532.0 | 520.0 | 18.0 | 25.4 | 25.1 | 26.0 | 24.9 | 25.70 |
stats.t.interval(0.95,df_aux['Participation_ACT (%)'].shape[0],loc = df_aux['Participation_ACT (%)'].mean(),
scale = (np.std(df_aux['Participation_ACT (%)'], ddof = 1)) / df_aux['Participation_ACT (%)'].shape[0] ** 0.5)
(26.261598995475964, 43.071734337857364)
stats.t.interval(0.95,df_aux['Participation_SAT (%)'].shape[0],loc = df_aux['Participation_SAT (%)'].mean(),
scale = (np.std(df_aux['Participation_SAT (%)'], ddof = 1)) / df_aux['Participation_SAT (%)'].shape[0] ** 0.5)
(68.45044651087402, 83.83526777484026)
Not surprising since confidence that they differ must be associated with lack of overlap between confidence intervals. If we are confident that they are likely to belong to different intervals, it is intuitive that they should differ.
Probably yes, since they have the same units and compare reasonably similar quantities (details about the test takers and some other factors for example may change that rationale).
astropy.visualization
contains this options and will be used in future projects.