Is the SAT a fair test? ¶

( An NYC Highschool data analysis )

Aim :¶

Exploring the relationships between SAT scores and demographic factors (example: race, income,etc.) in New York City public schools.

Dataset Description :¶

New York City has a significant immigrant population and is very diverse, so comparing demographic factors such as race, income, and gender with SAT scores is a good way to determine whether the SAT is a fair test. For example, if certain racial groups consistently perform better on the SAT, we would have some evidence that the SAT is unfair.
*What is the SAT? *
- The SAT, or Scholastic Aptitude Test, is a test that high school seniors in the U.S. take every year. The SAT has three
sections, each of which is worth a maximum of 800 points. Colleges use the SAT to determine which students to admit. High average SAT scores are usually indicative of a good school.
New York City has published data on student SAT scores by high school, along with additional demographic data sets. Over the course of this project we will combine the following data sets into a single, clean pandas dataframe:

Dataset (w.Links)	Description
`SAT scores by school`	SAT scores for each high school in New York City
`School attendance`	Attendance information for each school in New York City
`Class size`	Information on class size for each school
`AP test results`	Advanced Placement (AP) exam results for each high school (passing an optional AP exam in a particular subject can earn a student college credit in that subject)
`Graduation Outcomes`	The percentage of students who graduated, and other outcome information
`Demographics`	Demographic information for each school
`School survey`	Surveys of parents, teachers, and students at each school

Dataset cleaning :¶

Let's first read in all our datasets :

In [1]:

## importing all the required libraries :
import pandas as pd
import numpy as np
import re

data_files = [
    "ap_2010.csv",
    "class_size.csv",
    "demographics.csv",
    "graduation.csv",
    "hs_directory.csv",
    "sat_results.csv"
]

# creating a dictionary to keep the datasets in an orderly fashion:

data = {}

for f in data_files:
    d = pd.read_csv("schools/{0}".format(f))
    data[f.replace(".csv", "")] = d

In [2]:

# We see that the surveys have a different encoding, so let's change it t the format we need:
all_survey = pd.read_csv("schools/survey_all.txt", delimiter="\t", encoding='windows-1252')
d75_survey = pd.read_csv("schools/survey_d75.txt", delimiter="\t", encoding='windows-1252')

# combining the surveys: 
survey = pd.concat([all_survey, d75_survey], axis=0)

# cleaning the 'dbn' column to make it unifrom across all the datasets:
survey["DBN"] = survey["dbn"]

In [3]:

# Locating all the survey fields we need for the analysis:

survey_fields = [
    "DBN", 
    "rr_s", 
    "rr_t", 
    "rr_p", 
    "N_s", 
    "N_t", 
    "N_p", 
    "saf_p_11", 
    "com_p_11", 
    "eng_p_11", 
    "aca_p_11", 
    "saf_t_11", 
    "com_t_11", 
    "eng_t_11", 
    "aca_t_11", 
    "saf_s_11", 
    "com_s_11", 
    "eng_s_11", 
    "aca_s_11", 
    "saf_tot_11", 
    "com_tot_11", 
    "eng_tot_11", 
    "aca_tot_11",
]

survey = survey.loc[:,survey_fields]

# Assigning all the cleansed data back to the dictionary:
data["survey"] = survey

# let's check the result:
data["survey"].head()

Out[3]:

	DBN	rr_s	rr_t	rr_p	N_s	N_t	N_p	saf_p_11	com_p_11	eng_p_11	...	eng_t_11	aca_t_11	saf_s_11	com_s_11	eng_s_11	aca_s_11	saf_tot_11	com_tot_11	eng_tot_11	aca_tot_11
0	01M015	NaN	88	60	NaN	22.0	90.0	8.5	7.6	7.5	...	7.6	7.9	NaN	NaN	NaN	NaN	8.0	7.7	7.5	7.9
1	01M019	NaN	100	60	NaN	34.0	161.0	8.4	7.6	7.6	...	8.9	9.1	NaN	NaN	NaN	NaN	8.5	8.1	8.2	8.4
2	01M020	NaN	88	73	NaN	42.0	367.0	8.9	8.3	8.3	...	6.8	7.5	NaN	NaN	NaN	NaN	8.2	7.3	7.5	8.0
3	01M034	89.0	73	50	145.0	29.0	151.0	8.8	8.2	8.0	...	6.8	7.8	6.2	5.9	6.5	7.4	7.3	6.7	7.1	7.9
4	01M063	NaN	100	60	NaN	23.0	90.0	8.7	7.9	8.1	...	7.8	8.1	NaN	NaN	NaN	NaN	8.5	7.6	7.9	8.0

5 rows × 23 columns

We see that DBN is the UNIQUE identification code for each of the schools across the datasets. Let's make sure that each of the datasets have this column:

In [4]:

for key,value in data.items():
    if 'DBN' not in value.columns:
        print (key)
        print ([value.columns])
        print ('\n')

class_size
[Index(['CSD', 'BOROUGH', 'SCHOOL CODE', 'SCHOOL NAME', 'GRADE ',
       'PROGRAM TYPE', 'CORE SUBJECT (MS CORE and 9-12 ONLY)',
       'CORE COURSE (MS CORE and 9-12 ONLY)', 'SERVICE CATEGORY(K-9* ONLY)',
       'NUMBER OF STUDENTS / SEATS FILLED', 'NUMBER OF SECTIONS',
       'AVERAGE CLASS SIZE', 'SIZE OF SMALLEST CLASS', 'SIZE OF LARGEST CLASS',
       'DATA SOURCE', 'SCHOOLWIDE PUPIL-TEACHER RATIO'],
      dtype='object')]


hs_directory
[Index(['dbn', 'school_name', 'boro', 'building_code', 'phone_number',
       'fax_number', 'grade_span_min', 'grade_span_max', 'expgrade_span_min',
       'expgrade_span_max', 'bus', 'subway', 'primary_address_line_1', 'city',
       'state_code', 'zip', 'website', 'total_students', 'campus_name',
       'school_type', 'overview_paragraph', 'program_highlights',
       'language_classes', 'advancedplacement_courses', 'online_ap_courses',
       'online_language_courses', 'extracurricular_activities',
       'psal_sports_boys', 'psal_sports_girls', 'psal_sports_coed',
       'school_sports', 'partner_cbo', 'partner_hospital', 'partner_highered',
       'partner_cultural', 'partner_nonprofit', 'partner_corporate',
       'partner_financial', 'partner_other', 'addtl_info1', 'addtl_info2',
       'start_time', 'end_time', 'se_services', 'ell_programs',
       'school_accessibility_description', 'number_programs', 'priority01',
       'priority02', 'priority03', 'priority04', 'priority05', 'priority06',
       'priority07', 'priority08', 'priority09', 'priority10', 'Location 1'],
      dtype='object')]

We see that there are 2 datasets that don't have defined DBN column. Let's check those datasets:

In [5]:

# Changing the 'dbn' column in hs_directory to 'DBN': 
data["hs_directory"]["DBN"] = data["hs_directory"]["dbn"]

# Now let's check the `class_size` dataset:
data['class_size'].head()

Out[5]:

	CSD	BOROUGH	SCHOOL CODE	SCHOOL NAME	GRADE	PROGRAM TYPE	CORE SUBJECT (MS CORE and 9-12 ONLY)	CORE COURSE (MS CORE and 9-12 ONLY)	SERVICE CATEGORY(K-9* ONLY)	NUMBER OF STUDENTS / SEATS FILLED	NUMBER OF SECTIONS	AVERAGE CLASS SIZE	SIZE OF SMALLEST CLASS	SIZE OF LARGEST CLASS	DATA SOURCE	SCHOOLWIDE PUPIL-TEACHER RATIO
0	1	M	M015	P.S. 015 Roberto Clemente	0K	GEN ED	-	-	-	19.0	1.0	19.0	19.0	19.0	ATS	NaN
1	1	M	M015	P.S. 015 Roberto Clemente	0K	CTT	-	-	-	21.0	1.0	21.0	21.0	21.0	ATS	NaN
2	1	M	M015	P.S. 015 Roberto Clemente	01	GEN ED	-	-	-	17.0	1.0	17.0	17.0	17.0	ATS	NaN
3	1	M	M015	P.S. 015 Roberto Clemente	01	CTT	-	-	-	17.0	1.0	17.0	17.0	17.0	ATS	NaN
4	1	M	M015	P.S. 015 Roberto Clemente	02	GEN ED	-	-	-	15.0	1.0	15.0	15.0	15.0	ATS	NaN

Creating a `DBN` column :¶

From looking at these rows, we can tell that the DBN in the sat_results data is just a combination of the CSD and SCHOOL CODE columns in the class_size data. The main difference is that the DBN is padded, so that the CSD portion of it always consists of two digits. That means we'll need to add a leading 0 to the CSD if the CSD is less than two digits long.

In [6]:

## Let's create a function to create padded_csd and then the 'DBN' column:
def pad_csd(num):
    string_representation = str(num)
    if len(string_representation) > 1:
        return string_representation
    else:
        return "0" + string_representation

# checking the function:
print (pad_csd(2))
    
# applying the function and combining the required columns for creating the `DBN` column:
data["class_size"]["padded_csd"] = data["class_size"]["CSD"].apply(pad_csd)
data["class_size"]["DBN"] = data["class_size"]["padded_csd"] + data["class_size"]["SCHOOL CODE"]

In [7]:

# Let's check the columns we need to calculate the total SAT score:
cols = ['SAT Math Avg. Score', 'SAT Critical Reading Avg. Score', 'SAT Writing Avg. Score']
print ('without conversion: ', data['sat_results']['SAT Math Avg. Score'].dtype)

## We need to convert these columns into numeric type to be able to add the columns:
for c in cols:
    data["sat_results"][c] = pd.to_numeric(data["sat_results"][c], errors="coerce")

print ('after conversion: ', data['sat_results']['SAT Math Avg. Score'].dtype)

without conversion:  object
after conversion:  float64

In [8]:

## Adding the columns to get the final SAT score:
data['sat_results']['sat_score'] = data['sat_results'][cols[0]] + data['sat_results'][cols[1]] + data['sat_results'][cols[2]]

print(data['sat_results']['sat_score'].head())

0    1122.0
1    1172.0
2    1149.0
3    1174.0
4    1207.0
Name: sat_score, dtype: float64

Finding precise school locations :¶

Let's now check the hs_directory dataset to parse the latitude and longitude coordinates for each school. This will enable us to map the schools and uncover any geographic patterns in the data. The coordinates are currently in the text field Location 1 in the hs_directory data set.

In [9]:

## Checking the `location 1` column:
print (data['hs_directory']['Location 1'][1])

1110 Boston Road
Bronx, NY 10456
(40.8276026690005, -73.90447525699966)

In [10]:

## Creating functions to seperate the latitude and longitude values of the schools from the column:
def find_lat(loc):
    coords = re.findall("\(.+, .+\)", loc)
    lat = coords[0].split(",")[0].replace("(", "")
    return lat

def find_lon(loc):
    coords = re.findall("\(.+, .+\)", loc)
    lon = coords[0].split(",")[1].replace(")", "").strip()
    return lon

# Checking the function :
print('The latitude value is : ',find_lat(data['hs_directory']['Location 1'][1]))
print('The longitude value is : ',find_lon(data['hs_directory']['Location 1'][1]))

# Applying the functions to create columns that we need :
data["hs_directory"]["lat"] = data["hs_directory"]["Location 1"].apply(find_lat)
data["hs_directory"]["lon"] = data["hs_directory"]["Location 1"].apply(find_lon)

# Converting the string values of the locations to numeric datatypes:
data["hs_directory"]["lat"] = pd.to_numeric(data["hs_directory"]["lat"], errors="coerce")
data["hs_directory"]["lon"] = pd.to_numeric(data["hs_directory"]["lon"], errors="coerce")

The latitude value is :  40.8276026690005
The longitude value is :  -73.90447525699966

Condense datasets :¶

Since the datasets together contains a huge amount of information, let's try to condense the datasets to for better analysis.

In [11]:

## Let's look at the `class_size`:
class_size = data["class_size"]
class_size.head(2)

Out[11]:

	CSD	BOROUGH	SCHOOL CODE	SCHOOL NAME	GRADE	PROGRAM TYPE	CORE SUBJECT (MS CORE and 9-12 ONLY)	CORE COURSE (MS CORE and 9-12 ONLY)	SERVICE CATEGORY(K-9* ONLY)	NUMBER OF STUDENTS / SEATS FILLED	NUMBER OF SECTIONS	AVERAGE CLASS SIZE	SIZE OF SMALLEST CLASS	SIZE OF LARGEST CLASS	DATA SOURCE	SCHOOLWIDE PUPIL-TEACHER RATIO	padded_csd	DBN
0	1	M	M015	P.S. 015 Roberto Clemente	0K	GEN ED	-	-	-	19.0	1.0	19.0	19.0	19.0	ATS	NaN	01	01M015
1	1	M	M015	P.S. 015 Roberto Clemente	0K	CTT	-	-	-	21.0	1.0	21.0	21.0	21.0	ATS	NaN	01	01M015

In [12]:

class_size["GRADE "].value_counts()

Out[12]:

09-12      10644
MS Core     4762
0K-09       1384
0K          1237
01          1185
02          1167
03          1143
04          1140
05          1086
06           846
07           778
08           735
09            20
Name: GRADE , dtype: int64

In [13]:

class_size["PROGRAM TYPE"].value_counts()

Out[13]:

GEN ED     14545
CTT         7460
SPEC ED     3653
G&T          469
Name: PROGRAM TYPE, dtype: int64

In [14]:

## Let's condense the dataset to include only the mst occuring values:
class_size = class_size[class_size["GRADE "] == "09-12"]
class_size = class_size[class_size["PROGRAM TYPE"] == "GEN ED"]

print(class_size["DBN"].value_counts())

# Condensing the `DBN` column by the mean of `DBN` values :
class_size = class_size.groupby("DBN").agg(np.mean)
class_size.reset_index(inplace=True)

# Assigining the dataset back to the dictionary:
data["class_size"] = class_size

20K505    21
09X505    21
24Q485    20
26Q435    20
25Q425    20
30Q445    20
21K525    20
20K490    20
21K540    20
22K495    20
25Q460    19
03M470    19
10X445    19
10X696    19
21K690    19
18K563    19
11X545    19
08X405    19
30Q450    19
27Q480    19
26Q495    19
28Q470    19
31R455    19
30Q575    19
12X278    19
26Q415    19
22K405    19
27Q410    19
25Q285    19
13K670    19
          ..
14K126     1
09X303     1
20K187     1
11X180     1
29Q295     1
10X206     1
29Q059     1
03M402     1
28Q217     1
32K377     1
21K209     1
13K596     1
23K041     1
17K394     1
07X203     1
03M247     1
20K030     1
10X254     1
27Q043     1
03M334     1
11X083     1
16K308     1
06M326     1
22K078     1
31R027     1
10X391     1
09X327     1
31R063     1
10X390     1
18K588     1
Name: DBN, Length: 583, dtype: int64

In [15]:

## Let's check the `demographics` dataset:
data["demographics"].head(2)

Out[15]:

	DBN	Name	schoolyear	fl_percent	frl_percent	total_enrollment	prek	k	grade1	grade2	...	black_num	black_per	hispanic_num	hispanic_per	white_num	white_per	male_num	male_per	female_num	female_per
0	01M015	P.S. 015 ROBERTO CLEMENTE	20052006	89.4	NaN	281	15	36	40	33	...	74	26.3	189	67.3	5	1.8	158.0	56.2	123.0	43.8
1	01M015	P.S. 015 ROBERTO CLEMENTE	20062007	89.4	NaN	243	15	29	39	38	...	68	28.0	153	63.0	4	1.6	140.0	57.6	103.0	42.4

2 rows × 38 columns

In [16]:

## Let's condense the `demographic`dataset to aid our findings:
data["demographics"] = data["demographics"][data["demographics"]["schoolyear"] == 20112012]

data["graduation"] = data["graduation"][data["graduation"]["Cohort"] == "2006"]
data["graduation"] = data["graduation"][data["graduation"]["Demographic"] == "Total Cohort"]

In [17]:

# Converting AP scores to numeric for better analysis:
cols = ['AP Test Takers ', 'Total Exams Taken', 'Number of Exams with scores 3 4 or 5']

for col in cols:
    data["ap_2010"][col] = pd.to_numeric(data["ap_2010"][col], errors="coerce")

Dataset Merging :¶

Since we have now cleansed the datasets, let's merge them together into one.

To use DBN column as our merging key for all the datasets, we need to identify the type of merge so as to reserve maximum amount of data for our analysis.
We know that sat_results dataset forms the base for our combined data, so let's use that as the base and add the other datasets to it.

In [18]:

combined = data["sat_results"]

# Using LEFT merge to preserve all the 'DBN' columns in the `sat_results` dataset:
combined = combined.merge(data["ap_2010"], on="DBN", how="left")
combined = combined.merge(data["graduation"], on="DBN", how="left")

# Using the INNER merge to get the common `DBN` values:
to_merge = ["class_size", "demographics", "survey", "hs_directory"]

for m in to_merge:
    combined = combined.merge(data[m], on="DBN", how="inner")

# Let's fill the empty values with their respective column's mean:
combined = combined.fillna(combined.mean())

# For the columns with no descirnible mean:
combined = combined.fillna(0)

# checking the results:
combined.head()

Out[18]:

	DBN	SCHOOL NAME	Num of SAT Test Takers	SAT Critical Reading Avg. Score	SAT Math Avg. Score	SAT Writing Avg. Score	sat_score	SchoolName	AP Test Takers	Total Exams Taken	...	priority04	priority05	Location 1	lat	lon
0	01M292	HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES	29	355.0	404.0	363.0	1122.0	0	129.028846	197.038462	...	Then to Manhattan students or residents	Then to New York City residents	220 Henry Street\nNew York, NY 10002\n(40.7137...	40.713764	-73.985260
1	01M448	UNIVERSITY NEIGHBORHOOD HIGH SCHOOL	91	383.0	423.0	366.0	1172.0	UNIVERSITY NEIGHBORHOOD H.S.	39.000000	49.000000	...	0	0	200 Monroe Street\nNew York, NY 10002\n(40.712...	40.712332	-73.984797
2	01M450	EAST SIDE COMMUNITY SCHOOL	70	377.0	402.0	370.0	1149.0	EAST SIDE COMMUNITY HS	19.000000	21.000000	...	0	0	420 East 12 Street\nNew York, NY 10009\n(40.72...	40.729783	-73.983041
3	01M509	MARTA VALLE HIGH SCHOOL	44	390.0	433.0	384.0	1207.0	0	129.028846	197.038462	...	0	0	145 Stanton Street\nNew York, NY 10002\n(40.72...	40.720569	-73.985673
4	01M539	NEW EXPLORATIONS INTO SCIENCE, TECHNOLOGY AND ...	159	522.0	574.0	525.0	1621.0	NEW EXPLORATIONS SCI,TECH,MATH	255.000000	377.000000	...	0	0	111 Columbia Street\nNew York, NY 10002\n(40.7...	40.718725	-73.979426

5 rows × 159 columns

School district :¶

We see that the first two characters in the DBN column represent the school district codes. Let's extract the codes for mapping the data.

In [19]:

## Creating a function to extract the codes: 
def get_first_two_chars(dbn):
    return dbn[0:2]

# Applying the function to the `DBN` column to get the `school_dist` values:
combined["school_dist"] = combined["DBN"].apply(get_first_two_chars)

Dataset Analysis :¶

Now that we have a clean dataset, let's perform some preliminary analysis to draw inferences.

In [20]:

## To find "Pearson's" correlation between the columns:
correlations = combined.corr()
correlations = correlations["sat_score"]
print(correlations)

SAT Critical Reading Avg. Score         0.986820
SAT Math Avg. Score                     0.972643
SAT Writing Avg. Score                  0.987771
sat_score                               1.000000
AP Test Takers                          0.523140
Total Exams Taken                       0.514333
Number of Exams with scores 3 4 or 5    0.463245
Total Cohort                            0.325144
CSD                                     0.042948
NUMBER OF STUDENTS / SEATS FILLED       0.394626
NUMBER OF SECTIONS                      0.362673
AVERAGE CLASS SIZE                      0.381014
SIZE OF SMALLEST CLASS                  0.249949
SIZE OF LARGEST CLASS                   0.314434
SCHOOLWIDE PUPIL-TEACHER RATIO               NaN
schoolyear                                   NaN
fl_percent                                   NaN
frl_percent                            -0.722225
total_enrollment                        0.367857
ell_num                                -0.153778
ell_percent                            -0.398750
sped_num                                0.034933
sped_percent                           -0.448170
asian_num                               0.475445
asian_per                               0.570730
black_num                               0.027979
black_per                              -0.284139
hispanic_num                            0.025744
hispanic_per                           -0.396985
white_num                               0.449559
                                          ...   
rr_p                                    0.047925
N_s                                     0.423463
N_t                                     0.291463
N_p                                     0.421530
saf_p_11                                0.122913
com_p_11                               -0.115073
eng_p_11                                0.020254
aca_p_11                                0.035155
saf_t_11                                0.313810
com_t_11                                0.082419
eng_t_11                                0.036906
aca_t_11                                0.132348
saf_s_11                                0.337639
com_s_11                                0.187370
eng_s_11                                0.213822
aca_s_11                                0.339435
saf_tot_11                              0.318753
com_tot_11                              0.077310
eng_tot_11                              0.100102
aca_tot_11                              0.190966
grade_span_max                               NaN
expgrade_span_max                            NaN
zip                                    -0.063977
total_students                          0.407827
number_programs                         0.117012
priority08                                   NaN
priority09                                   NaN
priority10                                   NaN
lat                                    -0.121029
lon                                    -0.132222
Name: sat_score, Length: 67, dtype: float64

In [21]:

## We generally consider (-0.25 to 0.25) to be a weak correllation so let's remove it and map the rest of the data:
neg_corr = correlations[correlations < -0.25 ]
neg_corr

Out[21]:

frl_percent    -0.722225
ell_percent    -0.398750
sped_percent   -0.448170
black_per      -0.284139
hispanic_per   -0.396985
Name: sat_score, dtype: float64

In [22]:

pos_corr = correlations[correlations > 0.25 ]
pos_corr

Out[22]:

SAT Critical Reading Avg. Score         0.986820
SAT Math Avg. Score                     0.972643
SAT Writing Avg. Score                  0.987771
sat_score                               1.000000
AP Test Takers                          0.523140
Total Exams Taken                       0.514333
Number of Exams with scores 3 4 or 5    0.463245
Total Cohort                            0.325144
NUMBER OF STUDENTS / SEATS FILLED       0.394626
NUMBER OF SECTIONS                      0.362673
AVERAGE CLASS SIZE                      0.381014
SIZE OF LARGEST CLASS                   0.314434
total_enrollment                        0.367857
asian_num                               0.475445
asian_per                               0.570730
white_num                               0.449559
white_per                               0.620718
male_num                                0.325520
female_num                              0.388631
N_s                                     0.423463
N_t                                     0.291463
N_p                                     0.421530
saf_t_11                                0.313810
saf_s_11                                0.337639
aca_s_11                                0.339435
saf_tot_11                              0.318753
total_students                          0.407827
Name: sat_score, dtype: float64

Observations :¶

total_enrollment has a strong positive correlation with sat_score. This is surprising because we'd expect smaller schools where students receive more attention to have higher scores. However, it looks like the opposite is true -- larger schools tend to do better on the SAT.
- Other columns that are proxies for enrollment correlate similarly. These include **total_students, N_s, N_p, N_t,
AP Test Takers, Total Exams Taken, and NUMBER OF SECTIONS**.
Both the percentage of females (female_per) and number of females (female_num) at a school correlate positively with SAT score, whereas the percentage of males (male_per) and the number of males (male_num) correlate negatively. This could indicate that women do better on the SAT than men.
Teacher and student ratings of school safety (saf_t_11, and saf_s_11) correlate with sat_score.
Student ratings of school academic standards (aca_s_11) correlate with sat_score, but this does not hold for ratings from teachers and parents (aca_p_11 and aca_t_11).
There is significant racial inequality in SAT scores (white_per, asian_per, black_per, hispanic_per).
- The percentage of English language learners at the school (ell_percent, frl_percent) has a strong negative
correlation with SAT scores.

Plotting survey correlations:¶

In order to understand the data better, let's plot the data:

In [23]:

## Remove DBN since it's a unique identifier, not a useful numerical value for correlation.
survey_fields.remove("DBN")

In [24]:

## Importing the library for plotting:
import matplotlib.pyplot as plt
%matplotlib inline

In [25]:

## Let's plot a bar plot to see the correlation bw fields that originally came from a survey of parents, teachers, and students.
combined.corr()["sat_score"][survey_fields].plot.bar()

Out[25]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f7bb8f43518>

In [26]:

## Let's see the column tags to understand the plot better:
link = 'https://data.cityofnewyork.us/api/views/mnz3-dyi8/files/aa68d821-4dbb-4eb2-9448-3d8cbbad5044?download=true&filename=Survey%20Data%20Dictionary.xls'
data_dict = pd.read_excel(link)
data_dict.head()

Out[26]:

	2011 NYC School Survey Data Dictionary	Unnamed: 1
0	This data dictionary can be used with the scho...	NaN
1	NaN	NaN
2	Field Name	Field Description
3	dbn	School identification code (district borough n...
4	sch_type	School type (Elementary, Middle, High, etc)

In [27]:

## Let's clean up the df:
columns = ['Field Name', 'Field Description'] # headers
data_dict = data_dict[3:]                     # removing the top 3 rows
data_dict.columns = columns                   # assigning the headers
data_dict.reset_index(inplace=True)           # resetting and assigning new index
data_dict = data_dict.drop('index', axis=1)

In [28]:

## remvoving unnecessary rows:
data_dict

Out[28]:

	Field Name	Field Description
0	dbn	School identification code (district borough n...
1	sch_type	School type (Elementary, Middle, High, etc)
2	location	School name
3	enrollment	Enrollment size
4	borough	Borough
5	principal	Principal name
6	studentsurvey	Only students in grades 6-12 partipate in the ...
7	rr_s	Student Response Rate
8	rr_t	Teacher Response Rate
9	rr_p	Parent Response Rate
10	N_s	Number of student respondents
11	N_t	Number of teacher respondents
12	N_p	Number of parent respondents
13	nr_s	Number of eligible students
14	nr_t	Number of eligible teachers
15	nr_p	Number of eligible parents
16	saf_p_10	Safety and Respect score based on parent respo...
17	com_p_10	Communication score based on parent responses
18	eng_p_10	Engagement score based on parent responses
19	aca_p_10	Academic expectations score based on parent re...
20	saf_t_10	Safety and Respect score based on teacher resp...
21	com_t_10	Communication score based on teacher responses
22	eng_t_10	Engagement score based on teacher responses
23	aca_t_10	Academic expectations score based on teacher r...
24	saf_s_10	Safety and Respect score based on student resp...
25	com_s_10	Communication score based on student responses
26	eng_s_10	Engagement score based on student responses
27	aca_s_10	Academic expectations score based on student r...
28	saf_tot_10	Safety and Respect total score
29	com_tot_10	Communication total score
30	eng_tot_10	Engagement total score
31	aca_tot_10	Academic Expectations total score
32	Field Series	Field Series Description
33	Column AG through Column CA	These fields provide scores determined for eac...
34	Column CB through Column LV	These fields provide percentages of responses ...
35	Column LW through Column VQ	These fields provide counts of responses from ...
36	Column VR through Column YS	These fields provide scores determined for eac...
37	Column YT through Column AMY	These fields provide percentages of responses ...
38	Column AMZ through Column BBE	These fields provide counts of responses from ...
39	Column BBF through Column BDD	These fields provide scores determined for eac...
40	Column BDE through Column BNB	These fields provide percentages of responses ...
41	Column BNC through BWZ	These fields provide counts of responses from ...
42	Field Convention	Field Convention Description
43	p_q1	Indicates parent_question 1
44	p_q1a	Indicates parent_question 1a
45	p_q1a_1	Indicates parent question_1a_response option 1
46	p_N_q1_1	Indicates parent_Number of responses_question ...
47	t_q1	Indicates teacher_question 1
48	t_q1a	Indicates teacher_question 1a
49	t_q1a_1	Indicates teacher question_1a_response option 1
50	t_N_q1_1	Indicates teacher_Number of responses_question...
51	s_q1	Indicates student_question 1
52	s_q1a	Indicates student_question 1a
53	s_q1a_1	Indicates student question_1a_response option 1
54	s_N_q1_1	Indicates student_Number of responses_question...

In [29]:

data_dict = data_dict[:32]
pd.set_option('mode.chained_assignment', None)
data_dict['Field Name'] = data_dict['Field Name'].astype('str').str.replace('0','1')
pd.reset_option('mode.chained_assignment')
data_dict

Out[29]:

	Field Name	Field Description
0	dbn	School identification code (district borough n...
1	sch_type	School type (Elementary, Middle, High, etc)
2	location	School name
3	enrollment	Enrollment size
4	borough	Borough
5	principal	Principal name
6	studentsurvey	Only students in grades 6-12 partipate in the ...
7	rr_s	Student Response Rate
8	rr_t	Teacher Response Rate
9	rr_p	Parent Response Rate
10	N_s	Number of student respondents
11	N_t	Number of teacher respondents
12	N_p	Number of parent respondents
13	nr_s	Number of eligible students
14	nr_t	Number of eligible teachers
15	nr_p	Number of eligible parents
16	saf_p_11	Safety and Respect score based on parent respo...
17	com_p_11	Communication score based on parent responses
18	eng_p_11	Engagement score based on parent responses
19	aca_p_11	Academic expectations score based on parent re...
20	saf_t_11	Safety and Respect score based on teacher resp...
21	com_t_11	Communication score based on teacher responses
22	eng_t_11	Engagement score based on teacher responses
23	aca_t_11	Academic expectations score based on teacher r...
24	saf_s_11	Safety and Respect score based on student resp...
25	com_s_11	Communication score based on student responses
26	eng_s_11	Engagement score based on student responses
27	aca_s_11	Academic expectations score based on student r...
28	saf_tot_11	Safety and Respect total score
29	com_tot_11	Communication total score
30	eng_tot_11	Engagement total score
31	aca_tot_11	Academic Expectations total score

In [30]:

## to check the values of the tags in the dictionary:
for val in survey_fields:
    
    ind = data_dict[data_dict['Field Name']== val].index.values
    correlation = combined.corr()['sat_score'][combined.corr().index == val]
    print( data_dict.loc[ind ,'Field Description'] )
    print(' ',correlation)
    print('\n')

7    Student Response Rate
Name: Field Description, dtype: object
  rr_s    0.232199
Name: sat_score, dtype: float64


8    Teacher Response Rate
Name: Field Description, dtype: object
  rr_t   -0.023386
Name: sat_score, dtype: float64


9    Parent Response Rate
Name: Field Description, dtype: object
  rr_p    0.047925
Name: sat_score, dtype: float64


10    Number of student respondents
Name: Field Description, dtype: object
  N_s    0.423463
Name: sat_score, dtype: float64


11    Number of teacher respondents
Name: Field Description, dtype: object
  N_t    0.291463
Name: sat_score, dtype: float64


12    Number of parent respondents
Name: Field Description, dtype: object
  N_p    0.42153
Name: sat_score, dtype: float64


16    Safety and Respect score based on parent respo...
Name: Field Description, dtype: object
  saf_p_11    0.122913
Name: sat_score, dtype: float64


17    Communication score based on parent responses
Name: Field Description, dtype: object
  com_p_11   -0.115073
Name: sat_score, dtype: float64


18    Engagement score based on parent responses
Name: Field Description, dtype: object
  eng_p_11    0.020254
Name: sat_score, dtype: float64


19    Academic expectations score based on parent re...
Name: Field Description, dtype: object
  aca_p_11    0.035155
Name: sat_score, dtype: float64


20    Safety and Respect score based on teacher resp...
Name: Field Description, dtype: object
  saf_t_11    0.31381
Name: sat_score, dtype: float64


21    Communication score based on teacher responses
Name: Field Description, dtype: object
  com_t_11    0.082419
Name: sat_score, dtype: float64


22    Engagement score based on teacher responses
Name: Field Description, dtype: object
  eng_t_11    0.036906
Name: sat_score, dtype: float64


23    Academic expectations score based on teacher r...
Name: Field Description, dtype: object
  aca_t_11    0.132348
Name: sat_score, dtype: float64


24    Safety and Respect score based on student resp...
Name: Field Description, dtype: object
  saf_s_11    0.337639
Name: sat_score, dtype: float64


25    Communication score based on student responses
Name: Field Description, dtype: object
  com_s_11    0.18737
Name: sat_score, dtype: float64


26    Engagement score based on student responses
Name: Field Description, dtype: object
  eng_s_11    0.213822
Name: sat_score, dtype: float64


27    Academic expectations score based on student r...
Name: Field Description, dtype: object
  aca_s_11    0.339435
Name: sat_score, dtype: float64


28    Safety and Respect total score
Name: Field Description, dtype: object
  saf_tot_11    0.318753
Name: sat_score, dtype: float64


29    Communication total score
Name: Field Description, dtype: object
  com_tot_11    0.07731
Name: sat_score, dtype: float64


30    Engagement total score
Name: Field Description, dtype: object
  eng_tot_11    0.100102
Name: sat_score, dtype: float64


31    Academic Expectations total score
Name: Field Description, dtype: object
  aca_tot_11    0.190966
Name: sat_score, dtype: float64

Observations :¶

Positive correlations :
1. N_s, N_t, N_p and sat_score : This seems to be the case as the columns represent the number of people giving the SATs.
2. rr_s : The students who responded to the survey seemed to have performed better on the SATs.
3. saf_tot_11 : The safety and respect total score (teachers, students and parents) seem to correlate positively on the sat_scores, indicating the requirement of both to study properly.
4. aca_tot_11 : The overall acadmic expectations from the sat_scores is relatively less, indicating that the score holds less value in the case of parents and the teachers. Whereas the aca_s_11 is higher, showing the importance that sat_scores holds for them.

In [31]:

## Let's dry to dive a bit deeper and plot our observations:
print(combined['saf_s_11'].corr(combined['sat_score']))
combined.plot.scatter('saf_s_11','sat_score', title='Safety plot relation', c='g')

0.3376387443608813

Out[31]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f7bb61366a0>

From the diagram we see that the columns are related positively and :

Most of the schools have an average saftey rating between 5.5 and 7.5
The students enlisted in those schools get the average SAT score between 1000 and 1400.
Student's in the schools with less safety rating (~6.5 and below) don't score more than 1600, hence are mostly low performers.
Schools that have a safety rating between 7 - 9 provide a better enviornment to the students, hence the students' scores' have a wider range, indicating that they perform to the best of their abilities.

Average Safety score for each Borough:¶

To relate to our observations above, let's see the safety score for each borough.

In [32]:

combined.groupby('boro')['saf_s_11'].agg(np.mean).sort_values(ascending=True)

Out[32]:

boro
Brooklyn         6.370755
Staten Island    6.530000
Bronx            6.606577
Queens           6.721875
Manhattan        6.831370
Name: saf_s_11, dtype: float64

On average, all the boroughs have similar safety scores with Brooklyn having the least safety score (~ 6.37) and Manhattan having the highest score (~ 6.83).

Racial differences in SAT scores:¶

In [33]:

rac_cols = ['white_per', 'asian_per', 'black_per', 'hispanic_per']
correlations[rac_cols].plot(kind='bar', title = 'Racial differences in SAT scores', rot=0)

Out[33]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f7bb639d470>

Schools with higher number of 'whites' and 'asians' have higher SAT scores indicated by the strong positive correlation values.
Schools with higher number of 'black' and 'hispanic' students have lower SAT scores indicated by the negative correlation values.

In [34]:

## Let's find all the schools with 'hispanic_per' greater than 95% :
combined['SCHOOL NAME'][combined['hispanic_per'] > 95 ]

Out[34]:

44                         MANHATTAN BRIDGES HIGH SCHOOL
82      WASHINGTON HEIGHTS EXPEDITIONARY LEARNING SCHOOL
89     GREGORIO LUPERON HIGH SCHOOL FOR SCIENCE AND M...
125                  ACADEMY FOR LANGUAGE AND TECHNOLOGY
141                INTERNATIONAL SCHOOL FOR LIBERAL ARTS
176     PAN AMERICAN INTERNATIONAL HIGH SCHOOL AT MONROE
253                            MULTICULTURAL HIGH SCHOOL
286               PAN AMERICAN INTERNATIONAL HIGH SCHOOL
Name: SCHOOL NAME, dtype: object

All the schools listed above primariy focused towards the recent immigrant population.** Lower SAT scores could indicate that the students are not fluent with English.**

In [35]:

## Let's find all the schools with 'hispanic_per' < 10% and avg SAT score > 1800:
combined['SCHOOL NAME'][(combined["hispanic_per"] < 10) & (combined["sat_score"] > 1800)]

Out[35]:

37                                STUYVESANT HIGH SCHOOL
151                         BRONX HIGH SCHOOL OF SCIENCE
187                       BROOKLYN TECHNICAL HIGH SCHOOL
327    QUEENS HIGH SCHOOL FOR THE SCIENCES AT YORK CO...
356                  STATEN ISLAND TECHNICAL HIGH SCHOOL
Name: SCHOOL NAME, dtype: object

All the above schools are ** 'Specialised Schools' , with a very low acceptance rate. The acceptance is decided on the score that a student attains on the 'Entrance exams'. Hence, accepting only High performers** makes the average SAT scores of these schools really high.

Gender disparity in SAT scores :¶

Does a particlar gender affect SAT scores:

In [36]:

gender = ['male_per', 'female_per']
correlations[gender].plot(kind='bar', title = 'Gender differences in SAT scores', rot=0)

Out[36]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f7bb242ae80>

Though the correlation values are not strong, it can be seen that:

schools with higher female percentage have higher average SAT scores. (Positive correlation)
schools with high male percentage have lesser average SAT scores. (Negative correlation)

In [37]:

## Let's plot 'female_per' and 'sat_score' to aid our observations above:
combined.plot.scatter('female_per','sat_score', c='b')

Out[37]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f7bb23606a0>

In [38]:

## Let's plot 'male_per' and 'sat_score' to aid our observations above:
combined.plot.scatter('male_per','sat_score', c='b')

Out[38]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f7bb242a8d0>

The columns have a very weak correlation. Most of the schools have equal percentage of males and females. However it is quite interesting to see that when the gender percentages are relatively same (~ 40-60%), they tend to perform better.

In [39]:

## Let's find schools with a 'female_per' greater than 60% and an average SAT score greater than 1700 :
combined['SCHOOL NAME'][(combined['female_per'] > 10) & (combined['sat_score'] > 1700)]

Out[39]:

5                         BARD HIGH SCHOOL EARLY COLLEGE
26                         ELEANOR ROOSEVELT HIGH SCHOOL
37                                STUYVESANT HIGH SCHOOL
60                                    BEACON HIGH SCHOOL
61     FIORELLO H. LAGUARDIA HIGH SCHOOL OF MUSIC & A...
79     HIGH SCHOOL FOR MATHEMATICS, SCIENCE AND ENGIN...
151                         BRONX HIGH SCHOOL OF SCIENCE
155    HIGH SCHOOL OF AMERICAN STUDIES AT LEHMAN COLLEGE
187                       BROOKLYN TECHNICAL HIGH SCHOOL
198                           BROOKLYN LATIN SCHOOL, THE
302                          TOWNSEND HARRIS HIGH SCHOOL
327    QUEENS HIGH SCHOOL FOR THE SCIENCES AT YORK CO...
356                  STATEN ISLAND TECHNICAL HIGH SCHOOL
Name: SCHOOL NAME, dtype: object

These schools, again, belong to a category of ** 'Specialised Schools' **.

`Advanced Placement` and `SAT scores`:¶

In the U.S., high school students take Advanced Placement (AP) exams to earn college credit. It makes sense that the number of students at a school who took AP exams would be highly correlated with the school's SAT scores.

In [40]:

## Let's look at the percentage of students in each school who took at least one AP exam :
combined['ap_per'] = combined['AP Test Takers '] / combined['total_enrollment']

# let's plot the relationship:
print(combined['ap_per'].corr(combined['sat_score']))
combined.plot.scatter('ap_per', 'sat_score', title = 'Advance Placement vs SAT scores', xlim = 0)

0.05717081390766969

Out[40]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f7bb22e7fd0>

There seems to be no correlation between sat_scores and ap_per.

`Class size` and `sat_scores`:¶

In [41]:

# let's plot the relationship:
print(combined['AVERAGE CLASS SIZE'].corr(combined['sat_score']))
combined.plot.scatter('AVERAGE CLASS SIZE', 'sat_score', title = 'Class size vs SAT scores')

0.3810143308095524

Out[41]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f7bb22d3400>

We see a positive correlation between the columns and:

Most of the schools have an average class size between 20-30 students with an average sat score between 1000-1400.
Schools with lesser average class size (10-20) tend to have lesser sat_scores (800-1400).
Schools with higher average class size (30-40) tend to have higher sat_scores (1400-2100).

Neighborhood with Best schools:¶

boro with the best schools :

In [42]:

## Let' look at the boroughs with the best schools:
combined.groupby('boro')['sat_score'].agg(np.mean).sort_values(ascending=False)

Out[42]:

boro
Staten Island    1382.500000
Queens           1286.753032
Manhattan        1278.331410
Brooklyn         1181.364461
Bronx            1157.598203
Name: sat_score, dtype: float64

In [43]:

combined.groupby('boro')['sat_score'].agg(np.max).sort_values(ascending=False)

Out[43]:

boro
Manhattan        2096.0
Bronx            1969.0
Staten Island    1953.0
Queens           1910.0
Brooklyn         1833.0
Name: sat_score, dtype: float64

In [44]:

combined.groupby('boro')['sat_score'].agg(np.min).sort_values(ascending=False)

Out[44]:

boro
Staten Island    1195.0
Manhattan        1014.0
Queens            951.0
Bronx             934.0
Brooklyn          887.0
Name: sat_score, dtype: float64

We see that the top schools by the average SATs score are located in Staten Island with an average score of 1382.5 and the schools in Bronx score relatively the worst, with an avg. of 1157.6.

Conclusion :¶

Let's condense our findings at one place:

The students who responded to the survey seemed to have performed better on the SATs.
The safety and respect total score (teachers, students and parents) seem to correlate positively on the sat_scores,indicating the requirement of both to study properly.
The overall acadmic expectations from the sat_scores is relatively less, indicating that the score holds less value in the case of parents and the teachers. Whereas the aca_s_11 is higher, showing the importance that sat_scores holds for them.
Safety :
- Schools that have a higher safety rating provide a better enviornment to the students, hence the students' scores' have a wider range, indicating that they perform to the best of their abilities.
- All the boroughs have similar safety scores with Brooklyn having the least safety score (~ 6.37) and Manhattan having the highest score (~ 6.83).
Racial Disparity :
- - Schools with higher number of 'whites' and 'asians' have *higher SAT scores and those with higher number of 'black' and 'hispanic' students have lower SAT scores .
Gender Inequality :
- Most of the schools have equal percentage of males and females. However it is quite interesting to see that when the gender percentages are relatively same (~ 40-60%), they tend to perform better.
Class Size :
- Most of the schools have an average class size between 20-30 students with an average sat score between 1000-1400 and as the size increases, the scores tend to increase.
By Neighbourhood :
- We see that the top schools by the average SATs score are located in Staten Island with an average score of 1382.5 and the schools in Bronx score relatively the worst, with an avg. of 1157.6.

Is the SAT a fair test? ¶

Aim :¶

Dataset Description :¶

Dataset cleaning :¶

Creating a DBN column :¶

Finding precise school locations :¶

Condense datasets :¶

Dataset Merging :¶

School district :¶

Dataset Analysis :¶

Observations :¶

Plotting survey correlations:¶

Observations :¶

Average Safety score for each Borough:¶

Racial differences in SAT scores:¶

Gender disparity in SAT scores :¶

Advanced Placement and SAT scores:¶

Class size and sat_scores:¶

Neighborhood with Best schools:¶

Conclusion :¶

Creating a `DBN` column :¶

`Advanced Placement` and `SAT scores`:¶

`Class size` and `sat_scores`:¶