( An NYC Highschool data analysis )
Exploring the relationships between SAT scores and demographic factors (example: race, income,etc.) in New York City public schools.
New York City has a significant immigrant population and is very diverse, so comparing demographic factors such as race, income, and gender with SAT scores is a good way to determine whether the SAT is a fair test. For example, if certain racial groups consistently perform better on the SAT, we would have some evidence that the SAT is unfair.
*What is the SAT? *
sections, each of which is worth a maximum of 800 points. Colleges use the SAT to determine which students to admit. High average SAT scores are usually indicative of a good school.
New York City has published data on student SAT scores by high school, along with additional demographic data sets. Over the course of this project we will combine the following data sets into a single, clean pandas dataframe:
Dataset (w.Links) | Description |
---|---|
SAT scores by school |
SAT scores for each high school in New York City |
School attendance |
Attendance information for each school in New York City |
Class size |
Information on class size for each school |
AP test results |
Advanced Placement (AP) exam results for each high school (passing an optional AP exam in a particular subject can earn a student college credit in that subject) |
Graduation Outcomes |
The percentage of students who graduated, and other outcome information |
Demographics |
Demographic information for each school |
School survey |
Surveys of parents, teachers, and students at each school |
Let's first read in all our datasets :
## importing all the required libraries :
import pandas as pd
import numpy as np
import re
data_files = [
"ap_2010.csv",
"class_size.csv",
"demographics.csv",
"graduation.csv",
"hs_directory.csv",
"sat_results.csv"
]
# creating a dictionary to keep the datasets in an orderly fashion:
data = {}
for f in data_files:
d = pd.read_csv("schools/{0}".format(f))
data[f.replace(".csv", "")] = d
# We see that the surveys have a different encoding, so let's change it t the format we need:
all_survey = pd.read_csv("schools/survey_all.txt", delimiter="\t", encoding='windows-1252')
d75_survey = pd.read_csv("schools/survey_d75.txt", delimiter="\t", encoding='windows-1252')
# combining the surveys:
survey = pd.concat([all_survey, d75_survey], axis=0)
# cleaning the 'dbn' column to make it unifrom across all the datasets:
survey["DBN"] = survey["dbn"]
# Locating all the survey fields we need for the analysis:
survey_fields = [
"DBN",
"rr_s",
"rr_t",
"rr_p",
"N_s",
"N_t",
"N_p",
"saf_p_11",
"com_p_11",
"eng_p_11",
"aca_p_11",
"saf_t_11",
"com_t_11",
"eng_t_11",
"aca_t_11",
"saf_s_11",
"com_s_11",
"eng_s_11",
"aca_s_11",
"saf_tot_11",
"com_tot_11",
"eng_tot_11",
"aca_tot_11",
]
survey = survey.loc[:,survey_fields]
# Assigning all the cleansed data back to the dictionary:
data["survey"] = survey
# let's check the result:
data["survey"].head()
DBN | rr_s | rr_t | rr_p | N_s | N_t | N_p | saf_p_11 | com_p_11 | eng_p_11 | ... | eng_t_11 | aca_t_11 | saf_s_11 | com_s_11 | eng_s_11 | aca_s_11 | saf_tot_11 | com_tot_11 | eng_tot_11 | aca_tot_11 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 01M015 | NaN | 88 | 60 | NaN | 22.0 | 90.0 | 8.5 | 7.6 | 7.5 | ... | 7.6 | 7.9 | NaN | NaN | NaN | NaN | 8.0 | 7.7 | 7.5 | 7.9 |
1 | 01M019 | NaN | 100 | 60 | NaN | 34.0 | 161.0 | 8.4 | 7.6 | 7.6 | ... | 8.9 | 9.1 | NaN | NaN | NaN | NaN | 8.5 | 8.1 | 8.2 | 8.4 |
2 | 01M020 | NaN | 88 | 73 | NaN | 42.0 | 367.0 | 8.9 | 8.3 | 8.3 | ... | 6.8 | 7.5 | NaN | NaN | NaN | NaN | 8.2 | 7.3 | 7.5 | 8.0 |
3 | 01M034 | 89.0 | 73 | 50 | 145.0 | 29.0 | 151.0 | 8.8 | 8.2 | 8.0 | ... | 6.8 | 7.8 | 6.2 | 5.9 | 6.5 | 7.4 | 7.3 | 6.7 | 7.1 | 7.9 |
4 | 01M063 | NaN | 100 | 60 | NaN | 23.0 | 90.0 | 8.7 | 7.9 | 8.1 | ... | 7.8 | 8.1 | NaN | NaN | NaN | NaN | 8.5 | 7.6 | 7.9 | 8.0 |
5 rows × 23 columns
We see that DBN
is the UNIQUE identification code for each of the schools across the datasets. Let's make sure that each of the datasets have this column:
for key,value in data.items():
if 'DBN' not in value.columns:
print (key)
print ([value.columns])
print ('\n')
class_size [Index(['CSD', 'BOROUGH', 'SCHOOL CODE', 'SCHOOL NAME', 'GRADE ', 'PROGRAM TYPE', 'CORE SUBJECT (MS CORE and 9-12 ONLY)', 'CORE COURSE (MS CORE and 9-12 ONLY)', 'SERVICE CATEGORY(K-9* ONLY)', 'NUMBER OF STUDENTS / SEATS FILLED', 'NUMBER OF SECTIONS', 'AVERAGE CLASS SIZE', 'SIZE OF SMALLEST CLASS', 'SIZE OF LARGEST CLASS', 'DATA SOURCE', 'SCHOOLWIDE PUPIL-TEACHER RATIO'], dtype='object')] hs_directory [Index(['dbn', 'school_name', 'boro', 'building_code', 'phone_number', 'fax_number', 'grade_span_min', 'grade_span_max', 'expgrade_span_min', 'expgrade_span_max', 'bus', 'subway', 'primary_address_line_1', 'city', 'state_code', 'zip', 'website', 'total_students', 'campus_name', 'school_type', 'overview_paragraph', 'program_highlights', 'language_classes', 'advancedplacement_courses', 'online_ap_courses', 'online_language_courses', 'extracurricular_activities', 'psal_sports_boys', 'psal_sports_girls', 'psal_sports_coed', 'school_sports', 'partner_cbo', 'partner_hospital', 'partner_highered', 'partner_cultural', 'partner_nonprofit', 'partner_corporate', 'partner_financial', 'partner_other', 'addtl_info1', 'addtl_info2', 'start_time', 'end_time', 'se_services', 'ell_programs', 'school_accessibility_description', 'number_programs', 'priority01', 'priority02', 'priority03', 'priority04', 'priority05', 'priority06', 'priority07', 'priority08', 'priority09', 'priority10', 'Location 1'], dtype='object')]
We see that there are 2 datasets that don't have defined DBN
column. Let's check those datasets:
# Changing the 'dbn' column in hs_directory to 'DBN':
data["hs_directory"]["DBN"] = data["hs_directory"]["dbn"]
# Now let's check the `class_size` dataset:
data['class_size'].head()
CSD | BOROUGH | SCHOOL CODE | SCHOOL NAME | GRADE | PROGRAM TYPE | CORE SUBJECT (MS CORE and 9-12 ONLY) | CORE COURSE (MS CORE and 9-12 ONLY) | SERVICE CATEGORY(K-9* ONLY) | NUMBER OF STUDENTS / SEATS FILLED | NUMBER OF SECTIONS | AVERAGE CLASS SIZE | SIZE OF SMALLEST CLASS | SIZE OF LARGEST CLASS | DATA SOURCE | SCHOOLWIDE PUPIL-TEACHER RATIO | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | M | M015 | P.S. 015 Roberto Clemente | 0K | GEN ED | - | - | - | 19.0 | 1.0 | 19.0 | 19.0 | 19.0 | ATS | NaN |
1 | 1 | M | M015 | P.S. 015 Roberto Clemente | 0K | CTT | - | - | - | 21.0 | 1.0 | 21.0 | 21.0 | 21.0 | ATS | NaN |
2 | 1 | M | M015 | P.S. 015 Roberto Clemente | 01 | GEN ED | - | - | - | 17.0 | 1.0 | 17.0 | 17.0 | 17.0 | ATS | NaN |
3 | 1 | M | M015 | P.S. 015 Roberto Clemente | 01 | CTT | - | - | - | 17.0 | 1.0 | 17.0 | 17.0 | 17.0 | ATS | NaN |
4 | 1 | M | M015 | P.S. 015 Roberto Clemente | 02 | GEN ED | - | - | - | 15.0 | 1.0 | 15.0 | 15.0 | 15.0 | ATS | NaN |
DBN
column :¶From looking at these rows, we can tell that the DBN
in the sat_results
data is just a combination of the CSD
and SCHOOL CODE
columns in the class_size
data.
The main difference is that the DBN
is padded, so that the CSD
portion of it always consists of two digits. That means we'll need to add a leading 0 to the CSD
if the CSD
is less than two digits long.
## Let's create a function to create padded_csd and then the 'DBN' column:
def pad_csd(num):
string_representation = str(num)
if len(string_representation) > 1:
return string_representation
else:
return "0" + string_representation
# checking the function:
print (pad_csd(2))
# applying the function and combining the required columns for creating the `DBN` column:
data["class_size"]["padded_csd"] = data["class_size"]["CSD"].apply(pad_csd)
data["class_size"]["DBN"] = data["class_size"]["padded_csd"] + data["class_size"]["SCHOOL CODE"]
02
# Let's check the columns we need to calculate the total SAT score:
cols = ['SAT Math Avg. Score', 'SAT Critical Reading Avg. Score', 'SAT Writing Avg. Score']
print ('without conversion: ', data['sat_results']['SAT Math Avg. Score'].dtype)
## We need to convert these columns into numeric type to be able to add the columns:
for c in cols:
data["sat_results"][c] = pd.to_numeric(data["sat_results"][c], errors="coerce")
print ('after conversion: ', data['sat_results']['SAT Math Avg. Score'].dtype)
without conversion: object after conversion: float64
## Adding the columns to get the final SAT score:
data['sat_results']['sat_score'] = data['sat_results'][cols[0]] + data['sat_results'][cols[1]] + data['sat_results'][cols[2]]
print(data['sat_results']['sat_score'].head())
0 1122.0 1 1172.0 2 1149.0 3 1174.0 4 1207.0 Name: sat_score, dtype: float64
Let's now check the hs_directory
dataset to parse the latitude and longitude coordinates for each school. This will enable us to map the schools and uncover any geographic patterns in the data. The coordinates are currently in the text field Location 1
in the hs_directory
data set.
## Checking the `location 1` column:
print (data['hs_directory']['Location 1'][1])
1110 Boston Road Bronx, NY 10456 (40.8276026690005, -73.90447525699966)
## Creating functions to seperate the latitude and longitude values of the schools from the column:
def find_lat(loc):
coords = re.findall("\(.+, .+\)", loc)
lat = coords[0].split(",")[0].replace("(", "")
return lat
def find_lon(loc):
coords = re.findall("\(.+, .+\)", loc)
lon = coords[0].split(",")[1].replace(")", "").strip()
return lon
# Checking the function :
print('The latitude value is : ',find_lat(data['hs_directory']['Location 1'][1]))
print('The longitude value is : ',find_lon(data['hs_directory']['Location 1'][1]))
# Applying the functions to create columns that we need :
data["hs_directory"]["lat"] = data["hs_directory"]["Location 1"].apply(find_lat)
data["hs_directory"]["lon"] = data["hs_directory"]["Location 1"].apply(find_lon)
# Converting the string values of the locations to numeric datatypes:
data["hs_directory"]["lat"] = pd.to_numeric(data["hs_directory"]["lat"], errors="coerce")
data["hs_directory"]["lon"] = pd.to_numeric(data["hs_directory"]["lon"], errors="coerce")
The latitude value is : 40.8276026690005 The longitude value is : -73.90447525699966
Since the datasets together contains a huge amount of information, let's try to condense the datasets to for better analysis.
## Let's look at the `class_size`:
class_size = data["class_size"]
class_size.head(2)
CSD | BOROUGH | SCHOOL CODE | SCHOOL NAME | GRADE | PROGRAM TYPE | CORE SUBJECT (MS CORE and 9-12 ONLY) | CORE COURSE (MS CORE and 9-12 ONLY) | SERVICE CATEGORY(K-9* ONLY) | NUMBER OF STUDENTS / SEATS FILLED | NUMBER OF SECTIONS | AVERAGE CLASS SIZE | SIZE OF SMALLEST CLASS | SIZE OF LARGEST CLASS | DATA SOURCE | SCHOOLWIDE PUPIL-TEACHER RATIO | padded_csd | DBN | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | M | M015 | P.S. 015 Roberto Clemente | 0K | GEN ED | - | - | - | 19.0 | 1.0 | 19.0 | 19.0 | 19.0 | ATS | NaN | 01 | 01M015 |
1 | 1 | M | M015 | P.S. 015 Roberto Clemente | 0K | CTT | - | - | - | 21.0 | 1.0 | 21.0 | 21.0 | 21.0 | ATS | NaN | 01 | 01M015 |
class_size["GRADE "].value_counts()
09-12 10644 MS Core 4762 0K-09 1384 0K 1237 01 1185 02 1167 03 1143 04 1140 05 1086 06 846 07 778 08 735 09 20 Name: GRADE , dtype: int64
class_size["PROGRAM TYPE"].value_counts()
GEN ED 14545 CTT 7460 SPEC ED 3653 G&T 469 Name: PROGRAM TYPE, dtype: int64
## Let's condense the dataset to include only the mst occuring values:
class_size = class_size[class_size["GRADE "] == "09-12"]
class_size = class_size[class_size["PROGRAM TYPE"] == "GEN ED"]
print(class_size["DBN"].value_counts())
# Condensing the `DBN` column by the mean of `DBN` values :
class_size = class_size.groupby("DBN").agg(np.mean)
class_size.reset_index(inplace=True)
# Assigining the dataset back to the dictionary:
data["class_size"] = class_size
20K505 21 09X505 21 24Q485 20 26Q435 20 25Q425 20 30Q445 20 21K525 20 20K490 20 21K540 20 22K495 20 25Q460 19 03M470 19 10X445 19 10X696 19 21K690 19 18K563 19 11X545 19 08X405 19 30Q450 19 27Q480 19 26Q495 19 28Q470 19 31R455 19 30Q575 19 12X278 19 26Q415 19 22K405 19 27Q410 19 25Q285 19 13K670 19 .. 14K126 1 09X303 1 20K187 1 11X180 1 29Q295 1 10X206 1 29Q059 1 03M402 1 28Q217 1 32K377 1 21K209 1 13K596 1 23K041 1 17K394 1 07X203 1 03M247 1 20K030 1 10X254 1 27Q043 1 03M334 1 11X083 1 16K308 1 06M326 1 22K078 1 31R027 1 10X391 1 09X327 1 31R063 1 10X390 1 18K588 1 Name: DBN, Length: 583, dtype: int64
## Let's check the `demographics` dataset:
data["demographics"].head(2)
DBN | Name | schoolyear | fl_percent | frl_percent | total_enrollment | prek | k | grade1 | grade2 | ... | black_num | black_per | hispanic_num | hispanic_per | white_num | white_per | male_num | male_per | female_num | female_per | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 01M015 | P.S. 015 ROBERTO CLEMENTE | 20052006 | 89.4 | NaN | 281 | 15 | 36 | 40 | 33 | ... | 74 | 26.3 | 189 | 67.3 | 5 | 1.8 | 158.0 | 56.2 | 123.0 | 43.8 |
1 | 01M015 | P.S. 015 ROBERTO CLEMENTE | 20062007 | 89.4 | NaN | 243 | 15 | 29 | 39 | 38 | ... | 68 | 28.0 | 153 | 63.0 | 4 | 1.6 | 140.0 | 57.6 | 103.0 | 42.4 |
2 rows × 38 columns
## Let's condense the `demographic`dataset to aid our findings:
data["demographics"] = data["demographics"][data["demographics"]["schoolyear"] == 20112012]
data["graduation"] = data["graduation"][data["graduation"]["Cohort"] == "2006"]
data["graduation"] = data["graduation"][data["graduation"]["Demographic"] == "Total Cohort"]
# Converting AP scores to numeric for better analysis:
cols = ['AP Test Takers ', 'Total Exams Taken', 'Number of Exams with scores 3 4 or 5']
for col in cols:
data["ap_2010"][col] = pd.to_numeric(data["ap_2010"][col], errors="coerce")
Since we have now cleansed the datasets, let's merge them together into one.
To use DBN
column as our merging key for all the datasets, we need to identify the type of merge so as to reserve maximum amount of data for our analysis.
We know that sat_results
dataset forms the base for our combined data, so let's use that as the base and add the other datasets to it.
combined = data["sat_results"]
# Using LEFT merge to preserve all the 'DBN' columns in the `sat_results` dataset:
combined = combined.merge(data["ap_2010"], on="DBN", how="left")
combined = combined.merge(data["graduation"], on="DBN", how="left")
# Using the INNER merge to get the common `DBN` values:
to_merge = ["class_size", "demographics", "survey", "hs_directory"]
for m in to_merge:
combined = combined.merge(data[m], on="DBN", how="inner")
# Let's fill the empty values with their respective column's mean:
combined = combined.fillna(combined.mean())
# For the columns with no descirnible mean:
combined = combined.fillna(0)
# checking the results:
combined.head()
DBN | SCHOOL NAME | Num of SAT Test Takers | SAT Critical Reading Avg. Score | SAT Math Avg. Score | SAT Writing Avg. Score | sat_score | SchoolName | AP Test Takers | Total Exams Taken | ... | priority04 | priority05 | priority06 | priority07 | priority08 | priority09 | priority10 | Location 1 | lat | lon | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 01M292 | HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES | 29 | 355.0 | 404.0 | 363.0 | 1122.0 | 0 | 129.028846 | 197.038462 | ... | Then to Manhattan students or residents | Then to New York City residents | 0 | 0 | 0 | 0 | 0 | 220 Henry Street\nNew York, NY 10002\n(40.7137... | 40.713764 | -73.985260 |
1 | 01M448 | UNIVERSITY NEIGHBORHOOD HIGH SCHOOL | 91 | 383.0 | 423.0 | 366.0 | 1172.0 | UNIVERSITY NEIGHBORHOOD H.S. | 39.000000 | 49.000000 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 200 Monroe Street\nNew York, NY 10002\n(40.712... | 40.712332 | -73.984797 |
2 | 01M450 | EAST SIDE COMMUNITY SCHOOL | 70 | 377.0 | 402.0 | 370.0 | 1149.0 | EAST SIDE COMMUNITY HS | 19.000000 | 21.000000 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 420 East 12 Street\nNew York, NY 10009\n(40.72... | 40.729783 | -73.983041 |
3 | 01M509 | MARTA VALLE HIGH SCHOOL | 44 | 390.0 | 433.0 | 384.0 | 1207.0 | 0 | 129.028846 | 197.038462 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 145 Stanton Street\nNew York, NY 10002\n(40.72... | 40.720569 | -73.985673 |
4 | 01M539 | NEW EXPLORATIONS INTO SCIENCE, TECHNOLOGY AND ... | 159 | 522.0 | 574.0 | 525.0 | 1621.0 | NEW EXPLORATIONS SCI,TECH,MATH | 255.000000 | 377.000000 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 111 Columbia Street\nNew York, NY 10002\n(40.7... | 40.718725 | -73.979426 |
5 rows × 159 columns
We see that the first two characters in the DBN
column represent the school district codes. Let's extract the codes for mapping the data.
## Creating a function to extract the codes:
def get_first_two_chars(dbn):
return dbn[0:2]
# Applying the function to the `DBN` column to get the `school_dist` values:
combined["school_dist"] = combined["DBN"].apply(get_first_two_chars)
Now that we have a clean dataset, let's perform some preliminary analysis to draw inferences.
## To find "Pearson's" correlation between the columns:
correlations = combined.corr()
correlations = correlations["sat_score"]
print(correlations)
SAT Critical Reading Avg. Score 0.986820 SAT Math Avg. Score 0.972643 SAT Writing Avg. Score 0.987771 sat_score 1.000000 AP Test Takers 0.523140 Total Exams Taken 0.514333 Number of Exams with scores 3 4 or 5 0.463245 Total Cohort 0.325144 CSD 0.042948 NUMBER OF STUDENTS / SEATS FILLED 0.394626 NUMBER OF SECTIONS 0.362673 AVERAGE CLASS SIZE 0.381014 SIZE OF SMALLEST CLASS 0.249949 SIZE OF LARGEST CLASS 0.314434 SCHOOLWIDE PUPIL-TEACHER RATIO NaN schoolyear NaN fl_percent NaN frl_percent -0.722225 total_enrollment 0.367857 ell_num -0.153778 ell_percent -0.398750 sped_num 0.034933 sped_percent -0.448170 asian_num 0.475445 asian_per 0.570730 black_num 0.027979 black_per -0.284139 hispanic_num 0.025744 hispanic_per -0.396985 white_num 0.449559 ... rr_p 0.047925 N_s 0.423463 N_t 0.291463 N_p 0.421530 saf_p_11 0.122913 com_p_11 -0.115073 eng_p_11 0.020254 aca_p_11 0.035155 saf_t_11 0.313810 com_t_11 0.082419 eng_t_11 0.036906 aca_t_11 0.132348 saf_s_11 0.337639 com_s_11 0.187370 eng_s_11 0.213822 aca_s_11 0.339435 saf_tot_11 0.318753 com_tot_11 0.077310 eng_tot_11 0.100102 aca_tot_11 0.190966 grade_span_max NaN expgrade_span_max NaN zip -0.063977 total_students 0.407827 number_programs 0.117012 priority08 NaN priority09 NaN priority10 NaN lat -0.121029 lon -0.132222 Name: sat_score, Length: 67, dtype: float64
## We generally consider (-0.25 to 0.25) to be a weak correllation so let's remove it and map the rest of the data:
neg_corr = correlations[correlations < -0.25 ]
neg_corr
frl_percent -0.722225 ell_percent -0.398750 sped_percent -0.448170 black_per -0.284139 hispanic_per -0.396985 Name: sat_score, dtype: float64
pos_corr = correlations[correlations > 0.25 ]
pos_corr
SAT Critical Reading Avg. Score 0.986820 SAT Math Avg. Score 0.972643 SAT Writing Avg. Score 0.987771 sat_score 1.000000 AP Test Takers 0.523140 Total Exams Taken 0.514333 Number of Exams with scores 3 4 or 5 0.463245 Total Cohort 0.325144 NUMBER OF STUDENTS / SEATS FILLED 0.394626 NUMBER OF SECTIONS 0.362673 AVERAGE CLASS SIZE 0.381014 SIZE OF LARGEST CLASS 0.314434 total_enrollment 0.367857 asian_num 0.475445 asian_per 0.570730 white_num 0.449559 white_per 0.620718 male_num 0.325520 female_num 0.388631 N_s 0.423463 N_t 0.291463 N_p 0.421530 saf_t_11 0.313810 saf_s_11 0.337639 aca_s_11 0.339435 saf_tot_11 0.318753 total_students 0.407827 Name: sat_score, dtype: float64
total_enrollment
has a strong positive correlation with sat_score
. This is surprising because we'd expect smaller schools where students receive more attention to have higher scores. However, it looks like the opposite is true -- larger schools tend to do better on the SAT.
total_students
, N_s
, N_p
, N_t
,AP Test Takers
, Total Exams Taken
, and NUMBER OF SECTIONS
**.
Both the percentage of females (female_per
) and number of females (female_num
) at a school correlate positively with SAT score, whereas the percentage of males (male_per
) and the number of males (male_num
) correlate negatively. This could indicate that women do better on the SAT than men.
Teacher and student ratings of school safety (saf_t_11
, and saf_s_11
) correlate with sat_score
.
Student ratings of school academic standards (aca_s_11
) correlate with sat_score
, but this does not hold for ratings from teachers and parents (aca_p_11
and aca_t_11
).
There is significant racial inequality in SAT scores (white_per
, asian_per
, black_per
, hispanic_per
).
ell_percent
, frl_percent
) has a strong negativecorrelation with SAT scores.
In order to understand the data better, let's plot the data:
## Remove DBN since it's a unique identifier, not a useful numerical value for correlation.
survey_fields.remove("DBN")
## Importing the library for plotting:
import matplotlib.pyplot as plt
%matplotlib inline
## Let's plot a bar plot to see the correlation bw fields that originally came from a survey of parents, teachers, and students.
combined.corr()["sat_score"][survey_fields].plot.bar()
<matplotlib.axes._subplots.AxesSubplot at 0x7f7bb8f43518>
## Let's see the column tags to understand the plot better:
link = 'https://data.cityofnewyork.us/api/views/mnz3-dyi8/files/aa68d821-4dbb-4eb2-9448-3d8cbbad5044?download=true&filename=Survey%20Data%20Dictionary.xls'
data_dict = pd.read_excel(link)
data_dict.head()
2011 NYC School Survey Data Dictionary | Unnamed: 1 | |
---|---|---|
0 | This data dictionary can be used with the scho... | NaN |
1 | NaN | NaN |
2 | Field Name | Field Description |
3 | dbn | School identification code (district borough n... |
4 | sch_type | School type (Elementary, Middle, High, etc) |
## Let's clean up the df:
columns = ['Field Name', 'Field Description'] # headers
data_dict = data_dict[3:] # removing the top 3 rows
data_dict.columns = columns # assigning the headers
data_dict.reset_index(inplace=True) # resetting and assigning new index
data_dict = data_dict.drop('index', axis=1)
## remvoving unnecessary rows:
data_dict
Field Name | Field Description | |
---|---|---|
0 | dbn | School identification code (district borough n... |
1 | sch_type | School type (Elementary, Middle, High, etc) |
2 | location | School name |
3 | enrollment | Enrollment size |
4 | borough | Borough |
5 | principal | Principal name |
6 | studentsurvey | Only students in grades 6-12 partipate in the ... |
7 | rr_s | Student Response Rate |
8 | rr_t | Teacher Response Rate |
9 | rr_p | Parent Response Rate |
10 | N_s | Number of student respondents |
11 | N_t | Number of teacher respondents |
12 | N_p | Number of parent respondents |
13 | nr_s | Number of eligible students |
14 | nr_t | Number of eligible teachers |
15 | nr_p | Number of eligible parents |
16 | saf_p_10 | Safety and Respect score based on parent respo... |
17 | com_p_10 | Communication score based on parent responses |
18 | eng_p_10 | Engagement score based on parent responses |
19 | aca_p_10 | Academic expectations score based on parent re... |
20 | saf_t_10 | Safety and Respect score based on teacher resp... |
21 | com_t_10 | Communication score based on teacher responses |
22 | eng_t_10 | Engagement score based on teacher responses |
23 | aca_t_10 | Academic expectations score based on teacher r... |
24 | saf_s_10 | Safety and Respect score based on student resp... |
25 | com_s_10 | Communication score based on student responses |
26 | eng_s_10 | Engagement score based on student responses |
27 | aca_s_10 | Academic expectations score based on student r... |
28 | saf_tot_10 | Safety and Respect total score |
29 | com_tot_10 | Communication total score |
30 | eng_tot_10 | Engagement total score |
31 | aca_tot_10 | Academic Expectations total score |
32 | Field Series | Field Series Description |
33 | Column AG through Column CA | These fields provide scores determined for eac... |
34 | Column CB through Column LV | These fields provide percentages of responses ... |
35 | Column LW through Column VQ | These fields provide counts of responses from ... |
36 | Column VR through Column YS | These fields provide scores determined for eac... |
37 | Column YT through Column AMY | These fields provide percentages of responses ... |
38 | Column AMZ through Column BBE | These fields provide counts of responses from ... |
39 | Column BBF through Column BDD | These fields provide scores determined for eac... |
40 | Column BDE through Column BNB | These fields provide percentages of responses ... |
41 | Column BNC through BWZ | These fields provide counts of responses from ... |
42 | Field Convention | Field Convention Description |
43 | p_q1 | Indicates parent_question 1 |
44 | p_q1a | Indicates parent_question 1a |
45 | p_q1a_1 | Indicates parent question_1a_response option 1 |
46 | p_N_q1_1 | Indicates parent_Number of responses_question ... |
47 | t_q1 | Indicates teacher_question 1 |
48 | t_q1a | Indicates teacher_question 1a |
49 | t_q1a_1 | Indicates teacher question_1a_response option 1 |
50 | t_N_q1_1 | Indicates teacher_Number of responses_question... |
51 | s_q1 | Indicates student_question 1 |
52 | s_q1a | Indicates student_question 1a |
53 | s_q1a_1 | Indicates student question_1a_response option 1 |
54 | s_N_q1_1 | Indicates student_Number of responses_question... |
data_dict = data_dict[:32]
pd.set_option('mode.chained_assignment', None)
data_dict['Field Name'] = data_dict['Field Name'].astype('str').str.replace('0','1')
pd.reset_option('mode.chained_assignment')
data_dict
Field Name | Field Description | |
---|---|---|
0 | dbn | School identification code (district borough n... |
1 | sch_type | School type (Elementary, Middle, High, etc) |
2 | location | School name |
3 | enrollment | Enrollment size |
4 | borough | Borough |
5 | principal | Principal name |
6 | studentsurvey | Only students in grades 6-12 partipate in the ... |
7 | rr_s | Student Response Rate |
8 | rr_t | Teacher Response Rate |
9 | rr_p | Parent Response Rate |
10 | N_s | Number of student respondents |
11 | N_t | Number of teacher respondents |
12 | N_p | Number of parent respondents |
13 | nr_s | Number of eligible students |
14 | nr_t | Number of eligible teachers |
15 | nr_p | Number of eligible parents |
16 | saf_p_11 | Safety and Respect score based on parent respo... |
17 | com_p_11 | Communication score based on parent responses |
18 | eng_p_11 | Engagement score based on parent responses |
19 | aca_p_11 | Academic expectations score based on parent re... |
20 | saf_t_11 | Safety and Respect score based on teacher resp... |
21 | com_t_11 | Communication score based on teacher responses |
22 | eng_t_11 | Engagement score based on teacher responses |
23 | aca_t_11 | Academic expectations score based on teacher r... |
24 | saf_s_11 | Safety and Respect score based on student resp... |
25 | com_s_11 | Communication score based on student responses |
26 | eng_s_11 | Engagement score based on student responses |
27 | aca_s_11 | Academic expectations score based on student r... |
28 | saf_tot_11 | Safety and Respect total score |
29 | com_tot_11 | Communication total score |
30 | eng_tot_11 | Engagement total score |
31 | aca_tot_11 | Academic Expectations total score |
## to check the values of the tags in the dictionary:
for val in survey_fields:
ind = data_dict[data_dict['Field Name']== val].index.values
correlation = combined.corr()['sat_score'][combined.corr().index == val]
print( data_dict.loc[ind ,'Field Description'] )
print(' ',correlation)
print('\n')
7 Student Response Rate Name: Field Description, dtype: object rr_s 0.232199 Name: sat_score, dtype: float64 8 Teacher Response Rate Name: Field Description, dtype: object rr_t -0.023386 Name: sat_score, dtype: float64 9 Parent Response Rate Name: Field Description, dtype: object rr_p 0.047925 Name: sat_score, dtype: float64 10 Number of student respondents Name: Field Description, dtype: object N_s 0.423463 Name: sat_score, dtype: float64 11 Number of teacher respondents Name: Field Description, dtype: object N_t 0.291463 Name: sat_score, dtype: float64 12 Number of parent respondents Name: Field Description, dtype: object N_p 0.42153 Name: sat_score, dtype: float64 16 Safety and Respect score based on parent respo... Name: Field Description, dtype: object saf_p_11 0.122913 Name: sat_score, dtype: float64 17 Communication score based on parent responses Name: Field Description, dtype: object com_p_11 -0.115073 Name: sat_score, dtype: float64 18 Engagement score based on parent responses Name: Field Description, dtype: object eng_p_11 0.020254 Name: sat_score, dtype: float64 19 Academic expectations score based on parent re... Name: Field Description, dtype: object aca_p_11 0.035155 Name: sat_score, dtype: float64 20 Safety and Respect score based on teacher resp... Name: Field Description, dtype: object saf_t_11 0.31381 Name: sat_score, dtype: float64 21 Communication score based on teacher responses Name: Field Description, dtype: object com_t_11 0.082419 Name: sat_score, dtype: float64 22 Engagement score based on teacher responses Name: Field Description, dtype: object eng_t_11 0.036906 Name: sat_score, dtype: float64 23 Academic expectations score based on teacher r... Name: Field Description, dtype: object aca_t_11 0.132348 Name: sat_score, dtype: float64 24 Safety and Respect score based on student resp... Name: Field Description, dtype: object saf_s_11 0.337639 Name: sat_score, dtype: float64 25 Communication score based on student responses Name: Field Description, dtype: object com_s_11 0.18737 Name: sat_score, dtype: float64 26 Engagement score based on student responses Name: Field Description, dtype: object eng_s_11 0.213822 Name: sat_score, dtype: float64 27 Academic expectations score based on student r... Name: Field Description, dtype: object aca_s_11 0.339435 Name: sat_score, dtype: float64 28 Safety and Respect total score Name: Field Description, dtype: object saf_tot_11 0.318753 Name: sat_score, dtype: float64 29 Communication total score Name: Field Description, dtype: object com_tot_11 0.07731 Name: sat_score, dtype: float64 30 Engagement total score Name: Field Description, dtype: object eng_tot_11 0.100102 Name: sat_score, dtype: float64 31 Academic Expectations total score Name: Field Description, dtype: object aca_tot_11 0.190966 Name: sat_score, dtype: float64
N_s, N_t, N_p and sat_score
: This seems to be the case as the columns represent the number of people giving the SATs.rr_s
: The students who responded to the survey seemed to have performed better on the SATs.saf_tot_11
: The safety and respect total score (teachers, students and parents) seem to correlate positively on the sat_scores
, indicating the requirement of both to study properly.aca_tot_11
: The overall acadmic expectations from the sat_scores
is relatively less, indicating that the score holds less value in the case of parents and the teachers. Whereas the aca_s_11
is higher, showing the importance that sat_scores
holds for them.## Let's dry to dive a bit deeper and plot our observations:
print(combined['saf_s_11'].corr(combined['sat_score']))
combined.plot.scatter('saf_s_11','sat_score', title='Safety plot relation', c='g')
0.3376387443608813
<matplotlib.axes._subplots.AxesSubplot at 0x7f7bb61366a0>
From the diagram we see that the columns are related positively and :
To relate to our observations above, let's see the safety score for each borough.
combined.groupby('boro')['saf_s_11'].agg(np.mean).sort_values(ascending=True)
boro Brooklyn 6.370755 Staten Island 6.530000 Bronx 6.606577 Queens 6.721875 Manhattan 6.831370 Name: saf_s_11, dtype: float64
On average, all the boroughs have similar safety scores with Brooklyn having the least safety score (~ 6.37) and Manhattan having the highest score (~ 6.83).
rac_cols = ['white_per', 'asian_per', 'black_per', 'hispanic_per']
correlations[rac_cols].plot(kind='bar', title = 'Racial differences in SAT scores', rot=0)
<matplotlib.axes._subplots.AxesSubplot at 0x7f7bb639d470>
Schools with higher number of 'whites' and 'asians' have higher SAT scores indicated by the strong positive correlation values.
Schools with higher number of 'black' and 'hispanic' students have lower SAT scores indicated by the negative correlation values.
## Let's find all the schools with 'hispanic_per' greater than 95% :
combined['SCHOOL NAME'][combined['hispanic_per'] > 95 ]
44 MANHATTAN BRIDGES HIGH SCHOOL 82 WASHINGTON HEIGHTS EXPEDITIONARY LEARNING SCHOOL 89 GREGORIO LUPERON HIGH SCHOOL FOR SCIENCE AND M... 125 ACADEMY FOR LANGUAGE AND TECHNOLOGY 141 INTERNATIONAL SCHOOL FOR LIBERAL ARTS 176 PAN AMERICAN INTERNATIONAL HIGH SCHOOL AT MONROE 253 MULTICULTURAL HIGH SCHOOL 286 PAN AMERICAN INTERNATIONAL HIGH SCHOOL Name: SCHOOL NAME, dtype: object
All the schools listed above primariy focused towards the recent immigrant population.** Lower SAT scores could indicate that the students are not fluent with English.**
## Let's find all the schools with 'hispanic_per' < 10% and avg SAT score > 1800:
combined['SCHOOL NAME'][(combined["hispanic_per"] < 10) & (combined["sat_score"] > 1800)]
37 STUYVESANT HIGH SCHOOL 151 BRONX HIGH SCHOOL OF SCIENCE 187 BROOKLYN TECHNICAL HIGH SCHOOL 327 QUEENS HIGH SCHOOL FOR THE SCIENCES AT YORK CO... 356 STATEN ISLAND TECHNICAL HIGH SCHOOL Name: SCHOOL NAME, dtype: object
All the above schools are ** 'Specialised Schools' , with a very low acceptance rate. The acceptance is decided on the score that a student attains on the 'Entrance exams'. Hence, accepting only High performers** makes the average SAT scores of these schools really high.
Does a particlar gender affect SAT scores:
gender = ['male_per', 'female_per']
correlations[gender].plot(kind='bar', title = 'Gender differences in SAT scores', rot=0)
<matplotlib.axes._subplots.AxesSubplot at 0x7f7bb242ae80>
Though the correlation values are not strong, it can be seen that:
## Let's plot 'female_per' and 'sat_score' to aid our observations above:
combined.plot.scatter('female_per','sat_score', c='b')
<matplotlib.axes._subplots.AxesSubplot at 0x7f7bb23606a0>
## Let's plot 'male_per' and 'sat_score' to aid our observations above:
combined.plot.scatter('male_per','sat_score', c='b')
<matplotlib.axes._subplots.AxesSubplot at 0x7f7bb242a8d0>
The columns have a very weak correlation. Most of the schools have equal percentage of males and females. However it is quite interesting to see that when the gender percentages are relatively same (~ 40-60%), they tend to perform better.
## Let's find schools with a 'female_per' greater than 60% and an average SAT score greater than 1700 :
combined['SCHOOL NAME'][(combined['female_per'] > 10) & (combined['sat_score'] > 1700)]
5 BARD HIGH SCHOOL EARLY COLLEGE 26 ELEANOR ROOSEVELT HIGH SCHOOL 37 STUYVESANT HIGH SCHOOL 60 BEACON HIGH SCHOOL 61 FIORELLO H. LAGUARDIA HIGH SCHOOL OF MUSIC & A... 79 HIGH SCHOOL FOR MATHEMATICS, SCIENCE AND ENGIN... 151 BRONX HIGH SCHOOL OF SCIENCE 155 HIGH SCHOOL OF AMERICAN STUDIES AT LEHMAN COLLEGE 187 BROOKLYN TECHNICAL HIGH SCHOOL 198 BROOKLYN LATIN SCHOOL, THE 302 TOWNSEND HARRIS HIGH SCHOOL 327 QUEENS HIGH SCHOOL FOR THE SCIENCES AT YORK CO... 356 STATEN ISLAND TECHNICAL HIGH SCHOOL Name: SCHOOL NAME, dtype: object
These schools, again, belong to a category of ** 'Specialised Schools' **.
Advanced Placement
and SAT scores
:¶In the U.S., high school students take Advanced Placement (AP) exams to earn college credit. It makes sense that the number of students at a school who took AP exams would be highly correlated with the school's SAT scores.
## Let's look at the percentage of students in each school who took at least one AP exam :
combined['ap_per'] = combined['AP Test Takers '] / combined['total_enrollment']
# let's plot the relationship:
print(combined['ap_per'].corr(combined['sat_score']))
combined.plot.scatter('ap_per', 'sat_score', title = 'Advance Placement vs SAT scores', xlim = 0)
0.05717081390766969
<matplotlib.axes._subplots.AxesSubplot at 0x7f7bb22e7fd0>
There seems to be no correlation between sat_scores
and ap_per
.
Class size
and sat_scores
:¶# let's plot the relationship:
print(combined['AVERAGE CLASS SIZE'].corr(combined['sat_score']))
combined.plot.scatter('AVERAGE CLASS SIZE', 'sat_score', title = 'Class size vs SAT scores')
0.3810143308095524
<matplotlib.axes._subplots.AxesSubplot at 0x7f7bb22d3400>
We see a positive correlation between the columns and:
sat_scores
(800-1400).sat_scores
(1400-2100).boro
with the best schools :
## Let' look at the boroughs with the best schools:
combined.groupby('boro')['sat_score'].agg(np.mean).sort_values(ascending=False)
boro Staten Island 1382.500000 Queens 1286.753032 Manhattan 1278.331410 Brooklyn 1181.364461 Bronx 1157.598203 Name: sat_score, dtype: float64
combined.groupby('boro')['sat_score'].agg(np.max).sort_values(ascending=False)
boro Manhattan 2096.0 Bronx 1969.0 Staten Island 1953.0 Queens 1910.0 Brooklyn 1833.0 Name: sat_score, dtype: float64
combined.groupby('boro')['sat_score'].agg(np.min).sort_values(ascending=False)
boro Staten Island 1195.0 Manhattan 1014.0 Queens 951.0 Bronx 934.0 Brooklyn 887.0 Name: sat_score, dtype: float64
We see that the top schools by the average SATs score are located in Staten Island with an average score of 1382.5 and the schools in Bronx score relatively the worst, with an avg. of 1157.6.
Let's condense our findings at one place:
The students who responded to the survey seemed to have performed better on the SATs.
The safety and respect total score (teachers, students and parents) seem to correlate positively on the sat_scores
,indicating the requirement of both to study properly.
The overall acadmic expectations from the sat_scores
is relatively less, indicating that the score holds less value in the case of parents and the teachers. Whereas the aca_s_11
is higher, showing the importance that sat_scores
holds for them.
Safety :
Racial Disparity :
Gender Inequality :
Class Size :
By Neighbourhood :