In the United States, the SAT is a standardised test designed to assess a students readiness for college education by measuring aptitude in literacy, numeracy, and writing. Scores on the SAT test range from a minimum of 400
points to a maximum of 1600
points* by combining test results from two 200-800 point sections.
The SAT was designed to equalise academic opportunity across the gamut of socioeconomic status in the USA; however, it is not difficult to find literature questioning the fairness of the SAT when viewed through a demographic lens.[1][2][3][4]
This project will attempt to answer: are SAT tests a fair measure of academic competence for all students?
New York City is a ethically and culturally diverse metropolis with a large immigrant population. This makes it an ideal location to examine the possible influence of demographic factors on the SAT test
In this project, we will take an in-depth look at the 2012 SAT results dataset published on the NYC Open Data website, the most recent available year of results at the time of writing. We will attempt to combine this dataset with several other datasets containing demographic information, which will allow us to explore the relationships between SAT scores and demographic factors.
The following is a list of all the individual datasets we will be using in this project:
* Between the years 2005-2016, the maximum score was 2400
To make it easier to manage and reference our datasets, we'll store them in a dictionary called data
.
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
%matplotlib inline
data_files = [
"ap_2010.csv",
"class_size.csv",
"demographics.csv",
"graduation.csv",
"hs_directory.csv",
"sat_results.csv"
]
data = {}
for f in data_files:
d = pd.read_csv("schools/{0}".format(f))
data[f.replace(".csv", "")] = d
The survey files use a different format and encoding so we need to import these separately. The also contain a vast number of columns, most of which we do not need so we will filter the data down to what's necessary for our analysis.
all_survey = pd.read_csv("schools/survey_all.txt", delimiter="\t", encoding='windows-1252')
d75_survey = pd.read_csv("schools/survey_d75.txt", delimiter="\t", encoding='windows-1252')
# combine both surveys into a single dataframe
survey = pd.concat([all_survey, d75_survey], axis=0)
# see next cell for details on 'DBN'
survey["DBN"] = survey["dbn"]
# select columns necessary for our analysis and add to our data dictionary
survey_fields = [
"DBN",
"rr_s",
"rr_t",
"rr_p",
"N_s",
"N_t",
"N_p",
"saf_p_11",
"com_p_11",
"eng_p_11",
"aca_p_11",
"saf_t_11",
"com_t_11",
"eng_t_11",
"aca_t_11",
"saf_s_11",
"com_s_11",
"eng_s_11",
"aca_s_11",
"saf_tot_11",
"com_tot_11",
"eng_tot_11",
"aca_tot_11",
]
survey = survey.loc[:,survey_fields]
data["survey"] = survey
Each school in New York City has a unique code called a DBN, or district borough number. We'll use this to combine our datasets. Our datasets either have a DBN column already (like the surveys we read in above), or the information needed to create one.
In the case of our class_size
dataset, we need to combine the data from the CSD
column and the SCHOOL CODE
column to create the DBN code. One caveat is that all schools with a single digit CSD
value will need padding with an extra 0
to make a valid DBN code, we'll create a function to handle this.
# standardise DBN column in hs_directory
data["hs_directory"]["DBN"] = data["hs_directory"]["dbn"]
# create DBN column in class_size dataset #
# function add leading 0 to any single digit value
def pad_csd(num):
string_representation = str(num)
if len(string_representation) > 1:
return string_representation
else:
return "0" + string_representation
# apply function and create new DBN column
data["class_size"]["padded_csd"] = data["class_size"]["CSD"].apply(pad_csd)
data["class_size"]["DBN"] = data["class_size"]["padded_csd"] + data["class_size"]["SCHOOL CODE"]
Our sat_results
dataframe has three columns that contain SAT score data:
SAT Math Avg. Score
SAT Critical Reading Avg. Score
SAT Writing Avg. Score
To make it easier to correlate this data with demographic factors, we'll sum the three columns to create a total score so that we can work with a single score, rather than three different ones. Before we can do this, we'll need to convert the data in these columns from the object (string) datatype to a numeric one.
The hs_directory
dataframe contains latitude and longitude coordinates for each school, these columns also need cleaning and converting to a numeric datatype.
Lastly, there are several columns in the ap_2010
dataset that will need converting to numeric datatypes so we can use them for analysis.
## convert SAT scores to numeric and create total ##
cols = ['SAT Math Avg. Score', 'SAT Critical Reading Avg. Score', 'SAT Writing Avg. Score']
for c in cols:
# set errors parameter to 'coerce' to treat invalid strings as missing values
data["sat_results"][c] = pd.to_numeric(data["sat_results"][c], errors="coerce")
data['sat_results']['sat_score'] = data['sat_results'][cols[0]] + data['sat_results'][cols[1]] + data['sat_results'][cols[2]]
## clean latitute and longtitude data in and convert to numeric datatype ##
def find_lat(loc):
coords = re.findall("\(.+, .+\)", loc)
lat = coords[0].split(",")[0].replace("(", "")
return lat
def find_lon(loc):
coords = re.findall("\(.+, .+\)", loc)
lon = coords[0].split(",")[1].replace(")", "").strip()
return lon
data["hs_directory"]["lat"] = data["hs_directory"]["Location 1"].apply(find_lat)
data["hs_directory"]["lon"] = data["hs_directory"]["Location 1"].apply(find_lon)
data["hs_directory"]["lat"] = pd.to_numeric(data["hs_directory"]["lat"], errors="coerce")
data["hs_directory"]["lon"] = pd.to_numeric(data["hs_directory"]["lon"], errors="coerce")
## convert AP scores to numeric ##
cols = ['AP Test Takers ', 'Total Exams Taken', 'Number of Exams with scores 3 4 or 5']
for col in cols:
data["ap_2010"][col] = pd.to_numeric(data["ap_2010"][col], errors="coerce")
Recall that we plan on using the DBN
column to join each individual dataframe into one combined dataframe. For that to work, the DBN
column in each dataset must contain unique values only. We know that the class_size
, demographics
, and graduation
datasets contain duplicate values in their DBN
columns so we'll need to handle this before any merging can take place.
Each dataset will require a different strategy to condense down to unique DBN values.
## condense `class_size` ##
# select only grades that take the SAT and the standard 'program type' for all NYC public schools
class_size = data["class_size"]
class_size = class_size[class_size["GRADE "] == "09-12"]
class_size = class_size[class_size["PROGRAM TYPE"] == "GEN ED"]
# group class_size by DBN column and take average of each group
# this will leave us with unique DBN values
class_size = class_size.groupby("DBN").agg(np.mean)
class_size.reset_index(inplace=True)
data["class_size"] = class_size
## condense `demographics` ##
# select year that matches our SAT result data
data["demographics"] = data["demographics"][data["demographics"]["schoolyear"] == 20112012]
## condense `graduation` ##
# select most recent available year in `Cohort` and select the total cohort from 'Demographic'
data["graduation"] = data["graduation"][data["graduation"]["Cohort"] == "2006"]
data["graduation"] = data["graduation"][data["graduation"]["Demographic"] == "Total Cohort"]
Now it's time to combine our datasets. Let's run through our merging strategy:
sat_results
dataset is the basis of this project so we'll want to preserve as many rows in this dataset as possible, therefore we'll start our joins from this datasetap_2010
and graduation
contain large numbers of missing values in their respective DBN columns, a left join would be most appropriate so that we do not lose a significant number of rowsAs we will be left with some missing data, we can attempt to fill this with an average value.
combined = data["sat_results"]
# create left joins
combined = combined.merge(data["ap_2010"], on="DBN", how="left")
combined = combined.merge(data["graduation"], on="DBN", how="left")
# create inner joins
to_merge = ["class_size", "demographics", "survey", "hs_directory"]
for m in to_merge:
combined = combined.merge(data[m], on="DBN", how="inner")
# impute NaN values with column average
combined = combined.fillna(combined.mean())
combined = combined.fillna(0)
Mapping the statistics out on a school district level might be an interesting way to analyze them. Adding a column to the data set that specifies the school district will help us accomplish this.
The school district is just the first two characters of the DBN. We can apply a function over the DBN column of combined that pulls out the first two letters.
def get_first_two_chars(dbn):
return dbn[0:2]
combined["school_dist"] = combined["DBN"].apply(get_first_two_chars)
combined.shape
(363, 160)
We now have a very large combined dataframe with 160 columns. To make it easier to refer back to all the columns, we'll print out the column names from each individual dataset.
# loop over data dictionary and print out column names for each dataset
for key in data:
print(key)
print(75*"-")
print(data[key].columns)
print("\n")
ap_2010 --------------------------------------------------------------------------- Index(['DBN', 'SchoolName', 'AP Test Takers ', 'Total Exams Taken', 'Number of Exams with scores 3 4 or 5'], dtype='object') class_size --------------------------------------------------------------------------- Index(['DBN', 'CSD', 'NUMBER OF STUDENTS / SEATS FILLED', 'NUMBER OF SECTIONS', 'AVERAGE CLASS SIZE', 'SIZE OF SMALLEST CLASS', 'SIZE OF LARGEST CLASS', 'SCHOOLWIDE PUPIL-TEACHER RATIO'], dtype='object') demographics --------------------------------------------------------------------------- Index(['DBN', 'Name', 'schoolyear', 'fl_percent', 'frl_percent', 'total_enrollment', 'prek', 'k', 'grade1', 'grade2', 'grade3', 'grade4', 'grade5', 'grade6', 'grade7', 'grade8', 'grade9', 'grade10', 'grade11', 'grade12', 'ell_num', 'ell_percent', 'sped_num', 'sped_percent', 'ctt_num', 'selfcontained_num', 'asian_num', 'asian_per', 'black_num', 'black_per', 'hispanic_num', 'hispanic_per', 'white_num', 'white_per', 'male_num', 'male_per', 'female_num', 'female_per'], dtype='object') graduation --------------------------------------------------------------------------- Index(['Demographic', 'DBN', 'School Name', 'Cohort', 'Total Cohort', 'Total Grads - n', 'Total Grads - % of cohort', 'Total Regents - n', 'Total Regents - % of cohort', 'Total Regents - % of grads', 'Advanced Regents - n', 'Advanced Regents - % of cohort', 'Advanced Regents - % of grads', 'Regents w/o Advanced - n', 'Regents w/o Advanced - % of cohort', 'Regents w/o Advanced - % of grads', 'Local - n', 'Local - % of cohort', 'Local - % of grads', 'Still Enrolled - n', 'Still Enrolled - % of cohort', 'Dropped Out - n', 'Dropped Out - % of cohort'], dtype='object') hs_directory --------------------------------------------------------------------------- Index(['dbn', 'school_name', 'boro', 'building_code', 'phone_number', 'fax_number', 'grade_span_min', 'grade_span_max', 'expgrade_span_min', 'expgrade_span_max', 'bus', 'subway', 'primary_address_line_1', 'city', 'state_code', 'zip', 'website', 'total_students', 'campus_name', 'school_type', 'overview_paragraph', 'program_highlights', 'language_classes', 'advancedplacement_courses', 'online_ap_courses', 'online_language_courses', 'extracurricular_activities', 'psal_sports_boys', 'psal_sports_girls', 'psal_sports_coed', 'school_sports', 'partner_cbo', 'partner_hospital', 'partner_highered', 'partner_cultural', 'partner_nonprofit', 'partner_corporate', 'partner_financial', 'partner_other', 'addtl_info1', 'addtl_info2', 'start_time', 'end_time', 'se_services', 'ell_programs', 'school_accessibility_description', 'number_programs', 'priority01', 'priority02', 'priority03', 'priority04', 'priority05', 'priority06', 'priority07', 'priority08', 'priority09', 'priority10', 'Location 1', 'DBN', 'lat', 'lon'], dtype='object') sat_results --------------------------------------------------------------------------- Index(['DBN', 'SCHOOL NAME', 'Num of SAT Test Takers', 'SAT Critical Reading Avg. Score', 'SAT Math Avg. Score', 'SAT Writing Avg. Score', 'sat_score'], dtype='object') survey --------------------------------------------------------------------------- Index(['DBN', 'rr_s', 'rr_t', 'rr_p', 'N_s', 'N_t', 'N_p', 'saf_p_11', 'com_p_11', 'eng_p_11', 'aca_p_11', 'saf_t_11', 'com_t_11', 'eng_t_11', 'aca_t_11', 'saf_s_11', 'com_s_11', 'eng_s_11', 'aca_s_11', 'saf_tot_11', 'com_tot_11', 'eng_tot_11', 'aca_tot_11'], dtype='object')
# get correlations for all columns in dataframe
correlations = combined.corr()
# filter to correlations for sat_score col only
correlations = correlations["sat_score"]
print(correlations[:34])
print(correlations[34:])
SAT Critical Reading Avg. Score 0.986820 SAT Math Avg. Score 0.972643 SAT Writing Avg. Score 0.987771 sat_score 1.000000 AP Test Takers 0.523140 Total Exams Taken 0.514333 Number of Exams with scores 3 4 or 5 0.463245 Total Cohort 0.325144 CSD 0.042948 NUMBER OF STUDENTS / SEATS FILLED 0.394626 NUMBER OF SECTIONS 0.362673 AVERAGE CLASS SIZE 0.381014 SIZE OF SMALLEST CLASS 0.249949 SIZE OF LARGEST CLASS 0.314434 SCHOOLWIDE PUPIL-TEACHER RATIO NaN schoolyear NaN fl_percent NaN frl_percent -0.722225 total_enrollment 0.367857 ell_num -0.153778 ell_percent -0.398750 sped_num 0.034933 sped_percent -0.448170 asian_num 0.475445 asian_per 0.570730 black_num 0.027979 black_per -0.284139 hispanic_num 0.025744 hispanic_per -0.396985 white_num 0.449559 white_per 0.620718 male_num 0.325520 male_per -0.112062 female_num 0.388631 Name: sat_score, dtype: float64 female_per 0.112108 rr_s 0.232199 rr_t -0.023386 rr_p 0.047925 N_s 0.423463 N_t 0.291463 N_p 0.421530 saf_p_11 0.122913 com_p_11 -0.115073 eng_p_11 0.020254 aca_p_11 0.035155 saf_t_11 0.313810 com_t_11 0.082419 eng_t_11 0.036906 aca_t_11 0.132348 saf_s_11 0.337639 com_s_11 0.187370 eng_s_11 0.213822 aca_s_11 0.339435 saf_tot_11 0.318753 com_tot_11 0.077310 eng_tot_11 0.100102 aca_tot_11 0.190966 grade_span_max NaN expgrade_span_max NaN zip -0.063977 total_students 0.407827 number_programs 0.117012 priority08 NaN priority09 NaN priority10 NaN lat -0.121029 lon -0.132222 Name: sat_score, dtype: float64
Typically, r values above 0.25
or below -0.25
are enough to qualify a correlation as interesting. Looks like there's a number of columns we could explore in more depth, we'll start by investigating the fields brought in from the survey data.
Every year, all parents, all teachers, and students in grades 6-12 take the NYC School Survey, which helps to provide an insight into a school's learning environment. The questions assess the community's opinions on academic expectations, communication, engagement, and safety and respect.
Let's take a look at the data dictionary provided with the survey files so we understand what each field name means.
The table below lists all the field names in the survey along with a description of what they mean.
| Field name | Field description | |------------ |---------------------------------------------------------- | | rr_s | Student Response Rate | | rr_t | Teacher Response Rate | | rr_p | Parent Response Rate | | N_s | Number of student respondents | | N_t | Number of teacher respondents | | N_p | Number of parent respondents | | nr_s | Number of eligible students | | nr_t | Number of eligible teachers | | nr_p | Number of eligible parents | | saf_p_10 | Safety and Respect score based on parent responses | | com_p_10 | Communication score based on parent responses | | eng_p_10 | Engagement score based on parent responses | | aca_p_10 | Academic expectations score based on parent responses | | saf_t_10 | Safety and Respect score based on teacher responses | | com_t_10 | Communication score based on teacher responses | | eng_t_10 | Engagement score based on teacher responses | | aca_t_10 | Academic expectations score based on teacher responses | | saf_s_10 | Safety and Respect score based on student responses | | com_s_10 | Communication score based on student responses | | eng_s_10 | Engagement score based on student responses | | aca_s_10 | Academic expectations score based on student responses | | saf_tot_10 | Safety and Respect total score | | com_tot_10 | Communication total score | | eng_tot_10 | Engagement total score | | aca_tot_10 | Academic Expectations total score |
Note that:
Now let's plot out the correlation between each survey field and SAT score.
# Remove DBN since it's a unique identifier, not a useful numerical value for correlation.
survey_fields.remove("DBN")
# filter the `correlations` series to only the survey field rows
survey_corr = correlations[survey_fields]
# assign different colours to positive and negative r values
color = (survey_corr > 0).apply(lambda x: 'tab:blue' if x else 'tab:red')
# plot a bar chart to identify correlations between these fields and the sat_score
survey_corr.plot.barh(figsize=(6,9),
color=color,
title="Correlations between survey results and SAT score")
plt.show()
In the bar chart above we can spot several strong-positive correlations between the survey responses and sat_score
:
N_p, N_t, N_s
) - there is a correlation between the number of respondents to the survey at the school and SAT performance. There's no immediately obvious reason why this might be the case, though perhaps participating in the survey suggests an increased level of engagement with the schoolsaf_s_11
, saf_t_11
) - the scores given by students and teachers had the strongest correlation, it seems reasonable to expect that a safe learning environment will facilitate academic performanceaca_s_11
) - students who expected themselves to perform well in academics generally did. Interestingly, there is little correlation between the teacher's or parent's expectation of student's academic performance and SAT score.Let's dive a little deeper into the safety and respect scores given by students, saf_s_11
. We can try and identify which schools have low scores for this metric.
# create scatter plot of saf_s_11 vs. sat_score
combined.plot(x="saf_s_11", y="sat_score", kind="scatter", title="saf_s_11 vs. sat_score")
plt.show()
This plot doesn't reveal a particularly strong correlation between safety score and sat score, arguably there are too many datapoints making the graph difficult to read. It would probably be a good idea to try and aggregate schools together in some manner. One aggregation we could try is grouping the schools by each district.
combined["sat_score"].mean()
1223.438805970149
# group the data by school district
grouped_district = combined.groupby("school_dist").agg(np.mean)
# generate the same scatter plot again this time using the grouped data
grouped_district.plot(x="saf_s_11", y="sat_score", kind="scatter", title="saf_s_11 vs. sat_score grouped by school district")
<matplotlib.axes._subplots.AxesSubplot at 0x1b8b7833b88>
Although the correlation is not strong, we can see that while districts that have a high average safety score perform both well and poorly on SAT scores, districts with a low average safety score (less than 6.4
) only perform poorly on SAT scores.
It could be interesting to compare how teachers, parents, and students responded to the survey questions. Let's recap on the four different scores we can analyse:
Let's create a boxplot for all four survey responses split by each cohort. This will make it easy to compare the distribution of the data between cohorts.
# dictionary containing field description as key and field name as value
fields = {"Academic expecations score": "aca_",
"Communication score": "com_",
"Engagement score": "eng_",
"Safety and Respect score": "saf_"}
# generate box plots using dictionary
fig = plt.figure(figsize=(10,15))
for i, key in enumerate(fields):
ax = fig.add_subplot(2,2,i+1)
combined.boxplot([fields[key] + "p_11", fields[key] + "s_11", fields[key] + "t_11"],
grid=False,
)
ax.set_title(key)
plt.show()
Observations
There are several columns in our dataset that indicate the percentage of each race at a given school:
asian_per
black_per
hispanic_per
white_per
We can plot out the correlations between these columns and sat_score
to determine whether there are any racial differences in SAT performance.
# get correlations between race columns and sat_score
race_corr = correlations[["asian_per", "black_per", "hispanic_per", "white_per"]]
# assign different colours to positive and negative r values
color = (race_corr > 0).apply(lambda x: 'tab:blue' if x else 'tab:red')
# plot bar chart
race_corr.plot(kind="bar",
rot=45,
color=color,
title="Correlations between race and SAT score")
plt.show()
There is a notable positive correlation between the percentage of both white and asian students in a school and SAT performance. Meanwhile, a negative correlation exists between the percentage of black students and, in particular, hispanic students at a school.
What factors could be contributing to this racial disparity in SAT performance? Let's explore the negative correlation between hispanic_per
and sat_score
further.
We'll create a scatter plot of hispanic_per
vs. sat_score
to try and identify schools with low SAT scores and high values for hispanic percentage.
combined.plot(x="hispanic_per", y="sat_score", kind="scatter", title="hispanic_per vs. sat_score")
plt.show()
Here we can see that no schools with a percentage of hispanic students greater than ~40%
have an average SAT performance of roughly 1500
and above. There is a cluster of schools with a hispanic_per
of approximately 100%
, all of which perform poorly on the SAT.
Earlier we found that the safety and respect score from our survey data was correlated with SAT score. Do hispanic students typically feel less safe at school, resulting in poor performance on the SAT?
combined.plot(x="hispanic_per", y="saf_s_11", kind="scatter", title="hispanic_per vs. saf_s_11")
plt.show()
There is no correlation between saf_s_11
and hispanic_per
.
Let's take a look a closer look at any schools with a hispanic_per
> 95%
high_his = combined.loc[combined["hispanic_per"] > 95, ["hispanic_per", "SCHOOL NAME"]]
high_his.sort_values("hispanic_per", ascending=False)
hispanic_per | SCHOOL NAME | |
---|---|---|
286 | 100.0 | PAN AMERICAN INTERNATIONAL HIGH SCHOOL |
44 | 99.8 | MANHATTAN BRIDGES HIGH SCHOOL |
89 | 99.8 | GREGORIO LUPERON HIGH SCHOOL FOR SCIENCE AND M... |
141 | 99.8 | INTERNATIONAL SCHOOL FOR LIBERAL ARTS |
176 | 99.8 | PAN AMERICAN INTERNATIONAL HIGH SCHOOL AT MONROE |
253 | 99.8 | MULTICULTURAL HIGH SCHOOL |
125 | 99.4 | ACADEMY FOR LANGUAGE AND TECHNOLOGY |
82 | 96.7 | WASHINGTON HEIGHTS EXPEDITIONARY LEARNING SCHOOL |
A cursory search indicates that these schools are typically geared towards educating migrant populations who do not natively speak English. It would seem logical that students who are trying to learn the English language would typically perform worse on the SAT than native speakers, given that two-thirds of the SAT is based on English language comprehension. These schools will have to allocate a significant amount of their budget on teaching English; funds that could otherwise be spent on high-quality educational materials that help prepare students for the SAT.
Let's check the relationship between the percentage of English language learners at a school, ell_percent
, and SAT score.
combined.plot.scatter(x="ell_percent", y="sat_score")
plt.show()
There is a distinct cluster of schools with high ell_percent
and poor SAT performance. Let's see if it includes any of the schools we identified above.
# identify schools with both high ELL percent and hispanic percent
high_his_ell = combined.loc[(combined["ell_percent"] > 75) & (combined["hispanic_per"] > 95), "SCHOOL NAME"]
# merge with dataframe containing only high hispanic percent
high_his.merge(high_his_ell, how="outer", indicator=True) # set indicator to flag source of each row in merge
hispanic_per | SCHOOL NAME | _merge | |
---|---|---|---|
0 | 99.8 | MANHATTAN BRIDGES HIGH SCHOOL | left_only |
1 | 96.7 | WASHINGTON HEIGHTS EXPEDITIONARY LEARNING SCHOOL | left_only |
2 | 99.8 | GREGORIO LUPERON HIGH SCHOOL FOR SCIENCE AND M... | both |
3 | 99.4 | ACADEMY FOR LANGUAGE AND TECHNOLOGY | both |
4 | 99.8 | INTERNATIONAL SCHOOL FOR LIBERAL ARTS | both |
5 | 99.8 | PAN AMERICAN INTERNATIONAL HIGH SCHOOL AT MONROE | both |
6 | 99.8 | MULTICULTURAL HIGH SCHOOL | both |
7 | 100.0 | PAN AMERICAN INTERNATIONAL HIGH SCHOOL | both |
Six out of the eight schools we identified to have a very high percentage of hispanic students appear in our list of schools with a high percentage of English language learners.
Let's explore and research any schools with a hispanic_per
less than 10%
and an average SAT score greater than 1800
low_his_high_SAT = combined.loc[(combined["hispanic_per"] < 10) & (combined["sat_score"] > 1800), ["hispanic_per", "SCHOOL NAME"]]
low_his_high_SAT.sort_values("hispanic_per")
hispanic_per | SCHOOL NAME | |
---|---|---|
37 | 2.4 | STUYVESANT HIGH SCHOOL |
356 | 5.3 | STATEN ISLAND TECHNICAL HIGH SCHOOL |
151 | 7.2 | BRONX HIGH SCHOOL OF SCIENCE |
187 | 7.9 | BROOKLYN TECHNICAL HIGH SCHOOL |
327 | 7.9 | QUEENS HIGH SCHOOL FOR THE SCIENCES AT YORK CO... |
Each one of these schools belongs to a group of selective schools in New York City which require admission via an examination, the Specialized High Schools Admissions Test, also known as the SHSAT.
The SHSAT is a controversial exam, this New York Times article explores how the elite public schools that use this admissions test have seen their populations of black and hispanic students significantly decrease, likely due to an increased focus on preparatory work prior to the exam, when historically students did very little preparation.
We have two columns in our dataset that indicate the percentage of each gender at a school.
male_per
female_per
Let's plot the correlations between each gender percentage and sat_score
.
# get correlations between gender columns and sat_score
gender_corr = correlations[["male_per", "female_per"]]
# assign different colours to positive and negative r values
color = (gender_corr > 0).apply(lambda x: 'tab:blue' if x else 'tab:red')
# plot bar chart
gender_corr.plot(kind="bar",
rot=45,
color=color,
title="Correlations between gender and SAT score")
plt.show()
We can see that schools with high percentages of males correlates negatively with SAT score, while the inverse is true for schools with high percentages of females. However, neither correlation is strong.
Let's take a closer look at the correlation between female_per
and sat_score
.
combined.plot(x="female_per", y="sat_score", kind="scatter", title="female_per vs. sat_score")
plt.show()
As we could expect from the weak correlation identified in the bar plot we created, there isn't really any observable correlation in this scatter plot. However, there does appear to be a cluster of schools with a high female percentage and high SAT score, let's try and identify these schools.
mask = (combined["female_per"] > 60) & (combined["sat_score"] > 1700)
combined.loc[mask, ["female_per", "SCHOOL NAME"]]
female_per | SCHOOL NAME | |
---|---|---|
5 | 68.7 | BARD HIGH SCHOOL EARLY COLLEGE |
26 | 67.5 | ELEANOR ROOSEVELT HIGH SCHOOL |
60 | 61.0 | BEACON HIGH SCHOOL |
61 | 73.6 | FIORELLO H. LAGUARDIA HIGH SCHOOL OF MUSIC & A... |
302 | 71.1 | TOWNSEND HARRIS HIGH SCHOOL |
Research suggests that the curriculum for these schools is generally orientated around the liberal arts, and may be considered an alternative for students seeking a less STEM-focused curriculum typically offered by the elite specialised schools we investigated earlier. Each school requires an admissions test for entry, which explains the above-average SAT performance.
High school students can take optional Advanced Placement (AP) exams to earn university credit. There are AP exams for many different subjects, allowing students to demonstrate an aptitude for a particular subject. AP exams are graded on a 1 to 5 scale; a grade of 3 or higher is considered a passing score.
AP exams are more commonly taken by students who attend more affluent and academically challenging institutions, conversely they are much more rare in schools that lack funding or academic rigor. If we investigate the relationship between the percentage of students at a school who take AP tests and SAT score, we can assess the impact that these exams might have on SAT performance.
# calculate the percentage of students who took the AP test
combined["ap_per"] = combined["AP Test Takers "] / combined["total_enrollment"]
# generate scatter plot between AP score and SAT score
combined.plot.scatter(x="ap_per", y="sat_score", title="ap_per vs. sat_score")
<matplotlib.axes._subplots.AxesSubplot at 0x1b8b843be08>
The plot above shows that the there is a positive correlation between the percentage of students who take AP exams and SAT score, however, it is fairly weak. The relationship is most likely due to differences in funding that each school receives, which can aid or inhibit access to taking AP exams. A wealthier school with greater resources and facilities at its disposal will be able to put more funding towards preparing students for non-mandatory exams than a poorer school.
Let's turn our attention to identifying a possible correlation between the size of a class and SAT score. We might expect that a larger class size could correlate negatively with SAT score, as the teacher's time would be divided among more students.
We can use the AVERAGE CLASS SIZE
column to get the information we need for each school.
combined.plot.scatter(x="AVERAGE CLASS SIZE",
y="sat_score",
title="Correlation between class size and SAT score")
plt.show()
Unexpectedly, there seems to be a weak positive correlation between the average class size at a school and SAT performance. It's likely that there are other factors influencing this relationship. We know from earlier exploration that there was a positive correlation between the total number of students enrolled at the school and SAT performance, which was in turn explained by a cluster of English-second-language schools with low enrollment and poor SAT performance. Perhaps the same root cause could be behind the relationship we see here.
# Explore cluster of schools with low average class size and low SAT score #
mask = (combined["AVERAGE CLASS SIZE"] < 18) & (combined["sat_score"] < 1400)
combined.loc[mask, ["SCHOOL NAME", "AVERAGE CLASS SIZE", "sat_score"]]
SCHOOL NAME | AVERAGE CLASS SIZE | sat_score | |
---|---|---|---|
6 | 47 THE AMERICAN SIGN LANGUAGE AND ENGLISH SECO... | 16.072727 | 1182.000000 |
8 | ESSEX STREET ACADEMY | 14.900000 | 1156.000000 |
20 | THE HIGH SCHOOL FOR LANGUAGE AND DIPLOMACY | 17.854545 | 1223.438806 |
56 | THE URBAN ASSEMBLY SCHOOL FOR GREEN CAREERS | 10.000000 | 1223.438806 |
83 | HIGH SCHOOL FOR EXCELLENCE AND INNOVATION | 17.000000 | 1223.438806 |
91 | INTERNATIONAL COMMUNITY HIGH SCHOOL | 16.200000 | 945.000000 |
126 | BRONX INTERNATIONAL HIGH SCHOOL | 15.611111 | 965.000000 |
176 | PAN AMERICAN INTERNATIONAL HIGH SCHOOL AT MONROE | 15.361538 | 970.000000 |
208 | FRANCES PERKINS ACADEMY | 17.337500 | 1122.000000 |
264 | THE URBAN ASSEMBLY SCHOOL FOR CRIMINAL JUSTICE | 12.575000 | 1223.438806 |
265 | INTERNATIONAL HIGH SCHOOL AT LAFAYETTE | 17.206667 | 1026.000000 |
291 | INTERNATIONAL HIGH SCHOOL AT LAGUARDIA COMMUNI... | 17.007692 | 1064.000000 |
361 | ALL CITY LEADERSHIP SECONDARY SCHOOL | 17.793750 | 1315.000000 |
Again we can see a similar set of high school geared towards educating migrant populations whose first language is not English. This cluster is likely responsible for the positive r value of 0.38
.
Extracurricular activities are often viewed as an important facet of a child's education, and it is generally perceived that taking part in them can help boost academic prospects.
combined["extra_curricular_count"] = (combined["extracurricular_activities"]
.str.split(r"[,;]") # activites are split by , or ;
.apply(len) # count number of activities
)
print(combined["extra_curricular_count"].corr(combined["sat_score"]))
combined.plot.scatter(x="extra_curricular_count", y="sat_score")
plt.show()
0.41223556699524444
low_sat_extcur = combined.nsmallest(10, ['sat_score'])["extra_curricular_count"].mean()
high_sat_extcur = combined.nlargest(10, ['sat_score'])["extra_curricular_count"].mean()
print("Average number of extracurricular activities offered")
print("Bottom 10 schools by SAT performance: ", low_sat_extcur)
print("Top 10 schools by SAT performance: ", high_sat_extcur)
Average number of extracurricular activities offered Bottom 10 schools by SAT performance: 12.4 Top 10 schools by SAT performance: 33.8
The top ten schools by SAT performance typically offer nearly three times as many extracurricular activities to students than the bottom ten schools.
One of the strongest correlations with sat_score
we could see from our exploratory analysis earlier was frl_percent
: the percentage of students eligible for free or reduced-cost lunch based on household income.
combined.plot.scatter(x="frl_percent", y="sat_score")
plt.show()
There is a clear negative correlation between the percentage of students eligible for free/reduced-cost lunches and SAT performance. This gives us a strong indication that socioeconomic status influences quality of education.
It is difficult for schools in low-income areas to provide students with the same standard of education that schools in wealthier neighbourhoods deliver. Despite the NYC Fair Student Funding (FSF) budget, which aims to allocate funding and resources to schools that need it most, disparities clearly remain. This could be attributed to a number of things:
Let's explore frl_percent
by race.
# create list of columns with race data
race = ["asian_per", "black_per", "hispanic_per", "white_per"]
# generate box plots using dictionary
fig = plt.figure(figsize=(10,10))
fig.subplots_adjust(hspace=.3) # allow space for xlabels
for i in range(0,4):
ax = fig.add_subplot(2,2,i+1)
# call matplot directly rather than using the pandas plot() method wrapper
ax.scatter(x=combined["frl_percent"], y=combined[race[i]])
ax.set_title("frl_percent vs. " + race[i])
ax.set_xlabel("frl_percent")
ax.set_ylabel(race[i])
ax.set_ylim([0,100]) # force y-axis limits
plt.show()
We can see that:
This would suggest that black and hispanic students typically come from poorer backgrounds than white students.
The table below shows house-hold income data for each of the five boroughs, along with the percentage of the population living in poverty
| Area | Median House-hold Income (USD) | Mean House-hold Income (USD) | Percent- age in Poverty | |---------------- |--------------------------- |------------------------- |------------------------- | | The Bronx | 34,156 | 46,298 | 27.1% | | Brooklyn | 41,406 | 60,020 | 21.9% | | Manhattan | 64,217 | 121,549 | 17.6% | | Queens | 53,171 | 67,027 | 12.0% | | Staten Island | 66,985 | 81,498 | 9.8% |
Data from Wikipedia - Demographics of New York City
We can see that:
Earlier in the project we investigated a survey response from students on how safe they felt at school. Let's try aggregating this data by borough.
# aggregate data by borough and summarise with mean
boro_safety = combined.groupby("boro").agg(np.mean)["saf_s_11"]
boro_safety
boro Bronx 6.606577 Brooklyn 6.370755 Manhattan 6.831370 Queens 6.721875 Staten Island 6.530000 Name: saf_s_11, dtype: float64
The average safety scores between boroughs are very similar, however we can see that Brooklyn
has the lowest average safety score while Manhattan
has the highest. Perhaps this could be linked to education budget made available to each borough, or possibly to the affluency of the neighbourhood the schools are situated in.
We'll continue our exploration of racial demographics and investigate the proportion of each race by borough.
## plot racial makeup of schools by borough ##
# get average percent of each race grouped by borough
boro_race = combined[["asian_per", "black_per", "hispanic_per", "white_per", "boro"]]
boro_race = boro_race.groupby("boro").agg(np.mean)
# plot bar chart
boro_race.plot.bar(figsize=(10,7), rot=0, title="Racial makeup of schools by borough")
plt.ylabel("Average %")
plt.xlabel("")
plt.show()
Observations
Earlier we discovered a strong-negative correlation between the percentage of students eligible for free/reduced-cost lunches and SAT score, indicating that students from poor backgrounds faced an academic disadvantage. Let's see how this relates to geographic location by exploring frl_percent
in each borough.
# generate boxplot of free/reduced-cost lunches grouped by borough
combined.boxplot("frl_percent",
by="boro",
grid=False,
figsize=(10,7)
)
plt.title("Free/reduced cost lunch eligibility by borough")
plt.suptitle("") # remove default boxplot title
plt.xlabel("")
plt.ylabel("% eligible")
plt.show()
Just as we would expect, the boroughs with highest levels of poverty, Brooklyn and The Bronx, have the most students eligible for free/reduced cost lunches. Meanwhile one of the wealthiest boroughs, Staten Island, has the lowest level.
Interestingly, the wealthiest borough, Manhattan, has a median of frl_percent
on par with that of Brooklyn, yet a much wider distribution and a high number of outliers with low percentage. The wide distribution could be a symptom of significant income inequality.
Lastly, lets take a look at how SAT performance varies by borough.
# generate boxplot of SAT score grouped by borough
combined.boxplot("sat_score",
by="boro",
grid=False,
figsize=(10,7)
)
plt.title("SAT peformance by borough")
plt.suptitle("") # remove default boxplot title
plt.xlabel("")
plt.ylabel("SAT score")
plt.show()
The box plots indicate:
The goal of this project was to identify whether demographic factors could influence a student's performance on the SAT, and by doing so give an indication as to how 'fair' the SAT examinations are.
We've gained the following insights:
To summarise, though the SAT examination itself is not 'unfair', the socioeconomic circumstances that students find themselves has an influence on their likelihood to perform well on the SAT. Students who live above the poverty-line in affluent neighbourhoods are more likely to academically outperform their peers from poorer families living in less-affluent neighbourhoods.
Further investigations
There are a few more things we could look at: