import pandas as pd
import numpy
import re
data_files = [
"ap_2010.csv",
"class_size.csv",
"demographics.csv",
"graduation.csv",
"hs_directory.csv",
"sat_results.csv"
]
data = {}
for f in data_files:
d = pd.read_csv("schools/{0}".format(f))
data[f.replace(".csv", "")] = d
all_survey = pd.read_csv("schools/survey_all.txt", delimiter="\t", encoding='windows-1252')
d75_survey = pd.read_csv("schools/survey_d75.txt", delimiter="\t", encoding='windows-1252')
survey = pd.concat([all_survey, d75_survey], axis=0)
survey["DBN"] = survey["dbn"]
survey_fields = [
"DBN",
"rr_s",
"rr_t",
"rr_p",
"N_s",
"N_t",
"N_p",
"saf_p_11",
"com_p_11",
"eng_p_11",
"aca_p_11",
"saf_t_11",
"com_t_11",
"eng_t_11",
"aca_t_11",
"saf_s_11",
"com_s_11",
"eng_s_11",
"aca_s_11",
"saf_tot_11",
"com_tot_11",
"eng_tot_11",
"aca_tot_11",
]
survey = survey.loc[:,survey_fields]
data["survey"] = survey
data["hs_directory"]["DBN"] = data["hs_directory"]["dbn"]
def pad_csd(num):
string_representation = str(num)
if len(string_representation) > 1:
return string_representation
else:
return "0" + string_representation
data["class_size"]["padded_csd"] = data["class_size"]["CSD"].apply(pad_csd)
data["class_size"]["DBN"] = data["class_size"]["padded_csd"] + data["class_size"]["SCHOOL CODE"]
cols = ['SAT Math Avg. Score', 'SAT Critical Reading Avg. Score', 'SAT Writing Avg. Score']
for c in cols:
data["sat_results"][c] = pd.to_numeric(data["sat_results"][c], errors="coerce")
data['sat_results']['sat_score'] = data['sat_results'][cols[0]] + data['sat_results'][cols[1]] + data['sat_results'][cols[2]]
def find_lat(loc):
coords = re.findall("\(.+, .+\)", loc)
lat = coords[0].split(",")[0].replace("(", "")
return lat
def find_lon(loc):
coords = re.findall("\(.+, .+\)", loc)
lon = coords[0].split(",")[1].replace(")", "").strip()
return lon
data["hs_directory"]["lat"] = data["hs_directory"]["Location 1"].apply(find_lat)
data["hs_directory"]["lon"] = data["hs_directory"]["Location 1"].apply(find_lon)
data["hs_directory"]["lat"] = pd.to_numeric(data["hs_directory"]["lat"], errors="coerce")
data["hs_directory"]["lon"] = pd.to_numeric(data["hs_directory"]["lon"], errors="coerce")
class_size = data["class_size"]
class_size = class_size[class_size["GRADE "] == "09-12"]
class_size = class_size[class_size["PROGRAM TYPE"] == "GEN ED"]
class_size = class_size.groupby("DBN").agg(numpy.mean)
class_size.reset_index(inplace=True)
data["class_size"] = class_size
data["demographics"] = data["demographics"][data["demographics"]["schoolyear"] == 20112012]
data["graduation"] = data["graduation"][data["graduation"]["Cohort"] == "2006"]
data["graduation"] = data["graduation"][data["graduation"]["Demographic"] == "Total Cohort"]
cols = ['AP Test Takers ', 'Total Exams Taken', 'Number of Exams with scores 3 4 or 5']
for col in cols:
data["ap_2010"][col] = pd.to_numeric(data["ap_2010"][col], errors="coerce")
combined = data["sat_results"]
combined = combined.merge(data["ap_2010"], on="DBN", how="left")
combined = combined.merge(data["graduation"], on="DBN", how="left")
to_merge = ["class_size", "demographics", "survey", "hs_directory"]
for m in to_merge:
combined = combined.merge(data[m], on="DBN", how="inner")
combined = combined.fillna(combined.mean())
combined = combined.fillna(0)
def get_first_two_chars(dbn):
return dbn[0:2]
combined["school_dist"] = combined["DBN"].apply(get_first_two_chars)
correlations = combined.corr()
correlations = correlations["sat_score"]
print(correlations)
SAT Critical Reading Avg. Score 0.986820 SAT Math Avg. Score 0.972643 SAT Writing Avg. Score 0.987771 sat_score 1.000000 AP Test Takers 0.523140 ... priority08 NaN priority09 NaN priority10 NaN lat -0.121029 lon -0.132222 Name: sat_score, Length: 67, dtype: float64
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
# Remove DBN since it's a unique identifier, not a useful numerical value for correlation.
survey_fields.remove("DBN")
combined.corr()['sat_score'][survey_fields].plot(kind='bar');
There are high correlations between N_s, N_t, N_p and sat_score. Since these columns are correlated with total_enrollment, it makes sense that they would be high.
It is more interesting that rr_s, the student response rate, or the percentage of students that completed the survey, correlates with sat_score. This might make sense because students who are more likely to fill out surveys may be more likely to also be doing well academically.
How students and teachers percieved safety (saf_t_11 and saf_s_11) correlate with sat_score. This make sense, as it's hard to teach or learn in an unsafe environment.
The last interesting correlation is the aca_s_11, which indicates how the student perceives academic standards, correlates with sat_score, but this is not true for aca_t_11, how teachers perceive academic standards, or aca_p_11, how parents perceive academic standards.
combined.plot.scatter(x='saf_s_11',y='sat_score')
plt.show();
There appears to be a correlation between SAT scores and safety, although it isn't thatstrong. It looks like there are a few schools with extremely high SAT scores and high safety scores. There are a few schools with low safety scores and low SAT scores. No school with a safety score lower than 6.5 has an average SAT score higher than 1500 or so.
combined.groupby('boro').agg(numpy.mean)['saf_s_11']
boro Bronx 6.606577 Brooklyn 6.370755 Manhattan 6.831370 Queens 6.721875 Staten Island 6.530000 Name: saf_s_11, dtype: float64
Looks like Manhattan and Queens have higher safety score whereas Brooklyn and Staten Island have lowest mean safety score
race_fields = ['white_per','asian_per','black_per','hispanic_per']
combined.corr()['sat_score'][race_fields].plot.bar()
plt.show();
It looks like a higher percentage of white or asian students at a school correlates positively with sat score, whereas a higher percentage of black or hispanic students correlates negatively with sat score. This may be due to a lack of funding for schools in certain areas, which are more likely to have a higher percentage of black or hispanic students.
combined.plot.scatter(x='hispanic_per',y='sat_score')
plt.show();
combined[combined['hispanic_per'] > 95][['SCHOOL NAME','hispanic_per']]
SCHOOL NAME | hispanic_per | |
---|---|---|
44 | MANHATTAN BRIDGES HIGH SCHOOL | 99.8 |
82 | WASHINGTON HEIGHTS EXPEDITIONARY LEARNING SCHOOL | 96.7 |
89 | GREGORIO LUPERON HIGH SCHOOL FOR SCIENCE AND M... | 99.8 |
125 | ACADEMY FOR LANGUAGE AND TECHNOLOGY | 99.4 |
141 | INTERNATIONAL SCHOOL FOR LIBERAL ARTS | 99.8 |
176 | PAN AMERICAN INTERNATIONAL HIGH SCHOOL AT MONROE | 99.8 |
253 | MULTICULTURAL HIGH SCHOOL | 99.8 |
286 | PAN AMERICAN INTERNATIONAL HIGH SCHOOL | 100.0 |
The schools listed above appear to primarily be geared towards recent immigrants to the US. These schools have a lot of students who are learning English, which would explain the lower SAT scores.
combined[combined['hispanic_per'] < 10][['SCHOOL NAME','hispanic_per']]
SCHOOL NAME | hispanic_per | |
---|---|---|
37 | STUYVESANT HIGH SCHOOL | 2.4 |
46 | HIGH SCHOOL FOR DUAL LANGUAGE AND ASIAN STUDIES | 4.0 |
151 | BRONX HIGH SCHOOL OF SCIENCE | 7.2 |
187 | BROOKLYN TECHNICAL HIGH SCHOOL | 7.9 |
192 | BEDFORD ACADEMY HIGH SCHOOL | 7.1 |
194 | BENJAMIN BANNEKER ACADEMY | 8.8 |
220 | BOYS AND GIRLS HIGH SCHOOL | 7.8 |
223 | ACADEMY FOR COLLEGE PREPARATION AND CAREER EXP... | 6.7 |
226 | THE HIGH SCHOOL FOR GLOBAL CITIZENSHIP | 7.3 |
227 | SCHOOL FOR HUMAN RIGHTS, THE | 8.3 |
228 | SCHOOL FOR DEMOCRACY AND LEADERSHIP | 8.4 |
229 | HIGH SCHOOL FOR YOUTH AND COMMUNITY DEVELOPMEN... | 9.1 |
230 | HIGH SCHOOL FOR SERVICE & LEARNING AT ERASMUS | 9.0 |
231 | SCIENCE, TECHNOLOGY AND RESEARCH EARLY COLLEGE... | 9.4 |
235 | MEDGAR EVERS COLLEGE PREPARATORY SCHOOL | 2.8 |
236 | CLARA BARTON HIGH SCHOOL | 5.8 |
237 | IT TAKES A VILLAGE ACADEMY | 4.5 |
238 | BROOKLYN GENERATION SCHOOL | 8.5 |
240 | KURT HAHN EXPEDITIONARY LEARNING SCHOOL | 8.8 |
241 | VICTORY COLLEGIATE HIGH SCHOOL | 5.8 |
242 | ARTS & MEDIA PREPARATORY ACADEMY | 9.6 |
280 | BROOKLYN COLLEGIATE: A COLLEGE BOARD SCHOOL | 8.3 |
327 | QUEENS HIGH SCHOOL FOR THE SCIENCES AT YORK CO... | 7.9 |
330 | QUEENS PREPARATORY ACADEMY | 8.2 |
331 | PATHWAYS COLLEGE PREPARATORY SCHOOL: A COLLEGE... | 2.9 |
332 | EXCELSIOR PREPARATORY HIGH SCHOOL | 9.7 |
335 | CAMBRIA HEIGHTS ACADEMY | 7.5 |
337 | HUMANITIES & ARTS MAGNET HIGH SCHOOL | 8.5 |
353 | TOTTENVILLE HIGH SCHOOL | 9.9 |
356 | STATEN ISLAND TECHNICAL HIGH SCHOOL | 5.3 |
Many of the schools above appear to be specialized science and technology schools that receive extra funding, and only admit students who pass an entrance exam. This doesn't explain the low hispanic_per, but it does explain why their students tend to do better on the SAT -- they are students from all over New York City who did well on a standardized test.
gender_fields = ['male_per','female_per']
combined.corr()['sat_score'][gender_fields].plot.bar()
plt.show();
In the plot above, we can see that a high percentage of females at a school positively correlates with SAT score, whereas a high percentage of males at a school negatively correlates with SAT score. Neither correlation is extremely strong.
combined.plot.scatter(x='female_per',y='sat_score');
Based on the scatterplot, there doesn't seem to be any real correlation between sat_score and female_per. However, there is a cluster of schools with a high percentage of females (60 to 80), and high SAT scores.
combined[(combined['female_per'] > 60) & (combined['sat_score'] > 1700)]['School Name']
5 BARD HIGH SCHOOL EARLY COLLEGE 26 ELEANOR ROOSEVELT HIGH SCHOOL 60 BEACON HIGH SCHOOL 61 FIORELLO H. LAGUARDIA HIGH SCHOOL OF 302 TOWNSEND HARRIS HIGH SCHOOL Name: School Name, dtype: object
These schools appears to be very selective liberal arts schools that have high academic standards.
combined['ap_per'] = combined['AP Test Takers ']/combined['total_enrollment']
combined.plot.scatter(x='ap_per',y='sat_score')
plt.show();
It looks like there is a relationship between the percentage of students in a school who take the AP exam, and their average SAT scores. It's not an extremely strong correlation, though.