Read in the data¶

In [1]:

import pandas as pd
import numpy
import re

data_files = [
    "ap_2010.csv",
    "class_size.csv",
    "demographics.csv",
    "graduation.csv",
    "hs_directory.csv",
    "sat_results.csv"
]

data = {}

for f in data_files:
    d = pd.read_csv("schools/{0}".format(f))
    data[f.replace(".csv", "")] = d

Read in the surveys¶

In [2]:

all_survey = pd.read_csv("schools/survey_all.txt", delimiter="\t", encoding='windows-1252')
d75_survey = pd.read_csv("schools/survey_d75.txt", delimiter="\t", encoding='windows-1252')
survey = pd.concat([all_survey, d75_survey], axis=0)

survey["DBN"] = survey["dbn"]

survey_fields = [
    "DBN", 
    "rr_s", 
    "rr_t", 
    "rr_p", 
    "N_s", 
    "N_t", 
    "N_p", 
    "saf_p_11", 
    "com_p_11", 
    "eng_p_11", 
    "aca_p_11", 
    "saf_t_11", 
    "com_t_11", 
    "eng_t_11", 
    "aca_t_11", 
    "saf_s_11", 
    "com_s_11", 
    "eng_s_11", 
    "aca_s_11", 
    "saf_tot_11", 
    "com_tot_11", 
    "eng_tot_11", 
    "aca_tot_11",
]
survey = survey.loc[:,survey_fields]
data["survey"] = survey

Add DBN columns¶

In [3]:

data["hs_directory"]["DBN"] = data["hs_directory"]["dbn"]

def pad_csd(num):
    string_representation = str(num)
    if len(string_representation) > 1:
        return string_representation
    else:
        return "0" + string_representation
    
data["class_size"]["padded_csd"] = data["class_size"]["CSD"].apply(pad_csd)
data["class_size"]["DBN"] = data["class_size"]["padded_csd"] + data["class_size"]["SCHOOL CODE"]

Convert columns to numeric¶

In [4]:

cols = ['SAT Math Avg. Score', 'SAT Critical Reading Avg. Score', 'SAT Writing Avg. Score']
for c in cols:
    data["sat_results"][c] = pd.to_numeric(data["sat_results"][c], errors="coerce")

data['sat_results']['sat_score'] = data['sat_results'][cols[0]] + data['sat_results'][cols[1]] + data['sat_results'][cols[2]]

def find_lat(loc):
    coords = re.findall("\(.+, .+\)", loc)
    lat = coords[0].split(",")[0].replace("(", "")
    return lat

def find_lon(loc):
    coords = re.findall("\(.+, .+\)", loc)
    lon = coords[0].split(",")[1].replace(")", "").strip()
    return lon

data["hs_directory"]["lat"] = data["hs_directory"]["Location 1"].apply(find_lat)
data["hs_directory"]["lon"] = data["hs_directory"]["Location 1"].apply(find_lon)

data["hs_directory"]["lat"] = pd.to_numeric(data["hs_directory"]["lat"], errors="coerce")
data["hs_directory"]["lon"] = pd.to_numeric(data["hs_directory"]["lon"], errors="coerce")

Condense datasets¶

In [5]:

class_size = data["class_size"]
class_size = class_size[class_size["GRADE "] == "09-12"]
class_size = class_size[class_size["PROGRAM TYPE"] == "GEN ED"]

class_size = class_size.groupby("DBN").agg(numpy.mean)
class_size.reset_index(inplace=True)
data["class_size"] = class_size

data["demographics"] = data["demographics"][data["demographics"]["schoolyear"] == 20112012]

data["graduation"] = data["graduation"][data["graduation"]["Cohort"] == "2006"]
data["graduation"] = data["graduation"][data["graduation"]["Demographic"] == "Total Cohort"]

Convert AP scores to numeric¶

In [6]:

cols = ['AP Test Takers ', 'Total Exams Taken', 'Number of Exams with scores 3 4 or 5']

for col in cols:
    data["ap_2010"][col] = pd.to_numeric(data["ap_2010"][col], errors="coerce")

Combine the datasets¶

In [7]:

combined = data["sat_results"]

combined = combined.merge(data["ap_2010"], on="DBN", how="left")
combined = combined.merge(data["graduation"], on="DBN", how="left")

to_merge = ["class_size", "demographics", "survey", "hs_directory"]

for m in to_merge:
    combined = combined.merge(data[m], on="DBN", how="inner")

combined = combined.fillna(combined.mean())
combined = combined.fillna(0)

Add a school district column for mapping¶

In [8]:

def get_first_two_chars(dbn):
    return dbn[0:2]

combined["school_dist"] = combined["DBN"].apply(get_first_two_chars)

Find correlations¶

In [9]:

correlations = combined.corr()
correlations = correlations["sat_score"]
print(correlations)

SAT Critical Reading Avg. Score    0.986820
SAT Math Avg. Score                0.972643
SAT Writing Avg. Score             0.987771
sat_score                          1.000000
AP Test Takers                     0.523140
                                     ...   
priority08                              NaN
priority09                              NaN
priority10                              NaN
lat                               -0.121029
lon                               -0.132222
Name: sat_score, Length: 67, dtype: float64

Plotting survey correlations¶

In [10]:

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

In [11]:

# Remove DBN since it's a unique identifier, not a useful numerical value for correlation.
survey_fields.remove("DBN")

In [12]:

combined.corr()['sat_score'][survey_fields].plot(kind='bar');

There are high correlations between N_s, N_t, N_p and sat_score. Since these columns are correlated with total_enrollment, it makes sense that they would be high.

It is more interesting that rr_s, the student response rate, or the percentage of students that completed the survey, correlates with sat_score. This might make sense because students who are more likely to fill out surveys may be more likely to also be doing well academically.

How students and teachers percieved safety (saf_t_11 and saf_s_11) correlate with sat_score. This make sense, as it's hard to teach or learn in an unsafe environment.

The last interesting correlation is the aca_s_11, which indicates how the student perceives academic standards, correlates with sat_score, but this is not true for aca_t_11, how teachers perceive academic standards, or aca_p_11, how parents perceive academic standards.

Exploring Safety¶

In [13]:

combined.plot.scatter(x='saf_s_11',y='sat_score')
plt.show();

There appears to be a correlation between SAT scores and safety, although it isn't thatstrong. It looks like there are a few schools with extremely high SAT scores and high safety scores. There are a few schools with low safety scores and low SAT scores. No school with a safety score lower than 6.5 has an average SAT score higher than 1500 or so.

Borough Safety¶

In [14]:

combined.groupby('boro').agg(numpy.mean)['saf_s_11']

Out[14]:

boro
Bronx            6.606577
Brooklyn         6.370755
Manhattan        6.831370
Queens           6.721875
Staten Island    6.530000
Name: saf_s_11, dtype: float64

Looks like Manhattan and Queens have higher safety score whereas Brooklyn and Staten Island have lowest mean safety score

Racial differences in SAT scores¶

In [15]:

race_fields = ['white_per','asian_per','black_per','hispanic_per']
combined.corr()['sat_score'][race_fields].plot.bar()
plt.show();

It looks like a higher percentage of white or asian students at a school correlates positively with sat score, whereas a higher percentage of black or hispanic students correlates negatively with sat score. This may be due to a lack of funding for schools in certain areas, which are more likely to have a higher percentage of black or hispanic students.

In [16]:

combined.plot.scatter(x='hispanic_per',y='sat_score')
plt.show();

In [17]:

combined[combined['hispanic_per'] > 95][['SCHOOL NAME','hispanic_per']]

Out[17]:

	SCHOOL NAME	hispanic_per
44	MANHATTAN BRIDGES HIGH SCHOOL	99.8
82	WASHINGTON HEIGHTS EXPEDITIONARY LEARNING SCHOOL	96.7
89	GREGORIO LUPERON HIGH SCHOOL FOR SCIENCE AND M...	99.8
125	ACADEMY FOR LANGUAGE AND TECHNOLOGY	99.4
141	INTERNATIONAL SCHOOL FOR LIBERAL ARTS	99.8
176	PAN AMERICAN INTERNATIONAL HIGH SCHOOL AT MONROE	99.8
253	MULTICULTURAL HIGH SCHOOL	99.8
286	PAN AMERICAN INTERNATIONAL HIGH SCHOOL	100.0

The schools listed above appear to primarily be geared towards recent immigrants to the US. These schools have a lot of students who are learning English, which would explain the lower SAT scores.

In [18]:

combined[combined['hispanic_per'] < 10][['SCHOOL NAME','hispanic_per']]

Out[18]:

	SCHOOL NAME	hispanic_per
37	STUYVESANT HIGH SCHOOL	2.4
46	HIGH SCHOOL FOR DUAL LANGUAGE AND ASIAN STUDIES	4.0
151	BRONX HIGH SCHOOL OF SCIENCE	7.2
187	BROOKLYN TECHNICAL HIGH SCHOOL	7.9
192	BEDFORD ACADEMY HIGH SCHOOL	7.1
194	BENJAMIN BANNEKER ACADEMY	8.8
220	BOYS AND GIRLS HIGH SCHOOL	7.8
223	ACADEMY FOR COLLEGE PREPARATION AND CAREER EXP...	6.7
226	THE HIGH SCHOOL FOR GLOBAL CITIZENSHIP	7.3
227	SCHOOL FOR HUMAN RIGHTS, THE	8.3
228	SCHOOL FOR DEMOCRACY AND LEADERSHIP	8.4
229	HIGH SCHOOL FOR YOUTH AND COMMUNITY DEVELOPMEN...	9.1
230	HIGH SCHOOL FOR SERVICE & LEARNING AT ERASMUS	9.0
231	SCIENCE, TECHNOLOGY AND RESEARCH EARLY COLLEGE...	9.4
235	MEDGAR EVERS COLLEGE PREPARATORY SCHOOL	2.8
236	CLARA BARTON HIGH SCHOOL	5.8
237	IT TAKES A VILLAGE ACADEMY	4.5
238	BROOKLYN GENERATION SCHOOL	8.5
240	KURT HAHN EXPEDITIONARY LEARNING SCHOOL	8.8
241	VICTORY COLLEGIATE HIGH SCHOOL	5.8
242	ARTS & MEDIA PREPARATORY ACADEMY	9.6
280	BROOKLYN COLLEGIATE: A COLLEGE BOARD SCHOOL	8.3
327	QUEENS HIGH SCHOOL FOR THE SCIENCES AT YORK CO...	7.9
330	QUEENS PREPARATORY ACADEMY	8.2
331	PATHWAYS COLLEGE PREPARATORY SCHOOL: A COLLEGE...	2.9
332	EXCELSIOR PREPARATORY HIGH SCHOOL	9.7
335	CAMBRIA HEIGHTS ACADEMY	7.5
337	HUMANITIES & ARTS MAGNET HIGH SCHOOL	8.5
353	TOTTENVILLE HIGH SCHOOL	9.9
356	STATEN ISLAND TECHNICAL HIGH SCHOOL	5.3

Many of the schools above appear to be specialized science and technology schools that receive extra funding, and only admit students who pass an entrance exam. This doesn't explain the low hispanic_per, but it does explain why their students tend to do better on the SAT -- they are students from all over New York City who did well on a standardized test.

Gender differences in SAT scores¶

In [27]:

gender_fields = ['male_per','female_per']
combined.corr()['sat_score'][gender_fields].plot.bar()
plt.show();

In the plot above, we can see that a high percentage of females at a school positively correlates with SAT score, whereas a high percentage of males at a school negatively correlates with SAT score. Neither correlation is extremely strong.

In [28]:

combined.plot.scatter(x='female_per',y='sat_score');

Based on the scatterplot, there doesn't seem to be any real correlation between sat_score and female_per. However, there is a cluster of schools with a high percentage of females (60 to 80), and high SAT scores.

In [34]:

combined[(combined['female_per'] > 60) & (combined['sat_score'] > 1700)]['School Name']

Out[34]:

5            BARD HIGH SCHOOL EARLY COLLEGE
26            ELEANOR ROOSEVELT HIGH SCHOOL
60                       BEACON HIGH SCHOOL
61     FIORELLO H. LAGUARDIA HIGH SCHOOL OF
302             TOWNSEND HARRIS HIGH SCHOOL
Name: School Name, dtype: object

These schools appears to be very selective liberal arts schools that have high academic standards.

AP Exam Scores vs SAT Scores¶

In [46]:

combined['ap_per'] = combined['AP Test Takers ']/combined['total_enrollment']

In [49]:

combined.plot.scatter(x='ap_per',y='sat_score')
plt.show();

It looks like there is a relationship between the percentage of students in a school who take the AP exam, and their average SAT scores. It's not an extremely strong correlation, though.

In [ ]: