#!/usr/bin/env python
# coding: utf-8

# In[1]:


get_ipython().run_cell_magic('javascript', '', 'IPython.OutputArea.prototype._should_scroll = function(lines) {\n    return false;\n}\n')


# # Introduction
# Public schools in the US represent a broad cross section of the regional population. They also tend to have robust and consistent collections of data. This data set is from New York City (NYC) public school surveys of the 2011-2012 school year. It consists of 161 columns of information for 363 schools.
# 
# # Summary
# * Schools with below average assessment of safety and respect have below average results on the SAT.
# * Initial analysis shows a negative correlation between the percentage of Hispanic students and SAT results. This relationship is not seen in results on the AP exam. Furthermore, when magnet schools and international schools are removed from the data set, the remaining 90% of schools show a fairly uniform distribution and the correlation is lost.  
# * A small group of schools share SAT scores and AP exam values. This could be due to a cohort of students that is shared among the schools. Their influence on the original data set needs to be considered before further analysis is made.

# # Data Cleaning

# ## Read in the data

# In[2]:


import pandas as pd
import numpy
import re
import pprint
pd.options.display.float_format = '{:20,.4f}'.format
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 3000)
pd.set_option('display.colheader_justify', 'center')
pd.set_option('display.precision', 3)

data_files = [
    "ap_2010.csv",
    "class_size.csv",
    "demographics.csv",
    "graduation.csv",
    "hs_directory.csv",
    "sat_results.csv"
]

data = {}

for f in data_files:
    d = pd.read_csv("schools/{0}".format(f))
    data[f.replace(".csv", "")] = d


# ## Read in the surveys

# In[3]:


all_survey = pd.read_csv("schools/survey_all.txt", delimiter="\t", encoding='windows-1252')
d75_survey = pd.read_csv("schools/survey_d75.txt", delimiter="\t", encoding='windows-1252')
survey = pd.concat([all_survey, d75_survey], axis=0)

survey["DBN"] = survey["dbn"]

survey_fields = [
    "DBN", 
    "rr_s", 
    "rr_t", 
    "rr_p", 
    "N_s", 
    "N_t", 
    "N_p", 
    "saf_p_11", 
    "com_p_11", 
    "eng_p_11", 
    "aca_p_11", 
    "saf_t_11", 
    "com_t_11", 
    "eng_t_11", 
    "aca_t_11", 
    "saf_s_11", 
    "com_s_11", 
    "eng_s_11", 
    "aca_s_11", 
    "saf_tot_11", 
    "com_tot_11", 
    "eng_tot_11", 
    "aca_tot_11",
]
survey = survey.loc[:,survey_fields]
data["survey"] = survey


# ## Add DBN columns

# In[4]:


data["hs_directory"]["DBN"] = data["hs_directory"]["dbn"]

def pad_csd(num):
    string_representation = str(num)
    if len(string_representation) > 1:
        return string_representation
    else:
        return "0" + string_representation
    
data["class_size"]["padded_csd"] = data["class_size"]["CSD"].apply(pad_csd)
data["class_size"]["DBN"] = data["class_size"]["padded_csd"] + data["class_size"]["SCHOOL CODE"]


# ## Convert columns to numeric

# In[5]:


cols = ['SAT Math Avg. Score', 'SAT Critical Reading Avg. Score', 'SAT Writing Avg. Score']
for c in cols:
    data["sat_results"][c] = pd.to_numeric(data["sat_results"][c], errors="coerce")

data['sat_results']['sat_score'] = data['sat_results'][cols[0]] + data['sat_results'][cols[1]] + data['sat_results'][cols[2]]

def find_lat(loc):
    coords = re.findall("\(.+, .+\)", loc)
    lat = coords[0].split(",")[0].replace("(", "")
    return lat

def find_lon(loc):
    coords = re.findall("\(.+, .+\)", loc)
    lon = coords[0].split(",")[1].replace(")", "").strip()
    return lon

data["hs_directory"]["lat"] = data["hs_directory"]["Location 1"].apply(find_lat)
data["hs_directory"]["lon"] = data["hs_directory"]["Location 1"].apply(find_lon)

data["hs_directory"]["lat"] = pd.to_numeric(data["hs_directory"]["lat"], errors="coerce")
data["hs_directory"]["lon"] = pd.to_numeric(data["hs_directory"]["lon"], errors="coerce")


# ## Condense datasets

# In[6]:


class_size = data["class_size"]
class_size = class_size[class_size["GRADE "] == "09-12"]
class_size = class_size[class_size["PROGRAM TYPE"] == "GEN ED"]

class_size = class_size.groupby("DBN").agg(numpy.mean)
class_size.reset_index(inplace=True)
data["class_size"] = class_size

data["demographics"] = data["demographics"][data["demographics"]["schoolyear"] == 20112012]

data["graduation"] = data["graduation"][data["graduation"]["Cohort"] == "2006"]
data["graduation"] = data["graduation"][data["graduation"]["Demographic"] == "Total Cohort"]


# ## Convert AP scores to numeric

# In[7]:


cols = ['AP Test Takers ', 'Total Exams Taken', 'Number of Exams with scores 3 4 or 5']

for col in cols:
    data["ap_2010"][col] = pd.to_numeric(data["ap_2010"][col], errors="coerce")


# ## Combine the datasets

# In[8]:


combined = data["sat_results"]

combined = combined.merge(data["ap_2010"], on="DBN", how="left")
combined = combined.merge(data["graduation"], on="DBN", how="left")

to_merge = ["class_size", "demographics", "survey", "hs_directory"]

for m in to_merge:
    combined = combined.merge(data[m], on="DBN", how="inner")

combined = combined.fillna(combined.mean())
combined = combined.fillna(0)


# ## Add a school district column for mapping

# In[9]:


def get_first_two_chars(dbn):
    return dbn[0:2]

combined["school_dist"] = combined["DBN"].apply(get_first_two_chars)


# # Correlations & Analysis

# ## Plotting survey correlations

# In[10]:


# determine correlations
correlations = combined.corr()
correlations = correlations["sat_score"]

# bar plot for correlations between survey_fields and sat_score
import matplotlib.pyplot as plt
correlations.loc["rr_s":"aca_tot_11"].plot.barh()
plt.show()


# ### Observations
# It looks like the strongest correlations are between the SAT scores and the number of parents or students that respond to the survey. The response rate, on the other hand, is not correlated. This discrepancy might be worth looking at.
# <br>
# <br>
# Additional correlations are between SAT scores and two other factors, the students academic expectations and the perception of safety and respect by students, teachers, and parents. The next two sections look at each.
# ## SAT Scores & Safety Scores

# In[11]:


# create melted df from combined preserving school borough & DBN, sat_score,  and safety scores
boroughs_long = combined[["boro", "DBN", "sat_score", "saf_s_11", "saf_t_11", "saf_p_11", "saf_tot_11"]]
boroughs_columns = ["borough", "DBN", "sat_score", "safety_student", "safety_teacher", "safety_parent", "safety_total"]
boroughs_long.columns = boroughs_columns
boroughs_long = boroughs_long.melt(id_vars=["borough", "DBN", "sat_score"])

# scatter plot with borough, sat_score, and safety scores (0-10)
# plots are for student, teacher, parent, and total safety scores
import seaborn as sns
sns.relplot(data=boroughs_long, x="value", y="sat_score", hue="borough", col="variable", col_wrap=2)
plt.show()


# In[12]:


# mean value for sat_score and safety scores, grouped by borough
boroughs_means = combined.groupby("boro").mean()
boroughs_means = boroughs_means[["sat_score", "saf_s_11", "saf_t_11", "saf_p_11", "saf_tot_11"]]
boroughs_means.reset_index(inplace=True)
boroughs_columns = ["borough", "sat_score", "safety_student", "safety_teacher", "safety_parent", "safety_total"]
boroughs_means.columns = boroughs_columns
display(boroughs_means)


#  ### Observations
# The scatter plots show two areas influencing the correlation between SAT scores and perceptions of safety and respect. First, the highest SAT scores come from schools with above average perceptions of safety and respect. Second, it looks like most schools have both SAT and safety and respect scores below the median values. It is also interesting that parents have a noticeably higher perception of safety and respect than either teachers or students.
# <br>
# <br>
# Grouping schools by borough and averaging scores paints a different picture. This seems to indicate that SAT scores and perception of safety do not have a strong positive correlation at all. For instance, Staten Island has the highest average SAT scores and Bronx the lowest. However, the student and parent perception of safety and respect scores are higher in the Bronx than in Staten Island. And while Manhattan has the highest total safety and respect scores, it ranks in the middle for SAT scores. So while using mean for comparison is a start, the scatter plot charts show a much more granular view that is closer to reality.
#  ## SAT Scores & Student Academic Expectations

# In[13]:


# scatter plot with borough, sat_score, and academic expectations (0-10)
sns.relplot(data=combined, x="aca_s_11", y="sat_score", hue="boro")
plt.show()


# ### Observations
# The academic expectation scatter plot shows similar results to those for perception of safety and respect. Again, there are very few schools in the upper half of the academic expectation scores, and among them about half have better than average SAT scores. Also, we see that most schools have both SAT scores and academic expectations that are below the median value.
# 
# ## SAT Scores & Demographics on Race/Ethnicity
# Taking an objective look at the relationship between race & ethnicity and SAT scores in NYC schools can help in creating an egalitarian public school system that delivers success to all students.

# In[14]:


# make bar plot for correlations between race & ethnicity and sat_score
correlations.loc[["asian_per", "black_per", "hispanic_per", "white_per"]].plot.barh()
plt.show()


# In[15]:


# scatter plot for sat_scores and percent Hispanic
sns.relplot(data=combined, x="hispanic_per", y="sat_score", hue="boro")
plt.show()


# ### Observations
# The bar chart for race & ethnicity paints a stark picture, showing a moderately strong negative correlation between SAT scores and the percentage of Hispanic students in the school. That being said, the scatter plot suggests something a little more nuanced.
# <br>
# <br>
# The highest SAT scores (over 1500) occur in schools with a low percentage of Hispanic students, but this is only about 6% of the schools in the data set. These seem to be magnet schools, which draw the best students from any demographic. There are about 5-10 schools that show close to 100% Hispanic student populations. These schools have low SAT scores, a closer look might show relevant factors. It looks like most of the schools (i.e. school-wide average of less than 1500 on the SAT and student population less than 95% percent Hispanic) have a more even distribution of SAT scores.
# <br>
# <br>
# As a side note, almost all the schools in the Bronx have more than half the student population identifying as Hispanic. There are just a few with less than 40% Hispanic, and two of these schools score the highest SAT averages. 
# <br>
# <br>
# That creates three groups of schools, with one group being divided again in two. The following sections take a look at each.
# ## Group 1, High SAT scores

# In[16]:


# create high_sat data set
high_sat = combined["Name"][combined["sat_score"] > 1500]
high_sat = high_sat.str.strip()
high_sat_bronx = combined["Name"][(combined["sat_score"] > 1500) & (combined["boro"]=="Bronx")]
display(high_sat)
print(len(high_sat))
print(len(combined))
display(high_sat_bronx)


# ### Observations
# Sure enough, most of these schools sound pretty brainy. The two in the Bronx included.
# ## Group 2, Hispanic Student Population Over 95%

# In[17]:


# create hispanic_per_95 data set
import numpy as np
columns = combined.iloc[:,[0,1,6,34,8,9,10,60,70]]
hispanic_per_95 = columns[combined["hispanic_per"] > 95]
display(hispanic_per_95)


# ### Observations
# This set of schools has the highest percentage of students identifying as Hispanic. The SAT scores all fall within the bottom quartile, with five of them scoring in the lowest tenth all the schools. The schools are further differentiated by having high percentages of students where English is a second language. The exception to this is the Washington Heights Expeditionary Learning School. All of the schools show a good percentage of passing scores on AP Exams.
# 
# Some people would consider AP Exam pass rate as a proxy for academic achievement. This combined with the large number of ELL students suggests that these SAT scores may be non-representational for some groups of NYC public school students.
# ## Group 3, The Majority of the Schools

# In[18]:


# create majority data set
majority = combined[(combined["sat_score"] < 1500) & (combined["hispanic_per"] < 95)]

# scatter plot for sat_scores and majority of schools
sns.relplot(data=majority, x="hispanic_per", y="sat_score", hue="boro")
plt.show()


# In[19]:


# correlations between race & ethnicity and sat_score in the majority of schools
majority_correlations = majority.corr()
majority_correlations = majority_correlations["sat_score"]
majority_correlations.loc[["asian_per", "black_per", "hispanic_per", "white_per"]].plot.barh()
plt.show()


# ### Observations
# Removing the schools with the highest percentage of Hispanic students (this includes many international and ELL students) and the schools with SAT scores over 1500 shows a more uniform distribution of SAT scores but still accounts for 91% of the schools in the original data set. The correlation between SAT scores and percentage of students identifying as Hispanic drops to less than -0.2.
# ## SAT Scores & Gender
# Information on student gender percentages is also available.

# In[20]:


# bar plot for correlations between gender percentage and sat_score
correlations.loc[["female_per", "male_per"]].plot.barh()
plt.show()


# In[21]:


# scatter plots for sat_scores and gender
import matplotlib.pyplot as plt
genders = combined[["boro", "DBN", "SCHOOL NAME", "sat_score", "female_per", "male_per"]]
genders_melt = genders.melt(id_vars=["boro", "DBN", "SCHOOL NAME", "sat_score"])
sns.relplot(data=genders_melt, x="value", y="sat_score", hue="boro", col="variable")
plt.show()


# In[22]:


# caluculate percentage female to male
all_the_kids = combined["NUMBER OF STUDENTS / SEATS FILLED"]
per_female = combined["female_per"].mean()
per_male = combined["male_per"].mean()
print("Total students =", all_the_kids.sum())
print("Percent female =", per_female)
print("Percent male =", per_male)


# In[23]:


# create sat_score over 1700 data set with female and male percentages
smart_kids = genders.iloc[:,2:6][genders["sat_score"] > 1700]
display(smart_kids)


# ### Observations
# Three things from the results stand out.
# <br>
# <br>
# First, the ratio of male to female students seems to have little to no effect on SAT scores. 
# <br>
# <br>
# Second, among the schools with SAT scores over 1700, there are five with over 60% female students and two with over 70%. There is only one with over 60% male students and none with over 70%. The two schools with the highest number of females are Townsend Harris High School (consistently ranked among the 100 best high schools in America) and LaGuardia High School. Both are magnet schools. The two schools with the highest average SAT scores are Stuyvesant High School (59% male) and Bronx High School of Science (58% male). Both of these are also magnet schools.
# <br>
# <br>
# Third, particularly interesting from a stats point of view, is that the gender break down among all the schools is so close to fifty-fifty. About 50.62% of the students are female and 49.37% male.
# ## SAT Scores & AP Exams
# Advanced Placement (AP) courses are college preparatory classes in the US, and are scored on a 1-5 scale. Scores of 3 and above often receive credit in US colleges and universities. AP Exam scores can be seen as a proxy for academic success, similar to SAT scores. This data set provides a good look at how the two are related.

# In[24]:


# create a column with percent of total sudents who took ap exams
combined["ap_per"] = combined["AP Test Takers "]/combined["total_enrollment"]
# scatter plot for ap_per and gender
sns.relplot(data=combined, x="ap_per", y="sat_score", hue="boro")
plt.show()


# In[25]:


# examine schools with high percentage of students taking ap exam
all_ap = combined[["SCHOOL NAME", "sat_score", "ap_per"]][combined["ap_per"] > 0.5]
display(all_ap)


# In[26]:


# schools with sat_score == 1223.4388
twelvetwentythree = combined[(1223 < combined["sat_score"]) & (combined["sat_score"] < 1224)]
display(twelvetwentythree)


# In[27]:


combined["Num of SAT Test Takers"].value_counts().head()


# ### Observations
# The scatter plot comparing SAT scores to the percent of students taking AP exams showed some interesting patterns.
# <br>
# <br>
# First, to answer the primary question, there does not appear to be a strong correlation between the two values. The schools with the highest SAT scores do have a large number of students taking AP exams. But there are many more schools with the same number of students taking the AP exam that have average SAT scores.
# <br>
# <br>
# Second, there are two large groups of schools with similar SAT scores but notably different numbers of AP exam takers. There is a group that has about 5-15% of the students taking the AP exam and another with about 20-30%. This is a little interesting.
# <br>
# <br>
# Third is a grouping of schools with identical SAT scores. Looking at the data set shows 28 schools with a SAT score of 1223.4388. This line shows up in the other scatter plots but is very obvious in this one. These schools also have the same numbers of AP exam test takers, AP exams taken, and AP scores of 3,4, or 5. They all also have the only non numeric value in the `Num of SAT Test Takers` column, namely `s`. Each of the groups of students does represent a different percentage of students in the school.

# # Conclusions
# 
# This initial look at academic and societal data for NYC schools suggests that the relationship between subgroups can be more nuanced than might first appear.
# * Schools that have high average SAT scores also have high safety and respect scores. While many schools with high safety and respect scores have average SAT scores, none of the schools in the bottom half of this group score above average SAT scores. This is a very important finding.
# * There doesn't seem to be a very strong connection in most high schools between high SAT scores and common demographic groups. Removing magnet schools and schools where the SAT has less emphasis or relevance (such as locations with very high percentages of international students) normalizes the distribution. It would be interesting to see what would happen to the overall data set if these students were re-introduced to schools in the same geographic area.
# * About 7% of the schools in the data set appear to share a group of students taking the SAT and AP exams. These schools represent key demographic data but their influence on the overall data set needs to be considered before further analysis is made.