#!/usr/bin/env python # coding: utf-8 # # *Analyzing NYC High School Data* # # One of the most controversial issues in the U.S. educational system is the effectiveness of standardized tests, and whether they're unfair to certain groups. # # One of such exam is SAT. The SAT, or Scholastic Aptitude Test, is an exam that U.S. high school students take before applying to college. Colleges take the test scores into account when deciding who to admit, so it's fairly important to perform well on it. # # In this project, we will collect data from various sources and determine whether SAT scores are unfair to certain groups. # # **New York City** has a significant immigrant population and is very diverse, so comparing demographic factors such as race, income, and gender with SAT scores is a good way to determine whether the SAT is a fair test. For example, if certain racial groups consistently perform better on the SAT, we would have some evidence that the SAT is unfair. # # **For the purposes of this project, we'll be using data about New York City public schools, which can be found here.
Below are the datasets that we will be using:** # # SAT scores by school - SAT scores for each high school in New York City
# School attendance - Attendance information for each school in New York City
# Class size - Information on class size for each school
# AP test results - Advanced Placement (AP) exam results for each high school (passing an optional AP exam in a particular subject can earn a student college credit in that subject)
# Graduation outcomes - The percentage of students who graduated, and other outcome information
# Demographics - Demographic information for each school
# School survey - Surveys of parents, teachers, and students at each school
# # **All of these data sets are interrelated. We'll combine all of them into a single data set *combined* which will be used in our analysis, to find correlations.** # # ### Read in the data # In[1]: import pandas as pd import numpy import re data_files = [ "ap_2010.csv", "class_size.csv", "demographics.csv", "graduation.csv", "hs_directory.csv", "sat_results.csv" ] data = {} for f in data_files: d = pd.read_csv("schools/{0}".format(f)) data[f.replace(".csv", "")] = d # ### Print Data Keys # In[2]: for item in data: print(item) # ### Print Data Values (Datasets) # In[3]: data['ap_2010'].head() # In[4]: data['class_size'].head() # In[5]: data['demographics'].head() # In[6]: data['graduation'].head() # In[7]: data['hs_directory'].head() # In[8]: data['sat_results'].head() # Each data set appears to either have a DBN column, or the information we need to create one. After some processing and making `DBN` column in each dataset, we can combine columns of each dataset together in a single data set. # ### Read in the surveys # In[9]: all_survey = pd.read_csv("schools/survey_all.txt", delimiter="\t", encoding='windows-1252') d75_survey = pd.read_csv("schools/survey_d75.txt", delimiter="\t", encoding='windows-1252') survey = pd.concat([all_survey, d75_survey], axis=0, sort=True) survey.head() # In[10]: survey['dbn'].head() # There are two immediate facts that we can see in the data: # - There are over 2000 columns, nearly all of which we don't need. We'll have to filter the data to remove the unnecessary ones. Working with fewer columns will make it easier to print the dataframe out and find correlations within it. # > We'll need to filter the columns to remove the ones we don't need. Luckily, there's a data dictionary at the [original data download location.](https://data.cityofnewyork.us/Education/NYC-School-Survey-2011/mnz3-dyi8) The dictionary tells us what each column represents. We'll pick columns having aggregate survey data about how parents, teachers, and students feel about school safety, academic performance, and more. # - The survey data has a dbn column that we'll want to convert to uppercase (DBN). The conversion will make the column name consistent with the other data sets. # In[11]: survey["DBN"] = survey["dbn"] survey_fields = [ "DBN", "rr_s", "rr_t", "rr_p", "N_s", "N_t", "N_p", "saf_p_11", "com_p_11", "eng_p_11", "aca_p_11", "saf_t_11", "com_t_11", "eng_t_11", "aca_t_11", "saf_s_11", "com_s_11", "eng_s_11", "aca_s_11", "saf_tot_11", "com_tot_11", "eng_tot_11", "aca_tot_11", ] survey = survey.loc[:,survey_fields] data["survey"] = survey data['survey'].head() # ### Add/Modify DBN columns in class_size and hs_directory # # When we explored all of the data sets, we noticed that some of them, like `class_size` and `hs_directory`, don't have a `DBN` column. # - `hs_directory` does have a `dbn` column, though, so we can just rename it. # - However, `class_size` doesn't appear to have the column at all. For making up `DBN` column, we will add `CSD` and `SCHOOL CODE` columns and add a leading 0 to the `CSD` if the `CSD` is less than two digits long # In[12]: data["hs_directory"]["DBN"] = data["hs_directory"]["dbn"] def pad_csd(num): string_representation = str(num) if len(string_representation) > 1: return string_representation else: return "0" + string_representation data["class_size"]["padded_csd"] = data["class_size"]["CSD"].apply(pad_csd) data["class_size"]["DBN"] = data["class_size"]["padded_csd"] + data["class_size"]["SCHOOL CODE"] print(data['hs_directory']['DBN'].head()) data["class_size"]['DBN'].head() # ### Convert numeric columns from Object type to numeric and Add Longitude and Latitude Columns # # - Before procedding further, we need to change columns in `ap_2010` from object (string) data type to numeric data type. # - Same needs to be done with SAT scores columns in `sat_results`. And then make up a new column that totals up the SAT scores. # - We also want to parse the latitude and longitude coordinates for each school. This will enable us to map the schools and uncover any geographic patterns in the data. The coordinates are currently in the text field `Location 1` in the `hs_directory` data set. We will need to use some regex alongwith string manipulation to extract them in two new columns. And then convert them to numeric format. # In[13]: cols = ['SAT Math Avg. Score', 'SAT Critical Reading Avg. Score', 'SAT Writing Avg. Score'] for c in cols: data["sat_results"][c] = pd.to_numeric(data["sat_results"][c], errors="coerce") data['sat_results']['sat_score'] = data['sat_results'][cols[0]] + data['sat_results'][cols[1]] + data['sat_results'][cols[2]] cols = ['AP Test Takers ', 'Total Exams Taken', 'Number of Exams with scores 3 4 or 5'] for col in cols: data["ap_2010"][col] = pd.to_numeric(data["ap_2010"][col], errors="coerce") def find_lat(loc): coords = re.findall("\(.+, .+\)", loc) lat = coords[0].split(",")[0].replace("(", "") return lat def find_lon(loc): coords = re.findall("\(.+, .+\)", loc) lon = coords[0].split(",")[1].replace(")", "").strip() return lon data["hs_directory"]["lat"] = data["hs_directory"]["Location 1"].apply(find_lat) data["hs_directory"]["lon"] = data["hs_directory"]["Location 1"].apply(find_lon) data["hs_directory"]["lat"] = pd.to_numeric(data["hs_directory"]["lat"], errors="coerce") data["hs_directory"]["lon"] = pd.to_numeric(data["hs_directory"]["lon"], errors="coerce") # ### Condense datasets # # - Further, we will clean our datasets more to have a single row for each `DBN` value per dataset. This will help us when while combining the data sets. # - We'll perform cleaning for three datasets: # # > **`class_size`** : each school has multiple values for `GRADE, PROGRAM TYPE`, `CORE SUBJECT (MS CORE and 9-12 ONLY)`, and `CORE COURSE (MS CORE and 9-12 ONLY)`. # - We will filter on `GRADE(09-12)` and `PROGRAM TYPE('GEN ED')` to get information about only classes where SAT has been taken. # - And, then we will use groupby method on `DBN` to get unique groups per school. Taking average on the resulting groupby object using agg() method will give unique values per `DBN`. # > **`demographics`**: Here, only column that prevents a given DBN from being unique is `schoolyear`. We only want to select rows where `schoolyear` is `20112012`. This will give us the most recent year of data, and also match our SAT results data.

# > **`graduation`**: The `Demographic` and `Cohort` columns are what prevent `DBN` from being unique in the graduation data. A `Cohort` appears to refer to the year the data represents, and the `Demographic` appears to refer to a specific demographic group. In this case, we want to pick data from the most recent `Cohort` available, which is `2006`. We also want data from the full cohort, so we'll only pick rows where `Demographic` is `Total Cohort`. # In[14]: class_size = data["class_size"] class_size = class_size[class_size["GRADE "] == "09-12"] class_size = class_size[class_size["PROGRAM TYPE"] == "GEN ED"] class_size = class_size.groupby("DBN").agg(numpy.mean) class_size.reset_index(inplace=True) data["class_size"] = class_size data["demographics"] = data["demographics"][data["demographics"]["schoolyear"] == 20112012] data["graduation"] = data["graduation"][data["graduation"]["Cohort"] == "2006"] data["graduation"] = data["graduation"][data["graduation"]["Demographic"] == "Total Cohort"] # ### Combine the datasets # # - We'll merge two data sets at a time. For example, we'll merge `sat_results` with one dataset, then with another, then the result of that with another dataset. We'll continue combining data sets in this way until we've merged all of them. Afterwards, we'll have roughly the same number of rows, but each row will have columns from all of the data sets. # - Because we're concerned with determing demographic factors correlating with SAT score, we want to preserve as many rows as possible from `sat_results` while minimizing null values. Therefore, we will use different merge strategies. # - Data sets having lot of missing DBN values will be merged using a left join. # - Data sets having DBN values identical to those in sat_results will be merged using inner join. These data sets also have information we need to keep. Most of our analysis would be impossible if a significant number of rows were missing from `demographics`, for example. # In[15]: combined = data["sat_results"] combined = combined.merge(data["ap_2010"], on="DBN", how="left") combined = combined.merge(data["graduation"], on="DBN", how="left") to_merge = ["class_size", "demographics", "survey", "hs_directory"] for m in to_merge: combined = combined.merge(data[m], on="DBN", how="inner") combined = combined.fillna(combined.mean()) combined = combined.fillna(0) # In[16]: pd.set_option('display.max_columns', 500) combined.head() # ### Add a school district column for mapping # # Now, we have a clean data set on which we can base our analysis. Mapping the statistics out on a school district level might be an interesting way to analyze them. Adding a column to the data set that specifies the school district will help us accomplish this. # In[17]: def get_first_two_chars(dbn): return dbn[0:2] combined["school_dist"] = combined["DBN"].apply(get_first_two_chars) combined.head(3) # ### Find correlations # In[18]: correlations = combined.corr() correlations = correlations["sat_score"] print(correlations) # ### Plotting survey correlations # # - There are several fields in combined that originally came from a survey of parents, teachers, and students. We will make a bar plot of the correlations between these fields and sat_score. # In[19]: # Remove DBN since it's a unique identifier, not a useful numerical value for correlation. survey_fields.remove("DBN") import matplotlib.pyplot as plt combined.corr()['sat_score'][survey_fields].plot.bar() plt.show() # **Key Observations:** # # - `N_s`, `N_t`, `N_p` correlate highly with `sat_score`. Since these fields are directly related to total enrollment, this can be understood. # - `rr_s` (Student Response Rate) however is the more interesting point. Since, students who excel academically are more likely to respond to a survey regarding `sat_score`. # - `saf_t_11`, `saf_s_11`, `saf_tot_11` gives us another good perspective about the relationship between a safe environment and academic brilliance. Schools where Students and Teachers feel safe are more likely to have good `sat_score`. # - `aca_s_11` i.e. how students percieve academic standards correlates highly for `sat_score`. However, same is not true for `aca_p_11` and `aca_t_11`, i.e. how Parents and Teachers perceive academic standards. # ### Exploring Safety and SAT Scores # # - We will investigating safety scores by making a scatter plot of the saf_s_11 column vs. the sat_score in combined. # - Map out safety scores : # - Compute the average safety score for each district. # - Make a map that shows safety scores by district. # In[20]: combined.plot.scatter('saf_s_11', 'sat_score') plt.show() # **From the Scatter Plot, we can observe that there seems to be a positive correlation between the `sat_score` and `saf_s_11`, although its not that strong.** # In[21]: districts = combined.groupby('school_dist').agg(numpy.mean) districts.reset_index(inplace=True) districts # In[61]: import numpy as np from mpl_toolkits.basemap import Basemap m = Basemap( projection='merc', llcrnrlat=40.496044, urcrnrlat=40.915256, llcrnrlon=-74.255735, urcrnrlon=-73.700272, resolution='f' ) m.drawmapboundary(fill_color='#ABD1FF') m.drawcoastlines(color='black', linewidth=.4) m.drawrivers(color='#ABD1FF', linewidth=.4) m.fillcontinents(color= '#F2F1EF',lake_color='#ABD1FF') longitudes = districts['lon'].tolist() latitudes = districts['lat'].tolist() m.scatter(longitudes, latitudes, s=50, zorder=2, latlon=True, c=districts['saf_tot_11'], cmap='viridis') plt.axis(aspect='equal') plt.xlabel('longitude') plt.ylabel('latitude') plt.colorbar(label='Safety Score') plt.title('New York City: Schools Safety Score') plt.savefig('new_york_schools.png') plt.show() # **From the Map, we can observe that parts of _Manhattan, Bronx and Queens_ have relatively higher safety scores. While _Brooklyn_ has mostly lower safety scores.** # *** # ### Racial differences in SAT Scores # # There are a few columns that indicate the percentage of each race at a given school: # - `white_per` # - `asian_per` # - `black_per` # - `hispanic_per` # # By plotting out the correlations between these columns and sat_score, we can determine whether there are any racial differences in SAT performance. We will investigate racial differences in SAT scores by making a bar plot of the correlations between the columns above and `sat_score`. # In[23]: races_per = ['white_per', 'asian_per', 'black_per', 'hispanic_per'] combined.corr()['sat_score'][races_per].plot.bar() plt.show() # **From the bar plot above we can notice that higher percentage of White or Asian students at a school correlates with higher SAT Scores. And vice versa for Black and Hispanic Students. It can be due to various external factors, like economic background of students, funding to Schools, etc.** # # *We will continue exploration regarding race and SAT scores by makign a scatter plot of hispanic_per vs. sat_score.* # In[24]: combined.plot.scatter(x='hispanic_per', y='sat_score') plt.show() # **There seems to be a negative correlation between hispanic students percentage and SAT Score. This may be due to a lack of funding for schools in certain areas, which are more likely to have a higher percentage of black or hispanic students.** # # *Filtering down on hispanic population by researching schools with a hispanic_per greater than 95%.* # In[25]: combined[combined['hispanic_per']>95]['SCHOOL NAME'] # **We can observe from our findings that the above schools are catering mostly to the immigrant community, who have recently migrated to USA from neighboring hispanic countries. And are learning English, which can be correlated to lower SAT Scores.** # # *Researching schools with a hispanic_per less than 10% and an average SAT score greater than 1800.* # In[26]: combined[(combined['hispanic_per'] < 10) & (combined['sat_score'] > 1800)]['SCHOOL NAME'] # **From our findings, we observe that the above schools are sepcialized in Science & Technology. Plus, have a criteria (clearing specific entrance exams) for admitting students, which explaining the High SAT Scores. But, there is no specific reason for low Hispanic proportion of students.** # ### Gender differences in SAT Scores # # There are two columns that indicate the percentage of each gender at a school: # # - male_per # - female_per
# # We can plot out the correlations between each percentage and sat_score by making a bar plot. # In[27]: gender_per = ['male_per', 'female_per'] combined.corr()['sat_score'][gender_per].plot.bar() plt.show() # **Here, from the bar plot we can observe that higher percentage of Female Students correlates positively with SAT Scores and vice versa for Male Students. However neither of the correlations are strong.**
# # *Investigating schools with high SAT scores and a high female_per by making a scatter plot of female_per vs. sat_score.* # In[28]: combined.plot.scatter(x='female_per', y='sat_score') plt.show() # **From the scatter plot above, there doesn't seem to be any strong correlation among the SAT Scores and Female Percentage.** # # *Researching schools with a female_per greater than 60% and an average SAT score greater than 1700.* # In[29]: combined[(combined['sat_score'] > 1700) & (combined['female_per'] > 60)]['SCHOOL NAME'] # **From our findings, we notice that the above schools have their major focus primarily on two things:** # > - *Arts*
# > - *Preparing students for college* # # These appear to be very selective liberal arts schools that have high academic standards. # ### Conclusion: # # There isn't enough evidence to claim that the SAT scores are unfair to certain groups. On surface, it looks like the SAT Scores are unfair. But on further investigation, we can see that there are multiple other factors (like immigration, funding, etc.) which may influence academic performance. # In[ ]: