#!/usr/bin/env python # coding: utf-8 # # Guided Project No. 07: Analyzing NYC SAT Results and High School Data # # ## Introduction # We will be exploring the education sector in this project, particularly SAT scores and how it relates to school characteristics. We will limit our analysis to New York City where there is plenty of data we can work with. # # [New York City (NYC)](https://en.wikipedia.org/wiki/New_York_City) is the most populous city and most densely populated city in the United States, with an estimated population of over 8 million people. NYC's [metropolitan statistical area](https://en.wikipedia.org/wiki/Metropolitan_statistical_area) is estimated to have a population of around 20 million. It is composed of five boroughs - Brooklyn, Queens, Manhattan, the Bronx, and Staten Island. # ![new_york.jpg](attachment:new_york.jpg) # # Primary and secondary education in NYC are provided by both public and private institutions. The New York public school system, managed by the New York City [Department of Education](https://www.schools.nyc.gov/), is the largest in the US, serving over 1.1 million students across over 1,700 public schools. There are also over 900 privately-run secular and religious schools. # # Basic education outcomes in New York would be interesting to examine given its large number of students and the substantial amount of resources spent to support the city's education system. We will be looking at patterns in SAT results of different schools. # # **What is the SAT?** # # The SAT is a standardized test widely used by most US colleges and universities to make admission decisions. It is a multiple-choice, pencil-and-paper test owned, developed, published, and administered by the [College Board](https://www.collegeboard.org/). # # The SAT contains four main sections and one optional essay section: # - Reading # - Writing & Language # - Mathematics (No Calculator) # - Mathematics (Calculator) # - Essay (Optional) # # It takes three hours to finish plus an additional 50 minutes for the optional essay section. Two section scores result from taking the SAT: (1) evidence-based reading and writing, and (2) math. Each section is scored on a 200 - 800 scale, with the total SAT score ranging from 400 (minimum) to 1600 (maximum). # # Most high school students take the SAT during their junior or senior year as the results of the test will determine their options for higher education. Overall, higher SAT scores mean more options of colleges, universities, and scholarship/financing opportunities. # ## The Data # # For this project, we will use [data](https://data.cityofnewyork.us/browse?category=Education) provided by the New York City Department of Education (DOE). Below are the data sets we will be using along with brief details for each. # # **[2012 SAT Results](https://data.cityofnewyork.us/Education/2012-SAT-Results/f9bf-2cp4)** #
The most recent school level results for New York City on the SAT. Results are available at the school level for the graduating seniors of 2012. Records contain 2012 college-bound seniors mean SAT scores taken during SY 2012. #
*Updated: April 25, 2019* #
*Data Provided by: Department of Education (DOE)* # # **[2016 DOE High School Directory](https://data.cityofnewyork.us/Education/2016-DOE-High-School-Directory/7crd-d9xh)** #
DOE attendance and enrollment statistics broken down by school district. #
*Updated: April 27, 2019* #
*Data Provided by: Department of Education (DOE)* # # **[2010-2011 Class Size - School-level detail](https://data.cityofnewyork.us/Education/2010-2011-Class-Size-School-level-detail/urz7-pzb3)** #
Average class sizes for each school, by grade and program type (General Education, Self-Contained Special Education, Collaborative Team Teaching (CTT)). #
Class size data is based on January 28, 2011 data. #
*Updated: April 27, 2019* #
*Data Provided by: Department of Education (DOE)* # # **[2010 AP (College Board) School Level Results](https://data.cityofnewyork.us/Education/2010-AP-College-Board-School-Level-Results/itfs-ms3e)** #
New York City school-level College Board AP results for 2010. #
Records with 5 or fewer students are suppressed. #
Data collected and processed by the College Board. #
*Updated: April 27, 2019* #
*Data Provided by: Department of Education (DOE)* # # **[2005-2010 Graduation Outcomes - School Level](https://data.cityofnewyork.us/Education/2005-2010-Graduation-Outcomes-School-Level/vh2h-md7a)** #
Graduation Outcomes - Cohorts of 2001 through 2006 (Classes of 2005 through 2010) #
The New York State calculation method was first adopted for the Cohort of 2001 (Class of 2005). The cohort consists of all students who first entered 9th grade in a given school year (e.g., the Cohort of 2006 entered 9th grade in the 2006-2007 school year). Graduates are defined as those students earning either a Local or Regents diploma and exclude those earning either a special education (IEP) diploma or GED. For the most recent cohort, graduation rates as of both June and August (including summer graduates) are reported. #
*Updated: April 27, 2019* #
*Data Provided by: Department of Education (DOE)* # # **[2006 - 2012 School Demographics and Accountability Snapshot](https://data.cityofnewyork.us/Education/2006-2012-School-Demographics-and-Accountability-S/ihfw-zy9j)** #
Annual school accounts of NYC public school student populations served by grade, special programs, ethnicity, gender and Title I funded programs. #
*Updated: April 27, 2019* #
*Data Provided by: Department of Education (DOE)* # # **[2011 NYC School Survey](https://data.cityofnewyork.us/Education/2011-NYC-School-Survey/mnz3-dyi8)** #
Every year, all parents, all teachers, and students in grades 6 - 12 take the NYC School Survey. The survey ranks among the largest surveys of any kind ever conducted nationally. Survey results provide insight into a school's learning environment and contribute a measure of diversification that goes beyond test scores on the Progress Report. #
The NYC School Survey helps school leaders understand what key members of the school community say about the learning environment at each school. The information captured by the survey is designed to support a dialogue among all members of the school community about how to make the school a better place to learn. #
*Updated: November 21, 2018* #
*Data Provided by: Department of Education (DOE)* #
*Data dictionary can be downloaded [here](https://data.cityofnewyork.us/api/views/mnz3-dyi8/files/aa68d821-4dbb-4eb2-9448-3d8cbbad5044?download=true&filename=Survey%20Data%20Dictionary.xls)* # # To summarize, we will be using information on: # - SAT results # - high school directory # - class sizes # - AP results # - graduation outcomes # - school demographics # - school survey results # # This is of great interest to us especially since New York's population is very diverse, consisting of many immigrants from different sociocultural and economic backgrounds. We will try to identify whether there are patterns and school characteristics that consistently correlate strongly with SAT scores. It may be the case that SATs may result or contribute to systemic discrimination and inequality. The information we will obtain from this project may be used by policymakers to improve or modify the design of standardized tests or even overhaul the system for decision-making in terms of college admissions. # # Note, however, that the scope of this project (and its corresponiding analysis) is limited to public schools. This is because the School Demographic and Accountability data does not include information on private schools. # ### Read in the data # # We begin by loading the first six data sets (SAT results, school attendance and enrollment, class sizes, AP results, graduation outcomes, and school demographics). # # We will load the school survey data separately because it uses a different encoding format (Windows-1252) and its structure is a little different and will need a bit of pre-processing before we can use it. # In[1]: # Loading modules import pandas as pd import numpy as np import re import matplotlib.pyplot as plt import matplotlib as mpl from matplotlib import cm import seaborn as sns sns.set_style('white') get_ipython().run_line_magic('matplotlib', 'inline') # Reading in the data sets data_files = ["sat_results.csv", "hs_directory.csv", "class_size.csv", "ap_2010.csv", "graduation.csv", "demographics.csv" ] data = {} for file in data_files: # Note that the csv's are in the "schools folder" df = pd.read_csv("schools/{}".format(file)) # We use the file names (minus the csv extension) as dictionary keys key_name = file[:-4] data[key_name] = df # Displaying the first few rows of each dataframe for key in data: print("dataframe: ", key) print("shape:", data[key].shape) print(data[key].head(3)) print("\n") # The data sets look okay, so far. The dimensions vary across the different dataframes, with some having as few as 258 rows (`ap_2010`) while others have over 27,000 rows (`class_size`). We'll clean this up later so we can combine the dataframes. # # For now, let's read in the survey data and add it to our dictionary of dataframes. The school survey data is divided into two data sets, one for all community schools (`survey_all.txt`) and another one for all disctrict 75 schools (`survey_d75.txt`). In loading the data set, we note that it is encoded using the Windows-1252 encoding format (unlike the previous six data sets which were encoded in the more common UTF format. For more information on encoding format, please check this [documentation](https://docs.python.org/2.4/lib/standard-encodings.html). # In[2]: # Loading the two survey data sets all_survey = pd.read_csv("schools/survey_all.txt", delimiter="\t", encoding="windows-1252" ) d75_survey = pd.read_csv("schools/survey_d75.txt", delimiter="\t", encoding="windows-1252" ) # Checking the dimensions surveys = [all_survey, d75_survey] survey_names = ["all_survey", "d75_survey"] for r in range(2): print("Dimensions of {}\nNumber of Rows: {}\nNumber of Columns: {}\n". format(survey_names[r], surveys[r].shape[0], surveys[r].shape[1]) ) # Notice that the surveys have over 1700 columns. We won't print those columns in this notebook to avoid excessive clutter. We will concatenate them into one dataframe along `axis=0`. # # Before we combine the survey dataframes, let's check whether we can minimize the number of columns based on what we can use. Most columns are survey responses, thus we will only need a few columns that aggregate the information contained in the survey or help identify the respective schools. To do that, we check the data dictionary ([download link](https://data.cityofnewyork.us/api/views/ihfw-zy9j/files/9e1aa617-1249-43ea-b7eb-2240bbed5b41?download=true&filename=Demographic%20Data%20DD.xlsx)) which is partially reproduced below: # # | Field Name | Field Description | # | --- | --- | # | dbn | School identification code (district borough number) | # | sch_type | School type (Elementary, Middle, High, etc) | # | location | School name | # | enrollment | Enrollment size | # | borough | Borough | # | principal | Principal name | # | studentsurvey | Only students in grades 6-12 partipate in the student survey. This field indicates whether or not this school serves any students in grades 6-12. | # | rr_s | Student Response Rate | # | rr_t | Teacher Response Rate | # | rr_p | Parent Response Rate | # | N_s | Number of student respondents | # | N_t | Number of teacher respondents | # | N_p | Number of parent respondents | # | nr_s | Number of eligible students | # | nr_t | Number of eligible teachers | # | nr_p | Number of eligible parents | # | saf_p_10 | Safety and Respect score based on parent responses | # | com_p_10 | Communication score based on parent responses | # | eng_p_10 | Engagement score based on parent responses | # | aca_p_10 | Academic expectations score based on parent responses | # | saf_t_10 | Safety and Respect score based on teacher responses | # | com_t_10 | Communication score based on teacher responses | # | eng_t_10 | Engagement score based on teacher responses | # | aca_t_10 | Academic expectations score based on teacher responses | # | saf_s_10 | Safety and Respect score based on student responses | # | com_s_10 | Communication score based on student responses | # | eng_s_10 | Engagement score based on student responses | # | aca_s_10 | Academic expectations score based on student responses | # | saf_tot_10 | Safety and Respect total score | # | com_tot_10 | Communication total score | # | eng_tot_10 | Engagement total score | # | aca_tot_10 | Academic Expectations total score | # In[3]: # Combining the two surveys into one dataframe survey = pd.concat([all_survey, d75_survey], axis=0) # Displaying dimensions of survey dataframe before dropping some columns print("Before filtering columns") print("Number of rows: {a}\nNumber of columns: {b}".format(a=survey.shape[0], b=survey.shape[1])) print("\n") # Creating a copy of dbn column with properly capitalized column name survey["DBN"] = survey["dbn"].copy() # Creating list of columns to be retained survey_fields = ["DBN", "rr_s", "rr_t", "rr_p", "N_s", "N_t", "N_p", "saf_p_11", "com_p_11", "eng_p_11", "aca_p_11", "saf_t_11", "com_t_11", "eng_t_11", "aca_t_11", "saf_s_11", "com_s_11", "eng_s_11", "aca_s_11", "saf_tot_11", "com_tot_11", "eng_tot_11", "aca_tot_11" ] # Filtering survey dataframe based on column list survey = survey[survey_fields] # Reassigning updated survey dataframe to our dictionary data["survey"] = survey # Displaying dimensions of survey dataframe after filtering print("After filtering columns") print("Number of rows: {a}\nNumber of columns: {b}".format(a=survey.shape[0], b=survey.shape[1])) # Our `survey` dataframe now only contains 23 columns from over 2,700. The 1,700 rows have been retained for now. # # We have finished reading in all the data sets into a single dictionary we can easily access. Let's now proceed to data cleaning tasks. # ## Cleaning and Pre-Processing the Data # # Before we can proceed with the analysis, we need to clean and combine our data first. This will involve several steps including: # - renaming and creating new columns # - converting entries to the proper data type (e.g. strings to numeric) # - filtering and aggregating some dataframes # # ### Adding DBN Column # # A quick glance at the data gives us enough information to identify the `"DBN"` (District Borough Number) column as something we can use to combine our different data sets. # # We will make a copy of the`"dbn:` column of the `hs_directory` dataframe and name it `"DBN"` to make combining easier later. # # We also note that the `class_size` dataframe does not have a `"DBN"` column. We can, however, recreate this after taking note that the values in the `"csd"` (Community School District) column, when padded with a leading zero and combined with the value in the `"SCHOOL CODE"` column will be equivalent to the DBN. # In[4]: # Creating a DBN column with correct case for hs_directory data["hs_directory"]["DBN"] = data["hs_directory"]["dbn"].copy() # Creating a DBN column for class size by combining csd and school code for class_size # padding csd with a leading zero data["class_size"]["padded_csd"] = data["class_size"]["CSD"].astype(str).str.zfill(2) # combining padded csd and school code to create DBN identifier data["class_size"]["DBN"] = (data["class_size"]["padded_csd"] + data["class_size"]["SCHOOL CODE"]).copy() # ### Creating a sat_score column # # To streamline our analysis, we will create a new column that represents the a SAT score (`"sat_score"`) column which is just the sum of the Math, Critical Reading, and Writing average scores. Before we can do mathematical operations, we need to convert them into numeric type and set the argument `errors="coerce"` such that values that can't be converted to numeric format will be set as `NaN`. # # Note that the SAT score we compute here is not necessarily equal to the total SAT score described earlier in this notebook. This is because we have not combined the Reading and Writing sections to form a max score of 800. The analysis should remain valid, however, as the patterns and trends will remain consistent anyway. # In[5]: # Listing the columns containing the score components sat_cols = ["SAT Math Avg. Score", "SAT Critical Reading Avg. Score", "SAT Writing Avg. Score" ] # Converting columns to numeric for c in sat_cols: data["sat_results"][c] = pd.to_numeric(data["sat_results"][c], errors="coerce") # Creating sat_score column by adding the three score components data["sat_results"]["sat_score"] = (data["sat_results"][sat_cols[0]] + data["sat_results"][sat_cols[1]] + data["sat_results"][sat_cols[2]] ) # Taking a first look at the distribution of SAT scores data["sat_results"]["sat_score"].plot(kind='box', figsize=(3,8)) plt.ylabel("SAT Score") plt.title("Box Plot of SAT Average SAT Score") plt.show() data["sat_results"]["sat_score"].describe() # We see that the median SAT score is 1,169 with the interquartile range being (1,102 to 1,257). We see a very large disparity in some scores, where the lowest average SAT score in our data set is 887 while the highest is 2,096. # ### Extracting location data # # The high school directory data set contains location data in its `"Location 1"` column that may prove to be very useful if we decide to map out the locations of these schools, where we might find some interesting patterns. # # Let's extract these values and add latitude and longitude columns. # In[6]: # Setting up the regex pattern for capturing latitude and longitude values pattern = r"\((.+),(.+)\)" # Extracting and creating latitude and longitude information using vectorized operations data["hs_directory"][["lat","lon"]] = data["hs_directory"]["Location 1"].str.extract(pattern).astype(float) # ### Condensing data sets # # When we merge the data sets, we will use the `"DBN"` column as a unique identifier. However, we note that some of our data sets have repeating or non-unique DBN columns. We will filter these on a case-to-case basis such that `"DBN"` is unique and can be used as the key for merging. # # **Class Size** #
Let's check the first few rows of `class_size`. # In[7]: print("class_size") data["class_size"].head() # For `class_size`, we see that DBN is not unique because a single school can occupy multiple rows representing different grade levels and program types. Let's check the values of these columns. # In[8]: print("GRADE") print(data["class_size"]["GRADE "].value_counts()) print("\n") print("PROGRAM TYPE") print(data["class_size"]["PROGRAM TYPE"].value_counts()) # Since we know that SATs are taken only by students in grades 11 to 12, we will filter out all other grade levels in `class_size` except for those with values `"09-12"`. # # The most common program type is `"GEN ED"` so keeping those will allow us to retain the most number of observations. # # We will now proceed with condensing the `class_size` dataframe. # In[9]: class_size = data["class_size"] # Condensing class_size to have unique DBN values # Filtering to include only data from grades 9-12 and general education program types grade_and_program_bool = (class_size["GRADE "] == "09-12") & (class_size["PROGRAM TYPE"] == "GEN ED") class_size = class_size[grade_and_program_bool] # Grouping by DBN and getting the mean class sizes per school class_size = class_size.groupby("DBN").agg(np.mean) class_size.reset_index(inplace=True) # Reassigning to dictionary data["class_size"] = class_size # **Demographics** #
Let's check the first few rows of `demographics`. # In[10]: data["demographics"].head() # The non-uniqueness of DBN in the `demographics` dataframe is caused by the same schools reappearing in multiple rows representing different school years. Let's check for the most recent school year available in this data set and retain those rows. # In[11]: print(data["demographics"]["schoolyear"].value_counts()) # The most recent school year is 2011-2012. Note that the values (school years) are stored as integers. # # Let's proceed with condensing the `demographics` dataframe. # In[12]: # Condensing demographics to have unique DBN values data["demographics"] = data["demographics"][data["demographics"]["schoolyear"] == 20112012] # **Graduation** #
Let's check the first few rows of `graduation`. # In[13]: print("graduation") data["graduation"].head() # Once again, we see non-unique DBN values. The culprits for this non-uniqueness are the `"Demographic"` and `"Cohort"` columns. # # Recall that a ***cohort*** is defined as "consisting of all students who first entered 9th grade in a given school year". # # Let's check the values for these two columns. # In[14]: print("Demographic") print(data["graduation"]["Demographic"].value_counts()) print("\n") print("Cohort") print(data["graduation"]["Cohort"].value_counts()) # The latest cohort is `"2006"` and we're more interested in analyzing the data at a school-level so we will keep only observations for `"Total Cohort"`. Other demographic information should be available in the `demographic` dataframe. # # Let's condense our `graduation` dataframe. # In[15]: # Condensing graduation to have unique DBN values cohort_and_demo_bool = (data["graduation"]["Cohort"] == "2006") & (data["graduation"]["Demographic"] == "Total Cohort") data["graduation"] = data["graduation"][cohort_and_demo_bool] # To summarize, we condensed the three dataframes as follows: # - For `class_size`, we will included only information for grades 9 to 12 AND enrolled under the general education program. We then aggregated the values by school to have unique DBN values. # - For `demographics`, we retained only the latest school year in the data set (2011 - 2012). # - For `graduation`, we retained only the latest total cohort of 2006. # # This process allowed us to make sure that DBN values are unique for each dataframe. # ### Converting AP scores to numeric # # Some columns in the `ap_2010` dataframe are stored as strings instead of numeric types. We'll reformat those and set the argument `errors="coerce"` so that values that can't be converted to numeric will be set as `NaN`. # In[16]: # Setting up list of columns to be converted to numeric cols = ["AP Test Takers ", "Total Exams Taken", "Number of Exams with scores 3 4 or 5"] # Extracting the dataframe from the dictionary ap_2010 = data["ap_2010"] # Converting the selected columns to numeric for col in cols: ap_2010[col] = pd.to_numeric(ap_2010[col], errors="coerce") # Reassigning modified dataframe back to dictionary data["ap_2010"] = ap_2010 # ### Combining the data sets # # We can now combine all data sets into one dataframe. To perform this task, we will use the `pandas.DataFrame.merge()` which supports different kinds of joins (`left`, `right`, `inner`, and `outer`). For more information on the `merge()` function, you may refer to its [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html). # **Combining `sat_results`, `ap_2010`, and `graduation`** # # For the first set of merges, we will use a `left` join using the `merge()` function. This means that we will retain the all the rows for the left dataframe, which we will set as `sat_results` since we want to retain as many rows from this dataframe as we can. We'll merge `sat_results` and `ap_2010` into the dataframe `combined`, then merge `combined` with `graduation`. # In[17]: # Extracting sat_results combined = data["sat_results"] # Merging sat_results with ap_2010, retaining sat_results rows combined = combined.merge(data["ap_2010"], on="DBN", how="left") # Merging combined with graduation, retaining combined raws combined = combined.merge(data["graduation"], on="DBN", how="left") # **Combining `class_size`, `demographics`, `survey`, and `hs_directory`** # # For the second set of merges, we will use an `inner` join where only rows that exist in the both the dataframes being merged will be retained. # In[18]: # Merging combined with the rest of the dataframes # Use inner join, retain rows that exist in both dataframes being merged # list of dataframes to_merge = ["class_size", "demographics", "survey", "hs_directory"] for m in to_merge: combined = combined.merge(data[m], on="DBN", how="inner") # **Filling in empty rows** # # We will also fill in empty rows with the column means. If rows are still empty, we will fill it in with zero. # In[19]: # Filling in blank rows using column means means = combined.mean() combined = combined.fillna(means) combined = combined.fillna(0) # ### Adding a school district column to our combined dataframe # # Analyzing patterns by school district may be an interesting way to look at our data. Since the school district is equivalent to the first two characters of the DBN column, let's extract that and add a `"school_dist"` column to our `combined` dataframe. # In[20]: # Adding a school district column extracted from the DBN column combined["school_dist"] = combined["DBN"].str[:2] # ### Adding a borough column to our combined dataframe # # In the same way as the school districts, we may also want to look at patterns in larger groups (boroughs). Let's add the columns based on the school districts. A directory that details what borough each school district belongs in is available [here](http://www.newyorkschools.com/nyc-schools/). # In[21]: # Setting up reference for district numbers and bor bronx = pd.Series([7, 8, 9, 10, 11, 12]) brooklyn = pd.Series([13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 32]) manhattan = pd.Series([1, 2, 3, 4, 5, 6, 75]) queens = pd.Series([24, 25, 26, 27, 28, 29, 30]) staten = pd.Series([31]) # Padding the values with leading zero bronx = bronx.astype(str).str.zfill(2).to_list() brooklyn = brooklyn.astype(str).str.zfill(2).to_list() manhattan = manhattan.astype(str).str.zfill(2).to_list() queens = queens.astype(str).str.zfill(2).to_list() staten = staten.astype(str).str.zfill(2).to_list() # Initializing empty list borough = [] # populating list for r in range(len(combined)): if combined["school_dist"][r] in bronx: borough.append("The Bronx") elif combined["school_dist"][r] in brooklyn: borough.append("Brooklyn") elif combined["school_dist"][r] in manhattan: borough.append("Manhattan") elif combined["school_dist"][r] in queens: borough.append("Queens") elif combined["school_dist"][r] in staten: borough.append("Staten Island") # Assigning borough series as a column in the combined dataframe borough = pd.Series(borough) combined["borough"] = borough.copy() # ## Exploratory Data Analysis # # Since we've finished cleaning our data set and can access most of the information from the single dataframe `combined`. We can now begin our analyis. # ### Mapping SAT scores by school district # # Let's look at how different school districts perform by drawing them into the maps. Let's generate a bubble map where the size of the bubbles represents the total number of students in that district. Let's also try to make it easier to identify which borough a school district belongs to by color-coding the edges of the bubbles. We will fill in the "bubbles" with a color that intensifies in relation to the average SAT score for that school district. # # For mapping we will use the `cartopy` package which extends the capability of `matplotlib` to plot geographical data. # In[22]: # Importing modules and libraries import cartopy import cartopy.crs as ccrs import cartopy.feature as cf import matplotlib.patches as mpatches # Setting up map boundaries central_lon, central_lat = np.mean(combined["lon"]), np.mean(combined["lat"]) extent = [-74.20, -73.70, 40.55, 40.90] # Creating district-level grouping districts = (combined. groupby(["borough", "school_dist"]) [["lon", "lat", "sat_score", "total_students"]]. agg({"lon":np.mean, "lat":np.mean, "sat_score":np.mean, "total_students":np.sum} ). reset_index() ) # Setting up edgecolor mapping colors = {"The Bronx":"blue", "Brooklyn":"orange", "Manhattan":"green", "Queens":"red", "Staten Island":"purple" } # Drawing the map figure = plt.figure(figsize=(12,12)) ax = figure.add_subplot(1,1,1, projection=ccrs.PlateCarree()) ax.set_extent(extent) # Adding features ax.add_feature(cf.BORDERS) ax.add_feature(cf.OCEAN, color="steelblue", alpha=0.3) ax.add_feature(cf.LAND, color="green", alpha=0.2) ax.add_feature(cf.STATES, alpha=0.5) # Plotting the school districts plt.scatter(x=districts["lon"], y=districts["lat"], c=districts["sat_score"], edgecolors=districts['borough'].map(colors), linewidth=3, s=districts["total_students"] / 5, alpha=0.8, cmap=cm.Reds, transform=ccrs.PlateCarree() ) # Adding custom legend for edgecolors bronx_patch = mpatches.Patch(color="blue", label="The Bronx", alpha=0.5, lw=0.05) brooklyn_patch = mpatches.Patch(color="orange", label="Brooklyn", alpha=0.5, lw=0.05) manhattan_patch = mpatches.Patch(color="green", label="Manhattan", alpha=0.5, lw=0.05) queens_patch = mpatches.Patch(color="red", label="Queens", alpha=0.5, lw=0.05) staten_patch = mpatches.Patch(color="purple", label="Staten Island", alpha=0.5, lw=0.05) plt.legend(handles=[bronx_patch, brooklyn_patch, manhattan_patch, queens_patch, staten_patch], loc='upper left' ) # Adding a few more map elements plt.colorbar(shrink=0.55) plt.title("Average SAT Scores by NYC School District (Bubble Size = Total Number of Students)", size=18) plt.tight_layout() plt.show() # Poor-performing schools appear to be clustered together (school districts in The Bronx and Brooklyn). # # We can observe a wide discrepancy in average SAT scores even within the same boroughs. For example, a district in Brooklyn registers a very high average SAT score (dark red) while smaller districts within the same borough are some of the worst-performing in terms of SAT scores. # # Most school districts in The Bronx have relatively lower average SAT scores compared to school districts in other boroughs. Meanwhile, schools in Queens and Mahattan have relatively higher average SAT scores. The single school district in Staten Island also registers a relatively high average SAT score. # ### Check correlations # # Since our analysis is focused on how SAT scores vary across school characteristics, we will first check how all of the columns correlate with average SAT scores. There is no general rule of thumb as to what values of correlation would be interesting but for this project, let's set the arbitrary threshold of 0.30 (negative and positive). # # For more information on the Pearson correlation coefficient, you may check this [link](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient). # In[23]: # Generating correlation table correlations = combined.corr() # Filtering to only include correlations with sat_score correlations = correlations["sat_score"] # Filtering to only include correlation values higher than |0.35| cols_corr_filtered = correlations[(correlations > 0.30) | (correlations < -0.30)] cols_corr_filtered.sort_values() # Below is a table which presents what the column infos on the columns with high correlation with SAT Scores. The columns describing SAT scores and its components are not included since those are self-explanatory. # # | source dataframe | column | column info # | --- | --- | --- # | **`hs_directory`** | `total_students` | total number of students enrolled # | **`class_size`** | `NUMBER OF STUDENTS / SEATS FILLED` | number of classes per grade or school # | **`class_size`** | `NUMBER OF SECTIONS` | number of sections per program/subject # | **`class_size`** | `AVERAGE CLASS SIZE` | average class size for school # | **`class_size`** | `SIZE OF LARGEST CLASS` | largest class size for school # | **`ap_2010`** | `AP Test Takers` | number of students who have taken AP test per schoo # | **`ap_2010`** | `Total Exams Taken` | number of exams taken in school # | **`ap_2010`** | `Number of Exams with scores 3 4 or 5` | number of exams taken which had a score of 3, 4, or 5 # | **`graduation`** | `Total Cohort` | total number of students who took test # | **`demographics`** | `frl_percent` | percentage free and reduced lunch # | **`demographics`** | `total_enrollment` | total enrollment count of students # | **`demographics`** | `ell_percent` | percentage english language learners # | **`demographics`** | `sped_percent` | percentage special education students # | **`demographics`** | `asian_num` | total number of asian students # | **`demographics`** | `asian_per` | percentage asian students # | **`demographics`** | `hispanic_per` | percentage hispanic students # | **`demographics`** | `white_num` | total number of white students # | **`demographics`** | `white_per` | percentage white students # | **`demographics`** | `male_num` | total number of male students # | **`demographics`** | `female_num` | total number of female students # | **`survey`** | `N_s` | number of student respondents # | **`survey`** | `N_p` | number of parent respondents # | **`survey`** | `saf_t_11` | safety and respect score - teacher responses # | **`survey`** | `saf_s_11` | safety and respect score - student responses # | **`survey`** | `aca_s_11` | academic expectations score - student responses # | **`survey`** | `saf_tot_11` | safety and respect total score # Let's graph the filtered columns correlations with SAT Score in a horizontal bar chart so it's much easier to see. # In[24]: # Setting up a mask for mapping colors negative = cols_corr_filtered < 0 # Plotting horizontal bar chart for correlations fig = plt.figure(figsize=(15,15)) ax = fig.add_subplot(1,1,1) ax = cols_corr_filtered.plot.barh(fontsize=15, width=0.8, color=negative.map({True: "indianred", False: "steelblue"}) ) ax.set_title("Strong Correlations with average SAT score", size=20) ax.set_xlim(-1,1) ax.axvline(0.5, color="dimgray", alpha=0.6, linestyle="--") ax.axvline(-0.5, color="dimgray", alpha=0.6, linestyle="--") ax.spines["right"].set_visible(False) ax.spines["left"].set_visible(False) ax.spines["top"].set_visible(False) ax.spines["bottom"].set_visible(False) plt.show() # In the step above, we looked at the correlation of different columns with average SAT score. In order to minimize clutter, we filtered to include only columns that have correlation coefficients of greater than 0.3 or less than -0.3. # # Here are some initial observations we can point out: # - The high correlation coefficients of `SAT Math Avg. Score`, `SAT Critical Reading Avg. Score`, `SAT Writing Avg. Score` are trivial since `sat_score` was constructed as the sum of these three components. # - The percentage of Asian and White students in a school (`asian_per`, `white_per`) have a strong positive correlation with SAT scores. Conversely, the percentage of Hispanic students in a school (`hispanic_per`) has a negative correlation with SAT scores. # - The percentage of students who avail of free and reduced school lunch (`frl_percent`) has a strong negative correlation with SAT scores. This somewhat implies that schools with a higher percentage of students from low-income households perform worse, on the average, on SATs. # - Other demographic characteristics such as the percentage of English language learners (`ell_percent`) and percentage of Special Education students (`sped_percent`) also have pronounced negative correlations with SAT scores. # - Somewhat surprisingly, schools with larger populations appear to perform better on SATs. This is supported by the positive correlation of measures of school size such as AP test takers, number of parent and student survey respondents, and total students, among others. # - Higher safety and respect scores, as measured by survey responses (`saf_s_11`, `saf_t_11`, `saf_tot_11`), appear to be positively correlated with SAT scores. # ### Plotting survey correlations # # Let's look more closely at the correlations of survey responses with SAT scores. Survey responses give us an idea of how stakeholders (students, teachers, and parents) perceive their respective schools. The numbers from the surveys are more of a perception measure which arguably is important since schools exist for the purpose of educating students and keeping teachers and parents satisfied. # # This exercise will be similar to the previous one. However, we won't filter out weak correlations this time around. # In[25]: # Remove DBN since it's a unique identifier, not a useful numerical value for correlation. survey_fields.remove("DBN") # Creating new map for color-coding positive and negativevalues negative = combined.corr()["sat_score"][survey_fields] < 0 fig = plt.figure(figsize=(15,15)) # Plotting bar chart ax = fig.add_subplot(1,1,1) ax = combined.corr()["sat_score"][survey_fields].plot.barh(fontsize=15, width=0.6, color=negative.map({True: "indianred", False: "steelblue"}) ) ax.set_title("Survey Response Statistics Correlation with SAT Scores", size=20) ax.set_xlim(-0.5, 0.5) ax.axvline(0.2, color="dimgray", alpha=0.6, linestyle="--") ax.spines["right"].set_visible(False) ax.spines["left"].set_visible(False) ax.spines["top"].set_visible(False) ax.spines["bottom"].set_visible(False) plt.show() # Only a few columns have positive correlations (with SAT scores) greater than 0.2. These are: # # | Field Name | Field Description | # | --- | --- | # | `rr_s` | Student Response Rate | # | `N_s` | Number of student respondents | # | `N_t` | Number of teacher respondents | # | `N_p` | Number of parent respondents | # | `saf_t_11` | Safety and Respect score based on teacher responses | # | `saf_s_11` | Safety and Respect score based on student responses | # | `eng_s_11` | Engagement score based on student responses | # | `aca_s_11` | Academic expectations score based on student responses | # | `saf_tot_11` | Safety and Respect total score | # # As we've already noted earlier, safety and respect perceptions seem to be important in achieving higher SAT scores. The size of the school, as proxied by the number of (student, teacher, and parent) respondents is also positively correlated with SAT scores. We are not sure why this is the case since our first guess would have been bigger schools and classes would result in less attention given by teachers to their students, resulting in worse SAT performance. It is possible that bigger schools have bigger budgets and better facilities but verifying these hypotheses is beyond the scope of our analysis for now. # # Interestingly, the substantial positive correlation of response rates can only be seen for student response rate and not for teachers and parents. This may be because high-performing students are more motivated or have more extra time to answer surveys. # # There are two columns showing negative correlation with SAT scores: teacher response rate (`rr_t`) and comunication score based on parent responses (`com_p_11`). These negative correlations are small so we can't really make any conclusions about it. # ### Higher safety and respect perceptions are weakly associated with higher average SAT scores # # **Plotting Safety and Respect Scores by School District** # # Let's create a similar plot as the one we earlier did for SAT scores. This time, let's check how safety and respect scores (as represented by `saf_tot_11`) vary across school districts and boroughs. # In[26]: # List of columns of interest survey_sat_loc = ["saf_tot_11", "lon", "lat", "sat_score", "total_students"] # Creating district-level grouping districts_safety = (combined. groupby(["borough", "school_dist"]) [survey_sat_loc]. agg({"lon":np.mean, "lat":np.mean, "sat_score":np.mean, "saf_tot_11":np.mean, "total_students":np.sum} ). reset_index() ) # Setting up edgecolor mapping colors = {"The Bronx":"blue", "Brooklyn":"orange", "Manhattan":"green", "Queens":"red", "Staten Island":"purple" } # Drawing the map figure = plt.figure(figsize=(12,12)) ax = figure.add_subplot(1,1,1, projection=ccrs.PlateCarree()) ax.set_extent(extent) # Adding features ax.add_feature(cf.BORDERS) ax.add_feature(cf.OCEAN, color="steelblue", alpha=0.3) ax.add_feature(cf.LAND, color="green", alpha=0.2) ax.add_feature(cf.STATES, alpha=0.5) # Plotting the school districts plt.scatter(x=districts_safety["lon"], y=districts_safety["lat"], c=districts_safety["saf_tot_11"], edgecolors=districts_safety['borough'].map(colors), linewidth=3, s=districts_safety["total_students"]/5, alpha=0.8, cmap=cm.coolwarm_r, transform=ccrs.PlateCarree() ) # Adding custom legend for edgecolors bronx_patch = mpatches.Patch(color="blue", label="The Bronx", alpha=0.5, lw=0.05) brooklyn_patch = mpatches.Patch(color="orange", label="Brooklyn", alpha=0.5, lw=0.05) manhattan_patch = mpatches.Patch(color="green", label="Manhattan", alpha=0.5, lw=0.05) queens_patch = mpatches.Patch(color="red", label="Queens", alpha=0.5, lw=0.05) staten_patch = mpatches.Patch(color="purple", label="Staten Island", alpha=0.5, lw=0.05) plt.legend(handles=[bronx_patch, brooklyn_patch, manhattan_patch, queens_patch, staten_patch], loc='upper left' ) # Adding a few more map elements plt.colorbar(shrink=0.55) plt.title("Safety and Respect Perception by NYC School District (Bubble Size = Total Number of Students)", size=15) plt.tight_layout() plt.show() # The map above shows safety and respect scores by school district. "Hot spots" where safety and respect are perceived to be low have warmer (redder) colors while the districts perceived to be safer have cooler (bluer) colors. There is one school district in the borough of Brooklyn that has very low safety and respect perception. # # The safety and respect score map shares many similarities with the SAT score map. Brooklyn needs closer inspection as there is a cluster of school districts with low safety and respect scores. These school districts also coincide with low average SAT scores. # # **Safety Scores by Borough** # # Due to the wide range of safety scores across school districts even within the same boroughs, let's check if any pattern is more evident if we analyze by borough. # In[27]: # List of columns of interest survey_sat_loc = ["saf_tot_11", "lon", "lat", "sat_score", "total_students"] # Creating district-level grouping borough_safety = (combined. groupby("borough") [survey_sat_loc]. agg({"lon":np.mean, "lat":np.mean, "sat_score":np.mean, "saf_tot_11":np.mean, "total_students":np.sum} ). reset_index() ) # Setting up edgecolor mapping colors = {"The Bronx":"blue", "Brooklyn":"orange", "Manhattan":"green", "Queens":"red", "Staten Island":"purple" } # Drawing the map for school districts figure = plt.figure(figsize=(12,12)) ax = figure.add_subplot(1,1,1, projection=ccrs.PlateCarree()) ax.set_extent(extent) # Adding features ax.add_feature(cf.BORDERS) ax.add_feature(cf.OCEAN, color="steelblue", alpha=0.3) ax.add_feature(cf.LAND, color="green", alpha=0.2) ax.add_feature(cf.STATES, alpha=0.5) # Plotting the school districts plt.scatter(x=borough_safety["lon"], y=borough_safety["lat"], c=borough_safety["saf_tot_11"], edgecolors=borough_safety['borough'].map(colors), linewidth=3, s=borough_safety["total_students"]/5, alpha=0.8, cmap=cm.coolwarm_r, transform=ccrs.PlateCarree() ) # Adding custom legend for edgecolors bronx_patch = mpatches.Patch(color="blue", label="The Bronx", alpha=0.5, lw=0.05) brooklyn_patch = mpatches.Patch(color="orange", label="Brooklyn", alpha=0.5, lw=0.05) manhattan_patch = mpatches.Patch(color="green", label="Manhattan", alpha=0.5, lw=0.05) queens_patch = mpatches.Patch(color="red", label="Queens", alpha=0.5, lw=0.05) staten_patch = mpatches.Patch(color="purple", label="Staten Island", alpha=0.5, lw=0.05) plt.legend(handles=[bronx_patch, brooklyn_patch, manhattan_patch, queens_patch, staten_patch], loc='upper left' ) # Adding a few more map elements plt.colorbar(shrink=0.55) plt.title("Safety and Respect Perception by NYC Borough (Bubble Size = Total Number of Students)", size=16) plt.tight_layout() plt.show() # Drawing the background map for boroughs figure = plt.figure(figsize=(12,12)) ax = figure.add_subplot(1,1,1, projection=ccrs.PlateCarree()) ax.set_extent(extent) # Adding features ax.add_feature(cf.BORDERS) ax.add_feature(cf.OCEAN, color="steelblue", alpha=0.3) ax.add_feature(cf.LAND, color="green", alpha=0.2) ax.add_feature(cf.STATES, alpha=0.5) # Plotting the borough sat score bubbles plt.scatter(x=borough_safety["lon"], y=borough_safety["lat"], c=borough_safety["sat_score"], edgecolors=borough_safety['borough'].map(colors), linewidth=3, s=borough_safety["total_students"]/5, alpha=0.8, cmap=cm.coolwarm_r, transform=ccrs.PlateCarree() ) plt.legend(handles=[bronx_patch, brooklyn_patch, manhattan_patch, queens_patch, staten_patch], loc='upper left' ) # Adding a few more map elements plt.colorbar(shrink=0.55) plt.title("Average SAT Score by NYC Borough (Bubble Size = Total Number of Students)", size=16) plt.tight_layout() plt.show() # Perceptions on safety and respect does not exhibit a very strong correlation with average SAT scores even when we inspect it at the borough level. Some notes: # - The Bronx has a relatively decent safety score but is the worst-performing borough in terms of SAT scores. # - On the other hand, Staten Island does not fare very well in terms of safety perception but is actually the best-performing borough, on the average, in its average SAT score. # - Manhattan gets the highest safety and respect score but is middling in terms of SAT performance relative to the other four boroughs. # **Scatter Plot for SAT scores and Safety and Respect Scores** # # Let's take a closer look and check whether there is a robust or apparent pattern in the relationship between safety perception and SAT scores. # In[28]: # Plotting 'sat_score' vs. 'saf_s_11' combined.plot.scatter("sat_score", "saf_tot_11", figsize=(12,8), s=150, alpha=0.6, edgecolors='darkslategrey', fontsize=15 ) plt.title('SAT Score vs Safety and Respect Score', fontsize=22) plt.xlabel("Average SAT Score", fontsize=15) plt.ylabel("Safety and Respect Score (Total)", fontsize=15) plt.show() # Upon inspecting the scatter plot, we seem to have some evidence of a positive correlation between SAT scores and safety and respect perception. This is illustrated by the gentle upward sloping pattern traced by the dots. This is not very surpising since students who go to schools that are perceived to be safer and afford them more respect can focus more on their lessons, allowing them to perform better in standardized tests. What's actually more surprising is that the correlation is not more pronounced. # ### Race Demographics in Schools are Correlated with Average SAT Scores # # Let's revisit what we noted earlier that the percentage of students in a school who come from specific races has some substantial correlation with the SAT performance of a school. We can visually examine this using a scatter plot. We'll examine the correlation for Hispanic, White, and Asian representations. # In[29]: # Initiating figure object fig = plt.figure(figsize=(15, 5)) # Setting up lists to be used for plotting race_cols = ["white_per", "asian_per", "hispanic_per"] titles = ["Average SAT Score vs White Representation", "Average SAT Score vs Asian Representation", "Average SAT Score vs Hispanic Representation"] # Generating scatter plots for r in range(3): ax = fig.add_subplot(1,3,r+1) ax.scatter(combined["sat_score"], combined[race_cols[r]], alpha=0.6, edgecolors="darkslategrey" ) ax.set_xlim(800, 2100) ax.set_ylim(-5,110) ax.spines["right"].set_visible(False) ax.spines["top"].set_visible(False) ax.spines["bottom"].set_visible(False) ax.spines["left"].set_visible(False) ax.tick_params(labelleft=False) ax.set_title(titles[r], fontsize=15) ax.axhline(50, color='red', linestyle="--", alpha=0.5 ) if r == 0: ax.spines["left"].set_visible(True) ax.tick_params(labelleft=True) ax.set_ylabel("Percent Student Population", fontsize=15) if r == 1: ax.set_xlabel("Average SAT Score", fontsize=15) fig.tight_layout() plt.show() # Both White and Asian representations are associated with higher average SAT scores although the correlation is not very pronounced. What's very pronounced, however, is the negative correlation between SAT scores and Hispanic representation. This is a major red flag as this could indicate that the SATs are systematically "unfair". # # Another angle we could examine these findings is checking whether diversity has an effect on a school's average SAT performance. We mention this because there are plenty of low-scoring schools with more than 50% Hispanic students. Also, if we look at the scatter plots above the red dotted line (50%) for all three graphs, the positive correlation is no longer evident for White and Asian representation although the negative correlation for Hispanic represenation still persists. # # Further examination of this would require us to construct a measure of diversity, then exploring whether diversity is correlated with average SAT score. # ### Gender demographics in a school doesn't appear to have a strong correlation with SAT performance # # Let us examine how gender make-up in schools correlate with SAT performance. # In[30]: # Initiating figure object fig = plt.figure(figsize=(8, 12)) # Setting up lists to be referred to when plotting gender_cols = ["female_per", "female_num"] titles = ["Average SAT Score vs Percentage Female", "Average SAT Score vs Total Number of Female Students" ] y_labels = ["Percent Students - Female", "Total Number of Female Students"] # Generating scatter plots for r in range(2): ax = fig.add_subplot(2,1,r+1) ax.scatter(combined["sat_score"], combined[gender_cols[r]], alpha=0.6, edgecolors="darkslategrey", s=100 ) ax.set_xlim(800, 2100) ax.spines["right"].set_visible(False) ax.spines["top"].set_visible(False) ax.spines["bottom"].set_visible(False) ax.spines["left"].set_visible(False) ax.set_title(titles[r], fontsize=15) ax.set_ylabel(y_labels[r], fontsize=15) ax.tick_params(labelbottom=False) if r == 1: ax.tick_params(labelbottom=True) ax.set_xlabel("Average SAT Score", fontsize=15) ax.spines["bottom"].set_visible(True) fig.tight_layout() plt.show() # There is no very visible pattern that pops out in the scatter plot for percentage of students being female and average SAT score. There is some subtle hint of an upward-sloping pattern but isn't very clear. # # While there is some form of positive relationship between total number of female students and average SAT score, this relationship may just be driven by the relationship between school size and average SAT score. # In[31]: # Initiating figure object fig = plt.figure(figsize=(8, 12)) # Setting up lists to be used for plotting gender_cols = ["male_per", "male_num"] titles = ["Average SAT Score vs Percentage Male", "Average SAT Score vs Total Number of Male Students" ] y_labels = ["Percent Students - Male", "Total Number of Male Students"] # Generating scatter plots for r in range(2): ax = fig.add_subplot(2,1,r+1) ax.scatter(combined["sat_score"], combined[gender_cols[r]], alpha=0.6, edgecolors="darkslategrey", s=100 ) ax.set_xlim(800, 2100) ax.spines["right"].set_visible(False) ax.spines["top"].set_visible(False) ax.spines["bottom"].set_visible(False) ax.spines["left"].set_visible(False) ax.set_title(titles[r], fontsize=15) ax.set_ylabel(y_labels[r], fontsize=15) ax.tick_params(labelbottom=False) if r == 1: ax.tick_params(labelbottom=True) ax.set_xlabel("Average SAT Score", fontsize=15) ax.spines["bottom"].set_visible(True) fig.tight_layout() plt.show() # Once, again no pronounced pattern can be seen in terms of percentage of students being male and average SAT scores. There are some hints of a downward pattern but it is not clear enough for us to make even preliminary conclusions. # ### Schools with more students who take AP exams and do well also tend to perform better in terms of SAT scores # # Advanced Placement (AP) courses offer the opportunity for students who take these courses and their corresponding exams to earn university or college credits even before entering a higher education instituion. # # The description of AP exams from its website states: # >AP Exams are standardized exams designed to measure how well you’ve mastered the content and skills of a specific AP course. Most AP courses have an end-of-year paper-and-pencil exam, but a few courses have different ways to assess what you’ve learned—for example, AP Art and Design students submit a portfolio of work for scoring. # # For more information on AP exams and its mechanics, click this [link](https://apstudents.collegeboard.org/about-ap-exams). # # Before we can analyze how AP scores relate to SAT scores, we need to do some extra processing. The columns `AP Test Takers `, `Total Exams Taken`, `Number of Exams with scores 3 4 or 5` are correlated with SAT scores but are all measures of school size (which is itself positively correlated with SAT scores). # # We will to create a couple of new columns that will measure performance: `ap_per` which will measure the number of the percentage of students who took the AP exam, and `over3_ap_per` which will measure the percentage of exams with a score of 3 or higher. # # We will first check whether the numbers for the columns we will use for creating the new columns make sense. It should be the case that the number of exams with scores 3 or higher should never be greater than the number of exams taken. # In[32]: # Creating scatter plot plt.scatter(x=combined["Number of Exams with scores 3 4 or 5"], y=combined["Total Exams Taken"], alpha=0.7, marker=".", s=30 ) plt.xlabel("exams 3 or higher") plt.ylabel("total exams taken") plt.xlim(0,500) plt.ylim(0,500) # Adding 45-degree diagonal line plt.plot([0,3800],[0,3800], c='red', alpha=0.5) plt.show() # There are clearly cases (dots to the right and below of the red diagonal line) where there are more exams with scores of 3 or higher compared to the total number of AP exams taken. We will rectify this and then proceed with our analysis. # In[33]: # Correcting values for exams 3 or higher combined.loc[combined["Number of Exams with scores 3 4 or 5"] > combined["Total Exams Taken"], "Number of Exams with scores 3 4 or 5"] = combined["Total Exams Taken"] # Creating new columns to measure AP performance per school combined["ap_per"] = 100 * combined["AP Test Takers "] / combined["total_enrollment"] combined["over3_ap_per"] = 100 * combined["Number of Exams with scores 3 4 or 5"] / combined["Total Exams Taken"] # Initiating figure fig = plt.figure(figsize=(15, 8)) # Setting up lists to be used for plotting AP_cols = ["ap_per", "over3_ap_per"] titles = ["Average SAT Score vs Percentage AP Test Takers", "Average SAT Score vs Percentage AP Exams with Scores 3 or Higher" ] y_labels = ["Percentage Students - AP Test Takers", "Percentage AP Exams with Scores 3 or Higher"] # Generating scatter plots for r in range(2): ax = fig.add_subplot(1,2,r+1) ax.scatter(combined["sat_score"], combined[AP_cols[r]], alpha=0.6, edgecolors="darkslategrey" ) ax.set_xlim(800, 2200) ax.set_ylim(-5, 110) ax.spines["right"].set_visible(False) ax.spines["top"].set_visible(False) ax.spines["bottom"].set_visible(True) ax.spines["left"].set_visible(False) ax.set_title(titles[r], fontsize=15) ax.set_ylabel(y_labels[r], fontsize=12) ax.set_xlabel("Average SAT Score", fontsize=12) fig.tight_layout() plt.show() # Unsurprisingly, the larger the percentage of AP exams with scores of 3 or higher, the better the SAT scores are. This correlation is very pronounced but not really very interesting since schools where students do well in AP exams are likely to do well in other tests such as the SAT. # # There is also a positive correlation between the percentage of students who take AP exams in a school and average SAT scores, although this relationship is not as strong as AP exam performance vs SAT score. It may be the case that students who take the AP exams are those who assess themselves to be qualified for advanced placement, and hence will also tend to do well in their SATs. # ### Larger schools with larger classes tend to perform better on SATs # # Let's now look at different measures of class and school size and how these correlate with average SAT scores. # In[34]: # Initiating figure fig = plt.figure(figsize=(15, 15)) # Initiating lists to be used for plotting size_cols = ["NUMBER OF STUDENTS / SEATS FILLED", "NUMBER OF SECTIONS", "AVERAGE CLASS SIZE", "SIZE OF LARGEST CLASS" ] titles = ["Average SAT Score vs Number of Classes Per Grade", "Average SAT Score vs Number of Sections Per Program", "Average SAT Score vs Average Class Size", "Average SAT Score vs Size of Largest Class" ] y_labels = ["Number of Classes per Grade", "Number of Sections per Program", "Average Class Size", "Size of Largest Class" ] # Generating scatter plots for r in range(4): ax = fig.add_subplot(2,2,r+1) ax.scatter(combined["sat_score"], combined[size_cols[r]], alpha=0.6, edgecolors="darkslategrey" ) ax.set_xlim(800, 2200) ax.spines["right"].set_visible(False) ax.spines["top"].set_visible(False) ax.spines["bottom"].set_visible(False) ax.spines["left"].set_visible(False) ax.set_title(titles[r], fontsize=15) ax.set_ylabel(y_labels[r], fontsize=12) ax.tick_params(labelbottom=False) if r >= 2: ax.spines["bottom"].set_visible(True) ax.set_xlabel("Average SAT Score", fontsize=12) ax.tick_params(labelbottom=True) fig.tight_layout() plt.show() # The scatter plots verify what the correlation coefficients suggested earlier in our analysis: schools with larger classes and larger student populations tend to perform better in SATs (although the correlation is not particularly high). We can see in all of the scatter plots a somewhat upward-sloping patern being followed by the dots. This is most pronounced in the scatter plot "Average SAT Score vs Average Class Size". # ### Average SAT scores are highly correlated with certain school demographics # # For the last part of our analysis, at least for this particular project, we will look at how certain characteristics of a school's student population correlate with SAT performance. # # We will look at the following indicators: # - percentage of students who receive free/reduced-cost lunches # - percentage of students who are learning the English language # - percentage of students who are under the special education program # # The percentage of students who receive free lunches can be treated as a good proxy for income level. # # The percentage of students who are learning English language reflects two things: number of immigrants and income levels. Immigrants who do not yet know the English language are more likely from low-income households (as opposed to immigrants who have good prior education and are already proficient in English). # # The special education program is designed for students who have some form of disability and require a specialized services. More information on special education can be found in this [link](https://www.schools.nyc.gov/learning/special-education/preschool-to-age-21/special-education-in-nyc). # In[35]: # Generating scatter polts fig = plt.figure(figsize=(15, 5)) demog_cols = ["frl_percent", "ell_percent", "sped_percent", ] titles = ["Average SAT Score vs Percentage Free and Reduced Lunch", "Average SAT Score vs Percentage English Language Learners", "Average SAT Score vs Percentage Special Education" ] y_labels = ["percentage students receiving free/reduced-cost lunch", "percentage students learning english language", "percentage students under special eduation program" ] for r in range(3): ax = fig.add_subplot(1,3,r+1) ax.scatter(combined["sat_score"], combined[demog_cols[r]], alpha=0.6, edgecolors="darkslategrey" ) ax.set_xlim(800, 2200) ax.set_ylim(-5, 110) ax.spines["right"].set_visible(False) ax.spines["top"].set_visible(False) ax.spines["bottom"].set_visible(True) ax.spines["left"].set_visible(False) ax.set_title(titles[r], fontsize=12) ax.set_ylabel(y_labels[r], fontsize=10) ax.set_xlabel("Average SAT Score", fontsize=10) ax.tick_params(labelleft=False) ax.tick_params(labelbottom=True) ax.axhline(50, c="red", linestyle="--", alpha=0.6) if r == 0: ax.tick_params(labelleft=True) ax.spines["left"].set_visible(True) fig.tight_layout() plt.show() # As can be seen in the right-most scatter plot, schools with many students availing of free/reduced-cost lunches tend to do so much worse in SATs. The downward-sloping shape is very prominent in the scatter plot, corroborating the highly negative correlation coefficient we saw earlier. # # The same downward sloping pattern can be seen in the next two scatter plots, although the pattern is not as pronounced. # # The three scatter plots provide some evidence that SATs may unintentionally discriminate against certain demographics, particularly those with low-income or those students that do not have English as their first language. # ## Conclusions and Take-Aways # We have reached the end of this data cleaning and preliminary analysis project. We started with just a general question on what patterns can be seen if we look at SAT scores and school characteristics. Of particular interest to us was to get some idea and evidence whether SATs were inherently fair or not. # # ![exam.jpg](attachment:exam.jpg) # # We summarize some of the findings we had below: # - Average SAT scores vary greatly within boroughs. Some of the best-performing school districts are located in Queens. # - There are clusters of school districts in both Brooklyn and the Bronx that perform poorly in SATs. # - Most indicators that have a high correlation (whether negative or positive) have to do with demographics. This indicates that SATs may be unfair where its current iteration results in some groups or demographics being disadvantaged. # - Higher safety and respect perceptions (by students, teachers, and parents) are weakly associated with higher average SAT scores. This means that in schools where people feel safe and respected, students tend to do slightly better on SATs. Note, however, that this relationship is not very strong nor robust. This is apparent when we looked at bubble maps of safety/respect scores alongside SAT scores by school district and by borough where the relationship is not very consistent across the entirety of NYC. # - Queens has (subjectively) the best combination of safety score and average SAT performance. The schools in this borough are perceived to be relatively safe and respectful environments, while it is the second best-performing borough in terms of average SAT scores. # - Brooklyn, on the other hand, is the worst-performing borough in terms of safety/respect perception and the second worst in terms of average SAT scores. Note, however, the large disparity in outcomes and safety perception within Brooklyn, since the best-performing school district in terms of SAT scores is actually located in Brooklyn. # - Gender demographics do not appear to be strongly correlated with average SAT score. This provides us some evidence to dismiss suspicions of SATs being gender-biased. # - The percentage of students who are either white or Asian are positively correlated with average SAT scores while the percentage of students in a school who are Hispanic is negatively correlated with SAT scores. This is something to take note of and in need of further investigation. # - Schools with more students in AP courses and take the corresponding exams also tend to perform better on SATs. # - Surprisingly, larger schools with more sections and larger classes tend to perform better on SATs. This, at first glance, is counter-intuitive since we would assume that classes with more students are more difficult to teach effectively. This is something that requires more investigation. # - While SATs do not appear to be gender-biased, it does appear to be strongly income-biased where schools that have a large proportion of students from immigrant households or low-income households registering lower average SAT scores. This is something policy-makers and test designers should take note of so that they can improve how the SATs are designed by minimizing the negative impact of low income or being a non-native English speaker on a student’s SAT performance. # # There are still many angles that we can explore and analyze the data available so far. We would like to note that the transparency of the New York Department of Education by making these data available allows it to crowdsource the analysis. # # In doing the analysis above, one hypothesis that we have formulated is that diversity may have a positive impact on academic performance (including SAT scores). Diversity may be the underlying cause as to why we see larger schools with larger classes performing better. It may also be one of the causes why schools with a relatively homegenous student population of Hispanics performed poorly in the SATs. # # Interested researchers may want to construct a new multidimensional index that combines measures of diversity and income and investigate how the new measure correlates not only with SAT scores but also in terms of academic performance and eventually socioeconomic mobility and success.