by Raghav_A
New York City has a significant immigrant population and is very diverse, so comparing demographic factors such as race, income, and gender with SAT scores is a good way to determine whether the SAT is a fair test. Also, using various surveys across NYC schools to compare how school-safety scores, what the average size of a class is, and number of AP test takers can also yield some interesting info. Let's see if we can find some useful correlations
import pandas as pd
import numpy
import re
import matplotlib.pyplot as plt
data_files = [
"ap_2010.csv",
"class_size.csv",
"demographics.csv",
"graduation.csv",
"hs_directory.csv",
"sat_results.csv"
]
data = {}
for f in data_files:
d = pd.read_csv("schools/{0}".format(f))
data[f.replace(".csv", "")] = d
all_survey = pd.read_csv("schools/survey_all.txt", delimiter="\t", encoding='windows-1252')
d75_survey = pd.read_csv("schools/survey_d75.txt", delimiter="\t", encoding='windows-1252')
survey = pd.concat([all_survey, d75_survey], axis=0)
survey["DBN"] = survey["dbn"]
survey_fields = [
"DBN",
"rr_s",
"rr_t",
"rr_p",
"N_s",
"N_t",
"N_p",
"saf_p_11",
"com_p_11",
"eng_p_11",
"aca_p_11",
"saf_t_11",
"com_t_11",
"eng_t_11",
"aca_t_11",
"saf_s_11",
"com_s_11",
"eng_s_11",
"aca_s_11",
"saf_tot_11",
"com_tot_11",
"eng_tot_11",
"aca_tot_11",
]
survey = survey.loc[:,survey_fields]
data["survey"] = survey
data["hs_directory"]["DBN"] = data["hs_directory"]["dbn"]
def pad_csd(num):
string_representation = str(num)
if len(string_representation) > 1:
return string_representation
else:
return "0" + string_representation
data["class_size"]["padded_csd"] = data["class_size"]["CSD"].apply(pad_csd)
data["class_size"]["DBN"] = data["class_size"]["padded_csd"] + data["class_size"]["SCHOOL CODE"]
cols = ['SAT Math Avg. Score', 'SAT Critical Reading Avg. Score', 'SAT Writing Avg. Score']
for c in cols:
data["sat_results"][c] = pd.to_numeric(data["sat_results"][c], errors="coerce")
data['sat_results']['sat_score'] = data['sat_results'][cols[0]] + data['sat_results'][cols[1]] + data['sat_results'][cols[2]]
def find_lat(loc):
coords = re.findall("\(.+, .+\)", loc)
lat = coords[0].split(",")[0].replace("(", "")
return lat
def find_lon(loc):
coords = re.findall("\(.+, .+\)", loc)
lon = coords[0].split(",")[1].replace(")", "").strip()
return lon
data["hs_directory"]["lat"] = data["hs_directory"]["Location 1"].apply(find_lat)
data["hs_directory"]["lon"] = data["hs_directory"]["Location 1"].apply(find_lon)
data["hs_directory"]["lat"] = pd.to_numeric(data["hs_directory"]["lat"], errors="coerce")
data["hs_directory"]["lon"] = pd.to_numeric(data["hs_directory"]["lon"], errors="coerce")
class_size = data["class_size"]
class_size = class_size[class_size["GRADE "] == "09-12"]
class_size = class_size[class_size["PROGRAM TYPE"] == "GEN ED"]
class_size = class_size.groupby("DBN").agg(numpy.mean)
class_size.reset_index(inplace=True)
data["class_size"] = class_size
data["demographics"] = data["demographics"][data["demographics"]["schoolyear"] == 20112012]
data["graduation"] = data["graduation"][data["graduation"]["Cohort"] == "2006"]
data["graduation"] = data["graduation"][data["graduation"]["Demographic"] == "Total Cohort"]
cols = ['AP Test Takers ', 'Total Exams Taken', 'Number of Exams with scores 3 4 or 5']
for col in cols:
data["ap_2010"][col] = pd.to_numeric(data["ap_2010"][col], errors="coerce")
combined = data["sat_results"]
combined = combined.merge(data["ap_2010"], on="DBN", how="left")
combined = combined.merge(data["graduation"], on="DBN", how="left")
to_merge = ["class_size", "demographics", "survey", "hs_directory"]
for m in to_merge:
combined = combined.merge(data[m], on="DBN", how="inner")
combined = combined.fillna(combined.mean())
combined = combined.fillna(0)
def get_first_two_chars(dbn):
return dbn[0:2]
combined["school_dist"] = combined["DBN"].apply(get_first_two_chars)
correlations = combined.corr()
correlations['sat_score']
SAT Critical Reading Avg. Score 0.986820 SAT Math Avg. Score 0.972643 SAT Writing Avg. Score 0.987771 sat_score 1.000000 AP Test Takers 0.523140 ... priority09 NaN priority10 NaN lat -0.121029 lon -0.132222 ap_per 0.057171 Name: sat_score, Length: 68, dtype: float64
correlations[abs(correlations['sat_score'])>0.25]['sat_score'].sort_values()
frl_percent -0.722225 sped_percent -0.448170 ell_percent -0.398750 hispanic_per -0.396985 black_per -0.284139 N_t 0.291463 saf_t_11 0.313810 SIZE OF LARGEST CLASS 0.314434 saf_tot_11 0.318753 Total Cohort 0.325144 male_num 0.325520 saf_s_11 0.337639 aca_s_11 0.339435 NUMBER OF SECTIONS 0.362673 total_enrollment 0.367857 AVERAGE CLASS SIZE 0.381014 female_num 0.388631 NUMBER OF STUDENTS / SEATS FILLED 0.394626 total_students 0.407827 N_p 0.421530 N_s 0.423463 white_num 0.449559 Number of Exams with scores 3 4 or 5 0.463245 asian_num 0.475445 Total Exams Taken 0.514333 AP Test Takers 0.523140 asian_per 0.570730 white_per 0.620718 SAT Math Avg. Score 0.972643 SAT Critical Reading Avg. Score 0.986820 SAT Writing Avg. Score 0.987771 sat_score 1.000000 Name: sat_score, dtype: float64
There are several fields in combined
dataset that originally came from a survey of parents, teachers, and students. I will make a bar plot of the correlations between these fields and sat_score. By doing this, I can dive-deep into those fields that have a high correlation with the sat_score
field.
combined.corr()[survey_fields].loc['sat_score',:].sort_values().plot.bar()
<matplotlib.axes._subplots.AxesSubplot at 0x35ac5a3b88>
Immedeiately, it an be observed that the correlation of the survey
fields with sat_score
(in absolute terms) varies from 0.02 (almost no correlation) to 0.4 (medium correlation). Any values of correlation that are > 0.25 can be termed as 'interesting', and deserve a deeper dive into it. Thus, next, I isolate only those survey-fields
which have an absolute correlation with sat_score
> 0.25
.
Rather than rely on the r (correlation) value alone, it is better to plot the 2 fields being compared via a scatterplot. In doing so, we will determine whether there is actually a correlation, or it is a ruse due to a bunch of influential outliers.
# A function that makes multiple scatterplots in a single figure
def scatter_plots_multiple(colnames_list):
numcols = len(colnames_list)
fig = plt.figure(figsize = (15,(int(numcols/3)+1)*5))
for i in range(1,numcols+1):
ax = fig.add_subplot(3,3,i)
ax.scatter(combined[colnames_list[i-1]],combined['sat_score'])
ax.set_title(colnames_list[i-1]+' vs "sat_score"')
plt.show()
survey_sat_corr = combined.corr()[survey_fields].loc['sat_score',:][combined.corr()[survey_fields].loc['sat_score',:]>0.25]
scatter_plots_multiple(survey_sat_corr.index)
Safety and Respect scores based on Student responses (saf_s_11
field) seems to have a better correlation amongst all other survey fields. While most of the sat_score
values are clustered around the 6-7 range of safety scores, it can be seen that schools with a higher student rated safety score seem to have higher mean sat_scores as well.
Let's dive deeper into this survey field, and try to visualise which districts have higher average safety scores and which ones don't.
# Remove DBN since it's a unique identifier, not a useful numerical value for correlation.
survey_fields.remove("DBN")
saf_s_11
safety scores by District¶# district-wise average scores
district = combined.groupby('school_dist').agg(numpy.mean)
district
SAT Critical Reading Avg. Score | SAT Math Avg. Score | SAT Writing Avg. Score | sat_score | AP Test Takers | Total Exams Taken | Number of Exams with scores 3 4 or 5 | Total Cohort | CSD | NUMBER OF STUDENTS / SEATS FILLED | ... | expgrade_span_max | zip | total_students | number_programs | priority08 | priority09 | priority10 | lat | lon | ap_per | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
school_dist | |||||||||||||||||||||
01 | 441.833333 | 473.333333 | 439.333333 | 1354.500000 | 116.681090 | 173.019231 | 135.800000 | 93.500000 | 1.0 | 115.244241 | ... | 12.0 | 10003.166667 | 659.500000 | 1.333333 | 0.0 | 0.0 | 0.0 | 40.719022 | -73.982377 | 0.192551 |
02 | 426.619092 | 444.186256 | 424.832836 | 1295.638184 | 128.908454 | 201.516827 | 157.495833 | 158.647849 | 2.0 | 149.818949 | ... | 12.0 | 10023.770833 | 621.395833 | 1.416667 | 0.0 | 0.0 | 0.0 | 40.739699 | -73.991386 | 0.265711 |
03 | 428.529851 | 437.997512 | 426.915672 | 1293.443035 | 156.183494 | 244.522436 | 193.087500 | 183.384409 | 3.0 | 156.005994 | ... | 12.0 | 10023.750000 | 717.916667 | 2.000000 | 0.0 | 0.0 | 0.0 | 40.781574 | -73.977370 | 0.267818 |
04 | 402.142857 | 416.285714 | 405.714286 | 1224.142857 | 129.016484 | 183.879121 | 151.035714 | 113.857143 | 4.0 | 132.362265 | ... | 12.0 | 10029.857143 | 580.857143 | 1.142857 | 0.0 | 0.0 | 0.0 | 40.793449 | -73.943215 | 0.246798 |
05 | 427.159915 | 438.236674 | 419.666098 | 1285.062687 | 85.722527 | 115.725275 | 142.464286 | 143.677419 | 5.0 | 120.623901 | ... | 12.0 | 10030.142857 | 609.857143 | 1.142857 | 0.0 | 0.0 | 0.0 | 40.817077 | -73.949251 | 0.161767 |
06 | 382.011940 | 400.565672 | 382.066269 | 1164.643881 | 108.711538 | 159.715385 | 105.425000 | 180.848387 | 6.0 | 139.041709 | ... | 12.0 | 10036.200000 | 628.900000 | 1.300000 | 0.0 | 0.0 | 0.0 | 40.848970 | -73.932502 | 0.220879 |
07 | 376.461538 | 380.461538 | 371.923077 | 1128.846154 | 73.703402 | 112.476331 | 105.276923 | 105.605459 | 7.0 | 97.597416 | ... | 12.0 | 10452.692308 | 465.846154 | 1.461538 | 0.0 | 0.0 | 0.0 | 40.816815 | -73.919971 | 0.170719 |
08 | 386.214383 | 395.542741 | 377.908005 | 1159.665129 | 118.379371 | 168.020979 | 144.731818 | 215.510264 | 8.0 | 129.765099 | ... | 12.0 | 10467.000000 | 547.636364 | 1.272727 | 0.0 | 0.0 | 0.0 | 40.823803 | -73.866087 | 0.249342 |
09 | 373.755970 | 383.582836 | 374.633134 | 1131.971940 | 71.411538 | 104.265385 | 98.470000 | 113.330645 | 9.0 | 100.118588 | ... | 12.0 | 10456.100000 | 449.700000 | 1.150000 | 0.0 | 0.0 | 0.0 | 40.836349 | -73.906240 | 0.175797 |
10 | 403.363636 | 418.000000 | 400.863636 | 1222.227273 | 132.231206 | 226.914336 | 191.618182 | 161.318182 | 10.0 | 168.876526 | ... | 12.0 | 10463.181818 | 757.863636 | 1.500000 | 0.0 | 0.0 | 0.0 | 40.870345 | -73.898360 | 0.153976 |
11 | 389.866667 | 394.533333 | 380.600000 | 1165.000000 | 83.813462 | 122.484615 | 108.833333 | 122.866667 | 11.0 | 129.031031 | ... | 12.0 | 10467.933333 | 563.666667 | 1.533333 | 0.0 | 0.0 | 0.0 | 40.873138 | -73.856120 | 0.170508 |
12 | 364.769900 | 379.109453 | 357.943781 | 1101.823134 | 93.102564 | 139.442308 | 153.450000 | 110.467742 | 12.0 | 91.684504 | ... | 12.0 | 10463.166667 | 409.000000 | 1.083333 | 0.0 | 0.0 | 0.0 | 40.831412 | -73.886946 | 0.265387 |
13 | 409.393800 | 424.127440 | 403.666361 | 1237.187600 | 232.931953 | 382.704142 | 320.773077 | 224.595533 | 13.0 | 218.306055 | ... | 12.0 | 11207.153846 | 895.153846 | 2.076923 | 0.0 | 0.0 | 0.0 | 40.692865 | -73.977016 | 0.180886 |
14 | 395.937100 | 398.189765 | 385.333049 | 1179.459915 | 77.798077 | 114.873626 | 123.282143 | 112.347926 | 14.0 | 123.643728 | ... | 12.0 | 11210.785714 | 545.357143 | 2.000000 | 0.0 | 0.0 | 0.0 | 40.711599 | -73.948360 | 0.217193 |
15 | 395.679934 | 404.628524 | 390.295854 | 1190.604312 | 94.574786 | 141.581197 | 153.450000 | 104.207885 | 15.0 | 135.707319 | ... | 12.0 | 11214.222222 | 573.111111 | 1.666667 | 0.0 | 0.0 | 0.0 | 40.675972 | -73.989255 | 0.181893 |
16 | 371.529851 | 379.164179 | 369.415672 | 1120.109701 | 82.264423 | 126.519231 | 153.450000 | 247.185484 | 16.0 | 177.501282 | ... | 12.0 | 11219.000000 | 440.250000 | 1.750000 | 0.0 | 0.0 | 0.0 | 40.688008 | -73.929686 | 0.309973 |
17 | 386.571429 | 394.071429 | 380.785714 | 1161.428571 | 105.583791 | 163.087912 | 111.360714 | 121.357143 | 17.0 | 130.246192 | ... | 12.0 | 11220.642857 | 547.071429 | 1.642857 | 0.0 | 0.0 | 0.0 | 40.660313 | -73.955636 | 0.209731 |
18 | 373.454545 | 373.090909 | 371.454545 | 1118.000000 | 129.028846 | 197.038462 | 153.450000 | 72.771261 | 18.0 | 72.209438 | ... | 12.0 | 11224.000000 | 344.000000 | 1.090909 | 0.0 | 0.0 | 0.0 | 40.641863 | -73.914726 | 0.396711 |
19 | 367.083333 | 377.583333 | 359.166667 | 1103.833333 | 88.097756 | 124.769231 | 120.670833 | 114.322581 | 19.0 | 105.752625 | ... | 12.0 | 11207.500000 | 440.416667 | 1.916667 | 0.0 | 0.0 | 0.0 | 40.676547 | -73.882158 | 0.200646 |
20 | 406.223881 | 465.731343 | 401.732537 | 1273.687761 | 227.805769 | 359.407692 | 177.690000 | 591.374194 | 20.0 | 420.029766 | ... | 12.0 | 11210.200000 | 2521.400000 | 3.800000 | 0.0 | 0.0 | 0.0 | 40.626751 | -74.006191 | 0.150214 |
21 | 395.283582 | 421.786974 | 389.242062 | 1206.312619 | 135.467657 | 203.835664 | 142.377273 | 275.351906 | 21.0 | 224.702989 | ... | 12.0 | 11221.000000 | 1098.272727 | 3.272727 | 0.0 | 0.0 | 0.0 | 40.593596 | -73.978465 | 0.206245 |
22 | 473.500000 | 502.750000 | 474.250000 | 1450.500000 | 391.007212 | 614.509615 | 370.362500 | 580.250000 | 22.0 | 495.279369 | ... | 12.0 | 11223.000000 | 2149.000000 | 2.250000 | 0.0 | 0.0 | 0.0 | 40.618285 | -73.952288 | 0.215706 |
23 | 380.666667 | 398.666667 | 378.000000 | 1157.333333 | 29.000000 | 31.000000 | 153.450000 | 87.000000 | 23.0 | 120.113095 | ... | 12.0 | 11219.000000 | 391.000000 | 1.333333 | 0.0 | 0.0 | 0.0 | 40.668586 | -73.912298 | 0.063672 |
24 | 405.846154 | 434.000000 | 402.153846 | 1242.000000 | 126.474852 | 179.094675 | 115.165385 | 234.682382 | 24.0 | 213.471903 | ... | 12.0 | 11206.153846 | 962.461538 | 2.230769 | 0.0 | 0.0 | 0.0 | 40.740621 | -73.911518 | 0.185186 |
25 | 437.250000 | 483.500000 | 436.250000 | 1357.000000 | 205.260817 | 279.889423 | 174.793750 | 268.733871 | 25.0 | 280.576007 | ... | 12.0 | 11361.000000 | 1288.875000 | 1.875000 | 0.0 | 0.0 | 0.0 | 40.745414 | -73.815558 | 0.205119 |
26 | 445.200000 | 487.600000 | 444.800000 | 1377.600000 | 410.605769 | 632.407692 | 392.090000 | 825.600000 | 26.0 | 595.953216 | ... | 12.0 | 11388.600000 | 2837.400000 | 4.600000 | 0.0 | 0.0 | 0.0 | 40.748507 | -73.759176 | 0.124673 |
27 | 407.800000 | 422.200000 | 394.300000 | 1224.300000 | 100.611538 | 145.315385 | 95.125000 | 288.961290 | 27.0 | 249.324536 | ... | 12.0 | 11556.300000 | 1072.000000 | 2.500000 | 0.0 | 0.0 | 0.0 | 40.638828 | -73.807823 | 0.150687 |
28 | 445.941655 | 465.997286 | 435.908005 | 1347.846947 | 182.010490 | 273.559441 | 175.336364 | 351.214076 | 28.0 | 255.381164 | ... | 12.0 | 11422.000000 | 1304.272727 | 2.545455 | 0.0 | 0.0 | 0.0 | 40.709344 | -73.806367 | 0.215716 |
29 | 395.764925 | 399.457090 | 386.707836 | 1181.929851 | 63.385817 | 96.514423 | 135.268750 | 98.108871 | 29.0 | 88.372155 | ... | 12.0 | 11413.625000 | 474.125000 | 1.250000 | 0.0 | 0.0 | 0.0 | 40.685276 | -73.752740 | 0.211378 |
30 | 430.679934 | 465.961857 | 429.740299 | 1326.382090 | 157.231838 | 252.123932 | 115.150000 | 310.526882 | 30.0 | 251.803744 | ... | 12.0 | 11103.000000 | 1123.333333 | 2.555556 | 0.0 | 0.0 | 0.0 | 40.755398 | -73.932306 | 0.170433 |
31 | 457.500000 | 472.500000 | 452.500000 | 1382.500000 | 228.908654 | 355.111538 | 194.435000 | 450.787097 | 31.0 | 380.528319 | ... | 12.0 | 10307.100000 | 1847.500000 | 5.000000 | 0.0 | 0.0 | 0.0 | 40.595680 | -74.125726 | 0.176337 |
32 | 371.500000 | 385.833333 | 362.166667 | 1119.500000 | 70.342949 | 100.179487 | 83.558333 | 105.333333 | 32.0 | 100.525613 | ... | 12.0 | 11231.666667 | 381.500000 | 1.000000 | 0.0 | 0.0 | 0.0 | 40.696295 | -73.917124 | 0.170409 |
32 rows × 68 columns
# plotting safety scores district wise on NYC map (using Basemap library)
from mpl_toolkits.basemap import Basemap
def nyc_plot_district(fieldname):
fig,ax = plt.subplots(figsize = (6,6))
m = Basemap(projection = 'merc', llcrnrlat = 40.496044, urcrnrlat = 40.915256,
llcrnrlon = -74.255735, urcrnrlon = -73.700272, resolution = 'h')
m.drawcoastlines(color = 'black', linewidth = 1)
m.drawmapboundary(fill_color = '#85A6D9')
# Creating scatterplot
m.scatter(district['lon'].tolist(),
district['lat'].tolist(),
zorder = 2, s=20,
latlon = True,
c=district[fieldname],
cmap = 'summer')
if fieldname == 'saf_s_11':
ax.set_title('Heat-Map: District Wise Safety Scores for NYC Schools')
plt.show()
def nyc_plot_school(df):
fig,ax = plt.subplots(figsize = (6,6))
m = Basemap(projection = 'merc', llcrnrlat = 40.496044, urcrnrlat = 40.915256,
llcrnrlon = -74.255735, urcrnrlon = -73.700272, resolution = 'i')
m.drawcoastlines(color = 'black', linewidth = 1)
m.drawmapboundary(fill_color = '#85A6D9')
# Creating scatterplot
m.scatter(df['lon'].tolist(),
df['lat'].tolist(),
zorder = 2, s=20,
latlon = True,
c='black')
ax.set_title('Scatter Plot: NYC Schools')
plt.show()
nyc_plot_district('saf_s_11')
C:\Users\Aseem\anaconda3\lib\site-packages\ipykernel_launcher.py:7: MatplotlibDeprecationWarning: The dedent function was deprecated in Matplotlib 3.1 and will be removed in 3.3. Use inspect.cleandoc instead. import sys
From the NYC map above, we can see that most districts in Manhattan and Queens seem to have lower average saffety scores by students (yellow dots), while Brooklyn has relatively higher safety scores.
sat_score
correlation¶By plotting out the correlations between these columns and sat_score, we can determine whether there are any racial differences in SAT performance.
combined.corr().loc['sat_score',['white_per','asian_per','black_per','hispanic_per']].plot.bar()
<matplotlib.axes._subplots.AxesSubplot at 0x35a4fbd448>
scatter_plots_multiple(['white_per','asian_per','black_per','hispanic_per'])
Immediately we can see -
sat_score
sat_score
This suggests that SAT is not favourable to students from hispanic/black communities.
Another interesting thing that we could further investigate is the cluster of low sat_score
schools in the hispanic_per
plot, having 99%+ hispanic students. On diving deeper, we find that all of these schools take in newly arrived immigrants from hispanic countries and have been in th USA for fewer than 2-3 years. This may contribute to lower English scores in SAT, and further, lower SAT overall scores.
# investigating schools with > 95% hispanic percentage
combined[combined['hispanic_per']>95][['SCHOOL NAME','hispanic_per','sat_score']]
SCHOOL NAME | hispanic_per | sat_score | |
---|---|---|---|
44 | MANHATTAN BRIDGES HIGH SCHOOL | 99.8 | 1058.0 |
82 | WASHINGTON HEIGHTS EXPEDITIONARY LEARNING SCHOOL | 96.7 | 1174.0 |
89 | GREGORIO LUPERON HIGH SCHOOL FOR SCIENCE AND M... | 99.8 | 1014.0 |
125 | ACADEMY FOR LANGUAGE AND TECHNOLOGY | 99.4 | 951.0 |
141 | INTERNATIONAL SCHOOL FOR LIBERAL ARTS | 99.8 | 934.0 |
176 | PAN AMERICAN INTERNATIONAL HIGH SCHOOL AT MONROE | 99.8 | 970.0 |
253 | MULTICULTURAL HIGH SCHOOL | 99.8 | 887.0 |
286 | PAN AMERICAN INTERNATIONAL HIGH SCHOOL | 100.0 | 951.0 |
We see that gender has a negligible correlation with the SAT scores. bu on a deeper dive, using the scatterplots, we can see that schools with a greater gender diversity (or a ratio close to 1:1 boys vs girls) have higher SAT scores on an average.
Also, a bunch of dots on the 100% female_per plot shows us that all-girls schools seem to have very low SAT scores on average. Same goes for schools having more than 80% male students.
combined.corr().loc['sat_score',['male_per','female_per']].plot.bar()
<matplotlib.axes._subplots.AxesSubplot at 0x35acf6be48>
scatter_plots_multiple(['female_per','male_per'])
combined[combined['female_per']>80][['SCHOOL NAME','sat_score','priority01']]
SCHOOL NAME | sat_score | priority01 | |
---|---|---|---|
15 | URBAN ASSEMBLY SCHOOL OF BUSINESS FOR YOUNG WO... | 1127.000000 | Open only to female students |
49 | THE HIGH SCHOOL OF FASHION INDUSTRIES | 1257.000000 | Open to New York City residents |
70 | YOUNG WOMEN'S LEADERSHIP SCHOOL | 1326.000000 | Open only to female students |
71 | YOUNG WOMEN'S LEADERSHIP SCHOOL | 1326.000000 | Open only to female students |
104 | WOMEN'S ACADEMY OF EXCELLENCE | 1171.000000 | Open only to female students |
133 | HIGH SCHOOL FOR VIOLIN AND DANCE | 1039.000000 | Priority to Bronx students or residents who at... |
137 | THE MARIE CURIE SCHOOL FOR MEDICINE, NURSING, ... | 1157.000000 | Priority to Bronx students or residents who at... |
191 | URBAN ASSEMBLY INSTITUTE OF MATH AND SCIENCE F... | 1223.438806 | Open only to female students |
264 | THE URBAN ASSEMBLY SCHOOL FOR CRIMINAL JUSTICE | 1223.438806 | Open only to female students |
329 | YOUNG WOMEN'S LEADERSHIP SCHOOL, QUEENS | 1316.000000 | Open only to female students |
338 | YOUNG WOMEN'S LEADERSHIP SCHOOL, ASTORIA | 1223.438806 | Open only to female students |
combined[combined['male_per']>80][['SCHOOL NAME','sat_score']]
SCHOOL NAME | sat_score | |
---|---|---|
99 | URBAN ASSEMBLY SCHOOL FOR CAREERS IN SPORTS | 1181.0 |
101 | ALFRED E. SMITH CAREER AND TECHNICAL EDUCATION... | 1158.0 |
115 | EAGLE ACADEMY FOR YOUNG MEN | 1134.0 |
135 | BRONX ENGINEERING AND TECHNOLOGY ACADEMY | 1150.0 |
160 | HIGH SCHOOL OF COMPUTERS AND TECHNOLOGY | 1111.0 |
170 | BRONX AEROSPACE HIGH SCHOOL | 1163.0 |
207 | AUTOMOTIVE HIGH SCHOOL | 1093.0 |
249 | FDNY HIGH SCHOOL FOR FIRE AND LIFE SAFETY | 1023.0 |
254 | TRANSIT TECH CAREER AND TECHNICAL EDUCATION HI... | 1193.0 |
267 | HIGH SCHOOL OF SPORTS MANAGEMENT | 1164.0 |
295 | AVIATION CAREER & TECHNICAL EDUCATION HIGH SCHOOL | 1364.0 |
combined[(combined['female_per']>60)&(combined['sat_score']>1700)][['SCHOOL NAME','sat_score','female_per']]
SCHOOL NAME | sat_score | female_per | |
---|---|---|---|
5 | BARD HIGH SCHOOL EARLY COLLEGE | 1856.0 | 68.7 |
26 | ELEANOR ROOSEVELT HIGH SCHOOL | 1758.0 | 67.5 |
60 | BEACON HIGH SCHOOL | 1744.0 | 61.0 |
61 | FIORELLO H. LAGUARDIA HIGH SCHOOL OF MUSIC & A... | 1707.0 | 73.6 |
302 | TOWNSEND HARRIS HIGH SCHOOL | 1910.0 | 71.1 |
In the U.S., high school students take Advanced Placement (AP) exams to earn college credit. There are AP exams for many different subjects.
It makes sense that the number of students at a school who took AP exams would be highly correlated with the school's SAT scores. Let's explore this relationship.
combined['ap_per'] = combined['AP Test Takers ']/combined['total_enrollment']
combined.plot.scatter('ap_per','sat_score')
<matplotlib.axes._subplots.AxesSubplot at 0x35aa158c08>
combined[combined['sat_score'] > 1800][['SCHOOL NAME','sat_score','ap_per']]
SCHOOL NAME | sat_score | ap_per | |
---|---|---|---|
5 | BARD HIGH SCHOOL EARLY COLLEGE | 1856.0 | 0.209123 |
37 | STUYVESANT HIGH SCHOOL | 2096.0 | 0.457992 |
79 | HIGH SCHOOL FOR MATHEMATICS, SCIENCE AND ENGIN... | 1847.0 | 0.280788 |
151 | BRONX HIGH SCHOOL OF SCIENCE | 1969.0 | 0.394955 |
155 | HIGH SCHOOL OF AMERICAN STUDIES AT LEHMAN COLLEGE | 1920.0 | 0.514589 |
187 | BROOKLYN TECHNICAL HIGH SCHOOL | 1833.0 | 0.397037 |
302 | TOWNSEND HARRIS HIGH SCHOOL | 1910.0 | 0.537719 |
327 | QUEENS HIGH SCHOOL FOR THE SCIENCES AT YORK CO... | 1868.0 | 0.514354 |
356 | STATEN ISLAND TECHNICAL HIGH SCHOOL | 1953.0 | 0.478261 |
Here we see a strong correlation - schools with higher average class size seem to have higher average SAT scores, and vice-versa.
combined.plot.scatter('AVERAGE CLASS SIZE', 'sat_score')
<matplotlib.axes._subplots.AxesSubplot at 0x35aeea3a08>