This project is guided by the big question: What makes a good public high school in NYC area?
Have you ever thought about dropping out from high school? Have you ever worried about your scores in high shcool? No matter what your answer is, graduation from high school is a bare minimum for students’ later academic and socioeconomic success. Also, prior researches indicated that high school grades are a stronger incremental predictor of college outcomes—including what is arguably the most important outcome of all: graduation. (Galla, Brian M., et al., 2019)
However, in the United States, Low academic performance in high school and increasing drop-out rates become major educational issues. it is estimated that Nearly one-third of all public high school students in the U.S. fail to graduate in four years (Chapman et al., 2011; Stillwell and Sable, 2013).
Note: please use "pip install -r requirements.txt" in your terminal to install the following packages first:
pandas os csv ast numpy textblob seaborn matplotlib wordcloud scipy geopy re fractions requests geopandas folium sklearn selenium time zipfile urllib getpass bs4
We are colleting and processing the data from Niche
When doing the data collection, we didn't find any official API from the website. So we use Selenium to access to the website, locate elements in a page, and then collect relative variables. There are three steps to collect our data.
Step 1: Collect all the schools we need with their name and the url of detailed information. Below are the columns.
name
– the school nameurl
– the url of its detailed informationStep 2: Collect relative feeatures from the school page in the school list.
Average Graduation Rate
—— The percentage of 12th graders who graduPercent Proficient - Math
– The percentage of students that scored at or above proficiency levels on their state math assessment testPercent Proficient - Reading
– This is the percentage of students that scored at or above proficiency levels on their state reading/language arts assessment test.Median Household Income
– The median combined gross income of all members of a household who are 15 years or older in the neighborhood.Median Rent
– The median rent in the neighborhood.Median Home Value
– The median amount for which a house would likely sell in the neighborhood.Student Number
– The student number in the school.Free or Reduced Lunch
– The percentage of free or reduced lunch in lunch fee.Safe
– The percentage of students who feel safe in the school survey.Happy
– The percentage of students who feel happy in the school survey.Student Teacher Ratio
– The student teacher ratio in the school.Address
– The school address in the school listStep 3: Collect 60 latest reviews of each school in the school list.
School
– The school name.Reviews
– The latest review list of that school.Rate
– The star of the specific review.Note: To improve on the readability of the Notebook, we have added the code for our scrappers in the Appendix at the end of this notebook.
First, let's analyze the reviews and try to answer our first question: what do parents/students usually focus on when they write reviews about high schools?
import pandas as pd
import os
import csv
import ast
import numpy as np
from textblob import TextBlob
import seaborn as sns
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from scipy.stats.stats import pearsonr
First we read in data from several .csv
file we scraped from the website. It contains the review text data and their respective rating stars, but it's still in a very primitive and unprocessed format. So we did some cleaning and preprocessing later.
def read_allcsv(fname_list):
df = pd.DataFrame()
for fname in fname_list:
with open('review/review/{}'.format(fname), newline='', encoding="utf8") as f:
df_i = pd.read_csv(f)
df = df.append(df_i)
return df
all_f = os.listdir("review/review")
df = read_allcsv(all_f)
df.head()
Unnamed: 0 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
---|---|---|---|---|---|---|---|---|---|
0 | East Meadow High School | ["The teacher's are pretty pog and are always ... | ['review__stars__icon', 'review__stars__icon--... | ["The teacher's are pretty pog and are always ... | ['review__stars__icon', 'review__stars__icon--... | ['Very strict faculty. Kida are getting wilder... | ['review__stars__icon', 'review__stars__icon--... | ['It is a decent school, 3 stars pretty much e... | ['review__stars__icon', 'review__stars__icon--... |
1 | Glen Rock High School | ["Glen Rock High School will hold some of my f... | ['review__stars__icon', 'review__stars__icon--... | ["Glen Rock High School will hold some of my f... | ['review__stars__icon', 'review__stars__icon--... | ['Glen Rock High School had many teachers who ... | ['review__stars__icon', 'review__stars__icon--... | ["I don't eat school food so I can't comment. ... | ['review__stars__icon', 'review__stars__icon--... |
2 | Millennium High School | ['cool and nice. academics are good and there ... | ['review__stars__icon', 'review__stars__icon--... | ['cool and nice. academics are good and there ... | ['review__stars__icon', 'review__stars__icon--... | ["Aspects of life at Millennium High School th... | ['review__stars__icon', 'review__stars__icon--... | ["Aspects of life at Millennium High School th... | ['review__stars__icon', 'review__stars__icon--... |
3 | Croton-Harmon High School | ['Croton Harmon High School has a small town f... | ['review__stars__icon', 'review__stars__icon--... | ['Croton Harmon High School has a small town f... | ['review__stars__icon', 'review__stars__icon--... | ['School has many opportunities to succeed', '... | ['review__stars__icon', 'review__stars__icon--... | ['School has many opportunities to succeed', '... | ['review__stars__icon', 'review__stars__icon--... |
4 | Scotch Plains-Fanwood High School | ['Scotch Plains-Fanwood High School gives a un... | ['review__stars__icon', 'review__stars__icon--... | ['Scotch Plains-Fanwood High School gives a un... | ['review__stars__icon', 'review__stars__icon--... | ['What I liked about my high school experience... | ['review__stars__icon', 'review__stars__icon--... | ['What I liked about my high school experience... | ['review__stars__icon', 'review__stars__icon--... |
As you can see, the dataframe has a column of school names. At the same row are the lists of review texts and their respective star ratings. However, each rating is encoded as a set of'review__stars__icon', 'review__stars__icon--40', 'review__stars__number__reviews'
for an instance to indicate a 4-star review. So we need to extract that infromation from the text. As the data readed are all just of type str
that looks like a list, we use ast
package, which helps Python applications to process trees of the Python abstract syntax grammar, to convert them to lists for more convenient processing. The following function process_rating
and process_review
did just the above.
def process_rating(rating_text):
ra = ast.literal_eval(rating_text)
outra = []
for i in range(1, len(ra), 3):
outra.append(int(ra[i][-2]))
return outra
def process_review(review_text):
return ast.literal_eval(review_text)
While we pre-process the raw text data, we also deal with the missing values where the reviews are just not enough to fill in all the columns for some certain schools and give them an empty list as default value.
df['rating_1'] = df['1'].apply(lambda x: process_rating(x) if(np.all(pd.notnull(x))) else [])
df['rating_2'] = df['5'].apply(lambda x: process_rating(x) if(np.all(pd.notnull(x))) else [])
df['rating_3'] = df['7'].apply(lambda x: process_rating(x) if(np.all(pd.notnull(x))) else [])
df['review_1'] = df['0'].apply(lambda x: process_review(x) if(np.all(pd.notnull(x))) else [])
df['review_2'] = df['4'].apply(lambda x: process_review(x) if(np.all(pd.notnull(x))) else [])
df['review_3'] = df['6'].apply(lambda x: process_review(x) if(np.all(pd.notnull(x))) else [])
df["ratings"] = df.apply(lambda row: row["rating_1"] + row["rating_2"] + row["rating_3"], axis = 1)
df["reviews"] = df.apply(lambda row: row["review_1"] + row["review_2"] + row["review_3"], axis = 1)
df = df[['Unnamed: 0', 'ratings', 'reviews']]
df.columns = ['school_name','ratings', 'reviews']
df = df.reset_index(drop = True)
Now we have a clean and well-formated dataframe that contains all the review and rating data of 982 schools that we collected from niche.
df.head()
school_name | ratings | reviews | |
---|---|---|---|
0 | East Meadow High School | [4, 4, 3, 3, 4, 4, 5, 4, 4, 4, 3, 4, 5, 4, 4, ... | [The teacher's are pretty pog and are always s... |
1 | Glen Rock High School | [4, 4, 4, 3, 3, 5, 3, 3, 4, 3, 2, 3, 5, 4, 5, ... | [Glen Rock High School will hold some of my fo... |
2 | Millennium High School | [5, 2, 4, 5, 4, 4, 4, 4, 5, 3, 4, 3, 3, 4, 4, ... | [cool and nice. academics are good and there a... |
3 | Croton-Harmon High School | [5, 3, 3, 5, 4, 5, 4, 5, 1, 4, 4, 4, 4, 4, 4, ... | [Croton Harmon High School has a small town fe... |
4 | Scotch Plains-Fanwood High School | [4, 4, 5, 5, 5, 4, 4, 4, 4, 4, 3, 5, 3, 5, 5, ... | [Scotch Plains-Fanwood High School gives a uni... |
We first have a look at the distribution of the ratings. It can be seen in the plot below that people mostly give 4-star reviews. The number of 3 and 5 star reviews are identical, whereas people seldom give 1 and 2 star reviews.
all_review = df.reviews.sum()
all_review_text = ' '.join(all_review)
all_rating = df['ratings'].sum()
# all_rating = [str(i) for i in all_rating]
# count = [all_rating.count(i+1) for i in range(5)]
# print(count)
plt.hist(all_rating, bins = 5)
plt.xticks(np.arange(1, 6), ('1-star', '2-star', '3-star', '4-star', '5-star'))
plt.show()
Here we utilize the sentiment analysis of Textblob
to acquire the sentiment polarity(negative -1.0 -> positive 1.0) of each review and visualize how it correlates with the star rating and finally found a correlation coefficient of 0.300 between sentiments and star rating. It looks like the low-star reviews indeed convey much more negative emotions.
def get_sentiment(text):
#getting the polarity of reviews
tb = TextBlob(text)
return tb.sentiment.polarity
# List of sentiment polarity for all reviews
all_senti = [get_sentiment(t) for t in all_review]
df_senti = pd.DataFrame(list(zip(all_rating, all_senti)),
columns =['rating', 'sentiment'])
x = 'rating'
y = 'sentiment'
plt.figure(figsize=(7,7))
sns.boxplot(x = x , y = y, data = df_senti)
plt.show()
pearsonr(all_senti,all_rating)
(0.2997920033706547, 0.0)
Here we made word loud images to plot reviews of different star ratings and also of schools that have low or high graduation rate, in order to see if they differ and in what way. We made several attempts to find the extra stopwords that don't give us much information about the difference but not in the default stopwords to make our wordcloud more insightful.
stopwords_a = set(STOPWORDS)
stopwords_a.update(['high', 'school', 'people', 'think', 'student', 'class', 'lot', 'much', 'one', \
'classes', 'great', 'good','amazing','even', 'come', 'make', 'overall', 'grade',\
'small', 'give', 'able', 'make', 'going', 'thing', 'year', 'other', 'feel', 'although', 'definitely'\
'made', 'average', 'say', 'go', 'way', 'now', 'well', 'will', 'teacher', 'place', 'always', 'many', \
'thing', 'though', 'really', 'anything', 'everything', 'still'])
def create_wc(text):
wc = WordCloud(stopwords = stopwords_a, max_font_size=180, background_color="white", \
width=2000, height=1000).generate(text)
plt.figure(figsize=(30, 15))
plt.imshow(wc, interpolation='bilinear')
plt.axis('off')
plt.show()
Below is the function to get the reviews of a certain star rating and we seperately assign the reviews divided by stars in different columns.
def get_star_review(star, star_list, review_list):
assert star in [1,2,3,4,5]
assert len(star_list) == len(review_list)
out = []
for i in range(len(star_list)):
if star_list[i] == star:
out.append(review_list[i])
return out
df["1_star_review"] = df.apply(lambda row: get_star_review(1, row['ratings'], row['reviews']), axis = 1)
df["2_star_review"] = df.apply(lambda row: get_star_review(2, row['ratings'], row['reviews']), axis = 1)
df["3_star_review"] = df.apply(lambda row: get_star_review(3, row['ratings'], row['reviews']), axis = 1)
df["4_star_review"] = df.apply(lambda row: get_star_review(4, row['ratings'], row['reviews']), axis = 1)
df["5_star_review"] = df.apply(lambda row: get_star_review(5, row['ratings'], row['reviews']), axis = 1)
one_star_review = ' '.join(df['1_star_review'].sum())
two_star_review = ' '.join(df['2_star_review'].sum())
three_star_review = ' '.join(df['3_star_review'].sum())
four_star_review = ' '.join(df['4_star_review'].sum())
five_star_review = ' '.join(df['5_star_review'].sum())
df.head()
school_name | ratings | reviews | 1_star_review | 2_star_review | 3_star_review | 4_star_review | 5_star_review | |
---|---|---|---|---|---|---|---|---|
0 | Long Island City High School | [3, 3, 4, 4, 3, 3, 5, 5, 3, 4, 2, 3, 4, 4, 4, ... | [I dont like this school as much as i thought ... | [do not go to this school if your english is n... | [Long island city has many opportunities but t... | [I dont like this school as much as i thought ... | [This school is what you make of it. In additi... | [As a incoming senior, Long Island City has we... |
1 | Liberty High School Academy for Newcomers | [4, 5, 5, 5, 3, 3, 5, 4] | [I remember the first time I stepped in Libert... | [] | [] | [My school is a good school as a foreign schoo... | [I remember the first time I stepped in Libert... | [I really like how the system works, everyone ... |
2 | Eximius College Preparatory Academy | [3, 5, 3, 4, 3, 3, 4, 4, 5, 1, 5, 5, 3, 1, 3, ... | [Eximius was a school I wont forget, Best four... | [I like how my school allows you to take many ... | [] | [Eximius was a school I wont forget, Best four... | [I love how there is the staff and teachers be... | [With my now 4 years worth of experience at EC... |
3 | Innovation Diploma Plus | [5, 5, 5, 4, 3, 3, 1, 3, 1, 1] | [Great school, great experience. Thank you IDP... | [kids do not what to do nothing, I liked the s... | [] | [none had kids in the school that didnt want t... | [What I like about innovation ....is the way o... | [Great school, great experience. Thank you IDP... |
4 | Urban Assembly Academy of Government & Law | [3, 4, 3, 4, 3, 5, 5, 5, 4, 3, 3, 3, 3, 5, 3, ... | [The is an average school. Most of the student... | [] | [The food was just plain cafeteria food, not a... | [The is an average school. Most of the student... | [My school, Urban Assembly Academy of Governme... | [even though we share a building with 5 other ... |
5 | High School for Law & Public Service | [4, 5, 4, 3, 5, 3, 4, 4, 3, 3, 4, 3, 4, 4, 5, ... | [My overall expo at this school is pretty aver... | [There arent many food options. Sometimes the ... | [] | [It's a pretty great school, even for an anti ... | [My overall expo at this school is pretty aver... | [My high school has many clubs and sports that... |
6 | Middlesex County Vocational School - East Brun... | [3, 3] | [My experience is decent. I'd like to see an h... | [] | [] | [My experience is decent. I'd like to see an h... | [] | [] |
7 | Urban Assembly School of Design & Construction | [4, 4, 4, 1, 5, 3, 3, 5, 5, 4, 4, 3, 4, 3, 4, ... | [I love the staff members. There was this one ... | [I would like to see more motivational teacher... | [They could do much better, and it makes me sa... | [I selected this school because I wanted to sp... | [I love the staff members. There was this one ... | [The school is a great place for students to t... |
8 | Plainfield Academy for the Arts & Advanced Stu... | [3, 2, 4, 4, 4, 5, 3, 4, 4, 4, 4, 4, 5, 3, 4, ... | [I liked how everything was so simple. You kne... | [] | [There are little to no clubs and/or after sch... | [I liked how everything was so simple. You kne... | [My experience about PAAAS is that I’ve had ma... | [I’ve been attending PAAAS since the 7th grade... |
9 | Essex Street Academy | [3, 3, 4, 4, 4, 4, 3, 4, 4, 5, 5, 3, 4, 4, 4, ... | [Very small school, doesn't prepare students f... | [] | [] | [Very small school, doesn't prepare students f... | [There are many ways to get everyone in the sc... | [I was given a lot of opportunties, The teache... |
10 | Repertory Company High School for Theatre Arts | [3, 5, 5, 3, 4, 5, 4, 3, 5, 4, 4, 3, 3, 3, 4, ... | [Well this school basically has opportunities ... | [I hate this School. The teachers don't care a... | [] | [Well this school basically has opportunities ... | [I love how close the community of the school ... | [The community found at Repertory Company High... |
11 | Ramapo High School | [3, 4, 4, 3, 3, 4, 5, 4, 3, 4, 4, 3, 3, 3, 3, ... | [Ramapo high school is a school that has a hig... | [] | [the schools student body is made up of approx... | [Ramapo high school is a school that has a hig... | [Very great school helped me with a lot of lov... | [GreatThis is a an interesting GreatThis is a ... |
12 | Academy for Conservation & the Environment | [4, 4, 4, 3, 2, 5, 5, 3, 4, 3, 3, 5, 4, 3, 3, ... | [The school isn't too bad, i went in with a ba... | [] | [I like that the teachers focuses on each stud... | [When it comes to ACE , teachers care a lot ab... | [The school isn't too bad, i went in with a ba... | [I absolutely love this school. I transferred ... |
13 | Sunset Park High School | [4, 4, 4, 1, 4, 4, 4, 3, 4, 5, 4, 5, 3, 5, 4, ... | [It's a pretty good high school, the teachers ... | [Horrible! I worked there as a teacher for a y... | [] | [Sunset Park high school is a very good school... | [It's a pretty good high school, the teachers ... | [Everyone is friendly in the school. Counselor... |
14 | Great Oaks Legacy Charter School | [3, 5, 2, 1, 2, 5, 4, 5, 5, 5, 4, 3, 1, 5, 4, ... | [Coming from a Public school to a Charter Scho... | [Overall, this high school could do better wit... | [what i like about great oaks was that they ca... | [Coming from a Public school to a Charter Scho... | [Something i like about Great oaks is how invo... | [I love how the school is a second family to m... |
15 | Young Women's Leadership School of Brooklyn | [5, 3, 4, 4, 4, 5, 5, 3, 5, 4, 4, 4, 3, 4, 3, ... | [I absolutely recommend this school!. From the... | [] | [] | [TYWLS is for students in the neighborhood who... | [I've been attending this school since 6th gra... | [I absolutely recommend this school!. From the... |
16 | Hudson High School of Learning Technologies | [3, 2, 5, 4, 3, 5, 4, 5, 5, 3, 5, 2, 3, 3, 5, ... | [My school added more AP classes for students ... | [] | [Some of the teachers are willing to help you ... | [My school added more AP classes for students ... | [Hudson’s the best high school you can get in ... | [This is a very good school. The teachers care... |
17 | Lincoln High School | [5, 5, 5, 4, 3, 3, 4, 5, 3, 4, 5, 5, 5, 3, 4, ... | [I really like my school and this school help ... | [this school makes me want to die. the staff i... | [] | [I like how the ninth grade teachers are dedic... | [Lincoln is filled with opportunities and is a... | [I really like my school and this school help ... |
18 | Dreamyard Preparatory School | [5, 5, 5, 4, 4, 5, 4, 4, 4, 5, 5, 5, 4, 4, 3, ... | [As a high school senior at Dreamyard Preparat... | [] | [] | [It is very challenging because some students ... | [My Experience in Dreamyard wasn’t the best bu... | [As a high school senior at Dreamyard Preparat... |
19 | John Adams High School | [5, 1, 4, 2, 5, 4, 5, 3, 4, 4, 4, 3, 5, 4, 5, ... | [As a student who attended John Adams High Sch... | [This school is rrally bad. The STEM program i... | [It's alright, but it gets very boring. Add a ... | [John Adams has a long way to go, but if stude... | [The teachers are friendly and will help you w... | [As a student who attended John Adams High Sch... |
20 | Multicultural High School | [3, 3] | [It was ok, it was really work intensive in mo... | [] | [] | [It was ok, it was really work intensive in mo... | [] | [] |
21 | Hyde Leadership Charter School | [5, 4, 2, 4, 4, 2, 5, 4, 4, 4, 4, 4, 5, 4, 1, ... | [Hyde Leadership Charter School offers AP Clas... | [Not good school. Terrible principalcelia sosa... | [The high school is starting to feel like a pu... | [The teachers in my school are very well educa... | [It is a good school that helps you prepare fo... | [Hyde Leadership Charter School offers AP Clas... |
22 | East New York Family Academy | [5, 4, 5, 5, 5, 4, 3, 3, 3, 3, 4, 3, 3, 4, 3, ... | [I've learned many things from this school it ... | [] | [] | [When you first start school here, you'll be j... | [I’d like them to change many key factors abou... | [I've learned many things from this school it ... |
23 | Unity Center for Urban Technologies | [5, 2, 5, 3, 3, 3, 3, 3, 3, 3, 3, 3] | [Very good school I'm currently a senior and i... | [] | [Terrible school because it a very racist scho... | [I believe that Unity Center For Urban Technol... | [] | [Very good school I'm currently a senior and i... |
24 | Mott Haven Community High School | [5, 5, 5, 5, 4, 5, 5] | [I transferred to this school to finish my hig... | [] | [] | [] | [There are so many programs and clubs availabl... | [I transferred to this school to finish my hig... |
25 | Academy for Language & Technology | [3, 5, 5, 4, 5, 3, 4, 4, 5, 4, 5, 3, 4, 5, 5, ... | [It was safe, we did not see a lot of fights i... | [] | [] | [It was safe, we did not see a lot of fights i... | [The Academy for Language and Technology is a ... | [The most thing I like about academy for langu... |
26 | Urban Assembly School for Collaborative Health... | [3, 3, 3, 4, 4, 4] | [I like how the school is always concerned abo... | [] | [] | [I like how the school is always concerned abo... | [There were plenty of supportive teachers and ... | [] |
27 | International High School at Lafayette | [4, 4, 4, 5, 4, 4, 5, 5, 5, 5, 4, 5, 5, 5, 5, ... | [It's a very good school because you get to ex... | [] | [] | [We don't chance to get involved in different ... | [It's a very good school because you get to ex... | [Always felt safe and secure, I've only had go... |
28 | High School for Energy & Technology | [3, 3, 5, 4, 4, 2, 3, 3, 3, 3, 5, 3, 5, 3, 4] | [Most of the teachers are very nice, I enyoy s... | [] | [High School for Energy and technology is by f... | [Most of the teachers are very nice, I enyoy s... | [In this school you'll experience the ways of ... | [High School for Energy and Technology (HSET) ... |
29 | High School for Enterprise, Business and Techn... | [5, 3, 3, 4, 3, 4, 3, 3, 5, 3, 4, 3, 2, 3, 3, ... | [I had a great experience attending this schoo... | [] | [I'm a senior now, and this school at first wa... | [I had a solid experience at EBT. The classes ... | [This school was very good. I enjoyed how much... | [I had a great experience attending this schoo... |
... | ... | ... | ... | ... | ... | ... | ... | ... |
952 | W. Tresper Clarke High School | [5, 4, 5, 4, 4, 5, 5, 4, 5, 4, 4, 4, 3, 4, 4, ... | [My experience at W.Tresper Clarke High School... | [] | [] | [Going to this high school has been an okay ex... | [I love the teachers and supportive environmen... | [My experience at W.Tresper Clarke High School... |
953 | Midwood High School at Brooklyn College | [5, 4, 4, 2, 4, 4, 5, 5, 3, 4, 5, 5, 4, 4, 3, ... | [Midwood High school is a great school for stu... | [] | [Overall, was a pretty poor experience. Some ... | [One thing I like about Midwood is the diversi... | [The students are really diverse and the class... | [Midwood High school is a great school for stu... |
954 | Morris Hills High School | [5, 5, 4, 2, 4, 4, 5, 5, 4, 5, 5, 3, 5, 3, 4, ... | [This is a wonderful school, which as many dif... | [] | [The administration constantly boasts that Mor... | [I would like to see more school spirit, schoo... | [Academics are really good, a lot of scholarsh... | [This is a wonderful school, which as many dif... |
955 | Oceanside High School | [4, 5, 4, 5, 3, 3, 2, 5, 4, 3, 5, 4, 4, 4, 3, ... | [The school provide a vast list of elective co... | [] | [Starting off with drug/health regulations, Oc... | [Decent. You could do anything you want if you... | [The school provide a vast list of elective co... | [My four years at Oceanside High School have b... |
956 | Millennium Brooklyn High School | [4, 5, 5, 4, 3, 5, 5, 4, 5, 5, 3, 5, 4, 5, 3, ... | [I really like that at Millennium Brooklyn Hig... | [] | [] | [The teachers in Millennium Brooklyn High Scho... | [I really like that at Millennium Brooklyn Hig... | [Millennium Brooklyn High School offered me a ... |
957 | Francis Lewis High School | [4, 3, 5, 4, 3, 4, 3, 5, 4, 4, 4, 4, 5, 4, 4, ... | [I liked that Francis Lewis has cool teacher a... | [What can you say about the most overcrowded h... | [] | [I like Francis Lewis High School because of t... | [I liked that Francis Lewis has cool teacher a... | [I like the diversity of the school and the va... |
958 | Governor Livingston High School | [4, 5, 4, 4, 4, 3, 4, 2, 5, 4, 3, 4, 5, 5, 4, ... | [Some of the teachers do not care about the st... | [] | [Teachers kinda suck and the school itself is ... | [School is good academically, there is a lack... | [Some of the teachers do not care about the st... | [I feel very luck to have been able to attend ... |
959 | Wayne Hills High School | [3, 5, 3, 5, 4, 5, 4, 4, 5, 5, 2, 4, 4, 5, 4, ... | [Wayne Hills High School is located in a subur... | [] | [I went to Wayne Hills for about the year. I d... | [Wayne Hills High School is located in a subur... | [Wayne Hills is a very positive atmosphere and... | [I really enjoy this school and have liked my ... |
960 | Marlboro High School | [3, 4, 5, 4, 3, 5, 4, 4, 4, 4, 4, 5, 5, 4, 4, ... | [Good academics but teachers do not care much ... | [] | [] | [Good academics but teachers do not care much ... | [Marlboro High School has many opportunities f... | [Marlboro High School is a wonderful school wi... |
961 | Parsippany Hills High School | [5, 4, 4, 5, 5, 3, 5, 5, 5, 4, 4, 1, 4, 5, 4, ... | [Parsippany Hills High school is a safe and en... | [There seems to be way too much homework, no s... | [I think the schools is okay but the main issu... | [Good school, good class selection and mostly ... | [As a current senior, I have gone through almo... | [Parsippany Hills High school is a safe and en... |
962 | Hendrick Hudson High School | [4, 4, 5, 4, 3, 5, 1, 1, 3, 5, 5, 5, 5, 4, 3, ... | [HHHS is a great place to learn and grow as a ... | [The administration is horrible, the food is b... | [No one really notices that people do drugs af... | [The school itself, and the education *general... | [HHHS is a great place to learn and grow as a ... | [Support for the arts is amazing. Facilities ... |
963 | Wantagh Senior High School | [4, 5, 5, 5, 4, 5, 4, 5, 5, 5, 5, 4, 5, 5, 3, ... | [The teachers were heavily engaging and extrem... | [] | [] | [Wantagh provides a good education however the... | [The teachers were heavily engaging and extrem... | [Overall I believe that Wantagh Senior High Sc... |
964 | Pompton Lakes High School | [5, 5, 5, 5, 4, 5, 4, 5, 5, 5, 5, 5, 5, 5, 3, ... | [It was a great experience being in this schoo... | [] | [] | [For its size, there are quite a few AP classe... | [Tradition is what makes Pompton Lakes High Sc... | [It was a great experience being in this schoo... |
965 | John H. Glenn High School | [5, 3, 5, 4, 5, 5, 5, 5, 4, 5, 4, 3, 3, 4, 3, ... | [The interactions between the students and the... | [] | [Bullying is a major issue in this school. Som... | [It was an alright experience. There are defin... | [John Glenn is considered a small school. Our... | [The interactions between the students and the... |
966 | Bay Shore Senior High School | [4, 4, 3, 4, 5, 4, 3, 5, 5, 5, 4, 5, 4, 5, 5, ... | [My experience at Bay Shore high School was am... | [] | [] | [I loved the high range of opportunities and l... | [My experience at Bay Shore high School was am... | [I loved the diversity in this town and school... |
967 | Baruch College Campus High School | [5, 4, 4, 5, 4, 4, 4, 5, 4, 4, 4, 5, 4, 5, 4, ... | [BCCHS is a wonderful place for flourishing mi... | [] | [] | [There are many great teachers at Baruch, and ... | [Baruch is great. Amazing teachers who are ded... | [BCCHS is a wonderful place for flourishing mi... |
968 | Fort Lee High School | [4, 4, 4, 5, 5, 5, 4, 4, 4, 4, 3, 4, 5, 4, 5, ... | [Fort Lee High School's strengths are in its d... | [] | [While there are hidden gems of teachers at FL... | [Fort Lee High School's small student populati... | [Fort Lee High School's strengths are in its d... | [My experience at Fort lee high school has bee... |
969 | Kings Collegiate Charter School | [4, 5, 4, 4, 3, 3, 4, 4, 4, 4, 5] | [I liked that the teachers did everything in t... | [] | [] | [It's a college. Not much for parents to do., ... | [I liked that the teachers did everything in t... | [Kings Colliagaite truly helped me become a be... |
970 | Locust Valley High School | [3, 5, 4, 4, 5, 5, 1, 5, 4, 3, 5, 4, 5, 5, 5, ... | [LVHS academic preparation isn't the best beca... | [The education and the teachers in this school... | [There isn't a great selection, and most of th... | [LVHS academic preparation isn't the best beca... | [The school is overall very good. The distric... | [Locust Valley is a great high school for many... |
971 | County Prep High School | [1, 5, 5, 5, 4, 3, 3, 4, 4, 4, 5, 4, 3, 3, 5, ... | [The HCST foundation is very crooked and broke... | [The HCST foundation is very crooked and broke... | [] | [I would love to see a change in the prices fo... | [The academics and teachers are very focused o... | [Amazing education in a safety environment, wi... |
972 | Northport Senior High School | [4, 4, 4, 4, 4, 5, 5, 5, 3, 3, 5, 5, 3, 3, 4, ... | [I had a great experience in the marching band... | [] | [] | [The teachers are great. There are a few bad a... | [I had a great experience in the marching band... | [Northport High is a great school with a wide ... |
973 | Dobbs Ferry High School | [3, 5, 4, 4, 1, 5, 4, 3, 4, 4, 5, 4, 5, 4, 4, ... | [What I would like to see change is in the inv... | [The school is absolutely tiny and focuses on ... | [I would say that there's almost something for... | [What I would like to see change is in the inv... | [I liked how involved the teachers were with t... | [This is an International Baccalaureate (IB) s... |
974 | Madison High School | [5, 4, 3, 5, 4, 5, 4, 3, 5, 5, 4, 4, 4, 5, 3, ... | [You get what you put into it at Madison High ... | [] | [] | [Madison High School had its pros and cons. It... | [This is a very friendly school. It has the sa... | [You get what you put into it at Madison High ... |
975 | Pascack Valley High School | [1, 4, 5, 5, 4, 4, 2, 4, 5, 5, 4, 4, 5, 5, 4, ... | [PV high school is the breeding ground for rac... | [PV high school is the breeding ground for rac... | [Where do I start. Pascack Valley is a great p... | [Academically, it is a very good school and so... | [Overall the school is pretty good. There are ... | [Very great school to attend for education. Th... |
976 | Freehold Borough High School | [5, 5, 4, 4, 3, 4, 5, 4, 3, 3, 4, 4, 4, 4, 5, ... | [Amazing school, people and great atmosphere. ... | [] | [] | [My experience at Freehold Borough High School... | [Very good school. Many excellent teachers and... | [Amazing school, people and great atmosphere. ... |
977 | Gen. Douglas MacArthur Senior High School | [5, 4, 4, 5, 4, 5, 4, 5, 5, 5, 4, 4, 4, 5, 4, ... | [Teachers and other staff members are very hel... | [] | [One could easily walk into this building with... | [This is an average school. The classes that ... | [The teachers are very nice and students are v... | [Teachers and other staff members are very hel... |
978 | Smithtown High School East | [4, 5, 3, 4, 4, 4, 3, 5, 4, 4, 5, 4, 4, 4, 5, ... | [Smithtown High School East is a great school.... | [] | [East is beast. West is the best. There is cel... | [Smithtown HS East as a school is great. The ... | [Smithtown High School East is a great school.... | [High School East provided me a wonderful educ... |
979 | NYC iSchool | [5, 5, 5, 5, 4, 4, 5, 4, 4, 4, 4, 5, 4, 5, 5, ... | [It's basically a liberal arts institute witho... | [] | [] | [It's alright. The environment and atmosphere ... | [I believe that this school is a very good sch... | [It's basically a liberal arts institute witho... |
980 | Mount Olive High School | [4, 4, 5, 5, 3, 3, 3, 4, 5, 4, 4, 5, 5, 5, 5, ... | [I enjoyed the higher-level classes and teache... | [] | [] | [Mount Olive High is a very good school with m... | [I enjoyed the higher-level classes and teache... | [The teachers were very caring. The school too... |
981 | Colts Neck High School | [4, 4, 4, 4, 4, 5, 3, 2, 4, 4, 5, 4, 5, 5, 5, ... | [Overall, Colts Neck High School is a great sc... | [] | [If your child is a minority, I recommend you ... | [Provided a good high school education and a d... | [Overall, Colts Neck High School is a great sc... | [I thought Colts Neck high school was a great ... |
982 rows × 8 columns
Below is the plot of 1 and 2 star reviews.
create_wc(' '.join([one_star_review, two_star_review]))
Here is the plot of 3, 4 and 5 star reviews.
create_wc(' '.join([three_star_review, four_star_review, five_star_review]))
From the two plots above, we can see 3, 4 and 5 star reviews have a stronger emphasis on positive words like diversity, caring students and so on, while in 1 and 2 star reviews, negative words like bullying and rude are mentioned more.
High star reviews:
Low star reviews:
Now we read in the graduation rate data scraped and classify the reviews based the graduation rate of the schools. A graduation rate above the 75th percentile (>0.92) is considered to be a high one and below the 25th percentile (<0.75) is considered to be a low one.
with open('df_full.csv', newline='', encoding="utf8") as f:
df_full = pd.read_csv(f)
low = df_full.graduation.quantile([0.25,0.75]).iloc[0]
high = df_full.graduation.quantile([0.25,0.75]).iloc[1]
# df_full[df_full.graduation<low].school_name
low_school = df_full[df_full.graduation<low].school_name.tolist()
# df[df['school_name'] == df_full['school_name']]
low_df = df.loc[df['school_name'].isin(low_school)]
low_review = ' '.join(low_df['5_star_review'].sum())
create_wc(low_review)
high_school = df_full[df_full.graduation>high].school_name.tolist()
# df[df['school_name'] == df_full['school_name']]
high_df = df.loc[df['school_name'].isin(high_school)]
high_review = ' '.join(high_df['5_star_review'].sum())
create_wc(high_review)
From the two plots above, we can see that words like diversity, support are mentioned more in high graduation rate schools, whereas more realistic and utility words like job and internship are mentioned more.
High graduation rate schools:
Low graduation rate schools:
Now let's analyze our feature data to answer the second question: How are public high schools in New York distributed based on different factors?
from geopy.geocoders import Nominatim
import re
import fractions
import requests
import matplotlib.pyplot as plt
%matplotlib inline
import geopandas as gpd
import seaborn as sns
In this part, we want to explore the geographical distribution of school features to find out the social factors which may influence the graduation rate of schools. First, we need to clean the clean the address data.
# Read and clean the data
# Read the four address files as dataframes
ad_file1 = pd.read_csv('school_address_174.csv')
ad_file2 = pd.read_csv('school_address_175_400.csv')
ad_file3 = pd.read_csv('school_address_400_764.csv')
ad_file4 = pd.read_csv('school_address_764_1006.csv')
# concatenate the four dataframes
frames = [ad_file1, ad_file2, ad_file3, ad_file4]
df_ad = pd.concat(frames)
# Rename the columns
df_ad.rename(columns={'Unnamed: 0':'school_name',
'0':'school_address'},
inplace=True)
#pd_ad.columns
# Clean the address column
df_ad['school_address'] = df_ad['school_address'].str.replace('Address', '')
df_ad['school_address'] = df_ad['school_address'].str.replace('\n', '')
# add whitespace between words starting with capital letters
def capital_words_spaces(address):
return re.sub(r"([a-z])([A-Z])", r"\1 \2", address)
df_ad['school_address'] = df_ad['school_address'].apply(capital_words_spaces)
# add whitespace between numbers and capital letters
def digital_space_capital(address):
return re.sub(r"(\d)([A-Z])", r"\1 \2", address)
df_ad['school_address'] = df_ad['school_address'].apply(digital_space_capital)
address = df_ad.to_csv (r'address.csv', index = None, header=True)
print(len(df_ad))
print(df_ad.head())
1006 school_name \ 0 Stuyvesant High School 1 Bergen County Academies 2 High Technology High School 3 Union County Magnet High School 4 Staten Island Technical High School school_address 0 345 Chambers St New York, NY 10282 1 200 Hackensack Avenue Hackensack, NJ 07601 2 765 Newman Springs Road Lincroft, NJ 07738 3 1776 Raritan Road Scotch Plains, NJ 07076 4 485 Clawson St Staten Island, NY 10306
Now we get a dataframe including addresses for all 1006 public high schools in New York Area. To plot the schools on the map, we use Geopy to convert the school addresses to coordinates. Geopy can be easily service timed out. So we break all schools into 10 dataframes to run separately.
# Geopy Address, Longtitude, Latitude
# get the correct address from Geopy
# easily service timed out so break into 100 one time
# a little bit slow, need to be patient
nom = Nominatim()
# ignore invalid addresses for Geopy
def eval_results(x):
try:
return (x.latitude, x.longitude)
except:
return (None, None)
/Users/ruwenyou/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:7: DeprecationWarning: Using Nominatim with the default "geopy/1.20.0" `user_agent` is strongly discouraged, as it violates Nominatim's ToS https://operations.osmfoundation.org/policies/nominatim/ and may possibly cause 403 and 429 HTTP errors. Please specify a custom `user_agent` with `Nominatim(user_agent="my-application")` or by overriding the default `user_agent`: `geopy.geocoders.options.default_user_agent = "my-application"`. In geopy 2.0 this will become an exception. import sys
df_ad['latitude'] = df_ad['school_address'][0:100].apply(nom.geocode).apply(lambda x: eval_results(x))
address = df_ad.to_csv (r'address1.csv', index = None, header=True)
df_ad = pd.read_csv('address1.csv')
df_ad['latitude'] = df_ad['school_address'][100:200].apply(nom.geocode).apply(lambda x: eval_results(x))
address = df_ad.to_csv (r'address2.csv', index = None, header=True)
df_ad = pd.read_csv('address2.csv')
df_ad['latitude'] = df_ad['school_address'][200:300].apply(nom.geocode).apply(lambda x: eval_results(x))
address = df_ad.to_csv (r'address3.csv', index = None, header=True)
df_ad = pd.read_csv('address3.csv')
df_ad['latitude'] = df_ad['school_address'][300:400].apply(nom.geocode).apply(lambda x: eval_results(x))
address = df_ad.to_csv (r'address4.csv', index = None, header=True)
df_ad = pd.read_csv('address4.csv')
df_ad['latitude'] = df_ad['school_address'][400:500].apply(nom.geocode).apply(lambda x: eval_results(x))
address = df_ad.to_csv (r'address5.csv', index = None, header=True)
df_ad = pd.read_csv('address5.csv')
df_ad['latitude'] = df_ad['school_address'][500:600].apply(nom.geocode).apply(lambda x: eval_results(x))
address = df_ad.to_csv (r'address6.csv', index = None, header=True)
df_ad = pd.read_csv('address6.csv')
df_ad['latitude'] = df_ad['school_address'][600:700].apply(nom.geocode).apply(lambda x: eval_results(x))
address = df_ad.to_csv (r'address7.csv', index = None, header=True)
df_ad = pd.read_csv('address7.csv')
df_ad['latitude'] = df_ad['school_address'][700:800].apply(nom.geocode).apply(lambda x: eval_results(x))
address = df_ad.to_csv (r'address8.csv', index = None, header=True)
df_ad = pd.read_csv('address8.csv')
df_ad['latitude'] = df_ad['school_address'][800:900].apply(nom.geocode).apply(lambda x: eval_results(x))
address = df_ad.to_csv (r'address9.csv', index = None, header=True)
df_ad = pd.read_csv('address9.csv')
df_ad['latitude'] = df_ad['school_address'][900:1000].apply(nom.geocode).apply(lambda x: eval_results(x))
address = df_ad.to_csv (r'address10.csv', index = None, header=True)
df_ad = pd.read_csv('address10.csv')
df_ad['latitude'] = df_ad['school_address'][1000:1100].apply(nom.geocode).apply(lambda x: eval_results(x))
address = df_ad.to_csv (r'address11.csv', index = None, header=True)
Now we need to concatenate all the addresses into one dataframe.
# read all address csv into dataframe
# If you can't run the Geopy code above (can't request Geopy service)
# please read the csv files we already saved in the folders before.
df_ad1 = pd.read_csv('address1.csv')
df_ad2 = pd.read_csv('address2.csv')
df_ad3 = pd.read_csv('address3.csv')
df_ad4 = pd.read_csv('address4.csv')
df_ad5 = pd.read_csv('address5.csv')
df_ad6 = pd.read_csv('address6.csv')
df_ad7 = pd.read_csv('address7.csv')
df_ad8 = pd.read_csv('address8.csv')
df_ad9 = pd.read_csv('address9.csv')
df_ad10 = pd.read_csv('address10.csv')
df_ad11 = pd.read_csv('address11.csv')
# get the rows which contain the coordinates in each dataframe
df_ad1 = df_ad1[0:100]
df_ad2 = df_ad2[100:200]
df_ad3 = df_ad3[200:300]
df_ad4 = df_ad4[300:400]
df_ad5 = df_ad5[400:500]
df_ad6 = df_ad6[500:600]
df_ad7 = df_ad7[600:700]
df_ad8 = df_ad8[700:800]
df_ad9 = df_ad9[800:900]
df_ad10 = df_ad10[900:1000]
df_ad11 = df_ad11[1000:1050]
# Concatenate the dataframes
frames = [df_ad1, df_ad2, df_ad3, df_ad4, df_ad5, df_ad6, df_ad7, df_ad8, df_ad9, df_ad10, df_ad11]
df_ad = pd.concat(frames)
df_ad.to_csv (r'address_latitude.csv', index = None, header=True)
df_ad.head()
school_name | school_address | latitude | |
---|---|---|---|
0 | Stuyvesant High School | 345 Chambers St New York, NY 10282 | (40.7179876, -74.0138420443421) |
1 | Bergen County Academies | 200 Hackensack Avenue Hackensack, NJ 07601 | (40.9012393225806, -74.0353447741935) |
2 | High Technology High School | 765 Newman Springs Road Lincroft, NJ 07738 | (40.3262704, -74.1314886709229) |
3 | Union County Magnet High School | 1776 Raritan Road Scotch Plains, NJ 07076 | (40.6181661, -74.3568340380298) |
4 | Staten Island Technical High School | 485 Clawson St Staten Island, NY 10306 | (40.567911, -74.115454) |
We get a row of coordinates ("latitude") for each school now. Then we need to merge the school feature data with the coordinates data. Let's look at the school feature data first.
# read csv file of school features into dataframe
df_sch = pd.read_csv('first.csv')
print(df_sch.dtypes)
print(df_sch.head())
Unnamed: 0 object reading object math object graduation object Median Household Income object Median Rent object Median Home Value object Student_Number object Free_or_Reduced_Lunch object safe object happy object Student_Teacher_Ratio object dtype: object Unnamed: 0 reading math graduation \ 0 A. Hamilton Preparatory Academy 69% 26% 96% 1 A. Philip Randolph Campus High School 95% 94% 75% 2 Abraham Lincoln High School 89% 91% 67% 3 Academy Charter High School 8% 12% 85% 4 Academy for Allied Health Science 99% 92% 95% Median Household Income Median Rent Median Home Value Student_Number \ 0 45,186\n 1,110\n 268,700\n 983 1 46,449\n 1,126\n 348,669\n 1,385 2 38,620\n 1,081\n 309,222\n 1,979 3 74,861\n 1,368\n 408,500\n 176 4 117,765\n 1,481\n 481,500\n 291 Free_or_Reduced_Lunch safe happy Student_Teacher_Ratio 0 82% 88% 72% 14:01 1 83% 70% 73% 17:01 2 78% 66% 59% 16:01 3 87% -- -- 10:01 4 11% 91% 82% 16:01
From the overview above, we need to convert most of the columns to float first and clean some invalid values and unnecessary strings.
# rename the column
df_sch.rename(columns={'Unnamed: 0':'school_name',
'Median Household Income':'Median_Household_Income',
'Median Rent':'Median_Rent',
'Median Home Value':'Median_Home_Value'},
inplace=True)
# clean unnecessary string
df_sch['Median_Household_Income'] = df_sch['Median_Household_Income'].str.replace('\n', '')
df_sch['Median_Rent'] = df_sch['Median_Rent'].str.replace('\n', '')
df_sch['Median_Home_Value'] = df_sch['Median_Home_Value'].str.replace('\n', '')
df_sch['Median_Household_Income'] = df_sch['Median_Household_Income'].str.replace(',', '')
df_sch['Median_Rent'] = df_sch['Median_Rent'].str.replace(',', '')
df_sch['Median_Home_Value'] = df_sch['Median_Home_Value'].str.replace(',', '')
df_sch['Student_Number'] = df_sch['Student_Number'].str.replace(',', '')
df_sch['Student_Teacher_Ratio'] = df_sch['Student_Teacher_Ratio'].str.replace(':00', '')
df_sch['Student_Teacher_Ratio'] = df_sch['Student_Teacher_Ratio'].str.replace(':', '/')
# replace invalid value to NaN
df_sch.replace('_', np.nan, inplace = True)
df_sch.replace('--', np.nan, inplace = True)
df_sch.replace('970\n55 responses', np.nan, inplace = True)
# convert the types of the column to string/float
df_sch["school_name"] = df_sch['school_name'].astype(str)
df_sch['reading'] = df_sch['reading'].str.rstrip('%').astype('float') / 100.0
df_sch['math'] = df_sch['math'].str.rstrip('%').astype('float') / 100.0
df_sch['graduation'] = df_sch['graduation'].str.rstrip('%').astype('float') / 100.0
df_sch['Median_Household_Income'] = df_sch['Median_Household_Income'].astype('float')
df_sch['Median_Rent'] = df_sch['Median_Rent'].astype('float')
df_sch['Median_Home_Value'] = df_sch['Median_Home_Value'].astype('float')
df_sch['Student_Number'] = df_sch['Student_Number'].astype('float')
df_sch['Free_or_Reduced_Lunch'] = df_sch['Free_or_Reduced_Lunch'].str.rstrip('%').astype('float') / 100.0
df_sch['safe'] = df_sch['safe'].str.rstrip('%').astype('float') / 100.0
df_sch['happy'] = df_sch['happy'].str.rstrip('%').astype('float') / 100.0
df_sch['Student_Teacher_Ratio'] = df_sch['Student_Teacher_Ratio'].apply(lambda frac: float(sum([fractions.Fraction(x) for x in frac.split()])))
print(df_sch.dtypes)
print(df_sch.head())
school_name object reading float64 math float64 graduation float64 Median_Household_Income float64 Median_Rent float64 Median_Home_Value float64 Student_Number float64 Free_or_Reduced_Lunch float64 safe float64 happy float64 Student_Teacher_Ratio float64 dtype: object school_name reading math graduation \ 0 A. Hamilton Preparatory Academy 0.69 0.26 0.96 1 A. Philip Randolph Campus High School 0.95 0.94 0.75 2 Abraham Lincoln High School 0.89 0.91 0.67 3 Academy Charter High School 0.08 0.12 0.85 4 Academy for Allied Health Science 0.99 0.92 0.95 Median_Household_Income Median_Rent Median_Home_Value Student_Number \ 0 45186.0 1110.0 268700.0 983.0 1 46449.0 1126.0 348669.0 1385.0 2 38620.0 1081.0 309222.0 1979.0 3 74861.0 1368.0 408500.0 176.0 4 117765.0 1481.0 481500.0 291.0 Free_or_Reduced_Lunch safe happy Student_Teacher_Ratio 0 0.82 0.88 0.72 14.0 1 0.83 0.70 0.73 17.0 2 0.78 0.66 0.59 16.0 3 0.87 NaN NaN 10.0 4 0.11 0.91 0.82 16.0
Let's merge the coordinate address dataframe with the school feature dataframe and separate coordinates into latitude and longitude.
# merge the two dataframes
df_full = df_ad.merge(df_sch, how="inner", left_on = "school_name", right_on="school_name")
# rename the 'latitude' to 'coordinate'
df_full.rename(columns={'latitude':'coordinate'}, inplace=True)
# separate coordinates into latitude and longitude
df_full['coordinate'] = df_full['coordinate'].str.replace('(', '')
df_full['coordinate'] = df_full['coordinate'].str.replace(')', '')
from ast import literal_eval
df_full['coordinate'] = [literal_eval(x) for x in df_full['coordinate']]
df_full[['lat','lon']] = pd.DataFrame(df_full['coordinate'].values.tolist(), index=df_full.index)
df_full.to_csv (r'df_full.csv', index = None, header=True)
# check the df_full dataframe
df_full.head()
school_name | school_address | coordinate | reading | math | graduation | Median_Household_Income | Median_Rent | Median_Home_Value | Student_Number | Free_or_Reduced_Lunch | safe | happy | Student_Teacher_Ratio | lat | lon | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Stuyvesant High School | 345 Chambers St New York, NY 10282 | (40.7179876, -74.0138420443421) | 0.99 | 0.99 | 0.98 | 190216.0 | 3084.0 | 1143983.0 | 3368.0 | 0.45 | 0.93 | 0.80 | 21.0 | 40.717988 | -74.013842 |
1 | Bergen County Academies | 200 Hackensack Avenue Hackensack, NJ 07601 | (40.9012393225806, -74.0353447741935) | NaN | NaN | NaN | NaN | NaN | NaN | 1069.0 | 0.05 | 0.90 | 0.75 | 11.0 | 40.901239 | -74.035345 |
2 | High Technology High School | 765 Newman Springs Road Lincroft, NJ 07738 | (40.3262704, -74.1314886709229) | 0.99 | 0.95 | 0.95 | 132135.0 | 392.0 | 495800.0 | 291.0 | 0.02 | 0.82 | 0.64 | 13.0 | 40.326270 | -74.131489 |
3 | Union County Magnet High School | 1776 Raritan Road Scotch Plains, NJ 07076 | (40.6181661, -74.3568340380298) | 0.99 | 0.95 | 0.95 | 117765.0 | 1481.0 | 481500.0 | 298.0 | 0.10 | 0.83 | 0.67 | 13.0 | 40.618166 | -74.356834 |
4 | Staten Island Technical High School | 485 Clawson St Staten Island, NY 10306 | (40.567911, -74.115454) | 0.99 | 0.99 | 0.99 | 93479.0 | 1441.0 | 539760.0 | 1313.0 | 0.39 | 0.97 | 0.89 | 23.0 | 40.567911 | -74.115454 |
# drop the rows which don't have invalid latitudes and longtitudes
df_full = df_full[np.isfinite(df_full['lat'])]
We have finished all the prior work. Let's use Folium to make the interactive data map for school features.
import folium
from folium import plugins
Let's first look at the geographical distribution of schools. We use heatmap to show it. The red color means schools gather in this area. From the maps below, we can see most of the schools gather in Bronx, Brooklyn and Mahattan.
# create the heatmap of school distribution
schools_heatmap = folium.Map(location=[df_full['lat'].mean(), df_full['lon'].mean()], zoom_start=10)
schools_heatmap.add_children(plugins.HeatMap([[row["lat"], row["lon"]] for name, row in df_full.iterrows()]))
schools_heatmap.save("heatmap.html")
schools_heatmap
/Users/ruwenyou/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:3: FutureWarning: Method `add_children` is deprecated. Please use `add_child` instead. This is separate from the ipykernel package so we can avoid doing imports until
# If the html doesn't show, please see the folloing picture.
Now let's explore the geographic distribution of graduation rate for the high schools. We use four colors to show four quartile ranges of the graduation rate. The lighter the color is, the higher the graduation is.
# drop the rows with invalid graduation data.
df_graduation = df_full[np.isfinite(df_full['graduation'])]
# create the column of four quartile ranges of the graduation rate
df_graduation['graduation_q'] = pd.qcut(df_graduation['graduation'], 4, labels=False)
/Users/ruwenyou/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy """
# check the column
df_graduation['graduation_q'].head()
0 3 2 2 3 2 4 3 5 3 Name: graduation_q, dtype: int64
# set the color of each range of graduation rate
# lightblue: 75-100%; lightgreen: 50-75%; orange: 25-50%; red: 0-25%.
colordict = {3: 'lightblue', 2: 'lightgreen', 1: 'orange', 0: 'red'}
# create the center of the map based on the mean of coordinates
graduation_map = folium.Map(location=[df_graduation['lat'].mean(), df_graduation['lon'].mean()], zoom_start=10)
# create the graduation distribution map
for lat, lon, school_name, graduation, graduation_q in zip(df_graduation['lat'], df_graduation['lon'], df_graduation['school_name'], df_graduation['graduation'], df_graduation['graduation_q']):
folium.CircleMarker(
[lat, lon],
radius=graduation*10,
popup = ('School: ' + str(school_name) + '<br>'
'Graduation: ' + str(graduation) + '<br>'
),
color='b',
key_on = graduation_q,
threshold_scale=[0,1,2,3],
fill_color=colordict[graduation_q],
fill=True,
fill_opacity=0.7
).add_to(graduation_map)
graduation_map.save("graduation_heatmap.html")
graduation_map
# If the html doesn't show, please see the following picture
As you can see from the above map, most of the lower graduation school gathers in the Bronx, Mahattan and Brooklyn. We hypothesize that some factors related to the districts may influence the graduation rate. We pick "Median_Household_Income" as a factor and make a geographic map for it. We want to see if the income level is related to the graduation rate.
We still divide the income level based on the quantile ranges of "Median_Household_Income" and use the same way to plot the household income level of each school. The lighter the color is, the higher the income level is.
# drop the rows with invalid 'Median_Household_Income'
df_income = df_full[np.isfinite(df_full['Median_Household_Income'])]
# set the color of each range of graduation rate
# lightblue: 75-100%; lightgreen: 50-75%; orange: 25-50%; red: 0-25%.
df_income['income_q'] = pd.qcut(df_income['Median_Household_Income'], 4, labels=False)
colordict = {3: 'lightblue', 2: 'lightgreen', 1: 'orange', 0: 'red'}
# create the center of the map based on the mean of coordinates
income_map = folium.Map(location=[df_full['lat'].mean(), df_full['lon'].mean()], zoom_start=9)
# create the graduation distribution map
for lat, lon, school_name, income, income_q in zip(df_income['lat'], df_income['lon'], df_income['school_name'], df_income['Median_Household_Income'], df_income['income_q']):
folium.CircleMarker(
[lat, lon],
radius=income*0.00015,
popup = ('School: ' + str(school_name) + '<br>'
'Median Income: ' + str(income) + '<br>'
),
color='b',
key_on = income_q,
threshold_scale=[0,1,2,3],
fill_color=colordict[income_q],
fill=True,
fill_opacity=0.7
).add_to(income_map)
income_map.save("income_heatmap.html")
income_map
/Users/ruwenyou/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy This is separate from the ipykernel package so we can avoid doing imports until
# If the html can't show, please see the following picture
Comparing the income map above with the prior graduation map, you can see that there are similar patterns between them. For example, the Bronx and Brooklyn both have lower graduation rate and lower income level. But we can also see exceptions. For example, Manhatton has higher level income level but lower graduation rate.
Now, let's explore more to ask the third question: What factors might influence the average graduation rate of public high schools in NY?
We use T-test to find out the difference between the patterns in the map. We find the clusters in the pictures below and use the coordinates of boundary schools in the clusters to locate them.
# find the coordinates of the boundary schools
# create the dataframe for the two clusters
lon_l_index = df_graduation.loc[df_graduation['school_name']=='Marta Valle High School'].index[0]
lon_l = df_graduation['lon'][lon_l_index]
lon_r_index = df_graduation.loc[df_graduation['school_name']=='Pan American International High School'].index[0]
lon_r = df_graduation['lon'][lon_r_index]
lat_d_index = df_graduation.loc[df_graduation['school_name']=='Academy of Medical Technology - A College Board School'].index[0]
lat_d = df_graduation['lat'][lat_d_index]
lat_u_index = df_graduation.loc[df_graduation['school_name']=='International Leadership Charter School'].index[0]
lat_u = df_graduation['lat'][lat_u_index]
df_l_gra = df_graduation[df_graduation['lon'] > lon_l]
df_r_gra = df_l_gra[df_l_gra['lon'] < lon_r]
df_d_gra = df_r_gra[df_r_gra['lat'] > lat_d]
df_gra_low = df_d_gra[df_d_gra['lat'] < lat_u]
lon2_index = df_graduation.loc[df_graduation['school_name']=='Queens High School of Teaching'].index[0]
lon2 = df_graduation['lon'][lon2_index]
lon3_index = df_graduation.loc[df_graduation['school_name']=='Sewanhaka High School'].index[0]
lon3 = df_graduation['lon'][lon3_index]
lat2_index = df_graduation.loc[df_graduation['school_name']=='Oceanside High School'].index[0]
lat2 = df_graduation['lat'][lat2_index]
lat3_index = df_graduation.loc[df_graduation['school_name']=='Earl L. Vandermeulen High School'].index[0]
lat3 = df_graduation['lat'][lat3_index]
df1_gra = df_graduation[df_graduation['lon'] > lon2]
df2 = df1_gra[df1_gra['lon'] < lon3]
df3 = df2[df2['lat'] > lat2]
df_gra_high = df3[df3['lat'] < lat3]
# T-test on all the features of the schools
from scipy.stats import ttest_ind
Reading = ttest_ind(df_gra_low['reading'], df_gra_high['reading'])
df_gra_low = df_gra_low[np.isfinite(df_gra_low['math'])]
df_gra_high = df_gra_high[np.isfinite(df_gra_high['math'])]
Math = ttest_ind(df_gra_low['math'], df_gra_high['math'])
Median_Household_Income = ttest_ind(df_gra_low['Median_Household_Income'], df_gra_high['Median_Household_Income'])
Median_Rent = ttest_ind(df_gra_low['Median_Rent'], df_gra_high['Median_Rent'])
Median_Home_Value = ttest_ind(df_gra_low['Median_Home_Value'], df_gra_high['Median_Home_Value'])
Student_Number = ttest_ind(df_gra_low['Student_Number'], df_gra_high['Student_Number'])
Free_or_Reduced_Lunch = ttest_ind(df_gra_low['Free_or_Reduced_Lunch'], df_gra_high['Free_or_Reduced_Lunch'])
df_gra_low = df_gra_low[np.isfinite(df_gra_low['safe'])]
df_gra_high = df_gra_high[np.isfinite(df_gra_high['safe'])]
Safe = ttest_ind(df_gra_low['safe'], df_gra_high['safe'])
df_gra_low = df_gra_low[np.isfinite(df_gra_low['happy'])]
df_gra_high = df_gra_high[np.isfinite(df_gra_high['happy'])]
Happy = ttest_ind(df_gra_low['happy'], df_gra_high['happy'])
Student_Teacher_Ratio = ttest_ind(df_gra_low['Student_Teacher_Ratio'], df_gra_high['Student_Teacher_Ratio'])
# convert the T-test results to a dataframe
df_ttest = pd.DataFrame()
df_ttest['Reading'] = Reading
df_ttest['Math'] = Math
df_ttest['Median_Household_Income'] = Median_Household_Income
df_ttest['Median_Rent'] = Median_Rent
df_ttest['Median_Home_Value'] = Median_Home_Value
df_ttest['Student_Number'] = Student_Number
df_ttest['Free_or_Reduced_Lunch'] = Free_or_Reduced_Lunch
df_ttest['Safe'] = Safe
df_ttest['Happy'] = Happy
df_ttest['Student_Teacher_Ratio'] = Student_Teacher_Ratio
df_ttest_transposed = df_ttest.T.round(3)
df_ttest_transposed.rename(columns={0: "T-Value", 1: "p-value"})
T-Value | p-value | |
---|---|---|
Reading | -1.654 | 0.099 |
Math | -1.969 | 0.050 |
Median_Household_Income | -4.594 | 0.000 |
Median_Rent | -1.705 | 0.089 |
Median_Home_Value | -1.401 | 0.162 |
Student_Number | -2.263 | 0.024 |
Free_or_Reduced_Lunch | 8.770 | 0.000 |
Safe | 0.177 | 0.860 |
Happy | 0.505 | 0.614 |
Student_Teacher_Ratio | 1.820 | 0.071 |
From the T-test results above, we can see that Math, Median_Household_Income, Student_Number and Free_or_Reduced_Lunch have significant difference between the two clusters. All these features are higher in schools with higher graduation rates except Free_or_Reduced_Lunch. The biggest difference also happens on Free_or_Reduced_Lunch. We can make assumption that study (Math), social economic status (Median_Household_Income), Student_Number all have positve influence on the graduation rate. But Low graduation rate school may have more needs on free_or_Reduced_Lunch.
Let's go back to the full sample and use correlation analysis to test our assumptions.
# read the full sample data
df_full = pd.read_csv('df_full.csv')
# remove the coordinates data
df_corr = df_full[['reading', 'math', 'graduation', 'Median_Household_Income',
'Median_Rent', 'Median_Home_Value', 'Student_Number', 'Free_or_Reduced_Lunch',
'safe', 'happy', 'Student_Teacher_Ratio']]
# correlation Analysis
cm = df_corr.corr()
cm
reading | math | graduation | Median_Household_Income | Median_Rent | Median_Home_Value | Student_Number | Free_or_Reduced_Lunch | safe | happy | Student_Teacher_Ratio | |
---|---|---|---|---|---|---|---|---|---|---|---|
reading | 1.000000 | 0.933167 | 0.055198 | 0.181728 | 0.256807 | 0.296964 | 0.053216 | -0.051341 | 0.251841 | 0.229925 | 0.366205 |
math | 0.933167 | 1.000000 | 0.013754 | 0.116594 | 0.214791 | 0.262026 | 0.042157 | 0.028645 | 0.224151 | 0.214398 | 0.371856 |
graduation | 0.055198 | 0.013754 | 1.000000 | 0.400917 | 0.234131 | 0.118879 | 0.313467 | -0.526028 | 0.374493 | 0.188272 | -0.031906 |
Median_Household_Income | 0.181728 | 0.116594 | 0.400917 | 1.000000 | 0.749960 | 0.731328 | 0.209494 | -0.679098 | 0.318926 | 0.113441 | -0.119405 |
Median_Rent | 0.256807 | 0.214791 | 0.234131 | 0.749960 | 1.000000 | 0.721526 | 0.109859 | -0.353262 | 0.197032 | 0.038397 | 0.004937 |
Median_Home_Value | 0.296964 | 0.262026 | 0.118879 | 0.731328 | 0.721526 | 1.000000 | 0.072537 | -0.258025 | 0.200933 | 0.052857 | 0.051414 |
Student_Number | 0.053216 | 0.042157 | 0.313467 | 0.209494 | 0.109859 | 0.072537 | 1.000000 | -0.265592 | -0.018810 | 0.114802 | 0.175205 |
Free_or_Reduced_Lunch | -0.051341 | 0.028645 | -0.526028 | -0.679098 | -0.353262 | -0.258025 | -0.265592 | 1.000000 | -0.373909 | -0.165836 | 0.193799 |
safe | 0.251841 | 0.224151 | 0.374493 | 0.318926 | 0.197032 | 0.200933 | -0.018810 | -0.373909 | 1.000000 | 0.618294 | 0.002469 |
happy | 0.229925 | 0.214398 | 0.188272 | 0.113441 | 0.038397 | 0.052857 | 0.114802 | -0.165836 | 0.618294 | 1.000000 | 0.091594 |
Student_Teacher_Ratio | 0.366205 | 0.371856 | -0.031906 | -0.119405 | 0.004937 | 0.051414 | 0.175205 | 0.193799 | 0.002469 | 0.091594 | 1.000000 |
# visualize the correlation matrix
plt.figure(figsize = (10, 7))
sns.heatmap(cm, annot = True)
plt.xlabel('correlation')
plt.ylabel('correlation')
Text(68.09374999999999, 0.5, 'correlation')
From the correlation matrix above, we can see that Free_or_Reduced_Lunch has bigger negative correlation with both graduation and Median_Household_Income. Schools with high free or reduced lunch needs also need more economic support. From the correlation coeffients between graduation and social economic status factors (Median_Household_Income, Median_Rent, Median_Home_Value), we can see that high social economic status will have positive influence on the graduation rate.
It's also suprised that math and reading scores don't have strong correlation with graduation compared to other features. From this we can speculate that if we define the school in terms of graduation rate and academic performance, we may get completely different results. Good grades may not lead to high graduation rates.
In the final part, we want to use the features we have to build a predictive model for graduation rate. We use the average graduation rate to classify the graduation rate into two categories: high graduation rate (>average) and low graduation rate (<average). We choose Decision Tree, Random Forest and KNeighbors to build the model, and compare the accuracy.
from sklearn import preprocessing
from sklearn.tree import DecisionTreeClassifier as DTree
from sklearn.ensemble import RandomForestClassifier as RF
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
from sklearn import metrics
from sklearn.metrics import classification_report as cls_rep
from sklearn.metrics import accuracy_score as ac
# read the full data set
df_full = pd.read_csv('df_full.csv')
# use the mean of each column to replace the invalid values
df_full['reading'].fillna((df_full['reading'].mean()), inplace=True)
df_full['math'].fillna((df_full['math'].mean()), inplace=True)
df_full['graduation'].fillna((df_full['graduation'].mean()), inplace=True)
df_full['Median_Household_Income'].fillna((df_full['Median_Household_Income'].mean()), inplace=True)
df_full['Median_Rent'].fillna((df_full['Median_Rent'].mean()), inplace=True)
df_full['Median_Home_Value'].fillna((df_full['Median_Home_Value'].mean()), inplace=True)
df_full['Student_Number'].fillna((df_full['Student_Number'].mean()), inplace=True)
df_full['Free_or_Reduced_Lunch'].fillna((df_full['Free_or_Reduced_Lunch'].mean()), inplace=True)
df_full['safe'].fillna((df_full['safe'].mean()), inplace=True)
df_full['happy'].fillna((df_full['happy'].mean()), inplace=True)
df_full['Student_Teacher_Ratio'].fillna((df_full['Student_Teacher_Ratio'].mean()), inplace=True)
# recode the graduation rate with 1 (higher than average), 0 (lower than average)
gra_ave = df_full['graduation'].mean()
print(gra_ave)
df_full["gra_high"] = df_full['graduation'].apply(lambda x: 1 if x > gra_ave else 0)
df = df_full[['reading', 'math', 'gra_high', 'Median_Household_Income', 'Median_Rent',
'Median_Home_Value', 'Student_Number', 'Free_or_Reduced_Lunch', 'safe',
'happy', 'Student_Teacher_Ratio']]
df.head()
0.8129700854700852
reading | math | gra_high | Median_Household_Income | Median_Rent | Median_Home_Value | Student_Number | Free_or_Reduced_Lunch | safe | happy | Student_Teacher_Ratio | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.990000 | 0.990000 | 1 | 190216.000000 | 3084.000000 | 1.143983e+06 | 3368.0 | 0.45 | 0.93 | 0.80 | 21.0 |
1 | 0.736527 | 0.646735 | 1 | 76707.738659 | 1443.290927 | 4.366174e+05 | 1069.0 | 0.05 | 0.90 | 0.75 | 11.0 |
2 | 0.990000 | 0.950000 | 1 | 132135.000000 | 392.000000 | 4.958000e+05 | 291.0 | 0.02 | 0.82 | 0.64 | 13.0 |
3 | 0.990000 | 0.950000 | 1 | 117765.000000 | 1481.000000 | 4.815000e+05 | 298.0 | 0.10 | 0.83 | 0.67 | 13.0 |
4 | 0.990000 | 0.990000 | 1 | 93479.000000 | 1441.000000 | 5.397600e+05 | 1313.0 | 0.39 | 0.97 | 0.89 | 23.0 |
# functions for model fit
def run_clf(X_train,X_test,y_train,class_type,**kwargs):
clf = class_type(**kwargs)
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
return y_pred
def run_RF(X_train,X_test,y_train,RF):
clf = RF(n_estimators=100)
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
return y_pred
# split train and test data
y = df.gra_high
X = df[['reading','math','Median_Household_Income','Median_Rent','Median_Home_Value','Student_Number',
'Free_or_Reduced_Lunch','safe','happy','Student_Teacher_Ratio']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 1)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)
(812, 10) (812,) (204, 10) (204,)
# result of the model
target_names=["High Graduation","Low Graduation"]
i=0
method=['DecisionTree','K-NN']
cls=[DTree,KNN]
while i<=1:
print(method[i])
print(cls_rep(y_test, run_clf(X_train,X_test,y_train,cls[i]),target_names=target_names))
print('Accuracy:',ac(y_test, run_clf(X_train,X_test,y_train,cls[i])))
i+=1
print('Randomforest')
print(cls_rep(y_test, run_RF(X_train,X_test,y_train,RF),target_names=target_names))
print('Accuracy:',ac(y_test, run_RF(X_train,X_test,y_train,RF)))
DecisionTree precision recall f1-score support High Graduation 0.61 0.69 0.65 67 Low Graduation 0.84 0.79 0.81 137 accuracy 0.75 204 macro avg 0.73 0.74 0.73 204 weighted avg 0.76 0.75 0.76 204 Accuracy: 0.7696078431372549 K-NN precision recall f1-score support High Graduation 0.56 0.49 0.52 67 Low Graduation 0.77 0.81 0.79 137 accuracy 0.71 204 macro avg 0.66 0.65 0.66 204 weighted avg 0.70 0.71 0.70 204 Accuracy: 0.7058823529411765 Randomforest precision recall f1-score support High Graduation 0.79 0.78 0.78 67 Low Graduation 0.89 0.90 0.89 137 accuracy 0.86 204 macro avg 0.84 0.84 0.84 204 weighted avg 0.86 0.86 0.86 204 Accuracy: 0.8333333333333334
We evaluate the performance of the classifiers by using AUC and ROC. Below are the way we build our AUC - ROC Curves.
from sklearn.metrics import roc_curve,auc
# Compute ROC curve and ROC area for each class
def acu_curve(y_test,y_pred,cls_me,color):
fpr, tpr, thresholds = roc_curve(y_test, y_pred) #fpr and tpr
roc_auc =auc(fpr, tpr) # auc
lw = 2
plt.plot(fpr, tpr, color=color,
lw=lw, label='ROC curve of {} '.format(cls_me)+'(area = %0.3f)' % roc_auc) ###x-axis: fpr,y-axis:tpr
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
ax.set_xlim([0.0, 1.0])
ax.set_ylim([0.0, 1.0])
ax.set_xlabel('False Positive Rate')
ax.set_xlabel('True Positive Rate')
ax.set_title('ROC curve of different classification methods')
ax.legend(loc="lower right")
fig, ax = plt.subplots(figsize = (6, 6))
acu_curve(y_test, run_clf(X_train,X_test,y_train,DTree),'DecisionTree','gold')
acu_curve(y_test, run_clf(X_train,X_test,y_train,KNN),'K-NN','darkorange')
acu_curve(y_test, run_RF(X_train,X_test,y_train,RF),'RandomForest','purple')
plt.show()
As we can see from the curves, Random Forest has the best performance. Let's use it to predict the graduation rate for test dataset.
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators = 100)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_pred
array([1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0])
%pip install selenium
import os
import sys
import time
import zipfile
import re
import csv
from urllib.request import urlretrieve
from selenium import webdriver
from getpass import getpass
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import StaleElementReferenceException
from selenium.common.exceptions import ElementClickInterceptedException
import pandas as pd
from bs4 import BeautifulSoup
import re
%%capture
""" ********** ********** """
try:
from configparser import SafeConfigParser
except ImportError:
from ConfigParser import SafeConfigParser
chrome_options = webdriver.ChromeOptions()
prefs = {"profile.default_content_setting_values.notifications" : 2}
chrome_options.add_experimental_option("prefs",prefs)
driver = webdriver.Chrome('./chromedriver')
driver.set_window_size(1080,800) #Required, removes the "element not found" bug
try:
input = raw_input
except NameError:
pass
def clear():
if os.name == 'nt':
os.system('cls')
else :
os.system('clear')
""" ********************************************************************** """
school_lists = []
school_info = {}
school_academic = {}
school_family = {}
school_stu_num = {}
school_heal_safe = {}
school_stu_tea = {}
school_stu_club = {}
def search():
# go to Search link
for n in range(0,46):
driver.get("https://www.niche.com/k12/search/best-public-high-schools/m/new-york-city-metro-area/?page="+format(n))
# get expected school from search link
for school_anchor in driver.find_elements_by_css_selector("a.search-result__link"):
try:
sponsor = school_anchor.find_element_by_class_name("search-result__sponsored-bar")
except NoSuchElementException:
link = school_anchor.get_attribute("href")
title = school_anchor.find_element_by_class_name("search-result__title").get_attribute("innerText")
school_lists.append([title, link])
if __name__ == '__main__':
clear()
credentials_from_file = False
credentials = SafeConfigParser();
search()
""" ********** ********** """
try:
from configparser import SafeConfigParser
except ImportError:
from ConfigParser import SafeConfigParser
chrome_options = webdriver.ChromeOptions()
prefs = {"profile.default_content_setting_values.notifications" : 2}
chrome_options.add_experimental_option("prefs",prefs)
driver = webdriver.Chrome('./chromedriver')
driver.set_window_size(1080,800) #Required, removes the "element not found" bug
try:
input = raw_input
except NameError:
pass
def clear():
if os.name == 'nt':
os.system('cls')
else :
os.system('clear')
""" ********************************************************************** """
school_info = {}
school_academic = {}
school_family = {}
school_stu_num = {}
school_heal_safe = {}
school_stu_tea = {}
def school_details(school_lists):
#[:5] to get the first five
for school in school_lists[:1009]:
print("Visiting", school[0])
#wait for 10 milliseconds before clicking on a link
driver.get(school[1])
driver.implicitly_wait(10)
#academic
academic_dic = {}
for value in driver.find_elements_by_css_selector('.block--two-two .profile__buckets'):
try:
academic_dic["reading"] = value.find_element_by_css_selector('.profile__bucket--1 .scalar__value').get_attribute("innerText")
except NoSuchElementException:
academic_dic["reading"] = "_"
try:
academic_dic["math"] = value.find_element_by_css_selector('.profile__bucket--2 .scalar__value').get_attribute("innerText")
except NoSuchElementException:
academic_dic["math"] = "_"
try:
academic_dic["graduation"] = value.find_element_by_css_selector('.profile__bucket--3 .scalar__value').get_attribute("innerText")
except NoSuchElementException:
academic_dic["graduation"] = "_"
school_academic[school[0]]=academic_dic
#family
family = []
for value in driver.find_elements_by_css_selector('.block--location .scalar__value'):
family.append(value.get_attribute("innerText"))
school_family[school[0]]=family
#students-health&safety
heal_safe = []
for value in driver.find_elements_by_css_selector('.block--two .poll__single__percent__label'):
heal_safe.append(value.get_attribute("innerText"))
school_heal_safe[school[0]]=heal_safe
stu_num = []
for value in driver.find_element_by_id('students').find_elements_by_css_selector('.scalar__value'):
stu_num.append(value.get_attribute("innerText"))
school_stu_num[school[0]]=stu_num
stu_tea = []
stu_tea.append(driver.find_element_by_id('teachers').find_element_by_css_selector('.scalar__value').get_attribute("innerText").split('\n')[0])
school_stu_tea[school[0]]=stu_tea
#reviews
#print("percent proficient-reading:", reading)
if __name__ == '__main__':
clear()
credentials_from_file = False
credentials = SafeConfigParser();
school_details(school_lists)
""" ********** ********** """
try:
from configparser import SafeConfigParser
except ImportError:
from ConfigParser import SafeConfigParser
chrome_options = webdriver.ChromeOptions()
prefs = {"profile.default_content_setting_values.notifications" : 2}
chrome_options.add_experimental_option("prefs",prefs)
driver = webdriver.Chrome('./chromedriver')
driver.set_window_size(1080,800) #Required, removes the "element not found" bug
try:
input = raw_input
except NameError:
pass
def clear():
if os.name == 'nt':
os.system('cls')
else :
os.system('clear')
""" ********************************************************************** """
school_review={}
def review_body(html_doc):
soup = BeautifulSoup(html_doc, 'lxml')
return [item.getText() for item in soup.find_all("div","overflow-text__content")]
def review_star(html_doc):
classes = []
each_cls = []
soup = BeautifulSoup(html_doc, 'lxml')
star = soup.find_all("div","review__stars")
for item in star:
for element in item.find_all(class_=True):
#print(element["class"])
each_cls.extend(element["class"])
#print(classes)
return each_cls
def get_reviews():
for school in school_lists[0:60]:
print("Visiting",school[0])
driver.implicitly_wait(10)
driver.get(school[1]+"reviews/")
# wait a bit before extracting reviews
driver.implicitly_wait(20)
all_reviews = []
# loop until we found all reviews
page_num = 0
while page_num < 4:
try:
review_htmls = driver.find_element_by_css_selector('section.reviews-expansion-bucket').get_attribute('innerHTML')
reviews_in_page = review_body(review_htmls)
star_in_page = review_star(review_htmls)
if reviews_in_page[0] not in all_reviews:
all_reviews.extend([reviews_in_page,star_in_page])
#print(page_num)
page_num += 1
# otherwise, click Next button
driver.implicitly_wait(20)
next_button = driver.find_element_by_css_selector('.pagination__next')
next_button.click()
driver.implicitly_wait(100)
# if page still refreshing after clicking Next button, wait for a bit and then repeat
except StaleElementReferenceException:
driver.implicitly_wait(20)
except ElementClickInterceptedException:
driver.implicitly_wait(20)
except NoSuchElementException:
break
school_review[school[0]]=all_reviews
print("last:",len(all_reviews),"Next")
return all_reviews
if __name__ == '__main__':
clear()
credentials_from_file = False
credentials = SafeConfigParser();
get_reviews()
""" ********** ********** """
try:
from configparser import SafeConfigParser
except ImportError:
from ConfigParser import SafeConfigParser
chrome_options = webdriver.ChromeOptions()
prefs = {"profile.default_content_setting_values.notifications" : 2}
chrome_options.add_experimental_option("prefs",prefs)
driver = webdriver.Chrome('./chromedriver')
driver.set_window_size(1080,800) #Required, removes the "element not found" bug
try:
input = raw_input
except NameError:
pass
def clear():
if os.name == 'nt':
os.system('cls')
else :
os.system('clear')
""" ********************************************************************** """
school_address = {}
def address(school_lists):
#[:5] to get the first five
for school in school_lists[:1]:
print("Visiting", school[0])
#wait for 10 milliseconds before clicking on a link
driver.get("https://www.niche.com/k12/professional-performing-arts-high-school-new-york-ny/")
driver.implicitly_wait(10)
address = []
address.append(driver.find_element_by_css_selector('.profile__address').get_attribute("innerText"))
print(address)
school_address[school[0]]=address
if __name__ == '__main__':
clear()
credentials_from_file = False
credentials = SafeConfigParser();
address(school_lists)