What makes a good public high school in NYC area?¶

0 Problem and Motivation¶

This project is guided by the big question: What makes a good public high school in NYC area?

0-1 WHY PUBLIC HIGH SCHOOL SO IMPORTANT?¶

Have you ever thought about dropping out from high school? Have you ever worried about your scores in high shcool? No matter what your answer is, graduation from high school is a bare minimum for students’ later academic and socioeconomic success. Also, prior researches indicated that high school grades are a stronger incremental predictor of college outcomes—including what is arguably the most important outcome of all: graduation. (Galla, Brian M., et al., 2019)

However, in the United States, Low academic performance in high school and increasing drop-out rates become major educational issues. it is estimated that Nearly one-third of all public high school students in the U.S. fail to graduate in four years (Chapman et al., 2011; Stillwell and Sable, 2013).

0-2 OUR QUESTIONS¶

What do parents/students usually focus on when they write reviews about high schools?
How are public high schools in New York distributed based on different factors?
What factors might influence the average graduation rate of public high schools in NY?

Note: please use "pip install -r requirements.txt" in your terminal to install the following packages first:
pandas os csv ast numpy textblob seaborn matplotlib wordcloud scipy geopy re fractions requests geopandas folium sklearn selenium time zipfile urllib getpass bs4

1- Relevant data and Collection¶

We are colleting and processing the data from Niche

When doing the data collection, we didn't find any official API from the website. So we use Selenium to access to the website, locate elements in a page, and then collect relative variables. There are three steps to collect our data.

Step 1: Collect all the schools we need with their name and the url of detailed information. Below are the columns.

name – the school name
url – the url of its detailed information

Step 2: Collect relative feeatures from the school page in the school list.

Average Graduation Rate —— The percentage of 12th graders who gradu
Percent Proficient - Math – The percentage of students that scored at or above proficiency levels on their state math assessment test
Percent Proficient - Reading – This is the percentage of students that scored at or above proficiency levels on their state reading/language arts assessment test.
Median Household Income – The median combined gross income of all members of a household who are 15 years or older in the neighborhood.
Median Rent – The median rent in the neighborhood.
Median Home Value – The median amount for which a house would likely sell in the neighborhood.
Student Number – The student number in the school.
Free or Reduced Lunch – The percentage of free or reduced lunch in lunch fee.
Safe – The percentage of students who feel safe in the school survey.
Happy – The percentage of students who feel happy in the school survey.
Student Teacher Ratio – The student teacher ratio in the school.
Address – The school address in the school list

Step 3: Collect 60 latest reviews of each school in the school list.

School – The school name.
Reviews – The latest review list of that school.
Rate – The star of the specific review.

Note: To improve on the readability of the Notebook, we have added the code for our scrappers in the Appendix at the end of this notebook.

2- Review Analysis¶

First, let's analyze the reviews and try to answer our first question: what do parents/students usually focus on when they write reviews about high schools?

In [1]:

import pandas as pd
import os
import csv
import ast
import numpy as np
from textblob import TextBlob
import seaborn as sns
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from scipy.stats.stats import pearsonr   

First we read in data from several .csv file we scraped from the website. It contains the review text data and their respective rating stars, but it's still in a very primitive and unprocessed format. So we did some cleaning and preprocessing later.

In [2]:

def read_allcsv(fname_list):
    df = pd.DataFrame()
    for fname in fname_list:
        with open('review/review/{}'.format(fname), newline='', encoding="utf8") as f:
            df_i = pd.read_csv(f)
        df = df.append(df_i)
    return df

all_f = os.listdir("review/review")
df = read_allcsv(all_f)

In [3]:

df.head()

Out[3]:

	Unnamed: 0	0	1	2	3	4	5	6	7
0	East Meadow High School	["The teacher's are pretty pog and are always ...	['review__stars__icon', 'review__stars__icon--...	["The teacher's are pretty pog and are always ...	['review__stars__icon', 'review__stars__icon--...	['Very strict faculty. Kida are getting wilder...	['review__stars__icon', 'review__stars__icon--...	['It is a decent school, 3 stars pretty much e...	['review__stars__icon', 'review__stars__icon--...
1	Glen Rock High School	["Glen Rock High School will hold some of my f...	['review__stars__icon', 'review__stars__icon--...	["Glen Rock High School will hold some of my f...	['review__stars__icon', 'review__stars__icon--...	['Glen Rock High School had many teachers who ...	['review__stars__icon', 'review__stars__icon--...	["I don't eat school food so I can't comment. ...	['review__stars__icon', 'review__stars__icon--...
2	Millennium High School	['cool and nice. academics are good and there ...	['review__stars__icon', 'review__stars__icon--...	['cool and nice. academics are good and there ...	['review__stars__icon', 'review__stars__icon--...	["Aspects of life at Millennium High School th...	['review__stars__icon', 'review__stars__icon--...	["Aspects of life at Millennium High School th...	['review__stars__icon', 'review__stars__icon--...
3	Croton-Harmon High School	['Croton Harmon High School has a small town f...	['review__stars__icon', 'review__stars__icon--...	['Croton Harmon High School has a small town f...	['review__stars__icon', 'review__stars__icon--...	['School has many opportunities to succeed', '...	['review__stars__icon', 'review__stars__icon--...	['School has many opportunities to succeed', '...	['review__stars__icon', 'review__stars__icon--...
4	Scotch Plains-Fanwood High School	['Scotch Plains-Fanwood High School gives a un...	['review__stars__icon', 'review__stars__icon--...	['Scotch Plains-Fanwood High School gives a un...	['review__stars__icon', 'review__stars__icon--...	['What I liked about my high school experience...	['review__stars__icon', 'review__stars__icon--...	['What I liked about my high school experience...	['review__stars__icon', 'review__stars__icon--...

As you can see, the dataframe has a column of school names. At the same row are the lists of review texts and their respective star ratings. However, each rating is encoded as a set of'review__stars__icon', 'review__stars__icon--40', 'review__stars__number__reviews' for an instance to indicate a 4-star review. So we need to extract that infromation from the text. As the data readed are all just of type str that looks like a list, we use ast package, which helps Python applications to process trees of the Python abstract syntax grammar, to convert them to lists for more convenient processing. The following function process_rating and process_review did just the above.

In [5]:

def process_rating(rating_text):
    ra = ast.literal_eval(rating_text)
    outra = []
    for i in range(1, len(ra), 3):
            outra.append(int(ra[i][-2]))
            
    return outra

def process_review(review_text):
    return ast.literal_eval(review_text)
    

2- 1 Data cleaning¶

While we pre-process the raw text data, we also deal with the missing values where the reviews are just not enough to fill in all the columns for some certain schools and give them an empty list as default value.

In [6]:

df['rating_1'] = df['1'].apply(lambda x: process_rating(x) if(np.all(pd.notnull(x))) else [])
df['rating_2'] = df['5'].apply(lambda x: process_rating(x) if(np.all(pd.notnull(x))) else [])
df['rating_3'] = df['7'].apply(lambda x: process_rating(x) if(np.all(pd.notnull(x))) else [])
df['review_1'] = df['0'].apply(lambda x: process_review(x) if(np.all(pd.notnull(x))) else [])
df['review_2'] = df['4'].apply(lambda x: process_review(x) if(np.all(pd.notnull(x))) else [])
df['review_3'] = df['6'].apply(lambda x: process_review(x) if(np.all(pd.notnull(x))) else [])


df["ratings"] = df.apply(lambda row: row["rating_1"] + row["rating_2"] + row["rating_3"], axis = 1)
df["reviews"] = df.apply(lambda row: row["review_1"] + row["review_2"] + row["review_3"], axis = 1)
df = df[['Unnamed: 0', 'ratings', 'reviews']]

In [7]:

df.columns = ['school_name','ratings', 'reviews']
df = df.reset_index(drop = True)

Now we have a clean and well-formated dataframe that contains all the review and rating data of 982 schools that we collected from niche.

In [8]:

df.head()

Out[8]:

	school_name	ratings	reviews
0	East Meadow High School	[4, 4, 3, 3, 4, 4, 5, 4, 4, 4, 3, 4, 5, 4, 4, ...	[The teacher's are pretty pog and are always s...
1	Glen Rock High School	[4, 4, 4, 3, 3, 5, 3, 3, 4, 3, 2, 3, 5, 4, 5, ...	[Glen Rock High School will hold some of my fo...
2	Millennium High School	[5, 2, 4, 5, 4, 4, 4, 4, 5, 3, 4, 3, 3, 4, 4, ...	[cool and nice. academics are good and there a...
3	Croton-Harmon High School	[5, 3, 3, 5, 4, 5, 4, 5, 1, 4, 4, 4, 4, 4, 4, ...	[Croton Harmon High School has a small town fe...
4	Scotch Plains-Fanwood High School	[4, 4, 5, 5, 5, 4, 4, 4, 4, 4, 3, 5, 3, 5, 5, ...	[Scotch Plains-Fanwood High School gives a uni...

2- 2 Exploratory and Sentiment Analysis¶

We first have a look at the distribution of the ratings. It can be seen in the plot below that people mostly give 4-star reviews. The number of 3 and 5 star reviews are identical, whereas people seldom give 1 and 2 star reviews.

In [8]:

all_review = df.reviews.sum()
all_review_text = ' '.join(all_review)
all_rating = df['ratings'].sum()
# all_rating = [str(i) for i in all_rating]
# count = [all_rating.count(i+1) for i in range(5)]
# print(count)
plt.hist(all_rating, bins = 5)
plt.xticks(np.arange(1, 6), ('1-star', '2-star', '3-star', '4-star', '5-star'))
plt.show()

Here we utilize the sentiment analysis of Textblob to acquire the sentiment polarity(negative -1.0 -> positive 1.0) of each review and visualize how it correlates with the star rating and finally found a correlation coefficient of 0.300 between sentiments and star rating. It looks like the low-star reviews indeed convey much more negative emotions.

In [9]:

def get_sentiment(text):
    #getting the polarity of reviews
    
    tb = TextBlob(text)
    return tb.sentiment.polarity

# List of sentiment polarity for all reviews
all_senti = [get_sentiment(t) for t in all_review]

In [10]:

df_senti = pd.DataFrame(list(zip(all_rating, all_senti)), 
               columns =['rating', 'sentiment'])
x = 'rating'
y = 'sentiment'

plt.figure(figsize=(7,7))
sns.boxplot(x = x , y = y, data = df_senti)
plt.show()

In [11]:

pearsonr(all_senti,all_rating)

Out[11]:

(0.2997920033706547, 0.0)

2-3 Word Cloud¶

Here we made word loud images to plot reviews of different star ratings and also of schools that have low or high graduation rate, in order to see if they differ and in what way. We made several attempts to find the extra stopwords that don't give us much information about the difference but not in the default stopwords to make our wordcloud more insightful.

In [12]:

stopwords_a = set(STOPWORDS)
stopwords_a.update(['high', 'school', 'people', 'think', 'student', 'class', 'lot', 'much', 'one', \
                  'classes', 'great', 'good','amazing','even', 'come', 'make', 'overall', 'grade',\
                  'small', 'give', 'able', 'make', 'going', 'thing', 'year', 'other', 'feel', 'although', 'definitely'\
                 'made', 'average', 'say', 'go', 'way', 'now', 'well', 'will', 'teacher', 'place', 'always', 'many', \
                  'thing', 'though', 'really', 'anything', 'everything', 'still'])

def create_wc(text):
    wc = WordCloud(stopwords = stopwords_a, max_font_size=180, background_color="white", \
                   width=2000, height=1000).generate(text)

    plt.figure(figsize=(30, 15))
    plt.imshow(wc, interpolation='bilinear')
    plt.axis('off')
    plt.show()

Comparison between high-star and low-star reviews¶

Below is the function to get the reviews of a certain star rating and we seperately assign the reviews divided by stars in different columns.

In [13]:

def get_star_review(star, star_list, review_list):
    assert star in [1,2,3,4,5]
    assert len(star_list) == len(review_list)
    out = []
    for i in range(len(star_list)):
        if star_list[i] == star:
            out.append(review_list[i])
    return out

df["1_star_review"] = df.apply(lambda row: get_star_review(1, row['ratings'], row['reviews']), axis = 1)
df["2_star_review"] = df.apply(lambda row: get_star_review(2, row['ratings'], row['reviews']), axis = 1)
df["3_star_review"] = df.apply(lambda row: get_star_review(3, row['ratings'], row['reviews']), axis = 1)
df["4_star_review"] = df.apply(lambda row: get_star_review(4, row['ratings'], row['reviews']), axis = 1)
df["5_star_review"] = df.apply(lambda row: get_star_review(5, row['ratings'], row['reviews']), axis = 1)
one_star_review = ' '.join(df['1_star_review'].sum())
two_star_review = ' '.join(df['2_star_review'].sum())
three_star_review = ' '.join(df['3_star_review'].sum())
four_star_review = ' '.join(df['4_star_review'].sum())
five_star_review = ' '.join(df['5_star_review'].sum())

In [14]:

df.head()

Out[14]:

	school_name	ratings	reviews	1_star_review	2_star_review	3_star_review	4_star_review	5_star_review
0	Long Island City High School	[3, 3, 4, 4, 3, 3, 5, 5, 3, 4, 2, 3, 4, 4, 4, ...	[I dont like this school as much as i thought ...	[do not go to this school if your english is n...	[Long island city has many opportunities but t...	[I dont like this school as much as i thought ...	[This school is what you make of it. In additi...	[As a incoming senior, Long Island City has we...
1	Liberty High School Academy for Newcomers	[4, 5, 5, 5, 3, 3, 5, 4]	[I remember the first time I stepped in Libert...	[]	[]	[My school is a good school as a foreign schoo...	[I remember the first time I stepped in Libert...	[I really like how the system works, everyone ...
2	Eximius College Preparatory Academy	[3, 5, 3, 4, 3, 3, 4, 4, 5, 1, 5, 5, 3, 1, 3, ...	[Eximius was a school I wont forget, Best four...	[I like how my school allows you to take many ...	[]	[Eximius was a school I wont forget, Best four...	[I love how there is the staff and teachers be...	[With my now 4 years worth of experience at EC...
3	Innovation Diploma Plus	[5, 5, 5, 4, 3, 3, 1, 3, 1, 1]	[Great school, great experience. Thank you IDP...	[kids do not what to do nothing, I liked the s...	[]	[none had kids in the school that didnt want t...	[What I like about innovation ....is the way o...	[Great school, great experience. Thank you IDP...
4	Urban Assembly Academy of Government & Law	[3, 4, 3, 4, 3, 5, 5, 5, 4, 3, 3, 3, 3, 5, 3, ...	[The is an average school. Most of the student...	[]	[The food was just plain cafeteria food, not a...	[The is an average school. Most of the student...	[My school, Urban Assembly Academy of Governme...	[even though we share a building with 5 other ...
5	High School for Law & Public Service	[4, 5, 4, 3, 5, 3, 4, 4, 3, 3, 4, 3, 4, 4, 5, ...	[My overall expo at this school is pretty aver...	[There arent many food options. Sometimes the ...	[]	[It's a pretty great school, even for an anti ...	[My overall expo at this school is pretty aver...	[My high school has many clubs and sports that...
6	Middlesex County Vocational School - East Brun...	[3, 3]	[My experience is decent. I'd like to see an h...	[]	[]	[My experience is decent. I'd like to see an h...	[]	[]
7	Urban Assembly School of Design & Construction	[4, 4, 4, 1, 5, 3, 3, 5, 5, 4, 4, 3, 4, 3, 4, ...	[I love the staff members. There was this one ...	[I would like to see more motivational teacher...	[They could do much better, and it makes me sa...	[I selected this school because I wanted to sp...	[I love the staff members. There was this one ...	[The school is a great place for students to t...
8	Plainfield Academy for the Arts & Advanced Stu...	[3, 2, 4, 4, 4, 5, 3, 4, 4, 4, 4, 4, 5, 3, 4, ...	[I liked how everything was so simple. You kne...	[]	[There are little to no clubs and/or after sch...	[I liked how everything was so simple. You kne...	[My experience about PAAAS is that I’ve had ma...	[I’ve been attending PAAAS since the 7th grade...
9	Essex Street Academy	[3, 3, 4, 4, 4, 4, 3, 4, 4, 5, 5, 3, 4, 4, 4, ...	[Very small school, doesn't prepare students f...	[]	[]	[Very small school, doesn't prepare students f...	[There are many ways to get everyone in the sc...	[I was given a lot of opportunties, The teache...
10	Repertory Company High School for Theatre Arts	[3, 5, 5, 3, 4, 5, 4, 3, 5, 4, 4, 3, 3, 3, 4, ...	[Well this school basically has opportunities ...	[I hate this School. The teachers don't care a...	[]	[Well this school basically has opportunities ...	[I love how close the community of the school ...	[The community found at Repertory Company High...
11	Ramapo High School	[3, 4, 4, 3, 3, 4, 5, 4, 3, 4, 4, 3, 3, 3, 3, ...	[Ramapo high school is a school that has a hig...	[]	[the schools student body is made up of approx...	[Ramapo high school is a school that has a hig...	[Very great school helped me with a lot of lov...	[GreatThis is a an interesting GreatThis is a ...
12	Academy for Conservation & the Environment	[4, 4, 4, 3, 2, 5, 5, 3, 4, 3, 3, 5, 4, 3, 3, ...	[The school isn't too bad, i went in with a ba...	[]	[I like that the teachers focuses on each stud...	[When it comes to ACE , teachers care a lot ab...	[The school isn't too bad, i went in with a ba...	[I absolutely love this school. I transferred ...
13	Sunset Park High School	[4, 4, 4, 1, 4, 4, 4, 3, 4, 5, 4, 5, 3, 5, 4, ...	[It's a pretty good high school, the teachers ...	[Horrible! I worked there as a teacher for a y...	[]	[Sunset Park high school is a very good school...	[It's a pretty good high school, the teachers ...	[Everyone is friendly in the school. Counselor...
14	Great Oaks Legacy Charter School	[3, 5, 2, 1, 2, 5, 4, 5, 5, 5, 4, 3, 1, 5, 4, ...	[Coming from a Public school to a Charter Scho...	[Overall, this high school could do better wit...	[what i like about great oaks was that they ca...	[Coming from a Public school to a Charter Scho...	[Something i like about Great oaks is how invo...	[I love how the school is a second family to m...
15	Young Women's Leadership School of Brooklyn	[5, 3, 4, 4, 4, 5, 5, 3, 5, 4, 4, 4, 3, 4, 3, ...	[I absolutely recommend this school!. From the...	[]	[]	[TYWLS is for students in the neighborhood who...	[I've been attending this school since 6th gra...	[I absolutely recommend this school!. From the...
16	Hudson High School of Learning Technologies	[3, 2, 5, 4, 3, 5, 4, 5, 5, 3, 5, 2, 3, 3, 5, ...	[My school added more AP classes for students ...	[]	[Some of the teachers are willing to help you ...	[My school added more AP classes for students ...	[Hudson’s the best high school you can get in ...	[This is a very good school. The teachers care...
17	Lincoln High School	[5, 5, 5, 4, 3, 3, 4, 5, 3, 4, 5, 5, 5, 3, 4, ...	[I really like my school and this school help ...	[this school makes me want to die. the staff i...	[]	[I like how the ninth grade teachers are dedic...	[Lincoln is filled with opportunities and is a...	[I really like my school and this school help ...
18	Dreamyard Preparatory School	[5, 5, 5, 4, 4, 5, 4, 4, 4, 5, 5, 5, 4, 4, 3, ...	[As a high school senior at Dreamyard Preparat...	[]	[]	[It is very challenging because some students ...	[My Experience in Dreamyard wasn’t the best bu...	[As a high school senior at Dreamyard Preparat...
19	John Adams High School	[5, 1, 4, 2, 5, 4, 5, 3, 4, 4, 4, 3, 5, 4, 5, ...	[As a student who attended John Adams High Sch...	[This school is rrally bad. The STEM program i...	[It's alright, but it gets very boring. Add a ...	[John Adams has a long way to go, but if stude...	[The teachers are friendly and will help you w...	[As a student who attended John Adams High Sch...
20	Multicultural High School	[3, 3]	[It was ok, it was really work intensive in mo...	[]	[]	[It was ok, it was really work intensive in mo...	[]	[]
21	Hyde Leadership Charter School	[5, 4, 2, 4, 4, 2, 5, 4, 4, 4, 4, 4, 5, 4, 1, ...	[Hyde Leadership Charter School offers AP Clas...	[Not good school. Terrible principalcelia sosa...	[The high school is starting to feel like a pu...	[The teachers in my school are very well educa...	[It is a good school that helps you prepare fo...	[Hyde Leadership Charter School offers AP Clas...
22	East New York Family Academy	[5, 4, 5, 5, 5, 4, 3, 3, 3, 3, 4, 3, 3, 4, 3, ...	[I've learned many things from this school it ...	[]	[]	[When you first start school here, you'll be j...	[I’d like them to change many key factors abou...	[I've learned many things from this school it ...
23	Unity Center for Urban Technologies	[5, 2, 5, 3, 3, 3, 3, 3, 3, 3, 3, 3]	[Very good school I'm currently a senior and i...	[]	[Terrible school because it a very racist scho...	[I believe that Unity Center For Urban Technol...	[]	[Very good school I'm currently a senior and i...
24	Mott Haven Community High School	[5, 5, 5, 5, 4, 5, 5]	[I transferred to this school to finish my hig...	[]	[]	[]	[There are so many programs and clubs availabl...	[I transferred to this school to finish my hig...
25	Academy for Language & Technology	[3, 5, 5, 4, 5, 3, 4, 4, 5, 4, 5, 3, 4, 5, 5, ...	[It was safe, we did not see a lot of fights i...	[]	[]	[It was safe, we did not see a lot of fights i...	[The Academy for Language and Technology is a ...	[The most thing I like about academy for langu...
26	Urban Assembly School for Collaborative Health...	[3, 3, 3, 4, 4, 4]	[I like how the school is always concerned abo...	[]	[]	[I like how the school is always concerned abo...	[There were plenty of supportive teachers and ...	[]
27	International High School at Lafayette	[4, 4, 4, 5, 4, 4, 5, 5, 5, 5, 4, 5, 5, 5, 5, ...	[It's a very good school because you get to ex...	[]	[]	[We don't chance to get involved in different ...	[It's a very good school because you get to ex...	[Always felt safe and secure, I've only had go...
28	High School for Energy & Technology	[3, 3, 5, 4, 4, 2, 3, 3, 3, 3, 5, 3, 5, 3, 4]	[Most of the teachers are very nice, I enyoy s...	[]	[High School for Energy and technology is by f...	[Most of the teachers are very nice, I enyoy s...	[In this school you'll experience the ways of ...	[High School for Energy and Technology (HSET) ...
29	High School for Enterprise, Business and Techn...	[5, 3, 3, 4, 3, 4, 3, 3, 5, 3, 4, 3, 2, 3, 3, ...	[I had a great experience attending this schoo...	[]	[I'm a senior now, and this school at first wa...	[I had a solid experience at EBT. The classes ...	[This school was very good. I enjoyed how much...	[I had a great experience attending this schoo...
...	...	...	...	...	...	...	...	...
952	W. Tresper Clarke High School	[5, 4, 5, 4, 4, 5, 5, 4, 5, 4, 4, 4, 3, 4, 4, ...	[My experience at W.Tresper Clarke High School...	[]	[]	[Going to this high school has been an okay ex...	[I love the teachers and supportive environmen...	[My experience at W.Tresper Clarke High School...
953	Midwood High School at Brooklyn College	[5, 4, 4, 2, 4, 4, 5, 5, 3, 4, 5, 5, 4, 4, 3, ...	[Midwood High school is a great school for stu...	[]	[Overall, was a pretty poor experience. Some ...	[One thing I like about Midwood is the diversi...	[The students are really diverse and the class...	[Midwood High school is a great school for stu...
954	Morris Hills High School	[5, 5, 4, 2, 4, 4, 5, 5, 4, 5, 5, 3, 5, 3, 4, ...	[This is a wonderful school, which as many dif...	[]	[The administration constantly boasts that Mor...	[I would like to see more school spirit, schoo...	[Academics are really good, a lot of scholarsh...	[This is a wonderful school, which as many dif...
955	Oceanside High School	[4, 5, 4, 5, 3, 3, 2, 5, 4, 3, 5, 4, 4, 4, 3, ...	[The school provide a vast list of elective co...	[]	[Starting off with drug/health regulations, Oc...	[Decent. You could do anything you want if you...	[The school provide a vast list of elective co...	[My four years at Oceanside High School have b...
956	Millennium Brooklyn High School	[4, 5, 5, 4, 3, 5, 5, 4, 5, 5, 3, 5, 4, 5, 3, ...	[I really like that at Millennium Brooklyn Hig...	[]	[]	[The teachers in Millennium Brooklyn High Scho...	[I really like that at Millennium Brooklyn Hig...	[Millennium Brooklyn High School offered me a ...
957	Francis Lewis High School	[4, 3, 5, 4, 3, 4, 3, 5, 4, 4, 4, 4, 5, 4, 4, ...	[I liked that Francis Lewis has cool teacher a...	[What can you say about the most overcrowded h...	[]	[I like Francis Lewis High School because of t...	[I liked that Francis Lewis has cool teacher a...	[I like the diversity of the school and the va...
958	Governor Livingston High School	[4, 5, 4, 4, 4, 3, 4, 2, 5, 4, 3, 4, 5, 5, 4, ...	[Some of the teachers do not care about the st...	[]	[Teachers kinda suck and the school itself is ...	[School is good academically, there is a lack...	[Some of the teachers do not care about the st...	[I feel very luck to have been able to attend ...
959	Wayne Hills High School	[3, 5, 3, 5, 4, 5, 4, 4, 5, 5, 2, 4, 4, 5, 4, ...	[Wayne Hills High School is located in a subur...	[]	[I went to Wayne Hills for about the year. I d...	[Wayne Hills High School is located in a subur...	[Wayne Hills is a very positive atmosphere and...	[I really enjoy this school and have liked my ...
960	Marlboro High School	[3, 4, 5, 4, 3, 5, 4, 4, 4, 4, 4, 5, 5, 4, 4, ...	[Good academics but teachers do not care much ...	[]	[]	[Good academics but teachers do not care much ...	[Marlboro High School has many opportunities f...	[Marlboro High School is a wonderful school wi...
961	Parsippany Hills High School	[5, 4, 4, 5, 5, 3, 5, 5, 5, 4, 4, 1, 4, 5, 4, ...	[Parsippany Hills High school is a safe and en...	[There seems to be way too much homework, no s...	[I think the schools is okay but the main issu...	[Good school, good class selection and mostly ...	[As a current senior, I have gone through almo...	[Parsippany Hills High school is a safe and en...
962	Hendrick Hudson High School	[4, 4, 5, 4, 3, 5, 1, 1, 3, 5, 5, 5, 5, 4, 3, ...	[HHHS is a great place to learn and grow as a ...	[The administration is horrible, the food is b...	[No one really notices that people do drugs af...	[The school itself, and the education *general...	[HHHS is a great place to learn and grow as a ...	[Support for the arts is amazing. Facilities ...
963	Wantagh Senior High School	[4, 5, 5, 5, 4, 5, 4, 5, 5, 5, 5, 4, 5, 5, 3, ...	[The teachers were heavily engaging and extrem...	[]	[]	[Wantagh provides a good education however the...	[The teachers were heavily engaging and extrem...	[Overall I believe that Wantagh Senior High Sc...
964	Pompton Lakes High School	[5, 5, 5, 5, 4, 5, 4, 5, 5, 5, 5, 5, 5, 5, 3, ...	[It was a great experience being in this schoo...	[]	[]	[For its size, there are quite a few AP classe...	[Tradition is what makes Pompton Lakes High Sc...	[It was a great experience being in this schoo...
965	John H. Glenn High School	[5, 3, 5, 4, 5, 5, 5, 5, 4, 5, 4, 3, 3, 4, 3, ...	[The interactions between the students and the...	[]	[Bullying is a major issue in this school. Som...	[It was an alright experience. There are defin...	[John Glenn is considered a small school. Our...	[The interactions between the students and the...
966	Bay Shore Senior High School	[4, 4, 3, 4, 5, 4, 3, 5, 5, 5, 4, 5, 4, 5, 5, ...	[My experience at Bay Shore high School was am...	[]	[]	[I loved the high range of opportunities and l...	[My experience at Bay Shore high School was am...	[I loved the diversity in this town and school...
967	Baruch College Campus High School	[5, 4, 4, 5, 4, 4, 4, 5, 4, 4, 4, 5, 4, 5, 4, ...	[BCCHS is a wonderful place for flourishing mi...	[]	[]	[There are many great teachers at Baruch, and ...	[Baruch is great. Amazing teachers who are ded...	[BCCHS is a wonderful place for flourishing mi...
968	Fort Lee High School	[4, 4, 4, 5, 5, 5, 4, 4, 4, 4, 3, 4, 5, 4, 5, ...	[Fort Lee High School's strengths are in its d...	[]	[While there are hidden gems of teachers at FL...	[Fort Lee High School's small student populati...	[Fort Lee High School's strengths are in its d...	[My experience at Fort lee high school has bee...
969	Kings Collegiate Charter School	[4, 5, 4, 4, 3, 3, 4, 4, 4, 4, 5]	[I liked that the teachers did everything in t...	[]	[]	[It's a college. Not much for parents to do., ...	[I liked that the teachers did everything in t...	[Kings Colliagaite truly helped me become a be...
970	Locust Valley High School	[3, 5, 4, 4, 5, 5, 1, 5, 4, 3, 5, 4, 5, 5, 5, ...	[LVHS academic preparation isn't the best beca...	[The education and the teachers in this school...	[There isn't a great selection, and most of th...	[LVHS academic preparation isn't the best beca...	[The school is overall very good. The distric...	[Locust Valley is a great high school for many...
971	County Prep High School	[1, 5, 5, 5, 4, 3, 3, 4, 4, 4, 5, 4, 3, 3, 5, ...	[The HCST foundation is very crooked and broke...	[The HCST foundation is very crooked and broke...	[]	[I would love to see a change in the prices fo...	[The academics and teachers are very focused o...	[Amazing education in a safety environment, wi...
972	Northport Senior High School	[4, 4, 4, 4, 4, 5, 5, 5, 3, 3, 5, 5, 3, 3, 4, ...	[I had a great experience in the marching band...	[]	[]	[The teachers are great. There are a few bad a...	[I had a great experience in the marching band...	[Northport High is a great school with a wide ...
973	Dobbs Ferry High School	[3, 5, 4, 4, 1, 5, 4, 3, 4, 4, 5, 4, 5, 4, 4, ...	[What I would like to see change is in the inv...	[The school is absolutely tiny and focuses on ...	[I would say that there's almost something for...	[What I would like to see change is in the inv...	[I liked how involved the teachers were with t...	[This is an International Baccalaureate (IB) s...
974	Madison High School	[5, 4, 3, 5, 4, 5, 4, 3, 5, 5, 4, 4, 4, 5, 3, ...	[You get what you put into it at Madison High ...	[]	[]	[Madison High School had its pros and cons. It...	[This is a very friendly school. It has the sa...	[You get what you put into it at Madison High ...
975	Pascack Valley High School	[1, 4, 5, 5, 4, 4, 2, 4, 5, 5, 4, 4, 5, 5, 4, ...	[PV high school is the breeding ground for rac...	[PV high school is the breeding ground for rac...	[Where do I start. Pascack Valley is a great p...	[Academically, it is a very good school and so...	[Overall the school is pretty good. There are ...	[Very great school to attend for education. Th...
976	Freehold Borough High School	[5, 5, 4, 4, 3, 4, 5, 4, 3, 3, 4, 4, 4, 4, 5, ...	[Amazing school, people and great atmosphere. ...	[]	[]	[My experience at Freehold Borough High School...	[Very good school. Many excellent teachers and...	[Amazing school, people and great atmosphere. ...
977	Gen. Douglas MacArthur Senior High School	[5, 4, 4, 5, 4, 5, 4, 5, 5, 5, 4, 4, 4, 5, 4, ...	[Teachers and other staff members are very hel...	[]	[One could easily walk into this building with...	[This is an average school. The classes that ...	[The teachers are very nice and students are v...	[Teachers and other staff members are very hel...
978	Smithtown High School East	[4, 5, 3, 4, 4, 4, 3, 5, 4, 4, 5, 4, 4, 4, 5, ...	[Smithtown High School East is a great school....	[]	[East is beast. West is the best. There is cel...	[Smithtown HS East as a school is great. The ...	[Smithtown High School East is a great school....	[High School East provided me a wonderful educ...
979	NYC iSchool	[5, 5, 5, 5, 4, 4, 5, 4, 4, 4, 4, 5, 4, 5, 5, ...	[It's basically a liberal arts institute witho...	[]	[]	[It's alright. The environment and atmosphere ...	[I believe that this school is a very good sch...	[It's basically a liberal arts institute witho...
980	Mount Olive High School	[4, 4, 5, 5, 3, 3, 3, 4, 5, 4, 4, 5, 5, 5, 5, ...	[I enjoyed the higher-level classes and teache...	[]	[]	[Mount Olive High is a very good school with m...	[I enjoyed the higher-level classes and teache...	[The teachers were very caring. The school too...
981	Colts Neck High School	[4, 4, 4, 4, 4, 5, 3, 2, 4, 4, 5, 4, 5, 5, 5, ...	[Overall, Colts Neck High School is a great sc...	[]	[If your child is a minority, I recommend you ...	[Provided a good high school education and a d...	[Overall, Colts Neck High School is a great sc...	[I thought Colts Neck high school was a great ...

982 rows × 8 columns

Below is the plot of 1 and 2 star reviews.

In [15]:

create_wc(' '.join([one_star_review, two_star_review]))

Here is the plot of 3, 4 and 5 star reviews.

In [16]:

create_wc(' '.join([three_star_review, four_star_review, five_star_review]))

From the two plots above, we can see 3, 4 and 5 star reviews have a stronger emphasis on positive words like diversity, caring students and so on, while in 1 and 2 star reviews, negative words like bullying and rude are mentioned more.

High star reviews:

Diversity
Made learn
Opportunities
Care student / Help student / Support
Offered
Best

Low star reviews:

Lunch
Drug
Bullying
Rude / Worst
Nothing
Bathroom

Comparison between reviews of high-graduation-rate and low-graduation-rate schools¶

Now we read in the graduation rate data scraped and classify the reviews based the graduation rate of the schools. A graduation rate above the 75th percentile (>0.92) is considered to be a high one and below the 25th percentile (<0.75) is considered to be a low one.

In [17]:

with open('df_full.csv', newline='', encoding="utf8") as f:
    df_full = pd.read_csv(f)

In [18]:

low = df_full.graduation.quantile([0.25,0.75]).iloc[0]
high = df_full.graduation.quantile([0.25,0.75]).iloc[1]

# df_full[df_full.graduation<low].school_name
low_school = df_full[df_full.graduation<low].school_name.tolist()
# df[df['school_name'] == df_full['school_name']]
low_df = df.loc[df['school_name'].isin(low_school)]
low_review = ' '.join(low_df['5_star_review'].sum())
create_wc(low_review)

In [19]:

high_school = df_full[df_full.graduation>high].school_name.tolist()
# df[df['school_name'] == df_full['school_name']]
high_df = df.loc[df['school_name'].isin(high_school)]
high_review = ' '.join(high_df['5_star_review'].sum())
create_wc(high_review)

From the two plots above, we can see that words like diversity, support are mentioned more in high graduation rate schools, whereas more realistic and utility words like job and internship are mentioned more.

High graduation rate schools:

Diversity
Support
Spirit
Prepare college

Low graduation rate schools:

Job
Internship
Graduate
Taught

3 Geographic Distribution¶

Now let's analyze our feature data to answer the second question: How are public high schools in New York distributed based on different factors?

In [20]:

from geopy.geocoders import Nominatim
import re
import fractions
import requests
import matplotlib.pyplot as plt
%matplotlib inline  
import geopandas as gpd
import seaborn as sns

3- 1 Data Clean and Geographic Data Request¶

In this part, we want to explore the geographical distribution of school features to find out the social factors which may influence the graduation rate of schools. First, we need to clean the clean the address data.

In [21]:

# Read and clean the data

# Read the four address files as dataframes
ad_file1 = pd.read_csv('school_address_174.csv')
ad_file2 = pd.read_csv('school_address_175_400.csv')
ad_file3 = pd.read_csv('school_address_400_764.csv')
ad_file4 = pd.read_csv('school_address_764_1006.csv')

# concatenate the four dataframes
frames = [ad_file1, ad_file2, ad_file3, ad_file4]
df_ad = pd.concat(frames)

# Rename the columns
df_ad.rename(columns={'Unnamed: 0':'school_name',
                          '0':'school_address'}, 
                 inplace=True)
#pd_ad.columns

# Clean the address column
df_ad['school_address'] = df_ad['school_address'].str.replace('Address', '')
df_ad['school_address'] = df_ad['school_address'].str.replace('\n', '')

# add whitespace between words starting with capital letters
def capital_words_spaces(address):
    return re.sub(r"([a-z])([A-Z])", r"\1 \2", address)
df_ad['school_address'] = df_ad['school_address'].apply(capital_words_spaces)

# add whitespace between numbers and capital letters
def digital_space_capital(address):
    return re.sub(r"(\d)([A-Z])", r"\1 \2", address)
df_ad['school_address'] = df_ad['school_address'].apply(digital_space_capital)

address = df_ad.to_csv (r'address.csv', index = None, header=True)

In [22]:

print(len(df_ad))
print(df_ad.head())

1006
                           school_name  \
0               Stuyvesant High School   
1              Bergen County Academies   
2          High Technology High School   
3      Union County Magnet High School   
4  Staten Island Technical High School   

                               school_address  
0          345 Chambers St New York, NY 10282  
1  200 Hackensack Avenue Hackensack, NJ 07601  
2  765 Newman Springs Road Lincroft, NJ 07738  
3   1776 Raritan Road Scotch Plains, NJ 07076  
4      485 Clawson St Staten Island, NY 10306

Now we get a dataframe including addresses for all 1006 public high schools in New York Area. To plot the schools on the map, we use Geopy to convert the school addresses to coordinates. Geopy can be easily service timed out. So we break all schools into 10 dataframes to run separately.

In [23]:

# Geopy Address, Longtitude, Latitude

# get the correct address from Geopy
# easily service timed out so break into 100 one time
# a little bit slow, need to be patient

nom = Nominatim()

# ignore invalid addresses for Geopy
def eval_results(x):
    try:
        return (x.latitude, x.longitude)
    except:
        return (None, None)

/Users/ruwenyou/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:7: DeprecationWarning: Using Nominatim with the default "geopy/1.20.0" `user_agent` is strongly discouraged, as it violates Nominatim's ToS https://operations.osmfoundation.org/policies/nominatim/ and may possibly cause 403 and 429 HTTP errors. Please specify a custom `user_agent` with `Nominatim(user_agent="my-application")` or by overriding the default `user_agent`: `geopy.geocoders.options.default_user_agent = "my-application"`. In geopy 2.0 this will become an exception.
  import sys

In [ ]:

df_ad['latitude'] = df_ad['school_address'][0:100].apply(nom.geocode).apply(lambda x: eval_results(x))
address = df_ad.to_csv (r'address1.csv', index = None, header=True)

In [ ]:

df_ad = pd.read_csv('address1.csv')
df_ad['latitude'] = df_ad['school_address'][100:200].apply(nom.geocode).apply(lambda x: eval_results(x))
address = df_ad.to_csv (r'address2.csv', index = None, header=True)

In [ ]:

df_ad = pd.read_csv('address2.csv')
df_ad['latitude'] = df_ad['school_address'][200:300].apply(nom.geocode).apply(lambda x: eval_results(x))
address = df_ad.to_csv (r'address3.csv', index = None, header=True)

In [ ]:

df_ad = pd.read_csv('address3.csv')
df_ad['latitude'] = df_ad['school_address'][300:400].apply(nom.geocode).apply(lambda x: eval_results(x))
address = df_ad.to_csv (r'address4.csv', index = None, header=True)

In [ ]:

df_ad = pd.read_csv('address4.csv')
df_ad['latitude'] = df_ad['school_address'][400:500].apply(nom.geocode).apply(lambda x: eval_results(x))
address = df_ad.to_csv (r'address5.csv', index = None, header=True)

In [ ]:

df_ad = pd.read_csv('address5.csv')
df_ad['latitude'] = df_ad['school_address'][500:600].apply(nom.geocode).apply(lambda x: eval_results(x))
address = df_ad.to_csv (r'address6.csv', index = None, header=True)

In [ ]:

df_ad = pd.read_csv('address6.csv')
df_ad['latitude'] = df_ad['school_address'][600:700].apply(nom.geocode).apply(lambda x: eval_results(x))
address = df_ad.to_csv (r'address7.csv', index = None, header=True)

In [ ]:

df_ad = pd.read_csv('address7.csv')
df_ad['latitude'] = df_ad['school_address'][700:800].apply(nom.geocode).apply(lambda x: eval_results(x))
address = df_ad.to_csv (r'address8.csv', index = None, header=True)

In [ ]:

df_ad = pd.read_csv('address8.csv')
df_ad['latitude'] = df_ad['school_address'][800:900].apply(nom.geocode).apply(lambda x: eval_results(x))
address = df_ad.to_csv (r'address9.csv', index = None, header=True)

In [ ]:

df_ad = pd.read_csv('address9.csv')
df_ad['latitude'] = df_ad['school_address'][900:1000].apply(nom.geocode).apply(lambda x: eval_results(x))
address = df_ad.to_csv (r'address10.csv', index = None, header=True)

In [ ]:

df_ad = pd.read_csv('address10.csv')
df_ad['latitude'] = df_ad['school_address'][1000:1100].apply(nom.geocode).apply(lambda x: eval_results(x))
address = df_ad.to_csv (r'address11.csv', index = None, header=True)

Now we need to concatenate all the addresses into one dataframe.

In [23]:

# read all address csv into dataframe
# If you can't run the Geopy code above (can't request Geopy service)
# please read the csv files we already saved in the folders before.

df_ad1 = pd.read_csv('address1.csv')
df_ad2 = pd.read_csv('address2.csv')
df_ad3 = pd.read_csv('address3.csv')
df_ad4 = pd.read_csv('address4.csv')
df_ad5 = pd.read_csv('address5.csv')
df_ad6 = pd.read_csv('address6.csv')
df_ad7 = pd.read_csv('address7.csv')
df_ad8 = pd.read_csv('address8.csv')
df_ad9 = pd.read_csv('address9.csv')
df_ad10 = pd.read_csv('address10.csv')
df_ad11 = pd.read_csv('address11.csv')

In [24]:

# get the rows which contain the coordinates in each dataframe

df_ad1 = df_ad1[0:100]
df_ad2 = df_ad2[100:200]
df_ad3 = df_ad3[200:300]
df_ad4 = df_ad4[300:400]
df_ad5 = df_ad5[400:500]
df_ad6 = df_ad6[500:600]
df_ad7 = df_ad7[600:700]
df_ad8 = df_ad8[700:800]
df_ad9 = df_ad9[800:900]
df_ad10 = df_ad10[900:1000]
df_ad11 = df_ad11[1000:1050]

In [25]:

# Concatenate the dataframes

frames = [df_ad1, df_ad2, df_ad3, df_ad4, df_ad5, df_ad6, df_ad7, df_ad8, df_ad9, df_ad10, df_ad11]
df_ad = pd.concat(frames)
df_ad.to_csv (r'address_latitude.csv', index = None, header=True)
df_ad.head()

Out[25]:

	school_name	school_address	latitude
0	Stuyvesant High School	345 Chambers St New York, NY 10282	(40.7179876, -74.0138420443421)
1	Bergen County Academies	200 Hackensack Avenue Hackensack, NJ 07601	(40.9012393225806, -74.0353447741935)
2	High Technology High School	765 Newman Springs Road Lincroft, NJ 07738	(40.3262704, -74.1314886709229)
3	Union County Magnet High School	1776 Raritan Road Scotch Plains, NJ 07076	(40.6181661, -74.3568340380298)
4	Staten Island Technical High School	485 Clawson St Staten Island, NY 10306	(40.567911, -74.115454)

We get a row of coordinates ("latitude") for each school now. Then we need to merge the school feature data with the coordinates data. Let's look at the school feature data first.

In [26]:

# read csv file of school features into dataframe

df_sch = pd.read_csv('first.csv')
print(df_sch.dtypes)
print(df_sch.head())

Unnamed: 0                 object
reading                    object
math                       object
graduation                 object
Median Household Income    object
Median Rent                object
Median Home Value          object
Student_Number             object
Free_or_Reduced_Lunch      object
safe                       object
happy                      object
Student_Teacher_Ratio      object
dtype: object
                              Unnamed: 0 reading math graduation  \
0        A. Hamilton Preparatory Academy     69%  26%        96%   
1  A. Philip Randolph Campus High School     95%  94%        75%   
2            Abraham Lincoln High School     89%  91%        67%   
3            Academy Charter High School      8%  12%        85%   
4      Academy for Allied Health Science     99%  92%        95%   

  Median Household Income Median Rent Median Home Value Student_Number  \
0                45,186\n     1,110\n         268,700\n            983   
1                46,449\n     1,126\n         348,669\n          1,385   
2                38,620\n     1,081\n         309,222\n          1,979   
3                74,861\n     1,368\n         408,500\n            176   
4               117,765\n     1,481\n         481,500\n            291   

  Free_or_Reduced_Lunch safe happy Student_Teacher_Ratio  
0                   82%  88%   72%                 14:01  
1                   83%  70%   73%                 17:01  
2                   78%  66%   59%                 16:01  
3                   87%   --    --                 10:01  
4                   11%  91%   82%                 16:01

From the overview above, we need to convert most of the columns to float first and clean some invalid values and unnecessary strings.

In [27]:

# rename the column
df_sch.rename(columns={'Unnamed: 0':'school_name',
                       'Median Household Income':'Median_Household_Income',
                       'Median Rent':'Median_Rent',
                       'Median Home Value':'Median_Home_Value'}, 
              inplace=True)

# clean unnecessary string
df_sch['Median_Household_Income'] = df_sch['Median_Household_Income'].str.replace('\n', '')
df_sch['Median_Rent'] = df_sch['Median_Rent'].str.replace('\n', '')
df_sch['Median_Home_Value'] = df_sch['Median_Home_Value'].str.replace('\n', '')
df_sch['Median_Household_Income'] = df_sch['Median_Household_Income'].str.replace(',', '')
df_sch['Median_Rent'] = df_sch['Median_Rent'].str.replace(',', '')
df_sch['Median_Home_Value'] = df_sch['Median_Home_Value'].str.replace(',', '')
df_sch['Student_Number'] = df_sch['Student_Number'].str.replace(',', '')
df_sch['Student_Teacher_Ratio'] = df_sch['Student_Teacher_Ratio'].str.replace(':00', '')
df_sch['Student_Teacher_Ratio'] = df_sch['Student_Teacher_Ratio'].str.replace(':', '/')

# replace invalid value to NaN
df_sch.replace('_', np.nan, inplace = True)
df_sch.replace('--', np.nan, inplace = True)
df_sch.replace('970\n55 responses', np.nan, inplace = True)

# convert the types of the column to string/float
df_sch["school_name"] = df_sch['school_name'].astype(str)
df_sch['reading'] = df_sch['reading'].str.rstrip('%').astype('float') / 100.0
df_sch['math'] = df_sch['math'].str.rstrip('%').astype('float') / 100.0
df_sch['graduation'] = df_sch['graduation'].str.rstrip('%').astype('float') / 100.0
df_sch['Median_Household_Income'] = df_sch['Median_Household_Income'].astype('float')
df_sch['Median_Rent'] = df_sch['Median_Rent'].astype('float')
df_sch['Median_Home_Value'] = df_sch['Median_Home_Value'].astype('float')
df_sch['Student_Number'] = df_sch['Student_Number'].astype('float')
df_sch['Free_or_Reduced_Lunch'] = df_sch['Free_or_Reduced_Lunch'].str.rstrip('%').astype('float') / 100.0
df_sch['safe'] = df_sch['safe'].str.rstrip('%').astype('float') / 100.0
df_sch['happy'] = df_sch['happy'].str.rstrip('%').astype('float') / 100.0
df_sch['Student_Teacher_Ratio'] = df_sch['Student_Teacher_Ratio'].apply(lambda frac: float(sum([fractions.Fraction(x) for x in frac.split()])))
print(df_sch.dtypes)
print(df_sch.head())

school_name                 object
reading                    float64
math                       float64
graduation                 float64
Median_Household_Income    float64
Median_Rent                float64
Median_Home_Value          float64
Student_Number             float64
Free_or_Reduced_Lunch      float64
safe                       float64
happy                      float64
Student_Teacher_Ratio      float64
dtype: object
                             school_name  reading  math  graduation  \
0        A. Hamilton Preparatory Academy     0.69  0.26        0.96   
1  A. Philip Randolph Campus High School     0.95  0.94        0.75   
2            Abraham Lincoln High School     0.89  0.91        0.67   
3            Academy Charter High School     0.08  0.12        0.85   
4      Academy for Allied Health Science     0.99  0.92        0.95   

   Median_Household_Income  Median_Rent  Median_Home_Value  Student_Number  \
0                  45186.0       1110.0           268700.0           983.0   
1                  46449.0       1126.0           348669.0          1385.0   
2                  38620.0       1081.0           309222.0          1979.0   
3                  74861.0       1368.0           408500.0           176.0   
4                 117765.0       1481.0           481500.0           291.0   

   Free_or_Reduced_Lunch  safe  happy  Student_Teacher_Ratio  
0                   0.82  0.88   0.72                   14.0  
1                   0.83  0.70   0.73                   17.0  
2                   0.78  0.66   0.59                   16.0  
3                   0.87   NaN    NaN                   10.0  
4                   0.11  0.91   0.82                   16.0

Let's merge the coordinate address dataframe with the school feature dataframe and separate coordinates into latitude and longitude.

In [28]:

# merge the two dataframes
df_full = df_ad.merge(df_sch, how="inner", left_on = "school_name", right_on="school_name")

# rename the 'latitude' to 'coordinate'
df_full.rename(columns={'latitude':'coordinate'}, inplace=True)

# separate coordinates into latitude and longitude
df_full['coordinate'] = df_full['coordinate'].str.replace('(', '')
df_full['coordinate'] = df_full['coordinate'].str.replace(')', '')

from ast import literal_eval

df_full['coordinate'] = [literal_eval(x) for x in df_full['coordinate']]
df_full[['lat','lon']] = pd.DataFrame(df_full['coordinate'].values.tolist(), index=df_full.index)

df_full.to_csv (r'df_full.csv', index = None, header=True)

In [29]:

# check the df_full dataframe
df_full.head()

Out[29]:

	school_name	school_address	coordinate	reading	math	graduation	Median_Household_Income	Median_Rent	Median_Home_Value	Student_Number	Free_or_Reduced_Lunch	safe	happy	Student_Teacher_Ratio	lat	lon
0	Stuyvesant High School	345 Chambers St New York, NY 10282	(40.7179876, -74.0138420443421)	0.99	0.99	0.98	190216.0	3084.0	1143983.0	3368.0	0.45	0.93	0.80	21.0	40.717988	-74.013842
1	Bergen County Academies	200 Hackensack Avenue Hackensack, NJ 07601	(40.9012393225806, -74.0353447741935)	NaN	NaN	NaN	NaN	NaN	NaN	1069.0	0.05	0.90	0.75	11.0	40.901239	-74.035345
2	High Technology High School	765 Newman Springs Road Lincroft, NJ 07738	(40.3262704, -74.1314886709229)	0.99	0.95	0.95	132135.0	392.0	495800.0	291.0	0.02	0.82	0.64	13.0	40.326270	-74.131489
3	Union County Magnet High School	1776 Raritan Road Scotch Plains, NJ 07076	(40.6181661, -74.3568340380298)	0.99	0.95	0.95	117765.0	1481.0	481500.0	298.0	0.10	0.83	0.67	13.0	40.618166	-74.356834
4	Staten Island Technical High School	485 Clawson St Staten Island, NY 10306	(40.567911, -74.115454)	0.99	0.99	0.99	93479.0	1441.0	539760.0	1313.0	0.39	0.97	0.89	23.0	40.567911	-74.115454

In [30]:

# drop the rows which don't have invalid latitudes and longtitudes
df_full = df_full[np.isfinite(df_full['lat'])]

We have finished all the prior work. Let's use Folium to make the interactive data map for school features.

3- 2 Distribution Map¶

In [31]:

import folium
from folium import plugins

Let's first look at the geographical distribution of schools. We use heatmap to show it. The red color means schools gather in this area. From the maps below, we can see most of the schools gather in Bronx, Brooklyn and Mahattan.

In [32]:

# create the heatmap of school distribution
schools_heatmap = folium.Map(location=[df_full['lat'].mean(), df_full['lon'].mean()], zoom_start=10)
schools_heatmap.add_children(plugins.HeatMap([[row["lat"], row["lon"]] for name, row in df_full.iterrows()]))
schools_heatmap.save("heatmap.html")
schools_heatmap

/Users/ruwenyou/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:3: FutureWarning: Method `add_children` is deprecated. Please use `add_child` instead.
  This is separate from the ipykernel package so we can avoid doing imports until

Out[32]:

In [ ]:

# If the html doesn't show, please see the folloing picture.

$distribution.png$

Now let's explore the geographic distribution of graduation rate for the high schools. We use four colors to show four quartile ranges of the graduation rate. The lighter the color is, the higher the graduation is.

In [33]:

# drop the rows with invalid graduation data.
df_graduation = df_full[np.isfinite(df_full['graduation'])]

# create the column of four quartile ranges of the graduation rate
df_graduation['graduation_q'] = pd.qcut(df_graduation['graduation'], 4, labels=False)

/Users/ruwenyou/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """

In [34]:

# check the column
df_graduation['graduation_q'].head()

Out[34]:

0    3
2    2
3    2
4    3
5    3
Name: graduation_q, dtype: int64

In [35]:

# set the color of each range of graduation rate
# lightblue: 75-100%; lightgreen: 50-75%; orange: 25-50%; red: 0-25%.
colordict = {3: 'lightblue', 2: 'lightgreen', 1: 'orange', 0: 'red'}

# create the center of the map based on the mean of coordinates
graduation_map = folium.Map(location=[df_graduation['lat'].mean(), df_graduation['lon'].mean()], zoom_start=10)

# create the graduation distribution map
for lat, lon, school_name, graduation, graduation_q in zip(df_graduation['lat'], df_graduation['lon'], df_graduation['school_name'], df_graduation['graduation'], df_graduation['graduation_q']):
    folium.CircleMarker(
        [lat, lon],
        radius=graduation*10,
        popup = ('School: ' + str(school_name) + '<br>'
                 'Graduation: ' + str(graduation) + '<br>'
                ),
        color='b',
        key_on = graduation_q,
        threshold_scale=[0,1,2,3],
        fill_color=colordict[graduation_q],
        fill=True,
        fill_opacity=0.7
        ).add_to(graduation_map)
graduation_map.save("graduation_heatmap.html")
graduation_map

Out[35]:

In [53]:

# If the html doesn't show, please see the following picture

As you can see from the above map, most of the lower graduation school gathers in the Bronx, Mahattan and Brooklyn. We hypothesize that some factors related to the districts may influence the graduation rate. We pick "Median_Household_Income" as a factor and make a geographic map for it. We want to see if the income level is related to the graduation rate.

We still divide the income level based on the quantile ranges of "Median_Household_Income" and use the same way to plot the household income level of each school. The lighter the color is, the higher the income level is.

In [36]:

# drop the rows with invalid 'Median_Household_Income'
df_income = df_full[np.isfinite(df_full['Median_Household_Income'])]

In [37]:

# set the color of each range of graduation rate
# lightblue: 75-100%; lightgreen: 50-75%; orange: 25-50%; red: 0-25%.
df_income['income_q'] = pd.qcut(df_income['Median_Household_Income'], 4, labels=False)
colordict = {3: 'lightblue', 2: 'lightgreen', 1: 'orange', 0: 'red'}

# create the center of the map based on the mean of coordinates
income_map = folium.Map(location=[df_full['lat'].mean(), df_full['lon'].mean()], zoom_start=9)

# create the graduation distribution map
for lat, lon, school_name, income, income_q in zip(df_income['lat'], df_income['lon'], df_income['school_name'], df_income['Median_Household_Income'], df_income['income_q']):
    folium.CircleMarker(
        [lat, lon],
        radius=income*0.00015,
        popup = ('School: ' + str(school_name) + '<br>'
                 'Median Income: ' + str(income) + '<br>'
                ),
        color='b',
        key_on = income_q,
        threshold_scale=[0,1,2,3],
        fill_color=colordict[income_q],
        fill=True,
        fill_opacity=0.7
        ).add_to(income_map)
income_map.save("income_heatmap.html")
income_map

/Users/ruwenyou/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until

Out[37]:

In [54]:

# If the html can't show, please see the following picture

Comparing the income map above with the prior graduation map, you can see that there are similar patterns between them. For example, the Bronx and Brooklyn both have lower graduation rate and lower income level. But we can also see exceptions. For example, Manhatton has higher level income level but lower graduation rate.

4 Factor influence¶

Now, let's explore more to ask the third question: What factors might influence the average graduation rate of public high schools in NY?

4- 1 T-test¶

We use T-test to find out the difference between the patterns in the map. We find the clusters in the pictures below and use the coordinates of boundary schools in the clusters to locate them.

In [38]:

# find the coordinates of the boundary schools
# create the dataframe for the two clusters

lon_l_index = df_graduation.loc[df_graduation['school_name']=='Marta Valle High School'].index[0]
lon_l = df_graduation['lon'][lon_l_index]
lon_r_index = df_graduation.loc[df_graduation['school_name']=='Pan American International High School'].index[0]
lon_r = df_graduation['lon'][lon_r_index]
lat_d_index = df_graduation.loc[df_graduation['school_name']=='Academy of Medical Technology - A College Board School'].index[0]
lat_d = df_graduation['lat'][lat_d_index]
lat_u_index = df_graduation.loc[df_graduation['school_name']=='International Leadership Charter School'].index[0]
lat_u = df_graduation['lat'][lat_u_index]

df_l_gra = df_graduation[df_graduation['lon'] > lon_l]
df_r_gra = df_l_gra[df_l_gra['lon'] < lon_r]
df_d_gra = df_r_gra[df_r_gra['lat'] > lat_d]
df_gra_low = df_d_gra[df_d_gra['lat'] < lat_u]

lon2_index = df_graduation.loc[df_graduation['school_name']=='Queens High School of Teaching'].index[0]
lon2 = df_graduation['lon'][lon2_index]
lon3_index = df_graduation.loc[df_graduation['school_name']=='Sewanhaka High School'].index[0]
lon3 = df_graduation['lon'][lon3_index]
lat2_index = df_graduation.loc[df_graduation['school_name']=='Oceanside High School'].index[0]
lat2 = df_graduation['lat'][lat2_index]
lat3_index = df_graduation.loc[df_graduation['school_name']=='Earl L. Vandermeulen High School'].index[0]
lat3 = df_graduation['lat'][lat3_index]

df1_gra = df_graduation[df_graduation['lon'] > lon2]
df2 = df1_gra[df1_gra['lon'] < lon3]
df3 = df2[df2['lat'] > lat2]
df_gra_high = df3[df3['lat'] < lat3]

In [39]:

# T-test on all the features of the schools

from scipy.stats import ttest_ind
Reading = ttest_ind(df_gra_low['reading'], df_gra_high['reading'])

df_gra_low = df_gra_low[np.isfinite(df_gra_low['math'])]
df_gra_high = df_gra_high[np.isfinite(df_gra_high['math'])]

Math = ttest_ind(df_gra_low['math'], df_gra_high['math'])
Median_Household_Income = ttest_ind(df_gra_low['Median_Household_Income'], df_gra_high['Median_Household_Income'])
Median_Rent = ttest_ind(df_gra_low['Median_Rent'], df_gra_high['Median_Rent'])
Median_Home_Value = ttest_ind(df_gra_low['Median_Home_Value'], df_gra_high['Median_Home_Value'])
Student_Number = ttest_ind(df_gra_low['Student_Number'], df_gra_high['Student_Number'])
Free_or_Reduced_Lunch = ttest_ind(df_gra_low['Free_or_Reduced_Lunch'], df_gra_high['Free_or_Reduced_Lunch'])
df_gra_low = df_gra_low[np.isfinite(df_gra_low['safe'])]
df_gra_high = df_gra_high[np.isfinite(df_gra_high['safe'])]
Safe = ttest_ind(df_gra_low['safe'], df_gra_high['safe'])
df_gra_low = df_gra_low[np.isfinite(df_gra_low['happy'])]
df_gra_high = df_gra_high[np.isfinite(df_gra_high['happy'])]
Happy = ttest_ind(df_gra_low['happy'], df_gra_high['happy'])
Student_Teacher_Ratio = ttest_ind(df_gra_low['Student_Teacher_Ratio'], df_gra_high['Student_Teacher_Ratio'])

In [40]:

# convert the T-test results to a dataframe

df_ttest = pd.DataFrame()
df_ttest['Reading'] = Reading
df_ttest['Math'] = Math
df_ttest['Median_Household_Income'] = Median_Household_Income
df_ttest['Median_Rent'] = Median_Rent
df_ttest['Median_Home_Value'] = Median_Home_Value
df_ttest['Student_Number'] = Student_Number
df_ttest['Free_or_Reduced_Lunch'] = Free_or_Reduced_Lunch
df_ttest['Safe'] = Safe
df_ttest['Happy'] = Happy
df_ttest['Student_Teacher_Ratio'] = Student_Teacher_Ratio
df_ttest_transposed = df_ttest.T.round(3)
df_ttest_transposed.rename(columns={0: "T-Value", 1: "p-value"})

Out[40]:

	T-Value	p-value
Reading	-1.654	0.099
Math	-1.969	0.050
Median_Household_Income	-4.594	0.000
Median_Rent	-1.705	0.089
Median_Home_Value	-1.401	0.162
Student_Number	-2.263	0.024
Free_or_Reduced_Lunch	8.770	0.000
Safe	0.177	0.860
Happy	0.505	0.614
Student_Teacher_Ratio	1.820	0.071

From the T-test results above, we can see that Math, Median_Household_Income, Student_Number and Free_or_Reduced_Lunch have significant difference between the two clusters. All these features are higher in schools with higher graduation rates except Free_or_Reduced_Lunch. The biggest difference also happens on Free_or_Reduced_Lunch. We can make assumption that study (Math), social economic status (Median_Household_Income), Student_Number all have positve influence on the graduation rate. But Low graduation rate school may have more needs on free_or_Reduced_Lunch.

4- 2 Correlation¶

Let's go back to the full sample and use correlation analysis to test our assumptions.

In [41]:

# read the full sample data
df_full = pd.read_csv('df_full.csv')

# remove the coordinates data
df_corr = df_full[['reading', 'math', 'graduation', 'Median_Household_Income', 
                   'Median_Rent', 'Median_Home_Value', 'Student_Number', 'Free_or_Reduced_Lunch',
                   'safe', 'happy', 'Student_Teacher_Ratio']]

In [42]:

# correlation Analysis

cm = df_corr.corr()
cm

Out[42]:

	reading	math	graduation	Median_Household_Income	Median_Rent	Median_Home_Value	Student_Number	Free_or_Reduced_Lunch	safe	happy	Student_Teacher_Ratio
reading	1.000000	0.933167	0.055198	0.181728	0.256807	0.296964	0.053216	-0.051341	0.251841	0.229925	0.366205
math	0.933167	1.000000	0.013754	0.116594	0.214791	0.262026	0.042157	0.028645	0.224151	0.214398	0.371856
graduation	0.055198	0.013754	1.000000	0.400917	0.234131	0.118879	0.313467	-0.526028	0.374493	0.188272	-0.031906
Median_Household_Income	0.181728	0.116594	0.400917	1.000000	0.749960	0.731328	0.209494	-0.679098	0.318926	0.113441	-0.119405
Median_Rent	0.256807	0.214791	0.234131	0.749960	1.000000	0.721526	0.109859	-0.353262	0.197032	0.038397	0.004937
Median_Home_Value	0.296964	0.262026	0.118879	0.731328	0.721526	1.000000	0.072537	-0.258025	0.200933	0.052857	0.051414
Student_Number	0.053216	0.042157	0.313467	0.209494	0.109859	0.072537	1.000000	-0.265592	-0.018810	0.114802	0.175205
Free_or_Reduced_Lunch	-0.051341	0.028645	-0.526028	-0.679098	-0.353262	-0.258025	-0.265592	1.000000	-0.373909	-0.165836	0.193799
safe	0.251841	0.224151	0.374493	0.318926	0.197032	0.200933	-0.018810	-0.373909	1.000000	0.618294	0.002469
happy	0.229925	0.214398	0.188272	0.113441	0.038397	0.052857	0.114802	-0.165836	0.618294	1.000000	0.091594
Student_Teacher_Ratio	0.366205	0.371856	-0.031906	-0.119405	0.004937	0.051414	0.175205	0.193799	0.002469	0.091594	1.000000

In [43]:

# visualize the correlation matrix

plt.figure(figsize = (10, 7))
sns.heatmap(cm, annot = True)
plt.xlabel('correlation')
plt.ylabel('correlation')

Out[43]:

Text(68.09374999999999, 0.5, 'correlation')

From the correlation matrix above, we can see that Free_or_Reduced_Lunch has bigger negative correlation with both graduation and Median_Household_Income. Schools with high free or reduced lunch needs also need more economic support. From the correlation coeffients between graduation and social economic status factors (Median_Household_Income, Median_Rent, Median_Home_Value), we can see that high social economic status will have positive influence on the graduation rate.

It's also suprised that math and reading scores don't have strong correlation with graduation compared to other features. From this we can speculate that if we define the school in terms of graduation rate and academic performance, we may get completely different results. Good grades may not lead to high graduation rates.

5 Predictive Model¶

In the final part, we want to use the features we have to build a predictive model for graduation rate. We use the average graduation rate to classify the graduation rate into two categories: high graduation rate (>average) and low graduation rate (<average). We choose Decision Tree, Random Forest and KNeighbors to build the model, and compare the accuracy.

In [44]:

from sklearn import preprocessing
from sklearn.tree import DecisionTreeClassifier as DTree
from sklearn.ensemble import RandomForestClassifier as RF
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
from sklearn import metrics
from sklearn.metrics import classification_report as cls_rep
from sklearn.metrics import accuracy_score as ac

In [45]:

# read the full data set

df_full = pd.read_csv('df_full.csv')

In [46]:

# use the mean of each column to replace the invalid values

df_full['reading'].fillna((df_full['reading'].mean()), inplace=True)
df_full['math'].fillna((df_full['math'].mean()), inplace=True)
df_full['graduation'].fillna((df_full['graduation'].mean()), inplace=True)
df_full['Median_Household_Income'].fillna((df_full['Median_Household_Income'].mean()), inplace=True)
df_full['Median_Rent'].fillna((df_full['Median_Rent'].mean()), inplace=True)
df_full['Median_Home_Value'].fillna((df_full['Median_Home_Value'].mean()), inplace=True)
df_full['Student_Number'].fillna((df_full['Student_Number'].mean()), inplace=True)
df_full['Free_or_Reduced_Lunch'].fillna((df_full['Free_or_Reduced_Lunch'].mean()), inplace=True)
df_full['safe'].fillna((df_full['safe'].mean()), inplace=True)
df_full['happy'].fillna((df_full['happy'].mean()), inplace=True)
df_full['Student_Teacher_Ratio'].fillna((df_full['Student_Teacher_Ratio'].mean()), inplace=True)

In [47]:

# recode the graduation rate with 1 (higher than average), 0 (lower than average)

gra_ave = df_full['graduation'].mean()
print(gra_ave)
df_full["gra_high"] = df_full['graduation'].apply(lambda x: 1 if x > gra_ave else 0)
df = df_full[['reading', 'math', 'gra_high', 'Median_Household_Income', 'Median_Rent', 
              'Median_Home_Value', 'Student_Number', 'Free_or_Reduced_Lunch', 'safe', 
              'happy', 'Student_Teacher_Ratio']]
df.head()

0.8129700854700852

Out[47]:

	reading	math	gra_high	Median_Household_Income	Median_Rent	Median_Home_Value	Student_Number	Free_or_Reduced_Lunch	safe	happy	Student_Teacher_Ratio
0	0.990000	0.990000	1	190216.000000	3084.000000	1.143983e+06	3368.0	0.45	0.93	0.80	21.0
1	0.736527	0.646735	1	76707.738659	1443.290927	4.366174e+05	1069.0	0.05	0.90	0.75	11.0
2	0.990000	0.950000	1	132135.000000	392.000000	4.958000e+05	291.0	0.02	0.82	0.64	13.0
3	0.990000	0.950000	1	117765.000000	1481.000000	4.815000e+05	298.0	0.10	0.83	0.67	13.0
4	0.990000	0.990000	1	93479.000000	1441.000000	5.397600e+05	1313.0	0.39	0.97	0.89	23.0

In [48]:

# functions for model fit

def run_clf(X_train,X_test,y_train,class_type,**kwargs):
    clf = class_type(**kwargs)
    clf.fit(X_train,y_train)  
    y_pred = clf.predict(X_test)
    return y_pred
def run_RF(X_train,X_test,y_train,RF):
    clf = RF(n_estimators=100)
    clf.fit(X_train,y_train)  
    y_pred = clf.predict(X_test)
    return y_pred

In [49]:

# split train and test data

y = df.gra_high
X = df[['reading','math','Median_Household_Income','Median_Rent','Median_Home_Value','Student_Number',
        'Free_or_Reduced_Lunch','safe','happy','Student_Teacher_Ratio']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 1)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(812, 10) (812,)
(204, 10) (204,)

In [50]:

# result of the model

target_names=["High Graduation","Low Graduation"]
i=0
method=['DecisionTree','K-NN']
cls=[DTree,KNN]
while i<=1:
    print(method[i])
    print(cls_rep(y_test, run_clf(X_train,X_test,y_train,cls[i]),target_names=target_names))
    print('Accuracy:',ac(y_test, run_clf(X_train,X_test,y_train,cls[i])))
    i+=1     
print('Randomforest')
print(cls_rep(y_test, run_RF(X_train,X_test,y_train,RF),target_names=target_names))
print('Accuracy:',ac(y_test, run_RF(X_train,X_test,y_train,RF)))

DecisionTree
                 precision    recall  f1-score   support

High Graduation       0.61      0.69      0.65        67
 Low Graduation       0.84      0.79      0.81       137

       accuracy                           0.75       204
      macro avg       0.73      0.74      0.73       204
   weighted avg       0.76      0.75      0.76       204

Accuracy: 0.7696078431372549
K-NN
                 precision    recall  f1-score   support

High Graduation       0.56      0.49      0.52        67
 Low Graduation       0.77      0.81      0.79       137

       accuracy                           0.71       204
      macro avg       0.66      0.65      0.66       204
   weighted avg       0.70      0.71      0.70       204

Accuracy: 0.7058823529411765
Randomforest
                 precision    recall  f1-score   support

High Graduation       0.79      0.78      0.78        67
 Low Graduation       0.89      0.90      0.89       137

       accuracy                           0.86       204
      macro avg       0.84      0.84      0.84       204
   weighted avg       0.86      0.86      0.86       204

Accuracy: 0.8333333333333334

We evaluate the performance of the classifiers by using AUC and ROC. Below are the way we build our AUC - ROC Curves.

In [51]:

from sklearn.metrics import roc_curve,auc
# Compute ROC curve and ROC area for each class
def acu_curve(y_test,y_pred,cls_me,color):
    fpr, tpr, thresholds  =  roc_curve(y_test, y_pred) #fpr and tpr
    roc_auc =auc(fpr, tpr) # auc
    lw = 2
    plt.plot(fpr, tpr, color=color,
             lw=lw, label='ROC curve of {} '.format(cls_me)+'(area = %0.3f)' % roc_auc) ###x-axis: fpr，y-axis:tpr
    plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
    ax.set_xlim([0.0, 1.0])
    ax.set_ylim([0.0, 1.0])
    ax.set_xlabel('False Positive Rate')
    ax.set_xlabel('True Positive Rate')
    ax.set_title('ROC curve of different classification methods')
    ax.legend(loc="lower right")
fig, ax = plt.subplots(figsize = (6, 6))
acu_curve(y_test, run_clf(X_train,X_test,y_train,DTree),'DecisionTree','gold')
acu_curve(y_test, run_clf(X_train,X_test,y_train,KNN),'K-NN','darkorange')
acu_curve(y_test, run_RF(X_train,X_test,y_train,RF),'RandomForest','purple')
plt.show()

As we can see from the curves, Random Forest has the best performance. Let's use it to predict the graduation rate for test dataset.

In [52]:

from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators = 100)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_pred

Out[52]:

array([1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1,
       1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1,
       1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0,
       0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0,
       1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1,
       0, 1, 1, 0, 1, 0])

Appendix¶

In [ ]:

%pip install selenium
import os
import sys
import time
import zipfile
import re
import csv
from urllib.request import urlretrieve
from selenium import webdriver
from getpass import getpass
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import StaleElementReferenceException
from selenium.common.exceptions import ElementClickInterceptedException
import pandas as pd 
from bs4 import BeautifulSoup
import re

Step 1: Get the school list and save it for next steps¶

In [ ]:

%%capture

""" **********                                                ********** """
try:
	from configparser import SafeConfigParser
except ImportError:
	from ConfigParser import SafeConfigParser

chrome_options = webdriver.ChromeOptions()
prefs = {"profile.default_content_setting_values.notifications" : 2}
chrome_options.add_experimental_option("prefs",prefs)
driver = webdriver.Chrome('./chromedriver')
driver.set_window_size(1080,800)  #Required, removes the "element not found" bug

try:
    input = raw_input
except NameError:
    pass

def clear():
    if os.name == 'nt':
        os.system('cls')
    else : 
        os.system('clear')
""" ********************************************************************** """
school_lists = []
school_info = {}
school_academic = {}
school_family = {}
school_stu_num = {}
school_heal_safe = {}
school_stu_tea = {}
school_stu_club = {}
def search():
	# go to Search link
    for n in range(0,46):
        driver.get("https://www.niche.com/k12/search/best-public-high-schools/m/new-york-city-metro-area/?page="+format(n))
        # get expected school from search link
        for school_anchor in driver.find_elements_by_css_selector("a.search-result__link"):
            try:
                sponsor = school_anchor.find_element_by_class_name("search-result__sponsored-bar")
            except NoSuchElementException:
                link = school_anchor.get_attribute("href")
                title = school_anchor.find_element_by_class_name("search-result__title").get_attribute("innerText")
                school_lists.append([title, link])
        
if __name__ == '__main__':
    clear()

    credentials_from_file = False

    credentials = SafeConfigParser();

    search()

Step 2: Get the datailed school information¶

In [ ]:

""" **********                                                ********** """
try:
	from configparser import SafeConfigParser
except ImportError:
	from ConfigParser import SafeConfigParser

chrome_options = webdriver.ChromeOptions()
prefs = {"profile.default_content_setting_values.notifications" : 2}
chrome_options.add_experimental_option("prefs",prefs)
driver = webdriver.Chrome('./chromedriver')
driver.set_window_size(1080,800)  #Required, removes the "element not found" bug

try:
    input = raw_input
except NameError:
    pass

def clear():
    if os.name == 'nt':
        os.system('cls')
    else : 
        os.system('clear')
""" ********************************************************************** """

school_info = {}
school_academic = {}
school_family = {}
school_stu_num = {}
school_heal_safe = {}
school_stu_tea = {}
def school_details(school_lists):
    #[:5] to get the first five
    for school in school_lists[:1009]:
        print("Visiting", school[0])
        #wait for 10 milliseconds before clicking on a link
        driver.get(school[1])
        driver.implicitly_wait(10)

        #academic
        academic_dic = {}
        for value in driver.find_elements_by_css_selector('.block--two-two .profile__buckets'):
            try:
                academic_dic["reading"] = value.find_element_by_css_selector('.profile__bucket--1 .scalar__value').get_attribute("innerText")
            except NoSuchElementException:
                academic_dic["reading"] = "_"
            try:
                academic_dic["math"] = value.find_element_by_css_selector('.profile__bucket--2 .scalar__value').get_attribute("innerText")
            except NoSuchElementException:
                academic_dic["math"] = "_"
            try:
                academic_dic["graduation"] = value.find_element_by_css_selector('.profile__bucket--3 .scalar__value').get_attribute("innerText")
            except NoSuchElementException:
                academic_dic["graduation"] = "_"
        school_academic[school[0]]=academic_dic  

        #family
        family = []
        for value in driver.find_elements_by_css_selector('.block--location .scalar__value'):
            family.append(value.get_attribute("innerText"))
        school_family[school[0]]=family

        #students-health&safety
        heal_safe = []
        for value in driver.find_elements_by_css_selector('.block--two .poll__single__percent__label'):
            heal_safe.append(value.get_attribute("innerText"))
        school_heal_safe[school[0]]=heal_safe

        stu_num = [] 
        for value in driver.find_element_by_id('students').find_elements_by_css_selector('.scalar__value'):
            stu_num.append(value.get_attribute("innerText"))
        school_stu_num[school[0]]=stu_num

        stu_tea = [] 
        stu_tea.append(driver.find_element_by_id('teachers').find_element_by_css_selector('.scalar__value').get_attribute("innerText").split('\n')[0])
        school_stu_tea[school[0]]=stu_tea

            #reviews
    #print("percent proficient-reading:", reading)
if __name__ == '__main__':
    clear()

    credentials_from_file = False

    credentials = SafeConfigParser();

    school_details(school_lists)

Step 3: Get the 60 latest reviews from each school¶

In [ ]:

""" **********                                                ********** """
try:
	from configparser import SafeConfigParser
except ImportError:
	from ConfigParser import SafeConfigParser

chrome_options = webdriver.ChromeOptions()
prefs = {"profile.default_content_setting_values.notifications" : 2}
chrome_options.add_experimental_option("prefs",prefs)
driver = webdriver.Chrome('./chromedriver')
driver.set_window_size(1080,800)  #Required, removes the "element not found" bug

try:
    input = raw_input
except NameError:
    pass

def clear():
    if os.name == 'nt':
        os.system('cls')
    else : 
        os.system('clear')
""" ********************************************************************** """

school_review={}
def review_body(html_doc):
    soup = BeautifulSoup(html_doc, 'lxml')
    return [item.getText() for item in soup.find_all("div","overflow-text__content")]

def review_star(html_doc):
    classes = []
    each_cls = []
    soup = BeautifulSoup(html_doc, 'lxml')
    star = soup.find_all("div","review__stars")
    for item in star:
        for element in item.find_all(class_=True):
            #print(element["class"])
            each_cls.extend(element["class"])
    #print(classes)
    return each_cls
def get_reviews():
    for school in school_lists[0:60]:
        print("Visiting",school[0])
        driver.implicitly_wait(10)
        driver.get(school[1]+"reviews/")

     # wait a bit before extracting reviews
        driver.implicitly_wait(20)
        all_reviews = []

     # loop until we found all reviews
        page_num = 0
        while page_num < 4:
            try:
                review_htmls = driver.find_element_by_css_selector('section.reviews-expansion-bucket').get_attribute('innerHTML')
                reviews_in_page = review_body(review_htmls)
                star_in_page = review_star(review_htmls)
                if reviews_in_page[0] not in all_reviews:
                    all_reviews.extend([reviews_in_page,star_in_page])
                    #print(page_num)
                    page_num += 1

       # otherwise, click Next button
                driver.implicitly_wait(20)
                next_button = driver.find_element_by_css_selector('.pagination__next')
                next_button.click()
                driver.implicitly_wait(100)

      # if page still refreshing after clicking Next button, wait for a bit and then repeat
            except StaleElementReferenceException:
                driver.implicitly_wait(20)
            except ElementClickInterceptedException:
                driver.implicitly_wait(20)
            except NoSuchElementException:
                break
                
        school_review[school[0]]=all_reviews
        print("last:",len(all_reviews),"Next")
    return all_reviews

if __name__ == '__main__':
    clear()

    credentials_from_file = False

    credentials = SafeConfigParser();
    
    get_reviews()

Step 4: Get the school address¶

In [ ]:

""" **********                                                 ********** """
try:
	from configparser import SafeConfigParser
except ImportError:
	from ConfigParser import SafeConfigParser

chrome_options = webdriver.ChromeOptions()
prefs = {"profile.default_content_setting_values.notifications" : 2}
chrome_options.add_experimental_option("prefs",prefs)
driver = webdriver.Chrome('./chromedriver')
driver.set_window_size(1080,800)  #Required, removes the "element not found" bug

try:
    input = raw_input
except NameError:
    pass

def clear():
    if os.name == 'nt':
        os.system('cls')
    else : 
        os.system('clear')
""" ********************************************************************** """
school_address = {}
def address(school_lists):
    #[:5] to get the first five
    for school in school_lists[:1]:
        print("Visiting", school[0])
        #wait for 10 milliseconds before clicking on a link
        driver.get("https://www.niche.com/k12/professional-performing-arts-high-school-new-york-ny/")
        driver.implicitly_wait(10)
        
        address = []
        address.append(driver.find_element_by_css_selector('.profile__address').get_attribute("innerText"))
        print(address)
        school_address[school[0]]=address
        
        
if __name__ == '__main__':
    clear()

    credentials_from_file = False

    credentials = SafeConfigParser();

    address(school_lists)