Notebook

Introduction - Education First¶

Team Member Names: Carl Shan Bharathkumar Gunasekaran Haroon Rasheed Paul Mohammed Sumedh Sawant¶

How can we help parents to choose elementary schools for their children? We’re interested in visualizing the statistical data from National Center for Educational Statistics with geo-data to help parents determine the best location to live in and send their children to school.

We will help parents assess the optimal location through analyzing a variety of factors including:

Crime Rates
HS Graduation Rate
Population
Average Income
Average House Price
Average County Test Scores
County Schools Total Funding
Health Care Costs
Percent of College Degree Holders
Literacy Rate

After ranking each of these on a scale of 1-10 (10: Most Important), we'll generate a heat map of the state they're interested in living in and dynamically color it according to how closely each county accords with their preferences.

The heatmap will be displayed within a web browser.

Example: http://nbviewer.ipython.org/5490713

Step 0: Data Extraction & Cleaning¶

Below is our initial code that extracts, cleans and coerces the data into Numpy data structures

In [219]:

## Setting up tools
import csv
import codecs
import pandas as pd
import numpy as np
import os
from pandas import DataFrame, Series
from itertools import islice
from bs4 import BeautifulSoup

dtype = {'FIPS':np.object}

Extracting Literacy rates, and College degree rates¶

In [220]:

## Extracting Literacy rates, and College degree rates
f1 = codecs.open(os.path.abspath("data/LiteracyCollegeDegreeData.csv"), encoding='iso-8859-1')
df1 = pd.read_csv(f1, dtype=dtype)

Extracting Crime, Motor Vehicle Mortality, High School Graduation and Smoking Rates¶

In [221]:

## Extracting Crime, motor vehicle mortality, average income and smoking rates
f2 = codecs.open(os.path.abspath("data/CrimeRate.csv"), encoding='iso-8859-1')
header = list()
data = list()
n = 0
for row in islice(f2, None):
    if n == 0:
        pass
    elif n == 1:
        header.append(row.split(','))
    else:
        data.append(row.split(','))
    n = n + 1
        
df2 = DataFrame(data, columns = header[0])
df2 = df2[['FIPS', 'County', 'State', 'Violent Crime Rate', '% AFGR', 'MV Mortality Rate', '% Smokers']]

Extracting Population and Household Income¶

In [222]:

f2 = codecs.open(os.path.abspath("data/PopIncome.csv"), encoding='iso-8859-1')
header1 = list()
data1 = list()
n = 0
for row in islice(f2, None):
    if n == 0:
        pass
    elif n == 1:
        header1.append(row.split(','))
    else:
        data1.append(row.split(','))
    n = n + 1
        
df3 = DataFrame(data1, columns = header1[0])
df3 = df3[['FIPS', 'County', 'State', 'Population', 'Household Income']]

Extracting county health rankings¶

In [223]:

## Extracting county health rankings

dtype = {'FIPS':np.object}
ff = codecs.open(os.path.abspath("data/CountyHealthRankings.csv"), encoding='iso-8859-1')
header = list()
data = list()
n = 0
for row in islice(ff, None):
    if n == 0:
        pass
    elif n == 1:
        header.append(row.split(','))
    else:
        data.append(row.split(','))
    n = n + 1
health = DataFrame(data, columns = header[0])

Merge all DataFrames into a final DataFrame¶

In [224]:

dataframe = pd.merge(df2, df3)
df_final = pd.merge(df1, dataframe, on = 'FIPS')

# Deleting redundant or unnecessary columns, then cleaning up the dataframe by renaming the columns
for col in ('State_y', 'County_y', 'Population_x', '95%CI-Low(College)', '95%CI-Low(Illiterate)', '95%CI-High(Illiterate)', 'Quartile'):
    del df_final[col]
    
df_final = df_final.rename(columns={'Population_y': 'Population', 'State_x': 'State', 'County_x': 'County'})
df_final['# of Ranked Counties'] = health['# of Ranked Counties']
df_final['Health Outcome Rank'] = health['Health Outcome Rank'] 

Function to return a subset of final DataFrame (df_final) with counties of a specific state¶

In [225]:

def get_state(state):
    '''
    Returns subset of df_final that is only the counties in the state
    state should be a state's FIP code, e.g., '06' for California
    '''
    if len(state) > 2:
        fip = state[:2]
    else:
        fip = state
    return df_final[df_final['FIPS'].map(lambda x: x[:2] == fip)]

Rank all counties of a state with respect to each of the factors¶

In [226]:

# False is for 'good' metrics and True is for 'bad' metrics.
ascending = {'NumberWentToSomeCollege': False, 'PercentWithCollegeDegree': False, '95%CI-High(College)': False, 'Illiterate': True, '% Smokers':True, 'MV Mortality Rate': True, '% AFGR': False, 'Violent Crime Rate': True, 'Population': False, 'Household Income':False, '# of Ranked Counties': False, 'Health Outcome Rank': False}


def getStateCountyMetricRankings(stateFips):
  '''
  Returns a dataframe with all the metrics ranked for all the counties in this state.
  stateFips should be a FIPS code for a state with just the first two characters that are relevant.
  '''
  if len(stateFips) > 2:
      stateFips = stateFips[:2]
  finalDataFrame = None
  for metric in ascending:
    rankedColumn = get_state(stateFips)[metric].rank(ascending=ascending[metric])
    if finalDataFrame is None:
        finalDataFrame = DataFrame(index = rankedColumn.index)
        finalDataFrame[metric] = rankedColumn
    else:
        finalDataFrame[metric] = rankedColumn  
  return finalDataFrame

Enter your Name and State¶

Name:

<div style="margin-left: 100px; display: inline"> State:</div>
<select id="state" name="state"  style="border:1px solid #cccccc;padding: 4px 6px;margin-bottom: 10px;width: 200px;border-radius: 4px">
    <option>Select a state</option>
    <option value="01000">Alabama</option>
    <option value="02000">Alaska</option>
    <option value="04000">Arizona</option>
    <option value="05000">Arkansas</option>
    <option value="06000">California</option>
    <option value="08000">Colorado</option>
    <option value="09000">Connecticut</option>
    <option value="10000">Delaware</option>
    <option value="11000">District Of Columbia</option>
    <option value="12000">Florida</option>
    <option value="13000">Georgia</option>
    <option value="15000">Hawaii</option>
    <option value="15000">Idaho</option>
    <option value="17000">Illinois</option>
    <option value="18000">Indiana</option>
    <option value="19000">Iowa</option>
    <option value="20000">Kansas</option>
    <option value="21000">Kentucky</option>
    <option value="22000">Louisiana</option>
    <option value="23000">Maine</option>
    <option value="24000">Maryland</option>
    <option value="25000">Massachusetts</option>
    <option value="26000">Michigan</option>
    <option value="27000">Minnesota</option>
    <option value="28000">Mississippi</option>
    <option value="29000">Missouri</option>
    <option value="30000">Montana</option>
    <option value="31000">Nebraska</option>
    <option value="32000">Nevada</option>
    <option value="33000">New Hampshire</option>
    <option value="34000">New Jersey</option>
    <option value="35000">New Mexico</option>
    <option value="36000">New York</option>
    <option value="37000">North Carolina</option>
    <option value="38000">North Dakota</option>
    <option value="39000">Ohio</option>
    <option value="40000">Oklahoma</option>
    <option value="41000">Oregon</option>
    <option value="42000">Pennsylvania</option>
    <option value="44000">Rhode Island</option>
    <option value="45000">South Carolina</option>
    <option value="46000">South Dakota</option>
    <option value="47000">Tennessee</option>
    <option value="48000">Texas</option>
    <option value="49000">Utah</option>
    <option value="50000">Vermont</option>
    <option value="51000">Virginia</option>
    <option value="53000">Washington</option>
    <option value="54000">West Virginia</option>
    <option value="55000">Wisconsin</option>
    <option value="56000">Wyoming</option>
    <option value="72000">Puerto Rico</option>
</select>	

<script>
    $("select[name='state']").change(function(){
        var kernel = IPython.notebook.kernel;
        kernel.execute("state = '"+$(this).val()+"'");

        var kernel = IPython.notebook.kernel;
        kernel.execute("name = '"+$('#name').val()+"'");

    });
</script>

Step 1: Rank your options on a scale of 1 to 10 based on your priorities.¶

Ranking a county characteristic as number 1 signifies that it is very important to you in choosing where to live. Ranking something as number 10 signifies that it is the least important factor to you.

<div style="margin-left: 100px; display: inline"> Rank 1:</div>
<select id="rank1" name="rank1"  style="border:1px solid #cccccc;padding: 4px 6px;margin-bottom: 10px;width: 200px;border-radius: 4px">
    <option>Select an option</option>
</select>

<div style="margin-left: 100px; display: inline"> Rank 2:</div>
<select id="rank2" name="rank2"  style="border:1px solid #cccccc;padding: 4px 6px;margin-bottom: 10px;width: 200px;border-radius: 4px; margin-left">
    <option>Select an option</option>
</select>

<br>
<br>
<div style="margin-left: 100px; display: inline"> Rank 3:</div>
<select id="rank3" name="rank3"  style="border:1px solid #cccccc;padding: 4px 6px;margin-bottom: 10px;width: 200px;border-radius: 4px">
    <option>Select an option</option>
</select>

<div style="margin-left: 100px; display: inline"> Rank 4:</div>
<select id="rank4" name="rank4"  style="border:1px solid #cccccc;padding: 4px 6px;margin-bottom: 10px;width: 200px;border-radius: 4px">
    <option>Select an option</option>
</select>

<br>
<br>
<div style="margin-left: 100px; display: inline"> Rank 5:</div>
<select id="rank5" name="rank5"  style="border:1px solid #cccccc;padding: 4px 6px;margin-bottom: 10px;width: 200px;border-radius: 4px">
    <option>Select an option</option>
</select>

<div style="margin-left: 100px; display: inline"> Rank 6:</div>
<select id="rank6" name="rank6"  style="border:1px solid #cccccc;padding: 4px 6px;margin-bottom: 10px;width: 200px;border-radius: 4px">
    <option>Select an option</option>
</select>

<br>
<br>
<div style="margin-left: 100px; display: inline"> Rank 7:</div>
<select id="rank7" name="rank7"  style="border:1px solid #cccccc;padding: 4px 6px;margin-bottom: 10px;width: 200px;border-radius: 4px">
    <option>Select an option</option>
</select>

<div style="margin-left: 100px; display: inline"> Rank 8:</div>
<select id="rank8" name="rank8"  style="border:1px solid #cccccc;padding: 4px 6px;margin-bottom: 10px;width: 200px;border-radius: 4px">
    <option>Select an option</option>
</select>

<br>
<br>
<div style="margin-left: 100px; display: inline"> Rank 9:</div>
<select id="rank9" name="rank9"  style="border:1px solid #cccccc;padding: 4px 6px;margin-bottom: 10px;width: 200px;border-radius: 4px">
    <option>Select an option</option>
</select>

<div style="margin-left: 100px; display: inline"> Rank 10:</div>
<select id="rank10" name="rank10"  style="border:1px solid #cccccc;padding: 4px 6px;margin-bottom: 10px;width: 200px;border-radius: 4px">
    <option>Select an option</option>
</select>

<script>
    options = ['Crime Rate', 'HS Grad Rate', 'Healthcare Cost', 'College Degree Holders', 'Population', 'Income', 'Literacy Rate', 'MV Mortality Rate', 'Percent of Smokers', 'Number attended College'];
    for (i=1; i<=options.length; i++)
		{
			//document.forms['matchScore'].matches.options[i] = new Option(i,i.toString());
			document.forms['userinput'].rank1.options[i] = new Option(options[i-1]);
		}

    $("select[name='rank1']").change(function(){
        var kernel = IPython.notebook.kernel;
        var rank1 = $(this).val();            
        kernel.execute("rank1 = '"+rank1+"'");

        var index = options.indexOf(rank1);
        options.splice(index, 1);

        for (i=1; i<=options.length; i++)
		{
			//document.forms['matchScore'].matches.options[i] = new Option(i,i.toString());
			document.forms['userinput'].rank2.options[i] = new Option(options[i-1]);
		}
    });

    $("select[name='rank2']").change(function(){
        var kernel = IPython.notebook.kernel;
        var rank2 = $(this).val();            
        kernel.execute("rank2 = '"+rank2+"'");

        var index = options.indexOf(rank2);
        options.splice(index, 1);

        for (i=1; i<=options.length; i++)
		{
			//document.forms['matchScore'].matches.options[i] = new Option(i,i.toString());
			document.forms['userinput'].rank3.options[i] = new Option(options[i-1]);
		}
    });

    $("select[name='rank3']").change(function(){
        var kernel = IPython.notebook.kernel;
        var rank3 = $(this).val();            
        kernel.execute("rank3 = '"+rank3+"'");

        var index = options.indexOf(rank3);
        options.splice(index, 1);

        for (i=1; i<=options.length; i++)
		{
			//document.forms['matchScore'].matches.options[i] = new Option(i,i.toString());
			document.forms['userinput'].rank4.options[i] = new Option(options[i-1]);
		}
    });

    $("select[name='rank4']").change(function(){
        var kernel = IPython.notebook.kernel;
        var rank4 = $(this).val();            
        kernel.execute("rank4 = '"+rank4+"'");

        var index = options.indexOf(rank4);
        options.splice(index, 1);

        for (i=1; i<=options.length; i++)
		{
			//document.forms['matchScore'].matches.options[i] = new Option(i,i.toString());
			document.forms['userinput'].rank5.options[i] = new Option(options[i-1]);
		}
    });

    $("select[name='rank5']").change(function(){
        var kernel = IPython.notebook.kernel;
        var rank5 = $(this).val();            
        kernel.execute("rank5 = '"+rank5+"'");

        var index = options.indexOf(rank5);
        options.splice(index, 1);

        for (i=1; i<=options.length; i++)
		{
			//document.forms['matchScore'].matches.options[i] = new Option(i,i.toString());
			document.forms['userinput'].rank6.options[i] = new Option(options[i-1]);
		}
    });

    $("select[name='rank6']").change(function(){
        var kernel = IPython.notebook.kernel;
        var rank6 = $(this).val();            
        kernel.execute("rank6 = '"+rank6+"'");

        var index = options.indexOf(rank6);
        options.splice(index, 1);

        for (i=1; i<=options.length; i++)
		{
			//document.forms['matchScore'].matches.options[i] = new Option(i,i.toString());
			document.forms['userinput'].rank7.options[i] = new Option(options[i-1]);
		}
    });

    $("select[name='rank7']").change(function(){
        var kernel = IPython.notebook.kernel;
        var rank7 = $(this).val();            
        kernel.execute("rank7 = '"+rank7+"'");

        var index = options.indexOf(rank7);
        options.splice(index, 1);

        for (i=1; i<=options.length; i++)
		{
			//document.forms['matchScore'].matches.options[i] = new Option(i,i.toString());
			document.forms['userinput'].rank8.options[i] = new Option(options[i-1]);
		}
    });

    $("select[name='rank8']").change(function(){
        var kernel = IPython.notebook.kernel;
        var rank8 = $(this).val();            
        kernel.execute("rank8 = '"+rank8+"'");

        var index = options.indexOf(rank8);
        options.splice(index, 1);

        for (i=1; i<=options.length; i++)
		{
			//document.forms['matchScore'].matches.options[i] = new Option(i,i.toString());
			document.forms['userinput'].rank9.options[i] = new Option(options[i-1]);
		}
    });

    $("select[name='rank9']").change(function(){
        var kernel = IPython.notebook.kernel;
        var rank9 = $(this).val();            
        kernel.execute("rank9 = '"+rank9+"'");

        var index = options.indexOf(rank9);
        options.splice(index, 1);

        for (i=1; i<=options.length; i++)
		{
			//document.forms['matchScore'].matches.options[i] = new Option(i,i.toString());
			document.forms['userinput'].rank10.options[i] = new Option(options[i-1]);
		}
    });

    $("select[name='rank10']").change(function(){
        var kernel = IPython.notebook.kernel;
        var rank10 = $(this).val();            
        kernel.execute("rank10 = '"+rank10+"'");

        var index = options.indexOf(rank10);
        options.splice(index, 10);

    });
</script>

In [227]:

translationMap = {"Crime Rate": "Violent Crime Rate", "HS Grad Rate": "% AFGR", \
"Healthcare Cost": "Health Outcome Rank", "College Degree Holders": "PercentWithCollegeDegree", "Population": "Population", \
"Income": "Household Income", "Literacy Rate": "Illiterate", "MV Mortality Rate": "MV Mortality Rate", \
"Percent of Smokers": "% Smokers", "Number attended College": "NumberWentToSomeCollege"}

ranks = [rank1, rank2, rank3, rank4, rank5, rank6, rank7, rank8, rank9, rank10]

weights = {}

# Populate the weights of the metrics.
for i,rank in enumerate(ranks):
    translatedDFMetric = translationMap[rank]
    weights[translatedDFMetric] = i + 1
    
print weights

{'Household Income': 3, 'Illiterate': 5, 'NumberWentToSomeCollege': 9, 'Violent Crime Rate': 1, '% Smokers': 6, 'MV Mortality Rate': 8, 'PercentWithCollegeDegree': 2, '% AFGR': 10, 'Population': 7, 'Health Outcome Rank': 4}

Step 2: Generate Weighted Sums per County¶

In [228]:

stateDF = get_state(state)
stateCountyMetrics = getStateCountyMetricRankings(state)

# Key is County FIPS, Value is the score
rankScores = {}

for index in stateDF.index:
    countyFip = stateDF.ix[index].ix['FIPS']
    for metric in weights:
        weight = weights[metric]
        metricValue = stateCountyMetrics.ix[index].ix[metric]
        if countyFip not in rankScores:
            rankScores[countyFip] = 0
        rankScores[countyFip] += weight*metricValue

Step 3: Assign Colors to Counties¶

Function to convert numbers between 0 and num_counties to color range Green to Red¶

In [229]:

num_counties = len(stateDF)

def rgb_to_hex(rgb):
    return '#%02x%02x%02x' % rgb

def generate_rgb(val, maxval):
    f = float(val) / (maxval)
    r, g, b = int((1-f)*255), int(f*255), 0
    return rgb_to_hex((g, r, b))

colors = []
for val in xrange(0, num_counties+1):
    colors.append(generate_rgb(val, num_counties))

Sort the counties and assign a color to each county. Highest ranked county gets Green and lowest ranked county gets Red¶

In [230]:

from collections import defaultdict

counties_rating = rankScores

counties_sorted = sorted(counties_rating, key=counties_rating.get, reverse=False)
counties_colors = defaultdict(int)

i = 0
while (i < num_counties):
    try:
        counties_colors[counties_sorted[i]] = colors[i]
        i = i + 1
    except:
        i = i + 1
        continue

SVG File for all counties in USA¶

SVG file source: https://commons.wikimedia.org/wiki/File:USA_Counties.svg

In [236]:

from IPython.display import SVG
SVG(filename="counties.svg")

Out[236]:

Read the counties.svg file. Set color for each county based on its rank¶

SVG file source: https://commons.wikimedia.org/wiki/File:USA_Counties.svg

Using BeautifulSoup, get all the 'path' elements from counties.svg file.
The path element corresponds to a county in US. Each path element has the following attributes

style - style information
id - fips code of the county
inkscape:label - name of the county
d - data points

Read all the path elements and select the ones that correspond counties of the selected state
Set the color (in the style attribute) based on the county's rank
Create a new svg file with modified information
Write javascript (in svg) to display the county name and rank while hovering over the county.

In [232]:

svg = codecs.open(os.path.abspath("data/counties.svg"), encoding='iso-8859-1').read()

soup = BeautifulSoup(svg, selfClosingTags=['defs','sodipodi:namedview'])
paths = soup.findAll('path')
svgs = soup.findAll('svg')
viewbox = "5 90 650 300"
style = 'font-size:12px;fill-rule:nonzero;stroke:#FFFFFF;stroke-opacity:1; stroke-width:0.1;stroke-miterlimit:4;stroke-dasharray:none;stroke-linecap:butt; marker-start:none;stroke-linejoin:bevel;fill:'
blank = 'fill:#FFFFFF';
path_onmouseover_open = "displayName('"
path_onmouseover_close = "')"

legends = []
countyNames = defaultdict(int)

rank = 0
for p in paths:
    if p['id'][0:2] == state[:2] and p['id'] not in ["State_Lines", "separator"]:
        try:
            color = counties_colors[p['id']]
            p['style'] = style + color
            rank = counties_sorted.index(p['id'])
            countyNames[p['id']] = p['inkscape:label']
            p['inkscape:label'] = p['inkscape:label'] + '. Rank: ' + str(rank+1)
            name = p['inkscape:label']
            legends.append(name)
            p['onmouseover'] = path_onmouseover_open + name + path_onmouseover_close
            count += 1
        except:
            continue
    else:
        p['style'] = blank
        
for s in svgs:
    s['viewBox']= viewbox
    
with open("output.svg", "wb") as file:
    file.write(soup.prettify(formatter=None))

Top Ranked Counties¶

In [233]:

for fips in counties_sorted[:10]:
    print "Rank" + str(counties_sorted.index(fips) + 1) + " " + str(countyNames[fips])

Rank1 Nevada, CA
Rank2 El Dorado, CA
Rank3 San Mateo, CA
Rank4 San Luis Obispo, CA
Rank5 Placer, CA
Rank6 Orange, CA
Rank7 Marin, CA
Rank8 Santa Barbara, CA
Rank9 San Diego, CA
Rank10 Sonoma, CA

Step 4: Displaying the generated Heat Map¶

In [234]:

from IPython.display import SVG
SVG(filename="output.svg")

Out[234]:

Sources:

http://www.countyhealthrankings.org/rankings/data

http://nces.ed.gov/ccd/drpcompstatelvl.asp

In [234]: