Working with Open Data Midterm (March 18, 2014)
There are 94 points in this exam: 2 each for the 47 questions. The questions are either multiple choice or short answers. For multiple choice, just write the number of the choice selected.
Name: ______________________________________
`
Consider this code to construct a DataFrame of populations of countries.
import json
import requests
from pandas import DataFrame
# read population in from JSON-formatted data derived from the Wikipedia
pop_json_url = "https://gist.github.com/rdhyee/8511607/" + \
"raw/f16257434352916574473e63612fcea55a0c1b1c/population_of_countries.json"
pop_list= requests.get(pop_json_url).json()
df = DataFrame(pop_list)
df[:5]
0 | 1 | 2 | |
---|---|---|---|
0 | 1 | China | 1385566537 |
1 | 2 | India | 1252139596 |
2 | 3 | United States | 320050716 |
3 | 4 | Indonesia | 249865631 |
4 | 5 | Brazil | 200361925 |
5 rows × 3 columns
Note the dtypes
of the columns.
df.dtypes
0 float64 1 object 2 int64 dtype: object
Q1: What is the relationship between s
and the population of China, where s
is defined as
s = sum(df[df[1].str.startswith('C')][2])
s
is greater than the population of Chinas
is the same as the population of Chinas
is less than the population of Chinas
is not a number.A1:
Q2: What is the relationship between s2
and the population of China, where s2
is defined by:
s2 = sum(df[df[1].str.startswith('X')][2])
s2
is greater than the population of Chinas2
is the same as the population of Chinas2
is less than the population of Chinas2
is not a number.A2:
Q3: What happens when the following statement is run?
df.columns = ['Number','Country','Population']
df
gets a new attribute called columns
df
's columns are renamed based on the listA3:
Q4: This statement does the following
df.columns = ['Number','Country']
df
gets a new attribute called columns
df
's columns are renamed based on the listA4:
Q5: How would you rewrite the following statement to get the same result as
s = sum(df[df[1].str.startswith('C')][2])
after running:
df.columns = ['Number','Country','Population']
A5:
Q6. What is
len(df[df["Population"] > 1000000000])
A6:
Q7. What is
";".join(df[df['Population']>1000000000]['Country'].apply(lambda s: s[0]))
A7:
Q8. What is
len(";".join(df[df['Population']>1000000000]['Country'].apply(lambda s: s[0])))
A8:
from pandas import DataFrame, Series
import numpy as np
s1 = Series(np.arange(-1,4))
s1
0 -1 1 0 2 1 3 2 4 3 dtype: int64
Q9: What is
s1 + 1
A9:
Q10: What is
s1.apply(lambda k: 2*k).sum()
A10:
Q11: What is
s1.cumsum()[3]
A11:
Q12: What is
s1.cumsum() - s1.cumsum()
A12:
Q13. What is
len(s1.cumsum() - s1.cumsum())
A13:
Q14: What is
np.any(s1 > 2)
A14:
Q15. What is
np.all(s1<3)
A15:
Consider the following code to load population(s) from the Census API.
from census import Census
from us import states
import settings
c = Census(settings.CENSUS_KEY)
c.sf1.get(('NAME', 'P0010001'), {'for': 'state:%s' % states.CA.fips})
[{u'NAME': u'California', u'P0010001': u'37253956', u'state': u'06'}]
Q16: What is the purpose of settings.CENSUS_KEY
?
A16:
Q17. When we run
pip install census
we are:
A17:
Consider r1
and r2
:
Q18: What is the difference between r1
and r2
?
r1 = c.sf1.get(('NAME', 'P0010001'), {'for': 'county:*', 'in': 'state:%s' % states.CA.fips})
r2 = c.sf1.get(('NAME', 'P0010001'), {'for': 'county:*', 'in': 'state:*' })
A18:
Q19. What's the relationship between len(r1)
and len(r2)
?
len(r1)
is less than len(r2)
len(r1)
equals len(r2)
len(r1)
is greater than len(r2)
A19:
Q20: Which is a correct geographic hierarchy?
Nation > States = Nation is subdivided into States
A20:
from pandas import DataFrame
import numpy as np
from census import Census
from us import states
import settings
c = Census(settings.CENSUS_KEY)
r = c.sf1.get(('NAME', 'P0010001'), {'for': 'state:*'})
df1 = DataFrame(r)
df1.head()
NAME | P0010001 | state | |
---|---|---|---|
0 | Alabama | 4779736 | 01 |
1 | Alaska | 710231 | 02 |
2 | Arizona | 6392017 | 04 |
3 | Arkansas | 2915918 | 05 |
4 | California | 37253956 | 06 |
5 rows × 3 columns
len(df1)
52
Q21: Why does df
have 52 items? Please explain
A21:
Consider the two following expressions:
print df1.P0010001.sum()
print
print df1.P0010001.astype(int).sum()
477973671023163920172915918372539565029196357409789793460172318801310968765313603011567582128306326483802304635528531184339367453337213283615773552654762998836405303925296729759889279894151826341270055113164708791894205917919378102953548367259111536504375135138310741270237910525674625364814180634610525145561276388562574180010246724540185299456869865636263725789 312471327
Q22: Why is df1.P0010001.sum()
different from df1.P0010001.astype(int).sum()
?
A22:
Q23: Describe the output of the following:
df1.P0010001 = df1.P0010001.astype(int)
df1[['NAME','P0010001']].sort('P0010001', ascending=True).head()
A23:
Q24: After running:
df1.set_index('NAME', inplace=True)
how would you access the Series for the state of Nebraska?
df1['Nebraska']
df1[1]
df1.ix['Nebraska']
df1[df1['NAME'] == 'Nebraska']
A24:
Q25. What is len(states.STATES)
?
A25:
Q26. What is
len(df1[np.in1d(df1.state, [s.fips for s in states.STATES])])
A26:
In the next question, we will make use of the negation operator ~
. Take a look at a specific example
~Series([True, True, False, True])
0 False 1 False 2 True 3 False dtype: bool
Q27. What is
list(df1[~np.in1d(df1.state, [s.fips for s in states.STATES])].index)[0]
A27:
Consider pop1
and pop2
:
pop1 = df1['P0010001'].astype('int').sum()
pop2 = df1[np.in1d(df1.state, [s.fips for s in states.STATES])]['P0010001'].astype('int').sum()
pop1-pop2
3725789
Q28. What does pop11 - pop2
represent?
A28:
Q29. Given that
range(10)
is
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
How to get the total of every integer from 1 to 100?
sum(range(1, 101))
sum(range(100))
sum(range(1, 100))
A29:
Q30. What output is produced from
# itertools is a great library
# http://docs.python.org/2/library/itertools.html#itertools.count
# itertools.count(start=0, step=1):
# "Make an iterator that returns evenly spaced values starting with step."
from itertools import islice, count
c = count(0, 1)
print c.next()
print c.next()
A30:
Q31. Recalling that
1+2+3+...+100 = 5050
what is:
(2*Series(np.arange(101))).sum()
A31:
Consider the follow generator that we used to query for census places.
import pandas as pd
from pandas import DataFrame
import census
import settings
import us
from itertools import islice
c=census.Census(settings.CENSUS_KEY)
def places(variables="NAME"):
for state in us.states.STATES:
geo = {'for':'place:*', 'in':'state:{s_fips}'.format(s_fips=state.fips)}
for place in c.sf1.get(variables, geo=geo):
yield place
Now we compute a DataFrame for the places: places_df
r = list(islice(places("NAME,P0010001"), None))
places_df = DataFrame(r)
places_df.P0010001 = places_df.P0010001.astype('int')
print "number of places", len(places_df)
print "total pop", places_df.P0010001.sum()
places_df.head()
number of places 29261 total pop 228457238
NAME | P0010001 | place | state | |
---|---|---|---|---|
0 | Abanda CDP | 192 | 00100 | 01 |
1 | Abbeville city | 2688 | 00124 | 01 |
2 | Adamsville city | 4522 | 00460 | 01 |
3 | Addison town | 758 | 00484 | 01 |
4 | Akron town | 356 | 00676 | 01 |
5 rows × 4 columns
We display the most populous places from California
places_df[places_df.state=='06'].sort_index(by='P0010001', ascending=False).head()
NAME | P0010001 | place | state | |
---|---|---|---|---|
2714 | Los Angeles city | 3792621 | 44000 | 06 |
3112 | San Diego city | 1307402 | 66000 | 06 |
3122 | San Jose city | 945942 | 68000 | 06 |
3116 | San Francisco city | 805235 | 67000 | 06 |
2425 | Fresno city | 494665 | 27000 | 06 |
5 rows × 4 columns
Q32. Given
places_df[places_df.state=='06'].sort_index(by='P0010001', ascending=False).head()
is
NAME | P0010001 | place | state | |
---|---|---|---|---|
2714 | Los Angeles city | 3792621 | 44000 | 06 |
3112 | San Diego city | 1307402 | 66000 | 06 |
3122 | San Jose city | 945942 | 68000 | 06 |
3116 | San Francisco city | 805235 | 67000 | 06 |
2425 | Fresno city | 494665 | 27000 | 06 |
5 rows × 4 columns
places_df.ix[3122]['label']
after we add the label
column with:
places_df['label'] = places_df.apply(lambda s: s['state']+s['place'], axis=1)
A32:
Q33. What is
places_df["NAME"][3122]
A33:
Now let's set up a DataFrame with some letters and properties of letters.
# numpy and pandas related imports
import numpy as np
from pandas import Series, DataFrame
import pandas as pd
# for example, using lower and uppercase English letters
import string
lower = Series(list(string.lowercase), name='lower')
upper = Series(list(string.uppercase), name='upper')
df2 = pd.concat((lower, upper), axis=1)
df2['ord'] = df2['lower'].apply(ord)
df2.head()
lower | upper | ord | |
---|---|---|---|
0 | a | A | 97 |
1 | b | B | 98 |
2 | c | C | 99 |
3 | d | D | 100 |
4 | e | E | 101 |
5 rows × 3 columns
Note that string.upper
takes a letter and returns its uppercase version. For example:
string.upper('b')
'B'
Q34. What is
np.all(df2['lower'].apply(string.upper) == df2['upper'])
A34:
Q35. What is
df2.apply(lambda s: s['lower'] + s['upper'], axis=1)[6]
A35:
Please remind yourself what enumerate
does.
words = ['Berkeley', 'I', 'School']
for (i, word) in islice(enumerate(words),1):
print (i, word)
(0, 'Berkeley')
Q36. What is
list(enumerate(words))[2][1]
A36:
Now consider the generator g2
def g2():
words = ['Berkeley', 'I', 'School']
for word in words:
if word != 'I':
for letter in list(word):
yield letter
my_g2 = g2()
Q37. What is
len(list(my_g2))
A37:
def g3():
words = ['Berkeley', 'I', 'School']
for word in words:
yield words
Q38. What is
len(list(g3()))
A38:
Consider using groupby
with a DataFrame with states.
import us
import census
import settings
import pandas as pd
import numpy as np
from pandas import DataFrame, Series
from itertools import islice
c = census.Census(settings.CENSUS_KEY)
def states(variables='NAME'):
geo={'for':'state:*'}
states_fips = set([state.fips for state in us.states.STATES])
# need to filter out non-states
for r in c.sf1.get(variables, geo=geo, year=2010):
if r['state'] in states_fips:
yield r
# make a dataframe from the total populations of states in the 2010 Census
df = DataFrame(states('NAME,P0010001'))
df.P0010001 = df.P0010001.astype('int')
df['first_letter'] = df.NAME.apply(lambda s:s[0])
df.head()
NAME | P0010001 | state | first_letter | |
---|---|---|---|---|
0 | Alabama | 4779736 | 01 | A |
1 | Alaska | 710231 | 02 | A |
2 | Arizona | 6392017 | 04 | A |
3 | Arkansas | 2915918 | 05 | A |
4 | California | 37253956 | 06 | C |
5 rows × 4 columns
For reference, here's a list of all the states
print list(df.NAME)
[u'Alabama', u'Alaska', u'Arizona', u'Arkansas', u'California', u'Colorado', u'Connecticut', u'Delaware', u'District of Columbia', u'Florida', u'Georgia', u'Hawaii', u'Idaho', u'Illinois', u'Indiana', u'Iowa', u'Kansas', u'Kentucky', u'Louisiana', u'Maine', u'Maryland', u'Massachusetts', u'Michigan', u'Minnesota', u'Mississippi', u'Missouri', u'Montana', u'Nebraska', u'Nevada', u'New Hampshire', u'New Jersey', u'New Mexico', u'New York', u'North Carolina', u'North Dakota', u'Ohio', u'Oklahoma', u'Oregon', u'Pennsylvania', u'Rhode Island', u'South Carolina', u'South Dakota', u'Tennessee', u'Texas', u'Utah', u'Vermont', u'Virginia', u'Washington', u'West Virginia', u'Wisconsin', u'Wyoming']
Q39. What is
df.groupby('first_letter').apply(lambda g:list(g.NAME))['C']
A39:
Q40. What is
df.groupby('first_letter').apply(lambda g:len(g.NAME))['A']
A40:
Q41. What is
df.groupby('first_letter').agg('count')['first_letter']['P']
A41:
Q42. What is
len(df.groupby('NAME'))
A42:
Recall the code from the diversity calculations
def normalize(s):
"""take a Series and divide each item by the sum so that the new series adds up to 1.0"""
total = np.sum(s)
return s.astype('float') / total
def entropy(series):
"""Normalized Shannon Index"""
# a series in which all the entries are equal should result in normalized entropy of 1.0
# eliminate 0s
series1 = series[series!=0]
# if len(series) < 2 (i.e., 0 or 1) then return 0
if len(series1) > 1:
# calculate the maximum possible entropy for given length of input series
max_s = -np.log(1.0/len(series))
total = float(sum(series1))
p = series1.astype('float')/float(total)
return sum(-p*np.log(p))/max_s
else:
return 0.0
def gini_simpson(s):
# https://en.wikipedia.org/wiki/Diversity_index#Gini.E2.80.93Simpson_index
s1 = normalize(s)
return 1-np.sum(s1*s1)
Q43. Suppose you have 10 people and 5 categories, how you would you maximize the Shannon entropy?
A43:
Q44. What is
entropy(Series([0,0,10,0,0]))
A44:
Q45. What is
entropy(Series([10,0,0,0,0]))
A45:
Q46. What is
entropy(Series([1,1,1,1,1]))
A46:
Q47. What is
gini_simpson(Series([2,2,2,2,2]))
A47: