import pandas as pd
Parsing functions in pandas:
read_csv: Load delimited file from a file, URL, or file-like object. Use comma as default delimiter.
read_table: Load delimited data from a file, URL, or file-like object. Use tab ('\t') as a default delimiter.
read_fwf: Read data in fixed-width column format (that is, no limiters)
read_clipboard: Version of read_Table that reads data from the clipboard. Useful for converting tables from webpages.
!cat /Users/sergulaydore/pydata-book/ch06/ex1.csv
a,b,c,d,message 1,2,3,4,hello 5,6,7,8,world 9,10,11,12,foo
Since this is comma-delimited, we can use read_csv to read it into a DataFrame:
df = pd.read_csv('/Users/sergulaydore/pydata-book/ch06/ex1.csv')
df
a | b | c | d | message | |
---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | hello |
1 | 5 | 6 | 7 | 8 | world |
2 | 9 | 10 | 11 | 12 | foo |
We could also have used read_table and specifiying the delimiter:
pd.read_table('/Users/sergulaydore/pydata-book/ch06/ex1.csv', sep = ',')
a | b | c | d | message | |
---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | hello |
1 | 5 | 6 | 7 | 8 | world |
2 | 9 | 10 | 11 | 12 | foo |
A file will not always a header file:
!cat /Users/sergulaydore/pydata-book/ch06/ex2.csv
1,2,3,4,hello 5,6,7,8,world 9,10,11,12,foo
assign default column names
pd.read_csv('/Users/sergulaydore/pydata-book/ch06/ex2.csv', header = None)
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | hello |
1 | 5 | 6 | 7 | 8 | world |
2 | 9 | 10 | 11 | 12 | foo |
or you can specify names yourself:
pd.read_csv('/Users/sergulaydore/pydata-book/ch06/ex2.csv', names = ['a','b', 'c', 'd', 'message'])
a | b | c | d | message | |
---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | hello |
1 | 5 | 6 | 7 | 8 | world |
2 | 9 | 10 | 11 | 12 | foo |
Suppose you wanted the message column to be the index of the returned DataFrame:
names = ['a', 'b', 'c', 'd', 'message']
pd.read_csv('/Users/sergulaydore/pydata-book/ch06/ex2.csv', names = names, index_col = 'message')
a | b | c | d | |
---|---|---|---|---|
message | ||||
hello | 1 | 2 | 3 | 4 |
world | 5 | 6 | 7 | 8 |
foo | 9 | 10 | 11 | 12 |
to form hierarchical index from multiple columns
!cat /Users/sergulaydore/pydata-book/ch06/csv_mindex.csv
key1,key2,value1,value2 one,a,1,2 one,b,3,4 one,c,5,6 one,d,7,8 two,a,9,10 two,b,11,12 two,c,13,14 two,d,15,16
parsed = pd.read_csv('/Users/sergulaydore/pydata-book/ch06/csv_mindex.csv', index_col = ['key1','key2'])
parsed
value1 | value2 | ||
---|---|---|---|
key1 | key2 | ||
one | a | 1 | 2 |
b | 3 | 4 | |
c | 5 | 6 | |
d | 7 | 8 | |
two | a | 9 | 10 |
b | 11 | 12 | |
c | 13 | 14 | |
d | 15 | 16 |
In some cases, a table might not have a fixed delimiter
list(open('/Users/sergulaydore/pydata-book/ch06/ex3.txt'))
[' A B C\n', 'aaa -0.264438 -1.026059 -0.619500\n', 'bbb 0.927272 0.302904 -0.032399\n', 'ccc -0.264273 -0.386314 -0.217601\n', 'ddd -0.871858 -0.348382 1.100491\n']
In this case fields are separated by a variable amount of white space. This can be expressed by the regular expression \s+.
result = pd.read_table('/Users/sergulaydore/pydata-book/ch06/ex3.txt', sep = '\s+')
result
A | B | C | |
---|---|---|---|
aaa | -0.264438 | -1.026059 | -0.619500 |
bbb | 0.927272 | 0.302904 | -0.032399 |
ccc | -0.264273 | -0.386314 | -0.217601 |
ddd | -0.871858 | -0.348382 | 1.100491 |
You can skip the first, third and fourth rows of a file with skiprows:
!cat /Users/sergulaydore/pydata-book/ch06/ex4.csv
# hey! a,b,c,d,message # just wanted to make things more difficult for you # who reads CSV files with computers, anyway? 1,2,3,4,hello 5,6,7,8,world 9,10,11,12,foo
pd.read_csv('/Users/sergulaydore/pydata-book/ch06/ex4.csv', skiprows = [0,2,3])
a | b | c | d | message | |
---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | hello |
1 | 5 | 6 | 7 | 8 | world |
2 | 9 | 10 | 11 | 12 | foo |
Missing data is usually either not present (empty string) or marked by some sentinel value. By default, pandas uses a set of commonly occuring sentinels, such as NA, -1.#IND, and NULL:
!cat /Users/sergulaydore/pydata-book/ch06/ex5.csv
something,a,b,c,d,message one,1,2,3,4,NA two,5,6,,8,world three,9,10,11,12,foo
result = pd.read_csv('/Users/sergulaydore/pydata-book/ch06/ex5.csv')
result
something | a | b | c | d | message | |
---|---|---|---|---|---|---|
0 | one | 1 | 2 | 3 | 4 | NaN |
1 | two | 5 | 6 | NaN | 8 | world |
2 | three | 9 | 10 | 11 | 12 | foo |
pd.isnull(result)
something | a | b | c | d | message | |
---|---|---|---|---|---|---|
0 | False | False | False | False | False | True |
1 | False | False | False | True | False | False |
2 | False | False | False | False | False | False |
The na_values option can take either a list or set of strings to consider missing values:
result = pd.read_csv('/Users/sergulaydore/pydata-book/ch06/ex5.csv', na_values = ['NULL'])
result
something | a | b | c | d | message | |
---|---|---|---|---|---|---|
0 | one | 1 | 2 | 3 | 4 | NaN |
1 | two | 5 | 6 | NaN | 8 | world |
2 | three | 9 | 10 | 11 | 12 | foo |
Different NA sentinels can be specified for each column in a dict:
sentinels = {'message':['foo', 'NA'], 'something': ['two']}
pd.read_csv('/Users/sergulaydore/pydata-book/ch06/ex5.csv', na_values = sentinels)
something | a | b | c | d | message | |
---|---|---|---|---|---|---|
0 | one | 1 | 2 | 3 | 4 | NaN |
1 | NaN | 5 | 6 | NaN | 8 | world |
2 | three | 9 | 10 | 11 | 12 | NaN |
result = pd.read_csv('/Users/sergulaydore/pydata-book/ch06/ex6.csv')
result
one | two | three | four | key | |
---|---|---|---|---|---|
0 | 0.467976 | -0.038649 | -0.295344 | -1.824726 | L |
1 | -0.358893 | 1.404453 | 0.704965 | -0.200638 | B |
2 | -0.501840 | 0.659254 | -0.421691 | -0.057688 | G |
3 | 0.204886 | 1.074134 | 1.388361 | -0.982404 | R |
4 | 0.354628 | -0.133116 | 0.283763 | -0.837063 | Q |
5 | 1.817480 | 0.742273 | 0.419395 | -2.251035 | Q |
6 | -0.776764 | 0.935518 | -0.332872 | -1.875641 | U |
7 | -0.913135 | 1.530624 | -0.572657 | 0.477252 | K |
8 | 0.358480 | -0.497572 | -0.367016 | 0.507702 | S |
9 | -1.740877 | -1.160417 | -1.637830 | 2.172201 | G |
10 | 0.240564 | -0.328249 | 1.252155 | 1.072796 | 8 |
11 | 0.764018 | 1.165476 | -0.639544 | 1.495258 | R |
12 | 0.571035 | -0.310537 | 0.582437 | -0.298765 | 1 |
13 | 2.317658 | 0.430710 | -1.334216 | 0.199679 | P |
14 | 1.547771 | -1.119753 | -2.277634 | 0.329586 | J |
15 | -1.310608 | 0.401719 | -1.000987 | 1.156708 | E |
16 | -0.088496 | 0.634712 | 0.153324 | 0.415335 | B |
17 | -0.018663 | -0.247487 | -1.446522 | 0.750938 | A |
18 | -0.070127 | -1.579097 | 0.120892 | 0.671432 | F |
19 | -0.194678 | -0.492039 | 2.359605 | 0.319810 | H |
20 | -0.248618 | 0.868707 | -0.492226 | -0.717959 | W |
21 | -1.091549 | -0.867110 | -0.647760 | -0.832562 | C |
22 | 0.641404 | -0.138822 | -0.621963 | -0.284839 | C |
23 | 1.216408 | 0.992687 | 0.165162 | -0.069619 | V |
24 | -0.564474 | 0.792832 | 0.747053 | 0.571675 | I |
25 | 1.759879 | -0.515666 | -0.230481 | 1.362317 | S |
26 | 0.126266 | 0.309281 | 0.382820 | -0.239199 | L |
27 | 1.334360 | -0.100152 | -0.840731 | -0.643967 | 6 |
28 | -0.737620 | 0.278087 | -0.053235 | -0.950972 | J |
29 | -1.148486 | -0.986292 | -0.144963 | 0.124362 | Y |
... | ... | ... | ... | ... | ... |
9970 | 0.633495 | -0.186524 | 0.927627 | 0.143164 | 4 |
9971 | 0.308636 | -0.112857 | 0.762842 | -1.072977 | 1 |
9972 | -1.627051 | -0.978151 | 0.154745 | -1.229037 | Z |
9973 | 0.314847 | 0.097989 | 0.199608 | 0.955193 | P |
9974 | 1.666907 | 0.992005 | 0.496128 | -0.686391 | S |
9975 | 0.010603 | 0.708540 | -1.258711 | 0.226541 | K |
9976 | 0.118693 | -0.714455 | -0.501342 | -0.254764 | K |
9977 | 0.302616 | -2.011527 | -0.628085 | 0.768827 | H |
9978 | -0.098572 | 1.769086 | -0.215027 | -0.053076 | A |
9979 | -0.019058 | 1.964994 | 0.738538 | -0.883776 | F |
9980 | -0.595349 | 0.001781 | -1.423355 | -1.458477 | M |
9981 | 1.392170 | -1.396560 | -1.425306 | -0.847535 | H |
9982 | -0.896029 | -0.152287 | 1.924483 | 0.365184 | 6 |
9983 | -2.274642 | -0.901874 | 1.500352 | 0.996541 | N |
9984 | -0.301898 | 1.019906 | 1.102160 | 2.624526 | I |
9985 | -2.548389 | -0.585374 | 1.496201 | -0.718815 | D |
9986 | -0.064588 | 0.759292 | -1.568415 | -0.420933 | E |
9987 | -0.143365 | -1.111760 | -1.815581 | 0.435274 | 2 |
9988 | -0.070412 | -1.055921 | 0.338017 | -0.440763 | X |
9989 | 0.649148 | 0.994273 | -1.384227 | 0.485120 | Q |
9990 | -0.370769 | 0.404356 | -1.051628 | -1.050899 | 8 |
9991 | -0.409980 | 0.155627 | -0.818990 | 1.277350 | W |
9992 | 0.301214 | -1.111203 | 0.668258 | 0.671922 | A |
9993 | 1.821117 | 0.416445 | 0.173874 | 0.505118 | X |
9994 | 0.068804 | 1.322759 | 0.802346 | 0.223618 | H |
9995 | 2.311896 | -0.417070 | -1.409599 | -0.515821 | L |
9996 | -0.479893 | -0.650419 | 0.745152 | -0.646038 | E |
9997 | 0.523331 | 0.787112 | 0.486066 | 1.093156 | K |
9998 | -0.362559 | 0.598894 | -1.843201 | 0.887292 | G |
9999 | -0.096376 | -1.012999 | -0.657431 | -0.573315 | 0 |
10000 rows × 5 columns
Read out a small number of rows:
pd.read_csv('/Users/sergulaydore/pydata-book/ch06/ex6.csv', nrows = 5)
one | two | three | four | key | |
---|---|---|---|---|---|
0 | 0.467976 | -0.038649 | -0.295344 | -1.824726 | L |
1 | -0.358893 | 1.404453 | 0.704965 | -0.200638 | B |
2 | -0.501840 | 0.659254 | -0.421691 | -0.057688 | G |
3 | 0.204886 | 1.074134 | 1.388361 | -0.982404 | R |
4 | 0.354628 | -0.133116 | 0.283763 | -0.837063 | Q |
Read out file in pieces:
chunker = pd.read_csv('/Users/sergulaydore/pydata-book/ch06/ex6.csv', chunksize = 1000)
chunker
<pandas.io.parsers.TextFileReader at 0x109ad6bd0>
next(iter(chunker))
one | two | three | four | key | |
---|---|---|---|---|---|
0 | 0.467976 | -0.038649 | -0.295344 | -1.824726 | L |
1 | -0.358893 | 1.404453 | 0.704965 | -0.200638 | B |
2 | -0.501840 | 0.659254 | -0.421691 | -0.057688 | G |
3 | 0.204886 | 1.074134 | 1.388361 | -0.982404 | R |
4 | 0.354628 | -0.133116 | 0.283763 | -0.837063 | Q |
5 | 1.817480 | 0.742273 | 0.419395 | -2.251035 | Q |
6 | -0.776764 | 0.935518 | -0.332872 | -1.875641 | U |
7 | -0.913135 | 1.530624 | -0.572657 | 0.477252 | K |
8 | 0.358480 | -0.497572 | -0.367016 | 0.507702 | S |
9 | -1.740877 | -1.160417 | -1.637830 | 2.172201 | G |
10 | 0.240564 | -0.328249 | 1.252155 | 1.072796 | 8 |
11 | 0.764018 | 1.165476 | -0.639544 | 1.495258 | R |
12 | 0.571035 | -0.310537 | 0.582437 | -0.298765 | 1 |
13 | 2.317658 | 0.430710 | -1.334216 | 0.199679 | P |
14 | 1.547771 | -1.119753 | -2.277634 | 0.329586 | J |
15 | -1.310608 | 0.401719 | -1.000987 | 1.156708 | E |
16 | -0.088496 | 0.634712 | 0.153324 | 0.415335 | B |
17 | -0.018663 | -0.247487 | -1.446522 | 0.750938 | A |
18 | -0.070127 | -1.579097 | 0.120892 | 0.671432 | F |
19 | -0.194678 | -0.492039 | 2.359605 | 0.319810 | H |
20 | -0.248618 | 0.868707 | -0.492226 | -0.717959 | W |
21 | -1.091549 | -0.867110 | -0.647760 | -0.832562 | C |
22 | 0.641404 | -0.138822 | -0.621963 | -0.284839 | C |
23 | 1.216408 | 0.992687 | 0.165162 | -0.069619 | V |
24 | -0.564474 | 0.792832 | 0.747053 | 0.571675 | I |
25 | 1.759879 | -0.515666 | -0.230481 | 1.362317 | S |
26 | 0.126266 | 0.309281 | 0.382820 | -0.239199 | L |
27 | 1.334360 | -0.100152 | -0.840731 | -0.643967 | 6 |
28 | -0.737620 | 0.278087 | -0.053235 | -0.950972 | J |
29 | -1.148486 | -0.986292 | -0.144963 | 0.124362 | Y |
... | ... | ... | ... | ... | ... |
970 | 0.633495 | -0.186524 | 0.927627 | 0.143164 | P |
971 | 0.308636 | -0.112857 | 0.762842 | -1.072977 | X |
972 | -1.627051 | -0.978151 | 0.154745 | -1.229037 | O |
973 | 0.314847 | 0.097989 | 0.199608 | 0.955193 | R |
974 | 1.666907 | 0.992005 | 0.496128 | -0.686391 | W |
975 | 0.010603 | 0.708540 | -1.258711 | 0.226541 | O |
976 | 0.118693 | -0.714455 | -0.501342 | -0.254764 | S |
977 | 0.302616 | -2.011527 | -0.628085 | 0.768827 | R |
978 | -0.098572 | 1.769086 | -0.215027 | -0.053076 | G |
979 | -0.019058 | 1.964994 | 0.738538 | -0.883776 | X |
980 | -0.595349 | 0.001781 | -1.423355 | -1.458477 | O |
981 | 1.392170 | -1.396560 | -1.425306 | -0.847535 | K |
982 | -0.896029 | -0.152287 | 1.924483 | 0.365184 | M |
983 | -2.274642 | -0.901874 | 1.500352 | 0.996541 | Q |
984 | -0.301898 | 1.019906 | 1.102160 | 2.624526 | Q |
985 | -2.548389 | -0.585374 | 1.496201 | -0.718815 | R |
986 | -0.064588 | 0.759292 | -1.568415 | -0.420933 | N |
987 | -0.143365 | -1.111760 | -1.815581 | 0.435274 | C |
988 | -0.070412 | -1.055921 | 0.338017 | -0.440763 | X |
989 | 0.649148 | 0.994273 | -1.384227 | 0.485120 | I |
990 | -0.370769 | 0.404356 | -1.051628 | -1.050899 | V |
991 | -0.409980 | 0.155627 | -0.818990 | 1.277350 | 4 |
992 | 0.301214 | -1.111203 | 0.668258 | 0.671922 | Q |
993 | 1.821117 | 0.416445 | 0.173874 | 0.505118 | O |
994 | 0.068804 | 1.322759 | 0.802346 | 0.223618 | W |
995 | 2.311896 | -0.417070 | -1.409599 | -0.515821 | M |
996 | -0.479893 | -0.650419 | 0.745152 | -0.646038 | H |
997 | 0.523331 | 0.787112 | 0.486066 | 1.093156 | D |
998 | -0.362559 | 0.598894 | -1.843201 | 0.887292 | W |
999 | -0.096376 | -1.012999 | -0.657431 | -0.573315 | K |
1000 rows × 5 columns
The TextParser object returned by read_csv allows you to iterate over the parts of the file according to the chunksize. For example, we can iterate over ex6.csv, aggregating the value counts in the 'key' column like so:
tot = pd.Series([])
for piece in chunker:
tot = tot.add(piece['key'].value_counts(), fill_value = 0)
tot = tot.order(ascending = False)
tot[:10]
E 336 X 327 L 315 M 309 K 303 Q 301 O 299 P 299 J 298 F 295 dtype: float64
Data can be exported to delimited format.
data = pd.read_csv('/Users/sergulaydore/pydata-book/ch06/ex5.csv')
data
something | a | b | c | d | message | |
---|---|---|---|---|---|---|
0 | one | 1 | 2 | 3 | 4 | NaN |
1 | two | 5 | 6 | NaN | 8 | world |
2 | three | 9 | 10 | 11 | 12 | foo |
We can write data out to a comma-separated file:
data.to_csv('/Users/sergulaydore/pydata-book/ch06/out.csv')
!cat /Users/sergulaydore/pydata-book/ch06/out.csv
,something,a,b,c,d,message 0,one,1,2,3.0,4, 1,two,5,6,,8,world 2,three,9,10,11.0,12,foo
Other delimiters can be used, of course (writing to sys.stdout so it just prints the text result).
import sys
data.to_csv(sys.stdout, sep = '|')
|something|a|b|c|d|message 0|one|1|2|3.0|4| 1|two|5|6||8|world 2|three|9|10|11.0|12|foo
Missing values appear as empty strings in the output. You might want to denote them by some other sentinel value:
data.to_csv(sys.stdout, na_rep = 'NULL')
,something,a,b,c,d,message 0,one,1,2,3.0,4,NULL 1,two,5,6,NULL,8,world 2,three,9,10,11.0,12,foo
With no other options specified, both the row and column labels are written. Both of these can be disabled:
data.to_csv(sys.stdout, index=False, header = False)
one,1,2,3.0,4, two,5,6,,8,world three,9,10,11.0,12,foo
You can also write only a subset of the columns, and in an order of your choosing:
data.to_csv(sys.stdout, index = False, columns = ['a', 'b', 'c'])
a,b,c 1,2,3.0 5,6, 9,10,11.0
Series also has a to_csv method:
dates = pd.date_range('1/1/2000', periods=7)
import numpy as np
ts = pd.Series(np.arange(7), index=dates)
ts.to_csv('/Users/sergulaydore/pydata-book/ch06/tseries.csv')
!cat /Users/sergulaydore/pydata-book/ch06/tseries.csv
2000-01-01,0 2000-01-02,1 2000-01-03,2 2000-01-04,3 2000-01-05,4 2000-01-06,5 2000-01-07,6
You can read a CSV version of Series with read_csv, but there is also a from_csv convenience method that makes it a bit simpler:
pd.Series.from_csv('/Users/sergulaydore/pydata-book/ch06/tseries.csv', parse_dates = True)
2000-01-01 0 2000-01-02 1 2000-01-03 2 2000-01-04 3 2000-01-05 4 2000-01-06 5 2000-01-07 6 dtype: int64
help(pd.Series.from_csv)
Help on method from_csv in module pandas.core.series: from_csv(cls, path, sep=',', parse_dates=True, header=None, index_col=0, encoding=None, infer_datetime_format=False) method of __builtin__.type instance Read delimited file into Series Parameters ---------- path : string file path or file handle / StringIO sep : string, default ',' Field delimiter parse_dates : boolean, default True Parse dates. Different default from read_table header : int, default 0 Row to use at header (skip prior rows) index_col : int or sequence, default 0 Column to use for index. If a sequence is given, a MultiIndex is used. Different default from read_table encoding : string, optional a string representing the encoding to use if the contents are non-ascii, for python versions prior to 3 infer_datetime_format: boolean, default False If True and `parse_dates` is True for a column, try to infer the datetime format based on the first datetime string. If the format can be inferred, there often will be a large parsing speed-up. Returns ------- y : Series
!cat /Users/sergulaydore/pydata-book/ch06/ex7.csv
"a","b","c" "1","2","3" "1","2","3","4"
For any file with a single-character delimiter, you can use Python's built-in csv module.
import csv
f = open('/Users/sergulaydore/pydata-book/ch06/ex7.csv')
reader = csv.reader(f)
Iterating through the reader like a file yields tuples of values in each like with any quote characters removed:
for line in reader:
print line
['a', 'b', 'c'] ['1', '2', '3'] ['1', '2', '3', '4']
From there, it's up to you to do the wrangling necessary to put the data in the form that you need it. For example:
lines = list(csv.reader(open('/Users/sergulaydore/pydata-book/ch06/ex7.csv')))
lines
[['a', 'b', 'c'], ['1', '2', '3'], ['1', '2', '3', '4']]
header, values = lines[0], lines[1:]
header
['a', 'b', 'c']
values
[['1', '2', '3'], ['1', '2', '3', '4']]
data_dict = {h: v for h, v in zip(header, zip(*values))}
data_dict
{'a': ('1', '1'), 'b': ('2', '2'), 'c': ('3', '3')}
CSV files come in many different flavors. Defining a new format with a differnt delimiter, string quoting convention, or line terminator is done by defining a simple sub-class of csv.Dialect:
class my_dialect(csv.Dialect):
lineterminator = '\n'
delimiter = ';'
quotechar = '"'
quoting=1
reader = csv.reader(f, dialect = my_dialect)
Individual CSV dialect parameters can also be given as keywords to csv.reader without having to define a subclass:
reader = csv.reader(f, delimiter='|')
To write delimited files manually, you can use csv.writer.
with open('mydata.csv','w') as f:
writer = csv.writer(f, dialect=my_dialect)
writer.writerow(('one', 'two', 'three'))
writer.writerow(('1','2','3'))
writer.writerow(('4','5','6'))
writer.writerow(('7','8','9'))
!cat mydata.csv
"one";"two";"three" "1";"2";"3" "4";"5";"6" "7";"8";"9"
JSON (Short for JavaScript Object notation) has become one of the standard formats for sending data by HTTP request between web browsers and other applications. It is a much more flexible data format than a tabular text form like CSV. Here is an example.
obj = """
{"name": "Wes",
"places_lived": ["United States", "Spain", "Germany"],
"pet": null,
"siblings":[{"name": "Scott", "age": 25, "pet":"Zuko"},
{"name": "Katie", "age": 33, "pet": "Cisco"}]
}
"""
To convert JSON string to Python form, use json.loads:
import json
result = json.loads(obj)
result
{u'name': u'Wes', u'pet': None, u'places_lived': [u'United States', u'Spain', u'Germany'], u'siblings': [{u'age': 25, u'name': u'Scott', u'pet': u'Zuko'}, {u'age': 33, u'name': u'Katie', u'pet': u'Cisco'}]}
json.dumps converts a Python object back to JSON:
asjson = json.dumps(result)
You can pass a list of JSON objects to the DataFrame constructor and select a subset of the data fields:
siblings = pd.DataFrame(result['siblings'], columns = ['name','age'])
siblings
name | age | |
---|---|---|
0 | Scott | 25 |
1 | Katie | 33 |
To get started, find the URL you want to extract data from, open it with urllib2 and parse the stream with lxml like so:
from lxml.html import parse
from urllib2 import urlopen
help(parse)
Help on function parse in module lxml.html: parse(filename_or_url, parser=None, base_url=None, **kw) Parse a filename, URL, or file-like object into an HTML document tree. Note: this returns a tree, not an element. Use ``parse(...).getroot()`` to get the document root. You can override the base URL with the ``base_url`` keyword. This is most useful when parsing from a file-like object.
parsed = parse((urlopen('http://finance.yahoo.com/q/op?s=AAPL+Options')))
doc = parsed.getroot()
doc
<Element html at 0x109b62ba8>
Using this object, you can extract all HTML tags of a particular type, such as table tags containing the data of interest. As a simple example, suppose you wanted to get a list of every URL linked to in the document; links are a tags in HTML. Using the document root's findall method along with an XPath (a means of expressing "queries" on the document):
links = doc.findall('.//a')
links[15:20]
[<Element a at 0x106518470>, <Element a at 0x1065184c8>, <Element a at 0x106518520>, <Element a at 0x106518578>, <Element a at 0x1065185d0>]
But these are objects representing HTML elements; to get the URL and link text you have to use each element's get method (for the URL) and text content methos (for the display text)
lnk = links[28]
lnk.get('href')
'https://homes.yahoo.com/own-rent/'
lnk.text_content()
'Homes'
Thus, getting a list of all URLs in the document is a matter of writing this list comprehension:
urls = [lnk.get('href') for lnk in doc.findall('.//a')]
urls[10:]
['https://www.flickr.com/', 'https://mobile.yahoo.com/', 'http://everything.yahoo.com/', 'https://celebrity.yahoo.com/', 'https://www.yahoo.com/movies', 'https://www.yahoo.com/music', 'https://www.yahoo.com/tv', 'https://groups.yahoo.com/', 'https://www.yahoo.com/health', 'https://www.yahoo.com/style', 'https://www.yahoo.com/beauty', 'https://www.yahoo.com/food', 'https://www.yahoo.com/parenting', 'https://www.yahoo.com/makers', 'https://www.yahoo.com/tech', 'http://shopping.yahoo.com/', 'https://www.yahoo.com/travel', 'https://autos.yahoo.com/', 'https://homes.yahoo.com/own-rent/', 'https://www.mozilla.org/firefox/new/?utm_source=yahoo&utm_medium=referral&utm_campaign=y-uh&utm_content=y-install-new-firefox', 'http://finance.yahoo.com', 'https://login.yahoo.com/config/login?.src=quote&.intl=us&.lang=en-US&.done=http://finance.yahoo.com/q/op%3fs=AAPL%2520Options', 'https://login.yahoo.com/config/login?.src=quote&.intl=us&.lang=en-US&.done=http://finance.yahoo.com/q/op%3fs=AAPL%2520Options', 'https://mail.yahoo.com/?.intl=us&.lang=en-US&.src=ym', 'javascript:void(0);', 'https://edit.yahoo.com/mc2.0/eval_profile?.intl=us&.lang=en-US&.done=http://finance.yahoo.com/q/op%3fs=AAPL%2520Options&.src=quote&.intl=us&.lang=en-US', 'https://help.yahoo.com/l/us/yahoo/finance/', 'http://feedback.yahoo.com/forums/207809', 'https://www.yahoo.com/', '/', '/portfolios.html', '/portfolios/manage', '/portfolio/new', '/my-quotes-news/', '/market-overview/', '/stock-center/', '/funds/', '/options/', '/etf/', '/bonds', '/futures', '/currency-investing', 'http://biz.yahoo.com/research/earncal/today.html', '/yahoofinance/', '/yahoofinance/business/', '/yahoofinance/investing', '/yahoofinance/personalfinance', '/blogs/breakout/', '/blogs/cost-of-living/', '/blogs/daily-ticker/', '/blogs/driven/', '/blogs/hot-stock-minute/', '/blogs/just-explain-it/', 'http://finance.yahoo.com/blogs/author/aaron-task/', '/blogs/author/michael-santoli/', '/blogs/author/jeff-macke/', '/blogs/author/aaron-pressman/', '/blogs/author/rick-newman/', '/blogs/author/mandi-woodruff/', '/blogs/author/chris-nichols/', '/blogs/the-exchange/', '/blogs/michael-santoli/', 'http://finance.yahoo.com/blogs/author/philip-pearlman/', '/news/', '/corporate-news/', '/economic-policy-news/', '/investing-news/', '/personal-finance/', '/career-education/', '/real-estate/', '/retirement/', '/credit-debt/', '/taxes/', '/autos/', '/lifestyle/', '/videos/', '/rates/', '/calculator/index/', '/personal-finance/tools/', '/cnbc/', '/blogs/big-data-download/', '/blogs/off-the-cuff/', '/blogs/power-pitch/', '/blogs/talking-numbers/', '/blogs/the-biz-fix/', '/blogs/top-best-most/', '/contributors/', 'http://finance.search.yahoo.com?fr=fin-v1', '/q?s=^DJI', '/q?s=^IXIC', '/q?s=AAPL', '/q/ecn?s=AAPL+Order+Book', '/q/op?s=AAPL+Options', '/q/hp?s=AAPL+Historical+Prices', '/echarts?s=AAPL+Interactive', '/q/h?s=AAPL+Headlines', '/q/p?s=AAPL+Press+Releases', '/q/ce?s=AAPL+Company+Events', '/mb?s=AAPL', '/marketpulse/?s=AAPL', '/q/pr?s=AAPL+Profile', '/q/ks?s=AAPL+Key+Statistics', '/q/sec?s=AAPL+SEC+Filings', '/q/co?s=AAPL+Competitors', '/q/in?s=AAPL+Industry', '/q/ct?s=AAPL+Components', '/q/ao?s=AAPL+Analyst+Opinion', '/q/ae?s=AAPL+Analyst+Estimates', '/q/mh?s=AAPL+Major+Holders', '/q/it?s=AAPL+Insider+Transactions', '/q/ir?s=AAPL+Insider+Roster', '/q/is?s=AAPL+Income+Statement', '/q/bs?s=AAPL+Balance+Sheet', '/q/cf?s=AAPL+Cash+Flow', 'https://mobile.yahoo.com/finance/?src=gta', '/q/op?s=AAPL&date=1426204800', '/q/op?s=AAPL&straddle=true&date=1426204800', None, None, '/q/op?s=AAPL&strike=85.00', '/q?s=AAPL150313C00085000', '/q/op?s=AAPL&strike=90.00', '/q?s=AAPL150313C00090000', '/q/op?s=AAPL&strike=95.00', '/q?s=AAPL150313C00095000', '/q/op?s=AAPL&strike=100.00', '/q?s=AAPL150313C00100000', '/q/op?s=AAPL&strike=101.00', '/q?s=AAPL150313C00101000', '/q/op?s=AAPL&strike=102.00', '/q?s=AAPL150313C00102000', '/q/op?s=AAPL&strike=103.00', '/q?s=AAPL150313C00103000', '/q/op?s=AAPL&strike=105.00', '/q?s=AAPL150313C00105000', '/q/op?s=AAPL&strike=106.00', '/q?s=AAPL150313C00106000', '/q/op?s=AAPL&strike=107.00', '/q?s=AAPL150313C00107000', '/q/op?s=AAPL&strike=108.00', '/q?s=AAPL150313C00108000', '/q/op?s=AAPL&strike=109.00', '/q?s=AAPL150313C00109000', '/q/op?s=AAPL&strike=110.00', '/q?s=AAPL150313C00110000', '/q/op?s=AAPL&strike=111.00', '/q?s=AAPL150313C00111000', '/q/op?s=AAPL&strike=112.00', '/q?s=AAPL150313C00112000', '/q/op?s=AAPL&strike=113.00', '/q?s=AAPL150313C00113000', '/q/op?s=AAPL&strike=114.00', '/q?s=AAPL150313C00114000', '/q/op?s=AAPL&strike=115.00', '/q?s=AAPL150313C00115000', '/q/op?s=AAPL&strike=116.00', '/q?s=AAPL150313C00116000', '/q/op?s=AAPL&strike=117.00', '/q?s=AAPL150313C00117000', '/q/op?s=AAPL&strike=118.00', '/q?s=AAPL150313C00118000', '/q/op?s=AAPL&strike=119.00', '/q?s=AAPL150313C00119000', '/q/op?s=AAPL&strike=120.00', '/q?s=AAPL150313C00120000', '/q/op?s=AAPL&strike=121.00', '/q?s=AAPL150313C00121000', '/q/op?s=AAPL&strike=122.00', '/q?s=AAPL150313C00122000', '/q/op?s=AAPL&strike=123.00', '/q?s=AAPL150313C00123000', '/q/op?s=AAPL&strike=124.00', '/q?s=AAPL150313C00124000', '/q/op?s=AAPL&strike=125.00', '/q?s=AAPL150313C00125000', '/q/op?s=AAPL&strike=126.00', '/q?s=AAPL150313C00126000', '/q/op?s=AAPL&strike=127.00', '/q?s=AAPL150313C00127000', '/q/op?s=AAPL&strike=128.00', '/q?s=AAPL150313C00128000', '/q/op?s=AAPL&strike=129.00', '/q?s=AAPL150313C00129000', '/q/op?s=AAPL&strike=130.00', '/q?s=AAPL150313C00130000', '/q/op?s=AAPL&strike=131.00', '/q?s=AAPL150313C00131000', '/q/op?s=AAPL&strike=132.00', '/q?s=AAPL150313C00132000', '/q/op?s=AAPL&strike=133.00', '/q?s=AAPL150313C00133000', '/q/op?s=AAPL&strike=134.00', '/q?s=AAPL150313C00134000', '/q/op?s=AAPL&strike=135.00', '/q?s=AAPL150313C00135000', '/q/op?s=AAPL&strike=136.00', '/q?s=AAPL150313C00136000', '/q/op?s=AAPL&strike=137.00', '/q?s=AAPL150313C00137000', '/q/op?s=AAPL&strike=138.00', '/q?s=AAPL150313C00138000', '/q/op?s=AAPL&strike=139.00', '/q?s=AAPL150313C00139000', '/q/op?s=AAPL&strike=140.00', '/q?s=AAPL150313C00140000', '/q/op?s=AAPL&strike=141.00', '/q?s=AAPL150313C00141000', '/q/op?s=AAPL&strike=142.00', '/q?s=AAPL150313C00142000', '/q/op?s=AAPL&strike=143.00', '/q?s=AAPL150313C00143000', '/q/op?s=AAPL&strike=144.00', '/q?s=AAPL150313C00144000', '/q/op?s=AAPL&strike=145.00', '/q?s=AAPL150313C00145000', '/q/op?s=AAPL&strike=146.00', '/q?s=AAPL150313C00146000', '/q/op?s=AAPL&strike=150.00', '/q?s=AAPL150313C00150000', None, None, '/q/op?s=AAPL&strike=85.00', '/q?s=AAPL150313P00085000', '/q/op?s=AAPL&strike=90.00', '/q?s=AAPL150313P00090000', '/q/op?s=AAPL&strike=95.00', '/q?s=AAPL150313P00095000', '/q/op?s=AAPL&strike=100.00', '/q?s=AAPL150313P00100000', '/q/op?s=AAPL&strike=101.00', '/q?s=AAPL150313P00101000', '/q/op?s=AAPL&strike=102.00', '/q?s=AAPL150313P00102000', '/q/op?s=AAPL&strike=103.00', '/q?s=AAPL150313P00103000', '/q/op?s=AAPL&strike=104.00', '/q?s=AAPL150313P00104000', '/q/op?s=AAPL&strike=105.00', '/q?s=AAPL150313P00105000', '/q/op?s=AAPL&strike=106.00', '/q?s=AAPL150313P00106000', '/q/op?s=AAPL&strike=107.00', '/q?s=AAPL150313P00107000', '/q/op?s=AAPL&strike=108.00', '/q?s=AAPL150313P00108000', '/q/op?s=AAPL&strike=109.00', '/q?s=AAPL150313P00109000', '/q/op?s=AAPL&strike=110.00', '/q?s=AAPL150313P00110000', '/q/op?s=AAPL&strike=111.00', '/q?s=AAPL150313P00111000', '/q/op?s=AAPL&strike=112.00', '/q?s=AAPL150313P00112000', '/q/op?s=AAPL&strike=113.00', '/q?s=AAPL150313P00113000', '/q/op?s=AAPL&strike=114.00', '/q?s=AAPL150313P00114000', '/q/op?s=AAPL&strike=115.00', '/q?s=AAPL150313P00115000', '/q/op?s=AAPL&strike=116.00', '/q?s=AAPL150313P00116000', '/q/op?s=AAPL&strike=117.00', '/q?s=AAPL150313P00117000', '/q/op?s=AAPL&strike=118.00', '/q?s=AAPL150313P00118000', '/q/op?s=AAPL&strike=119.00', '/q?s=AAPL150313P00119000', '/q/op?s=AAPL&strike=120.00', '/q?s=AAPL150313P00120000', '/q/op?s=AAPL&strike=121.00', '/q?s=AAPL150313P00121000', '/q/op?s=AAPL&strike=122.00', '/q?s=AAPL150313P00122000', '/q/op?s=AAPL&strike=123.00', '/q?s=AAPL150313P00123000', '/q/op?s=AAPL&strike=124.00', '/q?s=AAPL150313P00124000', '/q/op?s=AAPL&strike=125.00', '/q?s=AAPL150313P00125000', '/q/op?s=AAPL&strike=126.00', '/q?s=AAPL150313P00126000', '/q/op?s=AAPL&strike=127.00', '/q?s=AAPL150313P00127000', '/q/op?s=AAPL&strike=128.00', '/q?s=AAPL150313P00128000', '/q/op?s=AAPL&strike=129.00', '/q?s=AAPL150313P00129000', '/q/op?s=AAPL&strike=130.00', '/q?s=AAPL150313P00130000', '/q/op?s=AAPL&strike=131.00', '/q?s=AAPL150313P00131000', '/q/op?s=AAPL&strike=132.00', '/q?s=AAPL150313P00132000', '/q/op?s=AAPL&strike=133.00', '/q?s=AAPL150313P00133000', '/q/op?s=AAPL&strike=134.00', '/q?s=AAPL150313P00134000', '/q/op?s=AAPL&strike=135.00', '/q?s=AAPL150313P00135000', '/q/op?s=AAPL&strike=136.00', '/q?s=AAPL150313P00136000', '/q/op?s=AAPL&strike=137.00', '/q?s=AAPL150313P00137000', '/q/op?s=AAPL&strike=138.00', '/q?s=AAPL150313P00138000', '/q/op?s=AAPL&strike=139.00', '/q?s=AAPL150313P00139000', '/q/op?s=AAPL&strike=140.00', '/q?s=AAPL150313P00140000', '/q/op?s=AAPL&strike=141.00', '/q?s=AAPL150313P00141000', '/q/op?s=AAPL&strike=142.00', '/q?s=AAPL150313P00142000', '/q/op?s=AAPL&strike=143.00', '/q?s=AAPL150313P00143000', '/q/op?s=AAPL&strike=144.00', '/q?s=AAPL150313P00144000', '/q/op?s=AAPL&strike=145.00', '/q?s=AAPL150313P00145000', '/q/op?s=AAPL&strike=150.00', '/q?s=AAPL150313P00150000']
Now, finding the right tables in the document can be a matter of trial and error; some websites make is easier by giving a table of interest an id attribute. These were the two tables containing the call data and put data, respectively.
tables = doc.findall('.//table')
calls = tables[9]
puts = tables[13]
--------------------------------------------------------------------------- IndexError Traceback (most recent call last) <ipython-input-129-c2cad2b3938b> in <module>() ----> 1 calls = tables[9] 2 puts = tables[13] IndexError: list index out of range
t0 = tables[0]
t1 = tables[1]
t2 = tables[2]
Each table has a header row followed by each of the data rows.
rows0 = t0.findall('.//tr')
rows1 = t1.findall('.//tr')
rows2 = t2.findall('.//tr')
For the header as well as the data rows, we want to extract the text from each cell; in the case of the header these are the th cells and td cells for the data.
def _unpack(row, kind='td'):
elts = row.findall('.//%s' %kind)
return [val.text_content() for val in elts]
_unpack(rows2[2], kind='td')
['\r\n 85.00\r\n ', '\r\n AAPL150313P00085000\r\n ', '\r\n 0.01\r\n ', '\r\n 0.00\r\n ', '\r\n 0.01\r\n ', '\r\n 0.00\r\n ', '\r\n \r\n \r\n 0.00%\r\n \r\n \r\n \r\n ', '\r\n 3\r\n ', '\r\n 3\r\n ', '\r\n 237.50%\r\n ']
Now it is a matter of combining all these steps together to convert this data into a dataframe. Since the numerical data is still in string format, we want to convert some, but perhaps not all of the columns to floating point format. Pandas has a class TextParser that is used internally in the read_csv and other parsing functions to do the appropriate automatic type conversion:
from pandas.io.parsers import TextParser
def parse_options_data(table):
rows = table.findall('.//tr')
header = _unpack(rows2[0], kind = 'th')
data = [_unpack(r) for r in rows2[1:]]
return TextParser(data, names = header).get_chunk()
Finally, we invoke this parsing function on the lxml table objects and get DataFrame results:
call_data = parse_options_data(t2)
call_data.head()
Strike ∵ Filter | Contract Name | Last | Bid | Ask | Change | %Change | Volume | Open Interest | Implied Volatility | |
---|---|---|---|---|---|---|---|---|---|---|
0 | \r\n \r\n ... | None | NaN | NaN | NaN | NaN | None | NaN | NaN | None |
1 | \r\n 85.00\r\n | \r\n AAPL150313P00085000\r\n | 0.01 | 0 | 0.01 | 0 | \r\n \r\n \r\n ... | 3 | 3 | \r\n 237.50%\r\n |
2 | \r\n 90.00\r\n | \r\n AAPL150313P00090000\r\n | 0.01 | 0 | 0.01 | 0 | \r\n \r\n \r\n ... | 2 | 2 | \r\n 206.25%\r\n |
3 | \r\n 95.00\r\n | \r\n AAPL150313P00095000\r\n | 0.01 | 0 | 0.01 | 0 | \r\n \r\n \r\n ... | 10 | 6061 | \r\n 168.75%\r\n |
4 | \r\n 100.00\r\n | \r\n AAPL150313P00100000\r\n | 0.01 | 0 | 0.01 | 0 | \r\n \r\n \r\n ... | 2 | 674 | \r\n 140.63%\r\n |
XML (extensible markup language) is another common structured data format supporting hierarchical, nested data with metadata.
MTA publishes a number of data series about its bus and train services. Here we look at the performance data which is contained in a set of XML files. Each train or bus service has a different file containing monthly data as a series of XML records.
Using lxml.objectify, we parse the file and get a reference to the root note of the XML file with getroot:
One of the easiest ways to store data efficiently in binary format is using Python's built-in pickle serialization. Pandas objects all have a save method which writes the data to disk as a pickle:
!pwd
/Users/sergulaydore/Documents/Face_Car/python_scripts/Python_for_Data_Analysis
%cd /Users/sergulaydore/pydata-book/ch06
/Users/sergulaydore/pydata-book/ch06
frame = pd.read_csv('./ex1.csv')
frame
a | b | c | d | message | |
---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | hello |
1 | 5 | 6 | 7 | 8 | world |
2 | 9 | 10 | 11 | 12 | foo |
frame.to_pickle('./frame_pickle')
pd.read_pickle('./frame_pickle')
a | b | c | d | message | |
---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | hello |
1 | 5 | 6 | 7 | 8 | world |
2 | 9 | 10 | 11 | 12 | foo |
Note: pickle is only recommended as a short-term storage format. The problem is that it is hard to guarantee that the format will be stable over time; an object pickled today may not unpickle with a later version of a library.
There are a number of tools that facilitate efficiently reading and writing large amounts of scientific data in binary format on disk. A popular industry-grade library for this is HDF5, which is a C library with interfaces in many other languages like Java, Python, and MATLAB. The "HDF" and HDF5 stands for hierarchical data format. Each HDF5 file contains an internal file system like node structure enabling you to store multiple datasets and suporting metadata. Compared with simpler formats, HDF5 supports on the fly compression with a variety of compressors, enabling data with repeated patters to be stored more efficiently. For very large datasets that don't fit into memory, HDF5 is a good choice as you can efficiently read and write small sections of much larger arrays.
There are not one but two interfaces to the HDF5 library in Python, PyTables and h5py, each of which takes a different approach to the problem. h5py provides a direct, but high-level interface to the HDF5 API, while PyTables abstracts many of the details of HDF5 to provide multiple flexible flexible data containers, table indexing, querying capability, and some support for out-of-core computations.
pandas has a minimal dict-like HDFStore class, which uses PyTables to store pandas objects:
store = pd.HDFStore('./mydata.h5')
store['obj1'] = frame
store['obj1_col'] = frame['a']
store
<class 'pandas.io.pytables.HDFStore'> File path: ./mydata.h5 /obj1 frame (shape->[3,5]) /obj1_col series (shape->[3])
Objects contained in the HDF5 file can be retrieved in a dict-like fashion:
store['obj1']
a | b | c | d | message | |
---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | hello |
1 | 5 | 6 | 7 | 8 | world |
2 | 9 | 10 | 11 | 12 | foo |
xls_file = pd.ExcelFile('./data.xls')
table = xls_file.parse('Sheet1')
Many websites have public APIs providing data feeds via JSON or some other format. There a re a nuber of ways to access these APIs from Python; one easy to use method is the requests package. To search for the words "python pandas" on Twitter, we can make an HTTP GET request like so:
import requests
url = 'https://api.twitter.com/1.1/search/tweets.json?q=%23freebandnames&since_id=24012619984051000&max_id=250126199840518145&result_type=mixed&count=4'
resp = requests.get(url)
resp
<Response [400]>
The response object's text attribute contains the content of the GET query. Many web APIs will return a JSON string that must be loaded into a Python object:
import json
data = json.loads(resp.text)
data.keys()
[u'errors']
import requests
from requests_oauthlib import OAuth1
In many applications data rarely comes from text files, that being a fairly inefficient way to store large amounts of data. SQL-based relational databases (such as SQL Server, PostgreSQL, and MSQL) are in wide use, and many alternative non-SQL databases have become quite popular. The choice of database is usually dependent on the performance, data integrity, and scalability needs of an application.
Loading data from SQL into a DataFrame is fairly straightforward.
import sqlite3
query = """
CREATE TABLE test
(a VARCHAR(20), b VARCHAR(20),
c REAL, d INTEGER
); """
con = sqlite3.connect(':memory:')
con.execute(query)
<sqlite3.Cursor at 0x10f5845e0>
con.commit()
Then, insert a few rows of data:
data = [ ('Atlanta', 'Georgia', 1.25, 6),
('Tallahassee', 'Florida', 2.6, 3),
('Sacramento','California', 1.7, 5)
]
stmt = "INSERT INTO test VALUES(?,?,?,?)"
con.executemany(stmt, data)
<sqlite3.Cursor at 0x10f5af960>
con.commit()
Most Python SQL drivers return a list of tuples when selecting data from a table:
cursor = con.execute('select * from test')
rows = cursor.fetchall()
rows
[(u'Atlanta', u'Georgia', 1.25, 6), (u'Tallahassee', u'Florida', 2.6, 3), (u'Sacramento', u'California', 1.7, 5)]
You can pass the list of tuples to the DataFrame consructor, but you also need the column names, contained in the cursor's description attribute:
cursor.description
(('a', None, None, None, None, None, None), ('b', None, None, None, None, None, None), ('c', None, None, None, None, None, None), ('d', None, None, None, None, None, None))
pd.DataFrame(rows, columns = zip(*cursor.description)[0])
a | b | c | d | |
---|---|---|---|---|
0 | Atlanta | Georgia | 1.25 | 6 |
1 | Tallahassee | Florida | 2.60 | 3 |
2 | Sacramento | California | 1.70 | 5 |
This is quite a bit of munging that you'd rather not repeat each time you query the database. pandas has a read_frame function in its pandas.io.sql module that simplifies the process. Just pass the select statement and the connection object:
import pandas.io.sql as sql
sql.read_sql('select * from test', con)
a | b | c | d | |
---|---|---|---|---|
0 | Atlanta | Georgia | 1.25 | 6 |
1 | Tallahassee | Florida | 2.60 | 3 |
2 | Sacramento | California | 1.70 | 5 |