In [2]:

import pandas as pd

Reading and Writing Data in Text Format¶

Parsing functions in pandas:

read_csv: Load delimited file from a file, URL, or file-like object. Use comma as default delimiter.

read_table: Load delimited data from a file, URL, or file-like object. Use tab ('\t') as a default delimiter.

read_fwf: Read data in fixed-width column format (that is, no limiters)

read_clipboard: Version of read_Table that reads data from the clipboard. Useful for converting tables from webpages.

In [1]:

!cat /Users/sergulaydore/pydata-book/ch06/ex1.csv

a,b,c,d,message
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo

Since this is comma-delimited, we can use read_csv to read it into a DataFrame:

In [3]:

df = pd.read_csv('/Users/sergulaydore/pydata-book/ch06/ex1.csv')

In [4]:

df

Out[4]:

	a	b	c	d	message
0	1	2	3	4	hello
1	5	6	7	8	world
2	9	10	11	12	foo

We could also have used read_table and specifiying the delimiter:

In [5]:

pd.read_table('/Users/sergulaydore/pydata-book/ch06/ex1.csv', sep = ',')

Out[5]:

	a	b	c	d	message
0	1	2	3	4	hello
1	5	6	7	8	world
2	9	10	11	12	foo

A file will not always a header file:

In [6]:

!cat /Users/sergulaydore/pydata-book/ch06/ex2.csv

1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo

assign default column names

In [7]:

pd.read_csv('/Users/sergulaydore/pydata-book/ch06/ex2.csv', header = None)

Out[7]:

	0	1	2	3	4
0	1	2	3	4	hello
1	5	6	7	8	world
2	9	10	11	12	foo

or you can specify names yourself:

In [8]:

pd.read_csv('/Users/sergulaydore/pydata-book/ch06/ex2.csv', names = ['a','b', 'c', 'd', 'message'])

Out[8]:

	a	b	c	d	message
0	1	2	3	4	hello
1	5	6	7	8	world
2	9	10	11	12	foo

Suppose you wanted the message column to be the index of the returned DataFrame:

In [9]:

names = ['a', 'b', 'c', 'd', 'message']

In [10]:

pd.read_csv('/Users/sergulaydore/pydata-book/ch06/ex2.csv', names = names, index_col = 'message')

Out[10]:

	a	b	c	d
message
hello	1	2	3	4
world	5	6	7	8
foo	9	10	11	12

to form hierarchical index from multiple columns

In [13]:

!cat /Users/sergulaydore/pydata-book/ch06/csv_mindex.csv

key1,key2,value1,value2
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,a,9,10
two,b,11,12
two,c,13,14
two,d,15,16

In [14]:

parsed = pd.read_csv('/Users/sergulaydore/pydata-book/ch06/csv_mindex.csv', index_col = ['key1','key2'])

In [15]:

parsed

Out[15]:

		value1	value2
key1	key2
one	a	1	2
	b	3	4
	c	5	6
	d	7	8
two	a	9	10
	b	11	12
	c	13	14
	d	15	16

In some cases, a table might not have a fixed delimiter

In [17]:

list(open('/Users/sergulaydore/pydata-book/ch06/ex3.txt')) 

Out[17]:

['            A         B         C\n',
 'aaa -0.264438 -1.026059 -0.619500\n',
 'bbb  0.927272  0.302904 -0.032399\n',
 'ccc -0.264273 -0.386314 -0.217601\n',
 'ddd -0.871858 -0.348382  1.100491\n']

In this case fields are separated by a variable amount of white space. This can be expressed by the regular expression \s+.

In [18]:

result = pd.read_table('/Users/sergulaydore/pydata-book/ch06/ex3.txt', sep = '\s+')

In [19]:

result

Out[19]:

	A	B	C
aaa	-0.264438	-1.026059	-0.619500
bbb	0.927272	0.302904	-0.032399
ccc	-0.264273	-0.386314	-0.217601
ddd	-0.871858	-0.348382	1.100491

You can skip the first, third and fourth rows of a file with skiprows:

In [20]:

!cat /Users/sergulaydore/pydata-book/ch06/ex4.csv

# hey!
a,b,c,d,message
# just wanted to make things more difficult for you
# who reads CSV files with computers, anyway?
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo

In [21]:

pd.read_csv('/Users/sergulaydore/pydata-book/ch06/ex4.csv', skiprows = [0,2,3])

Out[21]:

	a	b	c	d	message
0	1	2	3	4	hello
1	5	6	7	8	world
2	9	10	11	12	foo

Missing data is usually either not present (empty string) or marked by some sentinel value. By default, pandas uses a set of commonly occuring sentinels, such as NA, -1.#IND, and NULL:

In [23]:

!cat /Users/sergulaydore/pydata-book/ch06/ex5.csv

something,a,b,c,d,message
one,1,2,3,4,NA
two,5,6,,8,world
three,9,10,11,12,foo

In [24]:

result = pd.read_csv('/Users/sergulaydore/pydata-book/ch06/ex5.csv')

In [25]:

result

Out[25]:

	something	a	b	c	d	message
0	one	1	2	3	4	NaN
1	two	5	6	NaN	8	world
2	three	9	10	11	12	foo

In [26]:

pd.isnull(result)

Out[26]:

	something	a	b	c	d	message
0	False	False	False	False	False	True
1	False	False	False	True	False	False
2	False	False	False	False	False	False

The na_values option can take either a list or set of strings to consider missing values:

In [28]:

result = pd.read_csv('/Users/sergulaydore/pydata-book/ch06/ex5.csv', na_values = ['NULL'])

In [29]:

result

Out[29]:

	something	a	b	c	d	message
0	one	1	2	3	4	NaN
1	two	5	6	NaN	8	world
2	three	9	10	11	12	foo

Different NA sentinels can be specified for each column in a dict:

In [30]:

sentinels = {'message':['foo', 'NA'], 'something': ['two']}

In [31]:

pd.read_csv('/Users/sergulaydore/pydata-book/ch06/ex5.csv', na_values = sentinels)

Out[31]:

	something	a	b	c	d	message
0	one	1	2	3	4	NaN
1	NaN	5	6	NaN	8	world
2	three	9	10	11	12	NaN

Reading Text Files in Pieces¶

In [32]:

result = pd.read_csv('/Users/sergulaydore/pydata-book/ch06/ex6.csv')

In [33]:

result

Out[33]:

	one	two	three	four	key
0	0.467976	-0.038649	-0.295344	-1.824726	L
1	-0.358893	1.404453	0.704965	-0.200638	B
2	-0.501840	0.659254	-0.421691	-0.057688	G
3	0.204886	1.074134	1.388361	-0.982404	R
4	0.354628	-0.133116	0.283763	-0.837063	Q
5	1.817480	0.742273	0.419395	-2.251035	Q
6	-0.776764	0.935518	-0.332872	-1.875641	U
7	-0.913135	1.530624	-0.572657	0.477252	K
8	0.358480	-0.497572	-0.367016	0.507702	S
9	-1.740877	-1.160417	-1.637830	2.172201	G
10	0.240564	-0.328249	1.252155	1.072796	8
11	0.764018	1.165476	-0.639544	1.495258	R
12	0.571035	-0.310537	0.582437	-0.298765	1
13	2.317658	0.430710	-1.334216	0.199679	P
14	1.547771	-1.119753	-2.277634	0.329586	J
15	-1.310608	0.401719	-1.000987	1.156708	E
16	-0.088496	0.634712	0.153324	0.415335	B
17	-0.018663	-0.247487	-1.446522	0.750938	A
18	-0.070127	-1.579097	0.120892	0.671432	F
19	-0.194678	-0.492039	2.359605	0.319810	H
20	-0.248618	0.868707	-0.492226	-0.717959	W
21	-1.091549	-0.867110	-0.647760	-0.832562	C
22	0.641404	-0.138822	-0.621963	-0.284839	C
23	1.216408	0.992687	0.165162	-0.069619	V
24	-0.564474	0.792832	0.747053	0.571675	I
25	1.759879	-0.515666	-0.230481	1.362317	S
26	0.126266	0.309281	0.382820	-0.239199	L
27	1.334360	-0.100152	-0.840731	-0.643967	6
28	-0.737620	0.278087	-0.053235	-0.950972	J
29	-1.148486	-0.986292	-0.144963	0.124362	Y
...	...	...	...	...	...
9970	0.633495	-0.186524	0.927627	0.143164	4
9971	0.308636	-0.112857	0.762842	-1.072977	1
9972	-1.627051	-0.978151	0.154745	-1.229037	Z
9973	0.314847	0.097989	0.199608	0.955193	P
9974	1.666907	0.992005	0.496128	-0.686391	S
9975	0.010603	0.708540	-1.258711	0.226541	K
9976	0.118693	-0.714455	-0.501342	-0.254764	K
9977	0.302616	-2.011527	-0.628085	0.768827	H
9978	-0.098572	1.769086	-0.215027	-0.053076	A
9979	-0.019058	1.964994	0.738538	-0.883776	F
9980	-0.595349	0.001781	-1.423355	-1.458477	M
9981	1.392170	-1.396560	-1.425306	-0.847535	H
9982	-0.896029	-0.152287	1.924483	0.365184	6
9983	-2.274642	-0.901874	1.500352	0.996541	N
9984	-0.301898	1.019906	1.102160	2.624526	I
9985	-2.548389	-0.585374	1.496201	-0.718815	D
9986	-0.064588	0.759292	-1.568415	-0.420933	E
9987	-0.143365	-1.111760	-1.815581	0.435274	2
9988	-0.070412	-1.055921	0.338017	-0.440763	X
9989	0.649148	0.994273	-1.384227	0.485120	Q
9990	-0.370769	0.404356	-1.051628	-1.050899	8
9991	-0.409980	0.155627	-0.818990	1.277350	W
9992	0.301214	-1.111203	0.668258	0.671922	A
9993	1.821117	0.416445	0.173874	0.505118	X
9994	0.068804	1.322759	0.802346	0.223618	H
9995	2.311896	-0.417070	-1.409599	-0.515821	L
9996	-0.479893	-0.650419	0.745152	-0.646038	E
9997	0.523331	0.787112	0.486066	1.093156	K
9998	-0.362559	0.598894	-1.843201	0.887292	G
9999	-0.096376	-1.012999	-0.657431	-0.573315	0

10000 rows × 5 columns

Read out a small number of rows:

In [34]:

pd.read_csv('/Users/sergulaydore/pydata-book/ch06/ex6.csv', nrows = 5)

Out[34]:

	one	two	three	four	key
0	0.467976	-0.038649	-0.295344	-1.824726	L
1	-0.358893	1.404453	0.704965	-0.200638	B
2	-0.501840	0.659254	-0.421691	-0.057688	G
3	0.204886	1.074134	1.388361	-0.982404	R
4	0.354628	-0.133116	0.283763	-0.837063	Q

Read out file in pieces:

In [35]:

chunker = pd.read_csv('/Users/sergulaydore/pydata-book/ch06/ex6.csv', chunksize = 1000)

In [36]:

chunker

Out[36]:

<pandas.io.parsers.TextFileReader at 0x109ad6bd0>

In [38]:

next(iter(chunker))

Out[38]:

	one	two	three	four	key
0	0.467976	-0.038649	-0.295344	-1.824726	L
1	-0.358893	1.404453	0.704965	-0.200638	B
2	-0.501840	0.659254	-0.421691	-0.057688	G
3	0.204886	1.074134	1.388361	-0.982404	R
4	0.354628	-0.133116	0.283763	-0.837063	Q
5	1.817480	0.742273	0.419395	-2.251035	Q
6	-0.776764	0.935518	-0.332872	-1.875641	U
7	-0.913135	1.530624	-0.572657	0.477252	K
8	0.358480	-0.497572	-0.367016	0.507702	S
9	-1.740877	-1.160417	-1.637830	2.172201	G
10	0.240564	-0.328249	1.252155	1.072796	8
11	0.764018	1.165476	-0.639544	1.495258	R
12	0.571035	-0.310537	0.582437	-0.298765	1
13	2.317658	0.430710	-1.334216	0.199679	P
14	1.547771	-1.119753	-2.277634	0.329586	J
15	-1.310608	0.401719	-1.000987	1.156708	E
16	-0.088496	0.634712	0.153324	0.415335	B
17	-0.018663	-0.247487	-1.446522	0.750938	A
18	-0.070127	-1.579097	0.120892	0.671432	F
19	-0.194678	-0.492039	2.359605	0.319810	H
20	-0.248618	0.868707	-0.492226	-0.717959	W
21	-1.091549	-0.867110	-0.647760	-0.832562	C
22	0.641404	-0.138822	-0.621963	-0.284839	C
23	1.216408	0.992687	0.165162	-0.069619	V
24	-0.564474	0.792832	0.747053	0.571675	I
25	1.759879	-0.515666	-0.230481	1.362317	S
26	0.126266	0.309281	0.382820	-0.239199	L
27	1.334360	-0.100152	-0.840731	-0.643967	6
28	-0.737620	0.278087	-0.053235	-0.950972	J
29	-1.148486	-0.986292	-0.144963	0.124362	Y
...	...	...	...	...	...
970	0.633495	-0.186524	0.927627	0.143164	P
971	0.308636	-0.112857	0.762842	-1.072977	X
972	-1.627051	-0.978151	0.154745	-1.229037	O
973	0.314847	0.097989	0.199608	0.955193	R
974	1.666907	0.992005	0.496128	-0.686391	W
975	0.010603	0.708540	-1.258711	0.226541	O
976	0.118693	-0.714455	-0.501342	-0.254764	S
977	0.302616	-2.011527	-0.628085	0.768827	R
978	-0.098572	1.769086	-0.215027	-0.053076	G
979	-0.019058	1.964994	0.738538	-0.883776	X
980	-0.595349	0.001781	-1.423355	-1.458477	O
981	1.392170	-1.396560	-1.425306	-0.847535	K
982	-0.896029	-0.152287	1.924483	0.365184	M
983	-2.274642	-0.901874	1.500352	0.996541	Q
984	-0.301898	1.019906	1.102160	2.624526	Q
985	-2.548389	-0.585374	1.496201	-0.718815	R
986	-0.064588	0.759292	-1.568415	-0.420933	N
987	-0.143365	-1.111760	-1.815581	0.435274	C
988	-0.070412	-1.055921	0.338017	-0.440763	X
989	0.649148	0.994273	-1.384227	0.485120	I
990	-0.370769	0.404356	-1.051628	-1.050899	V
991	-0.409980	0.155627	-0.818990	1.277350	4
992	0.301214	-1.111203	0.668258	0.671922	Q
993	1.821117	0.416445	0.173874	0.505118	O
994	0.068804	1.322759	0.802346	0.223618	W
995	2.311896	-0.417070	-1.409599	-0.515821	M
996	-0.479893	-0.650419	0.745152	-0.646038	H
997	0.523331	0.787112	0.486066	1.093156	D
998	-0.362559	0.598894	-1.843201	0.887292	W
999	-0.096376	-1.012999	-0.657431	-0.573315	K

1000 rows × 5 columns

The TextParser object returned by read_csv allows you to iterate over the parts of the file according to the chunksize. For example, we can iterate over ex6.csv, aggregating the value counts in the 'key' column like so:

In [41]:

tot = pd.Series([])
for piece in chunker:
    tot = tot.add(piece['key'].value_counts(), fill_value = 0)

In [42]:

tot = tot.order(ascending = False)

In [43]:

tot[:10]

Out[43]:

E    336
X    327
L    315
M    309
K    303
Q    301
O    299
P    299
J    298
F    295
dtype: float64

Writing Data Out to Text Format¶

Data can be exported to delimited format.

In [47]:

data = pd.read_csv('/Users/sergulaydore/pydata-book/ch06/ex5.csv')

In [48]:

data

Out[48]:

	something	a	b	c	d	message
0	one	1	2	3	4	NaN
1	two	5	6	NaN	8	world
2	three	9	10	11	12	foo

We can write data out to a comma-separated file:

In [49]:

data.to_csv('/Users/sergulaydore/pydata-book/ch06/out.csv')

In [50]:

!cat /Users/sergulaydore/pydata-book/ch06/out.csv

,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo

Other delimiters can be used, of course (writing to sys.stdout so it just prints the text result).

In [52]:

import sys
data.to_csv(sys.stdout, sep = '|')

|something|a|b|c|d|message
0|one|1|2|3.0|4|
1|two|5|6||8|world
2|three|9|10|11.0|12|foo

Missing values appear as empty strings in the output. You might want to denote them by some other sentinel value:

In [53]:

data.to_csv(sys.stdout, na_rep = 'NULL')

,something,a,b,c,d,message
0,one,1,2,3.0,4,NULL
1,two,5,6,NULL,8,world
2,three,9,10,11.0,12,foo

With no other options specified, both the row and column labels are written. Both of these can be disabled:

In [54]:

data.to_csv(sys.stdout, index=False, header = False)

one,1,2,3.0,4,
two,5,6,,8,world
three,9,10,11.0,12,foo

You can also write only a subset of the columns, and in an order of your choosing:

In [56]:

data.to_csv(sys.stdout, index = False, columns = ['a', 'b', 'c'])

a,b,c
1,2,3.0
5,6,
9,10,11.0

Series also has a to_csv method:

In [57]:

dates = pd.date_range('1/1/2000', periods=7)

In [60]:

import numpy as np
ts = pd.Series(np.arange(7), index=dates)

In [61]:

ts.to_csv('/Users/sergulaydore/pydata-book/ch06/tseries.csv')

In [62]:

!cat /Users/sergulaydore/pydata-book/ch06/tseries.csv

2000-01-01,0
2000-01-02,1
2000-01-03,2
2000-01-04,3
2000-01-05,4
2000-01-06,5
2000-01-07,6

You can read a CSV version of Series with read_csv, but there is also a from_csv convenience method that makes it a bit simpler:

In [68]:

pd.Series.from_csv('/Users/sergulaydore/pydata-book/ch06/tseries.csv', parse_dates = True)

Out[68]:

2000-01-01    0
2000-01-02    1
2000-01-03    2
2000-01-04    3
2000-01-05    4
2000-01-06    5
2000-01-07    6
dtype: int64

In [66]:

help(pd.Series.from_csv)

Help on method from_csv in module pandas.core.series:

from_csv(cls, path, sep=',', parse_dates=True, header=None, index_col=0, encoding=None, infer_datetime_format=False) method of __builtin__.type instance
    Read delimited file into Series
    
    Parameters
    ----------
    path : string file path or file handle / StringIO
    sep : string, default ','
        Field delimiter
    parse_dates : boolean, default True
        Parse dates. Different default from read_table
    header : int, default 0
        Row to use at header (skip prior rows)
    index_col : int or sequence, default 0
        Column to use for index. If a sequence is given, a MultiIndex
        is used. Different default from read_table
    encoding : string, optional
        a string representing the encoding to use if the contents are
        non-ascii, for python versions prior to 3
    infer_datetime_format: boolean, default False
        If True and `parse_dates` is True for a column, try to infer the
        datetime format based on the first datetime string. If the format
        can be inferred, there often will be a large parsing speed-up.
    
    Returns
    -------
    y : Series

Manually Working with Delimited Formats¶

In [69]:

!cat /Users/sergulaydore/pydata-book/ch06/ex7.csv

"a","b","c"
"1","2","3"
"1","2","3","4"

For any file with a single-character delimiter, you can use Python's built-in csv module.

In [72]:

import csv
f = open('/Users/sergulaydore/pydata-book/ch06/ex7.csv')
reader = csv.reader(f)

Iterating through the reader like a file yields tuples of values in each like with any quote characters removed:

In [73]:

for line in reader:
    print line

['a', 'b', 'c']
['1', '2', '3']
['1', '2', '3', '4']

From there, it's up to you to do the wrangling necessary to put the data in the form that you need it. For example:

In [76]:

lines = list(csv.reader(open('/Users/sergulaydore/pydata-book/ch06/ex7.csv')))

In [77]:

lines

Out[77]:

[['a', 'b', 'c'], ['1', '2', '3'], ['1', '2', '3', '4']]

In [78]:

header, values = lines[0], lines[1:]

In [79]:

header

Out[79]:

['a', 'b', 'c']

In [80]:

values

Out[80]:

[['1', '2', '3'], ['1', '2', '3', '4']]

In [83]:

data_dict = {h: v for h, v in zip(header, zip(*values))}

In [84]:

data_dict

Out[84]:

{'a': ('1', '1'), 'b': ('2', '2'), 'c': ('3', '3')}

CSV files come in many different flavors. Defining a new format with a differnt delimiter, string quoting convention, or line terminator is done by defining a simple sub-class of csv.Dialect:

In [90]:

class my_dialect(csv.Dialect):
    lineterminator = '\n'
    delimiter = ';'
    quotechar = '"'
    quoting=1

In [91]:

reader = csv.reader(f, dialect = my_dialect)

Individual CSV dialect parameters can also be given as keywords to csv.reader without having to define a subclass:

In [98]:

reader = csv.reader(f, delimiter='|')

To write delimited files manually, you can use csv.writer.

In [99]:

with open('mydata.csv','w') as f:
    writer = csv.writer(f, dialect=my_dialect)
    writer.writerow(('one', 'two', 'three'))
    writer.writerow(('1','2','3'))
    writer.writerow(('4','5','6'))
    writer.writerow(('7','8','9'))

In [100]:

!cat mydata.csv

"one";"two";"three"
"1";"2";"3"
"4";"5";"6"
"7";"8";"9"

JSON data¶

JSON (Short for JavaScript Object notation) has become one of the standard formats for sending data by HTTP request between web browsers and other applications. It is a much more flexible data format than a tabular text form like CSV. Here is an example.

In [104]:

obj = """
{"name": "Wes",
"places_lived": ["United States", "Spain", "Germany"],
"pet": null,
"siblings":[{"name": "Scott", "age": 25, "pet":"Zuko"},
            {"name": "Katie", "age": 33, "pet": "Cisco"}]
}
"""

To convert JSON string to Python form, use json.loads:

In [105]:

import json

In [106]:

result = json.loads(obj)

In [107]:

result

Out[107]:

{u'name': u'Wes',
 u'pet': None,
 u'places_lived': [u'United States', u'Spain', u'Germany'],
 u'siblings': [{u'age': 25, u'name': u'Scott', u'pet': u'Zuko'},
  {u'age': 33, u'name': u'Katie', u'pet': u'Cisco'}]}

json.dumps converts a Python object back to JSON:

In [108]:

asjson = json.dumps(result)

You can pass a list of JSON objects to the DataFrame constructor and select a subset of the data fields:

In [109]:

siblings = pd.DataFrame(result['siblings'], columns = ['name','age'])

In [110]:

siblings

Out[110]:

	name	age
0	Scott	25
1	Katie	33

XML and HTML: Web Scraping # THIS PART FAILED WHEN ATTEMPTED TO READ tables[9] and tables[13]¶

To get started, find the URL you want to extract data from, open it with urllib2 and parse the stream with lxml like so:

In [111]:

from lxml.html import parse
from urllib2 import urlopen

In [112]:

help(parse)

Help on function parse in module lxml.html:

parse(filename_or_url, parser=None, base_url=None, **kw)
    Parse a filename, URL, or file-like object into an HTML document
    tree.  Note: this returns a tree, not an element.  Use
    ``parse(...).getroot()`` to get the document root.
    
    You can override the base URL with the ``base_url`` keyword.  This
    is most useful when parsing from a file-like object.

In [113]:

parsed = parse((urlopen('http://finance.yahoo.com/q/op?s=AAPL+Options')))

In [117]:

doc = parsed.getroot()

In [118]:

doc

Out[118]:

<Element html at 0x109b62ba8>

Using this object, you can extract all HTML tags of a particular type, such as table tags containing the data of interest. As a simple example, suppose you wanted to get a list of every URL linked to in the document; links are a tags in HTML. Using the document root's findall method along with an XPath (a means of expressing "queries" on the document):

In [119]:

links = doc.findall('.//a')

In [121]:

links[15:20]

Out[121]:

[<Element a at 0x106518470>,
 <Element a at 0x1065184c8>,
 <Element a at 0x106518520>,
 <Element a at 0x106518578>,
 <Element a at 0x1065185d0>]

But these are objects representing HTML elements; to get the URL and link text you have to use each element's get method (for the URL) and text content methos (for the display text)

In [122]:

lnk = links[28]

In [123]:

lnk.get('href')

Out[123]:

'https://homes.yahoo.com/own-rent/'

In [124]:

lnk.text_content()

Out[124]:

'Homes'

Thus, getting a list of all URLs in the document is a matter of writing this list comprehension:

In [125]:

urls = [lnk.get('href') for lnk in doc.findall('.//a')]

In [127]:

urls[10:]

Out[127]:

['https://www.flickr.com/',
 'https://mobile.yahoo.com/',
 'http://everything.yahoo.com/',
 'https://celebrity.yahoo.com/',
 'https://www.yahoo.com/movies',
 'https://www.yahoo.com/music',
 'https://www.yahoo.com/tv',
 'https://groups.yahoo.com/',
 'https://www.yahoo.com/health',
 'https://www.yahoo.com/style',
 'https://www.yahoo.com/beauty',
 'https://www.yahoo.com/food',
 'https://www.yahoo.com/parenting',
 'https://www.yahoo.com/makers',
 'https://www.yahoo.com/tech',
 'http://shopping.yahoo.com/',
 'https://www.yahoo.com/travel',
 'https://autos.yahoo.com/',
 'https://homes.yahoo.com/own-rent/',
 'https://www.mozilla.org/firefox/new/?utm_source=yahoo&utm_medium=referral&utm_campaign=y-uh&utm_content=y-install-new-firefox',
 'http://finance.yahoo.com',
 'https://login.yahoo.com/config/login?.src=quote&.intl=us&.lang=en-US&.done=http://finance.yahoo.com/q/op%3fs=AAPL%2520Options',
 'https://login.yahoo.com/config/login?.src=quote&.intl=us&.lang=en-US&.done=http://finance.yahoo.com/q/op%3fs=AAPL%2520Options',
 'https://mail.yahoo.com/?.intl=us&.lang=en-US&.src=ym',
 'javascript:void(0);',
 'https://edit.yahoo.com/mc2.0/eval_profile?.intl=us&.lang=en-US&.done=http://finance.yahoo.com/q/op%3fs=AAPL%2520Options&.src=quote&.intl=us&.lang=en-US',
 'https://help.yahoo.com/l/us/yahoo/finance/',
 'http://feedback.yahoo.com/forums/207809',
 'https://www.yahoo.com/',
 '/',
 '/portfolios.html',
 '/portfolios/manage',
 '/portfolio/new',
 '/my-quotes-news/',
 '/market-overview/',
 '/stock-center/',
 '/funds/',
 '/options/',
 '/etf/',
 '/bonds',
 '/futures',
 '/currency-investing',
 'http://biz.yahoo.com/research/earncal/today.html',
 '/yahoofinance/',
 '/yahoofinance/business/',
 '/yahoofinance/investing',
 '/yahoofinance/personalfinance',
 '/blogs/breakout/',
 '/blogs/cost-of-living/',
 '/blogs/daily-ticker/',
 '/blogs/driven/',
 '/blogs/hot-stock-minute/',
 '/blogs/just-explain-it/',
 'http://finance.yahoo.com/blogs/author/aaron-task/',
 '/blogs/author/michael-santoli/',
 '/blogs/author/jeff-macke/',
 '/blogs/author/aaron-pressman/',
 '/blogs/author/rick-newman/',
 '/blogs/author/mandi-woodruff/',
 '/blogs/author/chris-nichols/',
 '/blogs/the-exchange/',
 '/blogs/michael-santoli/',
 'http://finance.yahoo.com/blogs/author/philip-pearlman/',
 '/news/',
 '/corporate-news/',
 '/economic-policy-news/',
 '/investing-news/',
 '/personal-finance/',
 '/career-education/',
 '/real-estate/',
 '/retirement/',
 '/credit-debt/',
 '/taxes/',
 '/autos/',
 '/lifestyle/',
 '/videos/',
 '/rates/',
 '/calculator/index/',
 '/personal-finance/tools/',
 '/cnbc/',
 '/blogs/big-data-download/',
 '/blogs/off-the-cuff/',
 '/blogs/power-pitch/',
 '/blogs/talking-numbers/',
 '/blogs/the-biz-fix/',
 '/blogs/top-best-most/',
 '/contributors/',
 'http://finance.search.yahoo.com?fr=fin-v1',
 '/q?s=^DJI',
 '/q?s=^IXIC',
 '/q?s=AAPL',
 '/q/ecn?s=AAPL+Order+Book',
 '/q/op?s=AAPL+Options',
 '/q/hp?s=AAPL+Historical+Prices',
 '/echarts?s=AAPL+Interactive',
 '/q/h?s=AAPL+Headlines',
 '/q/p?s=AAPL+Press+Releases',
 '/q/ce?s=AAPL+Company+Events',
 '/mb?s=AAPL',
 '/marketpulse/?s=AAPL',
 '/q/pr?s=AAPL+Profile',
 '/q/ks?s=AAPL+Key+Statistics',
 '/q/sec?s=AAPL+SEC+Filings',
 '/q/co?s=AAPL+Competitors',
 '/q/in?s=AAPL+Industry',
 '/q/ct?s=AAPL+Components',
 '/q/ao?s=AAPL+Analyst+Opinion',
 '/q/ae?s=AAPL+Analyst+Estimates',
 '/q/mh?s=AAPL+Major+Holders',
 '/q/it?s=AAPL+Insider+Transactions',
 '/q/ir?s=AAPL+Insider+Roster',
 '/q/is?s=AAPL+Income+Statement',
 '/q/bs?s=AAPL+Balance+Sheet',
 '/q/cf?s=AAPL+Cash+Flow',
 'https://mobile.yahoo.com/finance/?src=gta',
 '/q/op?s=AAPL&date=1426204800',
 '/q/op?s=AAPL&straddle=true&date=1426204800',
 None,
 None,
 '/q/op?s=AAPL&strike=85.00',
 '/q?s=AAPL150313C00085000',
 '/q/op?s=AAPL&strike=90.00',
 '/q?s=AAPL150313C00090000',
 '/q/op?s=AAPL&strike=95.00',
 '/q?s=AAPL150313C00095000',
 '/q/op?s=AAPL&strike=100.00',
 '/q?s=AAPL150313C00100000',
 '/q/op?s=AAPL&strike=101.00',
 '/q?s=AAPL150313C00101000',
 '/q/op?s=AAPL&strike=102.00',
 '/q?s=AAPL150313C00102000',
 '/q/op?s=AAPL&strike=103.00',
 '/q?s=AAPL150313C00103000',
 '/q/op?s=AAPL&strike=105.00',
 '/q?s=AAPL150313C00105000',
 '/q/op?s=AAPL&strike=106.00',
 '/q?s=AAPL150313C00106000',
 '/q/op?s=AAPL&strike=107.00',
 '/q?s=AAPL150313C00107000',
 '/q/op?s=AAPL&strike=108.00',
 '/q?s=AAPL150313C00108000',
 '/q/op?s=AAPL&strike=109.00',
 '/q?s=AAPL150313C00109000',
 '/q/op?s=AAPL&strike=110.00',
 '/q?s=AAPL150313C00110000',
 '/q/op?s=AAPL&strike=111.00',
 '/q?s=AAPL150313C00111000',
 '/q/op?s=AAPL&strike=112.00',
 '/q?s=AAPL150313C00112000',
 '/q/op?s=AAPL&strike=113.00',
 '/q?s=AAPL150313C00113000',
 '/q/op?s=AAPL&strike=114.00',
 '/q?s=AAPL150313C00114000',
 '/q/op?s=AAPL&strike=115.00',
 '/q?s=AAPL150313C00115000',
 '/q/op?s=AAPL&strike=116.00',
 '/q?s=AAPL150313C00116000',
 '/q/op?s=AAPL&strike=117.00',
 '/q?s=AAPL150313C00117000',
 '/q/op?s=AAPL&strike=118.00',
 '/q?s=AAPL150313C00118000',
 '/q/op?s=AAPL&strike=119.00',
 '/q?s=AAPL150313C00119000',
 '/q/op?s=AAPL&strike=120.00',
 '/q?s=AAPL150313C00120000',
 '/q/op?s=AAPL&strike=121.00',
 '/q?s=AAPL150313C00121000',
 '/q/op?s=AAPL&strike=122.00',
 '/q?s=AAPL150313C00122000',
 '/q/op?s=AAPL&strike=123.00',
 '/q?s=AAPL150313C00123000',
 '/q/op?s=AAPL&strike=124.00',
 '/q?s=AAPL150313C00124000',
 '/q/op?s=AAPL&strike=125.00',
 '/q?s=AAPL150313C00125000',
 '/q/op?s=AAPL&strike=126.00',
 '/q?s=AAPL150313C00126000',
 '/q/op?s=AAPL&strike=127.00',
 '/q?s=AAPL150313C00127000',
 '/q/op?s=AAPL&strike=128.00',
 '/q?s=AAPL150313C00128000',
 '/q/op?s=AAPL&strike=129.00',
 '/q?s=AAPL150313C00129000',
 '/q/op?s=AAPL&strike=130.00',
 '/q?s=AAPL150313C00130000',
 '/q/op?s=AAPL&strike=131.00',
 '/q?s=AAPL150313C00131000',
 '/q/op?s=AAPL&strike=132.00',
 '/q?s=AAPL150313C00132000',
 '/q/op?s=AAPL&strike=133.00',
 '/q?s=AAPL150313C00133000',
 '/q/op?s=AAPL&strike=134.00',
 '/q?s=AAPL150313C00134000',
 '/q/op?s=AAPL&strike=135.00',
 '/q?s=AAPL150313C00135000',
 '/q/op?s=AAPL&strike=136.00',
 '/q?s=AAPL150313C00136000',
 '/q/op?s=AAPL&strike=137.00',
 '/q?s=AAPL150313C00137000',
 '/q/op?s=AAPL&strike=138.00',
 '/q?s=AAPL150313C00138000',
 '/q/op?s=AAPL&strike=139.00',
 '/q?s=AAPL150313C00139000',
 '/q/op?s=AAPL&strike=140.00',
 '/q?s=AAPL150313C00140000',
 '/q/op?s=AAPL&strike=141.00',
 '/q?s=AAPL150313C00141000',
 '/q/op?s=AAPL&strike=142.00',
 '/q?s=AAPL150313C00142000',
 '/q/op?s=AAPL&strike=143.00',
 '/q?s=AAPL150313C00143000',
 '/q/op?s=AAPL&strike=144.00',
 '/q?s=AAPL150313C00144000',
 '/q/op?s=AAPL&strike=145.00',
 '/q?s=AAPL150313C00145000',
 '/q/op?s=AAPL&strike=146.00',
 '/q?s=AAPL150313C00146000',
 '/q/op?s=AAPL&strike=150.00',
 '/q?s=AAPL150313C00150000',
 None,
 None,
 '/q/op?s=AAPL&strike=85.00',
 '/q?s=AAPL150313P00085000',
 '/q/op?s=AAPL&strike=90.00',
 '/q?s=AAPL150313P00090000',
 '/q/op?s=AAPL&strike=95.00',
 '/q?s=AAPL150313P00095000',
 '/q/op?s=AAPL&strike=100.00',
 '/q?s=AAPL150313P00100000',
 '/q/op?s=AAPL&strike=101.00',
 '/q?s=AAPL150313P00101000',
 '/q/op?s=AAPL&strike=102.00',
 '/q?s=AAPL150313P00102000',
 '/q/op?s=AAPL&strike=103.00',
 '/q?s=AAPL150313P00103000',
 '/q/op?s=AAPL&strike=104.00',
 '/q?s=AAPL150313P00104000',
 '/q/op?s=AAPL&strike=105.00',
 '/q?s=AAPL150313P00105000',
 '/q/op?s=AAPL&strike=106.00',
 '/q?s=AAPL150313P00106000',
 '/q/op?s=AAPL&strike=107.00',
 '/q?s=AAPL150313P00107000',
 '/q/op?s=AAPL&strike=108.00',
 '/q?s=AAPL150313P00108000',
 '/q/op?s=AAPL&strike=109.00',
 '/q?s=AAPL150313P00109000',
 '/q/op?s=AAPL&strike=110.00',
 '/q?s=AAPL150313P00110000',
 '/q/op?s=AAPL&strike=111.00',
 '/q?s=AAPL150313P00111000',
 '/q/op?s=AAPL&strike=112.00',
 '/q?s=AAPL150313P00112000',
 '/q/op?s=AAPL&strike=113.00',
 '/q?s=AAPL150313P00113000',
 '/q/op?s=AAPL&strike=114.00',
 '/q?s=AAPL150313P00114000',
 '/q/op?s=AAPL&strike=115.00',
 '/q?s=AAPL150313P00115000',
 '/q/op?s=AAPL&strike=116.00',
 '/q?s=AAPL150313P00116000',
 '/q/op?s=AAPL&strike=117.00',
 '/q?s=AAPL150313P00117000',
 '/q/op?s=AAPL&strike=118.00',
 '/q?s=AAPL150313P00118000',
 '/q/op?s=AAPL&strike=119.00',
 '/q?s=AAPL150313P00119000',
 '/q/op?s=AAPL&strike=120.00',
 '/q?s=AAPL150313P00120000',
 '/q/op?s=AAPL&strike=121.00',
 '/q?s=AAPL150313P00121000',
 '/q/op?s=AAPL&strike=122.00',
 '/q?s=AAPL150313P00122000',
 '/q/op?s=AAPL&strike=123.00',
 '/q?s=AAPL150313P00123000',
 '/q/op?s=AAPL&strike=124.00',
 '/q?s=AAPL150313P00124000',
 '/q/op?s=AAPL&strike=125.00',
 '/q?s=AAPL150313P00125000',
 '/q/op?s=AAPL&strike=126.00',
 '/q?s=AAPL150313P00126000',
 '/q/op?s=AAPL&strike=127.00',
 '/q?s=AAPL150313P00127000',
 '/q/op?s=AAPL&strike=128.00',
 '/q?s=AAPL150313P00128000',
 '/q/op?s=AAPL&strike=129.00',
 '/q?s=AAPL150313P00129000',
 '/q/op?s=AAPL&strike=130.00',
 '/q?s=AAPL150313P00130000',
 '/q/op?s=AAPL&strike=131.00',
 '/q?s=AAPL150313P00131000',
 '/q/op?s=AAPL&strike=132.00',
 '/q?s=AAPL150313P00132000',
 '/q/op?s=AAPL&strike=133.00',
 '/q?s=AAPL150313P00133000',
 '/q/op?s=AAPL&strike=134.00',
 '/q?s=AAPL150313P00134000',
 '/q/op?s=AAPL&strike=135.00',
 '/q?s=AAPL150313P00135000',
 '/q/op?s=AAPL&strike=136.00',
 '/q?s=AAPL150313P00136000',
 '/q/op?s=AAPL&strike=137.00',
 '/q?s=AAPL150313P00137000',
 '/q/op?s=AAPL&strike=138.00',
 '/q?s=AAPL150313P00138000',
 '/q/op?s=AAPL&strike=139.00',
 '/q?s=AAPL150313P00139000',
 '/q/op?s=AAPL&strike=140.00',
 '/q?s=AAPL150313P00140000',
 '/q/op?s=AAPL&strike=141.00',
 '/q?s=AAPL150313P00141000',
 '/q/op?s=AAPL&strike=142.00',
 '/q?s=AAPL150313P00142000',
 '/q/op?s=AAPL&strike=143.00',
 '/q?s=AAPL150313P00143000',
 '/q/op?s=AAPL&strike=144.00',
 '/q?s=AAPL150313P00144000',
 '/q/op?s=AAPL&strike=145.00',
 '/q?s=AAPL150313P00145000',
 '/q/op?s=AAPL&strike=150.00',
 '/q?s=AAPL150313P00150000']

Now, finding the right tables in the document can be a matter of trial and error; some websites make is easier by giving a table of interest an id attribute. These were the two tables containing the call data and put data, respectively.

In [128]:

tables = doc.findall('.//table')

In [129]:

calls = tables[9]
puts = tables[13]

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-129-c2cad2b3938b> in <module>()
----> 1 calls = tables[9]
      2 puts = tables[13]

IndexError: list index out of range

In [136]:

t0 = tables[0]
t1 = tables[1]
t2 = tables[2]

Each table has a header row followed by each of the data rows.

In [139]:

rows0 = t0.findall('.//tr')
rows1 = t1.findall('.//tr')
rows2 = t2.findall('.//tr')

For the header as well as the data rows, we want to extract the text from each cell; in the case of the header these are the th cells and td cells for the data.

In [140]:

def _unpack(row, kind='td'):
    elts = row.findall('.//%s' %kind)
    return [val.text_content() for val in elts]

In [147]:

_unpack(rows2[2], kind='td')

Out[147]:

['\r\n            85.00\r\n        ',
 '\r\n            AAPL150313P00085000\r\n        ',
 '\r\n            0.01\r\n        ',
 '\r\n            0.00\r\n        ',
 '\r\n            0.01\r\n        ',
 '\r\n            0.00\r\n        ',
 '\r\n            \r\n                \r\n                    0.00%\r\n                \r\n                \r\n            \r\n        ',
 '\r\n            3\r\n        ',
 '\r\n            3\r\n        ',
 '\r\n            237.50%\r\n        ']

Now it is a matter of combining all these steps together to convert this data into a dataframe. Since the numerical data is still in string format, we want to convert some, but perhaps not all of the columns to floating point format. Pandas has a class TextParser that is used internally in the read_csv and other parsing functions to do the appropriate automatic type conversion:

In [148]:

from pandas.io.parsers import TextParser

In [158]:

def parse_options_data(table):
    rows = table.findall('.//tr')
    header = _unpack(rows2[0], kind = 'th')
    data = [_unpack(r) for r in rows2[1:]]
    return TextParser(data, names = header).get_chunk()

Finally, we invoke this parsing function on the lxml table objects and get DataFrame results:

In [160]:

call_data = parse_options_data(t2)

In [161]:

call_data.head()

Out[161]:

	Strike   ∵ Filter	Contract Name	Last  	Bid  	Ask  	Change  	%Change  	Volume  	Open Interest  	Implied Volatility  
0	\r\n \r\n ...	None	NaN	NaN	NaN	NaN	None	NaN	NaN	None
1	\r\n 85.00\r\n	\r\n AAPL150313P00085000\r\n	0.01	0	0.01	0	\r\n \r\n \r\n ...	3	3	\r\n 237.50%\r\n
2	\r\n 90.00\r\n	\r\n AAPL150313P00090000\r\n	0.01	0	0.01	0	\r\n \r\n \r\n ...	2	2	\r\n 206.25%\r\n
3	\r\n 95.00\r\n	\r\n AAPL150313P00095000\r\n	0.01	0	0.01	0	\r\n \r\n \r\n ...	10	6061	\r\n 168.75%\r\n
4	\r\n 100.00\r\n	\r\n AAPL150313P00100000\r\n	0.01	0	0.01	0	\r\n \r\n \r\n ...	2	674	\r\n 140.63%\r\n

Parsing XML with lxml.objectify % DATA NOT FOUND¶

XML (extensible markup language) is another common structured data format supporting hierarchical, nested data with metadata.

MTA publishes a number of data series about its bus and train services. Here we look at the performance data which is contained in a set of XML files. Each train or bus service has a different file containing monthly data as a series of XML records.

Using lxml.objectify, we parse the file and get a reference to the root note of the XML file with getroot:

BINARY DATA FORMATS¶

One of the easiest ways to store data efficiently in binary format is using Python's built-in pickle serialization. Pandas objects all have a save method which writes the data to disk as a pickle:

In [162]:

!pwd

/Users/sergulaydore/Documents/Face_Car/python_scripts/Python_for_Data_Analysis

In [166]:

%cd /Users/sergulaydore/pydata-book/ch06

/Users/sergulaydore/pydata-book/ch06

In [169]:

frame = pd.read_csv('./ex1.csv')

In [170]:

frame

Out[170]:

	a	b	c	d	message
0	1	2	3	4	hello
1	5	6	7	8	world
2	9	10	11	12	foo

In [172]:

frame.to_pickle('./frame_pickle')

In [174]:

pd.read_pickle('./frame_pickle')

Out[174]:

	a	b	c	d	message
0	1	2	3	4	hello
1	5	6	7	8	world
2	9	10	11	12	foo

Note: pickle is only recommended as a short-term storage format. The problem is that it is hard to guarantee that the format will be stable over time; an object pickled today may not unpickle with a later version of a library.

Using HDF5 format¶

There are a number of tools that facilitate efficiently reading and writing large amounts of scientific data in binary format on disk. A popular industry-grade library for this is HDF5, which is a C library with interfaces in many other languages like Java, Python, and MATLAB. The "HDF" and HDF5 stands for hierarchical data format. Each HDF5 file contains an internal file system like node structure enabling you to store multiple datasets and suporting metadata. Compared with simpler formats, HDF5 supports on the fly compression with a variety of compressors, enabling data with repeated patters to be stored more efficiently. For very large datasets that don't fit into memory, HDF5 is a good choice as you can efficiently read and write small sections of much larger arrays.

There are not one but two interfaces to the HDF5 library in Python, PyTables and h5py, each of which takes a different approach to the problem. h5py provides a direct, but high-level interface to the HDF5 API, while PyTables abstracts many of the details of HDF5 to provide multiple flexible flexible data containers, table indexing, querying capability, and some support for out-of-core computations.

pandas has a minimal dict-like HDFStore class, which uses PyTables to store pandas objects:

In [175]:

store = pd.HDFStore('./mydata.h5')

In [177]:

store['obj1'] = frame

In [178]:

store['obj1_col'] = frame['a']

In [179]:

store

Out[179]:

<class 'pandas.io.pytables.HDFStore'>
File path: ./mydata.h5
/obj1                frame        (shape->[3,5])
/obj1_col            series       (shape->[3])

Objects contained in the HDF5 file can be retrieved in a dict-like fashion:

In [180]:

store['obj1']

Out[180]:

	a	b	c	d	message
0	1	2	3	4	hello
1	5	6	7	8	world
2	9	10	11	12	foo

Reading Microsoft Excel Files¶

In [ ]:

xls_file = pd.ExcelFile('./data.xls')
table = xls_file.parse('Sheet1')

Interacting with HTML and Web APIs % TWITTER REQUIRES AUTHENTICATION AND ETC, so annoying¶

Many websites have public APIs providing data feeds via JSON or some other format. There a re a nuber of ways to access these APIs from Python; one easy to use method is the requests package. To search for the words "python pandas" on Twitter, we can make an HTTP GET request like so:

In [182]:

import requests

In [215]:

url = 'https://api.twitter.com/1.1/search/tweets.json?q=%23freebandnames&since_id=24012619984051000&max_id=250126199840518145&result_type=mixed&count=4'

In [216]:

resp = requests.get(url)

In [217]:

resp

Out[217]:

<Response [400]>

The response object's text attribute contains the content of the GET query. Many web APIs will return a JSON string that must be loaded into a Python object:

In [218]:

import json

In [219]:

data = json.loads(resp.text)

In [220]:

data.keys()

Out[220]:

[u'errors']

In [224]:

import requests
from requests_oauthlib import OAuth1

Interacting with DataBases¶

In many applications data rarely comes from text files, that being a fairly inefficient way to store large amounts of data. SQL-based relational databases (such as SQL Server, PostgreSQL, and MSQL) are in wide use, and many alternative non-SQL databases have become quite popular. The choice of database is usually dependent on the performance, data integrity, and scalability needs of an application.

Loading data from SQL into a DataFrame is fairly straightforward.

In [225]:

import sqlite3

In [226]:

query = """
CREATE TABLE test
(a VARCHAR(20), b VARCHAR(20),
c REAL, d INTEGER
); """

In [227]:

con = sqlite3.connect(':memory:')

In [228]:

con.execute(query)

Out[228]:

<sqlite3.Cursor at 0x10f5845e0>

In [229]:

con.commit()

Then, insert a few rows of data:

In [230]:

data = [ ('Atlanta', 'Georgia', 1.25, 6),
        ('Tallahassee', 'Florida', 2.6, 3),
        ('Sacramento','California', 1.7, 5)
        ]

In [231]:

stmt = "INSERT INTO test VALUES(?,?,?,?)"

In [232]:

con.executemany(stmt, data)

Out[232]:

<sqlite3.Cursor at 0x10f5af960>

In [233]:

con.commit()

Most Python SQL drivers return a list of tuples when selecting data from a table:

In [234]:

cursor = con.execute('select * from test')

In [235]:

rows = cursor.fetchall()

In [236]:

rows

Out[236]:

[(u'Atlanta', u'Georgia', 1.25, 6),
 (u'Tallahassee', u'Florida', 2.6, 3),
 (u'Sacramento', u'California', 1.7, 5)]

You can pass the list of tuples to the DataFrame consructor, but you also need the column names, contained in the cursor's description attribute:

In [237]:

cursor.description

Out[237]:

(('a', None, None, None, None, None, None),
 ('b', None, None, None, None, None, None),
 ('c', None, None, None, None, None, None),
 ('d', None, None, None, None, None, None))

In [238]:

pd.DataFrame(rows, columns = zip(*cursor.description)[0])

Out[238]:

	a	b	c	d
0	Atlanta	Georgia	1.25	6
1	Tallahassee	Florida	2.60	3
2	Sacramento	California	1.70	5

This is quite a bit of munging that you'd rather not repeat each time you query the database. pandas has a read_frame function in its pandas.io.sql module that simplifies the process. Just pass the select statement and the connection object:

In [239]:

import pandas.io.sql as sql

In [241]:

sql.read_sql('select * from test', con)

Out[241]:

	a	b	c	d
0	Atlanta	Georgia	1.25	6
1	Tallahassee	Florida	2.60	3
2	Sacramento	California	1.70	5