Introduction¶

In this short tutorial, I want to show how you can read in various formatted software data with Python and Pandas. We use the read_csv as well as the read_excel methods to accomplish our tasks.

In [ ]:

# Reading CSV

Reading files with mixed separators¶

In this section we read a more unstructured data set:

It's a Git log output in the following format.

<timestamp><whitespace><timezone><tabulator><author>

It contains two different separators: whitespace and tabular. Here is an the content of the file datasets/mixed_dataset.csv

1514531161 -0800	Linus Torvalds
1514489303 -0500	David S. Miller
1514487644 -0800	Tom Herbert
1514487643 -0800	Tom Herbert
1514482693 -0500	Willem de Bruijn

We can read in this kind of data:

In [54]:

import pandas as pds
pd.read_csv(
    "datasets/mixed_separators.txt",
    sep="^([0-9]*?) (.*?)\t(.*?)$",
    engine='python',
    names=['timestamp', 'timezone', 'author'],

    header=None)

Out[54]:

		timestamp	timezone	author
NaN	1514531161	-800	Linus Torvalds	NaN
	1514489303	-500	David S. Miller	NaN
	1514487644	-800	Tom Herbert	NaN
	1514487643	-800	Tom Herbert	NaN
	1514482693	-500	Willem de Bruijn	NaN