In this short tutorial, I want to show how you can read in various formatted software data with Python and Pandas. We use the read_csv
as well as the read_excel
methods to accomplish our tasks.
# Reading CSV
In this section we read a more unstructured data set:
It's a Git log output in the following format.
<timestamp><whitespace><timezone><tabulator><author>
It contains two different separators: whitespace and tabular. Here is an the content of the file datasets/mixed_dataset.csv
1514531161 -0800 Linus Torvalds
1514489303 -0500 David S. Miller
1514487644 -0800 Tom Herbert
1514487643 -0800 Tom Herbert
1514482693 -0500 Willem de Bruijn
We can read in this kind of data:
import pandas as pds
pd.read_csv(
"datasets/mixed_separators.txt",
sep="^([0-9]*?) (.*?)\t(.*?)$",
engine='python',
names=['timestamp', 'timezone', 'author'],
header=None)
timestamp | timezone | author | ||
---|---|---|---|---|
NaN | 1514531161 | -800 | Linus Torvalds | NaN |
1514489303 | -500 | David S. Miller | NaN | |
1514487644 | -800 | Tom Herbert | NaN | |
1514487643 | -800 | Tom Herbert | NaN | |
1514482693 | -500 | Willem de Bruijn | NaN |