Importing delimited text files

Importing delimited text files can be accomplished using a variety of modules in Python. This notebook will cover pure python, the csv module, the numpy module and pandas.

Pure Python

In pure Python we:

  1. Open the datafile in read ('r') mode
  2. Create an empty list to store the data
  3. Loop over the rows in the file
  4. Getting rid of whitespace (.strip()) and splitting the row into a list based on the delimiter, in this case a comma (.split(',')).

The types of the resulting items in the list of list will always be strings, so if you need something else it will have to be converted later.

In [1]:
datafile = open('./data/examp_data.txt', 'r')
data = []
for row in datafile:
    data.append(row.strip().split(','))
data
Out[1]:
[['x', ' y', ' z'],
 ['1', ' 2', ' 3'],
 ['2', ' 2.4', ' 6'],
 ['3', ' 1.9', ' 8']]

CSV Module

In the csv module we:

  1. Open the datafile in read ('r') mode
  2. Create a reader object with that file, specifying the delimiter (the default is a comma, but it is explicitly specified here for clarity).
  3. Create an empty list to store the data
  4. Loop over the rows in the reader appending each row to the list

The types of the resulting items in the list of list will always be strings, so if you need something else it will have to be converted later.

In [2]:
import csv
datafile = open('./data/examp_data.txt', 'r')
datareader = csv.reader(datafile, delimiter=',')
data = []
for row in datareader:
    data.append(row)
data
Out[2]:
[['x', ' y', ' z'],
 ['1', ' 2', ' 3'],
 ['2', ' 2.4', ' 6'],
 ['3', ' 1.9', ' 8']]

Using Numpy

Importing using Numpy will import the data into a Numpy array, a commonly used data structure for scientific programming.

Using Numpy we simply use the genfromtxt() function to directly import the data. genfromtxt() has a lot of options for controlling what and how gets imported. See the docs page for details: http://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html

Numpy will autodetect the data type, so we'll often want to leave off the header row(s) (skip_header=1). We could keep them using the names=True argument, and also columns with different data types, but that creates a structured array and if we want to work with that type of data we're typically better off using pandas (see below).

In [3]:
import numpy
data = numpy.genfromtxt('./data/examp_data.txt', delimiter = ',', skip_header=1)
data
Out[3]:
array([[ 1. ,  2. ,  3. ],
       [ 2. ,  2.4,  6. ],
       [ 3. ,  1.9,  8. ]])

Pandas

Pandas is a powerful data management library that produces data structures and associated tools that are ideal for scientific computing tasks related to data. In particular, it produces a dataframe object that is much like R's dataframe and is designed to hold data with the standard structure of one row per record and one column per type of data (or field).

In Pandas we just use the read_csv() function to import text files. It has a lot of options, but will do most things automatically including detecting delimiters and detecting data types.

In [4]:
import pandas as pd
data = pd.read_csv('./data/examp_data.txt')
data
Out[4]:
   x   y    z
0  1  2.0  3 
1  2  2.4  6 
2  3  1.9  8 

Pandas dataframes do behave a bit differently than a lot of list based structures in Python, but we'll learn how to work with them soon. If you just want to pull the core data out of a dataframe you can do this using the values member (a member is just a variable associated with an object).

In [5]:
data.values
Out[5]:
array([[ 1. ,  2. ,  3. ],
       [ 2. ,  2.4,  6. ],
       [ 3. ,  1.9,  8. ]])