Reading and using numerical data from text files

Reading in data from text files is a common task in data analysis and plotting. Python has many options for dealing with text files that contain numerical data (as well as text-based data if that is your thing). These examples start from the assumption that you have a textfile containing a list of data, possibly spanning multiple columns. We will go through different ways that you can import and manipulate this data with Python and NumPy.

The sample data file we are working with is a plain text file containing radar rainfall precipitation rates. Each column corresponds to a different region, and every line is the rainfall rate at 60 minute intervals.

Numpy's np.loadtxt

This is the simplest method. (And probably the most limited.) It uses a numpy method called loadtxt to read in textfile data, as long as the data is regularly formatted (an MxN grid of data). It also assumes you have no missing values (see np.genfromtxt if this applies). Some more advanced methods are covered towards the end of this tutorial if you have more complex data to work with.

The function returns numpy array based on the dimensions of your data file.

In [2]:
import numpy as np

myData = np.loadtxt("test_rainfile_hourly.txt")
In [3]:
myData
Out[3]:
array([[  3.2,   0.6,   0.6, ...,   0. ,   0. ,   0. ],
       [  0. ,   0. ,   0. , ...,   0. ,   0. ,   0. ],
       [  0. ,   0. ,   0. , ...,   0. ,   0. ,   0. ],
       ..., 
       [  0.1,   1.1,   1. , ...,  12.5,  12.5,   3.6],
       [  0. ,   0. ,   0. , ...,   0. ,   0. ,   0. ],
       [  0. ,   0. ,   0. , ...,   0. ,   0. ,   0. ]])

As you can see, the object myData is a 2-Dimensional array, filled with the values in the sample text data file. The loadtxt method automatically creates for us an array with all of our data parsed into it from the text file.

Extracting subsets of data

Now we are going to extract certain subsets of this (or these...) data. We can do this with something called array splicing - a built in feature of numpy arrays. Suppose we want the first column of our data:

In [4]:
myData[:,0]
Out[4]:
array([  3.2,   0. ,   0. ,   0. ,   0. ,   0. ,   0. ,   0. ,   0. ,
         0. ,   0. ,   0. ,   0. ,   1.5,  15.8,   0. ,   0.3,   1. ,
         0. ,   0. ,   0. ,   0.1,   0. ,   0. ])

Explanation: Numpy (and Python in general) stores array data in row-major order. So when you want to access a certain data element within an array, you specifiy the row first, and then the column. (Note that Python starts counting array positions from zero!)

someArray[row_index][column_index]

The colon symbol is used to specifiy a range of values, but without any bounds specified it effectively means "Give me the full range of values in that dimension". So here we are using the : symbol as a wildcard to say we want all elements from the row dimension, and 0 to say we want the "zeroth-column". This gives us a view of the array's first column.

If we wanted the first row, we could just specifiy the following:

In [5]:
myData[0,:]
# (give me the "zero-th"  row, and then all the columns in that row)
Out[5]:
array([ 3.2,  0.6,  0.6,  0.6,  0.6,  0.6,  0.3,  0.3,  0.3,  0.3,  0.3,
        0.1,  0.1,  0.1,  0.1,  0.1,  0. ,  4.3,  2. ,  2. ,  2. ,  2. ,
        2. ,  2.3,  2.3,  2.3,  2.3,  2.3,  0.1,  0.1,  0.1,  0.1,  0.1,
        0. ,  4.3,  2. ,  2. ,  2. ,  2. ,  2. ,  2.3,  2.3,  2.3,  2.3,
        2.3,  0.1,  0.1,  0.1,  0.1,  0.1,  0. ,  4.3,  2. ,  2. ,  2. ,
        2. ,  2. ,  2.3,  2.3,  2.3,  2.3,  2.3,  0.1,  0.1,  0.1,  0.1,
        0.1,  0. ,  4.3,  2. ,  2. ,  2. ,  2. ,  2. ,  2.3,  2.3,  2.3,
        2.3,  2.3,  0.1,  0.1,  0.1,  0.1,  0.1,  0. ,  4.3,  2. ,  2. ,
        2. ,  2. ,  2. ,  2.3,  2.3,  2.3,  2.3,  2.3,  0.1,  0.1,  0.1,
        0.1,  0.1,  0. ,  5.5,  2. ,  2. ,  2. ,  2. ,  2. ,  0.3,  0.3,
        0.3,  0.3,  0.3,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  5.5,  2. ,
        2. ,  2. ,  2. ,  2. ,  0.3,  0.3,  0.3,  0.3,  0.3,  0. ,  0. ,
        0. ,  0. ,  0. ,  0. ,  5.5,  2. ,  2. ,  2. ,  2. ,  2. ,  0.3,
        0.3,  0.3,  0.3,  0.3,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  5.5,
        2. ,  2. ,  2. ,  2. ,  2. ,  0.3,  0.3,  0.3,  0.3,  0.3,  0. ,
        0. ,  0. ,  0. ,  0. ,  0. ,  5.5,  2. ,  2. ,  2. ,  2. ,  2. ,
        0.3,  0.3,  0.3,  0.3,  0.3,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,
        7.5,  0.6,  0.6,  0.6,  0.6,  0.6,  0. ,  0. ,  0. ,  0. ,  0. ,
        0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  7.5,  0.6,  0.6,  0.6,  0.6,
        0.6,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,
        0. ,  7.5,  0.6,  0.6,  0.6,  0.6,  0.6,  0. ,  0. ,  0. ,  0. ,
        0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  7.5,  0.6,  0.6,  0.6,
        0.6,  0.6,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,
        0. ,  0. ,  7.5,  0.6,  0.6,  0.6,  0.6,  0.6,  0. ,  0. ,  0. ,
        0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0.6,  0. ,  0. ,
        0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,
        0. ,  0. ,  0. ,  0.6,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,
        0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0.6,  0. ,
        0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,
        0. ,  0. ,  0. ,  0. ,  0.6,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,
        0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0.6,
        0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,
        0. ,  0. ,  0. ,  0. ,  0. ,  0.4,  0. ,  0. ,  0. ,  0. ,  0. ,
        0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,
        0.4,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,
        0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0.4,  0. ,  0. ,  0. ,  0. ,
        0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,  0. ,
        0. ])

Similarly, if we want the second and third columns, we can do this:

In [6]:
myData[:,1:3]
# give me all the rows, and the columns in the range 1-3
Out[6]:
array([[  0.6,   0.6],
       [  0. ,   0. ],
       [  0. ,   0. ],
       [  0. ,   0. ],
       [  0. ,   0. ],
       [  0. ,   0. ],
       [  0. ,   0. ],
       [  0. ,   0. ],
       [  0. ,   0. ],
       [  0. ,   0. ],
       [  0. ,   0. ],
       [  0. ,   0. ],
       [  0. ,   0. ],
       [  1.5,   0.4],
       [ 21.1,  21.1],
       [  1.7,   1.7],
       [  1.1,   1.1],
       [  1.3,   1. ],
       [  0. ,   0. ],
       [  0. ,   0. ],
       [  0. ,   0. ],
       [  1.1,   1. ],
       [  0. ,   0. ],
       [  0. ,   0. ]])

The syntax of the array slicing is perhaps slightly confusing at first. You may have thought (or forgot, like I did when writing this) that when you specify a range of elements in an array slice you get the range inclusive of the upper bound, but actually the range is up to but not including the upper range bound. i.e. myData[:,1:2] would only give you one column of data.

Summary

  1. The colon syntax (:) specifies a range. However, omitting any bounds on the range will return the whole set of data from that dimension.
  2. Range bounds are not inclusive of the upper bound.
  3. Array indexing is row-major. (Like C, but not like Fortran)

More slicing

You can extract a rectangular subset of data using the same notation: Here we are going to ask for four rows of data, and from those rows, two columns of data.

In [24]:
myData[2:6,1:3]
Out[24]:
array([[ 0.,  0.],
       [ 0.,  0.],
       [ 0.,  0.],
       [ 0.,  0.]])

Lo and behold, we now have a 4x2 subset of the original myData array. (Don't worry that they are all zeros - that's just part of the dataset and is correct)

Array views vs Array copies

By design of the language, creating a new 'view' of an array by slicing does not create a copy of that array. It simply creates a "view" of the original array (or a "reference" to the original array, if you like the terminology of references or (god forbid) pointers and that kind of stuff.) If you modify the original array, your view will also change as well. This might be what you intended, but sometimes it is not.

If you want a separate array to work on, which is not just a view of the original array, use the copy function:

In [25]:
b = myData[2:6,1:3].copy()
b
Out[25]:
array([[ 0.,  0.],
       [ 0.,  0.],
       [ 0.,  0.],
       [ 0.,  0.]])

b is now a separate array object, with dimensions of 4x2. It takes up its own space in memory (which is why you should think about whether you really need a copy of an entire array if dealing with massive datasets).

Now look what happens if you don't use copy, but use a simple assignment operator:

In [26]:
c = myData[2:6,1:3]
c
Out[26]:
array([[ 0.,  0.],
       [ 0.,  0.],
       [ 0.,  0.],
       [ 0.,  0.]])

c is a view of a subset of myData. It is not a separate array object. Let's modifiy the original array by adding 10 to all the values:

In [27]:
myData += 10
In [28]:
c
Out[28]:
array([[ 10.,  10.],
       [ 10.,  10.],
       [ 10.,  10.],
       [ 10.,  10.]])

c will reflect any changes made to the original array, myData.

In this case, the variable c only refers to a view of the original array myData. Therefore, anything we do to myData will affect c, since c is only a reference to the original array.

In contrast, when we created b, we explicitly requested the data from myData to be copied to a new array. Anything we subsequently do to myData will not affect the array b. Be sure to make copies of arrays when you need to.

Some other methods

np.loadtxt is not the only way of loading data. In fact, for large datasets, it is often very slow compared to other methods. I won't go into full detail here but there are other methods you can use such as:

csv.reader

Python has a native csv (comma separated variable) file reader. Note that the CSV file does not have to actually be "comma separated", as you can specify what the delimter is. We can use it on our original (space separated) data file for example:

In [36]:
import csv

data = []  # create an empty list (note: not a numpy array, yet!)

with open("test_rainfile_hourly.txt", 'rb') as csvfile:
    csvreader = csv.reader(csvfile, delimiter=" ")
    for row in csvreader:
        data.append(row)  
        
dataArray = np.array(data, dtype=float)  # convert the standard python list into a swish numpy array
dataArray
Out[36]:
array([[  3.2,   0.6,   0.6, ...,   0. ,   0. ,   0. ],
       [  0. ,   0. ,   0. , ...,   0. ,   0. ,   0. ],
       [  0. ,   0. ,   0. , ...,   0. ,   0. ,   0. ],
       ..., 
       [  0.1,   1.1,   1. , ...,  12.5,  12.5,   3.6],
       [  0. ,   0. ,   0. , ...,   0. ,   0. ,   0. ],
       [  0. ,   0. ,   0. , ...,   0. ,   0. ,   0. ]])

There's an important detail in the last couple of lines above. We specified a dtype - the type of data that the textfile contains. In our case, the data type is floating point numbers (float) - by default the csv reader will generate strings, which is probably not what you want in scientific data processsing. We don't normally specify data types explicity in Python, but here's an example of when it is necessary.

np.genfromtxt

This method is similar to np.loadtxt but offers a few more options, such as how to deal with missing values etc. For example if our data file contain NaNs, and we wanted to replace them with numbers, we could do:

In [41]:
import numpy as np

dataArray2 = np.genfromtxt("test_rainfile_hourly.txt", missing_values="NaN", filling_values=-1)

Some other options for np.loadtxt and np.genfromtxt

There are some other nifty options for loadtxt and genfromtxt that you can use to easily pick out specific columns from a textfile, or skip header rows for example.

usecols

Pick out specific columns (do not have to be next to each other). Note that the 'first' column is column 0.

In [45]:
dataArray3 = np.loadtxt("test_rainfile_hourly.txt", usecols=(0,4,5))
dataArray3
Out[45]:
array([[  3.2,   0.6,   0.6],
       [  0. ,   0. ,   0. ],
       [  0. ,   0. ,   0. ],
       [  0. ,   0. ,   0. ],
       [  0. ,   0. ,   0. ],
       [  0. ,   0. ,   0. ],
       [  0. ,   0. ,   0. ],
       [  0. ,   0. ,   0. ],
       [  0. ,   0. ,   0. ],
       [  0. ,   0. ,   0. ],
       [  0. ,   0. ,   0. ],
       [  0. ,   0. ,   0. ],
       [  0. ,   0. ,   0. ],
       [  1.5,   0.4,   0.4],
       [ 15.8,  21.1,  21.1],
       [  0. ,   1.7,   1.7],
       [  0.3,   1.1,   1.1],
       [  1. ,   1. ,   1. ],
       [  0. ,   0. ,   0. ],
       [  0. ,   0. ,   0. ],
       [  0. ,   0. ,   0. ],
       [  0.1,   1. ,   1. ],
       [  0. ,   0. ,   0. ],
       [  0. ,   0. ,   0. ]])

skiprows

Useful if you have header information in the text file. Or you just want to ignore given number of rows from the data.

In [46]:
dataArray4 = np.loadtxt("test_rainfile_hourly.txt", usecols=(0,4,5), skiprows=5)
dataArray4
Out[46]:
array([[  0. ,   0. ,   0. ],
       [  0. ,   0. ,   0. ],
       [  0. ,   0. ,   0. ],
       [  0. ,   0. ,   0. ],
       [  0. ,   0. ,   0. ],
       [  0. ,   0. ,   0. ],
       [  0. ,   0. ,   0. ],
       [  0. ,   0. ,   0. ],
       [  1.5,   0.4,   0.4],
       [ 15.8,  21.1,  21.1],
       [  0. ,   1.7,   1.7],
       [  0.3,   1.1,   1.1],
       [  1. ,   1. ,   1. ],
       [  0. ,   0. ,   0. ],
       [  0. ,   0. ,   0. ],
       [  0. ,   0. ,   0. ],
       [  0.1,   1. ,   1. ],
       [  0. ,   0. ,   0. ],
       [  0. ,   0. ,   0. ]])

'Unpacking' columns into separate variables

This is a handy method for extracting colums into separate arrays. Insteading of getting one massive array for the whole dataset, we can get back back individual arrays for each one, by specifiying unpack=True

In [54]:
x, y, z = np.loadtxt("test_rainfile_hourly.txt", usecols=(0,4,5), unpack=True)

print x
print y
print z
[  3.2   0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.
   0.    1.5  15.8   0.    0.3   1.    0.    0.    0.    0.1   0.    0. ]
[  0.6   0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.
   0.    0.4  21.1   1.7   1.1   1.    0.    0.    0.    1.    0.    0. ]
[  0.6   0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.
   0.    0.4  21.1   1.7   1.1   1.    0.    0.    0.    1.    0.    0. ]

pandas.read_csv

Pandas is a powerful Python module specifically aimed at data processing. It is separate from numpy, and probably best suited to situations when you want to do lots of loading and modifying of data tables. It has a level of functionality approaching database software. I won't go into it here but if you are interested in using it, there is a function called read_csv which has many, many options for reading text files into pandas data frames. Pandas dataframes can be easily converted into standard numpy arrays.

http://pandas.pydata.org/pandas-docs/stable/io.html#io-read-csv-table

Doing it the hard way...

This is how I used to do it before I realised numpy had a loadtxt function (*facepalm*). This method could come in handy if you really need some fine-grained control over how the file was processed, or you have irregularly shaped data. For example, rows/columns of varying lengths. Here it is for completeness:

In [55]:
f = open("test_rainfile_hourly.txt",'r')  # open file
lines = f.readlines()[4:]   # read in the data using the readlines function (the last part skips 4 header lines)
no_lines = len(lines)   # get the number of lines (=number of data)

# data variables
rainfall_zone1 = np.zeros(no_lines, dtype=np.float)
rainfall_zone2 = np.zeros(no_lines, dtype=np.float)          
rainfall_zone3 = np.zeros(no_lines, dtype=np.float)         
rainfall_zone4 = np.zeros(no_lines, dtype=np.float) 

for i in range (0,no_lines):
    line = lines[i].strip().split(" ")
    #print line
    rainfall_zone1[i] = float(line[0])
    rainfall_zone2[i] = float(line[1])
    rainfall_zone3[i] = float(line[2])
    rainfall_zone4[i] = float(line[3])

f.close() 

Yes, it's clunky, but as you can see, the columns may be of different lengths and you will get the correctly sized arrays for each column. Of course, I could just have done:

In [57]:
zone1, zone2, zone3, zone4 = np.loadtxt("test_rainfile_hourly.txt", usecols=(0,1,2,3), unpack=True)

In one line. But you live and learn. (This is the result of coming to Python after learning C++, which probably would require twice as many lines of code, and then you'd get a memory leak etc...)

Task

Work with your own data. Try loading in a textfile that contains data you use, with an appropriate method from the ones discussed above.