A brief introduction to Numpy

Numpy is the fundamental library for scientific computing in Python. It contains list like objects that work like arrays, matrices, and data tables. This is how scientists typically expect data to behave. Numpy also provides linear algebra, Fourier transforms, random number generation, and tools for integrating C/C++ and Fortran code.

If you primarily want to work with tables of data, Pandas, which depends on Numpy, is probably the module that you want to start with.

Numpy Array Basics

Creating a Numpy array

In [1]:
import numpy as np

example_array = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
example_array
Out[1]:
array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

Indexing an array

In [2]:
example_array[1, 1]
Out[2]:
5

Slicing an array

In [3]:
example_array[:, 0]
Out[3]:
array([1, 4, 7])
In [4]:
example_array[1, :]
Out[4]:
array([4, 5, 6])
In [5]:
example_array[1:3, 1:3]
Out[5]:
array([[5, 6],
       [8, 9]])

Subsetting an array

In [6]:
array1 = np.array([1, 1, 1, 2, 2, 2, 1])
array2 = np.array([1, 2, 3, 4, 5, 6, 7])
array2[array1==1]
Out[6]:
array([1, 2, 3, 7])
In [7]:
array3 = np.array(['a', 'a', 'a', 'b', 'b', 'b', 'b'])
array2[(array1==1) & (array3=='a')]
Out[7]:
array([1, 2, 3])

Math

Arrays

Math on arrays is vectorized and behaves exactly like most scientists would expect

In [8]:
array1 = np.array([1, 1, 1, 2, 2, 2, 1])
array2 = np.array([1, 2, 3, 4, 5, 6, 7])

array1 * 2 + 1
Out[8]:
array([3, 3, 3, 5, 5, 5, 3])
In [9]:
array1 * array2
Out[9]:
array([ 1,  2,  3,  8, 10, 12,  7])

Linear algebra (matrices)

Linear algebra is done using a different data structure called a matrix.

In [10]:
matrix1 = np.matrix([[1, 2, 3], [4, 5, 6]])
matrix2 = np.matrix([1, 2, 3])
matrix1 * matrix2.transpose()
Out[10]:
matrix([[14],
        [32]])

Importing and Exporting Data

The numpy function genfromtxt is a powerful way to import text data. It can use different delimiters, skip header rows, control the type of imported data, give columns of data names, and a number of other useful goodies. See the documentation for a full list of features of run help(np.genfromtxt) from the Python shell (after importing the module of course).

Basic Import and Export

Import

Basic imports using Numpy will treat all data as floats. If we're doing a basic import we'll typically want to skip the header row (since it's generally not composed of numbers.

In [11]:
data = np.genfromtxt('./data/examp_data.txt', delimiter=',', skip_header=1)
data
Out[11]:
array([[ 1. ,  2. ,  3. ],
       [ 2. ,  2.4,  6. ],
       [ 3. ,  1.9,  8. ]])

Export

In [12]:
np.savetxt('./data/examp_output.txt', data, delimiter=',')

Importing Data Tables

Lots of scientific data comes in the form of tables, with one row per observation, and one column per thing observed. Often the different columns to have different types (including text). The best way to work with this type of data is in a Structured Array.

Import

To do this we let Numpy automatically detect the data types in each column using the optional argument dtype=None. We can also use an existing header row as the names for the columns using the optional arugment Names=True.

In [13]:
data = np.genfromtxt('./data/examp_data_species_mass.txt', dtype=None, names=True, delimiter=',')
data
Out[13]:
array([(1, 'DS', 125), (1, 'DM', 70), (2, 'DM', 55), (1, 'CB', 40),
       (2, 'DS', 110), (1, 'CB', 45)], 
      dtype=[('site', '<i8'), ('species', '|S2'), ('mass', '<i8')])

Export

The easiest way to export a structured array is to treat it like a list of lists and export it using the csv module using a function like this.

In [14]:
def export_to_csv(data, filename):
    outputfile = open(filename, 'wb')
    datawriter = csv.writer(outputfile)
    datawriter.writerows(data)
    outputfile.close()

Structured Arrays

If we import data into a Structured Array we can do a lot of things that we often want to do with scientific data.

Selecting columns by name

In [15]:
data = np.genfromtxt('./data/examp_data_species_mass.txt', dtype=None, names=True, delimiter=',')
print data
data['species']
[(1, 'DS', 125) (1, 'DM', 70) (2, 'DM', 55) (1, 'CB', 40) (2, 'DS', 110)
 (1, 'CB', 45)]
Out[15]:
array(['DS', 'DM', 'DM', 'CB', 'DS', 'CB'], 
      dtype='|S2')

Subset columns based on the values in other columns

In [16]:
data['mass'][data['species'] == 'DM']
Out[16]:
array([70, 55])
In [17]:
data['mass'][(data['species'] == 'DM') & (data['site'] == 1)]
Out[17]:
array([70])

Random number generation

Random uniform (0 to 1)

In [18]:
np.random.rand(3, 5)
Out[18]:
array([[ 0.03414585,  0.83900235,  0.93206285,  0.06820967,  0.70145045],
       [ 0.552352  ,  0.76730225,  0.06316622,  0.71285231,  0.81976971],
       [ 0.39709379,  0.71772434,  0.21598482,  0.96412023,  0.69841293]])

Random normal (mean=0, stdev=1)

In [19]:
np.random.randn(4, 2)
Out[19]:
array([[ 0.04802043, -0.89025722],
       [ 0.46246887,  1.11994326],
       [-0.95655129,  0.76707094],
       [-1.61019706, -0.21933367]])

Random integers

In [20]:
min = 10
max = 20
np.random.randint(min, max, [10, 2])
Out[20]:
array([[13, 10],
       [18, 10],
       [10, 12],
       [10, 16],
       [17, 12],
       [17, 13],
       [19, 16],
       [11, 17],
       [11, 16],
       [16, 18]])