Python data structures

When working with data in Python, you will be using data structures from core Python as well as from several different libraries (especially Numpy and Pandas). Indexing and slicing operations for these data structures are fairly straightforward, but there are a few quirks to be aware of. This notebook provides a brief overview of core Python lists and dictionaries, Numpy ndarrays, and Pandas Series and DataFrame objects.

In [1]:
import numpy as np
import pandas as pd

One-dimensional data structures

Here is a core Python list. These lists only allow position-based indexing. Note that the first index is 0.

In [2]:
x = [1, 3, 2]
print(x[1])
3

Here is a core Python dictionary. Dictionaries only allow label-based indexing (the labels are called "keys" in this setting).

In [3]:
x = {"a": 25, "b": -1, "c": 0}
print(x["b"])
-1

Here is a Pandas Series (a one-dimensional homogeneous data structure).

In [4]:
x = pd.Series([3, 1, 7, 99, 0], index=["a", "b", "c", "d", "e"])
print(x)
a     3
b     1
c     7
d    99
e     0
dtype: int64

Pandas Series allow label-based indexing, here are three equivalent approaches:

In [5]:
print(x["b"])
1
In [6]:
print(x.loc["b"])
1
In [7]:
print(x.b)
1

Pandas Series also allow position-based indexing:

In [8]:
print(x.iloc[1])
1
In [9]:
print(x.iloc[-2])
99

Another Pandas Series:

In [10]:
y = pd.Series([5, 13, 7], index=["b", "e", "f"])
print(y)
b     5
e    13
f     7
dtype: int64

Two Series can be added (and subtracted, etc.). Note that if a position is missing in either summand, the result is NaN.

In [11]:
print(x + y)
a   NaN
b     6
c   NaN
d   NaN
e    13
f   NaN
dtype: float64
In [12]:
x.add(y)
Out[12]:
a   NaN
b     6
c   NaN
d   NaN
e    13
f   NaN
dtype: float64

A "fill value" is used in place of any missing value:

In [13]:
x.add(y, fill_value=0)
Out[13]:
a     3
b     6
c     7
d    99
e    13
f     7
dtype: float64

Here is a Numpy ndarray. These are homogeneous data structures that allow only position-based indexing.

In [14]:
x = np.array([4, 2, 8, 5, 6, 1])
print(x[2])
print(x[-1])
8
1

Here we take a position-based slice of a Numpy ndarray. Note that the end point is not included in the slice.

In [15]:
x[2:4]
Out[15]:
array([8, 5])

Here is how we do position-based slicing for a Pandas Series (note that the end point of the range is not included in the slice):

In [16]:
x = pd.Series([5, 2, 3, 1, 4], index=["a", "b", "c", "d", "e"])
print(x.iloc[1:3])
b    2
c    3
dtype: int64

Here is how we do label-based slicing for a Pandas Series. Note that the end point of the range is included in the slice.

In [17]:
x.loc["b":"d"]
Out[17]:
b    2
c    3
d    1
dtype: int64

Two-dimensional data structures

You can create a two-dimensional core Python "list" by nesting two one-dimensional lists:

In [18]:
x = [[1, 3], [5, 2]]
print(x)
[[1, 3], [5, 2]]

The two-dimensional data structure in Pandas is called a DataFrame.

In [19]:
x = pd.DataFrame([[1, 3], [5, 8], [2, 1]], columns=["a", "b"], index=["row1", "row2", "row3"])
print(x)
      a  b
row1  1  3
row2  5  8
row3  2  1

Here is how we do label-based indexing of a DataFrame object.

In [20]:
x.loc["row1", "b"]
Out[20]:
3

Here is how we do position-based indexing of a DataFrame object.

In [21]:
x.iloc[0, 1]
Out[21]:
3

Mixing label-based and position-based indexing is a bit trickier.

In [22]:
x.loc["row1", :].iloc[1]
Out[22]:
3

Here is how we do label-based slicing with a DataFrame (note that this is a Series):

In [23]:
x.loc["row1", :]
Out[23]:
a    1
b    3
Name: row1, dtype: int64
In [24]:
x.loc[:, "b"]
Out[24]:
row1    3
row2    8
row3    1
Name: b, dtype: int64

Note that these slices are Series objects.

In [25]:
x.loc["row1", :]
Out[25]:
a    1
b    3
Name: row1, dtype: int64

Here is how we do position-based slicing with a DataFrame:

In [26]:
x.iloc[1, :]
Out[26]:
a    5
b    8
Name: row2, dtype: int64
In [27]:
x.iloc[:, 0]
Out[27]:
row1    1
row2    5
row3    2
Name: a, dtype: int64

Here is how we do position-based slicing on two axes with a DataFrame:

In [28]:
x.iloc[0:2, :]
Out[28]:
a b
row1 1 3
row2 5 8
In [29]:
x.iloc[:, 0:2]
Out[29]:
a b
row1 1 3
row2 5 8
row3 2 1

If you want to select a non-contiguous set of rows, use a list of positions:

In [30]:
x.iloc[[0,2], :]
Out[30]:
a b
row1 1 3
row3 2 1

You can also index with a boolean vector:

In [31]:
x.loc[[True, False, True], :]
Out[31]:
a b
row1 1 3
row3 2 1

Boolean indexing is useful if you want to select rows (or columns) based on a query:

In [32]:
ix = x.a < 3
x.loc[ix, :]
Out[32]:
a b
row1 1 3
row3 2 1

A Numpy ndarray can be two-dimensional:

In [33]:
x = np.array([[1, 4], [-5, 3], [3, 4]])
print(x)
[[ 1  4]
 [-5  3]
 [ 3  4]]

Here is how we do position-based indexing and slicing with a two-dimensional ndarray:

In [34]:
x[0, 1]
Out[34]:
4
In [35]:
x[1, :]
Out[35]:
array([-5,  3])
In [36]:
x[:, 1]
Out[36]:
array([4, 3, 4])
In [37]:
x[[0, 2], :]
Out[37]:
array([[1, 4],
       [3, 4]])

Boolean arrays used in indexing must by numpy arrays, not Python lists.

In [38]:
print(x[:, np.r_[False, True]])
print(x[:, [False, True]])
[[4]
 [3]
 [4]]
[[ 1  4]
 [-5  3]
 [ 3  4]]
In [39]:
x[-1, :]
Out[39]:
array([3, 4])

Shortcuts

The safest and fastest way to index a Pandas data structure is using loc (for label-based indexing) and iloc for position-based indexing. Next we look at some shortcuts that can also be used.

In [40]:
x = pd.DataFrame([[1,3], [2,0], [1,1]], index=("a", "b", "c"), columns=("v1", "v2"))
print(x)
   v1  v2
a   1   3
b   2   0
c   1   1

Here are four ways to access a column of a Pandas DataFrame:

In [41]:
x.loc[:, "v1"]
Out[41]:
a    1
b    2
c    1
Name: v1, dtype: int64
In [42]:
x["v1"]
Out[42]:
a    1
b    2
c    1
Name: v1, dtype: int64
In [43]:
x.v1
Out[43]:
a    1
b    2
c    1
Name: v1, dtype: int64

Here are two ways to access a row of a Pandas DataFrame.

In [44]:
x.loc["a", :]
Out[44]:
v1    1
v2    3
Name: a, dtype: int64
In [45]:
x.loc["a"]
Out[45]:
v1    1
v2    3
Name: a, dtype: int64

Data types

Core Python lists and dictionaries are heterogeneous containers, meaning they can hold values of any combination of types:

In [46]:
x = [3, 9., "apple"]
[type(v) for v in x]
Out[46]:
[int, float, str]

Numpy ndarray and Pandas Series objects homogeneous and hence have a single type (the type can be 'object' which provides the behavior of a heterogeneous container, but we won't get into that here). Here is how you specify and discover the data type for a numpy ndarray. Pandas Series objects have exactly the same behavior.

In [47]:
w = np.array([3, 2, 1], dtype=np.int32)
x = np.array([3.5, 2, 1], dtype=np.float64)
y = np.array([5, 2, 3])
z = np.array([5., 2, 3])
print(w.dtype)
print(x.dtype)
print(y.dtype)
print(z.dtype)
int32
float64
int64
float64

Each column of a Pandas DataFrame has a single type, but different columns in the same data frame can have different types.

In [48]:
df = pd.DataFrame(index=range(3), columns=["a", "b", "c"])
df["a"] = [3, 2, 1]
df["b"] = ["x", "y", "z"]
df["c"] = [1.3, 2.5, 5.2]
print(df.dtypes)
a      int64
b     object
c    float64
dtype: object

Data structure types, conversions, and introspection

Here we create an ndarray from a core Python list:

In [49]:
w =  [3, 5, 2]
x = np.array(w)
z = np.array(w, dtype=np.float64)

You can use asarray if you want to convert an sequence-containing object to an ndarray and if possible avoid copying the data.

In [50]:
w = [1, 2, 3]
x = np.asarray(w)
y = np.asarray(x)
print(id(w), id(x), id(y))
(139923570100200, 22326400, 22326400)

Here we create a DataFrame from a ndarray, note the default indices:

In [51]:
x = np.array([[3, 2, 1, 6, 5], [3, 2, 7, 6, 6]])
y = pd.DataFrame(x)
print(y)
   0  1  2  3  4
0  3  2  1  6  5
1  3  2  7  6  6

You can also use np.asarray to convert a DataFrame or Series to an ndarray.

In [52]:
np.asarray(y)
Out[52]:
array([[3, 2, 1, 6, 5],
       [3, 2, 7, 6, 6]])

The dir function produces all the methods that can be invoked on an object. Methods whose names begin and end with two underscores are considered private.

In [53]:
print(dir(z))
['T', '__abs__', '__add__', '__and__', '__array__', '__array_finalize__', '__array_interface__', '__array_prepare__', '__array_priority__', '__array_struct__', '__array_wrap__', '__class__', '__contains__', '__copy__', '__deepcopy__', '__delattr__', '__delitem__', '__delslice__', '__div__', '__divmod__', '__doc__', '__eq__', '__float__', '__floordiv__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getslice__', '__gt__', '__hash__', '__hex__', '__iadd__', '__iand__', '__idiv__', '__ifloordiv__', '__ilshift__', '__imod__', '__imul__', '__index__', '__init__', '__int__', '__invert__', '__ior__', '__ipow__', '__irshift__', '__isub__', '__iter__', '__itruediv__', '__ixor__', '__le__', '__len__', '__long__', '__lshift__', '__lt__', '__mod__', '__mul__', '__ne__', '__neg__', '__new__', '__nonzero__', '__oct__', '__or__', '__pos__', '__pow__', '__radd__', '__rand__', '__rdiv__', '__rdivmod__', '__reduce__', '__reduce_ex__', '__repr__', '__rfloordiv__', '__rlshift__', '__rmod__', '__rmul__', '__ror__', '__rpow__', '__rrshift__', '__rshift__', '__rsub__', '__rtruediv__', '__rxor__', '__setattr__', '__setitem__', '__setslice__', '__setstate__', '__sizeof__', '__str__', '__sub__', '__subclasshook__', '__truediv__', '__xor__', 'all', 'any', 'argmax', 'argmin', 'argpartition', 'argsort', 'astype', 'base', 'byteswap', 'choose', 'clip', 'compress', 'conj', 'conjugate', 'copy', 'ctypes', 'cumprod', 'cumsum', 'data', 'diagonal', 'dot', 'dtype', 'dump', 'dumps', 'fill', 'flags', 'flat', 'flatten', 'getfield', 'imag', 'item', 'itemset', 'itemsize', 'max', 'mean', 'min', 'nbytes', 'ndim', 'newbyteorder', 'nonzero', 'partition', 'prod', 'ptp', 'put', 'ravel', 'real', 'repeat', 'reshape', 'resize', 'round', 'searchsorted', 'setfield', 'setflags', 'shape', 'size', 'sort', 'squeeze', 'std', 'strides', 'sum', 'swapaxes', 'take', 'tofile', 'tolist', 'tostring', 'trace', 'transpose', 'var', 'view']

Now we can get the docstring for a specific method:

In [54]:
help(z.trace)
Help on built-in function trace:

trace(...)
    a.trace(offset=0, axis1=0, axis2=1, dtype=None, out=None)
    
    Return the sum along diagonals of the array.
    
    Refer to `numpy.trace` for full documentation.
    
    See Also
    --------
    numpy.trace : equivalent function

Reference behavior

When you slice an array, you are often getting a reference to the array that you sliced from. This means that if you change the slice, you also are changing the value in the array that it was sliced from. To illustrate this, let's first create an array and get a slice from it.

In [55]:
x = np.array([[1, 4], [3,2], [5,6]])
y = x[1, :]
print(x)
print("\n")
print(y)       
[[1 4]
 [3 2]
 [5 6]]


[3 2]

Now we change a value in the slice, and check the state of the parent array (x):

In [56]:
y[1] = 88
print(y)
print("\n")
print(x)
[ 3 88]


[[ 1  4]
 [ 3 88]
 [ 5  6]]

If you want to avoid this behavior, create a copy.

In [57]:
y = x[1, :].copy()

Merging

There are several ways to merge data frames in Pandas. One approach is to use the 'merge' function. To illustrate, first we construct two data frames:

In [58]:
df1 = pd.DataFrame(np.random.normal(size=(4, 2)), index=[1, 3, 4, 5], columns=("A", "B"))
df2 = pd.DataFrame(np.random.normal(size=(5, 3)), index=[1, 2, 5, 6, 7], columns=("C", "D", "E"))
print(df1)
print(df2)
          A         B
1  0.286263  0.299427
3  1.412048 -0.065437
4  0.728740  0.089141
5  0.604061  0.050104
          C         D         E
1 -0.533990  0.535573  0.173613
2  1.639590 -1.876353 -0.539917
5 -1.150531  0.534513  1.206041
6  0.241901 -0.297153 -0.594572
7  0.762719 -1.392482 -0.102770

We can merge these two data frames on their indices. The default merge is an "inner merge" that uses only the index values that are present in both data frames.

In [59]:
df3 = pd.merge(df1, df2, left_index=True, right_index=True)
print(df3)
          A         B         C         D         E
1  0.286263  0.299427 -0.533990  0.535573  0.173613
5  0.604061  0.050104 -1.150531  0.534513  1.206041

An "outer merge" uses the index values that are present in either of the two data frames being merged:

In [60]:
df4 = pd.merge(df1, df2, left_index=True, right_index=True, how='outer')
print(df4)
          A         B         C         D         E
1  0.286263  0.299427 -0.533990  0.535573  0.173613
2       NaN       NaN  1.639590 -1.876353 -0.539917
3  1.412048 -0.065437       NaN       NaN       NaN
4  0.728740  0.089141       NaN       NaN       NaN
5  0.604061  0.050104 -1.150531  0.534513  1.206041
6       NaN       NaN  0.241901 -0.297153 -0.594572
7       NaN       NaN  0.762719 -1.392482 -0.102770

We can also merge using only the keys from the first data frame:

In [61]:
df5 = pd.merge(df1, df2, left_index=True, right_index=True, how='left')
print(df5)
          A         B         C         D         E
1  0.286263  0.299427 -0.533990  0.535573  0.173613
3  1.412048 -0.065437       NaN       NaN       NaN
4  0.728740  0.089141       NaN       NaN       NaN
5  0.604061  0.050104 -1.150531  0.534513  1.206041

We can also merge on non-unique values. This can be done using an index with non-unique values, or using an arbitrary column of the data frame as the merge keys. Here we demonstrate the latter.

In [62]:
df1 = pd.DataFrame(index=range(5), columns=("A", "B"))
df1["A"] = [1, 2, 2, 3, 3]
df1["B"] = np.random.normal(size=5)
df2 = pd.DataFrame(index=range(4), columns=("C", "D"))
df2["C"] = [1, 1, 2, 2]
df2["D"] = np.random.normal(size=4)
print(df1)
print(df2)
   A         B
0  1  0.277073
1  2 -1.227416
2  2  0.627276
3  3  0.297944
4  3  0.677057
   C         D
0  1 -0.314053
1  1 -1.539383
2  2 -0.546884
3  2 -1.582331
In [63]:
df3 = pd.merge(df1, df2, left_on="A", right_on="C")
print(df3)
   A         B  C         D
0  1  0.277073  1 -0.314053
1  1  0.277073  1 -1.539383
2  2 -1.227416  2 -0.546884
3  2 -1.227416  2 -1.582331
4  2  0.627276  2 -0.546884
5  2  0.627276  2 -1.582331