Core Python, Pandas, and Numpy data structures

When working with data in Python, you will be using data structures from core Python as well as from several libraries, especially Numpy and Pandas. This notebook provides a brief overview of some useful data structures.

Core Python data structures

Numpy data structures

Pandas data structures

Core Python data structures

Here is a core Python list. The list uses position-based indexing and slicing. Note that the first position has index value 0.

In [1]:
x = [1, 3, 2, 5, 4]
print(x[1])
3

Next we take a slice from the list:

In [2]:
print(x[2:4])
[2, 5]

Here are some simple "slicing tricks":

In [3]:
print(x[2:])
print(x[:2])
print(x[::2])
print(x[-2:])
[2, 5, 4]
[1, 3]
[1, 2, 4]
[5, 4]

Core Python lists are "heterogeneous containers", meaning that they can hold arbitary data, including other lists.

In [4]:
x = [1, "apple", [3, "cat"]]

We can access the type information with the type function:

In [5]:
[type(v) for v in x]
Out[5]:
[int, str, list]

A tuple is closely related to a list, but it is "frozen" (or "immutable"), meaning that it cannot be changed after it is created.

In [6]:
x = (1, 3, 2, 5, (4, 6), ("cat", "dog"))
print(x)
(1, 3, 2, 5, (4, 6), ('cat', 'dog'))
In [7]:
print(x[3])
5

If we try to change the contents of the tuple an exception results:

In [8]:
x[3] = 99
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-8-747886e126e4> in <module>()
----> 1 x[3] = 99

TypeError: 'tuple' object does not support item assignment

Here is a core Python dictionary. A dictionary is equivalent to data structures called "hashmaps", "associative arrays", and "key/value stores" in other languages.

In [9]:
x = {"a": 25, "b": -1, "c": 0}
print(x["b"])
-1

A "set" is basically a dictionary without values. It behaves like a mathematical set (an unordered collection with no duplicate values).

In [10]:
x = {3, 1, 2}
In [11]:
x.add(99)
In [12]:
3 in x
Out[12]:
True

Python data structure support various forms of iteration. First, we simply iterate through a list:

In [13]:
x = ["a", "c", "d", "g", "b", "i"]
for v in x:
    print(v)
a
c
d
g
b
i

enumerate helps us keep track of the position:

In [14]:
for j,v in enumerate(x):
    print(j, v)
(0, 'a')
(1, 'c')
(2, 'd')
(3, 'g')
(4, 'b')
(5, 'i')

Sometimes we can use a list comprehension instead of a traditional loop:

In [15]:
y = [v + v for v in x]
print(y)
['aa', 'cc', 'dd', 'gg', 'bb', 'ii']

We can also iterate over dictionaries:

In [16]:
x = {"a": 25, "b": -1, "c": 0}
for k in x:
    print(k, x[k])
('a', 25)
('c', 0)
('b', -1)

There are many Python libraries that implement useful algorithms using core Python data structures, for example the toolz library.

Numpy data structures

The main numpy data structure is the ndarray, which stands for "n-dimenional array". First, since numpy is a library we need to import it.

In [17]:
import numpy as np

An ndarray is a homogeneous rectangular data structure. Being "homogeneous" means all data values in the container must have the same type. Numpy supports many data types. In the following cell, we create a 1-dimensional literal ndarray with 8 byte floating point (double precision) values.

In [18]:
x = np.asarray([4, 1, 5, 4, 7, 3, 0], dtype=np.float64)

Since the data are all integers, we could have used an integer data type instead:

In [19]:
x = np.asarray([4, 1, 5, 4, 7, 3, 0], dtype=np.int64)

We can even store them as single byte values (since none of the values exceeds 255):

In [20]:
x = np.asarray([4, 1, 5, 4, 7, 3, 0], dtype=np.uint8)

We can index and slice an ndarray just like we index and slice a Python list:

In [21]:
print(x[2])
print(x[3:5])
5
[4 7]

In addition, ndarrays support two types of indexing that core Python lists do not. We can index with a Boolean array:

In [22]:
ii = np.asarray([False, False, True, False, True, False, False])
print(x[ii])
[5 7]

We can also index using a list of positions:

In [23]:
ix = np.asarray([0, 3, 3, 5])
print(x[ii])
[5 7]

We can do elementwise arithmetic using numpy arrays as long as they are conformable (or can be broadcast to be conformable, but that is a more advanced topic). Note that numerical types are "upcast".

In [24]:
y = np.asarray([0, 1, 0, -1, 1, 1, -2], dtype=np.float64)
z = x + y
print(z)
print(z.dtype)
[ 4.  2.  5.  3.  8.  4. -2.]
float64

An ndarray can have multiple dimensions:

In [25]:
x = np.zeros((4, 3))
print(x)
[[ 0.  0.  0.]
 [ 0.  0.  0.]
 [ 0.  0.  0.]
 [ 0.  0.  0.]]
In [26]:
x = np.zeros((4, 3, 2))
print(x)
[[[ 0.  0.]
  [ 0.  0.]
  [ 0.  0.]]

 [[ 0.  0.]
  [ 0.  0.]
  [ 0.  0.]]

 [[ 0.  0.]
  [ 0.  0.]
  [ 0.  0.]]

 [[ 0.  0.]
  [ 0.  0.]
  [ 0.  0.]]]

Slicing ndarrays with multiple dimensions is straightforward:

In [27]:
x = np.random.normal(size=(3, 4))
print(x)
print(x[1, :])
print(x[1:3, 2:4])
[[-1.42283865  1.24671798 -1.58036635  0.86115185]
 [ 1.25666217  2.47697181  0.32879628 -1.03057258]
 [ 0.8517466  -0.25932041 -1.58917537 -0.83187607]]
[ 1.25666217  2.47697181  0.32879628 -1.03057258]
[[ 0.32879628 -1.03057258]
 [-1.58917537 -0.83187607]]

Pandas data structures

Pandas is a library that provides data structures that can be used to manipulate heterogeneous data sets. It is a library, so we begin by importing it:

In [28]:
import pandas as pd

Below we create a Pandas Series (a one-dimensional homogeneous data structure). All Pandas data structures are "indexed", meaning that each axis is labeled with arbitrary keys. These keys are pre-sorted, so element access and slicing using the index is quite efficient.

In [29]:
x = pd.Series([3, 1, 7, 99, 0], index=["a", "b", "c", "d", "e"])
print(x)
a     3
b     1
c     7
d    99
e     0
dtype: int64

Pandas Series allow label-based indexing, here are three equivalent approaches:

In [30]:
print(x["b"])
1
In [31]:
print(x.loc["b"])
1
In [32]:
print(x.b)
1

Pandas Series objects also allow position-based indexing:

In [33]:
print(x.iloc[1])
1
In [34]:
print(x.iloc[-2])
99

Next we create another Pandas Series:

In [35]:
y = pd.Series([5, 13, 7], index=["b", "e", "f"])
print(y)
b     5
e    13
f     7
dtype: int64

Two Series can be added (and subtracted, etc.). Note that if a position is missing in either summand, the result is NaN.

In [36]:
print(x + y)
a   NaN
b     6
c   NaN
d   NaN
e    13
f   NaN
dtype: float64

Here is another way to do the same thing:

In [37]:
x.add(y)
Out[37]:
a   NaN
b     6
c   NaN
d   NaN
e    13
f   NaN
dtype: float64

A "fill value" is used in place of any missing value:

In [38]:
x.add(y, fill_value=0)
Out[38]:
a     3
b     6
c     7
d    99
e    13
f     7
dtype: float64

Here is how we do position-based slicing for a Pandas Series (note that the end point of the range is not included in the slice):

In [39]:
x = pd.Series([5, 2, 3, 1, 4], index=["a", "b", "c", "d", "e"])
print(x.iloc[1:3])
b    2
c    3
dtype: int64

Here is how we do label-based slicing for a Pandas Series. Note that the end point of the range is included in the slice.

In [40]:
x.loc["b":"d"]
Out[40]:
b    2
c    3
d    1
dtype: int64

The two-dimensional data structure in Pandas is called a DataFrame.

In [41]:
x = pd.DataFrame([[1, 3], [5, 8], [2, 1]], columns=["a", "b"], index=["row1", "row2", "row3"])
print(x)
      a  b
row1  1  3
row2  5  8
row3  2  1

Here is how we do label-based indexing of a DataFrame object.

In [42]:
x.loc["row1", "b"]
Out[42]:
3

Here is how we do position-based indexing of a DataFrame object.

In [43]:
x.iloc[0, 1]
Out[43]:
3

Mixing label-based and position-based indexing is a bit trickier.

In [44]:
x.loc["row1", :].iloc[1]
Out[44]:
3

Here is how we do label-based slicing with a DataFrame (note that the result is a Series):

In [45]:
x.loc["row1", :]
Out[45]:
a    1
b    3
Name: row1, dtype: int64
In [46]:
x.loc[:, "b"]
Out[46]:
row1    3
row2    8
row3    1
Name: b, dtype: int64

Note that these slices are Series objects.

In [47]:
x.loc["row1", :]
Out[47]:
a    1
b    3
Name: row1, dtype: int64

Here is how we do position-based slicing with a DataFrame:

In [48]:
x.iloc[1, :]
Out[48]:
a    5
b    8
Name: row2, dtype: int64
In [49]:
x.iloc[:, 0]
Out[49]:
row1    1
row2    5
row3    2
Name: a, dtype: int64

Here is how we do position-based slicing on two axes with a DataFrame:

In [50]:
x.iloc[0:2, :]
Out[50]:
a b
row1 1 3
row2 5 8
In [51]:
x.iloc[:, 0:2]
Out[51]:
a b
row1 1 3
row2 5 8
row3 2 1

If you want to select a non-contiguous set of rows, use a list of positions:

In [52]:
x.iloc[[0,2], :]
Out[52]:
a b
row1 1 3
row3 2 1

You can also index with a boolean vector:

In [53]:
x.loc[[True, False, True], :]
Out[53]:
a b
row1 1 3
row3 2 1

Boolean indexing is useful if you want to select rows (or columns) based on a query:

In [54]:
ix = x.a < 3
x.loc[ix, :]
Out[54]:
a b
row1 1 3
row3 2 1

Shortcuts

The safest and fastest way to index a Pandas data structure is using loc for label-based indexing and iloc for position-based indexing. Next we look at some shortcuts that can also be used.

In [55]:
x = pd.DataFrame([[1,3], [2,0], [1,1]], index=("a", "b", "c"), columns=("v1", "v2"))
print(x)
   v1  v2
a   1   3
b   2   0
c   1   1

Here are four ways to access a column of a Pandas DataFrame:

In [56]:
x.loc[:, "v1"]
Out[56]:
a    1
b    2
c    1
Name: v1, dtype: int64
In [57]:
x["v1"]
Out[57]:
a    1
b    2
c    1
Name: v1, dtype: int64
In [58]:
x.v1
Out[58]:
a    1
b    2
c    1
Name: v1, dtype: int64

Here are two ways to access a row of a Pandas DataFrame.

In [59]:
x.loc["a", :]
Out[59]:
v1    1
v2    3
Name: a, dtype: int64
In [60]:
x.loc["a"]
Out[60]:
v1    1
v2    3
Name: a, dtype: int64

Each column of a Pandas DataFrame has a single type, but different columns in the same data frame can have different types.

In [61]:
df = pd.DataFrame(index=range(3), columns=["a", "b", "c"])
df["a"] = [3, 2, 1]
df["b"] = ["x", "y", "z"]
df["c"] = [1.3, 2.5, 5.2]
print(df)
print(df.dtypes)
   a  b    c
0  3  x  1.3
1  2  y  2.5
2  1  z  5.2
a      int64
b     object
c    float64
dtype: object

Help and documentation

The dir function produces all the methods that can be invoked on an object. Methods whose names begin and end with two underscores are considered private.

In [62]:
print(dir(z))
['T', '__abs__', '__add__', '__and__', '__array__', '__array_finalize__', '__array_interface__', '__array_prepare__', '__array_priority__', '__array_struct__', '__array_wrap__', '__class__', '__contains__', '__copy__', '__deepcopy__', '__delattr__', '__delitem__', '__delslice__', '__div__', '__divmod__', '__doc__', '__eq__', '__float__', '__floordiv__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getslice__', '__gt__', '__hash__', '__hex__', '__iadd__', '__iand__', '__idiv__', '__ifloordiv__', '__ilshift__', '__imod__', '__imul__', '__index__', '__init__', '__int__', '__invert__', '__ior__', '__ipow__', '__irshift__', '__isub__', '__iter__', '__itruediv__', '__ixor__', '__le__', '__len__', '__long__', '__lshift__', '__lt__', '__mod__', '__mul__', '__ne__', '__neg__', '__new__', '__nonzero__', '__oct__', '__or__', '__pos__', '__pow__', '__radd__', '__rand__', '__rdiv__', '__rdivmod__', '__reduce__', '__reduce_ex__', '__repr__', '__rfloordiv__', '__rlshift__', '__rmod__', '__rmul__', '__ror__', '__rpow__', '__rrshift__', '__rshift__', '__rsub__', '__rtruediv__', '__rxor__', '__setattr__', '__setitem__', '__setslice__', '__setstate__', '__sizeof__', '__str__', '__sub__', '__subclasshook__', '__truediv__', '__xor__', 'all', 'any', 'argmax', 'argmin', 'argpartition', 'argsort', 'astype', 'base', 'byteswap', 'choose', 'clip', 'compress', 'conj', 'conjugate', 'copy', 'ctypes', 'cumprod', 'cumsum', 'data', 'diagonal', 'dot', 'dtype', 'dump', 'dumps', 'fill', 'flags', 'flat', 'flatten', 'getfield', 'imag', 'item', 'itemset', 'itemsize', 'max', 'mean', 'min', 'nbytes', 'ndim', 'newbyteorder', 'nonzero', 'partition', 'prod', 'ptp', 'put', 'ravel', 'real', 'repeat', 'reshape', 'resize', 'round', 'searchsorted', 'setfield', 'setflags', 'shape', 'size', 'sort', 'squeeze', 'std', 'strides', 'sum', 'swapaxes', 'take', 'tobytes', 'tofile', 'tolist', 'tostring', 'trace', 'transpose', 'var', 'view']

Now we can get the docstring for a specific method:

In [63]:
help(z.trace)
Help on built-in function trace:

trace(...)
    a.trace(offset=0, axis1=0, axis2=1, dtype=None, out=None)
    
    Return the sum along diagonals of the array.
    
    Refer to `numpy.trace` for full documentation.
    
    See Also
    --------
    numpy.trace : equivalent function

Reference behavior

When you slice an array, you are often getting a reference to the array that you sliced from. This means that if you change the slice, you also are changing the value in the array that it was sliced from. To illustrate this, let's first create an array and get a slice from it.

In [64]:
x = np.array([[1, 4], [3,2], [5,6]])
y = x[1, :]
print(x)
print("\n")
print(y)       
[[1 4]
 [3 2]
 [5 6]]


[3 2]

Now we change a value in the slice, and check the state of the parent array (x):

In [65]:
y[1] = 88
print(y)
print("\n")
print(x)
[ 3 88]


[[ 1  4]
 [ 3 88]
 [ 5  6]]

If you want to avoid this behavior, create a copy.

In [66]:
y = x[1, :].copy()

If you aren't sure whether you are getting a reference, use the id function:

In [67]:
w = [1, 2, 3]
x = np.asarray(w)
y = np.asarray(x)
print(id(w), id(x), id(y))
(140015432547936, 140015425830304, 140015425830304)

Merging

There are several ways to merge data frames in Pandas. One approach is to use the 'merge' function. To illustrate, first we construct two data frames:

In [68]:
df1 = pd.DataFrame(np.random.normal(size=(4, 2)), index=[1, 3, 4, 5], columns=("A", "B"))
df2 = pd.DataFrame(np.random.normal(size=(5, 3)), index=[1, 2, 5, 6, 7], columns=("C", "D", "E"))
print(df1)
print(df2)
          A         B
1 -1.874750  0.537500
3  0.190116 -0.177765
4 -1.780460  0.440861
5  0.700392 -0.513529
          C         D         E
1  0.060484 -1.476438  2.123858
2 -1.570789 -0.537953  0.268004
5 -0.171998  1.660688 -0.460527
6 -1.412172  1.137354 -0.470242
7  0.233659  0.387198 -2.396224

We can merge these two data frames on their indices. The default merge is an "inner merge" that uses only the index values that are present in both data frames.

In [69]:
df3 = pd.merge(df1, df2, left_index=True, right_index=True)
print(df3)
          A         B         C         D         E
1 -1.874750  0.537500  0.060484 -1.476438  2.123858
5  0.700392 -0.513529 -0.171998  1.660688 -0.460527

An "outer merge" uses the index values that are present in either of the two data frames being merged:

In [70]:
df4 = pd.merge(df1, df2, left_index=True, right_index=True, how='outer')
print(df4)
          A         B         C         D         E
1 -1.874750  0.537500  0.060484 -1.476438  2.123858
2       NaN       NaN -1.570789 -0.537953  0.268004
3  0.190116 -0.177765       NaN       NaN       NaN
4 -1.780460  0.440861       NaN       NaN       NaN
5  0.700392 -0.513529 -0.171998  1.660688 -0.460527
6       NaN       NaN -1.412172  1.137354 -0.470242
7       NaN       NaN  0.233659  0.387198 -2.396224

We can also merge using only the keys from the first data frame:

In [71]:
df5 = pd.merge(df1, df2, left_index=True, right_index=True, how='left')
print(df5)
          A         B         C         D         E
1 -1.874750  0.537500  0.060484 -1.476438  2.123858
3  0.190116 -0.177765       NaN       NaN       NaN
4 -1.780460  0.440861       NaN       NaN       NaN
5  0.700392 -0.513529 -0.171998  1.660688 -0.460527

We can also merge on non-unique values. This can be done using an index with non-unique values, or using an arbitrary column of the data frame as the merge keys. Here we demonstrate the latter.

In [72]:
df1 = pd.DataFrame(index=range(5), columns=("A", "B"))
df1["A"] = [1, 2, 2, 3, 3]
df1["B"] = np.random.normal(size=5)
df2 = pd.DataFrame(index=range(4), columns=("C", "D"))
df2["C"] = [1, 1, 2, 2]
df2["D"] = np.random.normal(size=4)
print(df1)
print(df2)
   A         B
0  1 -3.334844
1  2  1.521165
2  2  0.289833
3  3 -0.236966
4  3  1.156295
   C         D
0  1 -0.724522
1  1 -2.238793
2  2  0.270048
3  2 -0.096886
In [73]:
df3 = pd.merge(df1, df2, left_on="A", right_on="C")
print(df3)
   A         B  C         D
0  1 -3.334844  1 -0.724522
1  1 -3.334844  1 -2.238793
2  2  1.521165  2  0.270048
3  2  1.521165  2 -0.096886
4  2  0.289833  2  0.270048
5  2  0.289833  2 -0.096886