title

Why Numpy and Pandas?

For a long time, Python programmers who worked with data had to use the core Python libraries for importing, organising and manipulating data. The modules Numpy and Pandas give us tools we can use to work with data quickly and efficiently in a nicer format. By the end of this guide you should feel comfortable with Numpy arrays and Pandas series and dataframes.

Introduction to Numpy

Numpy is a module that lets us generate more efficient lists that have the option to be multi-dimensional (stacked inside each other). Before we look at what Numpy can do, we have to first import the module. As with matplotlib, we will be importing it under a different name for brevity. Here we use "np".

In [2]:
import numpy as np

Almost all of Numpy's functionality comes from it's multi-dimensional arrays, which mostly operate like Python lists, but use less memory, have better performance and have some other cool features under the hood. To initialise an array, we use the function np.array:

In [3]:
a = np.array([1,2,3,4])
print(a)
[1 2 3 4]

A very important note is that the "np.array" function's argument is a list - Python won't understand you if you give it something else! For example - the command...

In [4]:
aError = np.array(1,2,3,4)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-bd13f32faa64> in <module>
----> 1 aError = np.array(1,2,3,4)

ValueError: only 2 non-keyword arguments accepted

... give us an error because Python expects a list, not four numbers.

We can also make two dimensional arrays like so (note the nested brackets):

In [4]:
a = np.array([ [1,2,3,4] , [5,6,7,8] ])
print(a)
[[1 2 3 4]
 [5 6 7 8]]

Numpy also comes with some functions to generate arrays - for example the function linspace() gives us an array of numbers equally spread out between the arguements. For example:

In [6]:
b = np.linspace(0,10,21)
print(b)
[ 0.   0.5  1.   1.5  2.   2.5  3.   3.5  4.   4.5  5.   5.5  6.   6.5
  7.   7.5  8.   8.5  9.   9.5 10. ]

Gives us an array of 21 elements equally spaced from 0 to 10 (because we include endpoints here we have to be careful about how many numbers we want!).

Numpy, unlike Python lists, can have operations performed on them directly:

In [7]:
print(b**2)
[  0.     0.25   1.     2.25   4.     6.25   9.    12.25  16.    20.25
  25.    30.25  36.    42.25  49.    56.25  64.    72.25  81.    90.25
 100.  ]

We can also use numpy arrays in the same way as Python lists for graphing (in fact this is recommended):

In [8]:
import matplotlib.pyplot as plt #We need to import our graphing library!

plt.plot(b,b**2)

plt.show()
<Figure size 640x480 with 1 Axes>

Numpy arrays can also be indexed, sliced and iterated over in the same way as Python lists can:

In [12]:
print(b[0])
0.0
In [13]:
print(b[0:11])
[0.  0.5 1.  1.5 2.  2.5 3.  3.5 4.  4.5 5. ]
In [14]:
for element in b:
    print(str(element) + " is an element of our array!")
0.0 is an element of our array!
0.5 is an element of our array!
1.0 is an element of our array!
1.5 is an element of our array!
2.0 is an element of our array!
2.5 is an element of our array!
3.0 is an element of our array!
3.5 is an element of our array!
4.0 is an element of our array!
4.5 is an element of our array!
5.0 is an element of our array!
5.5 is an element of our array!
6.0 is an element of our array!
6.5 is an element of our array!
7.0 is an element of our array!
7.5 is an element of our array!
8.0 is an element of our array!
8.5 is an element of our array!
9.0 is an element of our array!
9.5 is an element of our array!
10.0 is an element of our array!

Apart from direct operations on arrays - this may seem a little redundant. So why do we use Numpy arrays over Python lists? Well, most scientific libraries leverage numpy for better performance. Better performance and a smaller memory footprint might not sound impressive now but imagine if you would work on a dataset with billions of entries; then these things become very important!

Pandas

Numpy is the backbone of most data focused Python libraries, because it provides a solid foundation to build upon. One of the most important of these libraries is Pandas, which we use to create series and dataframes, i.e tables.

As always, we first need to import the library. With pandas we use the alias "pd" by convention:

In [15]:
import pandas as pd

Pandas' functionality is two Python objects - the series, and the dataframe. For making series, we can use the Series function (note: this is case sensitive!):

In [16]:
c = pd.Series([1,1,2,3,5,8])
print(c)
0    1
1    1
2    2
3    3
4    5
5    8
dtype: int64

The numbers in the left column are our index - it can be helpful to change this, for example if our data is time based. To do this, we add another arguement to the series function:

In [17]:
c = pd.Series([1,1,2,3,5,8], index=["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"])
print(c)
Monday       1
Tuesday      1
Wednesday    2
Thursday     3
Friday       5
Saturday     8
dtype: int64

Dataframes are just a collection of series. To make a dataframe, we have a few options, all using the DataFrame function (again notice the capitals!).

We can pass a two dimensional numpy array as an arguement, along with column names (here we use the random sublibrary of numpy to give us a 6x4 array of random numbers):

(Note: here np.random.randn(x,y) gives us an x by y numpy array with random entries)

In [19]:
d = pd.DataFrame(np.random.randn(6,4), columns=['A','B','C','D'])
print(d) #Note, if you are using jupyter notebook, just outputting d here instead of printing it will give you a nicer format.
          A         B         C         D
0 -1.304458  0.128916  0.917312 -0.830153
1 -2.312278 -0.631357 -0.524241 -1.663592
2 -1.527491 -0.136707 -1.041927 -0.649112
3  0.621613  0.384033 -0.529702  1.046160
4  0.603338 -1.430888 -1.414519  0.308135
5  1.281230 -0.966232  0.618242  1.862749

Another way is to pass a dictionary as our arguement to the function - here we use the function "pd.date_range" to generate an array of dates, starting from 30/03/2017 and ending at 02/04/2017:

In [20]:
e = pd.DataFrame({'A': [1,2,3,4], 'B':"hello", 'C': pd.date_range('20170330', periods=4)})
print(e)
   A      B          C
0  1  hello 2017-03-30
1  2  hello 2017-03-31
2  3  hello 2017-04-01
3  4  hello 2017-04-02

Most of the time during data analysis, we will looking at tables that are much larger than just 4 or 5 rows. It can be helpful to know some commands to give us summary information about our data without bringing up the whole frame.

The head and tail methods give us the first and last row(s) of the frame:

In [21]:
print(d.head(3)) #Looks at the first 3 rows of our "d" dataframe
          A         B         C         D
0 -1.304458  0.128916  0.917312 -0.830153
1 -2.312278 -0.631357 -0.524241 -1.663592
2 -1.527491 -0.136707 -1.041927 -0.649112
In [22]:
print(e.tail(1)) #Looks at the last row of our "e" dataframe
   A      B          C
3  4  hello 2017-04-02

The describe method can give us some summary statistics of our data:

In [23]:
print(d.describe())
              A         B         C         D
count  6.000000  6.000000  6.000000  6.000000
mean  -0.439675 -0.442039 -0.329139  0.012365
std    1.456953  0.691383  0.918213  1.306626
min   -2.312278 -1.430888 -1.414519 -1.663592
25%   -1.471733 -0.882513 -0.913870 -0.784893
50%   -0.350560 -0.384032 -0.526971 -0.170488
75%    0.617044  0.062510  0.332621  0.861654
max    1.281230  0.384033  0.917312  1.862749

Accessing columns and rows of a dataframe is similar to lists and arrays - for columns we index as normal:

In [24]:
print(d['A'])
0   -1.304458
1   -2.312278
2   -1.527491
3    0.621613
4    0.603338
5    1.281230
Name: A, dtype: float64

For rows, we need to use the loc method:

In [25]:
print(d.loc[0])
A   -1.304458
B    0.128916
C    0.917312
D   -0.830153
Name: 0, dtype: float64

And finally, to get induvidual values, we can use a double-index:

In [26]:
d['A'][0]
Out[26]:
-1.3044583932951868

Finally, we can filter our data using logical expressions, to access the row in our dataframe e where column A is equal to 3 we can use:

In [27]:
print(e[e['A'] == 3])
   A      B          C
2  3  hello 2017-04-01

Worked Example

We're going to take a closer look at the random.randn function to see how this data is distributed:

In [28]:
import numpy as np
import pandas as pd

myData = pd.DataFrame(np.random.randn(100000))

print(myData.describe())
                   0
count  100000.000000
mean       -0.002194
std         1.000691
min        -4.194139
25%        -0.673841
50%         0.000094
75%         0.671483
max         4.440165

So our random data has a mean of about 0 and a standard deviation of around 1. This seems to be a standard normal distribution, and in fact that's true - the "n" of "randn" stands for normal. We can illustrate this using a graph:

In [29]:
import matplotlib.pyplot as plt

mySortedData = myData.sort_values(0) #sorts the data in ascending order
x = np.linspace(-10, 10, 100000) #setting up a dummy array

plt.plot(mySortedData,x)
plt.show()

Here we can see the cumulative distribution function of the normal distribution!

Mini Project

Below we have a dataframe of marks in a class - can you find out:

  • The average mark for English?
  • Each student's average mark? (DON'T do this manually!!!)
  • The subject which, on average, students scored the least marks in?
  • The subject which each induvidual student did worst in.
  • The student who got the most marks in the class overall
In [30]:
import numpy as np
import pandas as pd

subjects = ["Maths","English","Science","Geography","History","Languages"]
marks = pd.DataFrame({"Alice": [85, 86, 98, 94,  2, 39],"Billy": [55, 26, 69, 39, 47, 15],"Cameron": [80,  5, 28, 28, 44, 37],"David": [ 5, 22, 95, 71, 62,  6],"Ellie": [75, 93, 66, 18, 87, 60],"Faye": [72,  0, 63, 51, 65, 83],"Garry": [67, 92, 62, 35,  0, 79],"Harriet": [51, 17, 87, 31, 91, 99],"Izzy": [63, 37, 58, 26, 39, 51],"James": [17,  7, 88, 27,  6, 16],"Katie": [15, 77, 12, 54, 81,  0],"Liam": [25, 35, 80, 71, 71,  9],"Mason": [70, 78,  4, 19, 61, 77],"Noah": [78, 96, 86, 42, 73, 51],"Olivia": [75, 81, 23, 19, 76,  3],"Patrick": [43, 50, 87, 94, 33, 65],"Quinn": [72,  1, 80, 96, 76, 56],"Ross": [ 3, 25, 30, 49, 84,  7],"Sam": [67, 29, 91, 64, 11, 43],"Terri": [63, 36, 70, 73, 13, 25],"Umar": [70, 30, 47, 71, 25, 57],"Veronica": [88, 34, 29, 92, 82, 62],"Will": [89, 11, 14, 56, 78, 63]}, index=subjects)