title

Why Numpy and Pandas?¶

For a long time, when working with data, programmers who used Python had to use the core Python libraries to manipulate data, which was a bit painful. The modules Numpy and Pandas give us the tools we need to look at data quickly and efficiently, in a nicer format. By the end of this guide you should feel comfortable with Numpy arrays and Pandas series and dataframes.

Introduction to Numpy¶

Numpy is a module that lets us generate more efficient lists that have the option to be multi-dimensional. Before we look at what Numpy can do, we have to first import the module. As with matplotlib, we will be importing it under a different name for brevity. Here we use "np".

In [2]:

import numpy as np

Almost all of Numpy's functionality comes from it's multi-dimensional arrays, which mostly operate like Python lists, but use less memory and have some cool features. To initialise an array, we use the function np.array:

In [3]:

a = np.array([1,2,3,4])
print(a)

[1 2 3 4]

A very important note is that the "np.array" function's argument is a list - Python won't understand you if you give it something else! For example - the command...

In [5]:

aError = np.array(1,2,3,4)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-5-e75c477c5330> in <module>()
----> 1 aError = np.array(1,2,3,4)

ValueError: only 2 non-keyword arguments accepted

... give us an error because Python expects a list, not four numbers.

We can also make two dimensional arrays like so:

In [4]:

a = np.array([ [1,2,3,4] , [5,6,7,8] ])
print(a)

[[1 2 3 4]
 [5 6 7 8]]

Numpy also comes with some functions to generate arrays - for example the function linspace() gives us an array of numbers equally spread out between the arguements. For example:

In [5]:

b = np.linspace(0,10,21)
print(b)

[  0.    0.5   1.    1.5   2.    2.5   3.    3.5   4.    4.5   5.    5.5
   6.    6.5   7.    7.5   8.    8.5   9.    9.5  10. ]

Gives us an array of 21 elements equally spaced from 0 to 10.

Numpy, unlike Python lists, can have operations performed on them directly:

In [7]:

print(b**2)

[   0.      0.25    1.      2.25    4.      6.25    9.     12.25   16.
   20.25   25.     30.25   36.     42.25   49.     56.25   64.     72.25
   81.     90.25  100.  ]

We can also use numpy arrays in the same way as Python lists for graphing:

In [8]:

import matplotlib.pyplot as plt #We need to import our graphing library!

plt.plot(b,b**2)

plt.show()

Numpy arrays can also be indexed and sliced in the same way as Python lists:

In [9]:

print(b[0])

0.0

In [10]:

print(b[0:11])

[ 0.   0.5  1.   1.5  2.   2.5  3.   3.5  4.   4.5  5. ]

In [12]:

for element in b:
    print(str(element) + " is an element of our array!")

0.0 is an element of our array!
0.5 is an element of our array!
1.0 is an element of our array!
1.5 is an element of our array!
2.0 is an element of our array!
2.5 is an element of our array!
3.0 is an element of our array!
3.5 is an element of our array!
4.0 is an element of our array!
4.5 is an element of our array!
5.0 is an element of our array!
5.5 is an element of our array!
6.0 is an element of our array!
6.5 is an element of our array!
7.0 is an element of our array!
7.5 is an element of our array!
8.0 is an element of our array!
8.5 is an element of our array!
9.0 is an element of our array!
9.5 is an element of our array!
10.0 is an element of our array!

Apart from direct operations on arrays - this may seem a little redundant. So why do we use Numpy arrays over Python lists? Well, they use less memory and so run faster, thanks to some behind the scenes work.

Pandas¶

Numpy is the backbone of most data focused Python libraries, because it provides a solid foundation to build upon. One of the most important libraries is Pandas, which we use to create series and dataframes, i.e tables.

As always, we first need to import the library. With pandas we use the alias "pd" by convention:

In [11]:

import pandas as pd

Pandas' functionality is two Python objects - the series, and the dataframe. For making series, we can use the Series function (note: this is case sensitive!):

In [12]:

c = pd.Series([1,1,2,3,5,8])
print(c)

0    1
1    1
2    2
3    3
4    5
5    8
dtype: int64

The numbers in the left column are our index - it can be helpful to change this, for example if our data is time based. To do this, we add another arguement to the series function:

In [15]:

c = pd.Series([1,1,2,3,5,8], index=["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"])
print(c)

Monday       1
Tuesday      1
Wednesday    2
Thursday     3
Friday       5
Saturday     8
dtype: int64

Dataframes are just a collection of series. To make a dataframe, we have a few options, all using the DataFrame function (again notice the capitals!).

We can pass a two dimensional numpy array as an arguement, along with column names (here we use the random sublibrary of numpy to give us a 6x4 array of random numbers):

In [20]:

d = pd.DataFrame(np.random.randn(6,4), columns=['A','B','C','D'])
print(d) #Note, if you are using jupyter notebook, just outputting d here instead of printing it will give you a nicer format.

          A         B         C         D
0  1.137413  0.339099 -1.003439  1.316108
1  1.038507 -0.599110 -0.272934  0.832083
2  1.480407  0.875420  0.817394 -0.216612
3  0.626307  0.015586  1.337851  1.690318
4 -1.011163 -1.766528 -1.056720  0.307648
5  0.235622 -1.453060 -0.998353  1.570108

Another way is to pass a dictionary as our arguement to the function - here we use the function "pd.date_range" to generate an array of dates, starting from 30/03/2017 and ending at 02/04/2017:

In [22]:

e = pd.DataFrame({'A': [1,2,3,4], 'B':"hello", 'C': pd.date_range('20170330', periods=4)})
print(e)

   A      B          C
0  1  hello 2017-03-30
1  2  hello 2017-03-31
2  3  hello 2017-04-01
3  4  hello 2017-04-02

Most of the time during data analysis, we will looking at tables that are much larger than just 4 or 5 rows. It can be helpful to know some commands to give us summary information about our data without bringing up the whole frame.

The head and tail methods give us the first and last row(s) of the frame:

In [26]:

print(d.head(3)) #Looks at the first 3 rows of our "d" dataframe

          A         B         C         D
0  1.137413  0.339099 -1.003439  1.316108
1  1.038507 -0.599110 -0.272934  0.832083
2  1.480407  0.875420  0.817394 -0.216612

In [27]:

print(e.tail(1)) #Looks at the last row of our "e" dataframe

   A      B          C
3  4  hello 2017-04-02

The describe method can give us some summary statistics of our data:

In [28]:

print(d.describe())

              A         B         C         D
count  6.000000  6.000000  6.000000  6.000000
mean   0.584516 -0.431432 -0.196033  0.916609
std    0.892647  1.034962  1.041360  0.754646
min   -1.011163 -1.766528 -1.056720 -0.216612
25%    0.333293 -1.239573 -1.002167  0.438757
50%    0.832407 -0.291762 -0.635643  1.074095
75%    1.112687  0.258221  0.544812  1.506608
max    1.480407  0.875420  1.337851  1.690318

Accessing columns and rows of a dataframe is similar to lists and arrays - for columns we index as normal:

In [30]:

print(d['A'])

0    1.137413
1    1.038507
2    1.480407
3    0.626307
4   -1.011163
5    0.235622
Name: A, dtype: float64

For rows, we need to use the loc method:

In [33]:

print(d.loc[0])

A    1.137413
B    0.339099
C   -1.003439
D    1.316108
Name: 0, dtype: float64

And finally, to get induvidual values, we can use a double-index:

In [34]:

d['A'][0]

Out[34]:

1.1374132761829281

Worked Example¶

We're going to take a closer look at the random.randn function to see how this data is distributed:

In [54]:

import numpy as np
import pandas as pd

myData = pd.DataFrame(np.random.randn(100000))

print(myData.describe())

                   0
count  100000.000000
mean        0.005408
std         0.996759
min        -5.012640
25%        -0.666295
50%         0.004657
75%         0.677136
max         4.175438

So our random data has a mean of about 0 and a standard deviation of around 1. This seems to be a standard normal distribution, and in fact that's true - the "n" of "randn" stands for normal. We can illustrate this using a graph:

In [55]:

import matplotlib.pyplot as plt

mySortedData = mydata.sort_values(0) #sorts the data in ascending order
x = np.linspace(-10, 10, 100000) #setting up a dummy array

plt.plot(mySortedData,x)
plt.show()

Here we can see the cumulative distribution function of the normal distribution!

Mini Project¶

Below we have a dataframe of marks in a class - can you find out:

The average mark for English?
Each student's average mark? (DON'T do this manually!!!)
The subject which students scored the least marks in?
The student who got the most marks in the class overall

In [69]:

import numpy as np
import pandas as pd

subjects = ["Maths","English","Science","Geography","History","Languages"]
marks = pd.DataFrame({"Alice": [85, 86, 98, 94,  2, 39],"Billy": [55, 26, 69, 39, 47, 15],"Cameron": [80,  5, 28, 28, 44, 37],"David": [ 5, 22, 95, 71, 62,  6],"Ellie": [75, 93, 66, 18, 87, 60],"Faye": [72,  0, 63, 51, 65, 83],"Garry": [67, 92, 62, 35,  0, 79],"Harriet": [51, 17, 87, 31, 91, 99],"Izzy": [63, 37, 58, 26, 39, 51],"James": [17,  7, 88, 27,  6, 16],"Katie": [15, 77, 12, 54, 81,  0],"Liam": [25, 35, 80, 71, 71,  9],"Mason": [70, 78,  4, 19, 61, 77],"Noah": [78, 96, 86, 42, 73, 51],"Olivia": [75, 81, 23, 19, 76,  3],"Patrick": [43, 50, 87, 94, 33, 65],"Quinn": [72,  1, 80, 96, 76, 56],"Ross": [ 3, 25, 30, 49, 84,  7],"Sam": [67, 29, 91, 64, 11, 43],"Terri": [63, 36, 70, 73, 13, 25],"Umar": [70, 30, 47, 71, 25, 57],"Veronica": [88, 34, 29, 92, 82, 62],"Will": [89, 11, 14, 56, 78, 63]}, index=subjects)