For a long time, when working with data, programmers who used Python had to use the core Python libraries to manipulate data, which was a bit painful. The modules Numpy and Pandas give us the tools we need to look at data quickly and efficiently, in a nicer format. By the end of this guide you should feel comfortable with Numpy arrays and Pandas series and dataframes.
Numpy is a module that lets us generate more efficient lists that have the option to be multi-dimensional. Before we look at what Numpy can do, we have to first import the module. As with matplotlib, we will be importing it under a different name for brevity. Here we use "np".
import numpy as np
Almost all of Numpy's functionality comes from it's multi-dimensional arrays, which mostly operate like Python lists, but use less memory and have some cool features. To initialise an array, we use the function np.array:
a = np.array([1,2,3,4])
print(a)
[1 2 3 4]
A very important note is that the "np.array" function's argument is a list - Python won't understand you if you give it something else! For example - the command...
aError = np.array(1,2,3,4)
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-5-e75c477c5330> in <module>() ----> 1 aError = np.array(1,2,3,4) ValueError: only 2 non-keyword arguments accepted
... give us an error because Python expects a list, not four numbers.
We can also make two dimensional arrays like so:
a = np.array([ [1,2,3,4] , [5,6,7,8] ])
print(a)
[[1 2 3 4] [5 6 7 8]]
Numpy also comes with some functions to generate arrays - for example the function linspace() gives us an array of numbers equally spread out between the arguements. For example:
b = np.linspace(0,10,21)
print(b)
[ 0. 0.5 1. 1.5 2. 2.5 3. 3.5 4. 4.5 5. 5.5 6. 6.5 7. 7.5 8. 8.5 9. 9.5 10. ]
Gives us an array of 21 elements equally spaced from 0 to 10.
Numpy, unlike Python lists, can have operations performed on them directly:
print(b**2)
[ 0. 0.25 1. 2.25 4. 6.25 9. 12.25 16. 20.25 25. 30.25 36. 42.25 49. 56.25 64. 72.25 81. 90.25 100. ]
We can also use numpy arrays in the same way as Python lists for graphing:
import matplotlib.pyplot as plt #We need to import our graphing library!
plt.plot(b,b**2)
plt.show()
Numpy arrays can also be indexed and sliced in the same way as Python lists:
print(b[0])
0.0
print(b[0:11])
[ 0. 0.5 1. 1.5 2. 2.5 3. 3.5 4. 4.5 5. ]
for element in b:
print(str(element) + " is an element of our array!")
0.0 is an element of our array! 0.5 is an element of our array! 1.0 is an element of our array! 1.5 is an element of our array! 2.0 is an element of our array! 2.5 is an element of our array! 3.0 is an element of our array! 3.5 is an element of our array! 4.0 is an element of our array! 4.5 is an element of our array! 5.0 is an element of our array! 5.5 is an element of our array! 6.0 is an element of our array! 6.5 is an element of our array! 7.0 is an element of our array! 7.5 is an element of our array! 8.0 is an element of our array! 8.5 is an element of our array! 9.0 is an element of our array! 9.5 is an element of our array! 10.0 is an element of our array!
Apart from direct operations on arrays - this may seem a little redundant. So why do we use Numpy arrays over Python lists? Well, they use less memory and so run faster, thanks to some behind the scenes work.
Numpy is the backbone of most data focused Python libraries, because it provides a solid foundation to build upon. One of the most important libraries is Pandas, which we use to create series and dataframes, i.e tables.
As always, we first need to import the library. With pandas we use the alias "pd" by convention:
import pandas as pd
Pandas' functionality is two Python objects - the series, and the dataframe. For making series, we can use the Series function (note: this is case sensitive!):
c = pd.Series([1,1,2,3,5,8])
print(c)
0 1 1 1 2 2 3 3 4 5 5 8 dtype: int64
The numbers in the left column are our index - it can be helpful to change this, for example if our data is time based. To do this, we add another arguement to the series function:
c = pd.Series([1,1,2,3,5,8], index=["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"])
print(c)
Monday 1 Tuesday 1 Wednesday 2 Thursday 3 Friday 5 Saturday 8 dtype: int64
Dataframes are just a collection of series. To make a dataframe, we have a few options, all using the DataFrame function (again notice the capitals!).
We can pass a two dimensional numpy array as an arguement, along with column names (here we use the random sublibrary of numpy to give us a 6x4 array of random numbers):
d = pd.DataFrame(np.random.randn(6,4), columns=['A','B','C','D'])
print(d) #Note, if you are using jupyter notebook, just outputting d here instead of printing it will give you a nicer format.
A B C D 0 1.137413 0.339099 -1.003439 1.316108 1 1.038507 -0.599110 -0.272934 0.832083 2 1.480407 0.875420 0.817394 -0.216612 3 0.626307 0.015586 1.337851 1.690318 4 -1.011163 -1.766528 -1.056720 0.307648 5 0.235622 -1.453060 -0.998353 1.570108
Another way is to pass a dictionary as our arguement to the function - here we use the function "pd.date_range" to generate an array of dates, starting from 30/03/2017 and ending at 02/04/2017:
e = pd.DataFrame({'A': [1,2,3,4], 'B':"hello", 'C': pd.date_range('20170330', periods=4)})
print(e)
A B C 0 1 hello 2017-03-30 1 2 hello 2017-03-31 2 3 hello 2017-04-01 3 4 hello 2017-04-02
Most of the time during data analysis, we will looking at tables that are much larger than just 4 or 5 rows. It can be helpful to know some commands to give us summary information about our data without bringing up the whole frame.
The head and tail methods give us the first and last row(s) of the frame:
print(d.head(3)) #Looks at the first 3 rows of our "d" dataframe
A B C D 0 1.137413 0.339099 -1.003439 1.316108 1 1.038507 -0.599110 -0.272934 0.832083 2 1.480407 0.875420 0.817394 -0.216612
print(e.tail(1)) #Looks at the last row of our "e" dataframe
A B C 3 4 hello 2017-04-02
The describe method can give us some summary statistics of our data:
print(d.describe())
A B C D count 6.000000 6.000000 6.000000 6.000000 mean 0.584516 -0.431432 -0.196033 0.916609 std 0.892647 1.034962 1.041360 0.754646 min -1.011163 -1.766528 -1.056720 -0.216612 25% 0.333293 -1.239573 -1.002167 0.438757 50% 0.832407 -0.291762 -0.635643 1.074095 75% 1.112687 0.258221 0.544812 1.506608 max 1.480407 0.875420 1.337851 1.690318
Accessing columns and rows of a dataframe is similar to lists and arrays - for columns we index as normal:
print(d['A'])
0 1.137413 1 1.038507 2 1.480407 3 0.626307 4 -1.011163 5 0.235622 Name: A, dtype: float64
For rows, we need to use the loc method:
print(d.loc[0])
A 1.137413 B 0.339099 C -1.003439 D 1.316108 Name: 0, dtype: float64
And finally, to get induvidual values, we can use a double-index:
d['A'][0]
1.1374132761829281
We're going to take a closer look at the random.randn function to see how this data is distributed:
import numpy as np
import pandas as pd
myData = pd.DataFrame(np.random.randn(100000))
print(myData.describe())
0 count 100000.000000 mean 0.005408 std 0.996759 min -5.012640 25% -0.666295 50% 0.004657 75% 0.677136 max 4.175438
So our random data has a mean of about 0 and a standard deviation of around 1. This seems to be a standard normal distribution, and in fact that's true - the "n" of "randn" stands for normal. We can illustrate this using a graph:
import matplotlib.pyplot as plt
mySortedData = mydata.sort_values(0) #sorts the data in ascending order
x = np.linspace(-10, 10, 100000) #setting up a dummy array
plt.plot(mySortedData,x)
plt.show()
Here we can see the cumulative distribution function of the normal distribution!
Below we have a dataframe of marks in a class - can you find out:
import numpy as np
import pandas as pd
subjects = ["Maths","English","Science","Geography","History","Languages"]
marks = pd.DataFrame({"Alice": [85, 86, 98, 94, 2, 39],"Billy": [55, 26, 69, 39, 47, 15],"Cameron": [80, 5, 28, 28, 44, 37],"David": [ 5, 22, 95, 71, 62, 6],"Ellie": [75, 93, 66, 18, 87, 60],"Faye": [72, 0, 63, 51, 65, 83],"Garry": [67, 92, 62, 35, 0, 79],"Harriet": [51, 17, 87, 31, 91, 99],"Izzy": [63, 37, 58, 26, 39, 51],"James": [17, 7, 88, 27, 6, 16],"Katie": [15, 77, 12, 54, 81, 0],"Liam": [25, 35, 80, 71, 71, 9],"Mason": [70, 78, 4, 19, 61, 77],"Noah": [78, 96, 86, 42, 73, 51],"Olivia": [75, 81, 23, 19, 76, 3],"Patrick": [43, 50, 87, 94, 33, 65],"Quinn": [72, 1, 80, 96, 76, 56],"Ross": [ 3, 25, 30, 49, 84, 7],"Sam": [67, 29, 91, 64, 11, 43],"Terri": [63, 36, 70, 73, 13, 25],"Umar": [70, 30, 47, 71, 25, 57],"Veronica": [88, 34, 29, 92, 82, 62],"Will": [89, 11, 14, 56, 78, 63]}, index=subjects)