#!/usr/bin/env python # coding: utf-8 # ![title](header.png) # ## Why Numpy and Pandas? # For a long time, when working with data, programmers who used Python had to use the core Python libraries to manipulate data, which was a bit painful. The modules Numpy and Pandas give us the tools we need to look at data quickly and efficiently, in a nicer format. By the end of this guide you should feel comfortable with Numpy arrays and Pandas series and dataframes. # ## Introduction to Numpy # Numpy is a module that lets us generate more efficient lists that have the option to be multi-dimensional. Before we look at what Numpy can do, we have to first import the module. As with matplotlib, we will be importing it under a different name for brevity. Here we use "np". # In[2]: import numpy as np # Almost all of Numpy's functionality comes from it's multi-dimensional arrays, which mostly operate like Python lists, but use less memory and have some cool features. To initialise an array, we use the function np.array: # In[3]: a = np.array([1,2,3,4]) print(a) # A very important note is that the "np.array" function's argument is a list - Python won't understand you if you give it something else! For example - the command... # In[5]: aError = np.array(1,2,3,4) # ... give us an error because Python expects a list, not four numbers. # # We can also make two dimensional arrays like so: # In[4]: a = np.array([ [1,2,3,4] , [5,6,7,8] ]) print(a) # Numpy also comes with some functions to generate arrays - for example the function linspace() gives us an array of numbers equally spread out between the arguements. For example: # In[5]: b = np.linspace(0,10,21) print(b) # Gives us an array of 21 elements equally spaced from 0 to 10. # Numpy, unlike Python lists, can have operations performed on them directly: # In[7]: print(b**2) # We can also use numpy arrays in the same way as Python lists for graphing: # In[8]: import matplotlib.pyplot as plt #We need to import our graphing library! plt.plot(b,b**2) plt.show() # Numpy arrays can also be indexed and sliced in the same way as Python lists: # In[9]: print(b[0]) # In[10]: print(b[0:11]) # In[12]: for element in b: print(str(element) + " is an element of our array!") # Apart from direct operations on arrays - this may seem a little redundant. So why do we use Numpy arrays over Python lists? Well, they use less memory and so run faster, thanks to some behind the scenes work. # ## Pandas # Numpy is the backbone of most data focused Python libraries, because it provides a solid foundation to build upon. One of the most important libraries is Pandas, which we use to create series and dataframes, i.e tables. # # As always, we first need to import the library. With pandas we use the alias "pd" by convention: # In[11]: import pandas as pd # Pandas' functionality is two Python objects - the series, and the dataframe. For making series, we can use the Series function (note: this is case sensitive!): # In[12]: c = pd.Series([1,1,2,3,5,8]) print(c) # The numbers in the left column are our index - it can be helpful to change this, for example if our data is time based. To do this, we add another arguement to the series function: # In[15]: c = pd.Series([1,1,2,3,5,8], index=["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"]) print(c) # Dataframes are just a collection of series. To make a dataframe, we have a few options, all using the DataFrame function (again notice the capitals!). # # We can pass a two dimensional numpy array as an arguement, along with column names (here we use the random sublibrary of numpy to give us a 6x4 array of random numbers): # In[20]: d = pd.DataFrame(np.random.randn(6,4), columns=['A','B','C','D']) print(d) #Note, if you are using jupyter notebook, just outputting d here instead of printing it will give you a nicer format. # Another way is to pass a dictionary as our arguement to the function - here we use the function "pd.date_range" to generate an array of dates, starting from 30/03/2017 and ending at 02/04/2017: # In[22]: e = pd.DataFrame({'A': [1,2,3,4], 'B':"hello", 'C': pd.date_range('20170330', periods=4)}) print(e) # Most of the time during data analysis, we will looking at tables that are much larger than just 4 or 5 rows. It can be helpful to know some commands to give us summary information about our data without bringing up the whole frame. # # The head and tail methods give us the first and last row(s) of the frame: # In[26]: print(d.head(3)) #Looks at the first 3 rows of our "d" dataframe # In[27]: print(e.tail(1)) #Looks at the last row of our "e" dataframe # The describe method can give us some summary statistics of our data: # In[28]: print(d.describe()) # Accessing columns and rows of a dataframe is similar to lists and arrays - for columns we index as normal: # In[30]: print(d['A']) # For rows, we need to use the loc method: # In[33]: print(d.loc[0]) # And finally, to get induvidual values, we can use a double-index: # In[34]: d['A'][0] # ## Worked Example # We're going to take a closer look at the random.randn function to see how this data is distributed: # In[54]: import numpy as np import pandas as pd myData = pd.DataFrame(np.random.randn(100000)) print(myData.describe()) # So our random data has a mean of about 0 and a standard deviation of around 1. This seems to be a standard normal distribution, and in fact that's true - the "n" of "randn" stands for normal. We can illustrate this using a graph: # In[55]: import matplotlib.pyplot as plt mySortedData = mydata.sort_values(0) #sorts the data in ascending order x = np.linspace(-10, 10, 100000) #setting up a dummy array plt.plot(mySortedData,x) plt.show() # Here we can see the cumulative distribution function of the normal distribution! # ## Mini Project # Below we have a dataframe of marks in a class - can you find out: # # * The average mark for English? # * Each student's average mark? (DON'T do this manually!!!) # * The subject which students scored the least marks in? # * The student who got the most marks in the class overall # In[69]: import numpy as np import pandas as pd subjects = ["Maths","English","Science","Geography","History","Languages"] marks = pd.DataFrame({"Alice": [85, 86, 98, 94, 2, 39],"Billy": [55, 26, 69, 39, 47, 15],"Cameron": [80, 5, 28, 28, 44, 37],"David": [ 5, 22, 95, 71, 62, 6],"Ellie": [75, 93, 66, 18, 87, 60],"Faye": [72, 0, 63, 51, 65, 83],"Garry": [67, 92, 62, 35, 0, 79],"Harriet": [51, 17, 87, 31, 91, 99],"Izzy": [63, 37, 58, 26, 39, 51],"James": [17, 7, 88, 27, 6, 16],"Katie": [15, 77, 12, 54, 81, 0],"Liam": [25, 35, 80, 71, 71, 9],"Mason": [70, 78, 4, 19, 61, 77],"Noah": [78, 96, 86, 42, 73, 51],"Olivia": [75, 81, 23, 19, 76, 3],"Patrick": [43, 50, 87, 94, 33, 65],"Quinn": [72, 1, 80, 96, 76, 56],"Ross": [ 3, 25, 30, 49, 84, 7],"Sam": [67, 29, 91, 64, 11, 43],"Terri": [63, 36, 70, 73, 13, 25],"Umar": [70, 30, 47, 71, 25, 57],"Veronica": [88, 34, 29, 92, 82, 62],"Will": [89, 11, 14, 56, 78, 63]}, index=subjects)