#!/usr/bin/env python
# coding: utf-8
# ![title](header.png)
# ## Why Numpy and Pandas?
# For a long time, when working with data, programmers who used Python had to use the core Python libraries to manipulate data, which was a bit painful. The modules Numpy and Pandas give us the tools we need to look at data quickly and efficiently, in a nicer format. By the end of this guide you should feel comfortable with Numpy arrays and Pandas series and dataframes.
# ## Introduction to Numpy
# Numpy is a module that lets us generate more efficient lists that have the option to be multi-dimensional. Before we look at what Numpy can do, we have to first import the module. As with matplotlib, we will be importing it under a different name for brevity. Here we use "np".
# In[2]:
import numpy as np
# Almost all of Numpy's functionality comes from it's multi-dimensional arrays, which mostly operate like Python lists, but use less memory and have some cool features. To initialise an array, we use the function np.array:
# In[3]:
a = np.array([1,2,3,4])
print(a)
# A very important note is that the "np.array" function's argument is a list - Python won't understand you if you give it something else! For example - the command...
# In[5]:
aError = np.array(1,2,3,4)
# ... give us an error because Python expects a list, not four numbers.
#
# We can also make two dimensional arrays like so:
# In[4]:
a = np.array([ [1,2,3,4] , [5,6,7,8] ])
print(a)
# Numpy also comes with some functions to generate arrays - for example the function linspace() gives us an array of numbers equally spread out between the arguements. For example:
# In[5]:
b = np.linspace(0,10,21)
print(b)
# Gives us an array of 21 elements equally spaced from 0 to 10.
# Numpy, unlike Python lists, can have operations performed on them directly:
# In[7]:
print(b**2)
# We can also use numpy arrays in the same way as Python lists for graphing:
# In[8]:
import matplotlib.pyplot as plt #We need to import our graphing library!
plt.plot(b,b**2)
plt.show()
# Numpy arrays can also be indexed and sliced in the same way as Python lists:
# In[9]:
print(b[0])
# In[10]:
print(b[0:11])
# In[12]:
for element in b:
print(str(element) + " is an element of our array!")
# Apart from direct operations on arrays - this may seem a little redundant. So why do we use Numpy arrays over Python lists? Well, they use less memory and so run faster, thanks to some behind the scenes work.
# ## Pandas
# Numpy is the backbone of most data focused Python libraries, because it provides a solid foundation to build upon. One of the most important libraries is Pandas, which we use to create series and dataframes, i.e tables.
#
# As always, we first need to import the library. With pandas we use the alias "pd" by convention:
# In[11]:
import pandas as pd
# Pandas' functionality is two Python objects - the series, and the dataframe. For making series, we can use the Series function (note: this is case sensitive!):
# In[12]:
c = pd.Series([1,1,2,3,5,8])
print(c)
# The numbers in the left column are our index - it can be helpful to change this, for example if our data is time based. To do this, we add another arguement to the series function:
# In[15]:
c = pd.Series([1,1,2,3,5,8], index=["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"])
print(c)
# Dataframes are just a collection of series. To make a dataframe, we have a few options, all using the DataFrame function (again notice the capitals!).
#
# We can pass a two dimensional numpy array as an arguement, along with column names (here we use the random sublibrary of numpy to give us a 6x4 array of random numbers):
# In[20]:
d = pd.DataFrame(np.random.randn(6,4), columns=['A','B','C','D'])
print(d) #Note, if you are using jupyter notebook, just outputting d here instead of printing it will give you a nicer format.
# Another way is to pass a dictionary as our arguement to the function - here we use the function "pd.date_range" to generate an array of dates, starting from 30/03/2017 and ending at 02/04/2017:
# In[22]:
e = pd.DataFrame({'A': [1,2,3,4], 'B':"hello", 'C': pd.date_range('20170330', periods=4)})
print(e)
# Most of the time during data analysis, we will looking at tables that are much larger than just 4 or 5 rows. It can be helpful to know some commands to give us summary information about our data without bringing up the whole frame.
#
# The head and tail methods give us the first and last row(s) of the frame:
# In[26]:
print(d.head(3)) #Looks at the first 3 rows of our "d" dataframe
# In[27]:
print(e.tail(1)) #Looks at the last row of our "e" dataframe
# The describe method can give us some summary statistics of our data:
# In[28]:
print(d.describe())
# Accessing columns and rows of a dataframe is similar to lists and arrays - for columns we index as normal:
# In[30]:
print(d['A'])
# For rows, we need to use the loc method:
# In[33]:
print(d.loc[0])
# And finally, to get induvidual values, we can use a double-index:
# In[34]:
d['A'][0]
# ## Worked Example
# We're going to take a closer look at the random.randn function to see how this data is distributed:
# In[54]:
import numpy as np
import pandas as pd
myData = pd.DataFrame(np.random.randn(100000))
print(myData.describe())
# So our random data has a mean of about 0 and a standard deviation of around 1. This seems to be a standard normal distribution, and in fact that's true - the "n" of "randn" stands for normal. We can illustrate this using a graph:
# In[55]:
import matplotlib.pyplot as plt
mySortedData = mydata.sort_values(0) #sorts the data in ascending order
x = np.linspace(-10, 10, 100000) #setting up a dummy array
plt.plot(mySortedData,x)
plt.show()
# Here we can see the cumulative distribution function of the normal distribution!
# ## Mini Project
# Below we have a dataframe of marks in a class - can you find out:
#
# * The average mark for English?
# * Each student's average mark? (DON'T do this manually!!!)
# * The subject which students scored the least marks in?
# * The student who got the most marks in the class overall
# In[69]:
import numpy as np
import pandas as pd
subjects = ["Maths","English","Science","Geography","History","Languages"]
marks = pd.DataFrame({"Alice": [85, 86, 98, 94, 2, 39],"Billy": [55, 26, 69, 39, 47, 15],"Cameron": [80, 5, 28, 28, 44, 37],"David": [ 5, 22, 95, 71, 62, 6],"Ellie": [75, 93, 66, 18, 87, 60],"Faye": [72, 0, 63, 51, 65, 83],"Garry": [67, 92, 62, 35, 0, 79],"Harriet": [51, 17, 87, 31, 91, 99],"Izzy": [63, 37, 58, 26, 39, 51],"James": [17, 7, 88, 27, 6, 16],"Katie": [15, 77, 12, 54, 81, 0],"Liam": [25, 35, 80, 71, 71, 9],"Mason": [70, 78, 4, 19, 61, 77],"Noah": [78, 96, 86, 42, 73, 51],"Olivia": [75, 81, 23, 19, 76, 3],"Patrick": [43, 50, 87, 94, 33, 65],"Quinn": [72, 1, 80, 96, 76, 56],"Ross": [ 3, 25, 30, 49, 84, 7],"Sam": [67, 29, 91, 64, 11, 43],"Terri": [63, 36, 70, 73, 13, 25],"Umar": [70, 30, 47, 71, 25, 57],"Veronica": [88, 34, 29, 92, 82, 62],"Will": [89, 11, 14, 56, 78, 63]}, index=subjects)