Table of contents:
A module in Python is a code file(with a .py extension) that defines functions/classes.
Modules are made to support the reusability of code. We need not copy paste code from that file to use the class definitions and functions made in another code file.
The code from a module is imported by using the import keyword followed by the module name. This runs the module's code, and makes it available in your program.
import modulename
This imports a specific module.
from modulename import classname
This imports the specified class from the module.
We can import all classes from a module using another syntax as well
from modulename import *
The difference between the 2 is that when you use only import modulename, to refer to a class within the module, we need to write modulename.classname. This is because we want to avoid ambiguity between multiple modules having the same function name.
When we use from modulename import *, we import all the classes without having to refer to the classes via the modulename. We can simply use the classname directly.
Sometimes the modulenames are really big, and it is cumbersome to type them out everytime we want to access a class defined in a module. In such cases, we can provide a shorter alias to the modulename for our proggram like so
import modulename as modulealias
from modulename import classname as classalias
Thus, we can alias(provide another name for) whatever we are importing, be it the class or the module.
A Python Package is a directory of Python modules. Using them, we can organize modules in a hierarchical fashion.
They are very similar to Modules.
Because of the organizational structure, to access subpackages/submodules, we have to add a dot between the names of the modules which are in the path.
Imagine the package as a tree. The modules are leaves. The path to the leaf has to be mentioned using the names of the subpackages/submodules in the way, separated by a '.'.
import packagename.modulename
import packagename.subpackagename.modulename
from packagename.subpacakagename import modulename
Libraries do not have a specific meaning in Python. Modules and packages do the job of general libraries in other languages.
A Library generically means either a module or a package.
In Python, we have a Python Standard Library
The term ‘standard library‘ in Python language refers to the collection of exact syntax, token and semantics of the Python language which comes bundled with the core Python distribution.
In Python, the standard library is written in C language and it handles the standard functionalities like file I/O and other core modules that make Python what it is. The python standard library lists down more than 200 such core modules that form the core of Python.
We will be studying NumPy, Pandas, and having a look at SciKitLearn.
Python has a ton of ML libraries. We will go over the basics, and see how one library is used by the other.
NumPy is the fundamental package for scientific computing with Python.
NumPy, short for Numerical Python, provides support for numeric operations including a host of inbuilt functions, and support for multi-dimensional arrays, which form the basis of most computing.
NumPy is extremely fast and takes up much lesser space than storing data in lists. Virtually every ML library uses NumPy in the background.
NumPy’s main object is the homogeneous multidimensional array. It is a table of elements (usually numbers), all of the same type, indexed by a tuple of positive integers. In NumPy dimensions are called axes.
The array is the mathematical matrix
NumPy’s array class is called ndarray. It is also known by the alias array.
To import numpy into your computing environment, we have to use the import statement.
import numpy
Since numpy is so commonly used, we generally use the alias np to refer to numpy.
import numpy as np
Thus, we have imported NumPy. Now, let's declare an array in numpy.
We can create an array from:
numpy_arr = np.array([1,2,3])
Thus, we have created an array with 3 elements. This is a one-dimensional array.
To check the number of dimensions, we use the ndim attribute.
arr_name.ndim
numpy_arr.ndim
1
To see the exact shape of the array, we use the shape attribute.
arr_name.shape
numpy_arr.shape
(3,)
Note that the np.array() method takes as an argument a list, or a list of lists for a multi-dimensional array, and converts it to a numpy ndarray.
type(numpy_arr)
numpy.ndarray
A list of lists creates a multidimensional array.
arr_2d = np.array([[1,2],[3,4]])
print(arr_2d)
[[1 2] [3 4]]
arr_2d.ndim
2
arr_2d.shape
(2, 2)
We can see the total number of elements in an array using the size attribute.
arr_name.size
numpy_arr.size
3
arr_2d.size
4
NumPy has it's own Datatypes as well.
The number refers to the number of bits the object of that type occupies in memory.
To find out the datatype, we use the dtype attribute.
arr_name.dtype
numpy_arr.dtype
dtype('int32')
To create an array of equally spaced numbers, we can use the arange function.
np.arange(stop)
np.arange(start, stop)
np.arange(start, stop, step)
It's arguments are similar to the range function. It will return a numpy.ndarray object.
arr = np.arange(5)
print(arr)
[0 1 2 3 4]
arr2 = np.arange(1,5)
print(arr2)
[1 2 3 4]
arr = np.arange(1,10,2)
print(arr)
[1 3 5 7 9]
Thus, the arange() function will create a 1 dimensional array.
To generate an array with a specific shape, we use the ndarray function.
shape = (m,n)
np.ndarray(shape)
rand_arr = np.ndarray((5,5))
print(rand_arr)
[[4.67296746e-307 1.69121096e-306 1.11260483e-306 4.45055939e-308 1.33511018e-306] [2.04721462e-306 1.89146896e-307 1.37961302e-306 1.05699242e-307 8.01097889e-307] [1.78020169e-306 7.56601165e-307 1.02359984e-306 1.33510679e-306 2.22522597e-306] [1.20161390e-306 1.11261162e-306 1.42418172e-306 2.04712906e-306 7.56589622e-307] [1.11258277e-307 8.90111708e-307 2.11389826e-307 1.11260619e-306 6.01346930e-154]]
To generate an array of a specific shape and all elements of the same value, we can use functions like ones, zeros and eye.
shape = (m,n)
np.ones(shape)
ones = np.ones((3,3))
print(ones)
[[1. 1. 1.] [1. 1. 1.] [1. 1. 1.]]
Note that the shape argument in any function has to be a tuple containing the shape, and not just the independent numbers.
shape = (m,n)
np.zeros(shape)
zeros = np.zeros((4,4))
print(zeros)
[[0. 0. 0. 0.] [0. 0. 0. 0.] [0. 0. 0. 0.] [0. 0. 0. 0.]]
np.eye(ndim)
identity_mat = np.eye(5)
print(identity_mat)
[[1. 0. 0. 0. 0.] [0. 1. 0. 0. 0.] [0. 0. 1. 0. 0.] [0. 0. 0. 1. 0.] [0. 0. 0. 0. 1.]]
This generates an identity matrix of n x n size. The diagonal elements are 1 and the rest of the elements are 0 in an identity matrix.
Special values are:
np.nan
nan
np.inf
inf
Another method is linspace, which is used to get a certain number of equally spaced points between 2 values (including them).
np.linspace(start_val, stop_val, num_points)
nums = np.linspace(5,10,101)
print(nums)
[ 5. 5.05 5.1 5.15 5.2 5.25 5.3 5.35 5.4 5.45 5.5 5.55 5.6 5.65 5.7 5.75 5.8 5.85 5.9 5.95 6. 6.05 6.1 6.15 6.2 6.25 6.3 6.35 6.4 6.45 6.5 6.55 6.6 6.65 6.7 6.75 6.8 6.85 6.9 6.95 7. 7.05 7.1 7.15 7.2 7.25 7.3 7.35 7.4 7.45 7.5 7.55 7.6 7.65 7.7 7.75 7.8 7.85 7.9 7.95 8. 8.05 8.1 8.15 8.2 8.25 8.3 8.35 8.4 8.45 8.5 8.55 8.6 8.65 8.7 8.75 8.8 8.85 8.9 8.95 9. 9.05 9.1 9.15 9.2 9.25 9.3 9.35 9.4 9.45 9.5 9.55 9.6 9.65 9.7 9.75 9.8 9.85 9.9 9.95 10. ]
nums.shape
(101,)
When you print an array, NumPy displays it in a similar way to nested lists, but with the following layout:
*reshape* allows you to change the shape of the array. The values are retained.
shape = (m,n)
arr_name.reshape(shape)
arr = np.linspace(2,10,100)
print(arr)
[ 2. 2.08080808 2.16161616 2.24242424 2.32323232 2.4040404 2.48484848 2.56565657 2.64646465 2.72727273 2.80808081 2.88888889 2.96969697 3.05050505 3.13131313 3.21212121 3.29292929 3.37373737 3.45454545 3.53535354 3.61616162 3.6969697 3.77777778 3.85858586 3.93939394 4.02020202 4.1010101 4.18181818 4.26262626 4.34343434 4.42424242 4.50505051 4.58585859 4.66666667 4.74747475 4.82828283 4.90909091 4.98989899 5.07070707 5.15151515 5.23232323 5.31313131 5.39393939 5.47474747 5.55555556 5.63636364 5.71717172 5.7979798 5.87878788 5.95959596 6.04040404 6.12121212 6.2020202 6.28282828 6.36363636 6.44444444 6.52525253 6.60606061 6.68686869 6.76767677 6.84848485 6.92929293 7.01010101 7.09090909 7.17171717 7.25252525 7.33333333 7.41414141 7.49494949 7.57575758 7.65656566 7.73737374 7.81818182 7.8989899 7.97979798 8.06060606 8.14141414 8.22222222 8.3030303 8.38383838 8.46464646 8.54545455 8.62626263 8.70707071 8.78787879 8.86868687 8.94949495 9.03030303 9.11111111 9.19191919 9.27272727 9.35353535 9.43434343 9.51515152 9.5959596 9.67676768 9.75757576 9.83838384 9.91919192 10. ]
arr_2 = arr.reshape((5,20))
arr_2.shape
(5, 20)
arr.shape
(100,)
*np.random* is a module that has the methods to generate arrays based on random sampling methods.
*np.random.rand* generates values from the uniform distribution over [0,1)
shape = (m,n,k)
dim0, dim1, dim2 = shape
np.random.rand(dim0, dim1, dim2)
Note that here, we do not pass the tuple as an argument, instead we pass the individual numbers as multiple arguments.
rand = np.random.rand(2,3,4)
print(rand)
[[[0.82426839 0.07336335 0.57915416 0.87855818] [0.06833727 0.00950488 0.80136408 0.39404887] [0.76765689 0.17094639 0.31216921 0.51843793]] [[0.04563833 0.02141629 0.74783006 0.02971763] [0.39192003 0.09709793 0.22048199 0.3812041 ] [0.04469955 0.32250243 0.10270034 0.99520816]]]
*np.random.randn* generates values from the standard normal distribution.
shape = (m,n,k)
dim0, dim1, dim2 = shape
np.random.randn(dim0, dim1, dim2)
randn = np.random.randn(2,3,4)
randn
array([[[ 0.79654234, 1.33910217, 0.64370065, 1.30430276], [ 0.20773016, 0.88646579, -0.86081857, -0.90012864], [-0.74372168, 0.45233432, 0.24048097, 0.3644812 ]], [[-0.94821417, -1.00813354, 0.47037682, 0.1451233 ], [ 0.96273584, -0.29439498, -0.02986965, 1.4514224 ], [ 0.48699151, 0.0301358 , 0.32400916, -0.3680934 ]]])
*np.random.randint* generates random integers in range [low, high).
np.random.randint(low, high, number_of_integers_to_generate)
randint = np.random.randint(3,10)
print(randint)
8
randints = np.random.randint(5,100,10)
print(randints)
[63 88 58 51 18 56 94 43 79 86]
*np.max* returns the maximum value from the array along a given axis.
np.max(array, axis)
*np.argmax* returns the index of the maximum value from the array along a given axis.
np.argmax(array, axis)
arr = np.linspace(1,10,10).reshape(2,5)
print(arr)
[[ 1. 2. 3. 4. 5.] [ 6. 7. 8. 9. 10.]]
print(np.max(arr, 0))
print(np.argmax(arr, 0))
[ 6. 7. 8. 9. 10.] [1 1 1 1 1]
print(np.max(arr, 1))
print(np.argmax(arr,1))
[ 5. 10.] [4 4]
print(np.max(arr))
print(np.argmax(arr))
10.0 9
If no axis argument is provided, it will return the biggest value in the entire array.
*np.min* returns the minimum value from the array along a given axis.
np.min(array, axis)
*np.argmin* returns the index of the minimum value from the array along a given axis.
np.argmin(array, axis)
print(arr)
[[ 1. 2. 3. 4. 5.] [ 6. 7. 8. 9. 10.]]
print(np.min(arr, 0))
print(np.argmin(arr, 0))
[1. 2. 3. 4. 5.] [0 0 0 0 0]
print(np.min(arr, 1))
print(np.argmin(arr,1))
[1. 6.] [0 0]
np.min(arr)
1.0
If no axis argument is provided, it will return the smallest value in the entire array.
Arithmetic Operations
In NumPy, the arithmetic operations are applied elementwise. Thus, the traditional dot product is done by the * operator.
arr1 = np.ones((3,3))
arr2 = np.linspace(3,11,9).reshape(3,3)
arr1
array([[1., 1., 1.], [1., 1., 1.], [1., 1., 1.]])
arr2
array([[ 3., 4., 5.], [ 6., 7., 8.], [ 9., 10., 11.]])
arr2+arr1
array([[ 4., 5., 6.], [ 7., 8., 9.], [10., 11., 12.]])
arr2-arr1
array([[ 2., 3., 4.], [ 5., 6., 7.], [ 8., 9., 10.]])
arr2*arr1
array([[ 3., 4., 5.], [ 6., 7., 8.], [ 9., 10., 11.]])
arr2/arr1
array([[ 3., 4., 5.], [ 6., 7., 8.], [ 9., 10., 11.]])
To perform a matrix multiplication, we have to use the matmul function.
arr2**3
array([[ 27., 64., 125.], [ 216., 343., 512.], [ 729., 1000., 1331.]])
np.matmul(arr1,arr2)
array([[18., 21., 24.], [18., 21., 24.], [18., 21., 24.]])
Methods like np.exp, np.sqrt also apply the specified operations to each element of the array.
One-dimensional arrays can be indexed, sliced and iterated over, much like lists and other Python sequences.
arr_1d = np.arange(10)**2
arr_1d
array([ 0, 1, 4, 9, 16, 25, 36, 49, 64, 81], dtype=int32)
arr_1d[1:5]
array([ 1, 4, 9, 16], dtype=int32)
arr_1d[:3]
array([0, 1, 4], dtype=int32)
Multidimensional indexing
To index a multidimensional array, we separate the indices of each dimension by a ','
arr_2d = np.random.rand(3,4)
print(arr_2d)
[[0.09261106 0.78401964 0.10431284 0.31181089] [0.53867288 0.49331392 0.59958044 0.59758081] [0.13307665 0.3694777 0.42227996 0.28372874]]
arr_2d[0,0]
0.09261106145328935
arr_2d[1,2]
0.5995804368556347
To get a slice, we can use the ':' as well.
arr_2d[0,1:3]
array([0.78401964, 0.10431284])
arr_2d[1,1:]
array([0.49331392, 0.59958044, 0.59758081])
If we want all elements of a particular dimension, we use ':' without preceding or superceding it by a number.
arr_2d[:,2].reshape(3,1)
array([[0.10431284], [0.59958044], [0.42227996]])
arr_2d[0,:]
array([0.09261106, 0.78401964, 0.10431284, 0.31181089])
Broadcasting is supported in NumPy
arr_1d[:5] = 10
print(arr_1d)
[10 10 10 10 10 25 36 49 64 81]
To access all elements of a particular array, we can use [:]
arr_1d[:] = 5
print(arr_1d)
[5 5 5 5 5 5 5 5 5 5]
In NumPy, when we want to make a copy of an array, we use the copy function, as just assigning one variable to another will not create a separate variable, but will just add another name(alias) to the same variable.
my_arr = arr_1d
print(my_arr)
print(arr_1d)
[5 5 5 5 5 5 5 5 5 5] [5 5 5 5 5 5 5 5 5 5]
my_arr[:] = 10
print(my_arr)
print(arr_1d)
[10 10 10 10 10 10 10 10 10 10] [10 10 10 10 10 10 10 10 10 10]
my_arr = arr_1d.copy()
print(my_arr)
print(arr_1d)
[10 10 10 10 10 10 10 10 10 10] [10 10 10 10 10 10 10 10 10 10]
my_arr[:] = 5
print(my_arr)
print(arr_1d)
[5 5 5 5 5 5 5 5 5 5] [10 10 10 10 10 10 10 10 10 10]
Selection
We can also broadcast relational operators.
arr = np.linspace(1,10,10)
print(arr)
[ 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.]
print(arr>5)
[False False False False False True True True True True]
This will return a boolean array, which can be used to selectively filter elements from an array.
arr[arr>5]
array([ 6., 7., 8., 9., 10.])
arr[arr%2==0]
array([ 2., 4., 6., 8., 10.])
Thus, we can have an expression inside the [], that gives us as a result a boolean array of the same shape as the original array.
Time for a few excercises:
np.linspace(0.1,0.9,9)
array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9])
np.linspace(2,100,50).reshape(5,10)
array([[ 2., 4., 6., 8., 10., 12., 14., 16., 18., 20.], [ 22., 24., 26., 28., 30., 32., 34., 36., 38., 40.], [ 42., 44., 46., 48., 50., 52., 54., 56., 58., 60.], [ 62., 64., 66., 68., 70., 72., 74., 76., 78., 80.], [ 82., 84., 86., 88., 90., 92., 94., 96., 98., 100.]])
np.random.randint(3,15,20).reshape(4,5)
array([[10, 9, 8, 14, 14], [12, 5, 4, 7, 4], [ 7, 7, 4, 11, 4], [14, 14, 9, 9, 3]])
Pandas is like Excel. It allows a lot more extended functionality, and works on NumPy arrays in the background.
The basic Pandas datatype is a Series. It stores the data that is stored in an Excel column.
It is built on top of a NumPy array. What differentiates the NumPy array from a Series, is that a Series can have labels, meaning it can be indexed by a label, instead of just a number location. Think of it like a combination of a list and a dictionary. Here, there is an order to the sequence, but you can define arbitrary labels for each value as well.
It also doesn't need to hold numeric data, it can hold any arbitrary Python Object.
import pandas as pd
We can convert a list, Numpy array or a dictionary to a series.
my_list = [3,4,10]
arr = np.array(my_list)
my_dict = {'a':1,'b':3,'c':5}
*pd.Series* is a class, calls a constructor which takes as arguments the data(list/numpy array) and the labels(list). If we pass a dictionary, we do not need to pass the labels.
pd.Series(values, labels)
pd.Series(my_list)
0 3 1 4 2 10 dtype: int64
pd.Series(arr)
0 3 1 4 2 10 dtype: int32
ser_dict = pd.Series(my_dict)
print(ser_dict)
a 1 b 3 c 5 dtype: int64
ser_dict[0]
1
ser_dict['a']
1
labels = ['a','c','e']
pd.Series(my_list, labels)
a 3 c 4 e 10 dtype: int64
pd.Series(arr, labels)
a 3 c 4 e 10 dtype: int32
Pandas primarily works with DataFrames. DataFrames are like Excel sheets. An entire data table is stored in a Pandas DataFrame.
It is a collection of Pandas Series objects, which share the same index.
arr = np.random.rand(4,5)
We have to manually specify the indexes and column names.
df = pd.DataFrame(arr, index = 'A B C D'.split(), columns = 'V W X Y Z'.split())
df
V | W | X | Y | Z | |
---|---|---|---|---|---|
A | 0.827443 | 0.616891 | 0.438627 | 0.006388 | 0.939388 |
B | 0.023023 | 0.906223 | 0.397198 | 0.536125 | 0.234519 |
C | 0.580901 | 0.999647 | 0.316373 | 0.813785 | 0.988438 |
D | 0.253369 | 0.074600 | 0.526365 | 0.790307 | 0.843681 |
Selection and Indexing
dataframe_name[Column_name]
df['W']
A 0.616891 B 0.906223 C 0.999647 D 0.074600 Name: W, dtype: float64
Thus, we can get a column by using the column name.
We can also get multiple columns by passing a list of column names.
df[['W','Z','X']]
W | Z | X | |
---|---|---|---|
A | 0.616891 | 0.939388 | 0.438627 |
B | 0.906223 | 0.234519 | 0.397198 |
C | 0.999647 | 0.988438 | 0.316373 |
D | 0.074600 | 0.843681 | 0.526365 |
Each DataFrame column is a series!
type(df['W'])
pandas.core.series.Series
To select rows, we have to use the loc and iloc attributes.
dataframe.loc[index_name]
dataframe.iloc[index_number]
df.loc['A']
V 0.827443 W 0.616891 X 0.438627 Y 0.006388 Z 0.939388 Name: A, dtype: float64
df.iloc[0]
V 0.827443 W 0.616891 X 0.438627 Y 0.006388 Z 0.939388 Name: A, dtype: float64
Just like numpy arrays, we can do get multiple rows and columns. But, since we have index and column names, we need to pass the individual column names.
df.loc[['A','C'],['W','Y']]
W | Y | |
---|---|---|
A | 0.616891 | 0.006388 |
C | 0.999647 | 0.813785 |
df.loc['A','W']
0.6168906072660919
The conditional selection is also similar to NumPy.
df[df>0.5]
V | W | X | Y | Z | |
---|---|---|---|---|---|
A | 0.827443 | 0.616891 | NaN | NaN | 0.939388 |
B | NaN | 0.906223 | NaN | 0.536125 | NaN |
C | 0.580901 | 0.999647 | NaN | 0.813785 | 0.988438 |
D | NaN | NaN | 0.526365 | 0.790307 | 0.843681 |
For multiple conditions, we have to use & instead of and, | instead of or.
df[df['W']>0.5]
V | W | X | Y | Z | |
---|---|---|---|---|---|
A | 0.827443 | 0.616891 | 0.438627 | 0.006388 | 0.939388 |
B | 0.023023 | 0.906223 | 0.397198 | 0.536125 | 0.234519 |
C | 0.580901 | 0.999647 | 0.316373 | 0.813785 | 0.988438 |
This translates to: Give me the row where the value for attribute W is greater than 0.5
Thus, we can pass a condition on a specific attribute(column) value, and return the records(rows) that satisfy the given attribute condition.
This is based on the assumption that we always store tabular data, where rows are objects and columns are the respective attribute values.
To remove a row/column from the dataframe, we use the drop method.
dataframe.drop(labels, axis)
df.drop('A')
V | W | X | Y | Z | |
---|---|---|---|---|---|
B | 0.023023 | 0.906223 | 0.397198 | 0.536125 | 0.234519 |
C | 0.580901 | 0.999647 | 0.316373 | 0.813785 | 0.988438 |
D | 0.253369 | 0.074600 | 0.526365 | 0.790307 | 0.843681 |
df
V | W | X | Y | Z | |
---|---|---|---|---|---|
A | 0.827443 | 0.616891 | 0.438627 | 0.006388 | 0.939388 |
B | 0.023023 | 0.906223 | 0.397198 | 0.536125 | 0.234519 |
C | 0.580901 | 0.999647 | 0.316373 | 0.813785 | 0.988438 |
D | 0.253369 | 0.074600 | 0.526365 | 0.790307 | 0.843681 |
By default, we remove rows. If we want to remove a specific column, we have to add the axis=1 argument.
df.drop('V', axis=1, inplace=True)
Note that when we perform changes to a dataframe using these methods, the dataframe is not changed, instead we get back a copy of the dataframe with the specified operation performed on it.
To ensure that the change is reflected, we need to add an argument called inplace, which is assigned binary value True
df.drop('A', inplace=True)
df
W | X | Y | Z | |
---|---|---|---|---|
B | 0.906223 | 0.397198 | 0.536125 | 0.234519 |
C | 0.999647 | 0.316373 | 0.813785 | 0.988438 |
D | 0.074600 | 0.526365 | 0.790307 | 0.843681 |
Adding a column/row in Pandas is very simple as well. We can assume that the column/row already exists and assign values to that column/row.
df['V'] = np.random.rand(3)
df
W | X | Y | Z | V | |
---|---|---|---|---|---|
B | 0.906223 | 0.397198 | 0.536125 | 0.234519 | 0.263809 |
C | 0.999647 | 0.316373 | 0.813785 | 0.988438 | 0.487261 |
D | 0.074600 | 0.526365 | 0.790307 | 0.843681 | 0.097626 |
df.loc['A'] = np.random.rand(5)
df
W | X | Y | Z | V | |
---|---|---|---|---|---|
B | 0.906223 | 0.397198 | 0.536125 | 0.234519 | 0.263809 |
C | 0.999647 | 0.316373 | 0.813785 | 0.988438 | 0.487261 |
D | 0.074600 | 0.526365 | 0.790307 | 0.843681 | 0.097626 |
A | 0.169272 | 0.540569 | 0.591681 | 0.872657 | 0.763574 |
If we want to remove the index and get it as a data column, we can use a function called *reset_index*
df.reset_index(inplace=True)
df
index | W | X | Y | Z | V | |
---|---|---|---|---|---|---|
0 | B | 0.906223 | 0.397198 | 0.536125 | 0.234519 | 0.263809 |
1 | C | 0.999647 | 0.316373 | 0.813785 | 0.988438 | 0.487261 |
2 | D | 0.074600 | 0.526365 | 0.790307 | 0.843681 | 0.097626 |
3 | A | 0.169272 | 0.540569 | 0.591681 | 0.872657 | 0.763574 |
If we want to make a column as the index, we can use a function called *set_index*
df.set_index('index', inplace=True)
df
W | X | Y | Z | V | |
---|---|---|---|---|---|
index | |||||
B | 0.906223 | 0.397198 | 0.536125 | 0.234519 | 0.263809 |
C | 0.999647 | 0.316373 | 0.813785 | 0.988438 | 0.487261 |
D | 0.074600 | 0.526365 | 0.790307 | 0.843681 | 0.097626 |
A | 0.169272 | 0.540569 | 0.591681 | 0.872657 | 0.763574 |
df = pd.DataFrame({'A':[10,np.nan,0],
'C':[np.nan,2,20],
'E':[1,5,3]})
To handle missing values(np.nan), Pandas provides inbuilt methods.
To remove records/attributes with missing values, we use *dropna*
dataframe.dropna(axis)
df.dropna()
A | C | E | |
---|---|---|---|
2 | 0.0 | 20.0 | 3 |
df.dropna(axis=1)
E | |
---|---|
0 | 1 |
1 | 5 |
2 | 3 |
To fill in missing values, we use *fillna*
dataframe.fillna(value)
df.fillna(1000, inplace=True)
df
A | C | E | |
---|---|---|---|
0 | 10.0 | 1000.0 | 1 |
1 | 1000.0 | 2.0 | 5 |
2 | 0.0 | 20.0 | 3 |
We can apply a function to each row/column of a dataframe, by using the apply method.
df.apply(function, axis)
def func(x):
print(x)
print(type(x))
return 0
df.apply(func)
0 10.0 1 1000.0 2 0.0 Name: A, dtype: float64 <class 'pandas.core.series.Series'> 0 1000.0 1 2.0 2 20.0 Name: C, dtype: float64 <class 'pandas.core.series.Series'> 0 1.0 1 5.0 2 3.0 Name: E, dtype: float64 <class 'pandas.core.series.Series'>
A 0 C 0 E 0 dtype: int64
df.apply(func, axis=1)
A 10.0 C 1000.0 E 1.0 Name: 0, dtype: float64 <class 'pandas.core.series.Series'> A 1000.0 C 2.0 E 5.0 Name: 1, dtype: float64 <class 'pandas.core.series.Series'> A 0.0 C 20.0 E 3.0 Name: 2, dtype: float64 <class 'pandas.core.series.Series'>
0 0 1 0 2 0 dtype: int64
In Python, we can write temporary one line functions that have no name as well.
lambda x: 1 if x>2 else 0
<function __main__.<lambda>(x)>
Thus, we can provide this as an argument to the apply function as well.
df['A'].apply((lambda x: 1 if x>2 else 0))
0 1 1 1 2 0 Name: A, dtype: int64
We can also get the counts of a each value in a series.
df.value_counts()
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) <ipython-input-143-986e25863b45> in <module>() ----> 1 df.value_counts() ~\Anaconda3\lib\site-packages\pandas\core\generic.py in __getattr__(self, name) 3612 if name in self._info_axis: 3613 return self[name] -> 3614 return object.__getattribute__(self, name) 3615 3616 def __setattr__(self, name, value): AttributeError: 'DataFrame' object has no attribute 'value_counts'
df['C'].value_counts()
20.0 1 2.0 1 1000.0 1 Name: C, dtype: int64
Thus, this is only applicable to series, and not to DataFrames.
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
'B': ['B4', 'B5', 'B6', 'B7'],
'C': ['C4', 'C5', 'C6', 'C7'],
'D': ['D4', 'D5', 'D6', 'D7']},
index=[4, 5, 6, 7])
df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
'B': ['B8', 'B9', 'B10', 'B11'],
'C': ['C8', 'C9', 'C10', 'C11'],
'D': ['D8', 'D9', 'D10', 'D11']},
index=[8, 9, 10, 11])
df1
A | B | C | D | |
---|---|---|---|---|
0 | A0 | B0 | C0 | D0 |
1 | A1 | B1 | C1 | D1 |
2 | A2 | B2 | C2 | D2 |
3 | A3 | B3 | C3 | D3 |
df2
A | B | C | D | |
---|---|---|---|---|
4 | A4 | B4 | C4 | D4 |
5 | A5 | B5 | C5 | D5 |
6 | A6 | B6 | C6 | D6 |
7 | A7 | B7 | C7 | D7 |
df3
A | B | C | D | |
---|---|---|---|---|
8 | A8 | B8 | C8 | D8 |
9 | A9 | B9 | C9 | D9 |
10 | A10 | B10 | C10 | D10 |
11 | A11 | B11 | C11 | D11 |
pd.concat(list_of_dataframes, axis)
pd.concat([df1, df2], axis=1)
A | B | C | D | A | B | C | D | |
---|---|---|---|---|---|---|---|---|
0 | A0 | B0 | C0 | D0 | NaN | NaN | NaN | NaN |
1 | A1 | B1 | C1 | D1 | NaN | NaN | NaN | NaN |
2 | A2 | B2 | C2 | D2 | NaN | NaN | NaN | NaN |
3 | A3 | B3 | C3 | D3 | NaN | NaN | NaN | NaN |
4 | NaN | NaN | NaN | NaN | A4 | B4 | C4 | D4 |
5 | NaN | NaN | NaN | NaN | A5 | B5 | C5 | D5 |
6 | NaN | NaN | NaN | NaN | A6 | B6 | C6 | D6 |
7 | NaN | NaN | NaN | NaN | A7 | B7 | C7 | D7 |
pd.concat([df1,df2])
A | B | C | D | |
---|---|---|---|---|
0 | A0 | B0 | C0 | D0 |
1 | A1 | B1 | C1 | D1 |
2 | A2 | B2 | C2 | D2 |
3 | A3 | B3 | C3 | D3 |
4 | A4 | B4 | C4 | D4 |
5 | A5 | B5 | C5 | D5 |
6 | A6 | B6 | C6 | D6 |
7 | A7 | B7 | C7 | D7 |
left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
pd.merge(left_df, right_df, how, on_which_col)
left
A | B | key | |
---|---|---|---|
0 | A0 | B0 | K0 |
1 | A1 | B1 | K1 |
2 | A2 | B2 | K2 |
3 | A3 | B3 | K3 |
right
C | D | key | |
---|---|---|---|
0 | C0 | D0 | K0 |
1 | C1 | D1 | K1 |
2 | C2 | D2 | K2 |
3 | C3 | D3 | K3 |
pd.merge(left, right, how = 'inner',on = 'key')
A | B | key | C | D | |
---|---|---|---|---|---|
0 | A0 | B0 | K0 | C0 | D0 |
1 | A1 | B1 | K1 | C1 | D1 |
2 | A2 | B2 | K2 | C2 | D2 |
3 | A3 | B3 | K3 | C3 | D3 |
dataframe.join(dataframe_2, how)
left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']},
index=['K0', 'K1', 'K2'])
right = pd.DataFrame({'C': ['C0', 'C2', 'C3'],
'D': ['D0', 'D2', 'D3']},
index=['K0', 'K2', 'K3'])
left
A | B | |
---|---|---|
K0 | A0 | B0 |
K1 | A1 | B1 |
K2 | A2 | B2 |
right
C | D | |
---|---|---|
K0 | C0 | D0 |
K2 | C2 | D2 |
K3 | C3 | D3 |
left.join(right)
A | B | C | D | |
---|---|---|---|---|
K0 | A0 | B0 | C0 | D0 |
K1 | A1 | B1 | NaN | NaN |
K2 | A2 | B2 | C2 | D2 |
Inner join removes any non-overlapping records. It only keeps the records for which the indexes match.
left.join(right, how='inner')
A | B | C | D | |
---|---|---|---|---|
K0 | A0 | B0 | C0 | D0 |
K2 | A2 | B2 | C2 | D2 |
Outer join keeps any non-overlapping records, and fills in the missing values with NaNs.
left.join(right, how='outer')
A | B | C | D | |
---|---|---|---|---|
K0 | A0 | B0 | C0 | D0 |
K1 | A1 | B1 | NaN | NaN |
K2 | A2 | B2 | C2 | D2 |
K3 | NaN | NaN | C3 | D3 |
Left join keeps all records from the left(calling) dataframe. It fills the missing values with NaNs.
left.join(right, how='left')
A | B | C | D | |
---|---|---|---|---|
K0 | A0 | B0 | C0 | D0 |
K1 | A1 | B1 | NaN | NaN |
K2 | A2 | B2 | C2 | D2 |
Right join keeps all records from the right(argument) dataframe. It fills the missing values with NaNs.
left.join(right, how='right')
A | B | C | D | |
---|---|---|---|---|
K0 | A0 | B0 | C0 | D0 |
K2 | A2 | B2 | C2 | D2 |
K3 | NaN | NaN | C3 | D3 |
In short, Pandas allows you to perform all Database style operations on relational database like tables.
Let us explore the use of Scikit Learn in an ML problem. This will also set the tone for our discussions on Machine Learning in the next few lectures.
from sklearn.datasets import load_iris
iris_data = load_iris()
The Iris Dataset is a famous dataset, the data for which has been included directly in the Scikit Learn library. Let's explore it further.
Let's look at the description
print(iris_data['DESCR'])
.. _iris_dataset: Iris plants dataset -------------------- **Data Set Characteristics:** :Number of Instances: 150 (50 in each of three classes) :Number of Attributes: 4 numeric, predictive attributes and the class :Attribute Information: - sepal length in cm - sepal width in cm - petal length in cm - petal width in cm - class: - Iris-Setosa - Iris-Versicolour - Iris-Virginica :Summary Statistics: ============== ==== ==== ======= ===== ==================== Min Max Mean SD Class Correlation ============== ==== ==== ======= ===== ==================== sepal length: 4.3 7.9 5.84 0.83 0.7826 sepal width: 2.0 4.4 3.05 0.43 -0.4194 petal length: 1.0 6.9 3.76 1.76 0.9490 (high!) petal width: 0.1 2.5 1.20 0.76 0.9565 (high!) ============== ==== ==== ======= ===== ==================== :Missing Attribute Values: None :Class Distribution: 33.3% for each of 3 classes. :Creator: R.A. Fisher :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov) :Date: July, 1988 The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken from Fisher's paper. Note that it's the same as in R, but not as in the UCI Machine Learning Repository, which has two wrong data points. This is perhaps the best known database to be found in the pattern recognition literature. Fisher's paper is a classic in the field and is referenced frequently to this day. (See Duda & Hart, for example.) The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other. .. topic:: References - Fisher, R.A. "The use of multiple measurements in taxonomic problems" Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to Mathematical Statistics" (John Wiley, NY, 1950). - Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis. (Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218. - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System Structure and Classification Rule for Recognition in Partially Exposed Environments". IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. PAMI-2, No. 1, 67-71. - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactions on Information Theory, May 1972, 431-433. - See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II conceptual clustering system finds 3 classes in the data. - Many, many more ...
Number of instances is your training data size.
Number of attributes is your number of features for each example.
Class is your target variable.
print(iris_data['feature_names'])
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
iris_dataset = pd.DataFrame(data=iris_data['data'], columns=iris_data['feature_names'])
We make a dataframe object out of the data which was in a NumPy array. Now, we can use Pandas functions to get further insights into the dataset.
iris_dataset.head()
sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | |
---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 |
1 | 4.9 | 3.0 | 1.4 | 0.2 |
2 | 4.7 | 3.2 | 1.3 | 0.2 |
3 | 4.6 | 3.1 | 1.5 | 0.2 |
4 | 5.0 | 3.6 | 1.4 | 0.2 |
The .head() method gives us the top few datapoint values.
iris_dataset['target'] = iris_data['target']
Inserting a new column is very simple in pandas. We can refer to the column as if it existed, and then pass in data to be stored.
iris_dataset.head()
sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | target | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | 0 |
1 | 4.9 | 3.0 | 1.4 | 0.2 | 0 |
2 | 4.7 | 3.2 | 1.3 | 0.2 | 0 |
3 | 4.6 | 3.1 | 1.5 | 0.2 | 0 |
4 | 5.0 | 3.6 | 1.4 | 0.2 | 0 |
iris_data['target_names']
array(['setosa', 'versicolor', 'virginica'], dtype='<U10')
These are the names for the classes:
iris_dataset['target_name'] = np.apply_along_axis(lambda x: iris_data['target_names'][x], 0, iris_data['target'])
NumPy has an 'apply along axis' function, using which you can apply a function along a particular axis of a given array.
iris_dataset.tail()
sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | target | target_name | |
---|---|---|---|---|---|---|
145 | 6.7 | 3.0 | 5.2 | 2.3 | 2 | virginica |
146 | 6.3 | 2.5 | 5.0 | 1.9 | 2 | virginica |
147 | 6.5 | 3.0 | 5.2 | 2.0 | 2 | virginica |
148 | 6.2 | 3.4 | 5.4 | 2.3 | 2 | virginica |
149 | 5.9 | 3.0 | 5.1 | 1.8 | 2 | virginica |
Why convert an array to a dataframe?
Because now we can perform what is known as Exploratory Data Analysis, using only a few lines of code. Or use Pandas and Seaborn for what they're good at.
iris_dataset.describe(include='all')
sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | target | target_name | |
---|---|---|---|---|---|---|
count | 150.000000 | 150.000000 | 150.000000 | 150.000000 | 150.000000 | 150 |
unique | NaN | NaN | NaN | NaN | NaN | 3 |
top | NaN | NaN | NaN | NaN | NaN | setosa |
freq | NaN | NaN | NaN | NaN | NaN | 50 |
mean | 5.843333 | 3.057333 | 3.758000 | 1.199333 | 1.000000 | NaN |
std | 0.828066 | 0.435866 | 1.765298 | 0.762238 | 0.819232 | NaN |
min | 4.300000 | 2.000000 | 1.000000 | 0.100000 | 0.000000 | NaN |
25% | 5.100000 | 2.800000 | 1.600000 | 0.300000 | 0.000000 | NaN |
50% | 5.800000 | 3.000000 | 4.350000 | 1.300000 | 1.000000 | NaN |
75% | 6.400000 | 3.300000 | 5.100000 | 1.800000 | 2.000000 | NaN |
max | 7.900000 | 4.400000 | 6.900000 | 2.500000 | 2.000000 | NaN |
The describe() function provides statistics on each data-column in the dataframe. Thus, we can quickly understand our data distribution.
iris_dataset.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 150 entries, 0 to 149 Data columns (total 6 columns): sepal length (cm) 150 non-null float64 sepal width (cm) 150 non-null float64 petal length (cm) 150 non-null float64 petal width (cm) 150 non-null float64 target 150 non-null int32 target_name 150 non-null object dtypes: float64(4), int32(1), object(1) memory usage: 6.5+ KB
The info() function tells us the number of non-null values in each column, alongwith the datatype of each column.
iris_dataset.corr()
sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | target | |
---|---|---|---|---|---|
sepal length (cm) | 1.000000 | -0.117570 | 0.871754 | 0.817941 | 0.782561 |
sepal width (cm) | -0.117570 | 1.000000 | -0.428440 | -0.366126 | -0.426658 |
petal length (cm) | 0.871754 | -0.428440 | 1.000000 | 0.962865 | 0.949035 |
petal width (cm) | 0.817941 | -0.366126 | 0.962865 | 1.000000 | 0.956547 |
target | 0.782561 | -0.426658 | 0.949035 | 0.956547 | 1.000000 |
In this dataset, we see that there are no missing values. So, we can skip that step. Instead, a lot of Linear algorithms suffer if all features are not at the same scale. Hence, we use normalization/scaling to bring all variables to the same scale.
from sklearn.preprocessing import MinMaxScaler, StandardScaler
MinMaxScaler scales the values using the minimum and maximum values of the data, to a given range provided by the user.
StandardScaler scales the data to have zero mean and unit variance.
X = iris_dataset.drop(['target', 'target_name'], axis=1)
Y = iris_dataset['target']
scaler = StandardScaler()
X_sc = scaler.fit_transform(X)
The fit method uses the data to get a few variables it needs for future use, and the transform method applies the transformation to the given input data.
fit_transform performs both in one function call.
print("Minimum:",X_sc.min())
print("Maximum:",X_sc.max())
print("Mean:",X_sc.mean())
print("Standard Devaition:",X_sc.std())
Minimum: -2.43394714190809 Maximum: 3.0907752482994253 Mean: -1.4684549872375404e-15 Standard Devaition: 1.0
from sklearn.model_selection import train_test_split
train_test_split performs a split of the data into training and test sets.
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, shuffle=True)
We do not need scaling of the variables in this example, so we apply train-test-split to the original data.
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
For any ML algorithm in Scikit Learn, we have a fit and score method.
We always first create an object of the class of the algorithm, and provide parameters to it during the creation of the object.
Next, we call the .fit() method to train the algorithm on the data we provide as arguments to this function.
Finally, we can call .score() to get the score of the algorithm.
lr = LogisticRegression(solver = 'lbfgs', multi_class='auto')
lr.fit(X_train, Y_train)
print(lr.score(X_train, Y_train))
print(lr.score(X_test, Y_test))
0.9910714285714286 0.9210526315789473
from sklearn.model_selection import cross_val_score
cross_val_score will run cross_validation on given model, and return an array of scores on the validation set.
cv = cross_val_score(lr, X, Y, cv=5)
print(cv)
print(cv.mean())
[0.96666667 1. 0.93333333 0.96666667 1. ] 0.9733333333333334
C:\Users\Aditya Khandelwal\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py:757: ConvergenceWarning: lbfgs failed to converge. Increase the number of iterations. "of iterations.", ConvergenceWarning)
dtc = DecisionTreeClassifier()
dtc.fit(X_train, Y_train)
print(dtc.score(X_train, Y_train))
print(dtc.score(X_test, Y_test))
1.0 0.9210526315789473
cv = cross_val_score(dtc, X, Y, cv=5)
print(cv)
print(cv.mean())
[0.96666667 0.96666667 0.9 1. 1. ] 0.9666666666666668
mlp = MLPClassifier(hidden_layer_sizes=(10,10), max_iter=3000)
mlp.fit(X_train, Y_train)
print(mlp.score(X_train, Y_train))
print(mlp.score(X_test, Y_test))
0.9910714285714286 0.9473684210526315
cv = cross_val_score(mlp, X, Y, cv=5)
print(cv)
print(cv.mean())
[1. 1. 0.96666667 0.93333333 1. ] 0.9800000000000001
All the above algorithms have their individual hyperparameters that need tuning. Hyperparameters are basically parameters of the algorithm that we have to set.
Thus, we can observe how Pandas dataframes are directly used as inputs in Scikit Learn.
The next lectures will be focused on understanding the theory behind Machine Learning.