Lecture 1: Numpy

Before we start with Numpy...

Because Numpy is based on Python, we want to make sure you know the basics of Python3 in order to understand the materials we will dicuss throughout the course without any significant difficulties. Let's begin our learning with basic Python3 syntax before we dig into the actual data science.

*If you are an exprienced Python programmer, you can skip this section and start from the What is Numpy? section.*

When we do a programming, we need to know what variables are, when we create them, and how we use them. A **variable** is a memory location that stores certain value, object, or data type. Let's see an example:

In [ ]:

```
x = 3
y = 2
```

What did we just do above? We just created two variables, one called **x** which stores a numerical value of 3, and the other one called **y** which stores a numerical value of 2. We call it **assignment** because we **assigned** a value of 3 to x and a value of 2 to y. We can also do **multiple assignments** like:

In [ ]:

```
x, y, z = 1, 2, 3
```

This will assign x as 1, y as 2, and z as 3. Really intuitive isn't it? :)

Numbers are not the only type we can assign. See different types of variables we can create:

In [ ]:

```
# We can write any statement as "comment" if we put a hashtag in front of a line.
# These comments will be ignored by computer when it tries to read the code, so you can write anything here @#!#$
name = "David Gries" # any combination of characters called String
years_of_programming_experience = 56 # Integer
is_married = True # is it true or false? called Boolean
list_of_course_taught = ["CS1110", "CS2110"] # a list of things
```

The most important data type we will discuss is **list**, because it can be interpreted as a one-dimensional vector, which can be seen as a matrix with only one row (Numpy is a package that deals with matrix and matrix-related operations).

As shown above, list is a data type that stores a sequence of items separated by commas and encolsed by two square brackets []. One big advantage if list is that the items of a list can be of different types: for example, a list can contain number, string, and boolean all at the same time.

In [2]:

```
cds_list = [1998, "cds", True]
bigger_list = ["list", 1234, cds_list] # list within a list
```

So far, we have only discussed about initializing a list. Now, let's learn how to deal with items inside of a list. Python is a **0-indexed language**, which means we start counting from 0.
We can access the values in a list by using these indexes:

In [4]:

```
bigger_list[1] # returns 1234, which is our "first" element (our "0th" element is "list" string)
```

Out[4]:

You can also set the values to a specific index:

In [7]:

```
my_list = ["A", "B", "C"]
my_list[0] = "D"
my_list # returns ['D', 'B', 'C']
```

Out[7]:

A more advanced technique to access the elements in the list involve slice operator ([ ] and [:]) with indexes starting at 0 in the beginning of the list and working their way to end -1. You might also want to get used to the plus (+) sign: the list concatenation operator, and the asterisk (*): the repetition operator. For more details, check the Python list operations: click me

In [16]:

```
my_list = ["Abby", "Ann", "Cameron", "Grace", "Ryan", "Shubhom"]
my_list[0:2] # returns ["Abby", "Ann"]
my_list + ["Jared", "Daewon"] # returns ['Abby', 'Ann', 'Cameron', 'Grace', 'Ryan', 'Shubhom', 'Jared', 'Daewon']
["Jared", "Daewon"] * 3 # returns ['Jared', 'Daewon', 'Jared', 'Daewon', 'Jared', 'Daewon']
```

Out[16]:

What is Numpy?

Numerical Python, or "Numpy" for short, is a foundational package on which many of the most common data science packages are built. Numpy provides us with multi-dimensional arrays, called ndarrays, which can be created as vectors or matrices. We can use numpy to manipulate datasets to make them easier to work with. Numpy also comes with a number of helpful statistical methods.

The key features of numpy are:

**ndarrays**are n-dimensional arrays of the same data type which are fast and space-efficient. There are many built-in methods for ndarrays which allow for rapid processing of data without using loops (e.g., compute the mean).**Broadcasting**is a tool which dictates how operations between multi-dimensional arrays of different sizes will be carried out.**Vectorization**allows for numeric operations on ndarrays.**Input/Output**simplifies reading and writing of data from/to file.

**Additional Recommended Resources:**

Numpy Documentation

*Python for Data Analysis* by Wes McKinney

*Python Data science Handbook* by Jake VanderPlas

Intro to ndarrays

**ndarrays** are time and space-efficient multidimensional arrays at the core of numpy. One important thing to note is that all elements in an ndarray must be of the same type. Let's get started by creating ndarrays using the numpy package.

Creating and modifying Rank 1 ndarrays:

The "as" keyword in the import statement allows us to give a local name to the numpy package, so that we can refer to it as "np" rather than "numpy" in subsequent code. In the following lines of code, we use a couple of methods in the numpy package:

**np.array([comma-separated elements here])** creates a rank 1 array (like a vector) with the elements specified between the brackets

**nameOfArray.shape()** returns a list of integers that represent the size of the array in each dimension

In [2]:

```
import numpy as np
arr = np.array([3, 2, 1]) # Create a rank 1 array
print(type(arr)) # The type of an ndarray is "<class 'numpy.ndarray'>"
```

In [3]:

```
# the shape of arr
print(arr.shape)
```

In [4]:

```
# access each element in the array using its index
print(arr[0], arr[1], arr[2])
```

Ndarrays are mutable, which means that the contents of an array can be changed after it is created.

In [5]:

```
arr[0] = 100 # change the first element of the array (the element at index 0)
print(arr)
```

Creating a Rank 2 ndarray:

A rank 2 **ndarray** has two dimensions. Notice the format below of [ [row] , [row] ]. 2 dimensional arrays are great for representing matrices which are often useful in data science. We use the same methods as before to analyze rank 2 arrays as well.

In [8]:

```
arr2 = np.array([[1,2,3],[6,5,4]]) # Create a rank 2 array
print(arr2) # print the array
print(arr2.shape) # print number of rows, columns
print(arr2[0, 0], arr2[0, 1], arr2[1, 0]) # print the elements at the specified indices [row, column]
```

Different ways to create ndarrays:

In the code below, we create a number of different sized arrays with different shapes and values. Numpy has some built in methods (listed below) which help us quickly and easily create multidimensional arrays with pre-filled values.

**np.zeros((dimensions))** creates an array of zeros with the specified dimensions

**np.full((dimensions), value)** creates an array with the specified dimensions where every element is the specified value

**np.eye(dimensions)** creates an array where the elements on the diagonal are 1s and all other elements are 0

**np.ones(dimensions)** creates an array of ones

**np.random.random(dimensions)** generates an array of random floating-point values between 0 and 1

In [9]:

```
import numpy as np
# create a 3x4 array of zeros
ex1 = np.zeros((3, 4))
print(ex1)
```

In [41]:

```
# create a 2x3 array filled with 9.0
ex2 = np.full((2,3), 6.0)
print(ex2)
```

In [42]:

```
# create a 3x3 matrix with 1s on the diagonal and all other values 0
ex3 = np.eye(3,3)
print(ex3)
```

In [47]:

```
# create a 4x2 array of ones
ex4 = np.ones((4,2))
print(ex4)
```

In [48]:

```
# create a 2x3 array of random floating-point numbers between 0 and 1
ex5 = np.random.random((2,3))
print(ex5)
```

Using Array Indexing

It's often more convenient to look only at specific sections of arrays. We can accomplish this using **array indexing**.

Slice indexing (slicing):

We can use slice indexing to pull out sub-regions of ndarrays. The general syntax for this is array[start index:end index]. Note that the start index is included in the slice, while the last index is not.

In [2]:

```
import numpy as np
# Rank 2 array of shape (4, 4)
an_array = np.array([[11,12,13,14], [21,22,23,24], [31,32,33,34], [41,42,43,44]])
print(an_array)
```

Use array slicing to get a subarray of the first 2 rows x the last 2 columns.

In [3]:

```
array_slice = an_array[:2, 2:]
print(array_slice)
```

When you modify a slice, you actually modify the underlying array. This is because when you use array slicing, you aren't creating a new array; instead, you're creating a reference to the slice of the array that you've selected. Also, note that the element at a given index in a slice often does not correspond to the element at that index in the original array, since a slice is a section of the original array. For example:

In [4]:

```
print('Initial element at 0, 1: ', an_array[0, 1]) # print the element at 0, 1
array_slice[0, 0] = 55 # array_slice[0, 0] is the same piece of data as an_array[0, 1]
print('After modification: ', an_array[0, 1])
```

Using integer indexing & slice indexing

Integer indexing, as the name implies, simply selects the elements of an array at the specified indices. We can use combinations of integer indexing and slice indexing to create different shaped matrices.

In [54]:

```
# create the same 4x4 array as above
an_array = np.array([[11,12,13,14], [21,22,23,24], [31,32,33,34], [41,42,43,44]])
print(an_array)
```

When slicing, [:] with no start or end indices selects all the elements in that row or column. In the following example, the combination of integer and slice indexing selects all elements in the last row of the original array.

In [55]:

```
# Using integer indexing with slicing generates an array of lower rank
row_rank1 = an_array[2, :]
print(row_rank1) # notice the []
print(row_rank1.shape)
```

Note that when you try to do the same thing using slicing alone, the subarray that you create will be of the same rank as the original array, even though it may actually be of a smaller dimension.

In [56]:

```
# Using slicing alone generates an array of the same rank as the original array
row_rank2 = an_array[2:3, :]
print(row_rank2) # Notice the [[ ]]
print(row_rank2.shape)
```

We see the same thing when we work with the columns instead of the rows of the array:

In [57]:

```
col_rank1 = an_array[:, 3]
col_rank2 = an_array[:, 3:4]
print(col_rank1)
print(col_rank1.shape) # Rank 1
print()
print(col_rank2)
print(col_rank2.shape) # Rank 2
```

Using Array Indexing to change elements

Sometimes it's useful to use an array of indices to access or change elements in our larger matrix.

In [5]:

```
# Create a new 4x4 array
an_array = np.array([[11,12,13,14], [21,22,23,24], [31,32,33,34], [41,42,43,44]])
print('Starting Array:')
print(an_array)
```

In the following code, we create an array of indices, called col_indices, with the values zero, one, two, and zero.
We then use the **np.arange** function to create an ndarray with the values zero, one, and two, and three.

In [6]:

```
# Create an array of indices
col_indices = np.array([0, 2, 1, 3])
print('Column indices picked : ', col_indices)
row_indices = np.arange(4)
print('Row indices picked : ', row_indices)
```

Using a for loop and the zip function, we can see how these values might pair up if used as row and column indices. The corresponding 2D indices are printed below.

In [7]:

```
# print the indices (row, column) in the arrays created above
for row,col in zip(row_indices,col_indices):
print('(',row,',',col,')')
```

When we inspect the contents at those pairs of indices, we get back the values 11, 23, 32, and 44. This technique is very useful, as it allows us to can access elements using arrays as indices.

In [8]:

```
# print the values in the array at the indices specified above
print('Values in the array at specified indices: ',an_array[row_indices, col_indices])
```

We can also change elements in the array the same way. Here, we add 1,000 to an_array for our row and column indices. Looking at the new array, we see that (0, 0), (1, 2), (2, 1), and (3, 3) have all been incremented.

In [9]:

```
# change the elements at the selected indices
an_array[row_indices, col_indices] = 1000
print('\nChanged Array:')
print(an_array)
print('\nNew values in the array at specified indices: ',an_array[row_indices,col_indices])
```

Using Boolean Indexing

We can also use boolean indexing to filter out elements of an array based on whether or not they fulfill some condition. This is very useful when we only want to look at a specific portion of a dataset, ex. where some entries have a certain characteristic we want to explore.

Using Array Indexing to change elements

In [11]:

```
# create a 2x3 array
an_array = np.array([[11,12,13], [21,22,23]])
print(an_array)
```

The following code creates a boolean array of the same dimensions as the original array, where True indicates that the corresponding element of the original array fulfills the condition, and False indicates that the corresponding element does not fulfill the condition.

In [12]:

```
# create a filter of boolean values indicating whether each element meets the condition
filter = (an_array < 15)
filter
```

Out[12]:

In [67]:

```
# using the filter, we can select just those elements which meet that criteria
print(an_array[filter])
```

The following code accomplishes the same thing without the intermediate step of creating the filter array:

In [68]:

```
an_array[(an_array < 15)]
```

Out[68]:

We can also change elements in the array using a filter. The following code adds 200 to all elements < 15 in the array:

In [69]:

```
an_array[an_array < 15] += 200
print(an_array)
```

Array Operations

Arithmetic Array Operations:

There are also a number of useful arithmetic operations that may be used on numpy arrays. These include:

**add**, which adds the corresponding elements of different arrays. You can use **np.add(array1, array2)** or simply use the plus sign. Subtraction, multiplication, and division work the same way. See Numpy documentation for more details.

**np.sqrt(array)** returns an array where each element is the square root of the corresponding element in the original array.

**np.exp(array)** returns an array where each element is *e* raised to the power of the corresponding element in the original array.

In [34]:

```
a = np.array([[11,12],[21,22]], dtype=np.int)
b = np.array([[11.1,12.1],[21.1,22.1]], dtype=np.float64)
print(a)
print()
print(b)
```

In [35]:

```
# add
print(a + b)
print()
print(np.add(a, b))
# the plus sign does the same thing as the numpy function "add"
```

In [36]:

```
# subtract
print(a - b)
print()
print(np.subtract(a, b))
```

In [37]:

```
# multiply
print(a * b)
print()
print(np.multiply(a, b))
```

In [38]:

```
# divide
print(a / b)
print()
print(np.divide(a, b))
```

In [39]:

```
# square root
print(np.sqrt(a))
```

In [40]:

```
# exponent (e ** a)
print(np.exp(a))
```

Intro to Statistical Methods, Sorting, and Set Operations

Getting Started with Statistical Operations

There are many useful statistical operations for numpy arrays, some of which are:

**array.mean()**, which computes the mean of all elements in a matrix. **array.mean(axis = 1)** returns an array containing the mean values of each row, while **arr.mean(axis = 0)** returns an array containing the mean values of each column.

**array.sum()** returns the sum of all of the elements in the array.

**np.median(array, axis=)** computes the median of the elements in an array. Similar to the mean function, the axis argument specifies whether the medians should be computed by row or by column.

There are many other statistical methods out there; check out the numpy reference below if you need a function that isn't listed here or if you're looking for more detailed information about the functions above. Numpy Reference

In [13]:

```
# setup a random 3x3 matrix
arr = 10 * np.random.randn(3,3)
print(arr)
# compute the mean for all elements in the array
print('\n',arr.mean())
# set the axis value to 1 compute the means for each row
print('\n',arr.mean(axis = 1))
# set the axis value to 0 compute the means for each column
print('\n',arr.mean(axis = 0))
# sum all the elements in the array
print('\n',arr.sum())
# compute the median for all elements in the array
print('\n',np.median(arr))
# compute the medians for each row
print('\n',np.median(arr, axis = 1))
```

Using the Unique method

The NumPy method **unique** is very useful in data science. It allows us to pull out only the values that are unique in an array. Note that in the following example, the array has a number of duplicate 8s, 12s, and 13s. The output after calling **unique** on it is just 8, 12, and 13.

In [80]:

```
an_array = np.array([8,12,13,13,12,8,13,12])
print(np.unique(an_array))
```

Set Operations on ndarrays

We can use set routines in numpy to perform operations on and compare two arrays. In the code below, we use the following set methods:

**np.intersect1d(array1, array2)** returns an array with the values that array1 and array2 both have.

**np.union1d(array1, array2)** returns an array with all the unique values from both array1 and array2.

**np.setdiff1d(array1, array2)** returns an array with elements in array1 that are not in array2.

**np.in1d(array1, array2)** returns a boolean array of whether each element of array1 is also present in array2.

*Note that the two arrays can have different numbers of elements, but must both be rank 1 arrays.

In [91]:

```
ar1 = np.array(['dog','cat','bird','turtle'])
ar2 = np.array(['cat','bird','horse'])
print(ar1, ar2)
```

In [92]:

```
print( np.intersect1d(ar1, ar2) )
```

In [93]:

```
print( np.union1d(ar1, ar2) )
```

In [94]:

```
print( np.setdiff1d(ar1, ar2) )
```

In [95]:

```
print( np.in1d(ar1, ar2) )
```

Intro to Broadcasting:

Broadcasting is one of the more advanced features of NumPy, and it can help make array operations much more convenient. The term broadcasting describes how numpy treats arrays with different shapes during arithmetic operations. Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes. During the operation, no copy is involved in the process and both arrays retain their original shapes, making broadcasting very memory and computationally efficient.

For more details on broadcasting, please see <a href= https://docs.scipy.org/doc/numpy-1.10.1/user/basics.broadcasting.html>this resource.</a>

In [101]:

```
import numpy as np
array = np.zeros((5,3))
print(array)
```

In [102]:

```
# create a rank 1 ndarray with 3 values
add_rows = np.array([5, 3, 9])
print(add_rows)
```

In [107]:

```
x = start - add_rows # subtract from each row of 'array' using broadcasting
print(x)
```

In [108]:

```
# create a 5x1 ndarray to broadcast across columns
add_cols = np.array([[10,20,30,40,50]])
add_cols = add_cols.T
print(add_cols)
```

In [109]:

```
# add to each column of 'start' using broadcasting
x = array + add_cols
print(x)
```

In [112]:

```
# this will broadcast in both dimensions
scalar = np.array([3])
print(array+scalar)
```

Other Common ndarray Operations

Below, you'll find some other useful functions for ndarrays. There are a myriad of these, so we encourage you to go through these and explore the numpy documentation linked below.

Dot Product and Inner Product

**array1.dot(array2)** or **np.dot(array1, array2)** returns the dot or inner product of two arrays.

*Note that if the two arrays are 2D (matrices), **dot** returns the dot product, and if they are 1D (vectors), it returns the inner product.

In [130]:

```
# determine the dot product of two matrices
arr1_2d = np.array([[2,2],[2,2]])
arr2_2d = np.array([[1,1],[1,1]])
print(arr1_2d.dot(arr2_2d))
print()
print(np.dot(arr1_2d, arr2_2d))
```

In [131]:

```
# determine the inner product of two vectors
arr1_1d = np.array([9 , 9 ])
arr2_1d = np.array([10, 10])
print(arr1_1d.dot(arr2_1d))
print()
print(np.dot(arr1_1d, arr2_1d))
```

In [132]:

```
# dot product on an array and vector
print(arr1_2d.dot(arr1_1d))
print()
print(np.dot(arr1_2d, arr1_1d))
```

Using sum():

In the following code, we explore the various uses of the **sum()** method.

In [136]:

```
# sum elements in the array
arr1 = np.array([[10,15],[20,25]])
print(np.sum(arr1)) # sum of all elements
```

In [137]:

```
print(np.sum(arr1, axis=0)) # sum of elements in each column
```

In [138]:

```
print(np.sum(arr1, axis=1)) # sum of elements in each row
```

Element-wise Functions:

**np.maximum(array1, array2)** compares two arrays and returns a new array containing the element-wise maxima. For more element-wise functions, see the numpy documentation.

In [146]:

```
# create a random array
a = np.random.randn(3,2)
print(a)
```

In [147]:

```
# create another random array
b = np.random.randn(3,2)
print(b)
```

In [148]:

```
# return the element wise maxima between two arrays
print(np.maximum(a, b))
```

Reshaping arrays:

**array.reshape(dimensions)** gives a new shape to an array without changing its data.

In [161]:

```
# put values 0 through 14 in an array
arr = np.arange(15)
print(arr)
```

In [162]:

```
# reshape to be a 5 x 3 matrix
new_arr = arr.reshape(5,3)
print(new_arr)
```

Using transpose():

**np.transpose(array)** returns the transpose of an array with its dimensions permuted.

In [163]:

```
# transpose
arr = np.array([[11,12],[21,22]])
new_arr1 = np.transpose(arr)
print(new_arr1)
```

In [164]:

```
# another way to call the method
new_arr2 = arr.T
print(new_arr2)
```

Indexing using where():

**np.where(condition, array1, array2)** returns elements, either from array1 or array2, depending on the condition. The output array contains elements of array1 where the condition is True, and elements from array2 elsewhere.

In [168]:

```
array1 = np.array([1,2,3,4,5])
array2 = np.array([10,20,30,40,50])
filter = np.array([True, False, True, False, True])
```

In [169]:

```
out = np.where(filter, array1, array2)
print(out)
```

In [173]:

```
ran_arr = np.random.rand(3,3)
print(ran_arr)
```

In [175]:

```
new_arr = np.where( ran_arr > 0.5, 1000, -1)
print(new_arr)
```

Using any() and all()

**np.any()** tests whether any element in an array evaluates to True.

**np.all()** tests whether all elements in an array evaluate to True.

In [176]:

```
arr_bools = np.array([ True, False, True, True, False ])
```

In [177]:

```
arr_bools.any()
```

Out[177]:

In [178]:

```
arr_bools.all()
```

Out[178]:

Random Number Generation:

**np.random.normal(mean, standard deviation, dimensions)** draws random samples from a normal (Gaussian) distribution using information provided in parameters.

**np.random.randint(low, high, dimensions)** returns an array with specified dimensions of random integers from low (inclusive) to high (exclusive).

**np.random.permutation(array)** returns a new array with original array elements shuffled randomly.

**np.random.uniform(low, high, dimensions)** draws samples from a uniform distribution using information provided in parameters.

In [179]:

```
arr1 = np.random.normal(size = (3,4))[0]
print(arr1)
```

In [180]:

```
arr2 = np.random.randint(low=3,high=30,size=5)
print(arr2)
```

In [182]:

```
np.random.permutation(arr2) # reorder elements in arr2
```

Out[182]:

In [183]:

```
np.random.uniform(size=3) # uniform distribution
```

Out[183]:

In [184]:

```
np.random.normal(size=3) # normal distribution
```

Out[184]:

Merging two data sets:

**np.vstack((array1, array2))** takes a sequence of arrays and stacks them vertically to make a single array.

**np.hstack((array1, array2))** takes a sequence of arrays and stacks them horizontally to make a single array.

**np.concatenate((array1, array2), axis)** joins a sequence of arrays along a the specified axis.

In [185]:

```
arr1 = np.random.randint(low=5,high=30,size=(3,3))
print(arr1)
print()
arr2 = np.random.randint(low=5,high=30,size=(3,3))
print(arr2)
```

In [187]:

```
varr = np.vstack((arr1,arr2))
print(varr)
```

In [188]:

```
harr = np.hstack((arr1,arr2))
print(harr)
```

In [189]:

```
np.concatenate([arr1, arr2], axis = 0)
```

Out[189]:

In [190]:

```
np.concatenate([arr1, arr2.T], axis = 1)
```

Out[190]: