Introduction to Python

This jupyter notebook is adapted from the FREE Edx course of the same title.

1. Variables and Types

  • Specific, case-sensitive name
  • Call up value through variable name
In [5]:
height = 1.79
weight = 68.7
In [6]:
print(height)
1.79

Calculate BMI

  • Definition: $ \text{BMI} = \frac{\text{weight}}{\text{height}^2} $
In [8]:
height = 1.79
weight = 68.7
bmi = weight / height ** 2
print(bmi)
21.4412783621
In [9]:
# changing height or weight does not alter bmi
height = 1.34
print(bmi)
21.4412783621
In [11]:
str = "abc"
str
Out[11]:
'abc'

Python Data Types

  • float - real numbers
  • int - integer numbers
  • str - string, text
  • bool - True, False
In [13]:
type(bmi)
Out[13]:
float
In [14]:
day_of_week = 5
type(day_of_week)
Out[14]:
int
In [7]:
x = "body mass index"
type(x)
Out[7]:
str
In [8]:
y = 'this works too'
type(y)
Out[8]:
str
In [16]:
z = False
type(z)
Out[16]:
bool
In [18]:
print("this is a variable", str)
('this is a variable', 'abc')

2.1. Python List

Python Data Types

  • float - real numbers
  • int - integer numbers
  • str - string, text
  • bool - True, False
  • Each variable represents single value

Problem

  • Many data points
  • Height of entire family
    In [3]: height1 = 1.73
    In [4]: height2 = 1.68
    In [5]: height3 = 1.71
    In [6]: height4 = 1.89
    
  • Inconvenient

Python List

  • Name a collection of values
  • Contain any type
  • Contain different types
In [11]:
# Basic list
fam = [1.73, 1.68, 1.71, 1.89]
type(fam)
Out[11]:
list
In [19]:
# List with multiple types
fam = ["liz", 1.73, "emma", 1.68, "mom", 1.71, "dad", 1.89]
type(fam)
Out[19]:
list
In [13]:
# List of lists
fam2 = [["liz", 1.73], ["emma", 1.68], ["mom", 1.71], ["dad", 1.89]]
type(fam)
Out[13]:
list

2.2. Subsetting lists

Zero-based indexing:

  • $0, 1, 2,\dots, N-2, N-1$
  • $-(N-1), -(N-2), \dots , -2, -1$
In [14]:
fam[3]
Out[14]:
1.68
In [15]:
fam[6]
Out[15]:
'dad'
In [16]:
fam[-1]
Out[16]:
1.89
In [17]:
fam[-2]
Out[17]:
'dad'

List slicing

[ start (inclusive) : end (exclusive) ]

In [18]:
fam[3:5]
Out[18]:
[1.68, 'mom']
In [19]:
fam[1:4]
Out[19]:
[1.73, 'emma', 1.68]
In [20]:
fam[:4]
Out[20]:
['liz', 1.73, 'emma', 1.68]
In [21]:
fam[5:]
Out[21]:
[1.71, 'dad', 1.89]

2.3. Manipulating Lists

  • Change list elements
  • Add list elements
  • Remove list elements

Changing list elements

In [22]:
fam[7] = 1.86
In [23]:
fam
Out[23]:
['liz', 1.73, 'emma', 1.68, 'mom', 1.71, 'dad', 1.86]
In [24]:
fam[0:2] = ["lisa", 1.74]
In [25]:
fam
Out[25]:
['lisa', 1.74, 'emma', 1.68, 'mom', 1.71, 'dad', 1.86]

Adding and removing elements

In [26]:
fam_ext = fam + ["me", 1.79]
fam_ext
Out[26]:
['lisa', 1.74, 'emma', 1.68, 'mom', 1.71, 'dad', 1.86, 'me', 1.79]
In [27]:
del(fam[2])
fam
Out[27]:
['lisa', 1.74, 1.68, 'mom', 1.71, 'dad', 1.86]
In [28]:
del(fam[2])
fam
Out[28]:
['lisa', 1.74, 'mom', 1.71, 'dad', 1.86]
In [20]:
fam.append('test')
In [21]:
fam
Out[21]:
['liz', 1.73, 'emma', 1.68, 'mom', 1.71, 'dad', 1.89, 'test']

Copying Lists

  • copying by reference
  • copying by value
In [29]:
# Copy by reference (change the copy, change the original)
x = ["a", "b", "c"]
y = x
y[1] = "z"
print(x)
print(y)
['a', 'z', 'c']
['a', 'z', 'c']
In [30]:
# Copy by value (change the copy, no change to the original)
x = ["a", "b", "c"]
y = x[:] # or y = list(x)
y[1] = "z"
print(x)
print(y)
['a', 'b', 'c']
['a', 'z', 'c']

3.1. Functions

  • Nothing new!
  • type()
  • Piece of reusable code
  • Solves particular task
  • Call function instead of writing code yourself

Example 1: max( ) function

Find the largest element in a list.

In [25]:
fam = [1.73, 1.68, 1.71, 1.89]
tallest = max(fam)
print(tallest)
str_l = [1.73, 1.68, 1.71, 1.89,'ab','e']
print(max(str_l))
1.89
e

Example 2: round( ) function

Round floating point number.

In [32]:
help(round)
Help on built-in function round in module __builtin__:

round(...)
    round(number[, ndigits]) -> floating point number
    
    Round a number to a given precision in decimal digits (default 0 digits).
    This always returns a floating point number.  Precision may be negative.

In [33]:
round(1.68,1)
Out[33]:
1.7
In [34]:
round(1.68)
Out[34]:
2.0

Finding functions

  • Standard task $\rightarrow$ probably function exists!
  • The internet is your friend (google it!)

3.2 Methods: Functions that belong to objects

  • str : capitalize(), replace(), etc.
  • list : intex(), count(), etc.

list Methods

In [35]:
fam = ['liz', 1.73, 'emma', 1.68, 'mom', 1.71, 'dad', 1.89]
fam.index("mom")
Out[35]:
4
In [36]:
fam.count(1.73)
Out[36]:
1

str Methods

In [28]:
sister = "liz "
sister
Out[28]:
'liz '
In [29]:
sister.capitalize()
Out[29]:
'Liz '
In [39]:
sister.replace("z", "sa")
Out[39]:
'lisa'

Methods

  • Everything = object
  • Object have methods associated, depending on type
In [31]:
sister.index("z")
Out[31]:
2
In [42]:
fam.append("me")
fam
Out[42]:
['liz', 1.73, 'emma', 1.68, 'mom', 1.71, 'dad', 1.89, 'me']
In [43]:
fam.append("1.79")
fam
Out[43]:
['liz', 1.73, 'emma', 1.68, 'mom', 1.71, 'dad', 1.89, 'me', '1.79']

Summary

  • Functions
    • type(fam)
  • Methods: call functions on objects
    • fam.index("dad")

3.3 Packages

Motivation

  • Functions and methods are powerful
  • All code in Python distribution?
    • Huge code base: messy
    • Lots of code you won’t use
    • Maintenance problem

Packages

  • Directory of Python Scripts
  • Each script = module
  • Specify functions, methods, types
  • Thousands of packages available
    • Numpy
    • Matplotlib
    • Scikit-learn

Install package

Import package

In [32]:
import numpy as np
np.array([1, 2, 3])
Out[32]:
array([1, 2, 3])
In [105]:
numpy.array([1, 2, 3])
Out[105]:
array([1, 2, 3])
In [106]:
import numpy as np
np.array([1, 2, 3])
Out[106]:
array([1, 2, 3])
In [107]:
from numpy import array
array([1, 2, 3])
Out[107]:
array([1, 2, 3])

4.1 Numpy

Lists Recap

  • Powerful
  • Collection of values
  • Hold different types
  • Change, add, remove
  • Need for Data Science
    • Mathematical operations over collections
    • Speed

Illustration

In [108]:
height = [1.73, 1.68, 1.71, 1.89, 1.79]
height
Out[108]:
[1.73, 1.68, 1.71, 1.89, 1.79]
In [109]:
weight = [65.4, 59.2, 63.6, 88.4, 68.7]
weight
Out[109]:
[65.4, 59.2, 63.6, 88.4, 68.7]
In [110]:
weight / height ** 2
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-110-6a4c0c70e3b9> in <module>()
----> 1 weight / height ** 2

TypeError: unsupported operand type(s) for ** or pow(): 'list' and 'int'

Solution: Numpy

  • Numeric Python
  • Alternative to Python List: Numpy Array
  • Calculations over entire arrays
  • Easy and Fast
  • Installation
    • In the terminal: pip3 install numpy
In [111]:
import numpy as np
np_height = np.array(height)
np_height
Out[111]:
array([1.73, 1.68, 1.71, 1.89, 1.79])
In [112]:
np_weight = np.array(weight)
np_weight
Out[112]:
array([65.4, 59.2, 63.6, 88.4, 68.7])
In [113]:
# Element-wise calculations
bmi = np_weight / np_height ** 2
bmi
Out[113]:
array([21.85171573, 20.97505669, 21.75028214, 24.7473475 , 21.44127836])

Numpy: remarks

  • Numpy arrays: contain only one type
  • Different types: different behavior!
In [114]:
np.array([1.0, "is", True])
Out[114]:
array(['1.0', 'is', 'True'], dtype='|S32')
In [115]:
python_list = [1, 2, 3]
python_list + python_list
Out[115]:
[1, 2, 3, 1, 2, 3]
In [34]:
numpy_array = np.array([1, 2, 3])
numpy_array + numpy_array
Out[34]:
array([2, 4, 6])
In [35]:
np.add(numpy_array,numpy_array)
Out[35]:
array([2, 4, 6])

Numpy Subsetting

In [117]:
bmi
Out[117]:
array([21.85171573, 20.97505669, 21.75028214, 24.7473475 , 21.44127836])
In [118]:
bmi[1]
Out[118]:
20.97505668934241
In [119]:
bmi > 23
Out[119]:
array([False, False, False,  True, False])
In [120]:
bmi[bmi > 23]
Out[120]:
array([24.7473475])

4.2. 2D Numpy Arrays

Type of Numpy Arrays

  • ndarray = N-dimensional array
In [121]:
import numpy as np
np_height = np.array([1.73, 1.68, 1.71, 1.89, 1.79])
np_weight = np.array([65.4, 59.2, 63.6, 88.4, 68.7])
In [122]:
type(np_height)
Out[122]:
numpy.ndarray
In [123]:
type(np_weight)
Out[123]:
numpy.ndarray

2D Numpy Arrays

In [124]:
np_2d = np.array([[1.73, 1.68, 1.71, 1.89, 1.79],[65.4, 59.2, 63.6, 88.4, 68.7]])
np_2d
Out[124]:
array([[ 1.73,  1.68,  1.71,  1.89,  1.79],
       [65.4 , 59.2 , 63.6 , 88.4 , 68.7 ]])
In [125]:
# 2 rows, 5 columns
np_2d.shape
Out[125]:
(2, 5)
In [126]:
# numpy arrays must have a single type
np.array([[1.73, 1.68, 1.71, 1.89, 1.79],
[65.4, 59.2, 63.6, 88.4, "68.7"]])
Out[126]:
array([['1.73', '1.68', '1.71', '1.89', '1.79'],
       ['65.4', '59.2', '63.6', '88.4', '68.7']], dtype='|S32')
In [4]:
# Defining 
diag = np.diag([1,1,1])
print(diag)
[[1 0 0]
 [0 1 0]
 [0 0 1]]

Subsetting

In [127]:
# Get the first row
np_2d[0]
Out[127]:
array([1.73, 1.68, 1.71, 1.89, 1.79])
In [128]:
# First row, third column
np_2d[0][2]
Out[128]:
1.71
In [129]:
# First row, third column
In [12]: np_2d[0,2]
Out[129]:
1.71
In [130]:
# All rows, first through second columns
np_2d[:,1:3]
Out[130]:
array([[ 1.68,  1.71],
       [59.2 , 63.6 ]])
In [131]:
# Second row, all columns
np_2d[1,:]
Out[131]:
array([65.4, 59.2, 63.6, 88.4, 68.7])

4.3. Numpy: Basic Statistics

Data analysis

  • Get to know your data
  • Little data $\rightarrow$ simply look at it
  • Big data $\rightarrow$ ?

City-wide survey

In [132]:
import numpy as np

# Generatate 5000 normally distributed random variables
height = np.round(np.random.normal(1.75, 0.20, 5000), 2) # mean = 1.75, std dev = 0.2
weight = np.round(np.random.normal(60.32, 15, 5000), 2) # mean = 60.32, std dev = 15
np_city = np.column_stack((height, weight))
np_city
Out[132]:
array([[ 1.6 , 35.86],
       [ 1.61, 54.38],
       [ 2.04, 76.64],
       ...,
       [ 1.65, 51.07],
       [ 2.1 , 57.08],
       [ 1.67, 58.33]])
In [133]:
np.mean(np_city[:,0])
Out[133]:
1.7477559999999999
In [134]:
np.median(np_city[:,0])
Out[134]:
1.74
In [136]:
np.std(np_city[:,0])
Out[136]:
0.1981012984914536

5.1. Basic Plots with Matplotlib

Data Visualization

  • Very important in Data Analysis
    • Explore data
    • Report insights

Matplotlib

In [137]:
import matplotlib.pyplot as plt
year = [1950, 1970, 1990, 2010]
pop = [2.519, 3.692, 5.263, 6.972]
plt.plot(year, pop)
plt.show()

Scatter plot

In [138]:
plt.scatter(year, pop)
plt.show()

5.2. Histograms

Histogram

  • Explore dataset
  • Get idea about distribution

Matplotlib

In [139]:
import matplotlib.pyplot as plt
help(plt.hist)
Help on function hist in module matplotlib.pyplot:

hist(x, bins=None, range=None, density=None, weights=None, cumulative=False, bottom=None, histtype=u'bar', align=u'mid', orientation=u'vertical', rwidth=None, log=False, color=None, label=None, stacked=False, normed=None, hold=None, data=None, **kwargs)
    Plot a histogram.
    
    Compute and draw the histogram of *x*. The return value is a
    tuple (*n*, *bins*, *patches*) or ([*n0*, *n1*, ...], *bins*,
    [*patches0*, *patches1*,...]) if the input contains multiple
    data.
    
    Multiple data can be provided via *x* as a list of datasets
    of potentially different length ([*x0*, *x1*, ...]), or as
    a 2-D ndarray in which each column is a dataset.  Note that
    the ndarray form is transposed relative to the list form.
    
    Masked arrays are not supported at present.
    
    Parameters
    ----------
    x : (n,) array or sequence of (n,) arrays
        Input values, this takes either a single array or a sequency of
        arrays which are not required to be of the same length
    
    bins : integer or array_like or 'auto', optional
        If an integer is given, ``bins + 1`` bin edges are returned,
        consistently with :func:`numpy.histogram` for numpy version >=
        1.3.
    
        Unequally spaced bins are supported if *bins* is a sequence.
    
        If Numpy 1.11 is installed, may also be ``'auto'``.
    
        Default is taken from the rcParam ``hist.bins``.
    
    range : tuple or None, optional
        The lower and upper range of the bins. Lower and upper outliers
        are ignored. If not provided, *range* is ``(x.min(), x.max())``.
        Range has no effect if *bins* is a sequence.
    
        If *bins* is a sequence or *range* is specified, autoscaling
        is based on the specified bin range instead of the
        range of x.
    
        Default is ``None``
    
    density : boolean, optional
        If ``True``, the first element of the return tuple will
        be the counts normalized to form a probability density, i.e.,
        the area (or integral) under the histogram will sum to 1.
        This is achieved by dividing the count by the number of
        observations times the bin width and not dividing by the total
        number of observations. If *stacked* is also ``True``, the sum of
        the histograms is normalized to 1.
    
        Default is ``None`` for both *normed* and *density*. If either is
        set, then that value will be used. If neither are set, then the
        args will be treated as ``False``.
    
        If both *density* and *normed* are set an error is raised.
    
    weights : (n, ) array_like or None, optional
        An array of weights, of the same shape as *x*.  Each value in *x*
        only contributes its associated weight towards the bin count
        (instead of 1).  If *normed* or *density* is ``True``,
        the weights are normalized, so that the integral of the density
        over the range remains 1.
    
        Default is ``None``
    
    cumulative : boolean, optional
        If ``True``, then a histogram is computed where each bin gives the
        counts in that bin plus all bins for smaller values. The last bin
        gives the total number of datapoints. If *normed* or *density*
        is also ``True`` then the histogram is normalized such that the
        last bin equals 1. If *cumulative* evaluates to less than 0
        (e.g., -1), the direction of accumulation is reversed.
        In this case, if *normed* and/or *density* is also ``True``, then
        the histogram is normalized such that the first bin equals 1.
    
        Default is ``False``
    
    bottom : array_like, scalar, or None
        Location of the bottom baseline of each bin.  If a scalar,
        the base line for each bin is shifted by the same amount.
        If an array, each bin is shifted independently and the length
        of bottom must match the number of bins.  If None, defaults to 0.
    
        Default is ``None``
    
    histtype : {'bar', 'barstacked', 'step',  'stepfilled'}, optional
        The type of histogram to draw.
    
        - 'bar' is a traditional bar-type histogram.  If multiple data
          are given the bars are aranged side by side.
    
        - 'barstacked' is a bar-type histogram where multiple
          data are stacked on top of each other.
    
        - 'step' generates a lineplot that is by default
          unfilled.
    
        - 'stepfilled' generates a lineplot that is by default
          filled.
    
        Default is 'bar'
    
    align : {'left', 'mid', 'right'}, optional
        Controls how the histogram is plotted.
    
            - 'left': bars are centered on the left bin edges.
    
            - 'mid': bars are centered between the bin edges.
    
            - 'right': bars are centered on the right bin edges.
    
        Default is 'mid'
    
    orientation : {'horizontal', 'vertical'}, optional
        If 'horizontal', `~matplotlib.pyplot.barh` will be used for
        bar-type histograms and the *bottom* kwarg will be the left edges.
    
    rwidth : scalar or None, optional
        The relative width of the bars as a fraction of the bin width.  If
        ``None``, automatically compute the width.
    
        Ignored if *histtype* is 'step' or 'stepfilled'.
    
        Default is ``None``
    
    log : boolean, optional
        If ``True``, the histogram axis will be set to a log scale. If
        *log* is ``True`` and *x* is a 1D array, empty bins will be
        filtered out and only the non-empty ``(n, bins, patches)``
        will be returned.
    
        Default is ``False``
    
    color : color or array_like of colors or None, optional
        Color spec or sequence of color specs, one per dataset.  Default
        (``None``) uses the standard line color sequence.
    
        Default is ``None``
    
    label : string or None, optional
        String, or sequence of strings to match multiple datasets.  Bar
        charts yield multiple patches per dataset, but only the first gets
        the label, so that the legend command will work as expected.
    
        default is ``None``
    
    stacked : boolean, optional
        If ``True``, multiple data are stacked on top of each other If
        ``False`` multiple data are aranged side by side if histtype is
        'bar' or on top of each other if histtype is 'step'
    
        Default is ``False``
    
    Returns
    -------
    n : array or list of arrays
        The values of the histogram bins. See *normed* or *density*
        and *weights* for a description of the possible semantics.
        If input *x* is an array, then this is an array of length
        *nbins*. If input is a sequence arrays
        ``[data1, data2,..]``, then this is a list of arrays with
        the values of the histograms for each of the arrays in the
        same order.
    
    bins : array
        The edges of the bins. Length nbins + 1 (nbins left edges and right
        edge of last bin).  Always a single array even when multiple data
        sets are passed in.
    
    patches : list or list of lists
        Silent list of individual patches used to create the histogram
        or list of such list if multiple input datasets.
    
    Other Parameters
    ----------------
    **kwargs : `~matplotlib.patches.Patch` properties
    
    See also
    --------
    hist2d : 2D histograms
    
    Notes
    -----
    Until numpy release 1.5, the underlying numpy histogram function was
    incorrect with ``normed=True`` if bin sizes were unequal.  MPL
    inherited that error. It is now corrected within MPL when using
    earlier numpy versions.
    
    .. note::
        In addition to the above described arguments, this function can take a
        **data** keyword argument. If such a **data** argument is given, the
        following arguments are replaced by **data[<arg>]**:
    
        * All arguments with the following names: 'weights', 'x'.

Matplotlib Example

In [140]:
values = [0,0.6,1.4,1.6,2.2,2.5,2.6,3.2,3.5,3.9,4.2,6]
plt.hist(values, bins = 3)
plt.show()

5.3. Customization

Data Visualization

  • Science & Art
  • Many options
    • Different plot types
    • Many customizations
  • Choice depends on:
    • Data
    • Story you want to tell

Basic Plot

In [141]:
year = np.linspace(1950.,2100.,num=50)

K = 11.;
P0 = 2.6;
r = 0.03;
population = K*P0*np.exp(r*(year-year[0])) / (K + P0*(np.exp(r*(year-year[0]))-1.))

plt.plot(year, population)
plt.show()

Axis Labels, Title, Ticks, Axis Limits

In [142]:
plt.plot(year, population)

plt.xlabel('Year')
plt.ylabel('Population')
plt.title('World Population Projections')
plt.yticks([0,2,4,6,8,10])
plt.xlim(1950, 2100)
plt.ylim(0, 11)

plt.show()

Tick Labels

In [143]:
plt.plot(year, population)

plt.xlabel('Year')
plt.ylabel('Population')
plt.title('World Population Projections')
plt.yticks([0,2,4,6,8,10],['0','2B','4B','6B','8B','10B'])
plt.xlim(1950, 2100)
plt.ylim(0, 11)

plt.show()

6.1. Boolean Logic & Control Flow

Booleans

In [145]:
2 < 3
Out[145]:
True
In [146]:
2 == 3
Out[146]:
False
In [147]:
x = 2
y = 3
x < y
Out[147]:
True
In [148]:
x == y
Out[148]:
False

Relational Operators

operator meaning
< strictly less than
<= less than or equal
> strictly greater than
>= greater than or equal
== equal
!= not equal

Logical Operators

  • and
  • or
  • not
In [149]:
print(True and True)
print(True and False)
print(False and True)
print(False and False)
True
False
False
False
In [150]:
print(True or True)
print(True or False)
print(False or True)
print(False or False)
True
True
True
False
In [151]:
print(not True)
print(not False)
False
True

Conditional Statements

if condition :
    expression

Note the indentation of expression and the colon after the condition.

In [152]:
z = 4
if z % 2 == 0 :
    print("z is even")
z is even
In [153]:
z = 5
if z % 2 == 0 :
    print("z is even")
else :
    print("z is odd")
z is odd
In [154]:
z = 3
if z % 2 == 0 :
    print("z is divisible by 2")
elif z % 3 == 0 :
    print("z is divisible by 3")
else :
    print("z is neither divisible by 2 nor by 3")
z is divisible by 3
In [ ]: