Introduction to Python¶

This jupyter notebook is adapted from the FREE Edx course of the same title.

1. Variables and Types¶

Specific, case-sensitive name
Call up value through variable name

In [5]:

height = 1.79
weight = 68.7

In [6]:

print(height)

1.79

Calculate BMI¶

Definition: $ \text{BMI} = \frac{\text{weight}}{\text{height}^2} $

In [8]:

height = 1.79
weight = 68.7
bmi = weight / height ** 2
print(bmi)

21.4412783621

In [9]:

# changing height or weight does not alter bmi
height = 1.34
print(bmi)

21.4412783621

In [11]:

str = "abc"
str

Out[11]:

'abc'

Python Data Types¶

float - real numbers
int - integer numbers
str - string, text
bool - True, False

In [13]:

type(bmi)

Out[13]:

float

In [14]:

day_of_week = 5
type(day_of_week)

Out[14]:

int

In [7]:

x = "body mass index"
type(x)

Out[7]:

str

In [8]:

y = 'this works too'
type(y)

Out[8]:

str

In [16]:

z = False
type(z)

Out[16]:

bool

In [18]:

print("this is a variable", str)

('this is a variable', 'abc')

2.1. Python `List`¶

Python Data Types¶

float - real numbers
int - integer numbers
str - string, text
bool - True, False
Each variable represents *single* value

Problem¶

Many data points
Height of entire family

In [3]: height1 = 1.73
In [4]: height2 = 1.68
In [5]: height3 = 1.71
In [6]: height4 = 1.89

Inconvenient

Python `List`¶

Name a collection of values
Contain any type
Contain different types

In [11]:

# Basic list
fam = [1.73, 1.68, 1.71, 1.89]
type(fam)

Out[11]:

list

In [19]:

# List with multiple types
fam = ["liz", 1.73, "emma", 1.68, "mom", 1.71, "dad", 1.89]
type(fam)

Out[19]:

list

In [13]:

# List of lists
fam2 = [["liz", 1.73], ["emma", 1.68], ["mom", 1.71], ["dad", 1.89]]
type(fam)

Out[13]:

list

2.2. Subsetting lists¶

Zero-based indexing:

$0, 1, 2,\dots, N-2, N-1$
$-(N-1), -(N-2), \dots , -2, -1$

In [14]:

fam[3]

Out[14]:

1.68

In [15]:

fam[6]

Out[15]:

'dad'

In [16]:

fam[-1]

Out[16]:

1.89

In [17]:

fam[-2]

Out[17]:

'dad'

List slicing¶

[ start (inclusive) : end (exclusive) ]

In [18]:

fam[3:5]

Out[18]:

[1.68, 'mom']

In [19]:

fam[1:4]

Out[19]:

[1.73, 'emma', 1.68]

In [20]:

fam[:4]

Out[20]:

['liz', 1.73, 'emma', 1.68]

In [21]:

fam[5:]

Out[21]:

[1.71, 'dad', 1.89]

2.3. Manipulating Lists¶

Change list elements
Add list elements
Remove list elements

Changing list elements¶

In [22]:

fam[7] = 1.86

In [23]:

fam

Out[23]:

['liz', 1.73, 'emma', 1.68, 'mom', 1.71, 'dad', 1.86]

In [24]:

fam[0:2] = ["lisa", 1.74]

In [25]:

fam

Out[25]:

['lisa', 1.74, 'emma', 1.68, 'mom', 1.71, 'dad', 1.86]

Adding and removing elements¶

In [26]:

fam_ext = fam + ["me", 1.79]
fam_ext

Out[26]:

['lisa', 1.74, 'emma', 1.68, 'mom', 1.71, 'dad', 1.86, 'me', 1.79]

In [27]:

del(fam[2])
fam

Out[27]:

['lisa', 1.74, 1.68, 'mom', 1.71, 'dad', 1.86]

In [28]:

del(fam[2])
fam

Out[28]:

['lisa', 1.74, 'mom', 1.71, 'dad', 1.86]

In [20]:

fam.append('test')

In [21]:

fam

Out[21]:

['liz', 1.73, 'emma', 1.68, 'mom', 1.71, 'dad', 1.89, 'test']

Copying Lists¶

copying by reference
copying by value

In [29]:

# Copy by reference (change the copy, change the original)
x = ["a", "b", "c"]
y = x
y[1] = "z"
print(x)
print(y)

['a', 'z', 'c']
['a', 'z', 'c']

In [30]:

# Copy by value (change the copy, no change to the original)
x = ["a", "b", "c"]
y = x[:] # or y = list(x)
y[1] = "z"
print(x)
print(y)

['a', 'b', 'c']
['a', 'z', 'c']

3.1. Functions¶

Nothing new!
type()
Piece of reusable code
Solves particular task
Call function instead of writing code yourself

Example 1: max( ) function¶

Find the largest element in a list.

In [25]:

fam = [1.73, 1.68, 1.71, 1.89]
tallest = max(fam)
print(tallest)
str_l = [1.73, 1.68, 1.71, 1.89,'ab','e']
print(max(str_l))

1.89
e

Example 2: round( ) function¶

Round floating point number.

In [32]:

help(round)

Help on built-in function round in module __builtin__:

round(...)
    round(number[, ndigits]) -> floating point number
    
    Round a number to a given precision in decimal digits (default 0 digits).
    This always returns a floating point number.  Precision may be negative.

In [33]:

round(1.68,1)

Out[33]:

1.7

In [34]:

round(1.68)

Out[34]:

2.0

Finding functions¶

Standard task $\rightarrow$ probably function exists!
The internet is your friend (google it!)

3.2 Methods: Functions that belong to objects¶

str : capitalize(), replace(), etc.
list : intex(), count(), etc.

`list` Methods¶

In [35]:

fam = ['liz', 1.73, 'emma', 1.68, 'mom', 1.71, 'dad', 1.89]
fam.index("mom")

Out[35]:

In [36]:

fam.count(1.73)

Out[36]:

`str` Methods¶

In [28]:

sister = "liz "
sister

Out[28]:

'liz '

In [29]:

sister.capitalize()

Out[29]:

'Liz '

In [39]:

sister.replace("z", "sa")

Out[39]:

'lisa'

Methods¶

Everything = object
Object have methods associated, depending on type

In [31]:

sister.index("z")

Out[31]:

In [42]:

fam.append("me")
fam

Out[42]:

['liz', 1.73, 'emma', 1.68, 'mom', 1.71, 'dad', 1.89, 'me']

In [43]:

fam.append("1.79")
fam

Out[43]:

['liz', 1.73, 'emma', 1.68, 'mom', 1.71, 'dad', 1.89, 'me', '1.79']

Summary¶

Functions
- type(fam)
Methods: call functions on objects
- fam.index("dad")

3.3 Packages¶

Motivation¶

Functions and methods are powerful
All code in Python distribution?
- Huge code base: messy
- Lots of code you won’t use
- Maintenance problem

Packages¶

Directory of Python Scripts
Each script = module
Specify functions, methods, types
Thousands of packages available
- Numpy
- Matplotlib
- Scikit-learn

Install package¶

http://pip.readthedocs.org/en/stable/installing/
Download get-pip.py
Terminal:
- python get-pip.py
- pip install numpy

Import package¶

In [32]:

import numpy as np
np.array([1, 2, 3])

Out[32]:

array([1, 2, 3])

In [105]:

numpy.array([1, 2, 3])

Out[105]:

array([1, 2, 3])

In [106]:

import numpy as np
np.array([1, 2, 3])

Out[106]:

array([1, 2, 3])

In [107]:

from numpy import array
array([1, 2, 3])

Out[107]:

array([1, 2, 3])

4.1 Numpy¶

Lists Recap¶

Powerful
Collection of values
Hold different types
Change, add, remove
Need for Data Science
- Mathematical operations over collections
- Speed

Illustration¶

In [108]:

height = [1.73, 1.68, 1.71, 1.89, 1.79]
height

Out[108]:

[1.73, 1.68, 1.71, 1.89, 1.79]

In [109]:

weight = [65.4, 59.2, 63.6, 88.4, 68.7]
weight

Out[109]:

[65.4, 59.2, 63.6, 88.4, 68.7]

In [110]:

weight / height ** 2

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-110-6a4c0c70e3b9> in <module>()
----> 1 weight / height ** 2

TypeError: unsupported operand type(s) for ** or pow(): 'list' and 'int'

Solution: Numpy¶

Numeric Python
Alternative to Python List: Numpy Array
Calculations over entire arrays
Easy and Fast
Installation
- In the terminal: pip3 install numpy

In [111]:

import numpy as np
np_height = np.array(height)
np_height

Out[111]:

array([1.73, 1.68, 1.71, 1.89, 1.79])

In [112]:

np_weight = np.array(weight)
np_weight

Out[112]:

array([65.4, 59.2, 63.6, 88.4, 68.7])

In [113]:

# Element-wise calculations
bmi = np_weight / np_height ** 2
bmi

Out[113]:

array([21.85171573, 20.97505669, 21.75028214, 24.7473475 , 21.44127836])

Numpy: remarks¶

Numpy arrays: contain only one type
Different types: different behavior!

In [114]:

np.array([1.0, "is", True])

Out[114]:

array(['1.0', 'is', 'True'], dtype='|S32')

In [115]:

python_list = [1, 2, 3]
python_list + python_list

Out[115]:

[1, 2, 3, 1, 2, 3]

In [34]:

numpy_array = np.array([1, 2, 3])
numpy_array + numpy_array

Out[34]:

array([2, 4, 6])

In [35]:

np.add(numpy_array,numpy_array)

Out[35]:

array([2, 4, 6])

Numpy Subsetting¶

In [117]:

bmi

Out[117]:

array([21.85171573, 20.97505669, 21.75028214, 24.7473475 , 21.44127836])

In [118]:

bmi[1]

Out[118]:

20.97505668934241

In [119]:

bmi > 23

Out[119]:

array([False, False, False,  True, False])

In [120]:

bmi[bmi > 23]

Out[120]:

array([24.7473475])

4.2. 2D Numpy Arrays¶

Type of Numpy Arrays¶

ndarray = N-dimensional array

In [121]:

import numpy as np
np_height = np.array([1.73, 1.68, 1.71, 1.89, 1.79])
np_weight = np.array([65.4, 59.2, 63.6, 88.4, 68.7])

In [122]:

type(np_height)

Out[122]:

numpy.ndarray

In [123]:

type(np_weight)

Out[123]:

numpy.ndarray

2D Numpy Arrays¶

In [124]:

np_2d = np.array([[1.73, 1.68, 1.71, 1.89, 1.79],[65.4, 59.2, 63.6, 88.4, 68.7]])
np_2d

Out[124]:

array([[ 1.73,  1.68,  1.71,  1.89,  1.79],
       [65.4 , 59.2 , 63.6 , 88.4 , 68.7 ]])

In [125]:

# 2 rows, 5 columns
np_2d.shape

Out[125]:

(2, 5)

In [126]:

# numpy arrays must have a single type
np.array([[1.73, 1.68, 1.71, 1.89, 1.79],
[65.4, 59.2, 63.6, 88.4, "68.7"]])

Out[126]:

array([['1.73', '1.68', '1.71', '1.89', '1.79'],
       ['65.4', '59.2', '63.6', '88.4', '68.7']], dtype='|S32')

In [4]:

# Defining 
diag = np.diag([1,1,1])
print(diag)

[[1 0 0]
 [0 1 0]
 [0 0 1]]

Subsetting¶

In [127]:

# Get the first row
np_2d[0]

Out[127]:

array([1.73, 1.68, 1.71, 1.89, 1.79])

In [128]:

# First row, third column
np_2d[0][2]

Out[128]:

1.71

In [129]:

# First row, third column
In [12]: np_2d[0,2]

Out[129]:

1.71

In [130]:

# All rows, first through second columns
np_2d[:,1:3]

Out[130]:

array([[ 1.68,  1.71],
       [59.2 , 63.6 ]])

In [131]:

# Second row, all columns
np_2d[1,:]

Out[131]:

array([65.4, 59.2, 63.6, 88.4, 68.7])

4.3. Numpy: Basic Statistics¶

Data analysis¶

Get to know your data
Little data $\rightarrow$ simply look at it
Big data $\rightarrow$ ?

City-wide survey¶

In [132]:

import numpy as np

# Generatate 5000 normally distributed random variables
height = np.round(np.random.normal(1.75, 0.20, 5000), 2) # mean = 1.75, std dev = 0.2
weight = np.round(np.random.normal(60.32, 15, 5000), 2) # mean = 60.32, std dev = 15
np_city = np.column_stack((height, weight))
np_city

Out[132]:

array([[ 1.6 , 35.86],
       [ 1.61, 54.38],
       [ 2.04, 76.64],
       ...,
       [ 1.65, 51.07],
       [ 2.1 , 57.08],
       [ 1.67, 58.33]])

In [133]:

np.mean(np_city[:,0])

Out[133]:

1.7477559999999999

In [134]:

np.median(np_city[:,0])

Out[134]:

1.74

In [136]:

np.std(np_city[:,0])

Out[136]:

0.1981012984914536

5.1. Basic Plots with Matplotlib¶

Data Visualization¶

Very important in Data Analysis
- Explore data
- Report insights

Matplotlib¶

In [137]:

import matplotlib.pyplot as plt
year = [1950, 1970, 1990, 2010]
pop = [2.519, 3.692, 5.263, 6.972]
plt.plot(year, pop)
plt.show()

Scatter plot¶

In [138]:

plt.scatter(year, pop)
plt.show()

5.2. Histograms¶

Histogram¶

Explore dataset
Get idea about distribution

Matplotlib¶

In [139]:

import matplotlib.pyplot as plt
help(plt.hist)

Help on function hist in module matplotlib.pyplot:

hist(x, bins=None, range=None, density=None, weights=None, cumulative=False, bottom=None, histtype=u'bar', align=u'mid', orientation=u'vertical', rwidth=None, log=False, color=None, label=None, stacked=False, normed=None, hold=None, data=None, **kwargs)
Plot a histogram.

Compute and draw the histogram of *x*. The return value is a
tuple (*n*, *bins*, *patches*) or ([*n0*, *n1*, ...], *bins*,
[*patches0*, *patches1*,...]) if the input contains multiple
data.

Multiple data can be provided via *x* as a list of datasets
of potentially different length ([*x0*, *x1*, ...]), or as
a 2-D ndarray in which each column is a dataset. Note that
the ndarray form is transposed relative to the list form.

Masked arrays are not supported at present.

Parameters
----------
x : (n,) array or sequence of (n,) arrays
Input values, this takes either a single array or a sequency of
arrays which are not required to be of the same length

bins : integer or array_like or 'auto', optional
If an integer is given, ``bins + 1`` bin edges are returned,
consistently with :func:`numpy.histogram` for numpy version >=
1.3.

Unequally spaced bins are supported if *bins* is a sequence.

If Numpy 1.11 is installed, may also be ``'auto'``.

Default is taken from the rcParam ``hist.bins``.

range : tuple or None, optional
The lower and upper range of the bins. Lower and upper outliers
are ignored. If not provided, *range* is ``(x.min(), x.max())``.
Range has no effect if *bins* is a sequence.

If *bins* is a sequence or *range* is specified, autoscaling
is based on the specified bin range instead of the
range of x.

Default is ``None``

density : boolean, optional
If ``True``, the first element of the return tuple will
be the counts normalized to form a probability density, i.e.,
the area (or integral) under the histogram will sum to 1.
This is achieved by dividing the count by the number of
observations times the bin width and not dividing by the total
number of observations. If *stacked* is also ``True``, the sum of
the histograms is normalized to 1.

Default is ``None`` for both *normed* and *density*. If either is
set, then that value will be used. If neither are set, then the
args will be treated as ``False``.

If both *density* and *normed* are set an error is raised.

weights : (n, ) array_like or None, optional
An array of weights, of the same shape as *x*. Each value in *x*
only contributes its associated weight towards the bin count
(instead of 1). If *normed* or *density* is ``True``,
the weights are normalized, so that the integral of the density
over the range remains 1.

Default is ``None``

cumulative : boolean, optional
If ``True``, then a histogram is computed where each bin gives the
counts in that bin plus all bins for smaller values. The last bin
gives the total number of datapoints. If *normed* or *density*
is also ``True`` then the histogram is normalized such that the
last bin equals 1. If *cumulative* evaluates to less than 0
(e.g., -1), the direction of accumulation is reversed.
In this case, if *normed* and/or *density* is also ``True``, then
the histogram is normalized such that the first bin equals 1.

Default is ``False``

bottom : array_like, scalar, or None
Location of the bottom baseline of each bin. If a scalar,
the base line for each bin is shifted by the same amount.
If an array, each bin is shifted independently and the length
of bottom must match the number of bins. If None, defaults to 0.

Default is ``None``

histtype : {'bar', 'barstacked', 'step', 'stepfilled'}, optional
The type of histogram to draw.

- 'bar' is a traditional bar-type histogram. If multiple data
are given the bars are aranged side by side.

- 'barstacked' is a bar-type histogram where multiple
data are stacked on top of each other.

- 'step' generates a lineplot that is by default
unfilled.

- 'stepfilled' generates a lineplot that is by default
filled.

Default is 'bar'

align : {'left', 'mid', 'right'}, optional
Controls how the histogram is plotted.

- 'left': bars are centered on the left bin edges.

- 'mid': bars are centered between the bin edges.

- 'right': bars are centered on the right bin edges.

Default is 'mid'

orientation : {'horizontal', 'vertical'}, optional
If 'horizontal', `~matplotlib.pyplot.barh` will be used for
bar-type histograms and the *bottom* kwarg will be the left edges.

rwidth : scalar or None, optional
The relative width of the bars as a fraction of the bin width. If
``None``, automatically compute the width.

Ignored if *histtype* is 'step' or 'stepfilled'.

Default is ``None``

log : boolean, optional
If ``True``, the histogram axis will be set to a log scale. If
*log* is ``True`` and *x* is a 1D array, empty bins will be
filtered out and only the non-empty ``(n, bins, patches)``
will be returned.

Default is ``False``

color : color or array_like of colors or None, optional
Color spec or sequence of color specs, one per dataset. Default
(``None``) uses the standard line color sequence.

Default is ``None``

label : string or None, optional
String, or sequence of strings to match multiple datasets. Bar
charts yield multiple patches per dataset, but only the first gets
the label, so that the legend command will work as expected.

default is ``None``

stacked : boolean, optional
If ``True``, multiple data are stacked on top of each other If
``False`` multiple data are aranged side by side if histtype is
'bar' or on top of each other if histtype is 'step'

Default is ``False``

Returns
-------
n : array or list of arrays
The values of the histogram bins. See *normed* or *density*
and *weights* for a description of the possible semantics.
If input *x* is an array, then this is an array of length
*nbins*. If input is a sequence arrays
``[data1, data2,..]``, then this is a list of arrays with
the values of the histograms for each of the arrays in the
same order.

bins : array
The edges of the bins. Length nbins + 1 (nbins left edges and right
edge of last bin). Always a single array even when multiple data
sets are passed in.

patches : list or list of lists
Silent list of individual patches used to create the histogram
or list of such list if multiple input datasets.

Other Parameters
----------------
**kwargs : `~matplotlib.patches.Patch` properties

See also
--------
hist2d : 2D histograms

Notes
-----
Until numpy release 1.5, the underlying numpy histogram function was
incorrect with ``normed=True`` if bin sizes were unequal. MPL
inherited that error. It is now corrected within MPL when using
earlier numpy versions.

.. note::
In addition to the above described arguments, this function can take a
**data** keyword argument. If such a **data** argument is given, the
following arguments are replaced by **data[<arg>]**:

* All arguments with the following names: 'weights', 'x'.

Matplotlib Example¶

In [140]:

values = [0,0.6,1.4,1.6,2.2,2.5,2.6,3.2,3.5,3.9,4.2,6]
plt.hist(values, bins = 3)
plt.show()

5.3. Customization¶

Data Visualization¶

Science & Art
Many options
- Different plot types
- Many customizations
Choice depends on:
- Data
- Story you want to tell

Basic Plot¶

In [141]:

year = np.linspace(1950.,2100.,num=50)

K = 11.;
P0 = 2.6;
r = 0.03;
population = K*P0*np.exp(r*(year-year[0])) / (K + P0*(np.exp(r*(year-year[0]))-1.))

plt.plot(year, population)
plt.show()

Axis Labels, Title, Ticks, Axis Limits¶

In [142]:

plt.plot(year, population)

plt.xlabel('Year')
plt.ylabel('Population')
plt.title('World Population Projections')
plt.yticks([0,2,4,6,8,10])
plt.xlim(1950, 2100)
plt.ylim(0, 11)

plt.show()

Tick Labels¶

In [143]:

plt.plot(year, population)

plt.xlabel('Year')
plt.ylabel('Population')
plt.title('World Population Projections')
plt.yticks([0,2,4,6,8,10],['0','2B','4B','6B','8B','10B'])
plt.xlim(1950, 2100)
plt.ylim(0, 11)

plt.show()

6.1. Boolean Logic & Control Flow¶

Booleans¶

In [145]:

2 < 3

Out[145]:

True

In [146]:

2 == 3

Out[146]:

False

In [147]:

x = 2
y = 3
x < y

Out[147]:

True

In [148]:

x == y

Out[148]:

False

Relational Operators¶

operator	meaning
<	strictly less than
<=	less than or equal
>	strictly greater than
>=	greater than or equal
==	equal
!=	not equal

Logical Operators¶

and
or
not

In [149]:

print(True and True)
print(True and False)
print(False and True)
print(False and False)

True
False
False
False

In [150]:

print(True or True)
print(True or False)
print(False or True)
print(False or False)

True
True
True
False

In [151]:

print(not True)
print(not False)

False
True

Conditional Statements¶

if condition :
    expression

Note the indentation of expression and the colon after the condition.

In [152]:

z = 4
if z % 2 == 0 :
    print("z is even")

z is even

In [153]:

z = 5
if z % 2 == 0 :
    print("z is even")
else :
    print("z is odd")

z is odd

In [154]:

z = 3
if z % 2 == 0 :
    print("z is divisible by 2")
elif z % 3 == 0 :
    print("z is divisible by 3")
else :
    print("z is neither divisible by 2 nor by 3")

z is divisible by 3

In [ ]:

Introduction to Python¶

1. Variables and Types¶

Calculate BMI¶

Python Data Types¶

2.1. Python List¶

Python Data Types¶

Problem¶

Python List¶

2.2. Subsetting lists¶

List slicing¶

2.3. Manipulating Lists¶

Changing list elements¶

Adding and removing elements¶

Copying Lists¶

3.1. Functions¶

Example 1: max( ) function¶

Example 2: round( ) function¶

Finding functions¶

3.2 Methods: Functions that belong to objects¶

list Methods¶

str Methods¶

Methods¶

Summary¶

3.3 Packages¶

Motivation¶

Packages¶

Install package¶

Import package¶

4.1 Numpy¶

Lists Recap¶

Illustration¶

Solution: Numpy¶

Numpy: remarks¶

Numpy Subsetting¶

4.2. 2D Numpy Arrays¶

Type of Numpy Arrays¶

2D Numpy Arrays¶

Subsetting¶

4.3. Numpy: Basic Statistics¶

Data analysis¶

City-wide survey¶

5.1. Basic Plots with Matplotlib¶

Data Visualization¶

Matplotlib¶

Scatter plot¶

5.2. Histograms¶

Histogram¶

Matplotlib¶

Matplotlib Example¶

5.3. Customization¶

Data Visualization¶

Basic Plot¶

Axis Labels, Title, Ticks, Axis Limits¶

Tick Labels¶

6.1. Boolean Logic & Control Flow¶

Booleans¶

Relational Operators¶

Logical Operators¶

Conditional Statements¶

2.1. Python `List`¶

Python `List`¶

`list` Methods¶

`str` Methods¶