An Tutorial for Data Science in Python

Python is an amazing language and here we show a comprehensive tutorial in it for usage in Data Science.

Markdown Tip within Jupyter

I can also write this text within Jupyter by changing Cell type to Markdown in dropdown. That's what I just did. For markdown changing size of font is easy by prefixing by #, or ## , or ### (more the number of # smaller the size of font) while for a non numbered list prefix by a -

Installation

Installation is done using pip or easy_install(from setup tools) . Here we show how to install Pandas package from the Jupyter Notebook itself. I use the --upgrade flag to upgrade it, and I install Bokeh using easy_tools. Pandas is the Python library for Data Analysis and Bokeh helps make interactive data analysis available. Note the ! sign before the sudo command- it helps me use the Terminal without leaving the comfort of my Jupyter Notebook

In [2]:
! sudo pip install pandas --upgrade
The directory '/home/ajay/.cache/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
You are using pip version 7.1.0, however version 7.1.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
The directory '/home/ajay/.cache/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
Collecting pandas
  Downloading pandas-0.17.1.tar.gz (6.7MB)
    100% |████████████████████████████████| 6.7MB 40kB/s 
Collecting python-dateutil (from pandas)
  Downloading python_dateutil-2.4.2-py2.py3-none-any.whl (188kB)
    100% |████████████████████████████████| 192kB 1.3MB/s 
Collecting pytz>=2011k (from pandas)
  Downloading pytz-2015.7-py2.py3-none-any.whl (476kB)
    100% |████████████████████████████████| 479kB 92kB/s 
Collecting numpy>=1.7.0 (from pandas)
  Downloading numpy-1.10.1.tar.gz (4.0MB)
    100% |████████████████████████████████| 4.1MB 75kB/s 
Collecting six>=1.5 (from python-dateutil->pandas)
  Downloading six-1.10.0-py2.py3-none-any.whl
Installing collected packages: six, python-dateutil, pytz, numpy, pandas
  Found existing installation: six 1.9.0
    Uninstalling six-1.9.0:
      Successfully uninstalled six-1.9.0
  Found existing installation: python-dateutil 1.5
    Uninstalling python-dateutil-1.5:
      Successfully uninstalled python-dateutil-1.5
  Found existing installation: pytz 2015.2
    Uninstalling pytz-2015.2:
      Successfully uninstalled pytz-2015.2
  Found existing installation: numpy 1.9.2
    Uninstalling numpy-1.9.2:
      Successfully uninstalled numpy-1.9.2
  Running setup.py install for numpy
  Found existing installation: pandas 0.16.0
    Uninstalling pandas-0.16.0:
      Successfully uninstalled pandas-0.16.0
  Running setup.py install for pandas
Successfully installed numpy-1.10.1 pandas-0.17.1 python-dateutil-2.4.2 pytz-2015.7 six-1.10.0
In [3]:
! sudo easy_install bokeh
Searching for bokeh
Reading https://pypi.python.org/simple/bokeh/
Best match: bokeh 0.10.0
Downloading https://pypi.python.org/packages/source/b/bokeh/bokeh-0.10.0.zip#md5=1432ed7d3034ce0c16c9f3c6388ad10d
Processing bokeh-0.10.0.zip
Writing /tmp/easy_install-CSs4Vk/bokeh-0.10.0/setup.cfg
Running bokeh-0.10.0/setup.py -q bdist_egg --dist-dir /tmp/easy_install-CSs4Vk/bokeh-0.10.0/egg-dist-tmp-45CSzc


package init file 'bokeh/models/tests/__init__.py' not found (or not a regular file)
package init file 'bokeh/charts/builder/tests/__init__.py' not found (or not a regular file)
package init file 'bokeh/charts/tests/__init__.py' not found (or not a regular file)
package init file 'bokeh/_legacy_charts/tests/__init__.py' not found (or not a regular file)
package init file 'bokeh/server/tests/__init__.py' not found (or not a regular file)
package init file 'bokeh/tests/__init__.py' not found (or not a regular file)
package init file 'bokeh/util/tests/__init__.py' not found (or not a regular file)
creating /usr/local/lib/python2.7/dist-packages/bokeh-0.10.0-py2.7.egg
Extracting bokeh-0.10.0-py2.7.egg to /usr/local/lib/python2.7/dist-packages
Adding bokeh 0.10.0 to easy-install.pth file
Installing bokeh-server script to /usr/local/bin
Installing websocket_worker.py script to /usr/local/bin

Installed /usr/local/lib/python2.7/dist-packages/bokeh-0.10.0-py2.7.egg
Processing dependencies for bokeh
Finished processing dependencies for bokeh

Loading a Python Package

You can load a Python Package using the following ways

  • import PACKAGE
  • import PACKAGE as PK
  • from PACKAGE import FUN

    You can then invoke the function using

    PACKAGE.FUN , PK.FUN and FUN respectively

In [4]:
from datetime import datetime
Starttime =datetime.now()
Starttime
Out[4]:
datetime.datetime(2015, 12, 2, 22, 30, 1, 850119)
In [6]:
import pandas as pd

Import Data

Let's import some datasets. We will use Datasets bundled with R language from https://vincentarelbundock.github.io/Rdatasets/datasets.html

In [9]:
diamonds =pd.read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/ggplot2/diamonds.csv")

Data Inspection

In [14]:
diamonds.columns #Single Line Comment starts with # 
# name of variables is given by columns. In R we would use the command names(object)
# Note also R uses the FUNCTION(OBJECTNAME) syntax while Python uses OBJECTNAME.FUNCTION
Out[14]:
Index(['Unnamed: 0', 'carat', 'cut', 'color', 'clarity', 'depth', 'table',
       'price', 'x', 'y', 'z'],
      dtype='object')
In [41]:
len(diamonds) #gives the number of rows
Out[41]:
53940
In [46]:
0.0001*len(diamonds)
Out[46]:
5.394
In [47]:
round(0.0001*len(diamonds))
Out[47]:
5
In [15]:
'''Lets get some information on the object.
In R we would get this by str command (for structure). 
In Python str turns  the object to string
This was a multiple line comment using three single quote marks
'''
diamonds.info() 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 53940 entries, 0 to 53939
Data columns (total 11 columns):
Unnamed: 0    53940 non-null int64
carat         53940 non-null float64
cut           53940 non-null object
color         53940 non-null object
clarity       53940 non-null object
depth         53940 non-null float64
table         53940 non-null float64
price         53940 non-null int64
x             53940 non-null float64
y             53940 non-null float64
z             53940 non-null float64
dtypes: float64(6), int64(2), object(3)
memory usage: 4.3+ MB
In [8]:
diamonds.head(10) #we check the first 10 rows in the dataset
Out[8]:
Unnamed: 0 carat cut color clarity depth table price x y z
0 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
1 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
2 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
3 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
4 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
5 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
6 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
7 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
8 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
9 10 0.23 Very Good H VS1 59.4 61 338 4.00 4.05 2.39
  • to refer to particular row in Python I can use index.
  • In R I refer to the object in i th row and jth column by OBJECTNAME[i,j]
  • In R I refer to the column name by OBJECTNAME$ColumnName
  • Note in Python Index starts with 0 while in R it starts with 1.
In [36]:
diamonds.ix[20:30]
Out[36]:
Unnamed: 0 carat cut color clarity depth table price x y z
20 21 0.30 Good I SI2 63.3 56 351 4.26 4.30 2.71
21 22 0.23 Very Good E VS2 63.8 55 352 3.85 3.92 2.48
22 23 0.23 Very Good H VS1 61.0 57 353 3.94 3.96 2.41
23 24 0.31 Very Good J SI1 59.4 62 353 4.39 4.43 2.62
24 25 0.31 Very Good J SI1 58.1 62 353 4.44 4.47 2.59
25 26 0.23 Very Good G VVS2 60.4 58 354 3.97 4.01 2.41
26 27 0.24 Premium I VS1 62.5 57 355 3.97 3.94 2.47
27 28 0.30 Very Good J VS2 62.2 57 357 4.28 4.30 2.67
28 29 0.23 Very Good D VS2 60.5 61 357 3.96 3.97 2.40
29 30 0.23 Very Good F VS1 60.9 57 357 3.96 3.99 2.42
30 31 0.23 Very Good F VS1 60.0 57 402 4.00 4.03 2.41
In [88]:
#To refer to a particular column I use it's name
# I can also chain the commands
diamonds.ix[20:25].cut
Out[88]:
20         Good
21    Very Good
22    Very Good
23    Very Good
24    Very Good
25    Very Good
Name: cut, dtype: object
In [34]:
diamonds.ix[20:25]["color"]
Out[34]:
20    I
21    E
22    H
23    J
24    J
25    G
Name: color, dtype: object

Random Sample

In [87]:
import numpy as np
In [90]:
rows = np.random.choice(diamonds.index.values, round(0.0001*len(diamonds)))
print(rows)
[42122 21399 40554 36399 50336]
In [91]:
diamonds.ix[rows]
Out[91]:
Unnamed: 0 carat cut color clarity depth table price x y z
42122 42123 0.58 Ideal I VS2 62.0 54 1279 5.35 5.39 3.33
21399 21400 1.51 Ideal E SI2 62.9 57 9513 7.29 7.23 4.57
40554 40555 0.41 Ideal G VVS1 61.7 55 1151 4.77 4.79 2.95
36399 36400 0.31 Ideal E VS1 61.8 56 942 4.37 4.34 2.69
50336 50337 0.70 Good D SI2 58.3 60 2242 5.81 5.89 3.41
In [92]:
##Mising Values 

diamonds= diamonds.dropna(how='any') 

Summaries

We now do summaries for numerical and categorical data.

In [18]:
diamonds.describe()
Out[18]:
Unnamed: 0 carat depth table price x y z
count 53940.000000 53940.000000 53940.000000 53940.000000 53940.000000 53940.000000 53940.000000 53940.000000
mean 26970.500000 0.797940 61.749405 57.457184 3932.799722 5.731157 5.734526 3.538734
std 15571.281097 0.474011 1.432621 2.234491 3989.439738 1.121761 1.142135 0.705699
min 1.000000 0.200000 43.000000 43.000000 326.000000 0.000000 0.000000 0.000000
25% 13485.750000 0.400000 61.000000 56.000000 950.000000 4.710000 4.720000 2.910000
50% 26970.500000 0.700000 61.800000 57.000000 2401.000000 5.700000 5.710000 3.530000
75% 40455.250000 1.040000 62.500000 59.000000 5324.250000 6.540000 6.540000 4.040000
max 53940.000000 5.010000 79.000000 95.000000 18823.000000 10.740000 58.900000 31.800000
In [30]:
diamonds.price.describe()
Out[30]:
count    53940.000000
mean      3932.799722
std       3989.439738
min        326.000000
25%        950.000000
50%       2401.000000
75%       5324.250000
max      18823.000000
Name: price, dtype: float64
In [56]:
diamonds.corr() #Numerical Corelations
Out[56]:
Unnamed: 0 carat depth table price x y z
Unnamed: 0 1.000000 -0.377983 -0.034800 -0.100830 -0.306873 -0.405440 -0.395843 -0.399208
carat -0.377983 1.000000 0.028224 0.181618 0.921591 0.975094 0.951722 0.953387
depth -0.034800 0.028224 1.000000 -0.295779 -0.010647 -0.025289 -0.029341 0.094924
table -0.100830 0.181618 -0.295779 1.000000 0.127134 0.195344 0.183760 0.150929
price -0.306873 0.921591 -0.010647 0.127134 1.000000 0.884435 0.865421 0.861249
x -0.405440 0.975094 -0.025289 0.195344 0.884435 1.000000 0.974701 0.970772
y -0.395843 0.951722 -0.029341 0.183760 0.865421 0.974701 1.000000 0.952006
z -0.399208 0.953387 0.094924 0.150929 0.861249 0.970772 0.952006 1.000000
In [58]:
diamonds.corr()>0.5
Out[58]:
Unnamed: 0 carat depth table price x y z
Unnamed: 0 True False False False False False False False
carat False True False False True True True True
depth False False True False False False False False
table False False False True False False False False
price False True False False True True True True
x False True False False True True True True
y False True False False True True True True
z False True False False True True True True
In [60]:
diamonds['cut'].unique() #To get unique values
Out[60]:
array(['Ideal', 'Premium', 'Good', 'Very Good', 'Fair'], dtype=object)
In [59]:
diamonds['clarity'].unique()
Out[59]:
array(['SI2', 'SI1', 'VS1', 'VS2', 'VVS2', 'VVS1', 'I1', 'IF'], dtype=object)
In [50]:
pd.value_counts(diamonds.cut)
Out[50]:
Ideal        21551
Premium      13791
Very Good    12082
Good          4906
Fair          1610
dtype: int64
In [51]:
pd.value_counts(diamonds.color)
Out[51]:
G    11292
E     9797
F     9542
H     8304
D     6775
I     5422
J     2808
dtype: int64
In [52]:
pd.crosstab(diamonds.cut,diamonds.color)
Out[52]:
color D E F G H I J
cut
Fair 163 224 312 314 303 175 119
Good 662 933 909 871 702 522 307
Ideal 2834 3903 3826 4884 3115 2093 896
Premium 1603 2337 2331 2924 2360 1428 808
Very Good 1513 2400 2164 2299 1824 1204 678
In [64]:
pd.crosstab(diamonds.cut,diamonds.color,margins='TRUE')
Out[64]:
color D E F G H I J All
cut
Fair 163 224 312 314 303 175 119 1610
Good 662 933 909 871 702 522 307 4906
Ideal 2834 3903 3826 4884 3115 2093 896 21551
Premium 1603 2337 2331 2924 2360 1428 808 13791
Very Good 1513 2400 2164 2299 1824 1204 678 12082
All 6775 9797 9542 11292 8304 5422 2808 53940
In [80]:
pd.crosstab(diamonds.cut,diamonds.color,margins='TRUE')
Out[80]:
color D E F G H I J All
cut
Fair 163 224 312 314 303 175 119 1610
Good 662 933 909 871 702 522 307 4906
Ideal 2834 3903 3826 4884 3115 2093 896 21551
Premium 1603 2337 2331 2924 2360 1428 808 13791
Very Good 1513 2400 2164 2299 1824 1204 678 12082
All 6775 9797 9542 11292 8304 5422 2808 53940
In [61]:
cutgroup=pd.groupby(diamonds,diamonds.cut)
In [25]:
cutgroup
Out[25]:
<pandas.core.groupby.DataFrameGroupBy object at 0xae00d54c>
In [28]:
cutgroup.price.median()
Out[28]:
cut
Fair         3282.0
Good         3050.5
Ideal        1810.0
Premium      3185.0
Very Good    2648.0
Name: price, dtype: float64
In [67]:
cutgroup.price.median().reset_index()
Out[67]:
cut price
0 Fair 3282.0
1 Good 3050.5
2 Ideal 1810.0
3 Premium 3185.0
4 Very Good 2648.0
In [77]:
d=cutgroup.price.median().reset_index()
d.transpose()
Out[77]:
0 1 2 3 4
cut Fair Good Ideal Premium Very Good
price 3282 3050.5 1810 3185 2648
In [71]:
diamonds.groupby(['cut', "color"])
Out[71]:
<pandas.core.groupby.DataFrameGroupBy object at 0xad6845ac>
In [72]:
diamonds.groupby(['cut', "color"]).price.median().reset_index()
Out[72]:
cut color price
0 Fair D 3730.0
1 Fair E 2956.0
2 Fair F 3035.0
3 Fair G 3057.0
4 Fair H 3816.0
5 Fair I 3246.0
6 Fair J 3302.0
7 Good D 2728.5
8 Good E 2420.0
9 Good F 2647.0
10 Good G 3340.0
11 Good H 3468.5
12 Good I 3639.5
13 Good J 3733.0
14 Ideal D 1576.0
15 Ideal E 1437.0
16 Ideal F 1775.0
17 Ideal G 1857.5
18 Ideal H 2278.0
19 Ideal I 2659.0
20 Ideal J 4096.0
21 Premium D 2009.0
22 Premium E 1928.0
23 Premium F 2841.0
24 Premium G 2745.0
25 Premium H 4511.0
26 Premium I 4640.0
27 Premium J 5063.0
28 Very Good D 2310.0
29 Very Good E 1989.5
30 Very Good F 2471.0
31 Very Good G 2437.0
32 Very Good H 3734.0
33 Very Good I 3888.0
34 Very Good J 4113.0
In [78]:
e=diamonds.groupby(['cut', "color"]).price.median().reset_index()
e.pivot(index='cut', columns='color', values='price')
Out[78]:
color D E F G H I J
cut
Fair 3730.0 2956.0 3035 3057.0 3816.0 3246.0 3302
Good 2728.5 2420.0 2647 3340.0 3468.5 3639.5 3733
Ideal 1576.0 1437.0 1775 1857.5 2278.0 2659.0 4096
Premium 2009.0 1928.0 2841 2745.0 4511.0 4640.0 5063
Very Good 2310.0 1989.5 2471 2437.0 3734.0 3888.0 4113
In [81]:
f=e.pivot(index='cut', columns='color', values='price')
In [83]:
f>4000
Out[83]:
color D E F G H I J
cut
Fair False False False False False False False
Good False False False False False False False
Ideal False False False False False False True
Premium False False False False True True True
Very Good False False False False False False True

Data Visualization

In [123]:
import matplotlib.pyplot as plt
%matplotlib inline
pd.options.display.mpl_style = 'default'
plt.style.use('ggplot')
In [101]:
!sudo pip install seaborn --upgrade
The directory '/home/ajay/.cache/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
You are using pip version 7.1.0, however version 7.1.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
The directory '/home/ajay/.cache/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
Collecting seaborn
  Downloading seaborn-0.6.0.tar.gz (145kB)
    100% |████████████████████████████████| 147kB 123kB/s 
Installing collected packages: seaborn
  Found existing installation: seaborn 0.5.1
    Uninstalling seaborn-0.5.1:
      Successfully uninstalled seaborn-0.5.1
  Running setup.py install for seaborn
Successfully installed seaborn-0.6.0
In [116]:
diamonds['price'].plot()
Out[116]:
<matplotlib.axes._subplots.AxesSubplot at 0xa75290ec>
In [119]:
plt.hist(diamonds.price)
Out[119]:
(array([ 25335.,   9328.,   7393.,   3878.,   2364.,   1745.,   1306.,
          1002.,    863.,    726.]),
 array([   326. ,   2175.7,   4025.4,   5875.1,   7724.8,   9574.5,
         11424.2,  13273.9,  15123.6,  16973.3,  18823. ]),
 <a list of 10 Patch objects>)
In [122]:
plt.figure();
diamonds['price'].plot(kind='hist', stacked=True, bins=20)
Out[122]:
<matplotlib.axes._subplots.AxesSubplot at 0x98f2bcac>
In [105]:
plt.boxplot(diamonds.price)
Out[105]:
{'boxes': [<matplotlib.lines.Line2D at 0xa75082cc>],
 'caps': [<matplotlib.lines.Line2D at 0xa750a22c>,
  <matplotlib.lines.Line2D at 0xa750abcc>],
 'fliers': [<matplotlib.lines.Line2D at 0xa750ff2c>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0xa750f58c>],
 'whiskers': [<matplotlib.lines.Line2D at 0xa7508eec>,
  <matplotlib.lines.Line2D at 0xa750986c>]}
In [125]:
plt.figure();
diamonds['price'].plot(kind='box')
Out[125]:
<matplotlib.axes._subplots.AxesSubplot at 0x961d1a0c>
In [131]:
diamonds.plot(kind='hexbin', x='price', y='carat', gridsize=8)
Out[131]:
<matplotlib.axes._subplots.AxesSubplot at 0x9552b92c>
In [132]:
from ggplot import *
In [133]:
p = ggplot(aes(x='price', y='carat',color="clarity"), data=diamonds)
p + geom_point()
Out[133]:
<ggplot: (-918646034)>
In [134]:
p = ggplot(aes(x='price', y='carat',color="cut"), data=diamonds)
p + geom_point()
Out[134]:
<ggplot: (-918646060)>

Modeling

Lets do some basic Regression Modeling

In [93]:
import statsmodels.formula.api as sm
In [94]:
result = sm.ols(formula="price ~ carat + color", data=diamonds).fit()
In [97]:
result.summary()
Out[97]:
OLS Regression Results
Dep. Variable: price R-squared: 0.864
Model: OLS Adj. R-squared: 0.864
Method: Least Squares F-statistic: 4.893e+04
Date: Wed, 02 Dec 2015 Prob (F-statistic): 0.00
Time: 23:44:07 Log-Likelihood: -4.6998e+05
No. Observations: 53940 AIC: 9.400e+05
Df Residuals: 53932 BIC: 9.400e+05
Df Model: 7
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept -2136.2289 20.122 -106.162 0.000 -2175.669 -2096.789
color[T.E] -93.7813 23.252 -4.033 0.000 -139.355 -48.208
color[T.F] -80.2629 23.405 -3.429 0.001 -126.136 -34.390
color[T.G] -85.5363 22.670 -3.773 0.000 -129.969 -41.103
color[T.H] -732.2418 24.354 -30.067 0.000 -779.975 -684.508
color[T.I] -1055.7319 27.310 -38.657 0.000 -1109.260 -1002.203
color[T.J] -1914.4722 33.777 -56.679 0.000 -1980.676 -1848.268
carat 8066.6230 14.040 574.558 0.000 8039.105 8094.141
Omnibus: 12266.990 Durbin-Watson: 0.948
Prob(Omnibus): 0.000 Jarque-Bera (JB): 165317.069
Skew: 0.719 Prob(JB): 0.00
Kurtosis: 11.455 Cond. No. 11.0
In [96]:
result.params
Out[96]:
Intercept    -2136.228853
color[T.E]     -93.781288
color[T.F]     -80.262858
color[T.G]     -85.536282
color[T.H]    -732.241826
color[T.I]   -1055.731857
color[T.J]   -1914.472203
carat         8066.623019
dtype: float64
In [ ]: