Machine Learning Lunch


Tom Brander


June 28, 2017

Many Thanks to Compose for the space and lunch!

http://oswco.com @dartdog

The PyData Stack

Source: [Jake VanderPlas: State of the Tools](https://www.youtube.com/watch?v=5GlNDD7qbP4) and [Thomas Wiecki](https://quantopian.github.io/pyfolio/)

To the uninitiated the whole pile of Python stuff looks terribly complicated.
To some extent it is.
But there has been a ton of work done to bring order out of the apparent chaos!

The Libraries (just a starting point)

  • Python, of course (https://www.python.org/)
    • A few years ago there was a change from the the Python 2, series to the Python 3 series
    • Now the recomendation is just go with Python 3.6
  • Pandas (http://pandas.pydata.org/)
    • Main data manipulation library, mostly using DataFrames (think Excel on steroids)
    • Many IO capabilities GBQ, S-3, Parquet, SQL, CSV, JSON, And web data (stock prices and financial data)
    • Built on top of Numpy
  • Numpy (http://www.numpy.org/)
    • High performance numerical library particularly array and matrix oriented
  • Matplotlib (https://matplotlib.org/)
    • the grand daddy of Python Plotting libraries, many other libraries build on it to simplify and or stylize it
  • Sci-kit Learn (http://scikit-learn.org/stable/)
    • A collection of libraries for almost all types of machine learning with consitant API's and supporting libraries

Jupyter

  • (http://jupyter.org/)
  • What this notebook is done with
  • Has become a common format for "open data Science"
  • Has also become a great method for shaing code and documentation throughout the Python community
  • Supports many other languages, or Kernels R, Julia (a newer stats language) Go, Ruby ++ In many cases allows easier interoperabilitty between them

Anaconda

  • (https://www.continuum.io/anaconda-overview)
  • All of the above,(150+ libraries), (except TensorFlow/Keras) and much more is auto installed for you using the Anaconda distribution including a nice IDE, Spyder
  • As a bonus you get a faster than "Normal" version of Python with Intel MKL extensions built in
    • Speed-boosted NumPy, SciPy, scikit-learn, and NumExpr
    • The packaging of MKL with redistributable binaries in Anaconda for easy access to the MKL runtime library.
    • Python bindings to the low level MKL service functions, which allow for the modification of the number of threads being used during runtime.

Books

Machine Learning:


In [1]:
from tpot import TPOTClassifier
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import numpy as np

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data.astype(np.float64),
    iris.target.astype(np.float64), train_size=0.75, test_size=0.25)

tpot = TPOTClassifier(generations=7, population_size=100, verbosity=2, random_state=2)
tpot.fit(X_train, y_train)

print(tpot.score(X_test, y_test))
tpot.export('tpot_iris_pipeline.py')
/home/tom/anaconda3/envs/py36n/lib/python3.6/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
Optimization Progress:  23%|██▎       | 182/800 [00:23<01:10,  8.77pipeline/s]
Generation 1 - Current best internal CV score: 0.9730848861283643
Optimization Progress:  35%|███▍      | 278/800 [00:47<01:26,  6.01pipeline/s]
Generation 2 - Current best internal CV score: 0.9821428571428571
Optimization Progress:  46%|████▋     | 372/800 [00:58<00:25, 16.66pipeline/s]
Generation 3 - Current best internal CV score: 0.9821428571428571
Optimization Progress:  58%|█████▊    | 464/800 [01:15<01:00,  5.51pipeline/s]
Generation 4 - Current best internal CV score: 0.9821428571428571
Optimization Progress:  70%|██████▉   | 556/800 [01:37<00:22, 10.80pipeline/s]
Generation 5 - Current best internal CV score: 0.9904761904761905
Optimization Progress:  80%|████████  | 642/800 [01:48<00:16,  9.72pipeline/s]
Generation 6 - Current best internal CV score: 0.9904761904761905
                                                                              
Generation 7 - Current best internal CV score: 0.9904761904761905

Best pipeline: DecisionTreeClassifier(RBFSampler(XGBClassifier(input_matrix, XGBClassifier__learning_rate=1.0, XGBClassifier__max_depth=DEFAULT, XGBClassifier__min_child_weight=20, XGBClassifier__n_estimators=100, XGBClassifier__nthread=1, XGBClassifier__subsample=0.95), RBFSampler__gamma=0.35), DecisionTreeClassifier__criterion=entropy, DecisionTreeClassifier__max_depth=DEFAULT, DecisionTreeClassifier__min_samples_leaf=15, DecisionTreeClassifier__min_samples_split=10)
0.894736842105

In [2]:
proc=pd.DataFrame(tpot.evaluated_individuals_)
proc.head()
Out[2]:
BernoulliNB(BernoulliNB(input_matrix, BernoulliNB__alpha=10.0, BernoulliNB__fit_prior=DEFAULT), BernoulliNB__alpha=0.01, BernoulliNB__fit_prior=DEFAULT) BernoulliNB(DecisionTreeClassifier(input_matrix, DecisionTreeClassifier__criterion=gini, DecisionTreeClassifier__max_depth=5, DecisionTreeClassifier__min_samples_leaf=20, DecisionTreeClassifier__min_samples_split=10), BernoulliNB__alpha=100.0, BernoulliNB__fit_prior=True) BernoulliNB(DecisionTreeClassifier(input_matrix, DecisionTreeClassifier__criterion=gini, DecisionTreeClassifier__max_depth=7, DecisionTreeClassifier__min_samples_leaf=2, DecisionTreeClassifier__min_samples_split=3), BernoulliNB__alpha=0.01, BernoulliNB__fit_prior=DEFAULT) BernoulliNB(GaussianNB(input_matrix), BernoulliNB__alpha=0.1, BernoulliNB__fit_prior=DEFAULT) BernoulliNB(LogisticRegression(input_matrix, LogisticRegression__C=DEFAULT, LogisticRegression__dual=DEFAULT, LogisticRegression__penalty=l1), BernoulliNB__alpha=0.1, BernoulliNB__fit_prior=False) BernoulliNB(Normalizer(input_matrix, Normalizer__norm=l2), BernoulliNB__alpha=0.1, BernoulliNB__fit_prior=DEFAULT) BernoulliNB(Normalizer(input_matrix, Normalizer__norm=max), BernoulliNB__alpha=0.001, BernoulliNB__fit_prior=True) BernoulliNB(RobustScaler(input_matrix), BernoulliNB__alpha=1.0, BernoulliNB__fit_prior=False) BernoulliNB(RobustScaler(input_matrix), BernoulliNB__alpha=100.0, BernoulliNB__fit_prior=DEFAULT) BernoulliNB(SelectFromModel(input_matrix, SelectFromModel__ExtraTreesClassifier__criterion=DEFAULT, SelectFromModel__ExtraTreesClassifier__max_features=DEFAULT, SelectFromModel__ExtraTreesClassifier__n_estimators=100, SelectFromModel__threshold=0.2), BernoulliNB__alpha=1.0, BernoulliNB__fit_prior=False) ... XGBClassifier(input_matrix, XGBClassifier__learning_rate=0.5, XGBClassifier__max_depth=2, XGBClassifier__min_child_weight=4, XGBClassifier__n_estimators=DEFAULT, XGBClassifier__nthread=1, XGBClassifier__subsample=0.95) XGBClassifier(input_matrix, XGBClassifier__learning_rate=0.5, XGBClassifier__max_depth=2, XGBClassifier__min_child_weight=4, XGBClassifier__n_estimators=DEFAULT, XGBClassifier__nthread=1, XGBClassifier__subsample=DEFAULT) XGBClassifier(input_matrix, XGBClassifier__learning_rate=0.5, XGBClassifier__max_depth=3, XGBClassifier__min_child_weight=18, XGBClassifier__n_estimators=100, XGBClassifier__nthread=1, XGBClassifier__subsample=0.7) XGBClassifier(input_matrix, XGBClassifier__learning_rate=0.5, XGBClassifier__max_depth=5, XGBClassifier__min_child_weight=17, XGBClassifier__n_estimators=DEFAULT, XGBClassifier__nthread=1, XGBClassifier__subsample=0.25) XGBClassifier(input_matrix, XGBClassifier__learning_rate=1.0, XGBClassifier__max_depth=1, XGBClassifier__min_child_weight=19, XGBClassifier__n_estimators=DEFAULT, XGBClassifier__nthread=1, XGBClassifier__subsample=0.8) XGBClassifier(input_matrix, XGBClassifier__learning_rate=1.0, XGBClassifier__max_depth=1, XGBClassifier__min_child_weight=6, XGBClassifier__n_estimators=100, XGBClassifier__nthread=1, XGBClassifier__subsample=1.0) XGBClassifier(input_matrix, XGBClassifier__learning_rate=1.0, XGBClassifier__max_depth=2, XGBClassifier__min_child_weight=4, XGBClassifier__n_estimators=DEFAULT, XGBClassifier__nthread=1, XGBClassifier__subsample=0.95) XGBClassifier(input_matrix, XGBClassifier__learning_rate=1.0, XGBClassifier__max_depth=2, XGBClassifier__min_child_weight=6, XGBClassifier__n_estimators=DEFAULT, XGBClassifier__nthread=1, XGBClassifier__subsample=0.95) XGBClassifier(input_matrix, XGBClassifier__learning_rate=DEFAULT, XGBClassifier__max_depth=5, XGBClassifier__min_child_weight=17, XGBClassifier__n_estimators=DEFAULT, XGBClassifier__nthread=1, XGBClassifier__subsample=0.25) XGBClassifier(input_matrix, XGBClassifier__learning_rate=DEFAULT, XGBClassifier__max_depth=DEFAULT, XGBClassifier__min_child_weight=19, XGBClassifier__n_estimators=100, XGBClassifier__nthread=1, XGBClassifier__subsample=0.45)
0 2.000000 2.000000 2.00000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 2.000000 ... 1.000000 1.000000 1.000000 1.00000 1.000000 1.000000 1.00000 1.00000 1.00000 1.000000
1 0.366511 0.705642 0.93618 0.705642 0.705642 0.366511 0.366511 0.777433 0.705642 0.366511 ... 0.944876 0.944876 0.358178 0.33913 0.433178 0.944876 0.93618 0.93618 0.33913 0.366511

2 rows × 626 columns

+Initial data explore http://localhost:8889/notebooks/Documents/InfluenceH/Working_copies/Cond_fcast_wkg/ccsProfileInitialanalyis.ipynb
+current model http://localhost:8889/notebooks/Documents/InfluenceH/Working_copies/Cond_fcast_wkg/WIPNNModelonehottarget2.ipynb#

In [2]:
%load_ext watermark
In [3]:
%watermark -a "Tom Brander" -u -n -t -z -v -m -p pandas,numpy,scipy,sklearn,tpot,tensorflow -g
Tom Brander 
last updated: Wed Jun 28 2017 09:41:56 CDT

CPython 3.6.1
IPython 5.3.0

pandas 0.20.2
numpy 1.12.1
scipy 0.19.0
sklearn 0.18.1
tpot 0.8.3
tensorflow 1.2.0

compiler   : GCC 4.8.2 20140120 (Red Hat 4.8.2-15)
system     : Linux
release    : 4.4.0-81-generic
machine    : x86_64
processor  : x86_64
CPU cores  : 8
interpreter: 64bit
Git hash   : 1116fa037187d38cec52144d0bebde5ae0a0e484
In [8]:
!nvidia-smi
Tue Jun 27 07:47:13 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.66                 Driver Version: 375.66                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1070    Off  | 0000:01:00.0      On |                  N/A |
| N/A   45C    P8    10W /  N/A |    623MiB /  8105MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      1098    G   /usr/lib/xorg/Xorg                             282MiB |
|    0      2345    G   compiz                                          65MiB |
|    0      2767    G   ...anced GL_KHR_blend_equation_advanced_cohe   222MiB |
|    0     10292    G   ...s-passed-by-fd --v8-snapshot-passed-by-fd    50MiB |
+-----------------------------------------------------------------------------+
In [10]:
!nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Wed_May__4_21:01:56_CDT_2016
Cuda compilation tools, release 8.0, V8.0.26
In [11]:
!cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module  375.66  Mon May  1 15:29:16 PDT 2017
GCC version:  gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4) 

Best basic book Mainly SciKit Learn https://www.packtpub.com/big-data-and-business-intelligence/python-machine-learning Very useful for all Python ML stuff and algorithims and what to use when and where..

This still early release but the best for Keras (written by the guy who also conceived and wrote the library itself) https://www.manning.com/books/deep-learning-with-python

Best book for TensorFlow http://shop.oreilly.com/product/0636920052289.do also conversely uses SciKit learn, as a method to explain some of the concepts in TF.. Highly recommended..

Most accessible code can be found with Jupyter examples so you want to get that set up on your machine http://jupyter.org/

Easiest way to get everything you need and keep up to date is Anaconda https://www.continuum.io/downloads Includes Jupyter mentioned above as well as Spyder a Python IDE (Win Linux And Mac) Don't even think of doing another way.. (you will thank me!)

up front data exploration CSV kit https://csvkit.readthedocs.io/en/1.0.2/ specifically csvstat (lots more there though) including some god transition and report stuff..

I like https://github.com/JosPolfliet/pandas-profiling

Oh yes now a days just start with python 3.6 not 2.7