#!/usr/bin/env python # coding: utf-8 #

Machine Learning Lunch

#
#

Tom Brander

June 28, 2017

# # # # # # # ## Many Thanks to [Compose](https://compose.com/) for the space and lunch! # # ### http://oswco.com [@dartdog](https://twitter.com/dartdog) # # #

The PyData Stack

# Source: [Jake VanderPlas: State of the Tools](https://www.youtube.com/watch?v=5GlNDD7qbP4) # and [Thomas Wiecki](https://quantopian.github.io/pyfolio/) #

# To the uninitiated the whole pile of Python stuff looks terribly complicated. # To some extent it is. # But there has been a ton of work done to bring order out of the apparent chaos! # # The Libraries (just a starting point) # # + Python, of course (https://www.python.org/) # - A few years ago there was a change from the the Python 2, series to the Python 3 series # - Now the recomendation is just go with Python 3.6 # + Pandas (http://pandas.pydata.org/) # - Main data manipulation library, mostly using DataFrames (think Excel on steroids) # - Many IO capabilities GBQ, S-3, Parquet, SQL, CSV, JSON, And web data (stock prices and financial data) # - Built on top of Numpy # # + Numpy (http://www.numpy.org/) # - High performance numerical library particularly array and matrix oriented # + Matplotlib (https://matplotlib.org/) # - the grand daddy of Python Plotting libraries, many other libraries build on it to simplify and or stylize it # + Sci-kit Learn (http://scikit-learn.org/stable/) # - A collection of libraries for almost all types of machine learning with consitant API's and supporting libraries # + TensorFlow (https://www.tensorflow.org/) # - Google's open source numerical computing library # - on which they have built and released a large number of machine learning components # - along with a number of supporting components (I/O, encoding, serving etc) # + Keras (https://keras.io/)(https://www.tensorflow.org/api_docs/python/tf/contrib/keras) # - a simplified interface to many Machine learning libraries, also incorporated into TensorFlow # - supports Theano, Cntk, Pytorch (and more on the way) # + StatsModels (http://www.statsmodels.org/stable/index.html)(https://patsy.readthedocs.io/en/latest/) # - Many statistical techniques and the Patsy statistical language (much like R) # + PyMC3 (https://pymc-devs.github.io/pymc3/index.html) # - Baysian Modeling library (in many ways comparable to Stan but newer) # ## Jupyter # + (http://jupyter.org/) # + What this notebook is done with # + Has become a common format for "open data Science" # + Has also become a great method for shaing code and documentation throughout the Python community # + Supports many other languages, or Kernels R, Julia (a newer stats language) Go, Ruby ++ In many cases allows easier interoperabilitty between them # # Anaconda # + (https://www.continuum.io/anaconda-overview) # + All of the above,(150+ libraries), (except TensorFlow/Keras) and much more is auto installed for you using the Anaconda distribution including a nice IDE, Spyder # + As a bonus you get a faster than "Normal" version of Python with Intel MKL extensions built in # - Speed-boosted NumPy, SciPy, scikit-learn, and NumExpr # - The packaging of MKL with redistributable binaries in Anaconda for easy access to the MKL runtime library. # - Python bindings to the low level MKL service functions, which allow for the modification of the number of threads being used during runtime. # # # Books # + Data Science Handbook, freely available on GitHub, excellent resource https://github.com/jakevdp/PythonDataScienceHandbook # + Python Machine Learning by Sebastian Raschka https://www.packtpub.com/big-data-and-business-intelligence/python-machine-learning Primarily Sci-Kit learn Highly Recommeded! # + Hands-On Machine Learning with Scikit-Learn and TensorFlow by Aurélien Géron http://shop.oreilly.com/product/0636920052289.do # + Deep Learning with Python by Francois Chollet https://www.manning.com/books/deep-learning-with-python # # Five More Tips # + Jupyter Notebook Gallery, Awesome https://github.com/jupyter/jupyter/wiki/A-gallery-of-interesting-Jupyter-Notebooks # + CSVKIT (https://csvkit.readthedocs.io/en/1.0.2/) # + Pandas profiling (https://github.com/JosPolfliet/pandas-profiling) # + Kaggle (https://www.kaggle.com/) # + What type of algorithim? (http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html) # # Course # + https://github.com/amueller/scipy-2016-sklearn Videos and notebooks # # Machine Learning: # #
#

# # # # Skills # #
#

# # From: https://opendatascience.com/blog/what-is-data-science-and-what-does-a-data-scientist-do/ # In[1]: from tpot import TPOTClassifier import pandas as pd from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split import numpy as np iris = load_iris() X_train, X_test, y_train, y_test = train_test_split(iris.data.astype(np.float64), iris.target.astype(np.float64), train_size=0.75, test_size=0.25) tpot = TPOTClassifier(generations=7, population_size=100, verbosity=2, random_state=2) tpot.fit(X_train, y_train) print(tpot.score(X_test, y_test)) tpot.export('tpot_iris_pipeline.py') # In[2]: proc=pd.DataFrame(tpot.evaluated_individuals_) proc.head() # # Other links # + Tpot http://rhiever.github.io/tpot/ # + Some new Nvidia developments https://devblogs.nvidia.com/parallelforall/goai-open-gpu-accelerated-data-analytics/ # + State of the art Medical example https://fluforecaster.herokuapp.com/ # +Initial data explore http://localhost:8889/notebooks/Documents/InfluenceH/Working_copies/Cond_fcast_wkg/ccsProfileInitialanalyis.ipynb # +current model # http://localhost:8889/notebooks/Documents/InfluenceH/Working_copies/Cond_fcast_wkg/WIPNNModelonehottarget2.ipynb# # + RE forecast https://docs.google.com/spreadsheets/d/1HJxK82QYeYO13hQGaAg3hw4a4j8s0lSGdfk4BBjzh38/edit#gid=3 # + RE survey http://d1ambw9zjiu0uw.cloudfront.net/custom_reports3/21.pdf?1491572271 # + stock example http://localhost:8888/notebooks/Documents/pyfolio_wkng/examples/single_stock_example.ipynb# BUT! see issues https://github.com/quantopian/empyrical/issues/52 # + Old CCS propensity http://localhost:8888/notebooks/Documents/InfluenceH/Working_copies/CCS_wking/MultiCCS_PropModel_Div.ipynb # + beginning CCS NN http://localhost:8888/notebooks/Documents/InfluenceH/influence/USF_elu_downsample_1.1.ipynb # + Zillow initial explore http://localhost:8888/notebooks/Documents/Zillow_w/Notebooks/frkagnotebook.ipynb # + Zillow Bayes initial look http://localhost:8888/notebooks/Documents/Zillow_w/Notebooks/zillow_bayes.ipynb From: http://willwolf.io/2017/06/15/random-effects-neural-networks/ # + Zillow initial Profile http://localhost:8888/notebooks/Documents/Zillow_w/Notebooks/ProfileInitialanalyis.ipynb # + Zillow R initial profile https://www.kaggle.com/philippsp/exploratory-analysis-zillow # + Stock trading https://github.com/Kacawi/datacamp-community and https://medium.com/datacamp/python-for-finance-algorithmic-trading-60fdfb9bb20d # + Large collection of NN/NLP resourcs https://unsupervisedmethods.com/over-150-of-the-best-machine-learning-nlp-and-python-tutorials-ive-found-ffce2939bd78 # This notebook on Jupyter hub http://nbviewer.jupyter.org/github/dartdog/ML-lunch/blob/master/ML_resources.ipynb # R vs Python (2 pages ) http://www.kdnuggets.com/2017/06/ecosystem-data-science-machine-learning-software.html # In[2]: get_ipython().run_line_magic('load_ext', 'watermark') # In[3]: get_ipython().run_line_magic('watermark', '-a "Tom Brander" -u -n -t -z -v -m -p pandas,numpy,scipy,sklearn,tpot,tensorflow -g') # In[8]: get_ipython().system('nvidia-smi') # In[10]: get_ipython().system('nvcc --version') # In[11]: get_ipython().system('cat /proc/driver/nvidia/version') # Best basic book Mainly SciKit Learn https://www.packtpub.com/big-data-and-business-intelligence/python-machine-learning Very useful for all Python ML stuff and algorithims and what to use when and where.. # # This still early release but the best for Keras (written by the guy who also conceived and wrote the library itself) https://www.manning.com/books/deep-learning-with-python # # Best book for TensorFlow http://shop.oreilly.com/product/0636920052289.do also conversely uses SciKit learn, as a method to explain some of the concepts in TF.. Highly recommended.. # # Most accessible code can be found with Jupyter examples so you want to get that set up on your machine http://jupyter.org/ # # Easiest way to get everything you need and keep up to date is Anaconda https://www.continuum.io/downloads Includes Jupyter mentioned above as well as Spyder a Python IDE (Win Linux And Mac) Don't even think of doing another way.. (you will thank me!) # # up front data exploration CSV kit https://csvkit.readthedocs.io/en/1.0.2/ specifically csvstat (lots more there though) including some god transition and report stuff.. # # I like https://github.com/JosPolfliet/pandas-profiling # # Oh yes now a days just start with python 3.6 not 2.7 # In[ ]: