Developing Python Backends for Machine Learning Applications

by Jaidev Deshpande

Data Scientist @ Cube26 Software Pvt Ltd

@jaidevd

From Peter Norvig's Q&A session on Quora:

I think that it will be important for machine learning experts and software engineers to come together to develop best practices for software development of machine learning systems. Currently we have a software testing regime where you define unit tests... We will need new testing processes that involve running experiments, analyzing the results... This is a great area for software engineers and machine learning people to work together to build something new and better.

Example: How we started building ReosMessage

  • Classification of SMS into personal, notifications and spam

  • Get a dump of a table from postgres

  • Use simple Python scripts and some pandas magic to clean the data

  • Use regular expressions to label the data

  • Train sklearn estimators on the labeled data

  • Crowdsource the evaluation of the predictions

  • Dump the model coefficients to JSON

  • Hand over the JSON to Android developers

Typical Data Processing Pipeline

Managing Raw Data

And Data Ingest as a Service

  • Raw data is an integral part of ETL and therefore of your software

  • Working off local flatflies is bad!

  • Learn to work from remote storage to remote storage. Use the "cloud".

  • What about experimental / exploratory work? Use things like sqlite!

  • Only use local files when:

    • doing EDA or any interactive studies.

    • debugging a larger application.

    • prototyping (resist the temptation to deploy).

Using Pandas for Data Ingest

  • A few of million PSQL rows randomly sampled from over 15M rows

  • Preprocessing with:

    • Removing unicode, emoji, stopwords

    • Converting to lowercase

    • Dropping stopwords

    • Cleaning any other malformed input

  • Using a few hundred regular expressions to produce a labeled dataset

Using PySemantic to Wrap Pandas

smsdata:
  source: psql
  table_name: messages_db
  config:
    hostname: 127.0.0.1
    db_name: foo
    username: bar
  chunksize: 100000
  sampling:
    factor: 0.1
    kind: random
  dtypes:
    Message: &string !!python/name:__builtin__.str
    Number: *string
    person: *string
  postprocessors:
    - !!python/name:jeeves.preprocessing.text.remove_unicode
    - !!python/name:jeeves.preprocessing.text.remove_tabs
    - !!python/name:jeeves.preprocessing.text.remove_digits
    - !!python/name:jeeves.preprocessing.text.remove_stopwords
    - !!python/name:jeeves.preprocessing.text.to_lowercase
    - !!python/name:jeeves.feature_extraction.text.get_regex_features
    - !!python/name:jeeves.feature_extraction.text.get_tfidf_features
>>> from pysemantic import Project
>>> smartsms = Project("smartsms")
>>> X = smartsms.load_dataset("smsdata")

A Note about the AutoML project

"If machine learning is so ubiquitous, one should just be able to use it without understanding libraries."

- Andreas Mueller @ SciPy US 2016

Automating Model Selection

class CrossValidationTask(luigi.Task):

    estimator = luigi.Parameter() # or luigi.Target

    def run(self):
        # Run CV loop


class GridSearchTask(luigi.Task):

    grid = luigi.Parameter() # or Target
    estimator = luigi.Parameter() # or Target

    def run(self):
        X, y = self.input()
        clf = GridSearchCV(self.estimator, param_grid=self.grid, ...)
        clf.fit(X, y)
        joblib.dump(clf.best_estimator_, self.output())

Data Processing Pipeline as a Luigi Graph

Visualizing (Data & ML performance)

  • Bokeh server for dashboards

  • Chaco / Traits-based visualizations - for interative exploration

  • Use libs like Seaborn for stats - resist the temptation to write them yourself

Exposing Trained Models

  • Simple serialization methods

  • sklearn-compiledtrees

  • Don't use nonlinear models where linear models will do

  • The serveless paradigm - AWS Lambda / Amazon API Gateway, etc.

Example: How we built & scaled ReosMessage

  • Get a dump of a table from postgres

  • Use simple Python scripts and some pandas magic to clean the data

  • Spark streaming API connected to Kafka consumers

  • Use regular expressions user feedback to label the data

  • Use Luigi to:

    • Continuously run grid search and cross validation benchmarks

    • Train sklearn estimators on the labeled data

    • Dump the model coefficients to JSON

    • Hand over the JSON to Android developers

  • Use Jenkins to drive everything