#!/usr/bin/env python # coding: utf-8 # # Definitions and Advices # In[1]: import addutils.toc ; addutils.toc.js(ipy_notebook=True) # In[2]: from addutils import css_notebook css_notebook() # ## 1 What is Machine Learning ? # Today we see an explosion of applications that are wide and connected with an emphasis on storage and processing. *Most companies are storing a lot of data but not solving the problem of what to do with it*. Yet most of the information is stored in raw form: There a huge amound of information locked-up in databases: information that is potentially important but has not yet been discovered. The objective of these tutorials is to show the foundamental techniques to **Discover Meaningful Information in Data**. # # **Data Mining** is the extraction of implicit, previously unknown and potentially useful information from data. # # **Machine Learning (ML)** and **Deep Learning (DL)** provide the technical basis of Data Mining. **ML** is about building programs with **tunable parameters** (typically an array of floating point values) that are adjusted automatically so as to improve their behavior by **adapting to previously seen data.** # # **DL** is about modeling high-level abstractions in data by using model architectures composed of **multiple non-linear transformations.** # # **ML and DL** can be considered a subfield of **Artificial Intelligence (AI)** since those algorithms can be seen as building blocks to make computers learn to behave more intelligently by somehow **generalizing** rather that just storing and retrieving data items like a database system would do. # # Most of the examples of these tutorials are taken from the [scikit-learn documentation](http://scikit-learn.org/stable/index.html): check the original documentation for further information. # # If you are a MATLAB^® user we recommend to read [Numpy for MATLAB Users](http://www.scipy.org/NumPy_for_Matlab_Users) and [Python vs Matlab](http://www.pyzo.org/python_vs_matlab.html). # # For an idea of the **Open Source Approach to science**, we suggest the [Science Code Manifesto](http://sciencecodemanifesto.org/) # ### 1.1 Documentation and reference: # * [Numpy Reference guide](http://docs.scipy.org/doc/numpy/reference/) # * [SciPy Reference](http://docs.scipy.org/doc/scipy/reference/) # * [scikit-learn User Guide](http://scikit-learn.org/stable/user_guide.html) # * [scikit-learn External Resources](http://scikit-learn.org/stable/presentations.html) # ## 2 Supervised and Unsupervised Learning # In general, a learning problem uses a set of n data samples to predict properties of unknown data. Usually data are organized in tables where rows (first axis) represent the **samples** (or **instances**) and colums represent **attributes** (or **features**), for Supervised Learning, another array of **classes** or **target variables** is provided. # # We can separate learning problems in a few large categories: # # In **SUPERVISED LEARNING**, we have a dataset consisting of both features and labels. The task is to construct an estimator which is able to predict the label of an object given the set of features. # # * We have a **CLASSIFICATION** task when the **target variable is nominal (discrete)** - examples: # # * predicting the species of iris given a set of measurements of its flower # * given a multicolor image of an object through a telescope, determine whether that object is a star, a quasar, or a galaxy. # # * We have a **REGRESSION** task when the **target variable is continuous** - examples: # # * given a set of attributes, determine the selling price of an house # # # In **UNSUPERVISED LEARNING** the data has no labels, and we are interested in finding similarities between the samples. # # Unsupervised learning comprises tasks such as *dimensionality reduction*, *clustering*, and *density estimation*. # Some unsupervised learning problems are: # # * **CLUSTERING** is the task that **group similar items together** - examples: # # * given observations of distant galaxies, determine which features are most important in distinguishing between them. # # * **DENSITY ESTIMATION** is a task were we want to **find statistical values that describe the data** # # * **DIMENSIONALITY REDUCTION** is for **reduce the number of the features while keeping most of the information** # # **UNSUPERVISED / SUPERVISED LEARNING** in DL usally the two approach are combined, in fact the DL layers (Restricted Boltzmann Machines, Autoencoders, Convolutional Neural Networks) are used to learn the most significative features of the data. Those features are then used with standard ML regressors or classificators. # ## 3 Cheat Sheet - scikit-learn Algorithm Selection # ML is a general concept involving a huge number of algorithms. This is a tentative Cheat Sheet to help finding the correct approach. # # Basically, the principle is to start simple first. If this doesn't work out, try something more complicated. # # Red Links point to algorithms NOT included in scikit-learn. # # To make any of the algorithms actually work, you need to do the right preprocessing. # # Generally every ML algorithm needs a minimum number of samples. All the methods listed below are applicable to datasets with at least 50 samples. For tasks involving less than 50 samples most of the following methods are not suitable: # * **To predict a QUANTITY:** # # * **Regression:** these methods give back a numerical outcome. # # * **LESS than 100k samples with all features important (dense data):** # # * TRY: [Ridge Regression](http://scikit-learn.org/stable/modules/linear_model.html#ridge-regression) *(see Generalized Linear Models)* # # * If Ridge Regression doesn't work, TRY: [Support Vector Regression (svm.SVR)](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html) with *linear kernel* *(see Support Vector Machines)* # # * If SVR with *linear kernel* doesn't work, TRY: [Support Vector Regression (svm.SVR)](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html) with *rbf kernel* *(see Support Vector Machines)* # * If none of the above methods work, USE: [Ensemble Regressors](http://scikit-learn.org/stable/modules/ensemble.html) *(RF, Extremely Randomized Trees, GBRT)* *(see Support Ensemble Methods)* # # * **LESS than 100k samples with few features important (sparse data):** # # * USE: [Elastic Net Lasso](http://scikit-learn.org/stable/modules/linear_model.html#elastic-net) *(see Generalized Linear Models)* # # * **MORE than 100k samples:** # * USE: [SGD Regressor](http://scikit-learn.org/stable/modules/sgd.html#regression) *(see Stochastic Gradient Descent)* # # * **Alternatively, for every problem size:** # # * CALL US

# # * **Dimensionality Reduction (NOT for predicting the structure of the data):** these methods are suitable for data visualization and human interpretation. # # * TRY: [RandomizedPCA](http://scikit-learn.org/stable/modules/decomposition.html#principal-component-analysis-pca) # # * If RandomizedPCA dont work, and you have *LESS than 10k samples*, USE: [t-distributed Stochastic Neighbor Embedding (t-SNE)](http://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html#sklearn.manifold.TSNE) # # * If RandomizedPCA dont work, and you have *MORE than 10k samples*, CALL US: most probably you need a more efficient version of t-SNE # # * **For Prediction of multivariate or structured outputs:** # # * TRY: [SVM struct](http://www.cs.cornell.edu/people/tj/svm_light/svm_struct.html). This algorithm is free for non-commercial use # # * TRY: [pystruct](https://github.com/amueller/pystruct). (Under development) # * **Predict a CATEGORY for LABELED Data (Classification):** # # * **LESS than 100k samples**, TRY: [Linear SVC](http://scikit-learn.org/stable/modules/svm.html#svc) # # * If Linear SVC dont work, and you have *NUMERICAL DATA*, TRY: [KNeighborsClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) # # * If KNeighborsClassifier doesn't work, USE: [SVC](http://scikit-learn.org/stable/modules/svm.html#svc) # # * If SVC doesn't work, USE: [Ensemble Classifiers](http://scikit-learn.org/stable/modules/ensemble.html) *(RF, Extremely Randomized Trees, GBRT)* # # * If Linear SVC dont work, and you have *TEXTUAL DATA*, USE: [Naive Bayes](http://scikit-learn.org/stable/modules/naive_bayes.html) # # * **MORE than 100k samples**, TRY: [SGD Classifier](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html) # # * If SGD Classifier dont work, , USE: [Kernel Approximation](http://scikit-learn.org/stable/modules/kernel_approximation.html) # * **Predict a CATEGORY for UNLABELED Data (Clustering):** # # * **LESS than 10k samples and KNOWN number of categories** # * USE: [Mini Batch K-Means](http://scikit-learn.org/stable/modules/clustering.html#mini-batch-k-means) # # * **MORE than 10k samples and KNOWN number of categories** # * TRY: [K-Means Clustering](http://scikit-learn.org/stable/tutorial/statistical_inference/unsupervised_learning.html#k-means-clustering) # # * If K-Means Clustering doesn't work, TRY: [Gaussian Mixture Models](http://scikit-learn.org/stable/modules/mixture.html#gmm-classifier) # # * If GMM doesn't work, USE: [Spectral Clustering](http://scikit-learn.org/stable/modules/clustering.html#spectral-clustering) # # * **LESS than 10k samples and UNKNOWN number of categories** # * TRY: [Mean Shift](http://scikit-learn.org/stable/modules/clustering.html#mean-shift) # # * If Mean Shift doesn't work, USE: [Variational Gaussian Mixtures](http://scikit-learn.org/stable/modules/mixture.html#vbgmm-classifier-variational-gaussian-mixtures) # ## 4 Machine Learning Wisdom # Here is a list of things to take in great consideration while developing ML systems: # # 1. **No Free Lunch:** A wide variety of techniques exist for modeling. An important theorem in statistical machine learning essentially states that no one technique will outperform all other techniques on all problems *(Wolpert & MacReady, 1997)*. This theorem is sometimes referred to as *No Free Lunch*. Often, a modeling group will specialize in one particular technique, and will tout that technique as the being intrinsically superior to others. Such a claim should be regarded with extreme suspicion. Furthermore, the field of statistical machine learning is evolving rapidly, and new algorithms are developed at a regular pace, this determines a very fast aging for ML approaches. This is the reason why in Addfor we rely on Open Source, Lean and Data-Driven Development and **Combinatorial Innovation**. # # 2. **Beware of False Predictors:** In selecting input variables for a model, one must be careful not to include false predictors. A false predictor is a variable that is strongly correlated with the output class, but that is not available in a realistic prediction scenario. This step is stricktly data-dependent and can be accomplished by paying attention to the choice of the validation dataset. **Correlation does not imply causation:** ice-cream sales is a strong predictor for drowning deaths. # # 3. **Mind Data Balancing:** Always check if your algorithm is suitable to handle Data Asymmetricity. # # 4. **Correctly Define Output Classes:** If the model's task is to predict a system failure, it seems natural for the output classes to be "fail" and "not fail". However, characterizing the exact conditions under which failure occurs is not straightforward. For example two failures for different reasons could represent very different classes. # # 5. **Data Preparation:** this is maybe the most important task in any predictive algorithm, we dedicate a whole notebook to it! # # 6. **Model Selection:** Any modeling technique can be used to construct of a continuum of models, from simple to complex. One of the key issues in modeling is model selection, which involves picking the appropriate level of complexity for a model given a data set. Although model selection methods can be automated to some degree, model selection cannot be avoided. If someone claims otherwise, or does not emphasize their expertise in model selection, one should be suspicious of his abilities. # # 7. **Segmentation:** Often, a data set can be broken into several smaller, more homogenous data sets, which is referred to as segmentation. For example, a customer data base might be split into business and residential customers. Although domain experts can readily propose segmentations, enforcing a segmentation suggested by domain experts is generally not the most prudent approach to modeling, because the data itself provides clues to how the segmentation should be performed. Consequently, one should be concerned if a modeler claims to utilize a priori segmentation. # # 8. **Model Evaluation:** Once a model has been built, the natural question to ask is how accurate it is. Here we describe common sorts of deception that can occur in assessing and evaluating a model: # # a) *Failing to use an independent test set:* To obtain a fair estimate of performance, the model must be evaluated on examples that were not contained in the training set. The available data must be split into nonoverlapping subsets, with the test set reserved only for evaluation. # # b) *Assuming stationarity of the test environment:* For many difficult problems, a model built based on historical data will become a poorer and poorer predictor as time goes on, because the environment is nonstationary--the rules and behaviors of individuals change over time. Consequently, the best measure of a model's true performance will be obtained if it is tested on data from a different point in time relative to the training data. # # c) *Incomplete reports of results:* An accurate model will correctly discriminate examples of one output class from examples of another output class. Discrimination performance is best reported with an ROC curve, a lift curve, or a precision-recall curve. Any report of accuracy using only a single number is suspect. # # d) *Filtering data to bias results:* In a large data set, one segment of the population may be easier to predict than another. If a model is trained and tested just on this segment of the population, it will be more accurate than a model that must handle the entire population. Selective filtering can turn a hard problem into an easier problem. # # e) *Selective sampling of test cases:* A fair evaluation of a model will utilize a test set that is drawn from the same population as the model will eventually encounter in actual usage. # # f) *Failing to assess statistical reliability:* When comparing the accuracy of two models, it is not sufficient to report that one model performed better than the other, because the difference might not be statistically reliable. "Statistical reliability" means, among other things, that if the comparison were repeated using a different sample of the population, the same result would be achieved. # --- # # Visit [www.add-for.com]() for more tutorials and updates. # # This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.