#!/usr/bin/env python # coding: utf-8 # # 01 - Introduction to Machine Learning # # by [Alejandro Correa Bahnsen](albahnsen.com/) # # version 0.4, Feb 2017 # # ## Part of the Tutorial [Practical Machine Learning](https://github.com/albahnsen/Tutorial_PracticalMachineLearning_Pycon) # # # This notebook is licensed under a [Creative Commons Attribution-ShareAlike 3.0 Unported License](http://creativecommons.org/licenses/by-sa/3.0/deed.en_US). Special thanks goes to [Jake Vanderplas](http://www.vanderplas.com) # ## What is Machine Learning? # # In this section we will begin to explore the basic principles of machine learning. # Machine Learning is about building programs with **tunable parameters** (typically an # array of floating point values) that are adjusted automatically so as to improve # their behavior by **adapting to previously seen data.** # # Machine Learning can be considered a subfield of **Artificial Intelligence** since those # algorithms can be seen as building blocks to make computers learn to behave more # intelligently by somehow **generalizing** rather that just storing and retrieving data items # like a database system would do. # # We'll take a look at two very simple machine learning tasks here. # The first is a **classification** task: the figure shows a # collection of two-dimensional data, colored according to two different class # labels. # In[1]: # Import libraries get_ipython().run_line_magic('matplotlib', 'inline') import numpy as np import pandas as pd import matplotlib as mpl import matplotlib.pyplot as plt plt.style.use('fivethirtyeight') cmap = mpl.colors.ListedColormap(['r', 'g', 'b', 'c']) # In[2]: # Create a random set of examples from sklearn.datasets.samples_generator import make_blobs X, Y = make_blobs(n_samples=50, centers=2,random_state=23, cluster_std=2.90) plt.scatter(X[:, 0], X[:, 1], c=Y, cmap=cmap) plt.show() # A classification algorithm may be used to draw a dividing boundary # between the two clusters of points: # In[3]: from sklearn.linear_model import SGDClassifier clf = SGDClassifier(loss="hinge", alpha=0.01, n_iter=200, fit_intercept=True) clf.fit(X, Y) # In[4]: # Plot the decision boundary. For that, we will assign a color to each # point in the mesh [x_min, m_max]x[y_min, y_max]. x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5 y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5 xx, yy = np.meshgrid(np.arange(x_min, x_max, .05), np.arange(y_min, y_max, .05)) Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape) # In[5]: plt.contour(xx, yy, Z) plt.scatter(X[:, 0], X[:, 1], c=Y, cmap=cmap) plt.show() # This may seem like a trivial task, but it is a simple version of a very important concept. # By drawing this separating line, we have learned a model which can **generalize** to new # data: if you were to drop another point onto the plane which is unlabeled, this algorithm # could now **predict** whether it's a blue or a red point. # The next simple task we'll look at is a **regression** task: a simple best-fit line # to a set of data: # In[6]: a = 0.5 b = 1.0 # x from 0 to 10 x = 30 * np.random.random(20) # y = a*x + b with noise y = a * x + b + np.random.normal(size=x.shape) plt.scatter(x, y) # In[7]: from sklearn.linear_model import LinearRegression clf = LinearRegression() clf.fit(x[:, None], y) # In[8]: # underscore at the end indicates a fit parameter print(clf.coef_) print(clf.intercept_) # In[9]: x_new = np.linspace(0, 30, 100) y_new = clf.predict(x_new[:, None]) plt.scatter(x, y) plt.plot(x_new, y_new) # Again, this is an example of fitting a model to data, such that the model can make # generalizations about new data. The model has been **learned** from the training # data, and can be used to predict the result of test data: # here, we might be given an x-value, and the model would # allow us to predict the y value. Again, this might seem like a trivial problem, # but it is a basic example of a type of operation that is fundamental to # machine learning tasks. # ## Representation of Data in Scikit-learn # # Machine learning is about creating models from data: for that reason, we'll start by # discussing how data can be represented in order to be understood by the computer. Along # with this, we'll build on our matplotlib examples from the previous section and show some # examples of how to visualize data. # # Most machine learning algorithms implemented in scikit-learn expect data to be stored in a # **two-dimensional array or matrix**. The arrays can be # either ``numpy`` arrays, or in some cases ``scipy.sparse`` matrices. # The size of the array is expected to be `[n_samples, n_features]` # # - **n_samples:** The number of samples: each sample is an item to process (e.g. classify). # A sample can be a document, a picture, a sound, a video, an astronomical object, # a row in database or CSV file, # or whatever you can describe with a fixed set of quantitative traits. # - **n_features:** The number of features or distinct traits that can be used to describe each # item in a quantitative manner. Features are generally real-valued, but may be boolean or # discrete-valued in some cases. # # The number of features must be fixed in advance. However it can be very high dimensional # (e.g. millions of features) with most of them being zeros for a given sample. This is a case # where `scipy.sparse` matrices can be useful, in that they are # much more memory-efficient than numpy arrays. # ## A Simple Example: the Iris Dataset # # As an example of a simple dataset, we're going to take a look at the # iris data stored by scikit-learn. # The data consists of measurements of three different species of irises. # There are three species of iris in the dataset, which we can picture here: # In[10]: from IPython.core.display import Image, display imp_path = 'https://raw.githubusercontent.com/jakevdp/sklearn_pycon2015/master/notebooks/images/' display(Image(url=imp_path+'iris_setosa.jpg')) print("Iris Setosa\n") display(Image(url=imp_path+'iris_versicolor.jpg')) print("Iris Versicolor\n") display(Image(url=imp_path+'iris_virginica.jpg')) print("Iris Virginica") display(Image(url='https://raw.githubusercontent.com/albahnsen/PracticalMachineLearningClass/6160065e1e574a20edddc47116a0512d20656e26/notebooks/iris_with_length.png')) print('Iris versicolor and the petal and sepal width and length') print('From, Python Data Analytics, Apress, 2015.') # ### Quick Question: # # **If we want to design an algorithm to recognize iris species, what might the data be?** # # Remember: we need a 2D array of size `[n_samples x n_features]`. # # - What would the `n_samples` refer to? # # - What might the `n_features` refer to? # # Remember that there must be a **fixed** number of features for each sample, and feature # number ``i`` must be a similar kind of quantity for each sample. # ### Loading the Iris Data with Scikit-Learn # # Scikit-learn has a very straightforward set of data on these iris species. The data consist of # the following: # # - Features in the Iris dataset: # # 1. sepal length in cm # 2. sepal width in cm # 3. petal length in cm # 4. petal width in cm # # - Target classes to predict: # # 1. Iris Setosa # 2. Iris Versicolour # 3. Iris Virginica # # ``scikit-learn`` embeds a copy of the iris CSV file along with a helper function to load it into numpy arrays: # In[11]: from sklearn.datasets import load_iris iris = load_iris() iris.keys() # In[12]: n_samples, n_features = iris.data.shape print((n_samples, n_features)) print(iris.data[0]) # In[13]: print(iris.data.shape) print(iris.target.shape) # In[14]: print(iris.target) print(iris.target_names) # This data is four dimensional, but we can visualize two of the dimensions # at a time using a simple scatter-plot: # In[15]: data_temp = pd.DataFrame(iris.data, columns=iris.feature_names) data_temp['target'] = iris.target data_temp['target'] = data_temp['target'].astype('category') data_temp['target'].cat.categories = iris.target_names pd.scatter_matrix(data_temp, figsize=(15, 15)) plt.show() # ### Dimensionality Reduction: PCA # # Principle Component Analysis (PCA) is a dimension reduction technique that can find the combinations of variables that explain the most variance. # # Consider the iris dataset. It cannot be visualized in a single 2D plot, as it has 4 features. We are going to extract 2 combinations of sepal and petal dimensions to visualize it: # In[16]: X, y = iris.data, iris.target from sklearn.decomposition import PCA pca = PCA(n_components=3) pca.fit(X) X_reduced = pca.transform(X) plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y, cmap=cmap) # In[17]: X, y = iris.data, iris.target from sklearn.manifold import Isomap pca = Isomap(n_components=3) pca.fit(X) X_reduced = pca.transform(X) plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y, cmap=cmap) # In[18]: X_reduced.shape # In[19]: plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y, cmap=cmap) # In[20]: from mpl_toolkits.mplot3d import Axes3D fig = plt.figure() ax = Axes3D(fig) ax.set_title('Iris Dataset by PCA', size=14) ax.scatter(X_reduced[:,0],X_reduced[:,1],X_reduced[:,2], c=y, cmap=cmap) ax.set_xlabel('First eigenvector') ax.set_ylabel('Second eigenvector') ax.set_zlabel('Third eigenvector') ax.w_xaxis.set_ticklabels(()) ax.w_yaxis.set_ticklabels(()) ax.w_zaxis.set_ticklabels(()) plt.show() # ### Clustering: K-means # # Clustering groups together observations that are homogeneous with respect to a given criterion, finding ''clusters'' in the data. # # Note that these clusters will uncover relevent hidden structure of the data only if the criterion used highlights it. # In[21]: from sklearn.cluster import KMeans k_means = KMeans(n_clusters=3, random_state=0) # Fixing the RNG in kmeans k_means.fit(X) y_pred = k_means.predict(X) y_pred[y_pred == 0] = -1 y_pred[y_pred == 1] = 0 y_pred[y_pred == -1] = 1 plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=y_pred, cmap=cmap); # Lets then evaluate the performance of the clustering versus the ground truth # In[22]: from sklearn.metrics import confusion_matrix # Compute confusion matrix cm = confusion_matrix(y, y_pred) np.set_printoptions(precision=2) print(cm) # In[23]: def plot_confusion_matrix(cm, title='Confusion matrix', cmap=plt.cm.Blues): plt.imshow(cm, interpolation='nearest', cmap=cmap) plt.title(title) plt.colorbar() tick_marks = np.arange(len(iris.target_names)) plt.xticks(tick_marks, iris.target_names, rotation=45) plt.yticks(tick_marks, iris.target_names) plt.tight_layout() plt.ylabel('True label') plt.xlabel('Predicted label') plt.show() # In[24]: plot_confusion_matrix(cm) print('Accuracy ', np.diag(cm).sum() / cm.sum()) # ### Classification Logistic Regression # In[25]: from sklearn.linear_model import LogisticRegression clf = LogisticRegression() clf.fit(X, y) y_pred = clf.predict(X) cm = confusion_matrix(y, y_pred) plot_confusion_matrix(cm) print('Accuracy ', np.diag(cm).sum() / cm.sum()) # In[26]: from sklearn.ensemble import RandomForestClassifier clf = RandomForestClassifier() clf.fit(X, y) y_pred = clf.predict(X) cm = confusion_matrix(y, y_pred) plot_confusion_matrix(cm) print('Accuracy ', np.diag(cm).sum() / cm.sum()) # ### Recap: Scikit-learn's estimator interface # # Scikit-learn strives to have a uniform interface across all methods, # and we'll see examples of these below. Given a scikit-learn *estimator* # object named `model`, the following methods are available: # # - Available in **all Estimators** # + `model.fit()` : fit training data. For supervised learning applications, # this accepts two arguments: the data `X` and the labels `y` (e.g. `model.fit(X, y)`). # For unsupervised learning applications, this accepts only a single argument, # the data `X` (e.g. `model.fit(X)`). # - Available in **supervised estimators** # + `model.predict()` : given a trained model, predict the label of a new set of data. # This method accepts one argument, the new data `X_new` (e.g. `model.predict(X_new)`), # and returns the learned label for each object in the array. # + `model.predict_proba()` : For classification problems, some estimators also provide # this method, which returns the probability that a new observation has each categorical label. # In this case, the label with the highest probability is returned by `model.predict()`. # + `model.score()` : for classification or regression problems, most (all?) estimators implement # a score method. Scores are between 0 and 1, with a larger score indicating a better fit. # - Available in **unsupervised estimators** # + `model.predict()` : predict labels in clustering algorithms. # + `model.transform()` : given an unsupervised model, transform new data into the new basis. # This also accepts one argument `X_new`, and returns the new representation of the data based # on the unsupervised model. # + `model.fit_transform()` : some estimators implement this method, # which more efficiently performs a fit and a transform on the same input data. # ## Flow Chart: How to Choose your Estimator # # This is a flow chart created by scikit-learn super-contributor [Andreas Mueller](https://github.com/amueller) which gives a nice summary of which algorithms to choose in various situations. Keep it around as a handy reference! # In[27]: from IPython.display import Image Image(url="http://scikit-learn.org/dev/_static/ml_map.png") # Original source on the [scikit-learn website](http://scikit-learn.org/stable/tutorial/machine_learning_map/)