#!/usr/bin/env python # coding: utf-8 # # Data Analysis and Machine Learning Applications for Physicists # *Material for a* [*University of Illinois*](http://illinois.edu) *course offered by the* [*Physics Department*](https://physics.illinois.edu). *This content is maintained on* [*GitHub*](https://github.com/illinois-mla) *and is distributed under a* [*BSD3 license*](https://opensource.org/licenses/BSD-3-Clause). # # [Table of contents](Contents.ipynb) # In[ ]: get_ipython().run_line_magic('matplotlib', 'inline') import matplotlib.pyplot as plt import seaborn as sns; sns.set() import numpy as np import pandas as pd import warnings warnings.filterwarnings('ignore') # ## Load Data # In[ ]: from mls import locate_data # In[ ]: line_data = pd.read_csv(locate_data('line_data.csv')) # In[ ]: pong_data = pd.read_hdf(locate_data('pong_data.hf5')) pong_targets = pd.read_hdf(locate_data('pong_targets.hf5')) pong_both = pd.concat([pong_data, pong_targets], axis='columns') # ## Low-level Plots # The low-level historical foundation of python visualization is [matplotlib](https://matplotlib.org/), which was originally created to emulate the successful MATLAB graphics beloved by generations of engineers. Matplotlib is still relevant today, and often the best choice for simple tasks or to produce publication-quality plots. The downside of matplotlib is that relatively simple tasks often require many low-level steps. # Use `plt.hist`, `plt.xlabel`, `plt.ylabel` and `plt.legend` to reproduce the plot below that shows the 1D distributions ("histograms") of x and y values in `line_data`. Note how this goes one step further than the summary statistics by showing the full distribution of a single column. # # ![hist example](img/Visualization/hist.png) # In[ ]: plt.hist(line_data['x'], bins=20, range=(-2.5, +2.5), label='x', density=True, alpha=0.5) plt.hist(line_data['y'], bins=20, range=(-2.5, +2.5), label='y', density=True, alpha=0.5) plt.xlabel('Value') plt.ylabel('Probability Density') plt.legend(); #plt.savefig('img/Visualization/hist.png') # In[ ]: # Add your solution here... # Use `plt.errorbar`, `plt.xlabel`, `plt.ylabel` and `plt.legend` to reproduce the plot below that shows the first 12 rows of `line_data`: # # ![errorbar example](img/Visualization/errorbar.png) # In[ ]: plt.errorbar(line_data['x'][:12], line_data['y'][:12], line_data['dy'][:12], fmt='o', color='blue', markersize=5, ecolor='red', label='y +/- dy') plt.xlabel('x') plt.ylabel('y') plt.legend(loc='upper left'); # plt.savefig('img/Visualization/errorbar.png') # In[ ]: # Add your solution here... # Use `plt.scatter`, `plt.xlabel`, `plt.ylabel` and `plt.colorbar` to reproduce the plot below that shows the 2D distribution of x and y in `line_data`, color coded by each row's dy value. Note how this type of plot reveals correlations between a pair of columns that cannot be deduced from their 1D histograms. # ![scatter example](img/Visualization/scatter.png) # In[ ]: plt.scatter(line_data['x'], line_data['y'], c=line_data['dy'], s=10) plt.xlabel('x') plt.ylabel('y') plt.colorbar().set_label('dy') #plt.savefig('img/Visualization/scatter.png') # In[ ]: # Add your solution here... # ## Higher-level Plots # Visualization tools are rapidly evolving in python today (watch [this PyCon2017 talk](https://www.youtube.com/watch?v=FytuB8nFHPQ) for a recent overview), and there is no single best choice for all applications. # # For this course, we will use [seaborn](https://seaborn.pydata.org/index.html) since it is specifically designed for statistical data visualization and is reasonably mature. It also works well with matplotlib, which it uses under the hood to produce its plots. # The Seaborn version of `plt.hist` is `sns.distplot`, which automatically calculates optimized bins (using the [Freedman-Diaconis rule](https://en.wikipedia.org/wiki/Freedman%E2%80%93Diaconis_rule)) and superimposes a smooth estimate of the probability density (using the [KDE method](https://en.wikipedia.org/wiki/Kernel_density_estimation) - we will learning more about density estimation soon). # In[ ]: sns.distplot(line_data['x'], label='x') sns.distplot(line_data['y'], label='y') plt.xlabel('Value') plt.ylabel('Probability Density') plt.legend(); # The Seaborn version of `plt.scatter` is `sns.jointplot`, which shows 1D histograms along the sides comes in several kinds for displaying the central 2D information, e.g. # In[ ]: sns.jointplot('x', 'y', line_data, kind='hex'); # Note how the axis labels are generated automatically. The "pearsonr" value measures the [amount of correlation](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) between x and y, and "p" estimates the probability that they are uncorrelated. # **EXERCISE:** Read about the other kinds of `sns.jointplot` and try each one for x and y. Which seems most useful for a small data set? For a large data set? # I recommend 'scatter' for a small dataset, but it is not a good choice for a large dataset since it is relatively slow and your eye is inevitably drawn to the outliers that are least representative of the general trends. # # For a large dataset, I would start with 'hex' and then proceed to 'kde' if the 'hex' distribution looks sufficiently smooth. # # The 'reg' and 'resid' types are intended for cases where there is an obvious linear relationship between the pair of columns. # A "jointplot" gives a lot of insight into a pair of columns in a dataset. A "pairplot" takes this one step further with a grid of jointpoints. (There is also a dedicated python package for pairplots called [corner](http://corner.readthedocs.io/en/latest/)). # In[ ]: sns.pairplot(line_data.dropna()); # The resulting grid shows a 2D scatter plot each pair (i,j) of columns with 1D histograms on the diagonal (i=j). Note that the (j,i) scatter plot is just (i,j) transposed, so this format is not very space efficient. # A pairplot is less practical for a dataset with lots of columns but you can zoom in on any submatrix, for example: # In[ ]: sns.pairplot(pong_both, x_vars=('x0', 'x1', 'x2', 'x3'), y_vars=('y6', 'y7', 'y8')); # The three clusters that appear in each scatter plot suggest that there are three types of observations (rows) in the dataset. We can confirm this using the combined data+targets dataset and color-coding each row according to each target "grp" value: # In[ ]: sns.pairplot(pong_both, x_vars=('x0', 'x1', 'x2', 'x3'), y_vars=('y6', 'y7', 'y8'), hue='grp'); # We will soon learn how to automatically assign rows to the clusters you discover during visualization, i.e., without "cheating" with the target 'grp' column. # ## Beyond 2D # There are several options to go beyond 2D visualization of pairs: # - Use attributes such as color and size to display additional dimensions. # - Use 3D visualizations. # - Use an embedding transformation to map high dimensional data to 2D or 3D. # # We have already seen examples above of using color to add a third dimension to a 2D scatter plot. # **EXERCISE:** Try out the embedding approach by uploading `pong_data` to http://projector.tensorflow.org/. You will first need to save the data in the "tab-separated value" (TSV) format: # In[ ]: pong_data.to_csv('pong_data.tsv', sep='\t', index=False, header=False) # Look at the 3D PCA visualization of this dataset. What is the raw dimension (number of columns) for this data? What is the effective number of dimensions revealed by the visualization? We will learn more about PCA and counting effective dimensions soon. # You are welcome to also experiment with the [tSNE](https://distill.pub/2016/misread-tsne/) visualizations, but I have not found them to be very enlightening for scientific data. # ## How to use visualization # What should you be looking for when you visualize your data? # 1. Is there visible clustering in some pairs? # 2. Which pairs of columns are most correlated? # 3. When a correlation exists: # - Is it roughly linear or strongly non-linear? # - How much variance (spread) is there about the average trend? # # There is usually variance around the average trend but it can arise from two different sources (and, in general, is some mixture of both): # - Correlations with additional variables (which might not even be in your dataset). # - Random noise due to the measurement process, etc.