#!/usr/bin/env python # coding: utf-8 # A few weeks ago, the R community went through some handwringing about plotting packages. # For outsiders (like me) the details aren't that important, but some brief background might be useful so we can transfer the takeaways to Python. # The competing systems are "base R", which is the plotting system built into the language, and ggplot2, Hadley Wickham's implemntation of the grammar of graphics. # For those interested in more details, checkout # # - http://simplystatistics.org/2016/02/11/why-i-dont-use-ggplot2/ # - http://varianceexplained.org/r/why-I-use-ggplot2/ # - http://flowingdata.com/2016/03/22/comparing-ggplot2-and-r-base-graphics/ # # The most important takeaways, are that # # 1. Either system is capable of producing anything the other can # 2. ggplot is usually better for exploratory analysis # # Item 2 is not universally agreed upon, and it certainly isn't true for every type of chart, but I'm going to use it as fact for now. # # I'm not foolish enough to attempt a formal analogy here, like matplotlib is python's base R. # But there's at least a rough comparison: # like ggplot2, the combination of pandas and seaborn allows for fast iteration and exploration. You can quickly explore a dataset and transformations of that dataset. # When you need to, you can "drop down" into matplotlib for further refinement. # # Overview # # Here's a brief sketch of the plotting landscape as of April 2016. # For some reason, plotting tools feel a bit more personal than other parts of this series so far, so I feel the need to blanket this who discussion in a cavet: this is my personal take, shaped by my personal background and tastes, on how to handle plotting in Python. # ## [Matplotlib](http://matplotlib.org/) # # Matplotlib is an amazing project, and is the foundation of pandas' built-in plotting and Seaborn. # Matplotlib handles everything from the actual drawing to the screen, to several APIs of various levels. # I've found knowing the [pyplot api](http://matplotlib.org/api/pyplot_api.html) useful. # You're less likely to need things like [Transforms](http://matplotlib.org/users/transforms_tutorial.html) or [artists](http://matplotlib.org/api/artist_api.html), but when you do the documentation is there. # I'll typically start with a pandas or seaborn plot, and then make adjustments with the pyplot API. # # ## [Pandas' builtin-plotting](http://pandas.pydata.org/pandas-docs/version/0.18.0/visualization.html) # # `DataFrame` and `Series` have a `.plot` namespace, with various chart types available (`line`, `hist`, `scatter`, etc.). # Pandas objects additional metadata available that can be used to enhance plots (the Index for a better automatic x-axis then `range(n)` or Index names as axis labels for example). # # And since pandas had fewer backwards compatability constraints, it had a bit better default aesthetics, though matplotlib is addressing this in [matplotlib 2.0](http://matplotlib.org/style_changes.html). # # At this point, I see pandas `DataFrame.plot` as a useful exploratory tool for quick throwaway plots. # # ## [Seaborn](https://stanford.edu/~mwaskom/software/seaborn/) # # [Seaborn](https://stanford.edu/~mwaskom/software/seaborn/), created by Michael Waskom, "provides a high-level interface for drawing attractive statistical graphics." Seaborn gives a great API for quickly exploring different visual representations of your data. We'll be focusing on that today # # ## [Bokeh](http://bokeh.pydata.org/en/latest/) # # [Bokeh](http://bokeh.pydata.org/en/latest/) is a (still under heavy development) visualiztion library that targets the browser. # # Like matplotlib, Bokeh has a few APIs at various levels of abstraction. # They have a glyph API, which I suppose is most similar to matplotlib's Artists API, for drawing single or arrays of glpyhs (circles, rectangles, polygons, etc.). # More recently they introduced a Charts API, for producing canned charts from data structures like dicts or DataFrames. # # ## Other Libraries # # This is a (probably incomplete) list of other visualization libraries that I don't know enough about to comment on # # - [Lightning](http://lightning-viz.org/) # - [HoloViews](http://holoviews.org/) # - [Glueviz](http://www.glueviz.org/en/stable/) # - [vispy](http://vispy.org/) # - [bqplot](https://github.com/bloomberg/bqplot) # # Examples # We'll use the `diamonds` dataset from ggplot2. # You could use Vincent Arelbundock's RDatasets to find it (`pd.read_csv('http://vincentarelbundock.github.io/Rdatasets/csv/ggplot2/diamonds.csv')`), but I wanted to checkout [feather](https://github.com/wesm/feather). # In[1]: get_ipython().run_line_magic('load_ext', 'rpy2.ipython') # In[2]: get_ipython().run_cell_magic('R', '', "suppressPackageStartupMessages(library(ggplot2))\nlibrary(feather)\nwrite_feather(diamonds, 'diamonds.fthr')\n") # In[66]: import feather df = feather.read_dataframe('diamonds.fthr') df.head() # In[4]: df.info() # In[5]: import bokeh.charts as bc import bokeh.plotting as bk # In[6]: from bokeh.plotting import figure from bokeh.embed import components # Bokeh provides two APIs, a low-level glyph API and a higher-level Charts API. # In[7]: fig = (df.assign(xy = df.x / df.y) .sample(n=500) .pipe(bc.Scatter, "xy", "price")) bk.show(fig) # In[52]: script, div = components(fig) # In[53]: with open('../content/images/script.js', 'w') as f: f.write(script) with open('../content/images/div.js', 'w') as f: f.write(div) # It's not clear to me where the scientific community will come down on Bokeh for exploratory analysis. # The ability to share interactive graphics is compelling. # Personally, I have a lot of intertia in matplotlib that I haven't switched to Bokeh for day-to-day exploratory analysis. # # I have greatly enjoyed Bokeh for building dashboards and [webapps](http://bokeh.pydata.org/en/latest/docs/user_guide/interaction.html) with bokeh server. # It's still young, and I've hit [some rough edges](http://stackoverflow.com/questions/36610328/control-bokeh-plot-state-with-http-request). # The Bokeh team is trying to bridge a tough space. # In[8]: sns.set(context='talk', style='ticks') get_ipython().run_line_magic('matplotlib', 'inline') # # Matplotlib # Since it's relatively new, I should point out that matplotlib 1.5 added support for plotting labeled data. # In[12]: fig, ax = plt.subplots() ax.scatter(x='carat', y='depth', data=df, c='k', alpha=.15) plt.savefig('../content/images/mpl-scatter.png', transparent=True) # This isn't limited to just `DataFrame`s. # It supports anything that uses `__getitem__` (square-brackets) with string keys. # ## Pandas Built-in Plotting # The metadata in DataFrames gives a bit better defaults on plots. # In[67]: df.plot.scatter(x='carat', y='depth', c='k', alpha=.15) plt.tight_layout() plt.savefig('../content/images/pd-scatter.png', transparent=True) # We get axis labels from the column names. # Nothing major, just nice. # # Pandas can be more convienient for plotting a bunch of columns with a shared x-axis (the index). # In[68]: from pandas_datareader import fred gdp = fred.FredReader(['GCEC96', 'GPDIC96'], start='2000-01-01').read() gdp.rename(columns={"GCEC96": "Government Expenditure", "GPDIC96": "Private Investment"}).plot(figsize=(12, 6)) plt.tight_layout() plt.savefig('../content/images/vis-gdp.svg', transparent=True) # ## Seaborn # The rest of this post will focus on seaborn, and why I think it's especially great for exploratory analysis. # # I would encourage you to read Seaborn's [introductory notes](https://stanford.edu/~mwaskom/software/seaborn/introduction.html#introduction) that lay its design philosophy and attempted goals. Some highlights: # # > Seaborn aims to make visualization a central part of exploring and understanding data. # # It does this through a consistent, understandable API. # # > The plotting functions try to do something useful when called with a minimal set of arguments, and they expose a number of customizable options through additional parameters. # # Which works great for exploratory analysis, with the option to turn that into something more complex if it looks promising. # # > Some of the functions plot directly into a matplotlib axes object, while others operate on an entire figure and produce plots with several panels. # # The fact that seaborn is built on matplotlib means that if you are familiar with the pyplot API, you're knowledge will still be useful. # Most seaborn plotting functions (one per chart-type) take a `x`, `y`, `hue`, and `data` arguments (not all are required or used, depending on the plot type). If you're working with DataFrames, you'll pass in strings referring to column names for `x` and `y`, and the DataFrame for `data`. # In[69]: sns.countplot(x='cut', data=df) sns.despine() plt.tight_layout() plt.savefig('../content/images/vis-countplot.svg', transparent=True) # In[70]: sns.barplot(x='cut', y='price', data=df) sns.despine() plt.tight_layout() plt.savefig('../content/images/vis-barplot.svg', transparent=True) # Bivariate relationships can easily be explored, either one at a time: # In[71]: sns.jointplot(x='carat', y='price', data=df, size=8, alpha=.25, color='k', marker='.') plt.tight_layout() plt.savefig('../content/images/vis-joinplot.png', transparent=True) # Or many at once # In[15]: g = sns.pairplot(df, hue='cut') plt.savefig('../content/images/vis-pairplot.png', transparent=True) # `pairplot` is a concenince wrapper around `PairGrid`, and offers our first look at an important seaborn abstraction, the `Grid`. Seaborn `Grids` provide a link between matplolib `Figure`s with multiple `axes`, and features in your dataset. # # There are two main ways of interacting with grids. First, seaborn provides convience-wrapper functions like `pairplot`, that have good defaults for common tasks. If you need more flexibility, you can work with the `Grid` directly by mapping plotting functions over each axes. # In[78]: x = df.select_dtypes(include=[np.number]) # In[85]: x[(x > x.quantile(.05)).all(1) & (x < x.quantile(.95)).all(1)] # In[88]: def core(df, α=.05): mask = (df > df.quantile(α)).all(1) & (x < df.quantile(1 - α)).all(1) return df[mask] # In[92]: cmap = sns.cubehelix_palette(as_cmap=True, dark=0, light=1, reverse=True) (df.select_dtypes(include=[np.number]) .pipe(core) .pipe(sns.PairGrid) .map_upper(plt.scatter, marker='.', alpha=.25) .map_diag(sns.kdeplot) .map_lower(plt.hexbin, cmap=cmap, gridsize=20) ) plt.savefig('../content/images/vis-pairgrid.png', transparent=True) # `FacetGrid` is another class for producing `Grid`s, with control over how each facet (individual axes) gets determined. `PairGrid` is a special case of faceting by each `(x, y)` combination. In this next example, we'll facet by `cut`. # In[22]: g = sns.FacetGrid(df, row='cut', aspect=4, size=1.76, margin_titles=True) g.map(sns.kdeplot, 'price', shade=True, color='k') for ax in g.axes.flat: ax.yaxis.set_visible(False) sns.despine(left=True) g.fig.subplots_adjust(hspace=0.1) g.set(xlim=(0, 15000)) plt.savefig('../content/images/vis-kde-facet.svg', transparent=True); # This last example shows the tight integration with matplotlib. `g.axes` is an array of `matplotlib.Axes` and `g.fig` is a `matplotlib.Figure`. # This is a pretty common pattern when using seaborn: use a seaborn plotting method (or grid) to get a good start, and then adjust with matplotlib as needed. # I *think* (not an expert on this at all) that one thing people like about the grammar of graphics is its flexibility. # You aren't limited to a fixed set of chart types defined by the library author. # Instead, you construct your chart by layering scales, aesthetics and geometries. # # That said, I wouldn't really call what seaborn / matplotlib offer that limited. # You can create pretty complex charts suited to your needs. # In[74]: agged = df.groupby(['cut', 'color']).mean().sort_index().reset_index() g = sns.PairGrid(agged, x_vars=agged.columns[2:], y_vars=['cut', 'color'], size=5, aspect=.65) g.map(sns.stripplot, orient="h", size=10, palette='Blues_d') plt.tight_layout() plt.savefig('../content/images/facet-stripplot.svg', transparent=True) # In[61]: g = sns.FacetGrid(df, col='color', hue='color', col_wrap=4) g.map(sns.regplot, 'carat', 'price') plt.savefig('../content/images/vis-factet-rag.png', transparent=True) # Initially I had many more examples showing off seaborn, but I'll spare you. # Seaborn's [documentation](https://stanford.edu/~mwaskom/software/seaborn/) is thorough (and just beautiful to look at). # # We'll end with a nice scikit-learn integration for exploring the parameter-space on a GridSearch object. # In[62]: from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import GridSearchCV # In[63]: df = sns.load_dataset('titanic') clf = RandomForestClassifier() param_grid = dict(max_depth=[1, 2, 5, 10, 20, 30, 40], min_samples_split=[2, 5, 10], min_samples_leaf=[2, 3, 5]) est = GridSearchCV(clf, param_grid=param_grid, n_jobs=4) y = df['survived'] X = df.drop(['survived', 'who', 'alive'], axis=1) X = pd.get_dummies(X, drop_first=True) X = X.fillna(value=X.median()) est.fit(X, y); # In[64]: scores = est.grid_scores_ rows = [] params = sorted(scores[0].parameters) for row in scores: mean = row.mean_validation_score std = row.cv_validation_scores.std() rows.append([mean, std] + [row.parameters[k] for k in params]) scores = pd.DataFrame(rows, columns=['mean_', 'std_'] + params) # In[65]: sns.factorplot(x='max_depth', y='mean_', data=scores, col='min_samples_split', hue='min_samples_leaf') plt.savefig('../content/images/vis-grid-search.svg', transparent=True)