Quite often, it's most insightful to visualize not "raw" data points, but rather some distributional information about the points. Histograms and box plots are three of the most common statistical graphics for visualizing distributional properties of data.
import numpy as np
from matplotlib import pyplot as plt
x = np.random.randn(1000) # 1000 random normal variables
fig, ax = plt.subplots(1)
h = ax.hist(x) # automatically chooses a binwidth, usually pretty smart
ax.hist(x, bins = 40) # same data, in finer bins
fig
When comparing distributions of several data sets, it's often useful to use transparency via the keyword alpha
. alpha = 1
generates fully opaque data markings, while alpha = 0
is fully transparent (and therefore invisible).
x1 = np.random.normal(0, 0.8, 1000)
x2 = np.random.normal(-2, 1, 1000)
x3 = np.random.normal(3, 2, 1000)
kwargs = dict(bins=40, alpha = 0.3)
plt.hist(x1, **kwargs)
plt.hist(x2, **kwargs)
plt.hist(x3, **kwargs);
Another excellent way to visualize distributions of data is via boxplots. The orange line gives the median, the vertical bounds of the box the quartiles, the "whiskers" show the range of the bulk of the distribution, and the individual points are alleged to be "outliers". To create multiple boxplots (almost always a good idea), pass a list of 1d arrays to the function call.
fig, ax = plt.subplots(1)
box = ax.boxplot([x1, x2, x3])
Histograms are also an excellent way to visualize relationships between variables.
mean = [0, 0]
cov = [[1, 1], [1, 2]]
x, y = np.random.multivariate_normal(mean, cov, 10000).T
plt.scatter(x, y)
<matplotlib.collections.PathCollection at 0x7fefa6be0040>
fig, ax = plt.subplots(1)
h = ax.hist2d(x, y, bins=50, cmap = "Blues")
plt.colorbar(h[3], label = "counts in bin") # h[3]: annoying syntax thing to memorize or look up
<matplotlib.colorbar.Colorbar at 0x7fefa6d96250>
While there's often no problem with square grids, hexagonal grids often look a bit nicer. Note that the relevant keyword for hexagonal grids is gridsize
rather than bins
.
fig, ax = plt.subplots(1)
h = ax.hexbin(x, y, gridsize = 50, cmap = "Blues")
plt.colorbar(h, label = "counts in bin")
<matplotlib.colorbar.Colorbar at 0x7fefa6ea5c10>