When a measurement is made numerous times, it is often useful to bin (or group) the data
and make a histogram. For example, if the time that it takes a sphere to roll down a ramp
was measured one hundred times, then a histogram of the times would show how they are
distributed. The hist
function from the pylab library is useful for making histograms. The example below makes a histogram from a list of 24 numbers.
You can add labels to the histogram like othe graphs.
import pylab as pl
t = pl.array([4.94,5.98,5.00,6.06,5.94,5.17,5.12,5.06,
2.74,2.91,4.24,6.68,4.89,5.88,5.41,5.53,
3.73,5.80,4.26,5.50,5.73,5.29,7.40,3.55])
pl.figure()
pl.hist(t)
pl.show()
The first line imports the pylab library, which makes the hist
function available.
As for other plotting commands, the figure
and show
functions are also needed.
By default, the histogram will have 10 bins. If no additional arguments are sent, the hist
function decides where to put the boundaries of the bins.
The color
argument can be used to set the color of the bars in the histogram.
Alternatively, the edgecolor
and facecolor
arguments separately set the colors of
the edges and middle of the bars in the histogram, respectively. Some of other color
options are:
The default is for the edgecolor
to be the same as the facecolor
. The bins stand out better if the edgecolor
is black.
The facecolor
argument can also be set to "None" so that the bars only have outlines. Alternatively, you can set fill
to False. This is useful if you want to plot data on top of the histograms as shown further below.
import pylab as pl
t = pl.array([4.94,5.98,5.00,6.06,5.94,5.17,5.12,5.06,
2.74,2.91,4.24,6.68,4.89,5.88,5.41,5.53,
3.73,5.80,4.26,5.50,5.73,5.29,7.40,3.55])
pl.figure()
pl.hist(t, facecolor='b', edgecolor='k')
pl.show()
The hist
function returns the number of events in each bin, the edges of the bins, and
things called patches (which will not be discussed further). These values can be captured
by providing three variable names for them as follows.
import pylab as pl
t = pl.array([4.94,5.98,5.00,6.06,5.94,5.17,5.12,5.06,
2.74,2.91,4.24,6.68,4.89,5.88,5.41,5.53,
3.73,5.80,4.26,5.50,5.73,5.29,7.40,3.55])
pl.figure()
events, edges, patches = pl.hist(t, edgecolor='k')
print(events)
print(edges)
pl.show()
[2. 1. 1. 2. 4. 6. 5. 1. 1. 1.] [2.74 3.206 3.672 4.138 4.604 5.07 5.536 6.002 6.468 6.934 7.4 ]
The array events
contains the numbers of occurences in the 10 bins. The array edges
contain 11 elements. (The first 10 elements are the lower edges of the bins and the final element is the upper edge of the final bin.) The bins are the same width, but the edges may end up in unusual places. A number is included in a bin if it is greater than or equal to its lower edge and less than its upper edge.
If you set the density
argument to “True”, the function will make an area-normalized
histogram. For each bin, the height on the histogram is the probability density, which is
the number of events in the bin divided by the total number of events and the width of the
bin. The area of each bin in the histogram is the probability of an event being in that bin,
so the total area is one. With this option, the probability density is returned instead of the
number of events. Compare the example below with the previous example.
import pylab as pl
t = pl.array([4.94,5.98,5.00,6.06,5.94,5.17,5.12,5.06,
2.74,2.91,4.24,6.68,4.89,5.88,5.41,5.53,
3.73,5.80,4.26,5.50,5.73,5.29,7.40,3.55])
pl.figure()
events, edges, patches = pl.hist(t, density=True, edgecolor='k')
print(events)
print(edges)
pl.show()
[0.1788269 0.08941345 0.08941345 0.1788269 0.35765379 0.53648069 0.44706724 0.08941345 0.08941345 0.08941345] [2.74 3.206 3.672 4.138 4.604 5.07 5.536 6.002 6.468 6.934 7.4 ]
You can control the number of bins by setting the bins
argument to an integer, but this doesn’t control the locations of the edges. Choosing an appropriate number of bins is important. If there are too few or too many bins, the histogram won’t show how the events are distributed very well. For example, the same example data is histogrammed below with 3 and 30 bins.
import pylab as pl
t = pl.array([4.94,5.98,5.00,6.06,5.94,5.17,5.12,5.06,
2.74,2.91,4.24,6.68,4.89,5.88,5.41,5.53,
3.73,5.80,4.26,5.50,5.73,5.29,7.40,3.55])
pl.figure()
pl.hist(t, bins=3, edgecolor='k')
pl.figure()
pl.hist(t, bins=30, edgecolor='k')
pl.show()
If you want to have control over the number and location of the bins, you can make the
bins
argument an array. If you want N bins, the array will have (N + 1) elements. The
first N elements are the lower edges of the bins and the final element is the upper edge of
the final bin. Usually the bins have equal widths, but they can be made unequal. The array can be made with the linspace
function from the scipy library, which will need to be imported.
You must specify the first element of the array (the lower edge of the first bin), the last
element of the array (the upper edge of the final bin), and the number of elements in the
array (one more than the number of bins). The example below would produce2 10 bins
(not 11) starting at 0 and ending at 10. For the example data, some of the bins are
empty and aren't displayed.
import pylab as pl
t = pl.array([4.94,5.98,5.00,6.06,5.94,5.17,5.12,5.06,
2.74,2.91,4.24,6.68,4.89,5.88,5.41,5.53,
3.73,5.80,4.26,5.50,5.73,5.29,7.40,3.55])
bins = pl.linspace(0, 10, 11)
pl.figure()
pl.hist(t, bins, edgecolor='k')
pl.show()
It is also possible to set the upper and lower limits of the bins using the range
argument.
Values outside of the specified range are ignored. The following example does the same
as the previous example because the default number of bins is 10.
import pylab as pl
t = pl.array([4.94,5.98,5.00,6.06,5.94,5.17,5.12,5.06,
2.74,2.91,4.24,6.68,4.89,5.88,5.41,5.53,
3.73,5.80,4.26,5.50,5.73,5.29,7.40,3.55])
pl.figure()
events, edges, patches = pl.hist(t, range=(0.0,10.0), edgecolor='k')
print(edges)
pl.show()
[ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.]
Note that in all of the examples above the center of each bin is placed midway between the edges, which define what values are counted in that bin. If the values being histogrammed are all integers, it makes more sense for the to shift the bins to the left so that they are centered over integers. Setting align
to "left" will put the center of the bin over the left edge, which will center them over integers.
import pylab as pl
N = pl.array([4,5,5,6,5,5,5,5,2,2,4,6,4,5,5,5,3,5,4,5,5,5,7,3])
pl.figure()
pl.hist(t, range=(0.0,10.0), edgecolor='b',align='left')
pl.show()
If the bins aren't filled, you can graph points (using scatter
) or curves (using plot
) on the same figure. If the bins are filled, they can hide the points or curves.
import pylab as pl
N = pl.array([4,5,5,6,5,5,5,5,2,2,4,6,4,5,5,5,3,5,4,5,5,5,7,3])
bins = pl.linspace(0, 10, 11)
x = pl.array([2,3,4,5,6,7])
y = pl.array([1,2,3,12,3,2])
pl.figure()
pl.hist(t, bins, edgecolor='b', fill=False,align='left')
pl.scatter(x,y,c='g')
pl.show()
Sometimes data is binned before it is analyzed. For example, a set of decay times could
be binned before fitting the data to an exponential function. The histogram
function
from the numpy library can be used to bin data without making a plot. The histogram
function is similar to the hist
function described in the previous section. The range
and bins
arguments can be used, but it doesn’t return patches. Associating the locations of the bins and the numbers of events in them is a little tricky
because the edges
array is one element longer than the events
array.
If your counting the occurences of integers, the lower edges are the appropriate thing to use. In the example below, the resize
function makes an array called
lower
which has a length one less than the length of the edges
array, so it just contains the lower edges.
import numpy as np
N = np.array([4,5,5,6,5,5,5,5,2,2,4,6,4,5,5,5,3,5,4,5,5,5,7,3])
events, edges = np.histogram(t,range=(0.0,10.0))
lower = np.resize(edges, len(edges)-1)
print(lower)
print(events)
[0. 1. 2. 3. 4. 5. 6. 7. 8. 9.] [ 0 0 2 2 4 13 2 1 0 0]
For non-integer data, it makes more sense to associate the number of events with the center of bin. For example, the number
of event wiht values of t
between 0 and 1 should be associated with 0.5. The example
below will make an array called tmid
which is the same length as events
and contains
the values of t
in the middle of the bins. Again, the resize
function makes an array called
lower
which contains the locations of the lower edges of the bins because the final element is dropped.
An array containing the difference between consecutive elements of the edges
array is returned by the function diff
.
Adding half of the difference between the edges to the
lower edge gives the value in the middle of a bin. Note that "diff(edges)
" is the same
length as lower
.
import numpy as np
t = np.array([4.94,5.98,5.00,6.06,5.94,5.17,5.12,5.06,
2.74,2.91,4.24,6.68,4.89,5.88,5.41,5.53,
3.73,5.80,4.26,5.50,5.73,5.29,7.40,3.55])
events, edges = np.histogram(t,range=(0.0,10.0))
lower = np.resize(edges, len(edges)-1)
tmid = lower + 0.5*diff(edges)
print(tmid)
print(events)
[0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5] [ 0 0 2 2 4 13 2 1 0 0]
Further information is available at:
https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html
http://docs.scipy.org/doc/numpy/reference/generated/numpy.histogram.html