Notebook

Descriptive statistics of data¶

Marcos Duarte

Here is a function for the calculation of descriptive statistics which might be useful for the initial characterization and visualization of numerical data.
The function signature is:

stats = statdesc(data, missing='NaN', labels=[], alpha=.05, show=2)

And the function help:

In [1]:

import sys
sys.path.insert(1, r'./functions')  # add to pythonpath
from statdesc import statdesc

help(statdesc)

Help on function statdesc in module statdesc:

statdesc(data, missing='NaN', labels=[], alpha=0.05, show=2)
    Descriptive statistics of data.
    
    This function calculates the following statistics for each column
    (variable) of the input: mean and unbiased standard deviation [1]_, 95%
    confidence interval (confidence limits for the mean) with unknown
    population STD [2]_, minimum and maximum, median, 25th and 75th percentiles
    [3]_, test for normality (Shapiro-Wilk's test) [4]_, and a test for
    equality of variances for all columns (Levene's or Bartlett's test) [5]_.
    
    This function also generates plots (if matplotlib is available) to
    visualize the data and shows the calculated statistics on screen.
    
    Parameters
    ----------
    data : array_like
        1D or 2D (column oriented) numerical data with possible missing values
    
    missing : string ('nan') or number (int or float), optional
        option to enter a number representing missing values (default = 'nan')
    
    labels : list of strings, optional
        labels for each column (variable) in data
    
    alpha : float, optional
        statistical significance level (to decide which test for equality of
        variances to use)
    
    show : integer (0 or 1 or 2), optional
        option to show plots with some descritive statistics (0: don't show
        any plot; 1: show plots only for the grouped data; 2: show plots for
        individual data as well as for the grouped data (default))
    
    Returns
    -------
    m_sd : array
        mean and unbiased standard deviation of each column (variable) in data
    
    ci : array
        95% confidence interval (confidence limits for the mean) with unknown
        population STD for each column (variable) in data
    
    min_max : array
        minimum and maximum of each column (variable) in data
    
    quartiles : array
        median, 25th and 75th percentiles of each column (variable) in data
    
    normality : array
        test for normality of each column (variable) in data (Shapiro-Wilk's
        test)
    
    eq_var : array
        test for equality of variances for all columns (variables) in data
        (Levene's or Bartlett's test)
    
    References
    ----------
    .. [1] http://www.itl.nist.gov/div898/handbook/eda/section3/eda356.htm
    .. [2] http://www.itl.nist.gov/div898/handbook/prc/section1/prc14.htm.
    .. [3] http://www.itl.nist.gov/div898/handbook/prc/section2/prc252.htm.
    .. [4] http://www.itl.nist.gov/div898/handbook/prc/section2/prc213.htm.
    .. [5] http://www.itl.nist.gov/div898/handbook/eda/section3/eda35a.htm.
    
    Examples
    --------
    >>> import numpy as np
    >>> from statdesc import statdesc
    >>> y = np.random.randn(20,3)
    >>> statdesc(y)                # use the default options
    >>> y[8:12,1] = np.NaN         # add a missing value
    >>> y[12,1] = 2                # add another missing value
    >>> statdesc(y, False, 2, ['A','B'], .01) # set arguments
    >>> m_sd,ci,minmax,quartiles,normality,eq_var = statdesc(y)
    
    See Also
    --------
    scipy.stats.describe : Computes several descriptive statistics using Scipy
    pandas.DataFrame.describe : Computes several descriptive statistics using Pandas

Let's test statdesc.py:

In [2]:

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [3]:

stats = statdesc(np.random.randn(100, 1))

-----------------------------------------------------------
Descriptive statistics for data (100 rows, 1 column)
0 missing values
-----------------------------------------------------------
Variable              Mean             STD
-----------------------------------------------------------
1                -0.031199        1.096605
-----------------------------------------------------------
95% confidence interval with unknown population STD
Variable             Lower           Upper
-----------------------------------------------------------
1                -0.058260       -0.004138
-----------------------------------------------------------
Variable           Minimum         Maximum
-----------------------------------------------------------
1                -2.604322        3.143039
-----------------------------------------------------------
Variable            Median   25th percent.   75th percent.
-----------------------------------------------------------
1                -0.137626       -0.775762        0.788823
-----------------------------------------------------------
Shapiro-Wilk's test for normality
Variable       W statistic         p value
-----------------------------------------------------------
1                 0.986654        0.414805
-----------------------------------------------------------

In [4]:

stats = statdesc(np.random.randn(100, 2), show=1)

-----------------------------------------------------------
Descriptive statistics for data (100 rows, 2 columns)
0 missing values
-----------------------------------------------------------
Variable              Mean             STD
-----------------------------------------------------------
1                 0.200145        1.036444
2                 0.015605        0.945703
-----------------------------------------------------------
95% confidence interval with unknown population STD
Variable             Lower           Upper
-----------------------------------------------------------
1                 0.174568        0.225721
2                -0.007732        0.038943
-----------------------------------------------------------
Variable           Minimum         Maximum
-----------------------------------------------------------
1                -2.688702        2.489992
2                -2.123743        2.438873
-----------------------------------------------------------
Variable            Median   25th percent.   75th percent.
-----------------------------------------------------------
1                 0.208278       -0.359966        0.940224
2                -0.127119       -0.640241        0.749677
-----------------------------------------------------------
Shapiro-Wilk's test for normality
Variable       W statistic         p value
-----------------------------------------------------------
1                 0.990782        0.727672
2                 0.985118        0.323867
-----------------------------------------------------------
Barlett's test for equality of variances
               t statistic         p value
-----------------------------------------------------------
                  0.825739        0.363507
-----------------------------------------------------------

Function cogve.py¶

In [ ]:

# %load statdesc.py
#!/usr/bin/env python
"""Descriptive statistics of data."""

from __future__ import division, print_function

__author__ = 'Marcos Duarte, https://github.com/demotu/BMC'
__version__ = "1.0.2"
__license__ = "MIT"

import numpy as np
import scipy.stats as stats
try:
    import matplotlib.pyplot as plt
    import matplotlib.ticker as ticker
except ImportError:
    plt = None


def statdesc(data, missing='NaN', labels=[], alpha=.05, show=2):
    """
    Descriptive statistics of data.

    This function calculates the following statistics for each column
    (variable) of the input: mean and unbiased standard deviation [1]_, 95%
    confidence interval (confidence limits for the mean) with unknown
    population STD [2]_, minimum and maximum, median, 25th and 75th percentiles
    [3]_, test for normality (Shapiro-Wilk's test) [4]_, and a test for
    equality of variances for all columns (Levene's or Bartlett's test) [5]_.

    This function also generates plots (if matplotlib is available) to
    visualize the data and shows the calculated statistics on screen.

    Parameters
    ----------
    data : array_like
        1D or 2D (column oriented) numerical data with possible missing values

    missing : string ('nan') or number (int or float), optional
        option to enter a number representing missing values (default = 'nan')

    labels : list of strings, optional
        labels for each column (variable) in data

    alpha : float, optional
        statistical significance level (to decide which test for equality of
        variances to use)

    show : integer (0 or 1 or 2), optional
        option to show plots with some descritive statistics (0: don't show
        any plot; 1: show plots only for the grouped data; 2: show plots for
        individual data as well as for the grouped data (default))

    Returns
    -------
    m_sd : array
        mean and unbiased standard deviation of each column (variable) in data

    ci : array
        95% confidence interval (confidence limits for the mean) with unknown
        population STD for each column (variable) in data

    min_max : array
        minimum and maximum of each column (variable) in data

    quartiles : array
        median, 25th and 75th percentiles of each column (variable) in data

    normality : array
        test for normality of each column (variable) in data (Shapiro-Wilk's
        test)

    eq_var : array
        test for equality of variances for all columns (variables) in data
        (Levene's or Bartlett's test)

    References
    ----------
    .. [1] http://www.itl.nist.gov/div898/handbook/eda/section3/eda356.htm
    .. [2] http://www.itl.nist.gov/div898/handbook/prc/section1/prc14.htm.
    .. [3] http://www.itl.nist.gov/div898/handbook/prc/section2/prc252.htm.
    .. [4] http://www.itl.nist.gov/div898/handbook/prc/section2/prc213.htm.
    .. [5] http://www.itl.nist.gov/div898/handbook/eda/section3/eda35a.htm.

    Examples
    --------
    >>> import numpy as np
    >>> from statdesc import statdesc
    >>> y = np.random.randn(20,3)
    >>> statdesc(y)                # use the default options
    >>> y[8:12,1] = np.NaN         # add a missing value
    >>> y[12,1] = 2                # add another missing value
    >>> statdesc(y, False, 2, ['A','B'], .01) # set arguments
    >>> m_sd,ci,minmax,quartiles,normality,eq_var = statdesc(y)

    See Also
    --------
    scipy.stats.describe : Computes several descriptive statistics using Scipy
    pandas.DataFrame.describe : Computes several descriptive statistics using Pandas

    """

    data = np.asarray(data)  # convert the input to array
    if len(data.shape) == 1:
        data = data.reshape(data.shape[0], 1)
    # missing data: don't use masked arrray, some functions don't handle that
    if isinstance(missing, (int, float)) and ~np.isnan(missing):
        # if missing option is string, must be 'NaN', then data has already NaN
        data[data == missing] = np.NaN

    m_sd = np.zeros((data.shape[1], 2)) * np.NaN
    ci = np.zeros((data.shape[1], 2)) * np.NaN
    min_max = np.zeros((data.shape[1], 2)) * np.NaN
    quartiles = np.zeros((data.shape[1], 3)) * np.NaN
    normality = np.zeros((data.shape[1], 2)) * np.NaN
    eq_var = np.zeros((1, 2)) * np.NaN
    x = []
    nmiss = 0
    min_len = 0

    for i in range(data.shape[1]):
        # due to missing data, each column can have different length;
        # use list of arrays
        x.append(data[~np.isnan(data[:, i]), i])
        nmiss += data.shape[0] - x[i].shape[0]  # total number of missing value
        # skip empty array (data column with missing data only)
        if x[i].shape[0] == 0:
            print('Skipping column %d, only missing data' % (i + 1))
            continue
        # at least 2 sets with 3 points to test for equality of variances
        if x[i].shape[0] > 2:
            min_len += 1
        # handle labels
        if len(labels) > i and labels[i]:
            pass
        else:
            if len(labels) > i:
                labels[i] = str(i+1)
            else:
                labels.append(str(i+1))
        # summary statistics
        m_sd[i], ci[i], min_max[i], quartiles[i], normality[i] = summary(x[i])
        if show > 1 and plt:  # PLOT
            #plot for each variable
            plot1var(data[:, i], x[i], m_sd[i], min_max[i], normality[i],
                     labels[i], alpha, data.shape[1])

    # remove empty arrays (data columns with missing data only)
    i = 0
    while i < len(x):
        if x[i].size == 0:
            x.pop(i)
        else:
            i += 1

    # test for equality of variances
    if len(x) > 1 and min_len > 1:
        # at least 2 sets with 3 points to run this function
        # Levene's test is an alternative to the Bartlett test. The Levene test
        # is less sensitive than the Bartlett test to departures from normality
        # For data with nornal distribution, Bartlett's test has better
        # performance.
        if np.all(normality[:, 1] > .05):
            eq_var[0] = stats.bartlett(*x)
        else:
            eq_var[0] = stats.levene(*x, center='median')

    if show and plt:  # PLOT
        if data.shape[1] > 1:
            #summary plot
            plotallvar(data, x, min_max, eq_var, min_len, alpha, labels)
            #scatterplot matrix
            scatterplot(data, x, label=labels)

    #print results on screen
    statprint(m_sd, ci, min_max, quartiles, normality, eq_var,
              labels, alpha, data.shape[0], data.shape[1], nmiss, len(x))

    return m_sd, ci, min_max, quartiles, normality, eq_var


def summary(x):
    """summary statistics"""

    # mean and standard deviation (unbiased)
    m_sd = np.mean(x), np.std(x, ddof=1)
    # 95% confidence interval (confidence limits for the mean)
    ci = np.zeros((1, 2)) * np.NaN
    if x.shape[0] > 1:  # at least 2 points to run this function
        ci = stats.t._pdf(.975, x.size - 1) * m_sd[1] / np.sqrt(x.size) * \
            np.array([-1, 1]) + m_sd[0]
    # minimum and maximum
    min_max = x.min(), x.max()
    # median, and 25th and 75th percentiles
    quartiles = np.median(x), np.percentile(x, 25), np.percentile(x, 75)
    # test for normality
    # Shapiro-Wilk function is nicer (returns an exact p value) and simpler
    normality = np.zeros((1, 2)) * np.NaN
    if x.shape[0] > 2:  # at least 3 points to run this function
        normality = stats.shapiro(x)  # Shapiro-Wilk's test
        #A2,critical,sig = stats.anderson(x,dist='norm') #Anderson-Darling test
        #sig2 = sig[A2>critical]
        #normality =  A2, ( sig2[-1] if sig2.size else sig[0] )/100

    return m_sd, ci, min_max, quartiles, normality


def plot1var(data, x, m_sd, min_max, normality, labels, alpha, ncol):
    """Summary plot for each variable"""

    plt.figure(figsize=(7, 5))
    ax1 = plt.subplot(211)
    ax1.plot(data, 'bo', alpha=0.75)
    ax1.plot([0, data.shape[0] - 1], [m_sd[0], m_sd[0]], 'r', linewidth=2)
    ax1.plot([0, data.shape[0] - 1], [m_sd[0] + m_sd[1], m_sd[0] + m_sd[1]],
             'r--', linewidth=2)
    ax1.plot([0, data.shape[0] - 1], [m_sd[0] - m_sd[1], m_sd[0] - m_sd[1]],
             'r--', linewidth=2)
    ax1.grid(True, linestyle='-', which='major', color='lightgrey', alpha=0.5)
    title = 'Variable: Mean= %f, STD= %f' % (m_sd[0], m_sd[1]) if ncol == 1 \
            else 'Variable %s: Mean= %f, STD= %f' % (labels, m_sd[0], m_sd[1])
    ax1.set_title(title)
    #ax1.set_xlabel('Index')
    ax1.set_ylabel('Value')
    if x.shape[0] > 1:
        plt.xlim(xmin=-.5, xmax=data.shape[0] - .5)
        plt.ylim(ymin=min_max[0] - .05*(min_max[1] - min_max[0]),
                 ymax=min_max[1] + .05 * (min_max[1] - min_max[0]))
    ax2 = plt.subplot(223)
    h2 = ax2.boxplot(x, notch=1)
    plt.setp(h2['boxes'], color='r', linewidth=2)
    plt.setp(h2['medians'], color='r', linewidth=2)
    plt.xticks([1], [labels])
    ax2.grid(True, linestyle='-', which='major', color='lightgrey', alpha=0.5)
    ax2.set_title('Boxplot')
    ax2.set_xlabel('Variable')
    ax2.set_ylabel('Value')
    ax3 = plt.subplot(224)
    nbins = 2 * np.sqrt(x.size) if x.size < 100 else np.sqrt(x.size)
    n, bins, patches = ax3.hist(x, nbins, normed=1, fc='blue', alpha=0.75)
    bincenters = np.linspace((bins[0] + bins[1]) / 2,
                             (bins[-2] + bins[-1]) / 2, 100)
    # curve for the normal PDF
    y = stats.norm.pdf(bincenters, loc=m_sd[0], scale=m_sd[1])
    ax3.plot(bincenters, y, 'r-', linewidth=2)
    ax3.set_xlabel('Value')
    #ax3.set_ylabel('Probability')
    distribution = 'normal' if normality[1] > alpha else 'not normal'
    ax3.set_title('Histogram (%s, p=%1.3f)' % (distribution, normality[1]))
    ax3.xaxis.set_major_locator(ticker.MaxNLocator(nbins=5, prune=None))
    ax3.yaxis.set_major_locator(ticker.MaxNLocator(nbins=5, prune=None))
    plt.tight_layout()
    plt.show()


def plotallvar(data, x, min_max, eq_var, min_len, alpha, labels):
    """Summary plot for all variables"""

    plt.figure(figsize=(7, 5))
    ax1 = plt.subplot(211)
    h1 = ax1.plot(data)
    ax1.grid(True)
    ax1.set_title('All variables')
    #ax1.set_xlabel('Index')
    ax1.set_ylabel('Value')
    #ax1.legend(labels[0:data.shape[1]])
    plt.xlim(xmin=-.5, xmax=data.shape[0] - .5)
    if min_max.max()-min_max.min() > 0:
        plt.ylim(ymin=min_max.min() - .05 * (min_max.max() - min_max.min()),
                 ymax=min_max.max() + .05 * (min_max.max() - min_max.min()))
    ax2 = plt.subplot(212)
    h2 = ax2.boxplot(x, notch=1)
    ax2.grid(True, linestyle='-', which='major', color='lightgrey', alpha=0.5)
    ax2.set_title('Boxplot')
    if min_len > 1:
        if eq_var[0, 1] > alpha:
            tit = 'Boxplot (equality of variances, p=%f)' % eq_var[0, 1]
        else:
            tit = 'Boxplot (no equality of variances, p=%f)' % eq_var[0, 1]
        ax2.set_title(tit)
    ax2.set_xlabel('Variable')
    ax2.set_ylabel('Value')
    rot = 0 if len(''.join(labels)) < 50 else 45
    plt.xticks(range(1, data.shape[1] + 1), labels[0: data.shape[1]],
               rotation=rot)
    #Set boxplot color based on color of line plot
    for i in range(len(h1)):
        plt.setp(h2['boxes'][i], color=h1[i].get_c(), linewidth=2)
        plt.setp(h2['medians'][i], color=h1[i].get_c(), linewidth=2)
    plt.tight_layout()
    plt.show()


def scatterplot(data, x, label=None):
    """Scatterplot matrix for array data
    data have all the data (inlcuding missing data)
    x is a list of arrays without the missing data (for histogram and fitting)
    """

    fig, ax = plt.subplots(data.shape[1], data.shape[1], figsize=(8, 8))
    fig.suptitle('Scatterplot Matrix', fontsize=12)
    fig.subplots_adjust(hspace=0.04, wspace=0.04)
    nbins2 = 4 if data.shape[1] > 3 else 5
    for i in range(data.shape[1]):
        for j in range(data.shape[1]):
            #ax1 = plt.subplot(data.shape[1],data.shape[1],data.shape[1]*i+j+1)
            if i == j:
                nbins = 2 * np.sqrt(x[i].size) \
                    if x[i].size < 100 else np.sqrt(x[i].size)
                n, bins, patches = ax[i, j].hist(x[i], nbins, normed=1,
                                                 facecolor='blue', alpha=0.75)
                bincenters = np.linspace((bins[0] + bins[1]) / 2,
                                         (bins[-2] + bins[-1]) / 2, 100)
                y = stats.norm.pdf(bincenters, x[i].mean(), scale=x[i].std())
                ax[i, j].plot(bincenters, y, 'r-', linewidth=2)
                #ax[i, j].annotate(label[j], (0.05, 0.85),
                #    xycoords='axes fraction',fontweight='bold')
            else:
                ax[i, j].plot(data[:, i], data[:, j], 'bo', alpha=0.75)
                ax[i, j].grid(True, linestyle='-', which='major',
                              color='lightgrey', alpha=0.5)
            ax[i, j].xaxis.set_visible(False)
            ax[i, j].yaxis.set_visible(False)
            ax[i, j].xaxis.set_major_locator(ticker.MaxNLocator(nbins=nbins2,
                                             prune='both'))
            ax[i, j].yaxis.set_major_locator(ticker.MaxNLocator(nbins=nbins2,
                                             prune='both'))
            if ax[i, j].is_first_col():
                ax[i, j].yaxis.set_ticks_position('left')
                ax[i, j].yaxis.set_visible(True)
                ax[i, j].set_ylabel(label[i])
            if ax[i, j].is_last_col():
                ax[i, j].yaxis.set_ticks_position('right')
                ax[i, j].yaxis.set_visible(True)
            if ax[i, j].is_first_row():
                ax[i, j].xaxis.set_ticks_position('top')
                ax[i, j].xaxis.set_visible(True)
            if ax[i, j].is_last_row():
                ax[i, j].xaxis.set_ticks_position('bottom')
                ax[i, j].xaxis.set_visible(True)
                ax[i, j].set_xlabel(label[j])
    plt.show()


def statprint(m_sd, ci, min_max, quartiles, normality, eq_var, labels, alpha,
              nrow, ncol, nmiss, nx):
    """print results on screen"""

    print('-----------------------------------------------------------')
    str_row = 'rows' if nrow > 1 else 'row'
    str_col = 'columns' if ncol > 1 else 'column'
    print('Descriptive statistics for data (%d %s, %d %s)' \
          % (nrow, str_row, ncol, str_col))
    print('%d missing values' % nmiss)
    print('-----------------------------------------------------------')
    print('%-10s %15s %15s' % ('Variable', 'Mean', 'STD'))
    print('-----------------------------------------------------------')
    for i in range(ncol):
        print('%-10s %15f %15f' % (labels[i], m_sd[i, 0], m_sd[i, 1]))
    print('-----------------------------------------------------------')
    print('%s' % ('95% confidence interval with unknown population STD'))
    print('%-10s %15s %15s' % ('Variable', 'Lower', 'Upper'))
    print('-----------------------------------------------------------')
    for i in range(ncol):
        print('%-10s %15f %15f' % (labels[i], ci[i, 0], ci[i, 1]))
    print('-----------------------------------------------------------')
    print('%-10s %15s %15s' % ('Variable', 'Minimum', 'Maximum'))
    print('-----------------------------------------------------------')
    for i in range(ncol):
        print('%-10s %15f %15f' % (labels[i], min_max[i, 0], min_max[i, 1]))
    print('-----------------------------------------------------------')
    print('%-10s %15s %15s %15s' % ('Variable', 'Median', '25th percent.',
                                    '75th percent.'))
    print('-----------------------------------------------------------')
    for i in range(ncol):
        print('%-10s %15f %15f %15f' % (labels[i], quartiles[i, 0],
                                        quartiles[i, 1], quartiles[i, 2]))
    print('-----------------------------------------------------------')
    print('%s' % ("Shapiro-Wilk's test for normality"))
    print('%-10s %15s %15s' % ('Variable', 'W statistic', 'p value'))
    print('-----------------------------------------------------------')
    for i in range(ncol):
        print('%-10s %15f %15f' % (labels[i], normality[i, 0], normality[i, 1]))
    print('-----------------------------------------------------------')
    if nx > 1:
        if np.all(normality[:, 1] > alpha):
            print("Barlett's test for equality of variances")
        else:
            print("Levene's test for equality of variances")
        print('%26s %15s' % ('t statistic', 'p value'))
        print('-----------------------------------------------------------')
        print('%26f %15f' % (eq_var[0, 0], eq_var[0, 1]))
        print('-----------------------------------------------------------')


if __name__ == '__main__':
    #import sys
    #statdesc(sys.argv[1:])
    y = np.random.randn(100, 3)  # ; y[5:10,1] = np.nan
    statdesc(y, 1, [], ['A', 'B', 'C', 'D'], .05)