• Title: Exploring line lengths in FluidDyn packages
  • Authors: Ashwin Vishnu
  • Date: 2018-02-22
  • Modified: 2018-02-22
  • Tags: fluiddyn
  • Category: News

This analysis was inspired from Twitter’s interesting analysis of tweet lengths published on Twitter’s engineering blog. The gist of the analysis was this:

English language tweets display a roughly log-normal distribution of character counts, except near the 140-character limit - an evidence of cramming thoughts to fit 140 characters.

This prompts the question of the 79-character line limit suggested by Python’s PEP8 style guide:

Limit all lines to a maximum of 79 characters.

Jake VanderPlas invistigated this for scientific python packages. Let’s see how it affects FluidDyn packages!

Counting Lines in fluiddyn

To take a look at this, we first need a way to access all the raw lines of code in any Python package. In a standard system architecture, if you have installed a package you already have the Python source stored in your system.

In [1]:
import fluiddyn

With this in mind, we can use the os.walk function to write a quick generator function that will iterate over all lines of Python code in a given package:

In [2]:
# Python 3.X
import os

def iter_lines(module):
    """Iterate over all lines of Python in module"""
    for root, dirs, files in os.walk(module.__path__[0]):
        for filename in files:
            if filename.endswith('.py'):
                with open(os.path.join(root, filename)) as f:
                    yield from f

This returns a generator expression that iterates over all lines in order; let's see how many lines of Python code NumPy contains:

In [3]:
lines = iter_lines(fluiddyn)
len(list(lines))
Out[3]:
11340

Given this we can find the lengths of all the lines and plot a histogram:

In [4]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

plt.style.use('ggplot')
In [5]:
lengths = [len(line) for line in iter_lines(fluiddyn)]
plt.hist(lengths, bins=np.arange(max(lengths)), histtype='step', linewidth=1);

The distribution is dominated by lines of length 1 (i.e. containing only a newline character): let's go ahead and remove these, and also narrow the x limits:

In [6]:
lengths = [len(line) for line in iter_lines(fluiddyn) if len(line) > 1]
plt.hist(lengths, bins=np.arange(125), histtype='step', linewidth=1);

Now this is looking interesting!

Cleaning the Distribution

The first feature that pops out to me are the large spikes between 0 and 20 characters. To explore these, let's look at the lines that make up the largest spike:

In [7]:
np.argmax(np.bincount(lengths))
Out[7]:
27

We can use a list comprehension to extract all lines of length 13:

In [8]:
lines27 = [line for line in iter_lines(fluiddyn) if len(line) == 27]
len(lines27)
Out[8]:
237

There are 237 lines in fluiddyn with a length of 27 characters. It's interesting to explore how many of these short lines are identical. The Pandas value_counts function is useful for this:

In [9]:
import pandas as pd
pd.value_counts(lines27).head(10)
Out[9]:
if __name__ == '__main__':\n    48
from builtins import range\n    15
    nb_cores_per_node = 16\n     3
        sys.stdout.flush()\n     3
        except ValueError:\n     3
    print(*args, **kwargs)\n     2
        for serie in self:\n     2
        return self.shapeK\n     2
    except AttributeError:\n     2
        self.coef_norm = n\n     2
dtype: int64

It’s clear that many of these duplicates are just a manifestation of if __name__ == '__main__': line and some imports from python-future package recurring at the beginning and the end of many modules.

In [10]:
lengths = [len(line) for line in set(iter_lines(fluiddyn))]
plt.hist(lengths, bins=np.arange(125), histtype='step', linewidth=1);

That's a much cleaner distribution!

Comparison Between Packages

In order to aid comparison between packages, let's quickly refactor the above histogram code into a function that we can re-use. In addition, we'll add a vertical line at the PEP8's maximum character count:

In [11]:
def hist_linelengths(module, ax):
    """Plot a histogram of lengths of unique lines in the given module"""
    lengths = [len(line.rstrip('\n')) for line in set(iter_lines(module))]
    h = ax.hist(lengths, bins=np.arange(125) + 0.5, histtype='step', linewidth=1.5)
    ax.axvline(x=79.5, linestyle=':', color='black')
    ax.set(title="{0} {1}".format(module.__name__, module.__version__),
           xlim=(1, 100),
           ylim=(0, None),
           xlabel='characters in line',
           ylabel='number of lines')
    return h

Now we can use that function to compare the distributions for a number of well-known scientific Python packages:

In [12]:
%%capture --no-display
import fluiddyn, fluidsim, fluidfft, fluidimage, fluidlab, fluidcoriolis
modules = [fluiddyn, fluidsim, fluidfft, fluidimage, fluidlab, fluidcoriolis]

fig, ax = plt.subplots(2, 3, figsize=(14, 6), sharex=True)
fig.subplots_adjust(hspace=0.2, wspace=0.2)

for axi, module in zip(ax.flat, modules):
    hist_linelengths(module, ax=axi)

for axi in ax[0]:
    axi.set_xlabel('')
for axi in ax[:, 1:].flat:
    axi.set_ylabel('')

The results here are quite interesting: similar to Twitter’s tweet length analysis, we see that each of these packages have a somewhat smooth distribution of characters, with a “cutoff” at or near the 79-character PEP8 limit! There is a reasonable adherence to PEP8 standards, but unlike the results from Jake’s analysis of other scientific pacakges, there are no recognizable “bumps” near the 79-character limit. fluidfft has the most cleanest tail with absolutely 0 lines with above 79 characters - a luxury of being the newest package.

But one package stands out: fluidsim displays some noticeable spikes at a few intermediate line lengths; let’s take a look at these:

In [13]:
lines = set(iter_lines(fluidsim))
counts = np.bincount([len(line) for line in lines])
np.argmax(counts)
Out[13]:
50

The large spike reflects the approximately 300 lines with exactly 50 characters; printing some of these and examining them is interesting:

In [14]:
[line for line in lines if len(line) == 50][:10]
Out[14]:
["def load_bench(path_dir, solver, hostname='any'):\n",
 '        subcommand = compute_name_command(module)\n',
 "                it = group_state_phys.attrs['it']\n",
 '        freq_complex = np.empty_like(state_spect)\n',
 '                         shape_variable=(33, 18),\n',
 '                           sim.oper.KX[ikx, iky],\n',
 "        z_fft = self.state_spect.get_var('z_fft')\n",
 '        Fy_reu = -urx * px_etauy - ury * py_etauy\n',
 '=================================================\n',
 '            PZ2_fft = (abs(Frot_fft)**2)*deltat/2\n']

Maybe not so interesting after all, just a coincidence! :)

Modeling the Line Length Distribution

Following the Twitter character analysis, let's see if we can fit a log-normal distribution to the number of lines with each character count. As a reminder, the log-normal can be parametrized like this:

$$ LogNorm(x; \mu, \sigma) = \frac{1}{x \sigma \sqrt{2\pi}} \exp\left(-\frac{[\log(x) - \mu]^2}{2\sigma^2}\right) $$

Here $x$ is the number of counts, $\exp(\mu)$ is the median of the peak of the distribution, and $\sigma^2$ controls the distribution's width. We can implement this using the lognorm distribution available in scipy:

In [15]:
from scipy import stats

def lognorm_model(x, amplitude, mu, sigma):
    return amplitude * stats.lognorm.pdf(x, scale=np.exp(mu), s=sigma)

x = np.linspace(0, 100, 1000)
plt.plot(x, lognorm_model(x, 1000, 3.5, 0.7));

That seems like an appropriate shape for the left portion of our datasets; let's try optimizing the parameters to fit the counts of lines up to length 50:

In [16]:
from scipy import optimize

counts, bins, _ = hist_linelengths(fluiddyn, ax=plt.axes())
lengths = 0.5 * (bins[:-1] + bins[1:])

def minfunc(theta, lengths, counts):
    return np.sum((counts - lognorm_model(lengths, *theta)) ** 2)

opt = optimize.minimize(minfunc, x0=[10000, 4, 1],
                        args=(lengths[:50], counts[:50]),
                        method='Nelder-Mead')
print("optimal parameters:", opt.x)

plt.fill_between(lengths, lognorm_model(lengths, *opt.x), alpha=0.3, color='gray');
optimal parameters: [6.53825230e+03 3.68332005e+00 4.59205976e-01]

Seems like a reasonable fit! From this, you could argue (as the Twitter engineering team did) that the line lengths might "naturally" follow a log-normal distribution, if it weren't for the artificial imposition of the PEP8 maximum line length.

For convenience, let's create a function that will plot this lognormal fit for any given module:

In [17]:
def lognorm_model(x, theta):
    amp, mu, sigma = theta
    return amp * stats.lognorm.pdf(x, scale=np.exp(mu), s=sigma)

def minfunc(theta, lengths, freqs):
    return np.sum((freqs - lognorm_model(lengths, theta)) ** 2)

def lognorm_mode(amp, mu, sigma):
    return np.exp(mu - sigma ** 2)

def lognorm_std(amp, mu, sigma):
    var = (np.exp(sigma ** 2) - 1) * np.exp(2 * mu + sigma ** 2)
    return np.sqrt(var)
In [18]:
def hist_linelengths_with_fit(module, ax, indices=slice(50)):
    counts, bins, _ = hist_linelengths(module, ax)
    lengths = 0.5 * (bins[:-1] + bins[1:])
    opt = optimize.minimize(minfunc, x0=[1E5, 4, 0.5],
                            args=(lengths[indices], counts[indices]),
                            method='Nelder-Mead')
    model_counts = lognorm_model(lengths, opt.x)
    ax.fill_between(lengths, model_counts, alpha=0.3, color='gray')
    
    # Add text describing mu and sigma
    
    A, mu, sigma = opt.x
    mode = np.exp(mu - sigma ** 2)
    ax.text(0.22, 0.15, 'mode = {0:.1f}'.format(lognorm_mode(*opt.x)),
            transform=ax.transAxes, size=14)
    ax.text(0.22, 0.05, 'stdev = {0:.1f}'.format(lognorm_std(*opt.x)),
            transform=ax.transAxes, size=14)
    
    return opt.x

Now we can compare the models for each package:

In [19]:
modules = [fluiddyn, fluidsim, fluidfft, fluidimage, fluidlab, fluidcoriolis]

fig, ax = plt.subplots(2, 3, figsize=(14, 6), sharex=True)
fig.subplots_adjust(hspace=0.2, wspace=0.2)

fits = {}

for axi, module in zip(ax.flat, modules):
    fits[module.__name__] = hist_linelengths_with_fit(module, ax=axi)
    

for axi in ax[0]:
    axi.set_xlabel('')
for axi in ax[:, 1:].flat:
    axi.set_ylabel('')

It is also interesting to use these summary statistics as a way of describing the "essence" of each package, in order to compare them directly:

In [20]:
ha = {'fluidfft': 'right', 'fluidlab': 'right', 'pandas': 'right'}
va = {'fluidfft': 'top'}
for name, fit in sorted(fits.items()):
    mode = lognorm_mode(*fit)
    std = lognorm_std(*fit)
    plt.plot(mode, std, 'ok')
    plt.text(mode, std, ' ' + name + ' ', size=14,
             ha=ha.get(name, 'left'), va=va.get(name, 'bottom'))
plt.xlabel('mode of model', size=14)
plt.ylabel('std. deviation of model', size=14);
plt.xlim(27, 45);
plt.ylim(21, 41);

This notebook originally appeared as a post on the blog Pythonic Perambulations. The code content is MIT licensed.

This modified post was written entirely in the Jupyter notebook. You can download this notebook, or see a static view on nbviewer.