#!/usr/bin/env python # coding: utf-8 # # Group-by From Scratch # *This notebook originally appeared as a post on* # [*Pythonic Perambulations*](http://jakevdp.github.io/blog/2017/03/22/group-by-from-scratch/) # *by Jake Vanderplas. The code is MIT-licensed.* # # # I've found one of the best ways to grow in my scientific coding is to spend time comparing the efficiency of various approaches to implementing particular algorithms that I find useful, in order to build an intuition of the performance of the building blocks of the scientific Python ecosystem. # # In this vein, today I want to take a look at an operation that is in many ways fundamental to data-driven exploration: the group-by, otherwise known as the [split-apply-combine](https://www.jstatsoft.org/article/view/v040i01) pattern. # An architypical example of a summation group-by is shown in this figure, borrowed from the [Aggregation and Grouping](http://nbviewer.jupyter.org/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/03.08-Aggregation-and-Grouping.ipynb) section of the [Python Data Science Handbook](https://github.com/jakevdp/PythonDataScienceHandbook): # ![split-apply-combine diagram](https://jakevdp.github.io/figures/split-apply-combine.svg) # The basic idea is to split the data into groups based on some value, apply a particular operation to the subset of data within each group (often an aggregation), and then combine the results into an output dataframe. # Python users generally turn to the [Pandas](http://pandas.pydata.org) library for this type of operation, where it is is implemented effiently via a concise object-oriented API: # # # In[1]: keys = ['A', 'B', 'C', 'A', 'B', 'C'] vals = [ 1, 2, 3, 4, 5, 6 ] # In[2]: import pandas as pd pd.Series(vals).groupby(keys).sum() # An oft-repeated claim in the Python world is that doing this sort of operation efficiently requires Pandas — that tools provided in NumPy and SciPy are useful for elementwise operations, but not group-wise operations. # This is a claim I've repeated often in my own teaching and writing, but I found myself recently wondering if it's really true. # # Below I'll explore this question by approaching the group-by summation from a number of angles, taking advantage of both built-in Python operations and operations provided by NumPy and SciPy. # I hope that through this, you will gain some intuition for the performance of various algorithmic building-blocks available in the PyData ecosystem, and how you might apply them effectively to solve your own problems. # ## Timing Functions # # For convenience, let's start by defining some functionality that will let us quickly profile and visualize group-by functionality. # We'll assume that the operations to be profiled will come in the form of a function that takes a list of keys and values, and returns a dictionary of summed groups. # For example: # In[3]: def pandas_groupby(keys, vals): return pd.Series(vals).groupby(keys).sum().to_dict() pandas_groupby(keys, vals) # To avoid rerunning of identical benchmarks within the notebook, I'll make use of Python's ``functools.lru_cache`` decorator, which can cache the results of long-running functions, returning the cached value if the function is called again with the same inputs: # In[4]: get_ipython().run_line_magic('matplotlib', 'inline') import numpy as np import pandas as pd import matplotlib.pyplot as plt plt.style.use('seaborn-whitegrid') from functools import lru_cache # timeit magic requires global variables, so define them here timeitfunc = None timeitkeys = None timeitvals = None @lru_cache(maxsize=256) def _bench_one(func, n_group, size, rseed=754389): """Compute a timing and cache it for later use""" global timeitfunc, timeitkeys, timeitvals rand = np.random.RandomState(rseed) timeitkeys = rand.randint(0, n_group, size) timeitvals = rand.rand(size) timeitfunc = func # Following is a magic function; only works in IPython/Jupyter t = get_ipython().run_line_magic('timeit', '-oq timeitfunc(timeitkeys, timeitvals)') return t.best def bench(funcs, n_groups, sizes): """Run a set of benchmarks and return as a dataframe""" n_groups, sizes = np.broadcast_arrays(n_groups, sizes) names = [func.__name__.split('_')[0] for func in funcs] timings = {name: [_bench_one(func, n_group, size) for n_group, size in zip(n_groups, sizes)] for name, func in zip(names, funcs)} ind = pd.MultiIndex.from_arrays([n_groups, sizes], names=['num_groups', 'size']) return pd.DataFrame(timings, index=ind)[names] def plot_benchmarks(funcs, n_groups=None, sizes=None): """Plot the benchmarks as a function of data size and number of groups""" if n_groups is None: n_groups = (10 ** np.linspace(0, 4, 10)).astype(int) if sizes is None: sizes = (10 ** np.linspace(2, 6, 10)).astype(int) timings_sizes = bench(funcs, 10, sizes).reset_index('num_groups', drop=True) timings_groups = bench(funcs, n_groups, 10000).reset_index('size', drop=True) fig, ax = plt.subplots(1, 2, figsize=(10, 3.5), sharey=True) timings_sizes.plot(ax=ax[0], logx=True, logy=True) ax[0].set(xlabel='size of dataset', ylabel='seconds', title='Varying dataset size with 10 groups') timings_groups.plot(ax=ax[1], logx=True, logy=True) ax[1].set(xlabel='number of groups', ylabel='seconds', title='Varying number of groups with 10000 elements') return fig # With this setup in place, we can plot some benchmarks of the Pandas approach with a single concise command: # In[5]: plot_benchmarks([pandas_groupby]); # For the Pandas Groupby operation, there is some non-trivial scaling for small datasets, and as data grows large it execution time is approximately linear in the number of data points. # The pandas groupby is implemented in [highly-optimized cython code](https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/algos_groupby_helper.pxi.in), and provides a nice baseline of comparison for our exploration. # # Let's now take a look at some approaches by which we can implement this operation from scratch. # ## Group-by with Python dicts # A simple Python implementation of this might use some straightforward loops over Python dictionaries: # In[6]: result = {} for key, val in zip(keys, vals): if key not in result: result[key] = 0 result[key] += val print(result) # This is simple enough, but we can remove the conditional by using the built-in ``defaultdict`` collection, which automatically initializes values when they are referenced. # Wrapping this in a function gives: # In[7]: from collections import defaultdict def dict_groupby(keys, vals): count = defaultdict(int) for key, val in zip(keys, vals): count[key] += val return dict(count) dict_groupby(keys, vals) # With this function defined, we can compare the performance to the Pandas function: # In[8]: plot_benchmarks([pandas_groupby, dict_groupby]); # The dictionary implementation has very predictable linear scaling with the size of the data, because it relies on a single pass through the dataset. # Evidently, up to a couple thousand elements, this simple Python approach can even out-perform the Pandas operation, but as the number of points grows, the dict-based implementation becomes far slower than the optimized Pandas code. # ## Itertools approach # As another approach to the pure-Python group-by, you might be tempted to turn to the ``itertools.groupby`` function that is included in the standard library. # Upon further inspection, this is somewhat complicated by the fact that this operation requires groups to appear sequentially in the input, which necessitates a pre-sorting of the data in order to properly group all keys. # # Here's one possible implementation: # In[9]: from itertools import groupby from operator import itemgetter def itertools_groupby(keys, vals): get_key, get_val = itemgetter(0), itemgetter(1) sorted_pairs = sorted(zip(keys, vals), key=get_key) return {key: sum(map(get_val, values)) for key, values in groupby(sorted_pairs, get_key)} itertools_groupby(keys, vals) # Let's see how this compares: # In[10]: plot_benchmarks([pandas_groupby, dict_groupby, itertools_groupby]); # We see that the itertools approach has similar scaling to the dict-based approach, but unsurprisingly is quite a bit slower due to the required multiple passes over the data, one of which is a relatively expensive sorting operation. # ## NumPy-based approaches # When working with large arrays, one way to speed things up is to use NumPy's element-wise operations to push loops from the interpreted Python layer into the compiled C layer. # For example, selections in Numpy are most often done via [masking](http://nbviewer.jupyter.org/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/02.06-Boolean-Arrays-and-Masks.ipynb) operations, which can be combined with a Python dict comprehension for a nice concise implementation of the group-by: # In[11]: import numpy as np def masking_groupby(keys, vals): keys = np.asarray(keys) vals = np.asarray(vals) return {key: vals[keys == key].sum() for key in np.unique(keys)} masking_groupby(keys, vals) # Built-in to each masking operation is a full pass over the data, and the construction of an associated Boolean array. # These passes happen within the compiled layer of NumPy, and so in practice end up being faster than the pure-Python loops we saw previously: # In[12]: plot_benchmarks([pandas_groupby, dict_groupby, itertools_groupby, masking_groupby]); # Interestingly, when the number of groups is small, this approach can beat even the specialized code in Pandas. # When the number of groups is large, however, the one-pass-per-group requirement negates that gain. # Numpy will not be able to match the scaling of Pandas unless it can identify all groups in a single pass over the data. # # Single-pass identification of all groups is actually possible with the low-level routine ``np.bincount``, which performs a weighted count of the number of occurrences of values in an array of integers. # To use it, we must first map all our values to unique integers, which can be done with the ``np.unique`` function. # The result is a concise and efficient two-pass group-by implementation: # In[13]: import numpy as np def bincount_groupby(keys, vals): unique_keys, group = np.unique(keys, return_inverse=True) counts = np.bincount(group, weights=vals) return dict(zip(unique_keys, counts)) bincount_groupby(keys, vals) # Let's add this to the benchmark plot: # In[14]: plot_benchmarks([pandas_groupby, dict_groupby, itertools_groupby, masking_groupby, bincount_groupby]); # Compared to our masking-based approach, the result has similar scaling with dataset size, while avoiding the poor behavior for large numbers of groups. # # While it is interesting that an implementation based on generic building blocks in NumPy can approach the performance of purpose-optimized code in Pandas, there remains the problem that the bincount approach is not particularly flexible: for example, a ``min`` or ``max`` aggregation cannot be implemented as a weighted sum. # ## Sparse matrix approach # For a more flexible approach still based on generic building blocks available in the standard stack, I'll consider something that requires a bit of out-of-the-box thinking. # In my experience, some of the most useful low-level routines in the PyData stack are found within the SciPy sparse matrix module: while they are designed for fast manipulation of sparse matrices, they can often be re-purposed and used to solve conceptually distinct problems. # # In this case, we can leverage these tools by constructing a sparse matrix such that each group occupies a single row of the matrix, and then perform a standard aggregation over that matrix to find our desired result. # We'll again take advantage of ``np.unique`` to identify the appropriate row/group for each value, and use this to build the matrix: # In[15]: from scipy import sparse def sparse_groupby(keys, vals): unique_keys, row = np.unique(keys, return_inverse=True) col = np.arange(len(keys)) mat = sparse.coo_matrix((vals, (row, col))) return dict(zip(unique_keys, mat.sum(1).flat)) sparse_groupby(keys, vals) # Let's add this to our benchmarks: # In[16]: plot_benchmarks([pandas_groupby, dict_groupby, itertools_groupby, masking_groupby, bincount_groupby, sparse_groupby]); # Even for larger arrays, this sparse approach comes surprisingly close (within a factor of a few) to the purpose-built group-by implementation within Pandas, and also provides the wide range of efficient aggregation options available to sparse matrices. # This exercise in repurposing the sparse matrix internals might seem a bit frivolous in this specific case, but in my own research work I've often found similar approaches useful in efficient implementation of algorithms that might otherwise necessitate writing a custom C extension. # # I hope you found this discussion useful, and please let me know in the comments if you can think of other clever ways to implement fast group-by functionality in Python! # *This post was written in an IPython notebook, which can be downloaded* # [*here*](http://jakevdp.github.io/downloads/notebooks/GroupBy.ipynb), # *or viewed statically on* # [*nbviewer*](http://nbviewer.ipython.org/url/jakevdp.github.io/downloads/notebooks/GroupBy.ipynb).