In [ ]:
%%HTML
<link rel="stylesheet" type="text/css" href="css/custom.css">

Towards pandas 1.0

Marc Garcia - @datapythonista

About me

Marc Garcia - @datapythonista

  • 12 years working with Python
  • Master degree in AI
  • pandas core developer
  • Python fellow
  • NumFOCUS ambassador / volunteer
  • Organiser of the London Python sprints group
  • Data scientist at Tesco

https://twitter.com/datapythonista

About pandas

  • Started by Wes McKinney in 2008 in his spare time
    • Mainly to have R's dataframe functionality in Python
  • Huge API
    • Series has 325 public methods/attributes
    • DataFrame has 224 public methods/attributes
    • Native support for 14 data formats (besides loading from Python objects)
    • More than 1,200 docstrings
  • Huge user base
    • Estimated to have between 5 and 10 million users
  • Developed by the community (contributors and maintainers rarely get paid for their work in pandas)
    • Supported by NumFOCUS

Quick overview

In [ ]:
import pandas

df = pandas.read_csv('data/titanic.csv.gz')
df.head()
In [ ]:
'{:.0%} passengers with unknown age'.format(
    df['Age'].isnull().mean())
In [ ]:
df['Age'].fillna(df.Age.median(), inplace=True)

'{:.0%} passengers with unknown age'.format(
    df['Age'].isnull().mean())
In [ ]:
%matplotlib inline

df[['Age']].boxplot();
In [ ]:
df = df[df.Age < df.Age.quantile(.99)]

df[['Age']].boxplot();
In [ ]:
df['Age'] = pandas.cut(df['Age'],
                       bins=[df.Age.min(), 18, 40, df.Age.max()],
                       labels=['Underage', 'Young', 'Experienced'])
df['Age'].head()
In [ ]:
df['Sex'] = df['Sex'].replace({'female': 1, 'male': 0})

df = df.pivot_table(values='Sex', columns='Pclass', index='Age', aggfunc='mean')
df
In [ ]:
df = df.rename_axis('', axis='columns')
df = df.rename('Class {}'.format, axis='columns')
df.style.format('{:.2%}')

pandas update

Deprecated features

  • .ix method
    • Use .loc and .iloc instead
  • Deprecation of Panel (3-dimensional DataFrame)
    • Use DataFrame with multi-index, or x-array package instead

Future deprecations (TBC)

  • SparseDataFrame
    • Main use case is pandas.get_dummies(sparse=True)
    • Not sparse like in coo, csc, csr.
    • Incomplete and buggy
  • inplace=True
    • Like in df.fillna(0., inplace=True)
In [ ]:
import pandas

df = pandas.read_csv('data/titanic.csv.gz')

df = df[df.Age < df.Age.quantile(.99)]

df['Age'].fillna(df.Age.median(), inplace=True)  # <- Using inplace

df['Age'] = pandas.cut(df['Age'],
                       bins=[df.Age.min(), 18, 40, df.Age.max()],
                       labels=['Underage', 'Young', 'Experienced'])

df['Sex'] = df['Sex'].replace({'female': 1, 'male': 0})

df = df.pivot_table(values='Sex', columns='Pclass', index='Age', aggfunc='mean')

df = df.rename_axis('', axis='columns')

df = df.rename('Class {}'.format, axis='columns')

df.style.format('{:.2%}')
In [ ]:
import pandas

(pandas.read_csv('data/titanic.csv.gz')
       .query('Age < Age.quantile(.99)')
       .assign(Sex=lambda df: df['Sex'].replace({'female': 1, 'male': 0}),
               Age=lambda df: pandas.cut(df['Age'].fillna(df.Age.median()),
                                         bins=[df.Age.min(), 18, 40, df.Age.max()],
                                         labels=['Underage', 'Young', 'Experienced']))
       .pivot_table(values='Sex', columns='Pclass', index='Age', aggfunc='mean')
       .rename_axis('', axis='columns')
       .rename('Class {}'.format, axis='columns')
       .style.format('{:.2%}'))

Some reasons

  • Readability counts
  • Future ability to have lazy evaluation
  • Misleading: In place can make a copy

Readability counts

In [ ]:
import this

Lazy evaluation

In [ ]:
%time map(lambda x: x ** 2, range(100_000_000_000))

In [ ]:
import numpy
import pandas

data = pandas.Series(numpy.random.random(1_000_000))

%time data.sort_values(ascending=False).head(3)

In [ ]:
%time data.nlargest(3)

Distribution of the subtasks

Memory copies

>>> df = pandas.DataFrame({'foo': [1., 2.],
...                        'bar': [numpy.nan, 4.]})
   foo  bar
0  1.0  NaN
1  2.0  4.0

In place:

>>> df.fillna(0., inplace=True)
>>> df.isnull().any(axis=None)
False

Return a copy:

>>> df_copy = df.fillna(0.)
>>> df.isnull().any(axis=None)
True

Data representation example: Number 79

  • uint8: 01001111
  • int32: 00000000000000000000000001001111
  • float64: 0100000001010011110000000000000000000000000000000000000000000000
  • object: <memory_address> -> <pyobject (refcount, type, size, digit)>
  • string (ascii): 0011011100111001

Changing the size or type of a block requires:

  • Allocate a block of different size or type
  • Copy all content to the new block
  • Deallocate old block

Why is important whether we copy data or not?

In [ ]:
df = pandas.DataFrame({'foo': [1., 2.],
                       'bar': [numpy.nan, 4.]})

print(df.dtypes)
In [ ]:
df.fillna('val', inplace=True)

print(df.dtypes)

Notes on memory copies:

  • With the BlockManager memory copies are frequent

  • inplace=True does not mean memory is not copied

  • df = df.something() syntax does not implies memory copies

"The more you know about the internals of pandas DataFrame,
the more horrified you are."

Wes McKinney @ SciPy 2018

What is the solution?

Apache arrow - https://arrow.apache.org/

  • Open standard for columnar data representation
  • Optimised to avoid CPU cache misses
  • Native representation of complex formats (e.g. string, categorical, lists...)
  • Built-in parallelism (multiple cores support)
  • Chunking
  • Much more

A back-end designed for pandas needs (and R and others)

Already used in pandas and other projecs

  • pandas.from_parquet
  • pyspark.DataFrame.to_pandas()
  • turbodbc

But still work in progress to be the pandas backend 😒

Wes is hiring to work full-time on Arrow at Ursa Labs (C++, remote) 😃

What else is happening in pandas?

  • Lots of maintenance
    • Bugfixes, automation, cleaning...
  • Update to new Python versions
    • Python 3.7 compatibility
    • Use of sorted dicts (DataFrame creation, assign...)
In [ ]:
import pandas

fib = pandas.DataFrame({'fib1': [1, 1, 1], 'fib2': [1, 1, 1]})
fib
In [ ]:
fib = fib.assign(fib3=lambda fib: fib['fib1'] + fib['fib2'],
                 fib4=lambda fib: fib['fib2'] + fib['fib3'])
fib

Dropping Python 2 support

  • In January 2019 (yes, in 4.5 months)
    • Not only pandas, also numpy, matplotlib and others

Some Python 3 features

Old:

samples = 100000000

New:

samples = 100_000_000

Some Python 3 features

Old:

print('samples: %s' % samples)
print('samples: {samples}'.format(samples=samples))

New:

print(f'samples: {samples}')

Some Python 3 features

data = 'My hovercraft is full of eels.'.split()

Old:

first, second, last = data[0], data[1], data[-1]

New:

first, second, *discard, last = data

Cost of supporting Python 2

Supporting last version only:

def length(value):
    if isinstance(value, str):
        return len(value)

Supporting Python 2:

def length(value):
    if isinstance(value, compat.string_types):
        return compat.strlen(value)

Cost of supporting Python 2

Supporting last version only:

def sorted_apply(func, items):
    return {x: func(x) for x in items}

Supporting Python 2:

def sorted_apply(func, items):
    if compat.PY36:
        return {x: func(x) for x in items}
    else:
        result = collections.OrderedDict()
        for x in items:
            result[x] = func(x)
        return result

Dropping Python 2 support

  • In January 2019 (yes, in 4.5 months)
    • Not only pandas, also numpy, matplotlib and others

Extension arrays

In [ ]:
import numpy
import pandas

s = pandas.Series([1, 2, 3], dtype=numpy.uint8)

s
In [ ]:
s.values

Categories

In [ ]:
s = pandas.Series(['dog', 'cat', 'dog'], dtype='category')

s
In [ ]:
s.values
In [ ]:
s.cat.codes
In [ ]:
s.cat.categories

Cyberpandas

In [ ]:
import cyberpandas

arr = cyberpandas.IPArray(['127.0.0.1', '255.255.255.0'])

s = pandas.Series(arr)

s

Integer with NaN

In [ ]:
import numpy
import pandas

pandas.Series([1, 2, numpy.nan, 4])
In [ ]:
import numpy
import pandas

pandas.Series([1, 2, numpy.nan, 4], dtype=numpy.uint8)
In [ ]:
import numpy
import pandas

pandas.Series([1, 2, numpy.nan, 4], dtype='UInt8')

Fletcher

In [ ]:
import random

spam_list = [random.choice(['spam', 'spam,', 'spam!'])
             for i in range(1_000_000)]

spam_list[:10]
In [ ]:
import pandas

s = pandas.Series(spam_list)

s.str.endswith('!').mean()
In [ ]:
%timeit s.str.endswith('!').mean()
In [ ]:
import fletcher

spam_fletcher = fletcher.FletcherArray(spam_list)

s = pandas.Series(spam_fletcher)
s.head()
In [ ]:
%timeit s.text.endswith('!').mean()

Documentation

scikit-learn integration

In [ ]:
import sklearn.pipeline
import sklearn.compose
import sklearn.preprocessing
import sklearn.linear_model

preprocess = sklearn.compose.make_column_transformer(
    (['Fare'], sklearn.preprocessing.StandardScaler()),
    (['Sex', 'Pclass'], sklearn.preprocessing.OneHotEncoder()))

model = sklearn.pipeline.Pipeline([
    ('preprocess', preprocess),
    ('classifier', sklearn.linear_model.LogisticRegression()),
])
In [ ]:
import pandas

df = pandas.read_csv('data/titanic.csv.gz')
x, y = df[['Fare', 'Sex', 'Pclass']], df['Survived']

model.fit(x, y)

preprocess.fit_transform(x)

pandas 1.0 Roadmap

  • 0.24: September 2018

  • 0.25: December 2018

    • Warnings for all deprecations
  • 1.0

    • Same as 0.25, but with all deprecated features removed
    • If released in 2019, Python 3 only
  • No backward compatibility changes until 2.0

https://github.com/pandas-dev/pandas/wiki/Pandas-Sprint-(July,-2018)

Get involved

Thank you!