In [1]:

import pandas as pd
import seaborn as sns

Data & column memory statistics¶

In [2]:

iris = sns.load_dataset('iris')
iris_init = iris.copy() # will be used to get the reduction performance
iris.head()

Out[2]:

	sepal_length	sepal_width	petal_length	petal_width	species
0	5.1	3.5	1.4	0.2	setosa
1	4.9	3.0	1.4	0.2	setosa
2	4.7	3.2	1.3	0.2	setosa
3	4.6	3.1	1.5	0.2	setosa
4	5.0	3.6	1.4	0.2	setosa

In [3]:

iris.dtypes

Out[3]:

sepal_length    float64
sepal_width     float64
petal_length    float64
petal_width     float64
species          object
dtype: object

In [4]:

iris.memory_usage(deep=True)

Out[4]:

Index            128
sepal_length    1200
sepal_width     1200
petal_length    1200
petal_width     1200
species         9800
dtype: int64

Applying optimizations¶

Casting categorical columns as `category`¶

Usually gives biggest memory savings. Instead of storing whole strings/objects pandas tokenizes them and only stores indices which significantly reduces the memory usage.

In [5]:

iris.species = iris.species.astype("category")

In [6]:

"Column memory usage is reduced by {:.1%}".format(1 - iris.species.memory_usage(deep=True) / iris_init.species.memory_usage(deep=True))

Out[6]:

'Column memory usage is reduced by 94.4%'

Optimizing the numerical columns with `pd.to_numeric` and `downcast` argument¶

By default pandas are using float64 numeric type which is one of the heaviest ones. Some optimization can be done here given that we do not care about such high precision in this case (and, in fact, in this dataset we have only 1 digit after comma/dot).

In [7]:

columns = iris.columns.drop("species")
iris[columns] = iris[columns].apply(pd.to_numeric, downcast="float")

Final memory usage¶

In [8]:

iris.memory_usage(deep=True)

Out[8]:

Index           128
sepal_length    600
sepal_width     600
petal_length    600
petal_width     600
species         426
dtype: int64

In [9]:

"In total memory usage is reduced by {:.1%}".format(1 - iris.memory_usage(deep=True).sum() / iris_init.memory_usage(deep=True).sum())

Out[9]:

'In total memory usage is reduced by 79.9%'

Data & column memory statistics¶

Applying optimizations¶

Casting categorical columns as category¶

Optimizing the numerical columns with pd.to_numeric and downcast argument¶

Final memory usage¶

Casting categorical columns as `category`¶

Optimizing the numerical columns with `pd.to_numeric` and `downcast` argument¶