import pandas as pd
import seaborn as sns
iris = sns.load_dataset('iris')
iris_init = iris.copy() # will be used to get the reduction performance
iris.head()
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
iris.dtypes
sepal_length float64 sepal_width float64 petal_length float64 petal_width float64 species object dtype: object
iris.memory_usage(deep=True)
Index 128 sepal_length 1200 sepal_width 1200 petal_length 1200 petal_width 1200 species 9800 dtype: int64
category
¶Usually gives biggest memory savings. Instead of storing whole strings/objects pandas
tokenizes them and only stores indices which significantly reduces the memory usage.
iris.species = iris.species.astype("category")
"Column memory usage is reduced by {:.1%}".format(1 - iris.species.memory_usage(deep=True) / iris_init.species.memory_usage(deep=True))
'Column memory usage is reduced by 94.4%'
pd.to_numeric
and downcast
argument¶By default pandas are using float64
numeric type which is one of the heaviest ones. Some optimization can be done here given that we do not care about such high precision in this case (and, in fact, in this dataset we have only 1 digit after comma/dot).
columns = iris.columns.drop("species")
iris[columns] = iris[columns].apply(pd.to_numeric, downcast="float")
iris.memory_usage(deep=True)
Index 128 sepal_length 600 sepal_width 600 petal_length 600 petal_width 600 species 426 dtype: int64
"In total memory usage is reduced by {:.1%}".format(1 - iris.memory_usage(deep=True).sum() / iris_init.memory_usage(deep=True).sum())
'In total memory usage is reduced by 79.9%'