Drop Memory Usage Tricks

SegmentLocal

                                                                                                            [1]
In [22]:
import numpy as np
import pandas as pd
import nbconvert
import warnings
warnings.filterwarnings("ignore")

1. Reduce DataFrame size

1.1 Change in int datatype

Situation: Let say, you have Age column having minimum value 1 and maximum value 150, with 10 million total rows in dataframe
Task: Reduce Memory Usage of Age column given above constraints
Action: Change of original dtype from int32 to uint8
Result: Drop from 38.1 MB to 9.5 MB in Memory usage i.e. 75% reduction

In [2]:
## Initializing minimum and maximum value of age
min_age_value , max_age_value = 1,150
## Number of rows in dataframe
nrows = int(np.power(10,7))
## creation of Age dataframe
df_age = pd.DataFrame({'Age':np.random.randint(low=1,high=100,size=nrows)})
In [3]:
## check memory usage before action
df_age.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000000 entries, 0 to 9999999
Data columns (total 1 columns):
Age    int32
dtypes: int32(1)
memory usage: 38.1 MB
In [4]:
## Range of "uint8"; satisfies range constraint of Age column 
np.iinfo('uint8')
Out[4]:
iinfo(min=0, max=255, dtype=uint8)
In [5]:
## Action: conversion of dtype from "int32" to "uint8"
converted_df_age = df_age.astype(np.uint8)
In [6]:
## check memory usage after action
converted_df_age.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000000 entries, 0 to 9999999
Data columns (total 1 columns):
Age    uint8
dtypes: uint8(1)
memory usage: 9.5 MB

1.2 Change in float datatype

Situation: Let say, you have 50,000 search queries and 5,000 documents and computed cosine similarity for each search query with all documents i.e. dimension 50,000 X 5,000. All similarity values are between 0 and 1 and should have atleast 2 decimal precision
Task: Reduce Memory Usage of cosine smilarity dataframe given above constraints
Action: Change of original dtype from float64 to float16
Result: Drop from 1.9 GB to 476.8 MB or 0.46 GB in Memory usage i.e. 75% reduction

In [7]:
## no. of documents
ncols = int(5*np.power(10,3))
## no. of search queries
nrows = int(5*np.power(10,4))
## creation of cosine similarity dataframe
df_query_doc = pd.DataFrame(np.random.rand(nrows, ncols))
print("No. of search queries: {} and No. of documents: {}".format(df_query_doc.shape[0],df_query_doc.shape[1]))
No. of search queries: 50000 and No. of documents: 5000
In [8]:
## check memory usage before action
df_query_doc.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Columns: 5000 entries, 0 to 4999
dtypes: float64(5000)
memory usage: 1.9 GB
In [9]:
## Action: conversion of dtype from "float64" to "float16"
converted_df_query_doc = df_query_doc.astype('float16')
In [10]:
## check memory usage after action
converted_df_query_doc.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Columns: 5000 entries, 0 to 4999
dtypes: float16(5000)
memory usage: 476.8 MB

1.3 Change from object to category datatype

Situation: Let say, you have Day of Week column having 7 unique values, with 4.9 million total rows in dataframe
Task: Reduce Memory Usage of Day of Week column given only 7 unique value exist
Action: Change of dtype from object to category as ratio of unique values to no. of rows is almost zero
Result: Drop from 2.9 GB to 46.7 MB or 0.045 GB in Memory usage i.e. 98% reduction

In [11]:
## unique values of "days of week"
day_of_week = ["monday","tuesday","wednesday","thursday","friday","saturday","sunday"]
## Number of times day_of_week repeats
repeat_times = 7*np.power(10,6)
## creation of days of week dataframe
df_day_of_week = pd.DataFrame({'day_of_week':np.repeat(a=day_of_week,repeats = repeat_times)})
print("No of rows in days of week dataframe {}".format(df_day_of_week.shape[0]))
No of rows in days of week dataframe 49000000
In [12]:
## check memory usage before action
df_day_of_week.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49000000 entries, 0 to 48999999
Data columns (total 1 columns):
day_of_week    object
dtypes: object(1)
memory usage: 2.9 GB
In [13]:
## Action: conversion of dtype from "object" to "category"
converted_df_day_of_week = df_day_of_week.astype('category')
In [14]:
## check memory usage after action
converted_df_day_of_week.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49000000 entries, 0 to 48999999
Data columns (total 1 columns):
day_of_week    category
dtypes: category(1)
memory usage: 46.7 MB
In [15]:
## check first two rows of dataframe
converted_df_day_of_week.head(2)
Out[15]:
day_of_week
0 monday
1 monday
In [16]:
## check how mapping of day_of_week is created in category dtype
converted_df_day_of_week.head(2)['day_of_week'].cat.codes
Out[16]:
0    1
1    1
dtype: int8

1.4 Convert to Sparse DataFrame

Situation: Let say, you have dataframe having large count of zero or missing values (66%) usually happens in lot of NLP task like Count/TF-IDF encoding, Recommender Systems [2]
Task: Reduce Memory Usage of dataframe
Action: Change of DataFrame type to SparseDataFrame as Percentage of Non-Zero Non-NaN values is very less in number
Result: Drop from 228.9 MB to 152.6 MB in Memory usage i.e. 33% reduction

In [17]:
## number of rows in dataframe
nrows = np.power(10,7)
## creation of dataframe
df_dense =pd.DataFrame([[0,0.23,np.nan]]*nrows)
In [18]:
## check memory usage before action
df_dense.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000000 entries, 0 to 9999999
Data columns (total 3 columns):
0    int64
1    float64
2    float64
dtypes: float64(2), int64(1)
memory usage: 228.9 MB
In [19]:
## Percentage of Non-zero and Non-NaN values in dataframe
non_zero_non_nan = np.count_nonzero((df_dense)) - df_dense.isnull().sum().sum()
non_zero_non_nan_percentage = round((non_zero_non_nan/df_dense.size)*100,2)
print("Percentage of Non-Zero Non-NaN values in dataframe {} %".format(non_zero_non_nan_percentage))
Percentage of Non-Zero Non-NaN values in dataframe 33.33 %
In [20]:
## Action: Change of DataFrame type to SparseDataFrame
df_sparse = df_dense.to_sparse()
In [21]:
## check memory usage after action
df_sparse.info(memory_usage='deep')
<class 'pandas.core.sparse.frame.SparseDataFrame'>
RangeIndex: 10000000 entries, 0 to 9999999
Data columns (total 3 columns):
0    Sparse[int64, nan]
1    Sparse[float64, nan]
2    Sparse[float64, nan]
dtypes: Sparse[float64, nan](2), Sparse[int64, nan](1)
memory usage: 152.6 MB