Pandas parallelization¶

In [1]:

import pandas as pd
import numpy as np
import swifter

Data¶

Shape: (100.000 x 10.000)

In [2]:

df = pd.DataFrame(np.arange(10**9).reshape(10**5, 10**4))
df.head()

Out[2]:

	0	1	2	3	4	5	6	7	8	9	...	9990	9991	9992	9993	9994	9995	9996	9997	9998	9999
0	0	1	2	3	4	5	6	7	8	9	...	9990	9991	9992	9993	9994	9995	9996	9997	9998	9999
1	10000	10001	10002	10003	10004	10005	10006	10007	10008	10009	...	19990	19991	19992	19993	19994	19995	19996	19997	19998	19999
2	20000	20001	20002	20003	20004	20005	20006	20007	20008	20009	...	29990	29991	29992	29993	29994	29995	29996	29997	29998	29999
3	30000	30001	30002	30003	30004	30005	30006	30007	30008	30009	...	39990	39991	39992	39993	39994	39995	39996	39997	39998	39999
4	40000	40001	40002	40003	40004	40005	40006	40007	40008	40009	...	49990	49991	49992	49993	49994	49995	49996	49997	49998	49999

5 rows × 10000 columns

Performance test¶

In [3]:

%timeit -n1 -r1 df.apply(np.mean)

1min 6s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

Swifter - a package which efficiently applies any function to a pandas dataframe or series in the fastest available manner. Reference: https://github.com/jmcarpenter2/swifter

In [4]:

%timeit -n1 -r1 df.swifter.apply(np.mean)

7.65 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

For this specific case - over 8x speed improvement. Here we use only 1 loop & 1 run since when running the same function more than once swifter further optimizes the performance and becomes even faster - however very unlikely use case.