import pandas as pd
import numpy as np
import swifter
Shape: (100.000 x 10.000)
df = pd.DataFrame(np.arange(10**9).reshape(10**5, 10**4))
df.head()
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 9990 | 9991 | 9992 | 9993 | 9994 | 9995 | 9996 | 9997 | 9998 | 9999 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 9990 | 9991 | 9992 | 9993 | 9994 | 9995 | 9996 | 9997 | 9998 | 9999 |
1 | 10000 | 10001 | 10002 | 10003 | 10004 | 10005 | 10006 | 10007 | 10008 | 10009 | ... | 19990 | 19991 | 19992 | 19993 | 19994 | 19995 | 19996 | 19997 | 19998 | 19999 |
2 | 20000 | 20001 | 20002 | 20003 | 20004 | 20005 | 20006 | 20007 | 20008 | 20009 | ... | 29990 | 29991 | 29992 | 29993 | 29994 | 29995 | 29996 | 29997 | 29998 | 29999 |
3 | 30000 | 30001 | 30002 | 30003 | 30004 | 30005 | 30006 | 30007 | 30008 | 30009 | ... | 39990 | 39991 | 39992 | 39993 | 39994 | 39995 | 39996 | 39997 | 39998 | 39999 |
4 | 40000 | 40001 | 40002 | 40003 | 40004 | 40005 | 40006 | 40007 | 40008 | 40009 | ... | 49990 | 49991 | 49992 | 49993 | 49994 | 49995 | 49996 | 49997 | 49998 | 49999 |
5 rows × 10000 columns
%timeit -n1 -r1 df.apply(np.mean)
1min 6s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
Swifter
- a package which efficiently applies any function to a pandas dataframe or series in the fastest available manner. Reference: https://github.com/jmcarpenter2/swifter
%timeit -n1 -r1 df.swifter.apply(np.mean)
7.65 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
For this specific case - over 8x
speed improvement. Here we use only 1 loop & 1 run since when running the same function more than once swifter
further optimizes the performance and becomes even faster - however very unlikely use case.