Escaping Excel Hell with Python and Pandas

Minnebar 11, April 23 2016

First, import necessary libraries

In [1]:
import pandas as pd

We can issue shell commands too

In [2]:
!dir
Combining-Multiple-Excel-File-with-Pandas.ipynb
Common-Excel-Part-2.ipynb
Google-Forms-Data-Analysis.ipynb
Ipython-pandas-tips-and-tricks.ipynb
Learn_Pandas-Weighted_Average.ipynb
Minnebar\ 11\ Demo\ -\ Master.ipynb
Pandas-Pivot-Table-Explained.ipynb
test.xlsx

Read in the excel file (from the github repo) and turn into a DataFrame

In [3]:
df = pd.read_excel("https://github.com/chris1610/pbpython/blob/master/data/sample-salesv3.xlsx?raw=True")

Let's look at the file

In [4]:
df.head()
Out[4]:
account number name sku quantity unit price ext price date
0 740150 Barton LLC B1-20000 39 86.69 3380.91 2014-01-01 07:21:51
1 714466 Trantow-Barrows S2-77896 -1 63.16 -63.16 2014-01-01 10:00:47
2 218895 Kulas Inc B1-69924 23 90.70 2086.10 2014-01-01 13:24:58
3 307599 Kassulke, Ondricka and Metz S1-65481 41 21.05 863.05 2014-01-01 15:05:22
4 412290 Jerde-Hilpert S2-34077 6 83.21 499.26 2014-01-01 23:26:55
In [5]:
df.tail()
Out[5]:
account number name sku quantity unit price ext price date
1495 737550 Fritsch, Russel and Anderson S1-06532 12 55.80 669.60 2014-12-30 13:38:13
1496 786968 Frami, Hills and Schmidt S1-06532 37 13.14 486.18 2014-12-30 21:42:17
1497 239344 Stokes LLC S2-10342 14 38.75 542.50 2014-12-30 22:45:19
1498 642753 Pollich LLC S2-82423 3 65.97 197.91 2014-12-31 10:36:24
1499 383080 Will LLC S2-00301 38 46.44 1764.72 2014-12-31 12:48:35
In [6]:
df.describe()
Out[6]:
account number quantity unit price ext price
count 1500.000000 1500.000000 1500.000000 1500.000000
mean 485957.841333 24.308667 55.007527 1345.856213
std 223974.044572 14.439265 25.903267 1084.914881
min 141962.000000 -1.000000 10.030000 -97.160000
25% 257198.000000 12.000000 32.500000 472.177500
50% 527099.000000 25.000000 55.465000 1050.390000
75% 714466.000000 37.000000 77.075000 2068.330000
max 786968.000000 49.000000 99.850000 4824.540000
In [7]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1500 entries, 0 to 1499
Data columns (total 7 columns):
account number    1500 non-null int64
name              1500 non-null object
sku               1500 non-null object
quantity          1500 non-null int64
unit price        1500 non-null float64
ext price         1500 non-null float64
date              1500 non-null object
dtypes: float64(2), int64(2), object(3)
memory usage: 82.1+ KB

We can easily perform math operations on the data

In [8]:
df.quantity.sum()
Out[8]:
36463
In [9]:
df["unit price"].sum()
Out[9]:
82511.289999999994

Now we can show how to aggregate some data

In [10]:
df.groupby('name')["unit price"].sum()
Out[10]:
name
Barton LLC                         4409.06
Cronin, Oberbrunner and Spencer    3336.99
Frami, Hills and Schmidt           3942.49
Fritsch, Russel and Anderson       4350.41
Halvorson, Crona and Champlin      3244.92
Herman LLC                         3259.15
Jerde-Hilpert                      4635.54
Kassulke, Ondricka and Metz        3266.76
Keeling LLC                        4223.63
Kiehn-Spinka                       4389.32
Koepp Ltd                          4459.96
Kuhn-Gusikowski                    4075.87
Kulas Inc                          5608.19
Pollich LLC                        4126.92
Purdy-Kunde                        2668.07
Sanford and Sons                   4142.25
Stokes LLC                         3711.26
Trantow-Barrows                    5280.93
White-Trantow                      5040.73
Will LLC                           4338.84
Name: unit price, dtype: float64

Do multiple groupbys

In [11]:
df.groupby(["name", "sku"])["quantity", "unit price"].mean().head()
Out[11]:
quantity unit price
name sku
Barton LLC B1-04202 35.000000 69.5200
B1-05914 24.000000 13.5300
B1-20000 22.000000 78.9225
B1-33087 22.000000 96.4150
B1-33364 27.666667 40.0800

Time series analysis

Make sure to set the date as a date column

In [12]:
df["date"] = pd.to_datetime(df['date'])
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1500 entries, 0 to 1499
Data columns (total 7 columns):
account number    1500 non-null int64
name              1500 non-null object
sku               1500 non-null object
quantity          1500 non-null int64
unit price        1500 non-null float64
ext price         1500 non-null float64
date              1500 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(2), int64(2), object(2)
memory usage: 82.1+ KB

Now we have a date time object

In [13]:
df.head()
Out[13]:
account number name sku quantity unit price ext price date
0 740150 Barton LLC B1-20000 39 86.69 3380.91 2014-01-01 07:21:51
1 714466 Trantow-Barrows S2-77896 -1 63.16 -63.16 2014-01-01 10:00:47
2 218895 Kulas Inc B1-69924 23 90.70 2086.10 2014-01-01 13:24:58
3 307599 Kassulke, Ondricka and Metz S1-65481 41 21.05 863.05 2014-01-01 15:05:22
4 412290 Jerde-Hilpert S2-34077 6 83.21 499.26 2014-01-01 23:26:55

Let's see how much we sold by month by customer

In [14]:
df.set_index('date').groupby('name')["ext price"].resample("M").sum().head()
Out[14]:
name        date      
Barton LLC  2014-01-31     6177.57
            2014-02-28    12218.03
            2014-03-31     3513.53
            2014-04-30    11474.20
            2014-05-31    10220.17
Name: ext price, dtype: float64

What about every 3 Months?

In [15]:
df.set_index('date').groupby('name')["ext price"].resample("3M").sum().head()
Out[15]:
name        date      
Barton LLC  2014-01-31     6177.57
            2014-04-30    27205.76
            2014-07-31    27434.38
            2014-10-31    40946.75
            2015-01-31     7674.04
Name: ext price, dtype: float64

What about quarterly ending in Dec?

In [16]:
df.set_index('date').groupby('name')["ext price"].resample('Q-DEC').sum().head()
Out[16]:
name                             date      
Barton LLC                       2014-03-31    21909.13
                                 2014-06-30    32158.10
                                 2014-09-30    38345.55
                                 2014-12-31    17025.72
Cronin, Oberbrunner and Spencer  2014-03-31    26809.63
Name: ext price, dtype: float64

Annual Number

In [17]:
df.set_index('date').groupby('name')["ext price"].resample('A-DEC').sum()
Out[17]:
name                             date      
Barton LLC                       2014-12-31    109438.50
Cronin, Oberbrunner and Spencer  2014-12-31     89734.55
Frami, Hills and Schmidt         2014-12-31    103569.59
Fritsch, Russel and Anderson     2014-12-31    112214.71
Halvorson, Crona and Champlin    2014-12-31     70004.36
Herman LLC                       2014-12-31     82865.00
Jerde-Hilpert                    2014-12-31    112591.43
Kassulke, Ondricka and Metz      2014-12-31     86451.07
Keeling LLC                      2014-12-31    100934.30
Kiehn-Spinka                     2014-12-31     99608.77
Koepp Ltd                        2014-12-31    103660.54
Kuhn-Gusikowski                  2014-12-31     91094.28
Kulas Inc                        2014-12-31    137351.96
Pollich LLC                      2014-12-31     87347.18
Purdy-Kunde                      2014-12-31     77898.21
Sanford and Sons                 2014-12-31     98822.98
Stokes LLC                       2014-12-31     91535.92
Trantow-Barrows                  2014-12-31    123381.38
White-Trantow                    2014-12-31    135841.99
Will LLC                         2014-12-31    104437.60
Name: ext price, dtype: float64

DataFrame vs. Series

In [18]:
df.set_index('date').groupby('name')["ext price",].resample('A-DEC').sum()
Out[18]:
ext price
name date
Barton LLC 2014-12-31 109438.50
Cronin, Oberbrunner and Spencer 2014-12-31 89734.55
Frami, Hills and Schmidt 2014-12-31 103569.59
Fritsch, Russel and Anderson 2014-12-31 112214.71
Halvorson, Crona and Champlin 2014-12-31 70004.36
Herman LLC 2014-12-31 82865.00
Jerde-Hilpert 2014-12-31 112591.43
Kassulke, Ondricka and Metz 2014-12-31 86451.07
Keeling LLC 2014-12-31 100934.30
Kiehn-Spinka 2014-12-31 99608.77
Koepp Ltd 2014-12-31 103660.54
Kuhn-Gusikowski 2014-12-31 91094.28
Kulas Inc 2014-12-31 137351.96
Pollich LLC 2014-12-31 87347.18
Purdy-Kunde 2014-12-31 77898.21
Sanford and Sons 2014-12-31 98822.98
Stokes LLC 2014-12-31 91535.92
Trantow-Barrows 2014-12-31 123381.38
White-Trantow 2014-12-31 135841.99
Will LLC 2014-12-31 104437.60
In [19]:
df.set_index('date').groupby('name')["ext price",].resample('A-DEC').sum().to_excel("test.xlsx")
In [ ]: