Escaping Excel Hell with Python and Pandas¶

Minnebar 11, April 23 2016¶

First, import necessary libraries

In [1]:

import pandas as pd

We can issue shell commands too

In [2]:

!dir

Combining-Multiple-Excel-File-with-Pandas.ipynb
Common-Excel-Part-2.ipynb
Google-Forms-Data-Analysis.ipynb
Ipython-pandas-tips-and-tricks.ipynb
Learn_Pandas-Weighted_Average.ipynb
Minnebar\ 11\ Demo\ -\ Master.ipynb
Pandas-Pivot-Table-Explained.ipynb
test.xlsx

Read in the excel file (from the github repo) and turn into a DataFrame

In [3]:

df = pd.read_excel("https://github.com/chris1610/pbpython/blob/master/data/sample-salesv3.xlsx?raw=True")

Let's look at the file

In [4]:

df.head()

Out[4]:

	account number	name	sku	quantity	unit price	ext price	date
0	740150	Barton LLC	B1-20000	39	86.69	3380.91	2014-01-01 07:21:51
1	714466	Trantow-Barrows	S2-77896	-1	63.16	-63.16	2014-01-01 10:00:47
2	218895	Kulas Inc	B1-69924	23	90.70	2086.10	2014-01-01 13:24:58
3	307599	Kassulke, Ondricka and Metz	S1-65481	41	21.05	863.05	2014-01-01 15:05:22
4	412290	Jerde-Hilpert	S2-34077	6	83.21	499.26	2014-01-01 23:26:55

In [5]:

df.tail()

Out[5]:

	account number	name	sku	quantity	unit price	ext price	date
1495	737550	Fritsch, Russel and Anderson	S1-06532	12	55.80	669.60	2014-12-30 13:38:13
1496	786968	Frami, Hills and Schmidt	S1-06532	37	13.14	486.18	2014-12-30 21:42:17
1497	239344	Stokes LLC	S2-10342	14	38.75	542.50	2014-12-30 22:45:19
1498	642753	Pollich LLC	S2-82423	3	65.97	197.91	2014-12-31 10:36:24
1499	383080	Will LLC	S2-00301	38	46.44	1764.72	2014-12-31 12:48:35

In [6]:

df.describe()

Out[6]:

	account number	quantity	unit price	ext price
count	1500.000000	1500.000000	1500.000000	1500.000000
mean	485957.841333	24.308667	55.007527	1345.856213
std	223974.044572	14.439265	25.903267	1084.914881
min	141962.000000	-1.000000	10.030000	-97.160000
25%	257198.000000	12.000000	32.500000	472.177500
50%	527099.000000	25.000000	55.465000	1050.390000
75%	714466.000000	37.000000	77.075000	2068.330000
max	786968.000000	49.000000	99.850000	4824.540000

In [7]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1500 entries, 0 to 1499
Data columns (total 7 columns):
account number    1500 non-null int64
name              1500 non-null object
sku               1500 non-null object
quantity          1500 non-null int64
unit price        1500 non-null float64
ext price         1500 non-null float64
date              1500 non-null object
dtypes: float64(2), int64(2), object(3)
memory usage: 82.1+ KB

We can easily perform math operations on the data

In [8]:

df.quantity.sum()

Out[8]:

In [9]:

df["unit price"].sum()

Out[9]:

82511.289999999994

Now we can show how to aggregate some data

In [10]:

df.groupby('name')["unit price"].sum()

Out[10]:

name
Barton LLC                         4409.06
Cronin, Oberbrunner and Spencer    3336.99
Frami, Hills and Schmidt           3942.49
Fritsch, Russel and Anderson       4350.41
Halvorson, Crona and Champlin      3244.92
Herman LLC                         3259.15
Jerde-Hilpert                      4635.54
Kassulke, Ondricka and Metz        3266.76
Keeling LLC                        4223.63
Kiehn-Spinka                       4389.32
Koepp Ltd                          4459.96
Kuhn-Gusikowski                    4075.87
Kulas Inc                          5608.19
Pollich LLC                        4126.92
Purdy-Kunde                        2668.07
Sanford and Sons                   4142.25
Stokes LLC                         3711.26
Trantow-Barrows                    5280.93
White-Trantow                      5040.73
Will LLC                           4338.84
Name: unit price, dtype: float64

Do multiple groupbys

In [11]:

df.groupby(["name", "sku"])["quantity", "unit price"].mean().head()

Out[11]:

		quantity	unit price
name	sku
Barton LLC	B1-04202	35.000000	69.5200
	B1-05914	24.000000	13.5300
	B1-20000	22.000000	78.9225
	B1-33087	22.000000	96.4150
	B1-33364	27.666667	40.0800

Time series analysis¶

Make sure to set the date as a date column

In [12]:

df["date"] = pd.to_datetime(df['date'])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1500 entries, 0 to 1499
Data columns (total 7 columns):
account number    1500 non-null int64
name              1500 non-null object
sku               1500 non-null object
quantity          1500 non-null int64
unit price        1500 non-null float64
ext price         1500 non-null float64
date              1500 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(2), int64(2), object(2)
memory usage: 82.1+ KB

Now we have a date time object

In [13]:

df.head()

Out[13]:

	account number	name	sku	quantity	unit price	ext price	date
0	740150	Barton LLC	B1-20000	39	86.69	3380.91	2014-01-01 07:21:51
1	714466	Trantow-Barrows	S2-77896	-1	63.16	-63.16	2014-01-01 10:00:47
2	218895	Kulas Inc	B1-69924	23	90.70	2086.10	2014-01-01 13:24:58
3	307599	Kassulke, Ondricka and Metz	S1-65481	41	21.05	863.05	2014-01-01 15:05:22
4	412290	Jerde-Hilpert	S2-34077	6	83.21	499.26	2014-01-01 23:26:55

Let's see how much we sold by month by customer

In [14]:

df.set_index('date').groupby('name')["ext price"].resample("M").sum().head()

Out[14]:

name        date      
Barton LLC  2014-01-31     6177.57
            2014-02-28    12218.03
            2014-03-31     3513.53
            2014-04-30    11474.20
            2014-05-31    10220.17
Name: ext price, dtype: float64

What about every 3 Months?

In [15]:

df.set_index('date').groupby('name')["ext price"].resample("3M").sum().head()

Out[15]:

name        date      
Barton LLC  2014-01-31     6177.57
            2014-04-30    27205.76
            2014-07-31    27434.38
            2014-10-31    40946.75
            2015-01-31     7674.04
Name: ext price, dtype: float64

What about quarterly ending in Dec?

In [16]:

df.set_index('date').groupby('name')["ext price"].resample('Q-DEC').sum().head()

Out[16]:

name                             date      
Barton LLC                       2014-03-31    21909.13
                                 2014-06-30    32158.10
                                 2014-09-30    38345.55
                                 2014-12-31    17025.72
Cronin, Oberbrunner and Spencer  2014-03-31    26809.63
Name: ext price, dtype: float64

Annual Number

In [17]:

df.set_index('date').groupby('name')["ext price"].resample('A-DEC').sum()

Out[17]:

name                             date      
Barton LLC                       2014-12-31    109438.50
Cronin, Oberbrunner and Spencer  2014-12-31     89734.55
Frami, Hills and Schmidt         2014-12-31    103569.59
Fritsch, Russel and Anderson     2014-12-31    112214.71
Halvorson, Crona and Champlin    2014-12-31     70004.36
Herman LLC                       2014-12-31     82865.00
Jerde-Hilpert                    2014-12-31    112591.43
Kassulke, Ondricka and Metz      2014-12-31     86451.07
Keeling LLC                      2014-12-31    100934.30
Kiehn-Spinka                     2014-12-31     99608.77
Koepp Ltd                        2014-12-31    103660.54
Kuhn-Gusikowski                  2014-12-31     91094.28
Kulas Inc                        2014-12-31    137351.96
Pollich LLC                      2014-12-31     87347.18
Purdy-Kunde                      2014-12-31     77898.21
Sanford and Sons                 2014-12-31     98822.98
Stokes LLC                       2014-12-31     91535.92
Trantow-Barrows                  2014-12-31    123381.38
White-Trantow                    2014-12-31    135841.99
Will LLC                         2014-12-31    104437.60
Name: ext price, dtype: float64

DataFrame vs. Series

In [18]:

df.set_index('date').groupby('name')["ext price",].resample('A-DEC').sum()

Out[18]:

		ext price
name	date
Barton LLC	2014-12-31	109438.50
Cronin, Oberbrunner and Spencer	2014-12-31	89734.55
Frami, Hills and Schmidt	2014-12-31	103569.59
Fritsch, Russel and Anderson	2014-12-31	112214.71
Halvorson, Crona and Champlin	2014-12-31	70004.36
Herman LLC	2014-12-31	82865.00
Jerde-Hilpert	2014-12-31	112591.43
Kassulke, Ondricka and Metz	2014-12-31	86451.07
Keeling LLC	2014-12-31	100934.30
Kiehn-Spinka	2014-12-31	99608.77
Koepp Ltd	2014-12-31	103660.54
Kuhn-Gusikowski	2014-12-31	91094.28
Kulas Inc	2014-12-31	137351.96
Pollich LLC	2014-12-31	87347.18
Purdy-Kunde	2014-12-31	77898.21
Sanford and Sons	2014-12-31	98822.98
Stokes LLC	2014-12-31	91535.92
Trantow-Barrows	2014-12-31	123381.38
White-Trantow	2014-12-31	135841.99
Will LLC	2014-12-31	104437.60

In [19]:

df.set_index('date').groupby('name')["ext price",].resample('A-DEC').sum().to_excel("test.xlsx")

In [ ]: