First, import necessary libraries
import pandas as pd
We can issue shell commands too
!dir
Combining-Multiple-Excel-File-with-Pandas.ipynb Common-Excel-Part-2.ipynb Google-Forms-Data-Analysis.ipynb Ipython-pandas-tips-and-tricks.ipynb Learn_Pandas-Weighted_Average.ipynb Minnebar\ 11\ Demo\ -\ Master.ipynb Pandas-Pivot-Table-Explained.ipynb test.xlsx
Read in the excel file (from the github repo) and turn into a DataFrame
df = pd.read_excel("https://github.com/chris1610/pbpython/blob/master/data/sample-salesv3.xlsx?raw=True")
Let's look at the file
df.head()
account number | name | sku | quantity | unit price | ext price | date | |
---|---|---|---|---|---|---|---|
0 | 740150 | Barton LLC | B1-20000 | 39 | 86.69 | 3380.91 | 2014-01-01 07:21:51 |
1 | 714466 | Trantow-Barrows | S2-77896 | -1 | 63.16 | -63.16 | 2014-01-01 10:00:47 |
2 | 218895 | Kulas Inc | B1-69924 | 23 | 90.70 | 2086.10 | 2014-01-01 13:24:58 |
3 | 307599 | Kassulke, Ondricka and Metz | S1-65481 | 41 | 21.05 | 863.05 | 2014-01-01 15:05:22 |
4 | 412290 | Jerde-Hilpert | S2-34077 | 6 | 83.21 | 499.26 | 2014-01-01 23:26:55 |
df.tail()
account number | name | sku | quantity | unit price | ext price | date | |
---|---|---|---|---|---|---|---|
1495 | 737550 | Fritsch, Russel and Anderson | S1-06532 | 12 | 55.80 | 669.60 | 2014-12-30 13:38:13 |
1496 | 786968 | Frami, Hills and Schmidt | S1-06532 | 37 | 13.14 | 486.18 | 2014-12-30 21:42:17 |
1497 | 239344 | Stokes LLC | S2-10342 | 14 | 38.75 | 542.50 | 2014-12-30 22:45:19 |
1498 | 642753 | Pollich LLC | S2-82423 | 3 | 65.97 | 197.91 | 2014-12-31 10:36:24 |
1499 | 383080 | Will LLC | S2-00301 | 38 | 46.44 | 1764.72 | 2014-12-31 12:48:35 |
df.describe()
account number | quantity | unit price | ext price | |
---|---|---|---|---|
count | 1500.000000 | 1500.000000 | 1500.000000 | 1500.000000 |
mean | 485957.841333 | 24.308667 | 55.007527 | 1345.856213 |
std | 223974.044572 | 14.439265 | 25.903267 | 1084.914881 |
min | 141962.000000 | -1.000000 | 10.030000 | -97.160000 |
25% | 257198.000000 | 12.000000 | 32.500000 | 472.177500 |
50% | 527099.000000 | 25.000000 | 55.465000 | 1050.390000 |
75% | 714466.000000 | 37.000000 | 77.075000 | 2068.330000 |
max | 786968.000000 | 49.000000 | 99.850000 | 4824.540000 |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1500 entries, 0 to 1499 Data columns (total 7 columns): account number 1500 non-null int64 name 1500 non-null object sku 1500 non-null object quantity 1500 non-null int64 unit price 1500 non-null float64 ext price 1500 non-null float64 date 1500 non-null object dtypes: float64(2), int64(2), object(3) memory usage: 82.1+ KB
We can easily perform math operations on the data
df.quantity.sum()
36463
df["unit price"].sum()
82511.289999999994
Now we can show how to aggregate some data
df.groupby('name')["unit price"].sum()
name Barton LLC 4409.06 Cronin, Oberbrunner and Spencer 3336.99 Frami, Hills and Schmidt 3942.49 Fritsch, Russel and Anderson 4350.41 Halvorson, Crona and Champlin 3244.92 Herman LLC 3259.15 Jerde-Hilpert 4635.54 Kassulke, Ondricka and Metz 3266.76 Keeling LLC 4223.63 Kiehn-Spinka 4389.32 Koepp Ltd 4459.96 Kuhn-Gusikowski 4075.87 Kulas Inc 5608.19 Pollich LLC 4126.92 Purdy-Kunde 2668.07 Sanford and Sons 4142.25 Stokes LLC 3711.26 Trantow-Barrows 5280.93 White-Trantow 5040.73 Will LLC 4338.84 Name: unit price, dtype: float64
Do multiple groupbys
df.groupby(["name", "sku"])["quantity", "unit price"].mean().head()
quantity | unit price | ||
---|---|---|---|
name | sku | ||
Barton LLC | B1-04202 | 35.000000 | 69.5200 |
B1-05914 | 24.000000 | 13.5300 | |
B1-20000 | 22.000000 | 78.9225 | |
B1-33087 | 22.000000 | 96.4150 | |
B1-33364 | 27.666667 | 40.0800 |
Make sure to set the date as a date column
df["date"] = pd.to_datetime(df['date'])
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1500 entries, 0 to 1499 Data columns (total 7 columns): account number 1500 non-null int64 name 1500 non-null object sku 1500 non-null object quantity 1500 non-null int64 unit price 1500 non-null float64 ext price 1500 non-null float64 date 1500 non-null datetime64[ns] dtypes: datetime64[ns](1), float64(2), int64(2), object(2) memory usage: 82.1+ KB
Now we have a date time object
df.head()
account number | name | sku | quantity | unit price | ext price | date | |
---|---|---|---|---|---|---|---|
0 | 740150 | Barton LLC | B1-20000 | 39 | 86.69 | 3380.91 | 2014-01-01 07:21:51 |
1 | 714466 | Trantow-Barrows | S2-77896 | -1 | 63.16 | -63.16 | 2014-01-01 10:00:47 |
2 | 218895 | Kulas Inc | B1-69924 | 23 | 90.70 | 2086.10 | 2014-01-01 13:24:58 |
3 | 307599 | Kassulke, Ondricka and Metz | S1-65481 | 41 | 21.05 | 863.05 | 2014-01-01 15:05:22 |
4 | 412290 | Jerde-Hilpert | S2-34077 | 6 | 83.21 | 499.26 | 2014-01-01 23:26:55 |
Let's see how much we sold by month by customer
df.set_index('date').groupby('name')["ext price"].resample("M").sum().head()
name date Barton LLC 2014-01-31 6177.57 2014-02-28 12218.03 2014-03-31 3513.53 2014-04-30 11474.20 2014-05-31 10220.17 Name: ext price, dtype: float64
What about every 3 Months?
df.set_index('date').groupby('name')["ext price"].resample("3M").sum().head()
name date Barton LLC 2014-01-31 6177.57 2014-04-30 27205.76 2014-07-31 27434.38 2014-10-31 40946.75 2015-01-31 7674.04 Name: ext price, dtype: float64
What about quarterly ending in Dec?
df.set_index('date').groupby('name')["ext price"].resample('Q-DEC').sum().head()
name date Barton LLC 2014-03-31 21909.13 2014-06-30 32158.10 2014-09-30 38345.55 2014-12-31 17025.72 Cronin, Oberbrunner and Spencer 2014-03-31 26809.63 Name: ext price, dtype: float64
Annual Number
df.set_index('date').groupby('name')["ext price"].resample('A-DEC').sum()
name date Barton LLC 2014-12-31 109438.50 Cronin, Oberbrunner and Spencer 2014-12-31 89734.55 Frami, Hills and Schmidt 2014-12-31 103569.59 Fritsch, Russel and Anderson 2014-12-31 112214.71 Halvorson, Crona and Champlin 2014-12-31 70004.36 Herman LLC 2014-12-31 82865.00 Jerde-Hilpert 2014-12-31 112591.43 Kassulke, Ondricka and Metz 2014-12-31 86451.07 Keeling LLC 2014-12-31 100934.30 Kiehn-Spinka 2014-12-31 99608.77 Koepp Ltd 2014-12-31 103660.54 Kuhn-Gusikowski 2014-12-31 91094.28 Kulas Inc 2014-12-31 137351.96 Pollich LLC 2014-12-31 87347.18 Purdy-Kunde 2014-12-31 77898.21 Sanford and Sons 2014-12-31 98822.98 Stokes LLC 2014-12-31 91535.92 Trantow-Barrows 2014-12-31 123381.38 White-Trantow 2014-12-31 135841.99 Will LLC 2014-12-31 104437.60 Name: ext price, dtype: float64
DataFrame vs. Series
df.set_index('date').groupby('name')["ext price",].resample('A-DEC').sum()
ext price | ||
---|---|---|
name | date | |
Barton LLC | 2014-12-31 | 109438.50 |
Cronin, Oberbrunner and Spencer | 2014-12-31 | 89734.55 |
Frami, Hills and Schmidt | 2014-12-31 | 103569.59 |
Fritsch, Russel and Anderson | 2014-12-31 | 112214.71 |
Halvorson, Crona and Champlin | 2014-12-31 | 70004.36 |
Herman LLC | 2014-12-31 | 82865.00 |
Jerde-Hilpert | 2014-12-31 | 112591.43 |
Kassulke, Ondricka and Metz | 2014-12-31 | 86451.07 |
Keeling LLC | 2014-12-31 | 100934.30 |
Kiehn-Spinka | 2014-12-31 | 99608.77 |
Koepp Ltd | 2014-12-31 | 103660.54 |
Kuhn-Gusikowski | 2014-12-31 | 91094.28 |
Kulas Inc | 2014-12-31 | 137351.96 |
Pollich LLC | 2014-12-31 | 87347.18 |
Purdy-Kunde | 2014-12-31 | 77898.21 |
Sanford and Sons | 2014-12-31 | 98822.98 |
Stokes LLC | 2014-12-31 | 91535.92 |
Trantow-Barrows | 2014-12-31 | 123381.38 |
White-Trantow | 2014-12-31 | 135841.99 |
Will LLC | 2014-12-31 | 104437.60 |
df.set_index('date').groupby('name')["ext price",].resample('A-DEC').sum().to_excel("test.xlsx")