Pandas

Modify by P. Zambelli (pietro.zambelli ing unitn)

Original work from Damiana Vila

In [1]:
import pandas as pd
import numpy as np

Series

In [2]:
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s
Out[2]:
0     1
1     3
2     5
3   NaN
4     6
5     8
dtype: float64

Data frames

In python we can create a list of integer using range, an array with np.arange, with pandas we can create a list of date with:pd.date_range

In [3]:
dates = pd.date_range('20130101',periods=6)
dates
Out[3]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01 00:00:00, ..., 2013-01-06 00:00:00]
Length: 6, Freq: D, Timezone: None
In [4]:
df = pd.DataFrame(np.random.randn(6, 4), # fill the dataframe with valuse
                  index=dates,           # define an index for the rows
                  columns=list('ABCD'))  # define the column names
In [5]:
df
Out[5]:
A B C D
2013-01-01 -0.183522 0.287747 0.437099 0.802198
2013-01-02 -0.556901 1.451457 -1.171944 0.273472
2013-01-03 0.949617 1.105997 0.865616 0.388877
2013-01-04 1.118095 0.387274 0.001193 1.642457
2013-01-05 0.528890 0.490560 1.294505 1.366227
2013-01-06 -0.589232 -1.531341 0.107699 0.885957

6 rows × 4 columns

Pandas is quite flexible and accept a huge types of inputs, let's create another DataFrame using a dictionary.

In [6]:
df2 = pd.DataFrame({ 'A' : 1.,
                     'B' : pd.Timestamp('20130102'),
                     'C' : pd.Series(1,index=range(4),dtype='float32'),
                     'D' : np.array([3] * 4,dtype='int32'),
                     'E' : 'foo' })
In [7]:
df2
Out[7]:
A B C D E
0 1 2013-01-02 00:00:00 1 3 foo
1 1 2013-01-02 00:00:00 1 3 foo
2 1 2013-01-02 00:00:00 1 3 foo
3 1 2013-01-02 00:00:00 1 3 foo

4 rows × 5 columns

Pandas DataFrame are quite close to the numpy.recarray, and it is possible to mix different types. The last DataFrame has:

In [8]:
df2.dtypes
Out[8]:
A           float64
B    datetime64[ns]
C           float32
D             int32
E            object
dtype: object

Pandas include several methods that are handy to work with data.

In [9]:
df.head(3)
Out[9]:
A B C D
2013-01-01 -0.183522 0.287747 0.437099 0.802198
2013-01-02 -0.556901 1.451457 -1.171944 0.273472
2013-01-03 0.949617 1.105997 0.865616 0.388877

3 rows × 4 columns

In [10]:
df.tail(3)
Out[10]:
A B C D
2013-01-04 1.118095 0.387274 0.001193 1.642457
2013-01-05 0.528890 0.490560 1.294505 1.366227
2013-01-06 -0.589232 -1.531341 0.107699 0.885957

3 rows × 4 columns

Support slices

In [11]:
df[2:4]
Out[11]:
A B C D
2013-01-03 0.949617 1.105997 0.865616 0.388877
2013-01-04 1.118095 0.387274 0.001193 1.642457

2 rows × 4 columns

Indexing and Selecting Data

he axis labeling information in pandas objects serves many purposes:

  • Identifies data (i.e. provides metadata) using known indicators, important for for analysis, visualization, and interactive console display
  • Enables automatic and explicit data alignment
  • Allows intuitive getting and setting of subsets of the data set

Index

In [12]:
df.index
Out[12]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01 00:00:00, ..., 2013-01-06 00:00:00]
Length: 6, Freq: D, Timezone: None
In [13]:
df.index[:2]
Out[13]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01 00:00:00, 2013-01-02 00:00:00]
Length: 2, Freq: D, Timezone: None
In [14]:
df[0:3]
Out[14]:
A B C D
2013-01-01 -0.183522 0.287747 0.437099 0.802198
2013-01-02 -0.556901 1.451457 -1.171944 0.273472
2013-01-03 0.949617 1.105997 0.865616 0.388877

3 rows × 4 columns

In [15]:
df['20130102':'20130104']
Out[15]:
A B C D
2013-01-02 -0.556901 1.451457 -1.171944 0.273472
2013-01-03 0.949617 1.105997 0.865616 0.388877
2013-01-04 1.118095 0.387274 0.001193 1.642457

3 rows × 4 columns

Colums

In [16]:
df.columns
Out[16]:
Index(['A', 'B', 'C', 'D'], dtype='object')
In [17]:
df[df.columns[1:3]]
Out[17]:
B C
2013-01-01 0.287747 0.437099
2013-01-02 1.451457 -1.171944
2013-01-03 1.105997 0.865616
2013-01-04 0.387274 0.001193
2013-01-05 0.490560 1.294505
2013-01-06 -1.531341 0.107699

6 rows × 2 columns

In [18]:
df['A']
Out[18]:
2013-01-01   -0.183522
2013-01-02   -0.556901
2013-01-03    0.949617
2013-01-04    1.118095
2013-01-05    0.528890
2013-01-06   -0.589232
Freq: D, Name: A, dtype: float64
In [19]:
df.A
Out[19]:
2013-01-01   -0.183522
2013-01-02   -0.556901
2013-01-03    0.949617
2013-01-04    1.118095
2013-01-05    0.528890
2013-01-06   -0.589232
Freq: D, Name: A, dtype: float64

Different Choices for Indexing (loc, iloc, and ix)

Object selection has had a number of user-requested additions in order to support more explicit location based indexing. Pandas now supports three types of multi-axis indexing.

.loc is strictly label based, will raise KeyError when the items are not found, allowed inputs are:

In [20]:
df.loc['2013-01-03':'2013-01-05']
Out[20]:
A B C D
2013-01-03 0.949617 1.105997 0.865616 0.388877
2013-01-04 1.118095 0.387274 0.001193 1.642457
2013-01-05 0.528890 0.490560 1.294505 1.366227

3 rows × 4 columns

In [21]:
df.loc['2013-01-03':'2013-01-05', 'C':]
Out[21]:
C D
2013-01-03 0.865616 0.388877
2013-01-04 0.001193 1.642457
2013-01-05 1.294505 1.366227

3 rows × 2 columns

In [17]:
df.loc['2013-01-03']
Out[17]:
A   -0.156294
B   -1.060698
C    0.037094
D    0.781460
Name: 2013-01-03 00:00:00, dtype: float64

All this selection behave like a numpy array, so you can apply the same operations:

In [18]:
df.loc['2013-01-03':'2013-01-05', 'C':] > 1
Out[18]:
C D
2013-01-03 False False
2013-01-04 False False
2013-01-05 False False

3 rows × 2 columns

In [25]:
df.loc['2013-01-03'] > 1
Out[25]:
A    False
B     True
C    False
D    False
Name: 2013-01-03 00:00:00, dtype: bool

We can assigne values with:

In [26]:
df.loc['2013-01-03', 'D'] = np.nan
In [27]:
df.loc['2013-01-05', 'B'] = np.nan
In [28]:
df
Out[28]:
A B C D
2013-01-01 -0.183522 0.287747 0.437099 0.802198
2013-01-02 -0.556901 1.451457 -1.171944 0.273472
2013-01-03 0.949617 1.105997 0.865616 NaN
2013-01-04 1.118095 0.387274 0.001193 1.642457
2013-01-05 0.528890 NaN 1.294505 1.366227
2013-01-06 -0.589232 -1.531341 0.107699 0.885957

6 rows × 4 columns

Selection by Position

.iloc is strictly integer position based (from 0 to length-1 of the axis), will raise IndexError when the requested indicies are out of bounds. Allowed inputs are:

  • An integer e.g. 5
  • A list or array of integers [4, 3, 0]
  • A slice object with ints 1:7
In [31]:
df.iloc[2:4]
Out[31]:
A B C D
2013-01-03 0.949617 1.105997 0.865616 NaN
2013-01-04 1.118095 0.387274 0.001193 1.642457

2 rows × 4 columns

In [32]:
df.iloc[2:4, 2:]
Out[32]:
C D
2013-01-03 0.865616 NaN
2013-01-04 0.001193 1.642457

2 rows × 2 columns

In [33]:
df.iloc[[1, 4, -1], [2, 3]]
Out[33]:
C D
2013-01-02 -1.171944 0.273472
2013-01-05 1.294505 1.366227
2013-01-06 0.107699 0.885957

3 rows × 2 columns

In [34]:
df.iloc[1]
Out[34]:
A   -0.556901
B    1.451457
C   -1.171944
D    0.273472
Name: 2013-01-02 00:00:00, dtype: float64
In [35]:
df.iloc[1]['A']
Out[35]:
-0.55690086547882145
In [36]:
df.iloc[1, 0]
Out[36]:
-0.55690086547882145

Fast scalar value getting and setting

Since indexing with [] must handle a lot of cases (single-label access, slicing, boolean indexing, etc.), it has a bit of overhead in order to figure out what you’re asking for. If you only want to access a scalar value, the fastest way is to use the at and iat methods, which are implemented on all of the data structures.

Similary to loc, at provides label based scalar lookups, while, iat provides integer based lookups analagously to iloc

In [37]:
df.at[dates[1], 'A']
Out[37]:
-0.55690086547882145
In [38]:
df.iat[1, 0]
Out[38]:
-0.55690086547882145

Performance

In [87]:
%timeit df[0:3]
1000 loops, best of 3: 214 µs per loop
In [93]:
%timeit df['20130101':'20130104']
1000 loops, best of 3: 964 µs per loop
In [94]:
%timeit df.loc['20130101':'20130104']
1000 loops, best of 3: 1.42 ms per loop
In [90]:
%timeit df.iloc[:3]
1000 loops, best of 3: 204 µs per loop

Working with scalar

In [99]:
%timeit df.loc['20130101', 'A']
1000 loops, best of 3: 656 µs per loop
In [96]:
%timeit df.iloc[0, 0]
10000 loops, best of 3: 170 µs per loop
In [97]:
%timeit df.at[dates[0], 'A']
100000 loops, best of 3: 15.1 µs per loop
In [98]:
%timeit df.iat[0, 0]
10000 loops, best of 3: 26.5 µs per loop

Other property and attributes of pandas objects

we cav access to an array with the values, with:

In [40]:
df.values
Out[40]:
array([[ -1.83522041e-01,   2.87747307e-01,   4.37098680e-01,
          8.02198149e-01],
       [ -5.56900865e-01,   1.45145695e+00,  -1.17194358e+00,
          2.73471585e-01],
       [  9.49616644e-01,   1.10599748e+00,   8.65615804e-01,
                     nan],
       [  1.11809475e+00,   3.87273699e-01,   1.19317480e-03,
          1.64245712e+00],
       [  5.28890024e-01,              nan,   1.29450518e+00,
          1.36622654e+00],
       [ -5.89232343e-01,  -1.53134128e+00,   1.07699160e-01,
          8.85956987e-01]])

As for a numpy array we can perfomr simple task, like:

In [41]:
df.T
Out[41]:
2013-01-01 00:00:00 2013-01-02 00:00:00 2013-01-03 00:00:00 2013-01-04 00:00:00 2013-01-05 00:00:00 2013-01-06 00:00:00
A -0.183522 -0.556901 0.949617 1.118095 0.528890 -0.589232
B 0.287747 1.451457 1.105997 0.387274 NaN -1.531341
C 0.437099 -1.171944 0.865616 0.001193 1.294505 0.107699
D 0.802198 0.273472 NaN 1.642457 1.366227 0.885957

4 rows × 6 columns

Pandas objects are able to handle nan vales.

In [42]:
df.mean()
Out[42]:
A    0.211158
B    0.340227
C    0.255695
D    0.994062
dtype: float64
In [43]:
df.mean(1)
Out[43]:
2013-01-01    0.335881
2013-01-02   -0.000979
2013-01-03    0.973743
2013-01-04    0.787255
2013-01-05    1.063207
2013-01-06   -0.281729
Freq: D, dtype: float64
In [44]:
df.apply(np.cumsum)
Out[44]:
A B C D
2013-01-01 -0.183522 0.287747 0.437099 0.802198
2013-01-02 -0.740423 1.739204 -0.734845 1.075670
2013-01-03 0.209194 2.845202 0.130771 NaN
2013-01-04 1.327288 3.232475 0.131964 2.718127
2013-01-05 1.856179 NaN 1.426469 4.084353
2013-01-06 1.266946 1.701134 1.534168 4.970310

6 rows × 4 columns

Sorting

In [45]:
df.sort_index(axis=0, ascending=False)
Out[45]:
A B C D
2013-01-06 -0.589232 -1.531341 0.107699 0.885957
2013-01-05 0.528890 NaN 1.294505 1.366227
2013-01-04 1.118095 0.387274 0.001193 1.642457
2013-01-03 0.949617 1.105997 0.865616 NaN
2013-01-02 -0.556901 1.451457 -1.171944 0.273472
2013-01-01 -0.183522 0.287747 0.437099 0.802198

6 rows × 4 columns

In [46]:
df.sort_index(axis=1, ascending=False)
Out[46]:
D C B A
2013-01-01 0.802198 0.437099 0.287747 -0.183522
2013-01-02 0.273472 -1.171944 1.451457 -0.556901
2013-01-03 NaN 0.865616 1.105997 0.949617
2013-01-04 1.642457 0.001193 0.387274 1.118095
2013-01-05 1.366227 1.294505 NaN 0.528890
2013-01-06 0.885957 0.107699 -1.531341 -0.589232

6 rows × 4 columns

In [47]:
df.sort(columns='B')
Out[47]:
A B C D
2013-01-06 -0.589232 -1.531341 0.107699 0.885957
2013-01-01 -0.183522 0.287747 0.437099 0.802198
2013-01-04 1.118095 0.387274 0.001193 1.642457
2013-01-03 0.949617 1.105997 0.865616 NaN
2013-01-02 -0.556901 1.451457 -1.171944 0.273472
2013-01-05 0.528890 NaN 1.294505 1.366227

6 rows × 4 columns

Unlike numpy, pandas objects can be enlarged/extended

In [48]:
df['E'] = np.random.rand(6)
In [49]:
df
Out[49]:
A B C D E
2013-01-01 -0.183522 0.287747 0.437099 0.802198 0.795572
2013-01-02 -0.556901 1.451457 -1.171944 0.273472 0.714172
2013-01-03 0.949617 1.105997 0.865616 NaN 0.676403
2013-01-04 1.118095 0.387274 0.001193 1.642457 0.859019
2013-01-05 0.528890 NaN 1.294505 1.366227 0.358974
2013-01-06 -0.589232 -1.531341 0.107699 0.885957 0.002553

6 rows × 5 columns

In [51]:
df.append(pd.DataFrame({'A': np.random.random(), 
                        'C': np.random.random(),
                        'E': np.random.random()}, index=('2013-01-08 00:00:00', )))
Out[51]:
A B C D E
2013-01-01 00:00:00 -0.183522 0.287747 0.437099 0.802198 0.795572
2013-01-02 00:00:00 -0.556901 1.451457 -1.171944 0.273472 0.714172
2013-01-03 00:00:00 0.949617 1.105997 0.865616 NaN 0.676403
2013-01-04 00:00:00 1.118095 0.387274 0.001193 1.642457 0.859019
2013-01-05 00:00:00 0.528890 NaN 1.294505 1.366227 0.358974
2013-01-06 00:00:00 -0.589232 -1.531341 0.107699 0.885957 0.002553
2013-01-08 00:00:00 0.606448 NaN 0.841401 NaN 0.433929

7 rows × 5 columns

In [19]:
pd.DataFrame({'A': np.random.random(), 
              'C': np.random.random(),
              'E': np.random.random()}, index=('2013-01-08 00:00:00', ))
Out[19]:
A C E
2013-01-08 00:00:00 0.762141 0.133528 0.893769

1 rows × 3 columns

Group by

In [110]:
# groupby
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                         'foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three',
                          'two', 'two', 'one', 'three'],
                   'C' : randn(8), 
                   'D' : randn(8)})
df
Out[110]:
A B C D
0 foo one -0.176504 -1.006985
1 bar one 0.442536 1.034323
2 foo two 0.139310 0.866478
3 bar three -2.540913 -0.157726
4 foo two 0.706973 0.496195
5 bar two 0.521037 1.370522
6 foo one -2.236087 -0.914654
7 foo three 1.200121 -1.101082

8 rows × 4 columns

In [112]:
df.groupby('A').sum()
Out[112]:
C D
A
bar -1.577340 2.247119
foo -0.366187 -1.660047

2 rows × 2 columns

In [113]:
df.groupby(['A','B']).sum()
Out[113]:
C D
A B
bar one 0.442536 1.034323
three -2.540913 -0.157726
two 0.521037 1.370522
foo one -2.412592 -1.921638
three 1.200121 -1.101082
two 0.846283 1.362673

6 rows × 2 columns

Pivot

In [120]:
pvt = pd.DataFrame({'A': ['foo', ] * 5 + ['bar'] * 4,
                    'B': ['one', ] * 3 + ['two', ] * 2 + ['one', ] * 2 + ['two', ] * 2,
                    'C': ['small', ] + ['large', ] * 2 + ['small', ] * 2 + ['large', ] + ['small', ] * 2 + ['large', ],
                    'D': [1, 2, 2, 3, 3, 4, 5, 6, 7]})
pvt
Out[120]:
A B C D
0 foo one small 1
1 foo one large 2
2 foo one large 2
3 foo two small 3
4 foo two small 3
5 bar one large 4
6 bar one small 5
7 bar two small 6
8 bar two large 7

9 rows × 4 columns

In [123]:
pd.pivot_table(pvt, values='D', rows=['A', 'B'], cols=['C'], aggfunc=np.sum)
Out[123]:
C large small
A B
bar one 4 5
two 7 6
foo one 4 1
two NaN 6

4 rows × 2 columns

In [22]:
df[df==1.567247]
Out[22]:
A B C D
2013-01-01 NaN NaN NaN NaN
2013-01-02 NaN NaN NaN NaN
2013-01-03 NaN NaN NaN NaN
2013-01-04 NaN NaN NaN NaN
2013-01-05 NaN NaN NaN NaN
2013-01-06 NaN NaN NaN NaN

6 rows × 4 columns

IO Tools (Text, CSV, HDF5, ...)

The Pandas I/O api is a set of top level reader functions accessed like pd.read_csv() that generally return a pandas object.

  • read_csv & to_csv
  • read_excel & to_excel
  • read_hdf & to_hdf
  • read_sql & to_sql
  • read_json & to_json
  • read_msgpack & to_msgpack (experimental)
  • read_html & to_html
  • read_gbq & to_gbq (experimental)
  • read_stata & to_stata
  • read_clipboard & to_clipboard
  • read_pickle & to_pickle
In [4]:
%%sh
head -n 20 Klima_LT_N_daily_1981-2012_8320_Bozen-Bolzano.csv
                        AUTONOME PROVINZ           PROVINCI
                       26. Brand- und Ziv             26. P
                       26.4 - Hydrographi             26.4 
                                                           
     Station - stazioneBozen - Bolzano                     
      Nummer - codice :8320                                
     Rechtswert - X UTM677473                              
     Hochwert - Y UTM :5151945                             
         Höhe - quota :254                                 
     Zeitraum - periodo1981              2012              
                                                           
         Data
Datum    Precipitazione NieTemperatura
Temper
                                          massima Maximum    minima Minimum  
         01:01:1981           0,0               -6,0              10,0       
         02:01:1981           0,0               3,0               10,0       
         03:01:1981           0,0               -5,0              6,0        
         04:01:1981           0,4               -4,0              3,0        
         05:01:1981           0,0               0,0               6,0        
In [23]:
import datetime as dt

bz = pd.read_csv("Klima_LT_N_daily_1981-2012_8320_Bozen-Bolzano.csv", 
                 skipinitialspace=True, skiprows=15, na_values='---',
                 index_col=0, 
                 decimal=',',
                 delim_whitespace=True,
                 names=('rainfall', 'Tmin', 'Tmax'),
                 parse_dates=True,
                 date_parser=lambda bytes: dt.datetime.strptime(bytes, "%d:%m:%Y"))
                 
In [53]:
bz[:5]
Out[53]:
rainfall Tmin Tmax
1981-01-01 0.0 -6 10
1981-01-02 0.0 3 10
1981-01-03 0.0 -5 6
1981-01-04 0.4 -4 3
1981-01-05 0.0 0 6

5 rows × 3 columns

Plot

In [54]:
%pylab inline
df = pd.DataFrame(randn(1000, 4), index=pd.Index(pd.date_range('1/1/2000', periods=1000)),
                  columns=['A', 'B', 'C', 'D']).cumsum()

plt.figure(); 
df.plot(); 
plt.legend(loc='best')
Populating the interactive namespace from numpy and matplotlib
Out[54]:
<matplotlib.legend.Legend at 0x7f56161a92d0>
<matplotlib.figure.Figure at 0x7f5624b02bd0>
In [55]:
fig, ax0 = plt.subplots(figsize=(20,5))
ax1 = ax0.twinx()
ax1.plot(bz.index, bz['rainfall'], 'k-', label='rainfall', alpha=0.5)
ax1.axis('tight')
ax1.set_title('$Rainfall/T_{min}/T_{max}$ in Bozen/Bolzano')
ax1.set_xlabel('year')
ax1.set_ylabel('rainfall [mm]')

ax0.plot(bz.index, bz['Tmin'], 'b-', label='Tmin', alpha=0.7)
ax0.plot(bz.index, bz['Tmax'], 'r-', label='Tmin', alpha=0.7)
ax0.set_ylabel('Temperature [°C]')
ax0.grid()
In [105]:
from pandas.stats.moments import rolling_mean, rolling_min, rolling_max

ROLL=21

fig = plt.figure(figsize=(20, 5))
rolling_mean(bz.loc[:, ['Tmin', 'Tmax']], ROLL).plot(subplots=True)
Out[105]:
array([<matplotlib.axes.AxesSubplot object at 0x7f55f9fd3650>,
       <matplotlib.axes.AxesSubplot object at 0x7f55f8c4bb10>], dtype=object)
<matplotlib.figure.Figure at 0x7f55f9fc2d50>
In [106]:
fig, ax0 = plt.subplots(figsize=(20,5))

ax0.plot(bz.index, rolling_mean(bz['Tmin'],  ROLL), 'b-', label='Tmin', alpha=0.7)
ax0.plot(bz.index, rolling_mean(bz['Tmax'],  ROLL), 'r-', label='Tmin', alpha=0.7)
ax0.set_ylabel('Temperature [°C]')
ax0.grid()

Group by: Month

In [26]:
bz.index.month
Out[26]:
array([ 1,  1,  1, ..., 12, 12, 12], dtype=int32)
In [27]:
bz_month = bz.groupby(bz.index.month)
In [63]:
len(bz_month)
Out[63]:
12
In [66]:
Jan = bz_month.get_group(1)
In [67]:
Jan[:5]
Out[67]:
rainfall Tmin Tmax
1981-01-01 0.0 -6 10
1981-01-02 0.0 3 10
1981-01-03 0.0 -5 6
1981-01-04 0.4 -4 3
1981-01-05 0.0 0 6

5 rows × 3 columns

We can use the Seaborn library to get nicer plots

In [28]:
import matplotlib.pyplot as plt
import seaborn as sns

fig = plt.figure(figsize=(10, 20))

ax2 = subplot(313)
plt.setp(ax2.xaxis.get_majorticklabels(), rotation=90)
ax2.set_ylabel('Tdiff')
#sns.boxplot(Jan['rainfall'], groupby=Jan.index.year, color="pastel", widths=.5, ax=ax2)
sns.boxplot(Jan['Tmax'] - Jan['Tmin'], groupby=Jan.index.year, color="pastel", widths=.5, ax=ax2)

ax0 = plt.subplot(311, sharex=ax2)
plt.setp(ax0.get_xticklabels(), visible=False)
sns.boxplot(Jan['Tmax'], groupby=Jan.index.year, color="pastel", widths=.5, ax=ax0)

ax1 = subplot(312, sharex=ax2)
plt.setp(ax1.get_xticklabels(), visible=False)
sns.boxplot(Jan['Tmin'], groupby=Jan.index.year, color="pastel", widths=.5, ax=ax1)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-28-fe38870621fc> in <module>()
      8 ax2.set_ylabel('Tdiff')
      9 #sns.boxplot(Jan['rainfall'], groupby=Jan.index.year, color="pastel", widths=.5, ax=ax2)
---> 10 sns.boxplot(Jan['Tmax'] - Jan['Tmin'], groupby=Jan.index.year, color="pastel", widths=.5, ax=ax2)
     11 
     12 ax0 = plt.subplot(311, sharex=ax2)

NameError: name 'Jan' is not defined
In [31]:
def groupby_years(ymin, step=2):
    def get(x):
        return ((x.year - ymin) // step) * step + ymin
    return get

groupby_2y = groupby_years(Jan.index.year.min(), 2)
groupby_3y = groupby_years(Jan.index.year.min(), 3)
groupby_5y = groupby_years(Jan.index.year.min(), 5)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-31-40a8d4776ea9> in <module>()
      4     return get
      5 
----> 6 groupby_2y = groupby_years(Jan.index.year.min(), 2)
      7 groupby_3y = groupby_years(Jan.index.year.min(), 3)
      8 groupby_5y = groupby_years(Jan.index.year.min(), 5)

NameError: name 'Jan' is not defined
In [180]:
fig = plt.figure(figsize=(10, 20))

gby = groupby_5y

ax2 = subplot(313)
plt.setp(ax2.xaxis.get_majorticklabels(), rotation=90)
ax2.set_ylabel('Tdiff')
#sns.boxplot(Jan['rainfall'], groupby=Jan.index.year, color="pastel", widths=.5, ax=ax2)
sns.boxplot(Jan['Tmax'] - Jan['Tmin'], groupby=gby, color="pastel", widths=.5, ax=ax2)

ax0 = plt.subplot(311, sharex=ax2)
plt.setp(ax0.get_xticklabels(), visible=False)
sns.boxplot(Jan['Tmax'], groupby=gby, color="pastel", widths=.5, ax=ax0)

ax1 = subplot(312, sharex=ax2)
plt.setp(ax1.get_xticklabels(), visible=False)
sns.boxplot(Jan['Tmin'], groupby=gby, color="pastel", widths=.5, ax=ax1)
Out[180]:
<matplotlib.axes.AxesSubplot at 0x7f55f606e750>
In [181]:
import matplotlib.pyplot as plt

fig = plt.figure(figsize=(10, 20))

gby = groupby_5y

ax2 = subplot(313)
plt.setp(ax2.xaxis.get_majorticklabels(), rotation=90)
#sns.violinplot(Jan['rainfall'], groupby=Jan.index.year, color="pastel", widths=.5, ax=ax2)
ax2.set_ylabel('Tdiff')
sns.violinplot(Jan['Tmax'] - Jan['Tmin'], groupby=gby, color="pastel", widths=.5, ax=ax2)

ax0 = plt.subplot(311, sharex=ax2)
plt.setp(ax0.get_xticklabels(), visible=False)
sns.violinplot(Jan['Tmax'], groupby=gby, color="pastel", widths=.5, ax=ax0)

ax1 = subplot(312, sharex=ax2)
plt.setp(ax1.get_xticklabels(), visible=False)
sns.violinplot(Jan['Tmin'], groupby=gby, color="pastel", widths=.5, ax=ax1)
Out[181]:
<matplotlib.axes.AxesSubplot at 0x7f55f5dc5850>
In [ ]:
Define a `groupby` function:
In [96]:
# Create a Seasonal Dictionary that will map months to seasons
seasons = {12: 'Winter', 1: 'Winter', 2: 'Winter',
           3: 'Spring', 4: 'Spring', 5: 'Spring',
           6: 'Summer', 7: 'Summer', 8: 'Summer',
           9: 'Autumn', 10: 'Autumn', 11: 'Autumn'}

# Write a function that will be used to group the data
def group_by_season(x):
    return seasons[x.month]

# Call the function with the groupby operation. 
bz_seasons = bz.groupby(group_by_season)
winter = bz_seasons.get_group('Winter')
In [98]:
winter[:5]
Out[98]:
rainfall Tmin Tmax
1981-01-01 0.0 -6 10
1981-01-02 0.0 3 10
1981-01-03 0.0 -5 6
1981-01-04 0.4 -4 3
1981-01-05 0.0 0 6

5 rows × 3 columns

In [182]:
import matplotlib.pyplot as plt

fig = plt.figure(figsize=(10, 20))

gby = groupby_years(winter.index.year.min(), 5)

ax2 = subplot(313)
plt.setp(ax2.xaxis.get_majorticklabels(), rotation=90)
#sns.violinplot(winter['rainfall'], groupby=winter.index.year, color="pastel", widths=.5, ax=ax2)
ax2.set_ylabel('Tdiff')
sns.violinplot(winter['Tmax'] - winter['Tmin'], groupby=gby, color="pastel", widths=.5, ax=ax2)

ax0 = plt.subplot(311, sharex=ax2)
plt.setp(ax0.get_xticklabels(), visible=False)
sns.violinplot(winter['Tmax'], groupby=gby, color="pastel", widths=.5, ax=ax0)

ax1 = subplot(312, sharex=ax2)
plt.setp(ax1.get_xticklabels(), visible=False)
sns.violinplot(winter['Tmin'], groupby=gby, color="pastel", widths=.5, ax=ax1)
Out[182]:
<matplotlib.axes.AxesSubplot at 0x7f55f5b4ee50>
In [108]:
bz_month['Tmin'].mean().plot(label='mean')
bz_month['Tmin'].std().plot(label='std')
bz_month['Tmin'].min().plot(label='min')
bz_month['Tmin'].max().plot(label='max')
plt.legend(loc='best')
Out[108]:
<matplotlib.legend.Legend at 0x7f55f8af7410>
In [29]:
def plot(data, method='mean', column='Tmax', groupby=lambda x: x.year):
    fig = plt.figure(figsize=(20, 10))
    ax = subplot(111)
    with pd.plot_params.use('x_compat', True):
        for m in data.groups.keys():
            mdata = data.get_group(m).groupby(groupby)
            getattr(mdata, method)().loc[:, column].plot(label=str(m))
    ax.set_title(method.title())
    ax.set_xlabel('year')
    ax.set_ylabel('Temperature [°C]')
    ax.legend(loc='best')
In [32]:
plot(bz_month, method='min', groupby=groupby_years(bz_month.get_group(1).index.year.min(), 2))