Kushal Keshavamurthy Raviprakash
kushalkr2992@gmail.com
This notebook is a part of the Python for Earth and Atmospheric Sciences workshop.
%matplotlib inline
Pandas
is a python package which provides fast, flexible and expressive data structures and is a package designed to make working with "relational" or "labelled" data easy and intuitive.
I find pandas very useful for timeseries analysis.
The pandas package is generally imported as :
import pandas as pd
The most common data structures in pandas
are:
Series
(1-Dimensional, labeled, homogeneous array)DataFrame
(2-Dimensional, labeled with potentially heterogenous columns)Panel
(3-Dimensional, size mutable array) [DEPRECATED and will be removed in the future]In this lecture, we will look at the Series
and DataFrame
data structures in some detail.
Let us first import the necessary packages.
# Import necessary packages
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
pd.__version__
'0.20.2'
We will use the Arctic Oscillation (AO) and North Atlantic Oscillation (NAO) datasets as an example.
You can get the data from here. But, as a sample, you will find it being present in the data/
directory of my repository already.
pandas
has some very good I/O facilities. But, for now we will stick to what we have learnt so far.
ao = np.loadtxt('data/monthly.ao.index.b50.current.ascii')
print(ao[0:2])
[[ 1.95000000e+03 1.00000000e+00 -6.03100000e-02] [ 1.95000000e+03 2.00000000e+00 6.26810000e-01]]
The data has three columns: year, month and the AO index value.
Timeseries of a single variable are usually put into the Series
data structure provided by pandas
.
To create a timeseries, we need to first create a time range for which we have the data and later use it to index the data. This is done with the help of the date_range
function defined in the pandas
module.
The data we have loaded ranges from January-1950 until May-2017. You may have to change the range if you obtain a newer version of the data. We use freq = 'M'
which means the data is available every month.
dates = pd.date_range('1950-01','2017-01', freq='M') # Actual data is until 2017-06
dates
DatetimeIndex(['1950-01-31', '1950-02-28', '1950-03-31', '1950-04-30', '1950-05-31', '1950-06-30', '1950-07-31', '1950-08-31', '1950-09-30', '1950-10-31', ... '2016-03-31', '2016-04-30', '2016-05-31', '2016-06-30', '2016-07-31', '2016-08-31', '2016-09-30', '2016-10-31', '2016-11-30', '2016-12-31'], dtype='datetime64[ns]', length=804, freq='M')
dates.shape
(804,)
We will use the data only until Dec-2016.
AO = pd.Series(ao[:804,2], index=dates)
AO
1950-01-31 -0.060310 1950-02-28 0.626810 1950-03-31 -0.008127 1950-04-30 0.555100 1950-05-31 0.071577 1950-06-30 0.538570 1950-07-31 -0.802480 1950-08-31 -0.851010 1950-09-30 0.357970 1950-10-31 -0.378900 1950-11-30 -0.515110 1950-12-31 -1.928100 1951-01-31 -0.084969 1951-02-28 -0.399930 1951-03-31 -1.934100 1951-04-30 -0.776480 1951-05-31 -0.862780 1951-06-30 -0.917860 1951-07-31 0.090023 1951-08-31 -0.377410 1951-09-30 -0.817780 1951-10-31 -0.212910 1951-11-30 -0.068519 1951-12-31 1.987200 1952-01-31 0.368250 1952-02-29 -1.747200 1952-03-31 -1.859500 1952-04-30 0.538520 1952-05-31 -0.773510 1952-06-30 -0.440930 ... 2014-07-31 -0.488940 2014-08-31 -0.371540 2014-09-30 0.101910 2014-10-31 -1.134400 2014-11-30 -0.530300 2014-12-31 0.412920 2015-01-31 1.091600 2015-02-28 1.042600 2015-03-31 1.837400 2015-04-30 1.215700 2015-05-31 0.762760 2015-06-30 0.427040 2015-07-31 -1.107900 2015-08-31 -0.689020 2015-09-30 -0.164510 2015-10-31 -0.250060 2015-11-30 1.945000 2015-12-31 1.444100 2016-01-31 -1.448700 2016-02-29 -0.023521 2016-03-31 0.280240 2016-04-30 -1.051100 2016-05-31 -0.035739 2016-06-30 0.312880 2016-07-31 0.084758 2016-08-31 0.472380 2016-09-30 0.780970 2016-10-31 -1.917300 2016-11-30 -0.610910 2016-12-31 1.786400 Freq: M, Length: 804, dtype: float64
Let's see how the data looks.
AO.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x1128e4908>
AO['1980':'1990'].plot()
<matplotlib.axes._subplots.AxesSubplot at 0x11295a470>
AO['1992-01':'1993-08'].plot()
<matplotlib.axes._subplots.AxesSubplot at 0x114c26f98>
Indexing data is very intuituve. For example:
AO['1992']
1992-01-31 0.55003 1992-02-29 1.12170 1992-03-31 0.98423 1992-04-30 -0.52053 1992-05-31 1.34140 1992-06-30 -0.30196 1992-07-31 0.19113 1992-08-31 0.53534 1992-09-30 -0.64030 1992-10-31 -0.36589 1992-11-30 0.71699 1992-12-31 1.62670 Freq: M, dtype: float64
AO['2016-07']
2016-07-31 0.084758 Freq: M, dtype: float64
You can also use fancy indexing as with arrays.
AO[np.logical_and((AO < 0.3),(AO > 0))]
1950-05-31 0.071577 1951-07-31 0.090023 1953-06-30 0.022535 1953-08-31 0.084960 1955-04-30 0.194230 1955-05-31 0.241610 1955-10-31 0.099053 1956-06-30 0.280140 1956-12-31 0.000891 1957-04-30 0.237660 1957-08-31 0.096898 1959-04-30 0.119240 1959-07-31 0.104720 1960-06-30 0.054978 1961-05-31 0.157480 1961-08-31 0.012693 1961-10-31 0.203080 1962-05-31 0.067997 1962-06-30 0.287070 1962-08-31 0.152070 1963-09-30 0.083110 1964-06-30 0.141590 1965-06-30 0.037800 1965-12-31 0.162970 1966-07-31 0.010944 1966-09-30 0.011063 1966-11-30 0.110560 1967-05-31 0.126510 1967-07-31 0.259260 1967-09-30 0.132930 ... 1996-11-30 0.136370 1997-08-31 0.120620 1997-09-30 0.194510 1998-10-31 0.294330 1999-01-31 0.110350 1999-04-30 0.284450 1999-05-31 0.225870 1999-09-30 0.059078 2000-08-31 0.143890 2003-02-28 0.127800 2003-07-31 0.075316 2003-12-31 0.265210 2005-08-31 0.026141 2005-10-31 0.029768 2005-11-30 0.227690 2006-04-30 0.138330 2006-05-31 0.155810 2006-07-31 0.102700 2007-09-30 0.178920 2008-11-30 0.092205 2009-03-31 0.121340 2012-05-31 0.168410 2012-07-31 0.167790 2012-08-31 0.013999 2013-08-31 0.154250 2013-10-31 0.262760 2014-02-28 0.043775 2014-09-30 0.101910 2016-03-31 0.280240 2016-07-31 0.084758 Length: 106, dtype: float64
2-Dimensional data are handled well by the data structure called Dataframe
. We will use the North Atlantic Oscillation (NAO) data as an example. This data is available here.
Create a Series
the same way as the previously done for the AO data.
nao = np.loadtxt('data/norm.nao.monthly.b5001.current.ascii.txt')
dates_nao = pd.date_range('1950-01', '2017-01',freq='M')
print(dates_nao.dtype)
NAO = pd.Series(nao[:804,2],index=dates_nao)
datetime64[ns]
aonao = pd.DataFrame({'AO' : AO, 'NAO' : NAO})
aonao.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x114e158d0>
aonao.head()
AO | NAO | |
---|---|---|
1950-01-31 | -0.060310 | 0.92 |
1950-02-28 | 0.626810 | 0.40 |
1950-03-31 | -0.008127 | -0.36 |
1950-04-30 | 0.555100 | 0.73 |
1950-05-31 | 0.071577 | -0.59 |
Slicing works for both data.
aonao['1980':'1985']
AO | NAO | |
---|---|---|
1980-01-31 | -2.065700 | -0.75 |
1980-02-29 | -0.933720 | 0.05 |
1980-03-31 | -1.433300 | -0.31 |
1980-04-30 | -0.419130 | 1.29 |
1980-05-31 | -1.154800 | -1.50 |
1980-06-30 | 0.721490 | -0.37 |
1980-07-31 | -0.622210 | -0.42 |
1980-08-31 | -0.185190 | -2.24 |
1980-09-30 | 0.312620 | 0.66 |
1980-10-31 | -0.521210 | -1.77 |
1980-11-30 | -1.361000 | -0.37 |
1980-12-31 | -0.057300 | 0.78 |
1981-01-31 | -0.116340 | 0.37 |
1981-02-28 | -0.331580 | 0.92 |
1981-03-31 | -1.644700 | -1.19 |
1981-04-30 | 0.430410 | 0.36 |
1981-05-31 | 0.179560 | 0.20 |
1981-06-30 | -0.437930 | -0.45 |
1981-07-31 | 0.560550 | 0.05 |
1981-08-31 | -0.244110 | 0.39 |
1981-09-30 | -1.040100 | -1.45 |
1981-10-31 | -1.167500 | -1.35 |
1981-11-30 | -0.187670 | -0.38 |
1981-12-31 | -1.215700 | -0.02 |
1982-01-31 | -0.883410 | -0.89 |
1982-02-28 | 0.973910 | 1.15 |
1982-03-31 | 1.074100 | 1.15 |
1982-04-30 | 1.453800 | 0.10 |
1982-05-31 | -0.208700 | -0.53 |
1982-06-30 | -1.180100 | -1.63 |
... | ... | ... |
1983-07-31 | 0.130550 | 1.19 |
1983-08-31 | 1.097800 | 1.61 |
1983-09-30 | 0.166890 | -1.12 |
1983-10-31 | 1.368900 | 0.65 |
1983-11-30 | -0.687930 | -0.98 |
1983-12-31 | 0.186250 | 0.29 |
1984-01-31 | 0.905040 | 1.66 |
1984-02-29 | -0.302720 | 0.72 |
1984-03-31 | -2.386000 | -0.37 |
1984-04-30 | -0.283590 | -0.28 |
1984-05-31 | 0.479180 | 0.54 |
1984-06-30 | 0.007289 | -0.42 |
1984-07-31 | 0.018871 | -0.07 |
1984-08-31 | 0.465690 | 1.15 |
1984-09-30 | -0.412760 | 0.17 |
1984-10-31 | -0.270260 | -0.07 |
1984-11-30 | -0.965870 | -0.06 |
1984-12-31 | 0.446000 | 0.00 |
1985-01-31 | -2.805700 | -1.61 |
1985-02-28 | -1.439800 | -0.49 |
1985-03-31 | 0.551440 | 0.20 |
1985-04-30 | 0.652420 | 0.32 |
1985-05-31 | -0.432210 | -0.49 |
1985-06-30 | -0.346560 | -0.80 |
1985-07-31 | -0.389580 | 1.22 |
1985-08-31 | -0.001429 | -0.48 |
1985-09-30 | 0.114410 | -0.52 |
1985-10-31 | 1.035100 | 0.90 |
1985-11-30 | -1.217500 | -0.67 |
1985-12-31 | -1.947600 | 0.22 |
72 rows × 2 columns
You can add more complexity to the slicing this way.
import datetime
aonao.loc[(aonao.AO > 0) & (aonao.NAO < 0)
& (aonao.index > datetime.datetime(1980,1,1))
& (aonao.index < datetime.datetime(1989,1,1)), 'NAO'].plot(kind='barh')
<matplotlib.axes._subplots.AxesSubplot at 0x114f2b748>
Here I have used the DataFrame.loc advanced indexing attribute. We choose all the necessary constraints such as, AO > 0, NAO < 0, date range between years 1980-1989 (I have used the datetime module for this) and then, voila! you have your sliced data according to your requirements.
aonao.min()
AO -4.2657 NAO -3.1800 dtype: float64
aonao.max()
AO 3.4953 NAO 3.0400 dtype: float64
aonao.mean()
AO -0.116709 NAO -0.018718 dtype: float64
aonao.median()
AO -0.044346 NAO 0.025000 dtype: float64
print(aonao.mean(0))# This is the regular mean (column-wise)
print("\n Now let's see the mean of both indices:\n")
print(aonao.mean(1)) # Mean along row
AO -0.116709 NAO -0.018718 dtype: float64 Now let's see the mean of both indices: 1950-01-31 0.429845 1950-02-28 0.513405 1950-03-31 -0.184064 1950-04-30 0.642550 1950-05-31 -0.259211 1950-06-30 0.239285 1950-07-31 -1.031240 1950-08-31 -0.450505 1950-09-30 0.303985 1950-10-31 0.235550 1950-11-30 -0.887555 1950-12-31 -1.474050 1951-01-31 -0.002485 1951-02-28 0.150035 1951-03-31 -1.477050 1951-04-30 -0.498240 1951-05-31 -0.726390 1951-06-30 -1.278930 1951-07-31 0.730012 1951-08-31 -0.298705 1951-09-30 -1.088890 1951-10-31 0.828545 1951-11-30 -0.229260 1951-12-31 1.653600 1952-01-31 0.649125 1952-02-29 -1.288600 1952-03-31 -1.674750 1952-04-30 0.774260 1952-05-31 -0.946755 1952-06-30 -0.420465 ... 2014-07-31 -0.156775 2014-08-31 -1.026520 2014-09-30 0.859105 2014-10-31 -1.202500 2014-11-30 0.073835 2014-12-31 1.135210 2015-01-31 1.440150 2015-02-28 1.182700 2015-03-31 1.643550 2015-04-30 0.970525 2015-05-31 0.454140 2015-06-30 0.180109 2015-07-31 -2.143450 2015-08-31 -0.724690 2015-09-30 -0.406485 2015-10-31 0.093585 2015-11-30 1.844400 2015-12-31 1.843850 2016-01-31 -0.666090 2016-02-29 0.778390 2016-03-31 0.507320 2016-04-30 -0.337980 2016-05-31 -0.404660 2016-06-30 -0.059815 2016-07-31 -0.837771 2016-08-31 -0.586460 2016-09-30 0.695965 2016-10-31 -0.753025 2016-11-30 -0.387285 2016-12-31 1.132365 Freq: M, Length: 804, dtype: float64
Or you can get all necessary statistics in one command.
aonao.describe()
AO | NAO | |
---|---|---|
count | 804.000000 | 804.000000 |
mean | -0.116709 | -0.018718 |
std | 1.007779 | 1.011623 |
min | -4.265700 | -3.180000 |
25% | -0.662915 | -0.760090 |
50% | -0.044346 | 0.025000 |
75% | 0.475240 | 0.681750 |
max | 3.495300 | 3.040000 |
Getting correlation coefficients is also very easy.
aonao.corr()
AO | NAO | |
---|---|---|
AO | 1.000000 | 0.605191 |
NAO | 0.605191 | 1.000000 |
Try out basic pandas on some irregular data like the Bloomington station temperature and dew point temperature.
import datetime as dt
data = pd.read_csv('data/KBMG-2013.csv')
data['date'] = data.apply(lambda x: dt.datetime(x['yyyy'], x['mm'],x['dd'],x['hh'], x['mm.1']), axis=1)
data.set_index(data['date'], inplace=True)
T = pd.Series(data['temp'], index=data.index)
TD = pd.Series(data['dwpt'], index=data.index)
TTD = pd.DataFrame({'$T$': T, '$T_d$' : TD})
TTD.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x1158e88d0>