Retrieve some Data from Quandl and Play with Pandas

This notebook is related to the blog post Retrieve Quandl's Data and Play with a Pandas

This IPython notebook tries to show some features of the Python Data library Pandas. We would like to use some numerical data from the Quandl website.

Short Intro

Three lines to import modules and configure the Matplotlib display options.

In [1]:
import numpy as np
import pandas as pd
pd.options.display.mpl_style = 'default'
In [2]:
print 'Pandas version', pd.__version__
print 'Numpy version', np.__version__
Pandas version 0.12.0
Numpy version 1.7.1

Just plot ten Brownian motions.

In [2]:
pd.DataFrame(np.random.randn(120, 10).cumsum(axis=0)).plot(title='Brownian Motion')
Out[2]:
<matplotlib.axes.AxesSubplot at 0x37a6fd0>

Retrieve some Data: Quandl's Datasets

See Quandl and its Python package which depends on Numpy and Pandas.

It's possible to get a huge amount of datasets with timeseries on different topics --- finance, education, health, etc.

In [3]:
import Quandl
TOKEN = None # Replace None by your own Quandl token

 Single Quandl Data

Get a dataset about the murder rates per 100 000 person in Great Britain (from OECD stats).

In [4]:
uk_df = Quandl.get('OECD/HEALTH_STAT_CICDHOCD_TXCMILTX_GBR', authtoken=TOKEN)
Token YOUR_TOKEN activated and saved for later use.
Returning Dataframe for  OECD/HEALTH_STAT_CICDHOCD_TXCMILTX_GBR
In [5]:
uk_df.head(8)
Out[5]:
Value
Date
1986-12-31 0.8
1987-12-31 1.0
1988-12-31 1.0
1989-12-31 1.1
1990-12-31 0.7
1991-12-31 0.7
1992-12-31 0.9
1993-12-31 1.2

Note that the Quandl website allows you to make less than 50 calls per day. You can sign up if you want more. See http://www.quandl.com/help/authorization-token

In [11]:
uk_df.plot()
Out[11]:
<matplotlib.axes.AxesSubplot at 0x3b36fd0>

 Murder Rates Data of 33 Countries

I would like to get this for the 33 other countries. The pattern looks like OECD/HEALTH_STAT_CICDHOCD_TXCMILTX_[A-Z]+. At the end, you have the "country code".

In [22]:
import re
import urllib2

def get_quandl_country_code(url, pattern):
    """Get the country code from an URL and a pattern.
    
    Suppose your pattern have the 'country' group name.
    """
    rawtext = urllib2.urlopen(url).read()
    regexp = re.compile(pattern)
    return [m.group('country') for m in regexp.finditer(rawtext)]
In [23]:
URL = 'http://www.quandl.com/society/oecd-murder-rates'
PATTERN = r'OECD/HEALTH_STAT_CICDHOCD_TXCMILTX_(?P<country>[A-Z]+)'
country_code = get_quandl_country_code(URL, PATTERN)
print 'Nb countries', len(country_code), '--- first 5: ', country_code[:5]
Nb countries 33 --- first 5:  ['USA', 'AUT', 'BEL', 'DNK', 'FIN']

Now, let's retrieve each data per country and concat them into a unique DataFrame. Note that the time index of each DataFrame is not the same for each country.

Retrieve two datasets and try the pd.concat function.

In [8]:
usa_df = Quandl.get('OECD/HEALTH_STAT_CICDHOCD_TXCMILTX_USA', authtoken=TOKEN)
Token YOUR_TOKEN activated and saved for later use.
Returning Dataframe for  OECD/HEALTH_STAT_CICDHOCD_TXCMILTX_USA
In [16]:
df = pd.concat([uk_df, usa_df], axis=1)
df
Out[16]:
Value Value
Date
1986-12-31 0.8 8.5
1987-12-31 1.0 8.2
1988-12-31 1.0 8.4
1989-12-31 1.1 8.7
1990-12-31 0.7 9.4
1991-12-31 0.7 9.9
1992-12-31 0.9 9.4
1993-12-31 1.2 9.6
1994-12-31 1.0 9.1
1995-12-31 1.0 8.4
1996-12-31 0.8 7.6
1997-12-31 0.7 7.2
1998-12-31 0.7 6.5
1999-12-31 0.7 6.1
2000-12-31 NaN 5.8
2001-12-31 0.4 7.0
2002-12-31 0.4 6.0
2003-12-31 0.5 5.9
2004-12-31 0.5 5.8
2005-12-31 0.4 6.0
2006-12-31 0.4 6.0
2007-12-31 0.4 6.0
2008-12-31 0.3 5.7
2009-12-31 0.3 5.4
2010-12-31 0.3 5.2

As you can see, there are a few NaN values. Moreover, I don't have the name of the country as the column label. Let's fix this and get the datasets for the 33 countries.

In [24]:
def get_quandl_dataset(code, name=None):
    """Get a Pandas DataFrame from the Quandl website according to the code.
    
    name: str or list
        The name(s) of the col label.
    """
    df = Quandl.get(code)
    if name:
        name = [name] if isinstance(name, str) else name
        df.columns = pd.Index(name)
    return df

Sorry but I can't avoid the annoying printings from the Quandl's get function...

In [13]:
code = PATTERN.replace(r"(?P<country>[A-Z]+)", "%s")
murder = pd.concat([get_quandl_dataset(code % name, name) for name in country_code], axis=1)
Using cached token YOUR_TOKEN for authentication.
Returning Dataframe for  OECD/HEALTH_STAT_CICDHOCD_TXCMILTX_USA
Using cached token YOUR_TOKEN for authentication.
Returning Dataframe for  OECD/HEALTH_STAT_CICDHOCD_TXCMILTX_AUT
Using cached token YOUR_TOKEN for authentication.
Returning Dataframe for  OECD/HEALTH_STAT_CICDHOCD_TXCMILTX_BEL
Using cached token YOUR_TOKEN for authentication.
Returning Dataframe for  OECD/HEALTH_STAT_CICDHOCD_TXCMILTX_DNK
Using cached token YOUR_TOKEN for authentication.
Returning Dataframe for  OECD/HEALTH_STAT_CICDHOCD_TXCMILTX_FIN
Using cached token YOUR_TOKEN for authentication.
Returning Dataframe for  OECD/HEALTH_STAT_CICDHOCD_TXCMILTX_FRA
Using cached token YOUR_TOKEN for authentication.
Returning Dataframe for  OECD/HEALTH_STAT_CICDHOCD_TXCMILTX_DEU
Using cached token YOUR_TOKEN for authentication.
Returning Dataframe for  OECD/HEALTH_STAT_CICDHOCD_TXCMILTX_GRC
Using cached token YOUR_TOKEN for authentication.
Returning Dataframe for  OECD/HEALTH_STAT_CICDHOCD_TXCMILTX_HUN
Using cached token YOUR_TOKEN for authentication.
Returning Dataframe for  OECD/HEALTH_STAT_CICDHOCD_TXCMILTX_ISL
Using cached token YOUR_TOKEN for authentication.
Returning Dataframe for  OECD/HEALTH_STAT_CICDHOCD_TXCMILTX_IRL
Using cached token YOUR_TOKEN for authentication.
Returning Dataframe for  OECD/HEALTH_STAT_CICDHOCD_TXCMILTX_ITA
Using cached token YOUR_TOKEN for authentication.
Returning Dataframe for  OECD/HEALTH_STAT_CICDHOCD_TXCMILTX_LUX
Using cached token YOUR_TOKEN for authentication.
Returning Dataframe for  OECD/HEALTH_STAT_CICDHOCD_TXCMILTX_NLD
Using cached token YOUR_TOKEN for authentication.
Returning Dataframe for  OECD/HEALTH_STAT_CICDHOCD_TXCMILTX_NOR
Using cached token YOUR_TOKEN for authentication.
Returning Dataframe for  OECD/HEALTH_STAT_CICDHOCD_TXCMILTX_PRT
Using cached token YOUR_TOKEN for authentication.
Returning Dataframe for  OECD/HEALTH_STAT_CICDHOCD_TXCMILTX_ESP
Using cached token YOUR_TOKEN for authentication.
Returning Dataframe for  OECD/HEALTH_STAT_CICDHOCD_TXCMILTX_SWE
Using cached token YOUR_TOKEN for authentication.
Returning Dataframe for  OECD/HEALTH_STAT_CICDHOCD_TXCMILTX_CHE
Using cached token YOUR_TOKEN for authentication.
Returning Dataframe for  OECD/HEALTH_STAT_CICDHOCD_TXCMILTX_GBR
Using cached token YOUR_TOKEN for authentication.
Returning Dataframe for  OECD/HEALTH_STAT_CICDHOCD_TXCMILTX_CZE
Using cached token YOUR_TOKEN for authentication.
Returning Dataframe for  OECD/HEALTH_STAT_CICDHOCD_TXCMILTX_POL
Using cached token YOUR_TOKEN for authentication.
Returning Dataframe for  OECD/HEALTH_STAT_CICDHOCD_TXCMILTX_SVK
Using cached token YOUR_TOKEN for authentication.
Returning Dataframe for  OECD/HEALTH_STAT_CICDHOCD_TXCMILTX_SVN
Using cached token YOUR_TOKEN for authentication.
Returning Dataframe for  OECD/HEALTH_STAT_CICDHOCD_TXCMILTX_EST
Using cached token YOUR_TOKEN for authentication.
Returning Dataframe for  OECD/HEALTH_STAT_CICDHOCD_TXCMILTX_JPN
Using cached token YOUR_TOKEN for authentication.
Returning Dataframe for  OECD/HEALTH_STAT_CICDHOCD_TXCMILTX_KOR
Using cached token YOUR_TOKEN for authentication.
Returning Dataframe for  OECD/HEALTH_STAT_CICDHOCD_TXCMILTX_ISR
Using cached token YOUR_TOKEN for authentication.
Returning Dataframe for  OECD/HEALTH_STAT_CICDHOCD_TXCMILTX_CAN
Using cached token YOUR_TOKEN for authentication.
Returning Dataframe for  OECD/HEALTH_STAT_CICDHOCD_TXCMILTX_MEX
Using cached token YOUR_TOKEN for authentication.
Returning Dataframe for  OECD/HEALTH_STAT_CICDHOCD_TXCMILTX_CHL
Using cached token YOUR_TOKEN for authentication.
Returning Dataframe for  OECD/HEALTH_STAT_CICDHOCD_TXCMILTX_AUS
Using cached token YOUR_TOKEN for authentication.
Returning Dataframe for  OECD/HEALTH_STAT_CICDHOCD_TXCMILTX_NZL

OK ! I've got a murder DataFrame with 33 columns labelised with the country codename and 26 dates.

In [19]:
print murder.shape
print murder.dtypes[:5]
print murder[country_code[:5]].head()
(26, 33)
USA    float64
AUT    float64
BEL    float64
DNK    float64
FIN    float64
dtype: object
            USA  AUT  BEL  DNK  FIN
Date                               
1986-12-31  8.5  1.3  2.1  1.2  3.2
1987-12-31  8.2  1.3  1.6  1.0  2.7
1988-12-31  8.4  1.2  1.6  1.1  2.8
1989-12-31  8.7  1.1  1.3  1.2  3.1
1990-12-31  9.4  1.6  1.4  1.0  3.2

Here, you have some missing values represented by the type NaN, i.e. Not a Number. You can count the number of no NaN values per country (i.e. for each column).

In [20]:
murder.count()
Out[20]:
USA    25
AUT    26
BEL    21
DNK    21
FIN    26
FRA    24
DEU    22
GRC    25
HUN    26
ISL    22
IRL    25
ITA    23
LUX    26
NLD    26
NOR    26
PRT    23
ESP    26
SWE    25
CHE    25
GBR    24
CZE    26
POL    24
SVK    17
SVN    25
EST    26
JPN    25
KOR    26
ISR    25
CAN    24
MEX    25
CHL    24
AUS    26
NZL    24
dtype: int64

You can try to plot all data according to time.

In [21]:
murder.plot()
Out[21]:
<matplotlib.axes.AxesSubplot at 0x4086c90>

Well... The legend is a quite big.

Unemployment Data

I would like to get an other kind of datasets with the quite same countries and try to join and compare the data. Take another look at the Quandl's website. I found "The Registered Unemployment and Job Vacancies (MEI) dataset" from OECD stats. Let's go to retrieve an other DataFrame.

Take a look at the datasets from France http://www.quandl.com/OECD-Organisation-for-Economic-Co-operation-and-Development/MEI_LAB_REG1_FRA_EMECJV_ST_Q-France-Unfilled-Job-Vacancies-Level-Rate-Or-Quantity-Series-Quarterly

Loop on the country code and concat the result.

In [25]:
PATTERN2 = r'OECD/MEI_LAB_REG1_%(country)s_EMECJV_ST_Q'
df_can = get_quandl_dataset(PATTERN2 % {'country': 'CAN'}, 'CAN')
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-25-091e11b7a758> in <module>()
      1 PATTERN2 = r'OECD/MEI_LAB_REG1_%(country)s_EMECJV_ST_Q'
----> 2 df_can = get_quandl_dataset(PATTERN2 % {'country': 'CAN'}, 'CAN')

<ipython-input-24-165e2948d6b0> in get_quandl_dataset(code, name)
      5         The name(s) of the col label.
      6     """
----> 7     df = Quandl.get(code)
      8     if name:
      9         name = [name] if isinstance(name, str) else name

/home/dag/doc/work/blogpost/2013/pandas/Quandl/Quandl.pyc in get(dataset, **kwargs)
    134             elif str(e) == 'HTTP Error 404: Not Found':
    135                 error = "Dataset not found. Check Quandl code: {} for errors".format(dataset)
--> 136                 raise Exception(error)
    137 
    138             #Catch all

Exception: Dataset not found. Check Quandl code: OECD/MEI_LAB_REG1_CAN_EMECJV_ST_Q for errors
No authentication tokens found: usage will be limited.
See www.quandl.com/api for more information.

So bad. You can see that the kind of datasets does not exist for each previous countries. Thus, I make a loop over my country codenames with try..except. A more specific exception type would have been appreciated here. "If you want, make a patch" ... you're right but for now,

In [26]:
dflist = []
for country in country_code:
    try:
        data = get_quandl_dataset(PATTERN2 % {'country': country}, country)
        dflist.append(data)
    except:
        continue
No authentication tokens found: usage will be limited.
See www.quandl.com/api for more information.
No authentication tokens found: usage will be limited.
See www.quandl.com/api for more information.
Returning Dataframe for  OECD/MEI_LAB_REG1_AUT_EMECJV_ST_Q
No authentication tokens found: usage will be limited.
See www.quandl.com/api for more information.
Returning Dataframe for  OECD/MEI_LAB_REG1_BEL_EMECJV_ST_Q
No authentication tokens found: usage will be limited.
See www.quandl.com/api for more information.
No authentication tokens found: usage will be limited.
See www.quandl.com/api for more information.
Returning Dataframe for  OECD/MEI_LAB_REG1_FIN_EMECJV_ST_Q
No authentication tokens found: usage will be limited.
See www.quandl.com/api for more information.
Returning Dataframe for  OECD/MEI_LAB_REG1_FRA_EMECJV_ST_Q
No authentication tokens found: usage will be limited.
See www.quandl.com/api for more information.
Returning Dataframe for  OECD/MEI_LAB_REG1_DEU_EMECJV_ST_Q
No authentication tokens found: usage will be limited.
See www.quandl.com/api for more information.
No authentication tokens found: usage will be limited.
See www.quandl.com/api for more information.
Returning Dataframe for  OECD/MEI_LAB_REG1_HUN_EMECJV_ST_Q
No authentication tokens found: usage will be limited.
See www.quandl.com/api for more information.
No authentication tokens found: usage will be limited.
See www.quandl.com/api for more information.
No authentication tokens found: usage will be limited.
See www.quandl.com/api for more information.
No authentication tokens found: usage will be limited.
See www.quandl.com/api for more information.
Returning Dataframe for  OECD/MEI_LAB_REG1_LUX_EMECJV_ST_Q
No authentication tokens found: usage will be limited.
See www.quandl.com/api for more information.
Returning Dataframe for  OECD/MEI_LAB_REG1_NLD_EMECJV_ST_Q
No authentication tokens found: usage will be limited.
See www.quandl.com/api for more information.
Returning Dataframe for  OECD/MEI_LAB_REG1_NOR_EMECJV_ST_Q
No authentication tokens found: usage will be limited.
See www.quandl.com/api for more information.
Returning Dataframe for  OECD/MEI_LAB_REG1_PRT_EMECJV_ST_Q
No authentication tokens found: usage will be limited.
See www.quandl.com/api for more information.
Returning Dataframe for  OECD/MEI_LAB_REG1_ESP_EMECJV_ST_Q
No authentication tokens found: usage will be limited.
See www.quandl.com/api for more information.
Returning Dataframe for  OECD/MEI_LAB_REG1_SWE_EMECJV_ST_Q
No authentication tokens found: usage will be limited.
See www.quandl.com/api for more information.
No authentication tokens found: usage will be limited.
See www.quandl.com/api for more information.
Returning Dataframe for  OECD/MEI_LAB_REG1_GBR_EMECJV_ST_Q
No authentication tokens found: usage will be limited.
See www.quandl.com/api for more information.
Returning Dataframe for  OECD/MEI_LAB_REG1_CZE_EMECJV_ST_Q
No authentication tokens found: usage will be limited.
See www.quandl.com/api for more information.
Returning Dataframe for  OECD/MEI_LAB_REG1_POL_EMECJV_ST_Q
No authentication tokens found: usage will be limited.
See www.quandl.com/api for more information.
No authentication tokens found: usage will be limited.
See www.quandl.com/api for more information.
No authentication tokens found: usage will be limited.
See www.quandl.com/api for more information.
No authentication tokens found: usage will be limited.
See www.quandl.com/api for more information.
Returning Dataframe for  OECD/MEI_LAB_REG1_JPN_EMECJV_ST_Q
No authentication tokens found: usage will be limited.
See www.quandl.com/api for more information.
No authentication tokens found: usage will be limited.
See www.quandl.com/api for more information.
No authentication tokens found: usage will be limited.
See www.quandl.com/api for more information.
No authentication tokens found: usage will be limited.
See www.quandl.com/api for more information.
No authentication tokens found: usage will be limited.
See www.quandl.com/api for more information.
No authentication tokens found: usage will be limited.
See www.quandl.com/api for more information.
Returning Dataframe for  OECD/MEI_LAB_REG1_AUS_EMECJV_ST_Q
No authentication tokens found: usage will be limited.
See www.quandl.com/api for more information.
In [27]:
print len(dflist)
unemploy = pd.concat(dflist, axis=1)
print unemploy.columns
17
Index([u'AUT', u'BEL', u'FIN', u'FRA', u'DEU', u'HUN', u'LUX', u'NLD', u'NOR', u'PRT', u'ESP', u'SWE', u'GBR', u'CZE', u'POL', u'JPN', u'AUS'], dtype=object)

17 countries which occur in my previous "murder rates" DataFrame with more dates and a different time step.

In [28]:
unemploy
Out[28]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 105 entries, 1986-03-31 00:00:00 to 2012-03-31 00:00:00
Freq: Q-DEC
Data columns (total 17 columns):
AUT    105  non-null values
BEL    73  non-null values
FIN    105  non-null values
FRA    93  non-null values
DEU    105  non-null values
HUN    81  non-null values
LUX    105  non-null values
NLD    88  non-null values
NOR    105  non-null values
PRT    105  non-null values
ESP    77  non-null values
SWE    105  non-null values
GBR    105  non-null values
CZE    86  non-null values
POL    89  non-null values
JPN    105  non-null values
AUS    105  non-null values
dtypes: float64(17)

Let's dump these two DataFrame into CSV files. In that way, I will be able to get back these datasets without calling Quandl and using the concat function.

In [1]:
murder.to_csv('oecd-murder-rates.csv')
unemploy.to_csv('oecd-unemployment.csv')

To load the DataFrames from these CSV files, just do:

In [2]:
murder = pd.read_csv('oecd-murder-rates.csv', index_col=0, parse_dates=True)
unemploy = pd.read_csv('oecd-unemployment.csv', index_col=0, parse_dates=True)

See the other read_csv arguments to fit with your case: delimiter, header, date format, skip{row,footer}, dealing with missing values, etc.

Maybe you saw that the two DataFrames do not have the same shape:

  • number of dates in not the same (moreover, you don't have the same time frequency) ;
  • number of countries is not the same.

I didn't mention it before: unlike Numpy, you can 'attach' an Index to your data. Here, I use a timeseries which is a powerfull pandas feature. You can use integers, strings or floats as Index if you want to.

In [31]:
print murder.shape, unemploy.shape
(26, 33) (105, 17)

First, I would like to select countries which occur in my two DataFrames. I know there are less countries in unemploy that murder.

In [5]:
sub_murder = murder[unemploy.columns]

Then, I would like to have the same time frequency for my two DataFrames. Let's go to get some unemployment values for each dates which occur in my "murder" DataFrame.

  • murder DataFrame: last day of each year, i.e. yearly freq ;
  • unemployement DataFrame: last day of each quarter, i.e. quaterly freq.

Here, I use DateOffset pandas object to deal with different frequencies.

In [3]:
import pandas.tseries.offsets as offs
In [6]:
sub_murder.index.offset = offs.YearEnd()
unemploy.index.offset = offs.QuarterEnd()

sub_murder.index
Out[6]:
<class 'pandas.tseries.index.DatetimeIndex'>
[1986-12-31 00:00:00, ..., 2011-12-31 00:00:00]
Length: 26, Freq: A-DEC, Timezone: None
In [7]:
unemploy.index
Out[7]:
<class 'pandas.tseries.index.DatetimeIndex'>
[1986-03-31 00:00:00, ..., 2012-03-31 00:00:00]
Length: 105, Freq: Q-MAR, Timezone: None

As you can see, the __repr__ of my Index object has changed. There is Freq: A-DEC which means calendar year end.

Note that these lines are not mandatory if you get DataFrames from Quandl since Quandl.get(your_data_code) can return a DataFrame with the well frequency.

See pandas doc about offsets (i.e. frequency) aliases http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases

Well. Now, how can you get unemployment values (yearly average for instance) with the same frequency as the "murder" DataFrame? It's simple, just use the resample method.

In [8]:
year_unempl = unemploy.resample('A-DEC')
year_unempl
Out[8]:
AUT BEL FIN FRA DEU HUN LUX NLD NOR PRT ESP SWE GBR CZE POL JPN AUS
Date
1986-12-31 24692.6650 17739.3325 12438.58075 NaN 142974.775 NaN 235.000000 NaN 10493.41675 4275.00000 46104.4975 39000.00000 186850 NaN NaN 380499.975 64725
1987-12-31 26836.1650 14739.1675 12430.83250 NaN 157548.725 NaN 245.333350 NaN 12400.41500 7625.00000 46297.4200 46141.66750 234800 NaN NaN 436333.325 68675
1988-12-31 31184.0850 21508.8300 17358.33250 NaN 174979.050 NaN 216.250000 92625 8686.08350 9983.33400 55248.5850 52108.33500 245850 NaN NaN 558666.650 77750
1989-12-31 45598.6650 23157.1675 30380.00000 99923.160 235447.675 NaN 201.000000 92625 6906.00100 10908.33250 58452.7475 49950.00000 217525 NaN NaN 618166.650 83675
1990-12-31 55621.5000 18805.3325 26925.75000 100990.065 296489.150 NaN 177.666675 115400 6595.66675 9458.33350 51657.3350 41525.00250 172425 61912.6700 44291.66500 644500.000 58900
1991-12-31 49447.5825 16227.7525 13380.16750 93521.750 335004.200 NaN 153.749975 95550 6515.91650 8508.33350 41445.5000 18550.00000 116025 41269.6650 44208.33250 634750.000 31600
1992-12-31 44125.8325 16886.2500 7051.41675 100831.400 326219.100 21751.6650 159.000000 67200 6421.83325 7983.33325 31678.9150 10566.66650 116175 77535.2475 31708.33250 554083.350 33875
1993-12-31 32906.0850 14583.3350 5935.16650 99241.065 255561.825 34375.4150 125.583335 39275 7417.58325 6483.33350 20745.0850 8616.66675 126975 68332.0000 32033.33250 472750.025 43525
1994-12-31 30196.6675 19049.0850 7362.33325 132958.250 264042.800 35485.7475 271.250025 41750 7666.00000 5911.66675 19956.4150 11833.33325 157425 74074.9975 39050.00250 455000.000 74000
1995-12-31 24986.1675 19652.4175 8305.33425 167718.575 295787.875 28679.9150 437.666650 57000 8789.83325 7968.33350 28216.0825 14966.66500 181350 91248.1700 38525.00000 474333.350 74100
1996-12-31 19430.4975 21262.5025 10131.50000 188449.250 302137.600 38297.2500 1107.250175 68825 9990.24850 8622.83325 49616.1650 14041.66675 223925 98727.4900 30733.33250 530916.675 79450
1997-12-31 19018.5850 24568.6675 13690.33425 209466.650 311882.825 42543.9150 931.416750 89500 14057.33500 10998.66675 49468.1675 16150.00000 279850 77260.5850 26008.33000 558583.325 84750
1998-12-31 23088.1675 36319.5850 16842.25250 226316.650 390079.575 48260.0000 1443.250000 125725 18608.33000 12930.83250 75107.7475 23075.00000 296625 55423.5025 16041.66750 492333.325 98325
1999-12-31 31216.4175 43985.9175 14627.91575 246874.975 422541.850 51271.3325 1529.666750 163150 17878.16750 13732.50000 143101.8350 25775.00000 307725 35563.3325 12374.99825 488416.675 101100
2000-12-31 35495.2525 53144.4150 17120.16500 263016.650 452102.250 49999.7525 1364.333250 190150 18424.50000 15141.00000 106791.6600 35108.33250 357975 46700.5000 12358.33425 585833.325 115825
2001-12-31 29669.8325 46386.0850 19637.83500 252341.650 430995.425 43366.4175 1380.916500 176550 15191.41500 12220.25250 116460.9250 27350.00250 586400 58177.5000 9866.66750 594916.650 92925
2002-12-31 23189.0025 41156.7500 21591.58000 245508.350 372452.325 44602.5000 940.000000 130000 12225.41750 9727.83425 130919.8500 29209.08500 605500 48117.0850 11941.66825 598583.325 96825
2003-12-31 21716.4150 40471.0825 23320.83250 241716.650 266281.150 47239.4150 841.833350 96250 11114.50000 10227.33500 149534.2250 25263.00000 588250 42488.5000 16983.33500 670166.675 106325
2004-12-31 23774.2500 47681.3300 23479.25000 255758.350 206031.575 48223.4150 1014.083500 111650 10707.74900 9693.24825 150802.9000 20859.41500 637750 46054.0025 22108.33500 761666.675 123275
2005-12-31 26208.5000 NaN 29101.75000 276575.000 255758.250 41613.5000 1210.416750 143525 13321.08250 9057.83250 161163.0000 22348.41500 621500 55691.9150 29708.33250 825583.325 140050
2006-12-31 32912.4175 NaN 33991.92000 283608.350 354286.750 41676.4975 1750.750000 191750 18894.83250 10359.75175 NaN 32414.83250 601500 84235.0825 47491.66750 860750.000 153850
2007-12-31 38213.9175 NaN 40167.25250 302358.350 423431.600 29933.3350 2473.333250 227725 24302.00250 14402.58500 NaN 42312.66500 660500 123978.1750 66650.00250 805666.650 171525
2008-12-31 37497.9175 NaN 37493.08250 292975.000 389045.000 25386.0000 2080.500000 213325 26571.16750 16043.83500 NaN 37980.41750 617500 139698.0000 65799.99750 678666.650 90600
2009-12-31 27165.2500 NaN 26944.33250 240158.325 225648.175 20739.5850 1269.833250 114825 23875.83500 17869.75000 NaN 26995.75000 442750 46043.1675 38400.00000 522750.025 37700
2010-12-31 31008.6650 NaN 28496.91750 265891.650 359347.650 22240.5850 1959.250000 NaN 24323.74750 19841.67000 NaN 33264.00000 471250 33141.3350 44966.66750 571583.325 177200
2011-12-31 32310.0850 NaN 35072.41750 280016.675 466288.475 41304.8325 2856.666500 NaN 26703.58500 12567.66750 NaN 45255.91500 467750 36700.4175 40841.66750 655500.000 184800
2012-12-31 27586.3300 NaN 55361.00000 262300.000 471999.700 45270.6700 2883.333000 NaN 25564.00000 8844.33300 NaN 68413.00000 441000 37016.0000 37000.00000 760666.700 182200

If you want, you can specify the 'aggregation' method with the how argument with a str or a function.

Data Shaking: Murder Rates and Unemployment DataFrames

Let's try to bring datasets together using pd.merge function.

First, I would like to add a name to the Index used to store country codes.

In [9]:
sub_murder.columns.name = u"Country"
year_unempl.columns.name = u"Country"
In [14]:
print sub_murder
Country     AUT  BEL  FIN  FRA  DEU  HUN  LUX  NLD  NOR  PRT  ESP  SWE  GBR  \
Date                                                                          
1986-12-31  1.3  2.1  3.2  1.2  NaN  2.9  2.2  1.0  1.6  1.5  1.1  1.5  0.8   
1987-12-31  1.3  1.6  2.7  1.1  NaN  2.6  1.6  0.9  1.5  1.4  1.2  1.2  1.0   
1988-12-31  1.2  1.6  2.8  1.0  NaN  2.8  1.3  0.9  1.2  1.5  0.9  1.4  1.0   
1989-12-31  1.1  1.3  3.1  1.2  NaN  3.0  3.0  1.0  1.4  1.6  0.9  1.5  1.1   
1990-12-31  1.6  1.4  3.2  1.1  1.0  3.2  3.0  0.8  1.2  1.7  1.0  1.3  0.7   
1991-12-31  1.2  1.4  3.0  1.1  1.1  4.2  2.2  1.1  1.6  1.7  0.9  1.5  0.7   
1992-12-31  1.5  1.7  3.4  1.0  1.2  4.1  1.9  1.2  1.0  1.5  0.9  1.4  0.9   
1993-12-31  1.3  2.0  3.2  1.1  1.2  4.2  0.2  1.2  1.0  1.5  1.0  1.3  1.2   
1994-12-31  1.1  1.8  3.2  1.1  1.2  3.6  1.3  1.1  0.8  1.5  0.9  1.2  1.0   
1995-12-31  1.0  1.6  2.9  1.1  1.1  3.6  0.7  1.2  1.0  1.7  0.9  1.0  1.0   
1996-12-31  1.2  2.0  3.3  1.0  1.1  3.2  1.1  1.3  1.0  1.3  0.8  1.2  0.8   
1997-12-31  0.9  1.7  2.7  0.9  0.9  3.4  0.7  1.3  0.9  1.2  0.8  1.1  0.7   
1998-12-31  1.1  2.0  2.4  0.7  0.9  3.3  0.7  1.1  1.0  1.3  0.9  1.1  0.7   
1999-12-31  0.8  2.5  2.7  0.7  0.9  2.9  0.9  1.2  0.9  1.1  0.8  1.2  0.7   
2000-12-31  0.9  NaN  2.6  0.9  0.7  2.6  1.1  1.1  1.2  0.9  1.0  1.0  NaN   
2001-12-31  1.0  NaN  2.9  0.8  0.7  2.4  1.9  1.2  0.7  1.3  1.0  1.0  0.4   
2002-12-31  0.9  NaN  2.5  0.8  0.7  2.5  1.8  1.2  0.9  1.7  1.0  1.2  0.4   
2003-12-31  0.6  1.4  1.8  0.7  0.7  2.0  0.9  1.2  1.0  1.6  1.0  0.9  0.5   
2004-12-31  0.7  1.7  2.4  0.8  0.6  2.1  0.5  1.2  0.9  NaN  1.3  1.1  0.5   
2005-12-31  0.8  1.5  1.9  0.7  0.5  1.9  1.5  1.1  0.7  NaN  0.9  0.9  0.4   
2006-12-31  0.8  1.4  2.0  0.7  0.6  1.9  1.4  0.8  1.0  NaN  0.8  0.9  0.4   
2007-12-31  0.6  1.5  2.2  0.6  0.6  1.7  1.2  0.9  0.7  1.0  0.7  1.2  0.4   
2008-12-31  0.6  1.2  2.2  0.7  0.5  2.0  1.1  0.9  0.6  1.3  0.8  0.8  0.3   
2009-12-31  0.7  1.2  1.9  0.8  0.5  1.4  0.9  0.9  0.6  0.9  0.7  0.9  0.3   
2010-12-31  0.5  NaN  1.9  NaN  0.6  1.4  2.1  0.9  0.7  1.2  0.7  1.0  0.3   
2011-12-31  0.5  NaN  1.8  NaN  0.5  1.5  0.4  0.9  2.3  0.9  0.7  NaN  NaN   

Country     CZE  POL  JPN  AUS  
Date                            
1986-12-31  1.1  1.7  0.9  1.9  
1987-12-31  1.1  1.9  0.8  2.1  
1988-12-31  0.8  1.9  0.8  2.0  
1989-12-31  1.0  2.2  0.7  1.8  
1990-12-31  2.0  3.3  0.6  2.1  
1991-12-31  1.9  3.1  0.6  1.9  
1992-12-31  2.0  3.2  0.6  1.7  
1993-12-31  2.2  2.9  0.7  1.9  
1994-12-31  2.3  3.2  0.6  1.8  
1995-12-31  1.8  3.0  0.6  1.6  
1996-12-31  1.7  2.8  0.6  1.7  
1997-12-31  1.6  NaN  0.6  1.7  
1998-12-31  1.6  NaN  0.6  1.6  
1999-12-31  1.4  2.4  0.6  1.5  
2000-12-31  1.5  2.2  0.6  1.6  
2001-12-31  1.3  1.8  0.6  1.6  
2002-12-31  1.3  1.8  0.6  1.4  
2003-12-31  1.3  1.5  0.5  1.3  
2004-12-31  1.2  1.5  0.5  0.8  
2005-12-31  0.9  1.5  0.5  0.9  
2006-12-31  1.0  1.5  0.5  0.7  
2007-12-31  1.0  1.4  0.4  1.0  
2008-12-31  0.7  1.2  0.4  1.2  
2009-12-31  0.8  1.1  0.4  1.2  
2010-12-31  0.8  0.9  0.3  1.1  
2011-12-31  0.8  1.0  NaN  0.8  
In [15]:
print year_unempl
Country            AUT         BEL          FIN         FRA         DEU  \
Date                                                                      
1986-12-31  24692.6650  17739.3325  12438.58075         NaN  142974.775   
1987-12-31  26836.1650  14739.1675  12430.83250         NaN  157548.725   
1988-12-31  31184.0850  21508.8300  17358.33250         NaN  174979.050   
1989-12-31  45598.6650  23157.1675  30380.00000   99923.160  235447.675   
1990-12-31  55621.5000  18805.3325  26925.75000  100990.065  296489.150   
1991-12-31  49447.5825  16227.7525  13380.16750   93521.750  335004.200   
1992-12-31  44125.8325  16886.2500   7051.41675  100831.400  326219.100   
1993-12-31  32906.0850  14583.3350   5935.16650   99241.065  255561.825   
1994-12-31  30196.6675  19049.0850   7362.33325  132958.250  264042.800   
1995-12-31  24986.1675  19652.4175   8305.33425  167718.575  295787.875   
1996-12-31  19430.4975  21262.5025  10131.50000  188449.250  302137.600   
1997-12-31  19018.5850  24568.6675  13690.33425  209466.650  311882.825   
1998-12-31  23088.1675  36319.5850  16842.25250  226316.650  390079.575   
1999-12-31  31216.4175  43985.9175  14627.91575  246874.975  422541.850   
2000-12-31  35495.2525  53144.4150  17120.16500  263016.650  452102.250   
2001-12-31  29669.8325  46386.0850  19637.83500  252341.650  430995.425   
2002-12-31  23189.0025  41156.7500  21591.58000  245508.350  372452.325   
2003-12-31  21716.4150  40471.0825  23320.83250  241716.650  266281.150   
2004-12-31  23774.2500  47681.3300  23479.25000  255758.350  206031.575   
2005-12-31  26208.5000         NaN  29101.75000  276575.000  255758.250   
2006-12-31  32912.4175         NaN  33991.92000  283608.350  354286.750   
2007-12-31  38213.9175         NaN  40167.25250  302358.350  423431.600   
2008-12-31  37497.9175         NaN  37493.08250  292975.000  389045.000   
2009-12-31  27165.2500         NaN  26944.33250  240158.325  225648.175   
2010-12-31  31008.6650         NaN  28496.91750  265891.650  359347.650   
2011-12-31  32310.0850         NaN  35072.41750  280016.675  466288.475   
2012-12-31  27586.3300         NaN  55361.00000  262300.000  471999.700   

Country            HUN          LUX     NLD          NOR          PRT  \
Date                                                                    
1986-12-31         NaN   235.000000     NaN  10493.41675   4275.00000   
1987-12-31         NaN   245.333350     NaN  12400.41500   7625.00000   
1988-12-31         NaN   216.250000   92625   8686.08350   9983.33400   
1989-12-31         NaN   201.000000   92625   6906.00100  10908.33250   
1990-12-31         NaN   177.666675  115400   6595.66675   9458.33350   
1991-12-31         NaN   153.749975   95550   6515.91650   8508.33350   
1992-12-31  21751.6650   159.000000   67200   6421.83325   7983.33325   
1993-12-31  34375.4150   125.583335   39275   7417.58325   6483.33350   
1994-12-31  35485.7475   271.250025   41750   7666.00000   5911.66675   
1995-12-31  28679.9150   437.666650   57000   8789.83325   7968.33350   
1996-12-31  38297.2500  1107.250175   68825   9990.24850   8622.83325   
1997-12-31  42543.9150   931.416750   89500  14057.33500  10998.66675   
1998-12-31  48260.0000  1443.250000  125725  18608.33000  12930.83250   
1999-12-31  51271.3325  1529.666750  163150  17878.16750  13732.50000   
2000-12-31  49999.7525  1364.333250  190150  18424.50000  15141.00000   
2001-12-31  43366.4175  1380.916500  176550  15191.41500  12220.25250   
2002-12-31  44602.5000   940.000000  130000  12225.41750   9727.83425   
2003-12-31  47239.4150   841.833350   96250  11114.50000  10227.33500   
2004-12-31  48223.4150  1014.083500  111650  10707.74900   9693.24825   
2005-12-31  41613.5000  1210.416750  143525  13321.08250   9057.83250   
2006-12-31  41676.4975  1750.750000  191750  18894.83250  10359.75175   
2007-12-31  29933.3350  2473.333250  227725  24302.00250  14402.58500   
2008-12-31  25386.0000  2080.500000  213325  26571.16750  16043.83500   
2009-12-31  20739.5850  1269.833250  114825  23875.83500  17869.75000   
2010-12-31  22240.5850  1959.250000     NaN  24323.74750  19841.67000   
2011-12-31  41304.8325  2856.666500     NaN  26703.58500  12567.66750   
2012-12-31  45270.6700  2883.333000     NaN  25564.00000   8844.33300   

Country             ESP          SWE     GBR          CZE          POL  \
Date                                                                     
1986-12-31   46104.4975  39000.00000  186850          NaN          NaN   
1987-12-31   46297.4200  46141.66750  234800          NaN          NaN   
1988-12-31   55248.5850  52108.33500  245850          NaN          NaN   
1989-12-31   58452.7475  49950.00000  217525          NaN          NaN   
1990-12-31   51657.3350  41525.00250  172425   61912.6700  44291.66500   
1991-12-31   41445.5000  18550.00000  116025   41269.6650  44208.33250   
1992-12-31   31678.9150  10566.66650  116175   77535.2475  31708.33250   
1993-12-31   20745.0850   8616.66675  126975   68332.0000  32033.33250   
1994-12-31   19956.4150  11833.33325  157425   74074.9975  39050.00250   
1995-12-31   28216.0825  14966.66500  181350   91248.1700  38525.00000   
1996-12-31   49616.1650  14041.66675  223925   98727.4900  30733.33250   
1997-12-31   49468.1675  16150.00000  279850   77260.5850  26008.33000   
1998-12-31   75107.7475  23075.00000  296625   55423.5025  16041.66750   
1999-12-31  143101.8350  25775.00000  307725   35563.3325  12374.99825   
2000-12-31  106791.6600  35108.33250  357975   46700.5000  12358.33425   
2001-12-31  116460.9250  27350.00250  586400   58177.5000   9866.66750   
2002-12-31  130919.8500  29209.08500  605500   48117.0850  11941.66825   
2003-12-31  149534.2250  25263.00000  588250   42488.5000  16983.33500   
2004-12-31  150802.9000  20859.41500  637750   46054.0025  22108.33500   
2005-12-31  161163.0000  22348.41500  621500   55691.9150  29708.33250   
2006-12-31          NaN  32414.83250  601500   84235.0825  47491.66750   
2007-12-31          NaN  42312.66500  660500  123978.1750  66650.00250   
2008-12-31          NaN  37980.41750  617500  139698.0000  65799.99750   
2009-12-31          NaN  26995.75000  442750   46043.1675  38400.00000   
2010-12-31          NaN  33264.00000  471250   33141.3350  44966.66750   
2011-12-31          NaN  45255.91500  467750   36700.4175  40841.66750   
2012-12-31          NaN  68413.00000  441000   37016.0000  37000.00000   

Country            JPN     AUS  
Date                            
1986-12-31  380499.975   64725  
1987-12-31  436333.325   68675  
1988-12-31  558666.650   77750  
1989-12-31  618166.650   83675  
1990-12-31  644500.000   58900  
1991-12-31  634750.000   31600  
1992-12-31  554083.350   33875  
1993-12-31  472750.025   43525  
1994-12-31  455000.000   74000  
1995-12-31  474333.350   74100  
1996-12-31  530916.675   79450  
1997-12-31  558583.325   84750  
1998-12-31  492333.325   98325  
1999-12-31  488416.675  101100  
2000-12-31  585833.325  115825  
2001-12-31  594916.650   92925  
2002-12-31  598583.325   96825  
2003-12-31  670166.675  106325  
2004-12-31  761666.675  123275  
2005-12-31  825583.325  140050  
2006-12-31  860750.000  153850  
2007-12-31  805666.650  171525  
2008-12-31  678666.650   90600  
2009-12-31  522750.025   37700  
2010-12-31  571583.325  177200  
2011-12-31  655500.000  184800  
2012-12-31  760666.700  182200  

Then, use pd.merge function to create a single DataFrame with all values.

  1. You want to merge the data along their Index
  2. You want to make the difference between the values from "murder rates" and "unemployment" for each countries
In [10]:
dfa = pd.merge(sub_murder, year_unempl, left_index=True, right_index=True)
dfa
Out[10]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 26 entries, 1986-12-31 00:00:00 to 2011-12-31 00:00:00
Freq: A-DEC
Data columns (total 34 columns):
AUT_x    26  non-null values
BEL_x    21  non-null values
FIN_x    26  non-null values
FRA_x    24  non-null values
DEU_x    22  non-null values
HUN_x    26  non-null values
LUX_x    26  non-null values
NLD_x    26  non-null values
NOR_x    26  non-null values
PRT_x    23  non-null values
ESP_x    26  non-null values
SWE_x    25  non-null values
GBR_x    24  non-null values
CZE_x    26  non-null values
POL_x    24  non-null values
JPN_x    25  non-null values
AUS_x    26  non-null values
AUT_y    26  non-null values
BEL_y    19  non-null values
FIN_y    26  non-null values
FRA_y    23  non-null values
DEU_y    26  non-null values
HUN_y    20  non-null values
LUX_y    26  non-null values
NLD_y    22  non-null values
NOR_y    26  non-null values
PRT_y    26  non-null values
ESP_y    20  non-null values
SWE_y    26  non-null values
GBR_y    26  non-null values
CZE_y    22  non-null values
POL_y    22  non-null values
JPN_y    26  non-null values
AUS_y    26  non-null values
dtypes: float64(34)

As you can see, pandas merging makes the difference between the left and right labels with the suffixes (_x, _y) by default. Be more explicit.

In [11]:
dfa = pd.merge(sub_murder, year_unempl, left_index=True, right_index=True,
               suffixes=('_MUR', '_UNP'))
dfa
Out[11]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 26 entries, 1986-12-31 00:00:00 to 2011-12-31 00:00:00
Freq: A-DEC
Data columns (total 34 columns):
AUT_MUR    26  non-null values
BEL_MUR    21  non-null values
FIN_MUR    26  non-null values
FRA_MUR    24  non-null values
DEU_MUR    22  non-null values
HUN_MUR    26  non-null values
LUX_MUR    26  non-null values
NLD_MUR    26  non-null values
NOR_MUR    26  non-null values
PRT_MUR    23  non-null values
ESP_MUR    26  non-null values
SWE_MUR    25  non-null values
GBR_MUR    24  non-null values
CZE_MUR    26  non-null values
POL_MUR    24  non-null values
JPN_MUR    25  non-null values
AUS_MUR    26  non-null values
AUT_UNP    26  non-null values
BEL_UNP    19  non-null values
FIN_UNP    26  non-null values
FRA_UNP    23  non-null values
DEU_UNP    26  non-null values
HUN_UNP    20  non-null values
LUX_UNP    26  non-null values
NLD_UNP    22  non-null values
NOR_UNP    26  non-null values
PRT_UNP    26  non-null values
ESP_UNP    20  non-null values
SWE_UNP    26  non-null values
GBR_UNP    26  non-null values
CZE_UNP    22  non-null values
POL_UNP    22  non-null values
JPN_UNP    26  non-null values
AUS_UNP    26  non-null values
dtypes: float64(34)

Now you have all values in the same DataFrame. You would like to sort labels in order to have COUNTRY_CODE_MUR close to COUNTRY_CODE_UNP.

In [19]:
dfa.sort_index(axis=1)
Out[19]:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 26 entries, 1986-12-31 00:00:00 to 2011-12-31 00:00:00
Freq: A-DEC
Data columns (total 34 columns):
AUS_MUR    26  non-null values
AUS_UNP    26  non-null values
AUT_MUR    26  non-null values
AUT_UNP    26  non-null values
BEL_MUR    21  non-null values
BEL_UNP    19  non-null values
CZE_MUR    26  non-null values
CZE_UNP    22  non-null values
DEU_MUR    22  non-null values
DEU_UNP    26  non-null values
ESP_MUR    26  non-null values
ESP_UNP    20  non-null values
FIN_MUR    26  non-null values
FIN_UNP    26  non-null values
FRA_MUR    24  non-null values
FRA_UNP    23  non-null values
GBR_MUR    24  non-null values
GBR_UNP    26  non-null values
HUN_MUR    26  non-null values
HUN_UNP    20  non-null values
JPN_MUR    25  non-null values
JPN_UNP    26  non-null values
LUX_MUR    26  non-null values
LUX_UNP    26  non-null values
NLD_MUR    26  non-null values
NLD_UNP    22  non-null values
NOR_MUR    26  non-null values
NOR_UNP    26  non-null values
POL_MUR    24  non-null values
POL_UNP    22  non-null values
PRT_MUR    23  non-null values
PRT_UNP    26  non-null values
SWE_MUR    25  non-null values
SWE_UNP    26  non-null values
dtypes: float64(34)

Take 4 arbitrary countries and display the values.

In [12]:
clist = ['NOR', 'FRA', 'GBR', 'DEU']
clist = ['_'.join([c, 'MUR']) for c in clist] + ['_'.join([c, 'UNP']) for c in clist]
clist_sorted = sort(clist)
print clist
print clist_sorted
['NOR_MUR', 'FRA_MUR', 'GBR_MUR', 'DEU_MUR', 'NOR_UNP', 'FRA_UNP', 'GBR_UNP', 'DEU_UNP']
['DEU_MUR' 'DEU_UNP' 'FRA_MUR' 'FRA_UNP' 'GBR_MUR' 'GBR_UNP' 'NOR_MUR'
 'NOR_UNP']
In [23]:
dfa[clist_sorted]
Out[23]:
Country DEU_MUR DEU_UNP FRA_MUR FRA_UNP GBR_MUR GBR_UNP NOR_MUR NOR_UNP
Date
1986-12-31 NaN 142974.775 1.2 NaN 0.8 186850 1.6 10493.41675
1987-12-31 NaN 157548.725 1.1 NaN 1.0 234800 1.5 12400.41500
1988-12-31 NaN 174979.050 1.0 NaN 1.0 245850 1.2 8686.08350
1989-12-31 NaN 235447.675 1.2 99923.160 1.1 217525 1.4 6906.00100
1990-12-31 1.0 296489.150 1.1 100990.065 0.7 172425 1.2 6595.66675
1991-12-31 1.1 335004.200 1.1 93521.750 0.7 116025 1.6 6515.91650
1992-12-31 1.2 326219.100 1.0 100831.400 0.9 116175 1.0 6421.83325
1993-12-31 1.2 255561.825 1.1 99241.065 1.2 126975 1.0 7417.58325
1994-12-31 1.2 264042.800 1.1 132958.250 1.0 157425 0.8 7666.00000
1995-12-31 1.1 295787.875 1.1 167718.575 1.0 181350 1.0 8789.83325
1996-12-31 1.1 302137.600 1.0 188449.250 0.8 223925 1.0 9990.24850
1997-12-31 0.9 311882.825 0.9 209466.650 0.7 279850 0.9 14057.33500
1998-12-31 0.9 390079.575 0.7 226316.650 0.7 296625 1.0 18608.33000
1999-12-31 0.9 422541.850 0.7 246874.975 0.7 307725 0.9 17878.16750
2000-12-31 0.7 452102.250 0.9 263016.650 NaN 357975 1.2 18424.50000
2001-12-31 0.7 430995.425 0.8 252341.650 0.4 586400 0.7 15191.41500
2002-12-31 0.7 372452.325 0.8 245508.350 0.4 605500 0.9 12225.41750
2003-12-31 0.7 266281.150 0.7 241716.650 0.5 588250 1.0 11114.50000
2004-12-31 0.6 206031.575 0.8 255758.350 0.5 637750 0.9 10707.74900
2005-12-31 0.5 255758.250 0.7 276575.000 0.4 621500 0.7 13321.08250
2006-12-31 0.6 354286.750 0.7 283608.350 0.4 601500 1.0 18894.83250
2007-12-31 0.6 423431.600 0.6 302358.350 0.4 660500 0.7 24302.00250
2008-12-31 0.5 389045.000 0.7 292975.000 0.3 617500 0.6 26571.16750
2009-12-31 0.5 225648.175 0.8 240158.325 0.3 442750 0.6 23875.83500
2010-12-31 0.6 359347.650 NaN 265891.650 0.3 471250 0.7 24323.74750
2011-12-31 0.5 466288.475 NaN 280016.675 NaN 467750 2.3 26703.58500

You can specify the how argument if you don't want to drop any NaN values.

In [22]:
pd.merge(sub_murder, year_unempl, left_index=True, right_index=True,
               suffixes=('_MUR', '_UNP'), how='outer')[clist_sorted]
Out[22]:
Country DEU_MUR DEU_UNP FRA_MUR FRA_UNP GBR_MUR GBR_UNP NOR_MUR NOR_UNP
Date
1986-12-31 NaN 142974.775 1.2 NaN 0.8 186850 1.6 10493.41675
1987-12-31 NaN 157548.725 1.1 NaN 1.0 234800 1.5 12400.41500
1988-12-31 NaN 174979.050 1.0 NaN 1.0 245850 1.2 8686.08350
1989-12-31 NaN 235447.675 1.2 99923.160 1.1 217525 1.4 6906.00100
1990-12-31 1.0 296489.150 1.1 100990.065 0.7 172425 1.2 6595.66675
1991-12-31 1.1 335004.200 1.1 93521.750 0.7 116025 1.6 6515.91650
1992-12-31 1.2 326219.100 1.0 100831.400 0.9 116175 1.0 6421.83325
1993-12-31 1.2 255561.825 1.1 99241.065 1.2 126975 1.0 7417.58325
1994-12-31 1.2 264042.800 1.1 132958.250 1.0 157425 0.8 7666.00000
1995-12-31 1.1 295787.875 1.1 167718.575 1.0 181350 1.0 8789.83325
1996-12-31 1.1 302137.600 1.0 188449.250 0.8 223925 1.0 9990.24850
1997-12-31 0.9 311882.825 0.9 209466.650 0.7 279850 0.9 14057.33500
1998-12-31 0.9 390079.575 0.7 226316.650 0.7 296625 1.0 18608.33000
1999-12-31 0.9 422541.850 0.7 246874.975 0.7 307725 0.9 17878.16750
2000-12-31 0.7 452102.250 0.9 263016.650 NaN 357975 1.2 18424.50000
2001-12-31 0.7 430995.425 0.8 252341.650 0.4 586400 0.7 15191.41500
2002-12-31 0.7 372452.325 0.8 245508.350 0.4 605500 0.9 12225.41750
2003-12-31 0.7 266281.150 0.7 241716.650 0.5 588250 1.0 11114.50000
2004-12-31 0.6 206031.575 0.8 255758.350 0.5 637750 0.9 10707.74900
2005-12-31 0.5 255758.250 0.7 276575.000 0.4 621500 0.7 13321.08250
2006-12-31 0.6 354286.750 0.7 283608.350 0.4 601500 1.0 18894.83250
2007-12-31 0.6 423431.600 0.6 302358.350 0.4 660500 0.7 24302.00250
2008-12-31 0.5 389045.000 0.7 292975.000 0.3 617500 0.6 26571.16750
2009-12-31 0.5 225648.175 0.8 240158.325 0.3 442750 0.6 23875.83500
2010-12-31 0.6 359347.650 NaN 265891.650 0.3 471250 0.7 24323.74750
2011-12-31 0.5 466288.475 NaN 280016.675 NaN 467750 2.3 26703.58500
2012-12-31 NaN 471999.700 NaN 262300.000 NaN 441000 NaN 25564.00000

If you want to plot or better compare these data, you should normalize some values --- for instance getting total population (from OECD stats) and using it to normalize unemployment data.

Maybe I'll continue to 'play' with these data to show you other interesting and powerfull pandas features: group-by, more plottings, etc.

For now, let's end up to show you which country has the max murder rates according to time...

In [43]:
print sub_murder.idxmax(axis=1)
Date
1986-12-31    FIN
1987-12-31    FIN
1988-12-31    FIN
1989-12-31    FIN
1990-12-31    POL
1991-12-31    HUN
1992-12-31    HUN
1993-12-31    HUN
1994-12-31    HUN
1995-12-31    HUN
1996-12-31    FIN
1997-12-31    HUN
1998-12-31    HUN
1999-12-31    HUN
2000-12-31    FIN
2001-12-31    FIN
2002-12-31    FIN
2003-12-31    HUN
2004-12-31    FIN
2005-12-31    FIN
2006-12-31    FIN
2007-12-31    FIN
2008-12-31    FIN
2009-12-31    FIN
2010-12-31    LUX
2011-12-31    NOR
Freq: A-DEC, dtype: object

... and the occurrences of each 'max' country in this Series.

In [44]:
print sub_murder.idxmax(axis=1).value_counts()
FIN    14
HUN     9
NOR     1
POL     1
LUX     1
dtype: int64
In [13]:
sub_murder.idxmax(axis=1).value_counts().plot(kind='bar')
Out[13]:
<matplotlib.axes.AxesSubplot at 0x7f2f14135050>
In [14]:
sub_murder.max(axis=1).plot(style='r')
Out[14]:
<matplotlib.axes.AxesSubplot at 0x49e4690>

Clone from http://hg.logilab.org/users/dag/blog/2013/quandl-data-pandas/ with HG:

hg clone http://hg.logilab.org/users/dag/blog/2013/quandl-data-pandas/

There are this Notebook, the HTML conversion and the two CSV files.

Any improvement comment, ideas, suggestions or questions are welcomed.

See you. Damien