0. Import & parse CSV data

In [ ]:

%pylab inline

In [ ]:

import numpy as np
#from matplotlib.pylab import plot, legend
import pandas as pd

In [ ]:

# read trends.csv with pandas. The first line is the header: 
## week_start,week_end,textbooks,spring_break,kayak,skiing,global_warming
## 2004-01-04,2004-01-10,65,36,17,85,13
## 2004-01-11,2004-01-17,79,40,18,84,18


# Giving it index_col=0 uses the week_start as the index 
# and parse_dates=True parses them as a date.
df = pd.read_csv('trends.csv',index_col=0,parse_dates=True)

In [ ]:

df.

Here we read in the csv file, specifying that the first column (which was week_start) should be the index column. By setting parse_dates=True, we told pandas to parse the date and store each index as a Timestamp object.

The first line in the csv file was interpreted as the header, and these strings were stored as the column names. We can see the names and data types of each column with df.dtypes:

In [ ]:

df.dtypes

Let's take a look at the first few entries of the pandas dataframe to see how it looks:

In [ ]:

df.head()

In [ ]:

# del df["week_end"]

NB: In the table above it looks like the first row is "week_start" followed by a bunch of blank entries, but in fact this is just showing that the label for the indices is called "week_start". We can grab the first row of the data frame to verify what it contains:

In [ ]:

print(df.iloc[0])

Note that iloc indexes by integer position whereas loc will index by the actual index key, e.g.

In [ ]:

print(df.loc['2004-01-04'])

I. Plot trends in single frame, with legend

We can plot each column of interest against the index (which in this case is week_start) one at a time:

In [ ]:

df.plot(figsize=(10,10))
# if we want to add a legend, must import legend from matplotlib
#legend()

Or we could slice the columns of interest with df.loc and plot them all in one line. Pandas will even do the legend for you!

In [ ]:

df.loc[:,['textbooks','spring_break','kayak','skiing']].plot()

Or we could slice the columns of interest by integer position using df.iloc

In [ ]:

df.iloc[:,1:5].plot()

In [ ]:

# Let’s look at some summary statistics:
df.describe()

II. Determine maxima and minima of each trend term

Determine in which week of each year (for all five search trends including "global warming") that search reached its peak and its minimum.

We can query the pandas data frame to return to us a slice of the data frame where a year is a particular value

In [ ]:

sub_df = df[df.index.year == 2015]

In [ ]:

# look at the first few values of the data frame to see that it contains just 2006 values
print(sub_df.head())

In [ ]:

# we could find where there was a maximum value in a series with idxmax
sub_df.kayak.idxmax()

In [ ]:

# or we could use boolean indexing
sub_df.loc[sub_df.kayak == sub_df.kayak.max()].index

Find the week in each year that each trend was maximum and minimum. Let's loop through each key of interest and each year, and print out the max and min values of each column in that year.

In [ ]:

keylist = ['textbooks','spring_break','kayak','skiing','global_warming']

In [ ]:

for key in keylist:
    print('Searches for "' + key + '":')
    for yr in range(2004,2016):
        # our indices are timestamp objects which have nice properties such as being able to grab the year, month, day, week, etc
        sub_df = df.loc[df.index.year == yr]
        trend = sub_df[key]
        peak_start_timestamp = sub_df.loc[trend == max(trend)].index[0] # taking the first element in case there are multiple with the max value
        peak_start = str(peak_start_timestamp.month) + '-' + str(peak_start_timestamp.day)
        peak_week = str(peak_start_timestamp.week) 
        peak_end = sub_df.loc[trend == max(trend)].week_end.iloc[-1] # taking the first element in case there are multiple with the max value
        peak_end = peak_end[5:].lstrip('0') # take only the date, not the year
        
        # similar process for minimum values
        min_start_timestamp = sub_df.loc[trend == min(trend)].index[0]
        min_start = str(min_start_timestamp.month) + '-' + str(min_start_timestamp.day)
        min_week = str(min_start_timestamp.week)
        min_end = sub_df.loc[trend == min(trend)].week_end.iloc[0]
        min_end = min_end[5:].lstrip('0') 
        
        print('  in ' + str(yr) + ' were highest from ' + peak_start + ' to ' + peak_end + " (Week " + peak_week + ")")
        print('          were lowest from ' +  min_start + ' to ' + min_end + " (Week " + min_week + ")")

For each year, list week numbers corresponding to maximum and minimum search values

III. Study scatter about median values

Let's find which search value has the largest scatter over time. Q: Is it true that $\sigma(A) > \sigma(B) \Rightarrow \sigma_\textrm{median}(A) > \sigma_\textrm{median}(B)$?

In [ ]:

def std_median(datums):
    return np.sqrt( np.sum( (datums - datums.median())**2 ) * 1.0/ len(datums) )

In [ ]:

print(df.spring_break.std(), std_median(df.spring_break))
print(df.textbooks.std(), std_median(df.textbooks))
print(df.skiing.std(), std_median(df.skiing))
print(df.kayak.std(), std_median(df.kayak))
print(df.global_warming.std(), std_median(df.global_warming))

We see that searches for kayak have the highest scatter, whereas those for textbooks remain more stable

IV. Correlation between trends

We can use cross-correlation to find the time delay between two time-series. The maximum of the cross-correlation function gives the time when the two signals are best aligned (in other words, the lag between the two signals).

In [ ]:

result = np.correlate(df['spring_break'],df['skiing'], mode='full')
gap = np.arange(result.size) - result.size/2
plot(gap,result)

We can use the numpy function argmax to get the index at which this value is maximum (note for pandas expert - result appears still to be a pandas object and maintains attributes such as idxmax, but they don't work?)

In [ ]:

print(gap[result.argmax()]) # Find the x value corresponding to the maximum y value

Let's do the same exercise to find the lag between global warming and skiing searches

In [ ]:

result2 = np.correlate(df.global_warming,df.skiing, mode='full')
gap2 = np.arange(result2.size) - result2.size/2
plot(gap2,result2)

In [ ]:

print(gap2[result2.argmax()])  # Find the x value corresponding to the maximum y value

In [ ]:

mid_peak_min = len(gap2)/2 - 26
mid_peak_max = len(gap2)/2 + 26
print(gap2[mid_peak_min:mid_peak_max][result2[mid_peak_min:mid_peak_max].argmax()])

In [ ]: