A Python exercise motivated by this nice article from Hadley Wickham: Tidy Data. reading this paper is a nice prerequisite for this notebook.
The code in this notebook was originally developed and commented by Jean-François Puget in a blog post on "Tidy Data In Python".
Let's start.
We only need two Python packages here.
import pandas as pd
import numpy as np
Let's show readers which versions we are using.
pd.__version__
'0.18.1'
A messy data set. Messy is used as defined in Hadley Wickham's paper: any data set that is not tidy. Messy data sets are often convenient for showing them to human as they are compact. This form is often used in publications.
messy = pd.DataFrame({'First' : ['John', 'Jane', 'Mary'],
'Last' : ['Smith', 'Doe', 'Johnson'],
'Treatment A' : [np.nan, 16, 3],
'Treatment B' : [2, 11, 1]})
messy
First | Last | Treatment A | Treatment B | |
---|---|---|---|---|
0 | John | Smith | NaN | 2 |
1 | Jane | Doe | 16.0 | 11 |
2 | Mary | Johnson | 3.0 | 1 |
People may prefer the transpose view of that data set
messy.T
0 | 1 | 2 | |
---|---|---|---|
First | John | Jane | Mary |
Last | Smith | Doe | Johnson |
Treatment A | NaN | 16 | 3 |
Treatment B | 2 | 11 | 1 |
Messy data sets aren't that easy to process by statistical or machine learning packages. These often assume that examples are provided as rows in a matrix whose columns are example featureThis is precisely what a tidy data set is. Applying the melt()
function to it creates a tidy version of it. We sort the result by name to make it easier to read.
tidy = pd.melt(messy, id_vars=['First','Last'])
tidy
First | Last | variable | value | |
---|---|---|---|---|
0 | John | Smith | Treatment A | NaN |
1 | Jane | Doe | Treatment A | 16.0 |
2 | Mary | Johnson | Treatment A | 3.0 |
3 | John | Smith | Treatment B | 2.0 |
4 | Jane | Doe | Treatment B | 11.0 |
5 | Mary | Johnson | Treatment B | 1.0 |
The values are fine but column names aren't really meaningful. Fortinately, the melt()
function has arguments for renaming them.
tidy = pd.melt(messy, id_vars=['First','Last'], var_name='treatment', value_name='result')
tidy
First | Last | treatment | result | |
---|---|---|---|---|
0 | John | Smith | Treatment A | NaN |
1 | Jane | Doe | Treatment A | 16.0 |
2 | Mary | Johnson | Treatment A | 3.0 |
3 | John | Smith | Treatment B | 2.0 |
4 | Jane | Doe | Treatment B | 11.0 |
5 | Mary | Johnson | Treatment B | 1.0 |
messy = pd.DataFrame({'row' : ['A', 'B', 'C'],
'a' : [1, 2, 3],
'b' : [4, 5, 6],
'c' : [7, 8, 9]})
messy
a | b | c | row | |
---|---|---|---|---|
0 | 1 | 4 | 7 | A |
1 | 2 | 5 | 8 | B |
2 | 3 | 6 | 9 | C |
pd.melt(messy, id_vars='row')
row | variable | value | |
---|---|---|---|
0 | A | a | 1 |
1 | B | a | 2 |
2 | C | a | 3 |
3 | A | b | 4 |
4 | B | b | 5 |
5 | C | b | 6 |
6 | A | c | 7 |
7 | B | c | 8 |
8 | C | c | 9 |
tidy = pd.melt(messy, id_vars='row', var_name='dimension', value_name='length')
tidy
row | dimension | length | |
---|---|---|---|
0 | A | a | 1 |
1 | B | a | 2 |
2 | C | a | 3 |
3 | A | b | 4 |
4 | B | b | 5 |
5 | C | b | 6 |
6 | A | c | 7 |
7 | B | c | 8 |
8 | C | c | 9 |
Pivot is almost the inverse of melt
messy1 = tidy.pivot(index='row',columns='dimension',values='length')
messy1
dimension | a | b | c |
---|---|---|---|
row | |||
A | 1 | 4 | 7 |
B | 2 | 5 | 8 |
C | 3 | 6 | 9 |
This is almost the same as the orginal dataframe, except that row is used as index. We can move it back to a row easily.
messy1.reset_index(inplace=True)
messy1
dimension | row | a | b | c |
---|---|---|---|---|
0 | A | 1 | 4 | 7 |
1 | B | 2 | 5 | 8 |
2 | C | 3 | 6 | 9 |
Last step is to remove the name for the set of columns.
messy1.columns.name = ''
messy1
row | a | b | c | |
---|---|---|---|---|
0 | A | 1 | 4 | 7 |
1 | B | 2 | 5 | 8 |
2 | C | 3 | 6 | 9 |
This is the same as the original dataframe, up to column reordering.
This is the first issue with messy data in Hadley's paper. Let's first create the dataframe used as an example. For practical reasons, it was simpler to first construct the transpose of it, then process it to get the data set used in the article.
messy = pd.DataFrame({'Agnostic' : [27, 34, 60, 81, 76, 137],
'Atheist' : [12, 27, 37, 52, 35, 70],
'Buddhist' : [27, 21, 30, 34, 33, 58],
'Catholic' : [418, 617, 732, 670, 638, 1116],
"Don't know/refused" : [15, 14, 15, 11, 10, 35],
'Evangelical Prot' : [575, 869, 1064, 982, 881, 1486],
'Hindu' : [1, 9, 7, 9, 11, 34],
'Historically Black Prot' : [228, 244, 236, 238, 197, 223],
"Jehovah's Witness" : [20, 27, 24, 24, 21, 30],
'Jewish' : [19, 19, 25, 25, 30, 95],
})
def transpose(df, columns):
df = df.T.copy()
df.reset_index(inplace=True)
df.columns = columns
return df
messy = transpose(messy, ['religion', '<$10k', '$10-20k', '$20-30k', '$30-40k', '$40-50k', '$50-75k'])
messy
religion | <$10k | $10-20k | $20-30k | $30-40k | $40-50k | $50-75k | |
---|---|---|---|---|---|---|---|
0 | Agnostic | 27 | 34 | 60 | 81 | 76 | 137 |
1 | Atheist | 12 | 27 | 37 | 52 | 35 | 70 |
2 | Buddhist | 27 | 21 | 30 | 34 | 33 | 58 |
3 | Catholic | 418 | 617 | 732 | 670 | 638 | 1116 |
4 | Don't know/refused | 15 | 14 | 15 | 11 | 10 | 35 |
5 | Evangelical Prot | 575 | 869 | 1064 | 982 | 881 | 1486 |
6 | Hindu | 1 | 9 | 7 | 9 | 11 | 34 |
7 | Historically Black Prot | 228 | 244 | 236 | 238 | 197 | 223 |
8 | Jehovah's Witness | 20 | 27 | 24 | 24 | 21 | 30 |
9 | Jewish | 19 | 19 | 25 | 25 | 30 | 95 |
Again, the melt()
function is our friend. We sort the result by religion to make it easier to read.
tidy = pd.melt(messy, id_vars = ['religion'], var_name='income', value_name='freq')
tidy.sort_values(by=['religion'], inplace=True)
tidy.head()
religion | income | freq | |
---|---|---|---|
0 | Agnostic | <$10k | 27 |
30 | Agnostic | $30-40k | 81 |
40 | Agnostic | $40-50k | 76 |
50 | Agnostic | $50-75k | 137 |
10 | Agnostic | $10-20k | 34 |
This example is a little trickier. We first read the input data as a data frame. This data is available at https://github.com/hadley/tidy-data/blob/master/data/tb.csv
I've cloned it therefore it is in my local data
directory.
Reading it is easy. We remove the new_sp_
prefix appearing in most columns, and we rename a couple of columns as well.
url = "https://raw.githubusercontent.com/hadley/tidy-data/master/data/tb.csv"
tb = pd.read_csv(url)
tb.columns = tb.columns.str.replace('new_sp_','')
tb.rename(columns = {'new_sp' : 'total', 'iso2' : 'country'},
inplace=True)
tb.head()
country | year | total | m04 | m514 | m014 | m1524 | m2534 | m3544 | m4554 | ... | f04 | f514 | f014 | f1524 | f2534 | f3544 | f4554 | f5564 | f65 | fu | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AD | 1989 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | AD | 1990 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | AD | 1991 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | AD | 1992 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | AD | 1993 | 15.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 23 columns
Let's use year 2000, and drop few columns, to stay in sync with Wickham's article.
messy = tb[tb['year'] == 2000].copy()
messy.drop(['total','m04','m514','f04','f514'], axis=1, inplace=True)
messy.head(10)
country | year | m014 | m1524 | m2534 | m3544 | m4554 | m5564 | m65 | mu | f014 | f1524 | f2534 | f3544 | f4554 | f5564 | f65 | fu | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10 | AD | 2000 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
36 | AE | 2000 | 2.0 | 4.0 | 4.0 | 6.0 | 5.0 | 12.0 | 10.0 | NaN | 3.0 | 16.0 | 1.0 | 3.0 | 0.0 | 0.0 | 4.0 | NaN |
60 | AF | 2000 | 52.0 | 228.0 | 183.0 | 149.0 | 129.0 | 94.0 | 80.0 | NaN | 93.0 | 414.0 | 565.0 | 339.0 | 205.0 | 99.0 | 36.0 | NaN |
87 | AG | 2000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | NaN | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN |
136 | AL | 2000 | 2.0 | 19.0 | 21.0 | 14.0 | 24.0 | 19.0 | 16.0 | NaN | 3.0 | 11.0 | 10.0 | 8.0 | 8.0 | 5.0 | 11.0 | NaN |
165 | AM | 2000 | 2.0 | 152.0 | 130.0 | 131.0 | 63.0 | 26.0 | 21.0 | NaN | 1.0 | 24.0 | 27.0 | 24.0 | 8.0 | 8.0 | 4.0 | NaN |
178 | AN | 2000 | 0.0 | 0.0 | 1.0 | 2.0 | 0.0 | 0.0 | 0.0 | NaN | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | NaN |
207 | AO | 2000 | 186.0 | 999.0 | 1003.0 | 912.0 | 482.0 | 312.0 | 194.0 | NaN | 247.0 | 1142.0 | 1091.0 | 844.0 | 417.0 | 200.0 | 120.0 | NaN |
236 | AR | 2000 | 97.0 | 278.0 | 594.0 | 402.0 | 419.0 | 368.0 | 330.0 | NaN | 121.0 | 544.0 | 479.0 | 262.0 | 230.0 | 179.0 | 216.0 | NaN |
265 | AS | 2000 | NaN | NaN | NaN | NaN | 1.0 | 1.0 | NaN | NaN | NaN | NaN | NaN | NaN | 1.0 | NaN | NaN | NaN |
messy.iloc[:,:11].head(10)
country | year | m014 | m1524 | m2534 | m3544 | m4554 | m5564 | m65 | mu | f014 | |
---|---|---|---|---|---|---|---|---|---|---|---|
10 | AD | 2000 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | NaN |
36 | AE | 2000 | 2.0 | 4.0 | 4.0 | 6.0 | 5.0 | 12.0 | 10.0 | NaN | 3.0 |
60 | AF | 2000 | 52.0 | 228.0 | 183.0 | 149.0 | 129.0 | 94.0 | 80.0 | NaN | 93.0 |
87 | AG | 2000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | NaN | 1.0 |
136 | AL | 2000 | 2.0 | 19.0 | 21.0 | 14.0 | 24.0 | 19.0 | 16.0 | NaN | 3.0 |
165 | AM | 2000 | 2.0 | 152.0 | 130.0 | 131.0 | 63.0 | 26.0 | 21.0 | NaN | 1.0 |
178 | AN | 2000 | 0.0 | 0.0 | 1.0 | 2.0 | 0.0 | 0.0 | 0.0 | NaN | 0.0 |
207 | AO | 2000 | 186.0 | 999.0 | 1003.0 | 912.0 | 482.0 | 312.0 | 194.0 | NaN | 247.0 |
236 | AR | 2000 | 97.0 | 278.0 | 594.0 | 402.0 | 419.0 | 368.0 | 330.0 | NaN | 121.0 |
265 | AS | 2000 | NaN | NaN | NaN | NaN | 1.0 | 1.0 | NaN | NaN | NaN |
The melt()
function is useful, but is not enough. Let's use it still.
molten = pd.melt(messy, id_vars=['country', 'year'], value_name='cases')
molten.sort_values(by=['year', 'country'], inplace=True)
molten.head(10)
country | year | variable | cases | |
---|---|---|---|---|
0 | AD | 2000 | m014 | 0.0 |
201 | AD | 2000 | m1524 | 0.0 |
402 | AD | 2000 | m2534 | 1.0 |
603 | AD | 2000 | m3544 | 0.0 |
804 | AD | 2000 | m4554 | 0.0 |
1005 | AD | 2000 | m5564 | 0.0 |
1206 | AD | 2000 | m65 | 0.0 |
1407 | AD | 2000 | mu | NaN |
1608 | AD | 2000 | f014 | NaN |
1809 | AD | 2000 | f1524 | NaN |
What isn't really nice is the encoding of sex and age ranges as a string in the variable
column. Let's process the dataset to create two additional columns, one for the sex, and one for the age range. We then remove the variable
column. The tidy form also makes it easy to remove the values where the age is u
.
def parse_age(s):
s = s[1:]
if s == '65':
return '65+'
else:
return s[:-2]+'-'+s[-2:]
tidy = molten[molten['variable'] != 'mu'].copy()
tidy['sex'] = tidy['variable'].apply(lambda s: s[:1])
tidy['age'] = tidy['variable'].apply(parse_age)
tidy = tidy[['country', 'year', 'sex', 'age', 'cases']]
tidy.head(10)
country | year | sex | age | cases | |
---|---|---|---|---|---|
0 | AD | 2000 | m | 0-14 | 0.0 |
201 | AD | 2000 | m | 15-24 | 0.0 |
402 | AD | 2000 | m | 25-34 | 1.0 |
603 | AD | 2000 | m | 35-44 | 0.0 |
804 | AD | 2000 | m | 45-54 | 0.0 |
1005 | AD | 2000 | m | 55-64 | 0.0 |
1206 | AD | 2000 | m | 65+ | 0.0 |
1608 | AD | 2000 | f | 0-14 | NaN |
1809 | AD | 2000 | f | 15-24 | NaN |
2010 | AD | 2000 | f | 25-34 | NaN |
This example is really tricky. Let's first create the dataframe. This time, I create it using an array instead of a dictionary, just for the fun of doing something a bit different.
columns = ['id', 'year', 'month', 'element', 'd1', 'd2', 'd3', 'd4', 'd5', 'd6', 'd7', 'd8']
data = [['MX17004', 2010, 1, 'tmax', np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, ],
['MX17004', 2010, 1, 'tmin', np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, ],
['MX17004', 2010, 2, 'tmax', np.nan, 27.3, 24.1, np.nan, np.nan, np.nan, np.nan, np.nan, ],
['MX17004', 2010, 2, 'tmin', np.nan, 14.4, 14.4, np.nan, np.nan, np.nan, np.nan, np.nan, ],
['MX17004', 2010, 3, 'tmax', np.nan, np.nan, np.nan, np.nan, 32.1, np.nan, np.nan, np.nan, ],
['MX17004', 2010, 3, 'tmin', np.nan, np.nan, np.nan, np.nan, 14.2, np.nan, np.nan, np.nan, ],
['MX17004', 2010, 4, 'tmax', np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, ],
['MX17004', 2010, 4, 'tmin', np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, ],
['MX17004', 2010, 5, 'tmax', np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, ],
['MX17004', 2010, 5, 'tmin', np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan,]
]
messy = pd.DataFrame(data=data, columns=columns); messy
id | year | month | element | d1 | d2 | d3 | d4 | d5 | d6 | d7 | d8 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | MX17004 | 2010 | 1 | tmax | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | MX17004 | 2010 | 1 | tmin | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | MX17004 | 2010 | 2 | tmax | NaN | 27.3 | 24.1 | NaN | NaN | NaN | NaN | NaN |
3 | MX17004 | 2010 | 2 | tmin | NaN | 14.4 | 14.4 | NaN | NaN | NaN | NaN | NaN |
4 | MX17004 | 2010 | 3 | tmax | NaN | NaN | NaN | NaN | 32.1 | NaN | NaN | NaN |
5 | MX17004 | 2010 | 3 | tmin | NaN | NaN | NaN | NaN | 14.2 | NaN | NaN | NaN |
6 | MX17004 | 2010 | 4 | tmax | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
7 | MX17004 | 2010 | 4 | tmin | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
8 | MX17004 | 2010 | 5 | tmax | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
9 | MX17004 | 2010 | 5 | tmin | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Most of the values are not relevant. However, filtering the NaN values is imposible here. We need to melt the dataframe first.
molten = pd.melt(messy,
id_vars=['id', 'year','month','element',],
var_name='day');
molten.dropna(inplace=True)
molten = molten.reset_index(drop=True)
molten
id | year | month | element | day | value | |
---|---|---|---|---|---|---|
0 | MX17004 | 2010 | 2 | tmax | d2 | 27.3 |
1 | MX17004 | 2010 | 2 | tmin | d2 | 14.4 |
2 | MX17004 | 2010 | 2 | tmax | d3 | 24.1 |
3 | MX17004 | 2010 | 2 | tmin | d3 | 14.4 |
4 | MX17004 | 2010 | 3 | tmax | d5 | 32.1 |
5 | MX17004 | 2010 | 3 | tmin | d5 | 14.2 |
This dataframe is not in tidy form yet. First, the column element
contains variable names. Second, the columns year, month, day
represent one variable: the date. Let's fix the latter problem first.
def f(row):
return "%d-%02d-%02d" % (row['year'], row['month'], int(row['day'][1:]))
molten['date'] = molten.apply(f,axis=1)
molten = molten[['id', 'element','value','date']]
molten
id | element | value | date | |
---|---|---|---|---|
0 | MX17004 | tmax | 27.3 | 2010-02-02 |
1 | MX17004 | tmin | 14.4 | 2010-02-02 |
2 | MX17004 | tmax | 24.1 | 2010-02-03 |
3 | MX17004 | tmin | 14.4 | 2010-02-03 |
4 | MX17004 | tmax | 32.1 | 2010-03-05 |
5 | MX17004 | tmin | 14.2 | 2010-03-05 |
Now we need to pivot the element column.
tidy = molten.pivot(index='date',columns='element',values='value')
tidy
element | tmax | tmin |
---|---|---|
date | ||
2010-02-02 | 27.3 | 14.4 |
2010-02-03 | 24.1 | 14.4 |
2010-03-05 | 32.1 | 14.2 |
Wait a minute.
Where is the id?
One way to keep it, is to move the id to an index with the groupby()
function, and apply pivot()
inside each group.
tidy = molten.groupby('id').apply(pd.DataFrame.pivot,
index='date',
columns='element',
values='value')
tidy
element | tmax | tmin | |
---|---|---|---|
id | date | ||
MX17004 | 2010-02-02 | 27.3 | 14.4 |
2010-02-03 | 24.1 | 14.4 | |
2010-03-05 | 32.1 | 14.2 |
We are almost there. We simply have to move id back as a column with the reset_index()
.
tidy.reset_index(inplace=True)
tidy
element | id | date | tmax | tmin |
---|---|---|---|---|
0 | MX17004 | 2010-02-02 | 27.3 | 14.4 |
1 | MX17004 | 2010-02-03 | 24.1 | 14.4 |
2 | MX17004 | 2010-03-05 | 32.1 | 14.2 |
We get rid of the element
name.
tidy.columns.name = ''
tidy
id | date | tmax | tmin | |
---|---|---|---|---|
0 | MX17004 | 2010-02-02 | 27.3 | 14.4 |
1 | MX17004 | 2010-02-03 | 24.1 | 14.4 |
2 | MX17004 | 2010-03-05 | 32.1 | 14.2 |
Et Voilà!
This example is used to illustrate two of the above problems.
Let's create it. It is an excerpt from the Billboard top hits for 2000.
columns = ['year','artist','track','time','date entered','wk1','wk2','wk3',]
data = [[2000,"2,Pac","Baby Don't Cry","4:22","2000-02-26",87,82,72,],
[2000,"2Ge+her","The Hardest Part Of ...","3:15","2000-09-02",91,87,92,],
[2000,"3 Doors Down","Kryptonite","3:53","2000-04-08",81,70,68,],
[2000,"98^0","Give Me Just One Nig...","3:24","2000-08-19",51,39,34,],
[2000,"A*Teens","Dancing Queen","3:44","2000-07-08",97,97,96,],
[2000,"Aaliyah","I Don't Wanna","4:15","2000-01-29",84,62,51,],
[2000,"Aaliyah","Try Again","4:03","2000-03-18",59,53,38,],
[2000,"Adams,Yolanda","Open My Heart","5:30","2000-08-26",76,76,74]
]
messy = pd.DataFrame(data=data, columns=columns)
messy
year | artist | track | time | date entered | wk1 | wk2 | wk3 | |
---|---|---|---|---|---|---|---|---|
0 | 2000 | 2,Pac | Baby Don't Cry | 4:22 | 2000-02-26 | 87 | 82 | 72 |
1 | 2000 | 2Ge+her | The Hardest Part Of ... | 3:15 | 2000-09-02 | 91 | 87 | 92 |
2 | 2000 | 3 Doors Down | Kryptonite | 3:53 | 2000-04-08 | 81 | 70 | 68 |
3 | 2000 | 98^0 | Give Me Just One Nig... | 3:24 | 2000-08-19 | 51 | 39 | 34 |
4 | 2000 | A*Teens | Dancing Queen | 3:44 | 2000-07-08 | 97 | 97 | 96 |
5 | 2000 | Aaliyah | I Don't Wanna | 4:15 | 2000-01-29 | 84 | 62 | 51 |
6 | 2000 | Aaliyah | Try Again | 4:03 | 2000-03-18 | 59 | 53 | 38 |
7 | 2000 | Adams,Yolanda | Open My Heart | 5:30 | 2000-08-26 | 76 | 76 | 74 |
This dataset is messy because there are several observations per row, in the columns wk1, wk2, wk3. We can get one observation per row by metling the dataset.
molten = pd.melt(messy,
id_vars=['year','artist','track','time','date entered'],
var_name = 'week',
value_name = 'rank',
)
molten.sort_values(by=['date entered','week'], inplace=True)
molten.head()
year | artist | track | time | date entered | week | rank | |
---|---|---|---|---|---|---|---|
5 | 2000 | Aaliyah | I Don't Wanna | 4:15 | 2000-01-29 | wk1 | 84 |
13 | 2000 | Aaliyah | I Don't Wanna | 4:15 | 2000-01-29 | wk2 | 62 |
21 | 2000 | Aaliyah | I Don't Wanna | 4:15 | 2000-01-29 | wk3 | 51 |
0 | 2000 | 2,Pac | Baby Don't Cry | 4:22 | 2000-02-26 | wk1 | 87 |
8 | 2000 | 2,Pac | Baby Don't Cry | 4:22 | 2000-02-26 | wk2 | 82 |
We can clean the dataset further, first by turning week into number
molten['week'] = molten['week'].apply(lambda s: int(s[2:]))
molten.head()
year | artist | track | time | date entered | week | rank | |
---|---|---|---|---|---|---|---|
5 | 2000 | Aaliyah | I Don't Wanna | 4:15 | 2000-01-29 | 1 | 84 |
13 | 2000 | Aaliyah | I Don't Wanna | 4:15 | 2000-01-29 | 2 | 62 |
21 | 2000 | Aaliyah | I Don't Wanna | 4:15 | 2000-01-29 | 3 | 51 |
0 | 2000 | 2,Pac | Baby Don't Cry | 4:22 | 2000-02-26 | 1 | 87 |
8 | 2000 | 2,Pac | Baby Don't Cry | 4:22 | 2000-02-26 | 2 | 82 |
Second, we need the starting date of the week for each observation, instead of the date the track entered.
from datetime import datetime, timedelta
def increment_date(row):
date = datetime.strptime(row['date entered'], "%Y-%m-%d")
return date + timedelta(7) * (row['week'] - 1)
molten['date'] = molten.apply(increment_date, axis=1)
molten.drop('date entered', axis=1, inplace=True)
molten.head()
year | artist | track | time | week | rank | date | |
---|---|---|---|---|---|---|---|
5 | 2000 | Aaliyah | I Don't Wanna | 4:15 | 1 | 84 | 2000-01-29 |
13 | 2000 | Aaliyah | I Don't Wanna | 4:15 | 2 | 62 | 2000-02-05 |
21 | 2000 | Aaliyah | I Don't Wanna | 4:15 | 3 | 51 | 2000-02-12 |
0 | 2000 | 2,Pac | Baby Don't Cry | 4:22 | 1 | 87 | 2000-02-26 |
8 | 2000 | 2,Pac | Baby Don't Cry | 4:22 | 2 | 82 | 2000-03-04 |
Last, this dataset is denormalized. This is fine for most statistical and machine learning packages, but we might want to normalize it. It means that we should group information that is repeated every week for a track in a separate table. This information appears in columns year ,artist, track, time
.
tidy_track = molten[['year','artist','track','time']]\
.groupby(['year','artist','track'])\
.first()
tidy_track.reset_index(inplace=True)
tidy_track.reset_index(inplace=True)
tidy_track.rename(columns = {'index':'id'}, inplace=True)
tidy_track
id | year | artist | track | time | |
---|---|---|---|---|---|
0 | 0 | 2000 | 2,Pac | Baby Don't Cry | 4:22 |
1 | 1 | 2000 | 2Ge+her | The Hardest Part Of ... | 3:15 |
2 | 2 | 2000 | 3 Doors Down | Kryptonite | 3:53 |
3 | 3 | 2000 | 98^0 | Give Me Just One Nig... | 3:24 |
4 | 4 | 2000 | A*Teens | Dancing Queen | 3:44 |
5 | 5 | 2000 | Aaliyah | I Don't Wanna | 4:15 |
6 | 6 | 2000 | Aaliyah | Try Again | 4:03 |
7 | 7 | 2000 | Adams,Yolanda | Open My Heart | 5:30 |
tidy_rank = pd.merge(molten, tidy_track, on='track')
tidy_rank = tidy_rank[['id', 'date', 'rank']]
tidy_rank.head()
id | date | rank | |
---|---|---|---|
0 | 5 | 2000-01-29 | 84 |
1 | 5 | 2000-02-05 | 62 |
2 | 5 | 2000-02-12 | 51 |
3 | 0 | 2000-02-26 | 87 |
4 | 0 | 2000-03-04 | 82 |