In this exercise we will have a look at the Essen Folksong Collection (EFD). It is a database of Folksongs from all around the world gathered by the Ethnomusicologist Helmut Schaffrath (1995). The collection can be downloaded from this website. It comes in the **kern format that is a table of note events where the rows correspond to event time.
We will answer a very specific question:
What can we learn about musical scales when we look at the notes in a musical piece?
For this purpose and also due to limited time, we need to make some simplifications. We will only consider note counts. That means, for this excercise we just count all the notes in a piece but do not care how long the notes are. Another way to say it is that we are just interested in the pitch dimension.
Notes in the EFD come as spelled pitches. Spelled pitches have three parts:
Moreover, we will reduce the pitches in a melody to pitch classe. This means we do not differentiate between the same pitches in different octaves, e.g. C4 and C5, nor will we distinguish between enharmonically equivalent pitches, such as F#3 and Gb3.
This way, each piece can be represented as a list of pitch classes that can then be counted.
Because it is not straight-forward to work with **kern scores in Python, the data was already transformed into DataFrame format and exported to data.csv
.
First, we need to import sum libraries that contain the functions that we will use for the data analysis. If you do not have one or more of the packages installed, run conda install packages-names
in a command-line tool (as adminstrator).
import numpy as np # for numerical computation (we need it to transpose melodies)
import pandas as pd # to organize our data in tabular format
import matplotlib.pyplot as plt # to visualize results
import seaborn as sns # more advanced visualization tools
# directly show the plots in the notebook
%matplotlib inline
import re # regular expressions: for pattern finding in strings
data = pd.read_csv("data.csv", sep='\t', index_col=0)
data.head()
Wee see that the (preprocessed) data comes with the features region
, title
, and key
, and is represended as directed generic intervals (DGIs
) and spelled_pitches
.
For this excercise we will only work with the data in the key
and spelled_pitches
columns. The spelled_pitches
entries look like a list of pitches but they are actually strings ('['
, ','
and whitespaces are characters). Therefore we need to first transform it into a representation that we can work with. We will not go into details here but basically we just remove everything that we don't need from the string until only the spelled pitches are left.
data['spelled_pitches'] = \
data['spelled_pitches'].str.replace("', '", " ").str.replace("\[\'", "").str.replace("\'\]", "")
data.head()
Let's also drop the columns that we don't need for this excercise.
del data['region']
del data['DGIs']
data.head()
Let us inspect the data bit further with the describe
method.
data.describe()
It seems that we have some duplicates with different titles in the dataset. Normally, we should deal with this issue. For now we will treat them as separate songs. Also we can see that apparently there is missing key information for one song. Let's see where this happens.
data[ data['key'].isnull() ]
Since we have only the title for this song, it doesn't make much sense to include it into our analysis. We will exclude it and drop this row from the DataFrame.
data.drop(1028, inplace=True)
data = data.reset_index(drop=True)
data.shape
Translate the keys into modes and pitch classes of roots. The easiest way might be to write a dictionary by hand that does the job.
data['root'], data['mode'] = data['key'].str.split(' ').str
data.head()
It is always a good idea to inspect the data to understand it better. What are the proportions of major and minor pieces in this dataset?
data['mode'].value_counts()
data['mode'].value_counts() / len(data)
data['root'].value_counts()
We don't want to distinguish between the roots of major and minor keys, so we just write all the roots as uppercase letters.
data['root'] = data['root'].str.upper()
data['root'].value_counts()
In order to transpose all the melodies to the same key, we need to know the pitch-class of each root. We will use a pragmatic approach and just explicitly state the information in a dictionary. It is common to define 'C'
as pitch class 0
.
roots_dict = {
'G':7,
'F':5,
'C':0,
'A':9,
'D':2,
'B-':10,
'E-':3,
'E':4,
'A-':8,
'D-':1,
'B':11,
'F#':6
}
No we can translate the roots to pitch classes.
data['root'] = data['root'].map(roots_dict)
data.tail()
First, we transform the melody into a list of spelled pitches.
data['spelled_pitches'] = data['spelled_pitches'].str.split()
data.head()
Next, we need a way to transform each pitch to a pitch class. To that end we define a function that takes a spelled pitch (a symbol such as B-5
) and returns it's pitch class as a number between 0
and 12
.
def spelled_pitch_to_pitch_class(spelled_pitch):
"""
This function transforms a spelled pitch, such as, `B-5`
into a pitch class, a number between 0 and 12.
A spelled pitch consists of three parts:
1. Its diatonic step (C, D, E, F, G, A, or B)
2. Potentially one or two accidentals (# or b)
3. Its octave as a number.
"""
# Remove octave by removing the last character in the string
spelled_pitch_class = spelled_pitch[:-1]
# Extract the diatonic step
# First, we define a dictionary that associates
# each diatonic step with a pitch class
pitch_classes = {
'C':0,
'D':2,
'E':4,
'F':5,
'G':7,
'A':9,
'B':11
}
# Extract accidentals
# We define a regular expression that finds the three parts
# of a spelled pitch class.
match = re.match(r'(\w)(\#*)(-*)', spelled_pitch_class).group(1,2,3)
# If we find a match, we get a tripel (step, sharps, flats)
if match:
step = pitch_classes[match[0]]
sharps = len(match[1])
flats = len(match[2])
# The only thing left to do is to take the pitch class of the diatonic step,
# add the number of sharps and subtract the number of flats
# Finally, since pitch classes are always between 0 and 11, we take this number mod 12.
return (step + sharps - flats) % 12
In the previous step we set up a function that converts spelled pitches into pitch classes. Now we can use it to count all the notes in a piece.
# First we set up an empty list that will later contain dictionaries of pitch-class counts for each song.
countdicts = []
# Then we loop over all the rows (pieces) in our dataframe
for index, row in data.iterrows():
# We replace the spelled pitches with pitch classes
row['spelled_pitches'] = [spelled_pitch_to_pitch_class(pitch) for pitch in row['spelled_pitches']]
# Then we count the occurences of each pitch class in this list
# We create an empty dictionary that will contain the pitch-class counts for the current piece
intcounts = {}
# We iterate over all pitch classes and see if it is already in the `intcounts` dictionary
for pitch_class in row['spelled_pitches']:
# if not, set the count to 1
if pitch_class not in intcounts.keys():
intcounts[pitch_class] = 1
# if yes, increment the count by 1
else:
intcounts[pitch_class] += 1
# Finally, add the pitch-class counts dictionary to our list of pitch-class count dictionaries
countdicts.append(intcounts)
This is what the list of pitch-class count dictionaries looks like:
countdicts
This is not really convenient. To handle it easier, we transform it to a DataFrame object and set all pitch classes to 0
if they do not occur in a piece.
counts = pd.DataFrame(countdicts).fillna(0)
counts.head(10)
The DataFrame counts
contains now the pitch-class counts for all pieces. We can see if the dimensions of counts
and data
fit.
counts.shape, data.shape
But now, longer pieces weight more just because they contain more notes. To avoid that we have to normalize the DataFrame to get relative frequencies.
normalized = counts.div(counts.sum(axis=1), axis=0)
normalized.head(10)
Now we are almoste done with transforming the data in order to answer our question. We still need to transpose all songs into the same key so that we can compare their pitch-class distributions. Let's think a moment about how this can be done. We have the pitch-class distribution of each song in counts
, and we have key
, root
, and mode
in data
.
Let's say that we want to transpose all songs to the root C. C major and C minor pieces do not have to change. A piece in G major, for instance, has the root 7
and needs to be transposed to the root 0
. A piece in Bb minor has the root 10
and needs do be transposed to the root 0
. The easiest way to do this is to 'rotate' the pitch-class distributions by the negative amount of the root. Luckily, numpy provides the roll
function to do exactly that. It takes an array (a vector) and rolls it by the specified amount. We do this for all songs in data and save the result in a new DataFrame transposed
.
transposed = pd.DataFrame(
[ np.roll( normalized.iloc[i,:], -data['root'][i] ) for i in range(len(data)) ]
)
transposed.head(10)
We can already observe that none of the first 10 songs in the dataset has the tritone (pitch class 6), and only one has a minor third (pitch class 3) or a minor seventh (pitch class 10).
It would be nice not having to work with to DataFrames, data
and transposed
, so we combine (concatenate) them in a new one, just called df
.
df = pd.concat([data, transposed], axis=1)
df.shape
df.head()
piece = df.iloc[0,-11:]
piece.plot.bar(rot=0);
melted = df.melt(id_vars='mode',
value_vars=[0,1,2,3,4,5,6,7,8,9,10,11],
var_name='pitch_classes',
value_name='relative_frequencies'
)
melted.head()
melted.shape
sns.factorplot(data=melted,
x='pitch_classes',
y='relative_frequencies',
hue='mode',
kind='bar',
aspect=2.5
);
plt.show()
melted['fifths'] = melted['pitch_classes'] * 7 % 12
sns.factorplot(data=melted,
x='fifths',
y='relative_frequencies',
hue='mode',
kind='bar',
aspect=2.5
);
plt.figure(figsize=(12,10))
sns.boxplot(
data=melted,
x='pitch_classes',
y='relative_frequencies',
hue='mode',
fliersize=2.5
);
These boxplots show already much more! For example, they reveal that there are many outliers which we can't see in the bar plot. https://www.autodeskresearch.com/publications/samestats
plt.figure(figsize=(12,10))
sns.violinplot(
data=melted,
x='pitch_classes',
y='relative_frequencies',
hue='mode',
inner='quart',
split=True
);