Digital Musicology Exercise

Finding Similarities and Differences in Folk Melodies with Python and Pandas

In this exercise we will have a look at the Essen Folksong Collection (EFD). It is a database of Folksongs from all around the world gathered by the Ethnomusicologist Helmut Schaffrath (1995). The collection can be downloaded from this website. It comes in the **kern format that is a table of note events where the rows correspond to event time.

We will answer a very specific question:

What can we learn about musical scales when we look at the notes in a musical piece?

For this purpose and also due to limited time, we need to make some simplifications. We will only consider note counts. That means, for this excercise we just count all the notes in a piece but do not care how long the notes are. Another way to say it is that we are just interested in the pitch dimension.

Notes in the EFD come as spelled pitches. Spelled pitches have three parts:

  1. the diatonic step (C, D, E, F, G, A, or B)
  2. possibly one or two accidentals (# or b)
  3. the octave in which the note sounds

Moreover, we will reduce the pitches in a melody to pitch classe. This means we do not differentiate between the same pitches in different octaves, e.g. C4 and C5, nor will we distinguish between enharmonically equivalent pitches, such as F#3 and Gb3.

This way, each piece can be represented as a list of pitch classes that can then be counted.

Because it is not straight-forward to work with **kern scores in Python, the data was already transformed into DataFrame format and exported to data.csv.

First, we need to import sum libraries that contain the functions that we will use for the data analysis. If you do not have one or more of the packages installed, run conda install packages-names in a command-line tool (as adminstrator).

In [1]:
import numpy as np # for numerical computation (we need it to transpose melodies)
import pandas as pd # to organize our data in tabular format

import matplotlib.pyplot as plt # to visualize results 
import seaborn as sns # more advanced visualization tools
# directly show the plots in the notebook
%matplotlib inline 

import re # regular expressions: for pattern finding in strings

Step 1: Preprocessing

Read the data and create new column that contains a list of the pitches of the melody

In [ ]:
data = pd.read_csv("data.csv", sep='\t', index_col=0)
data.head()

Wee see that the (preprocessed) data comes with the features region, title, and key, and is represended as directed generic intervals (DGIs) and spelled_pitches.

For this excercise we will only work with the data in the key and spelled_pitches columns. The spelled_pitches entries look like a list of pitches but they are actually strings ('[', ',' and whitespaces are characters). Therefore we need to first transform it into a representation that we can work with. We will not go into details here but basically we just remove everything that we don't need from the string until only the spelled pitches are left.

In [ ]:
data['spelled_pitches'] = \
    data['spelled_pitches'].str.replace("', '", " ").str.replace("\[\'", "").str.replace("\'\]", "")
In [ ]:
data.head()

Let's also drop the columns that we don't need for this excercise.

In [ ]:
del data['region']
del data['DGIs']
In [ ]:
data.head()

Let us inspect the data bit further with the describe method.

In [ ]:
data.describe()

It seems that we have some duplicates with different titles in the dataset. Normally, we should deal with this issue. For now we will treat them as separate songs. Also we can see that apparently there is missing key information for one song. Let's see where this happens.

In [ ]:
data[ data['key'].isnull() ]

Since we have only the title for this song, it doesn't make much sense to include it into our analysis. We will exclude it and drop this row from the DataFrame.

In [ ]:
data.drop(1028, inplace=True)
In [ ]:
data = data.reset_index(drop=True)
In [ ]:
data.shape

Step 2: Extract the root and mode of the pieces and write them in new columns

Translate the keys into modes and pitch classes of roots. The easiest way might be to write a dictionary by hand that does the job.

In [ ]:
data['root'], data['mode'] = data['key'].str.split(' ').str
In [ ]:
data.head()

It is always a good idea to inspect the data to understand it better. What are the proportions of major and minor pieces in this dataset?

In [ ]:
data['mode'].value_counts()
In [ ]:
data['mode'].value_counts() / len(data)
In [ ]:
data['root'].value_counts()

We don't want to distinguish between the roots of major and minor keys, so we just write all the roots as uppercase letters.

In [ ]:
data['root'] = data['root'].str.upper()
In [ ]:
data['root'].value_counts()

In order to transpose all the melodies to the same key, we need to know the pitch-class of each root. We will use a pragmatic approach and just explicitly state the information in a dictionary. It is common to define 'C' as pitch class 0.

In [ ]:
roots_dict = { 
    'G':7,
    'F':5,
    'C':0,
    'A':9,
    'D':2,
    'B-':10,
    'E-':3,
    'E':4,
    'A-':8,
    'D-':1,
    'B':11,
    'F#':6
    }

No we can translate the roots to pitch classes.

In [ ]:
data['root'] = data['root'].map(roots_dict)
In [ ]:
data.tail()

Step 3: Create one new column for each pitch class in order to extract pitch-class counts

First, we transform the melody into a list of spelled pitches.

In [ ]:
data['spelled_pitches'] = data['spelled_pitches'].str.split()
In [ ]:
data.head()

Next, we need a way to transform each pitch to a pitch class. To that end we define a function that takes a spelled pitch (a symbol such as B-5) and returns it's pitch class as a number between 0 and 12.

In [ ]:
def spelled_pitch_to_pitch_class(spelled_pitch):
    """
    This function transforms a spelled pitch, such as, `B-5` 
    into a pitch class, a number between 0 and 12.
    
    A spelled pitch consists of three parts:
    1. Its diatonic step (C, D, E, F, G, A, or B)
    2. Potentially one or two accidentals (# or b)
    3. Its octave as a number.
    """
    
    # Remove octave by removing the last character in the string
    spelled_pitch_class = spelled_pitch[:-1]
    
    # Extract the diatonic step
    # First, we define a dictionary that associates 
    # each diatonic step with a pitch class
    
    pitch_classes = {
        'C':0,
        'D':2,
        'E':4,
        'F':5,
        'G':7,
        'A':9,
        'B':11
    }
    
    # Extract accidentals
    # We define a regular expression that finds the three parts 
    # of a spelled pitch class. 
    match = re.match(r'(\w)(\#*)(-*)', spelled_pitch_class).group(1,2,3)
    
    # If we find a match, we get a tripel (step, sharps, flats)
    if match:
        step = pitch_classes[match[0]]
        sharps = len(match[1])
        flats = len(match[2])
    
    # The only thing left to do is to take the pitch class of the diatonic step,
    # add the number of sharps and subtract the number of flats
    # Finally, since pitch classes are always between 0 and 11, we take this number mod 12.
    return (step + sharps - flats) % 12

In the previous step we set up a function that converts spelled pitches into pitch classes. Now we can use it to count all the notes in a piece.

In [ ]:
# First we set up an empty list that will later contain dictionaries of pitch-class counts for each song. 
countdicts = []

# Then we loop over all the rows (pieces) in our dataframe 
for index, row in data.iterrows():
    
    # We replace the spelled pitches with pitch classes
    row['spelled_pitches'] = [spelled_pitch_to_pitch_class(pitch) for pitch in row['spelled_pitches']]
    
    # Then we count the occurences of each pitch class in this list
    # We create an empty dictionary that will contain the pitch-class counts for the current piece
    intcounts = {}
    
    # We iterate over all pitch classes and see if it is already in the `intcounts` dictionary
    for pitch_class in row['spelled_pitches']:
        # if not, set the count to 1
        if pitch_class not in intcounts.keys():
            intcounts[pitch_class] = 1
        # if yes, increment the count by 1
        else:
            intcounts[pitch_class] += 1
    # Finally, add the pitch-class counts dictionary to our list of pitch-class count dictionaries
    countdicts.append(intcounts)

This is what the list of pitch-class count dictionaries looks like:

In [ ]:
countdicts

This is not really convenient. To handle it easier, we transform it to a DataFrame object and set all pitch classes to 0 if they do not occur in a piece.

In [ ]:
counts = pd.DataFrame(countdicts).fillna(0)
In [ ]:
counts.head(10)

The DataFrame counts contains now the pitch-class counts for all pieces. We can see if the dimensions of counts and data fit.

In [ ]:
counts.shape, data.shape

But now, longer pieces weight more just because they contain more notes. To avoid that we have to normalize the DataFrame to get relative frequencies.

In [ ]:
normalized = counts.div(counts.sum(axis=1), axis=0)
normalized.head(10)

Now we are almoste done with transforming the data in order to answer our question. We still need to transpose all songs into the same key so that we can compare their pitch-class distributions. Let's think a moment about how this can be done. We have the pitch-class distribution of each song in counts, and we have key, root, and mode in data.

Let's say that we want to transpose all songs to the root C. C major and C minor pieces do not have to change. A piece in G major, for instance, has the root 7 and needs to be transposed to the root 0. A piece in Bb minor has the root 10 and needs do be transposed to the root 0. The easiest way to do this is to 'rotate' the pitch-class distributions by the negative amount of the root. Luckily, numpy provides the roll function to do exactly that. It takes an array (a vector) and rolls it by the specified amount. We do this for all songs in data and save the result in a new DataFrame transposed.

In [ ]:
transposed = pd.DataFrame(
    [ np.roll( normalized.iloc[i,:], -data['root'][i] ) for i in range(len(data)) ]
    )

transposed.head(10)

We can already observe that none of the first 10 songs in the dataset has the tritone (pitch class 6), and only one has a minor third (pitch class 3) or a minor seventh (pitch class 10).

It would be nice not having to work with to DataFrames, data and transposed, so we combine (concatenate) them in a new one, just called df.

In [ ]:
df = pd.concat([data, transposed], axis=1)
In [ ]:
df.shape
In [ ]:
df.head()

Step 4: Plot your first pitch class histogram and pitch class distribution

  1. Choose an example piece.
  2. What do you expect to see?
  3. Plot a pitch class histogram in chromatic order.
In [ ]:
piece = df.iloc[0,-11:]
piece.plot.bar(rot=0);

Step 5: Plot the averaged pitch class distribution for the major and the minor mode

  1. Plot the averaged distributions.
  2. Can you show everything in one figure?
  3. Would it also make sense to plot averaged pitch class histograms?
In [ ]:
melted = df.melt(id_vars='mode',
                 value_vars=[0,1,2,3,4,5,6,7,8,9,10,11],
                 var_name='pitch_classes',
                 value_name='relative_frequencies'
                )
In [ ]:
melted.head()
In [ ]:
melted.shape
In [ ]:
sns.factorplot(data=melted, 
               x='pitch_classes', 
               y='relative_frequencies', 
               hue='mode',
               kind='bar',
               aspect=2.5
              );

plt.show()

Step 6: Plot the averaged distributions in fifths ordering

  1. Create the plot.
  2. What do you see?
In [ ]:
melted['fifths'] = melted['pitch_classes'] * 7 % 12
In [ ]:
sns.factorplot(data=melted, 
               x='fifths', 
               y='relative_frequencies', 
               hue='mode',
               kind='bar',
               aspect=2.5
              );

Step 7: Extend the plot above to show the diffusion of each pitch class

  1. Decide to either use error bars, boxplots, or violin plots. What is the difference between them? Violin plots are of cause the most fancy figures...
  2. Describe what you see.
In [ ]:
plt.figure(figsize=(12,10))
sns.boxplot(
    data=melted,
    x='pitch_classes',
    y='relative_frequencies',
    hue='mode',
    fliersize=2.5
);

These boxplots show already much more! For example, they reveal that there are many outliers which we can't see in the bar plot. https://www.autodeskresearch.com/publications/samestats

In [ ]:
plt.figure(figsize=(12,10))
sns.violinplot(
    data=melted,
    x='pitch_classes',
    y='relative_frequencies',
    hue='mode',
    inner='quart',
    split=True
);