This Jupyter notebook shows some notes regarding the justification and evaluation of the constant-Q harmomic coefficients (CQHC) (more official).
Timbre is the attribute of sound which makes, for example, two musical instruments playing the same note sound different. It is typically associated with the spectral (but also the temporal) envelope and assumed to be independent from the pitch (but also the loudness) of the sound. This will show how to design a simple but effective pitch-independent timbre feature which is well adapted to musical data, by deriving it from the constant-Q transform (CQT), a log frequency transform which matches the typical Western musical scale. We will show how to decompose the CQT spectrum into an energy-normalized pitch component and a pitch-normalized spectral component, the latter from which we will extract a number of harmonic coefficients. We will then evaluate the discriminative power of these Constant Q harmonic coefficients (CQHCs) on the NSynth dataset, a large-scale dataset of musical notes which is publicly available, comparing them with the mel-frequency cepstral coefficients (MFCCs), features originally designed for speech recognition but commonly used to characterize timbre in music.
Contents:
Author:
To show how to “pitch-normalize” a CQT spectrum, we start with the assumption that a CQT spectrum X
can be represented as the convolution between a pitch-normalized spectral component S
(which mostly contains the timbre information) and an energy-normalized pitch component P
(which mostly contains the pitch information), as shown in Equation 1, where ∗
represents the convolution operation.
X = S * P
Observation 1: A pitch change in the audio translates to a linear shift in the CQT spectrum.
Assuming that pitch and timbre are independent, this implies that the same musical object at different pitches would have a similar spectral component but a shifted pitch component (while two different musical objects at the same pitch would have different spectral components but a similar pitch component). This is summarized below, where X
, S
, P
and X'
, S'
, P'
represent the CQT spectrum, spectral component, and pitch component for a musical object and for a pitch-shifted version of the same musical object, respectively, and ≈
represents the approximate equality.
X = S * P
X' = S' * P'
=> S ≈ S'
%matplotlib inline
import numpy as np
import librosa
import librosa.display
import matplotlib.pyplot as plt
# Define two sinusoids, with one twice higher in pitch than the other one
sampling_frequency = 44100 # In Hz
time_duration = 1 # In seconds
time_array = np.arange(0, time_duration, 1/sampling_frequency) # In seconds
frequency_value1 = 65.41 # In Hz (= C2)
frequency_value2 = 2*frequency_value1 # In Hz (= C3)
audio_signal1 = np.sin(time_array*2*np.pi*frequency_value1)
audio_signal2 = np.sin(time_array*2*np.pi*frequency_value2)
# Comptute their CQT spectrogram
step_length = int(pow(2, int(np.ceil(np.log2(0.04*sampling_frequency))))/2)
octave_resolution = 12
cqt_spectrogram1 = np.abs(librosa.cqt(audio_signal1, sr=sampling_frequency, hop_length=step_length, \
bins_per_octave=octave_resolution))
cqt_spectrogram2 = np.abs(librosa.cqt(audio_signal2, sr=sampling_frequency, hop_length=step_length, \
bins_per_octave=octave_resolution))
# Display their waveform in seconds and their power CQT spectrogram in seconds and notes
plt.figure(figsize=(14, 7))
plt.subplot(2, 2, 1)
librosa.display.waveplot(audio_signal1, sr=sampling_frequency)
plt.title('Original signal (C2)')
plt.subplot(2, 2, 2)
librosa.display.waveplot(audio_signal2, sr=sampling_frequency)
plt.title('Pitch-shifted signal (C3)')
plt.subplot(2, 2, 3)
librosa.display.specshow(librosa.amplitude_to_db(cqt_spectrogram1, ref=np.max), sr=sampling_frequency, \
hop_length=step_length, x_axis='time', y_axis='cqt_note', bins_per_octave=octave_resolution)
plt.title('CQT spectrogram (C2)')
plt.subplot(2, 2, 4)
librosa.display.specshow(librosa.amplitude_to_db(cqt_spectrogram2, ref=np.max), sr=sampling_frequency, \
hop_length=step_length, x_axis='time', y_axis='cqt_note', bins_per_octave=octave_resolution)
plt.title('CQT spectrogram (C3)')
plt.tight_layout()
plt.show()
C:\Users\zarafii2001\Anaconda3\lib\site-packages\librosa\display.py:974: MatplotlibDeprecationWarning: The 'basey' parameter of __init__() has been renamed 'base' since Matplotlib 3.3; support for the old name will be dropped two minor releases later. scaler(mode, **kwargs)
import numpy as np
import librosa
import librosa.display
import matplotlib.pyplot as plt
# Load a musical note and derive a pitch-shifted one by making its pitch twice higher (make sure it keeps the same energy)
audio_signal1, sampling_frequency = librosa.load('bass_acoustic_000-036-075.wav', sr=None, mono=True)
audio_signal2 = librosa.effects.pitch_shift(audio_signal1, sampling_frequency, 12, bins_per_octave=12)
audio_signal2 = audio_signal2*np.sum(np.power(audio_signal1, 2))/np.sum(np.power(audio_signal2, 2))
# Comptute their CQT spectrogram
step_length = int(pow(2, int(np.ceil(np.log2(0.04*sampling_frequency))))/2)
octave_resolution = 12
cqt_spectrogram1 = np.abs(librosa.cqt(audio_signal1, sr=sampling_frequency, hop_length=step_length, \
bins_per_octave=octave_resolution))
cqt_spectrogram2 = np.abs(librosa.cqt(audio_signal2, sr=sampling_frequency, hop_length=step_length, \
bins_per_octave=octave_resolution))
# Display their waveform in seconds and their power CQT spectrogram in seconds and notes
plt.figure(figsize=(14, 7))
plt.subplot(2, 2, 1)
librosa.display.waveplot(audio_signal1, sr=sampling_frequency)
plt.title('Original signal (C2)')
plt.subplot(2, 2, 2)
librosa.display.waveplot(audio_signal2, sr=sampling_frequency)
plt.title('Pitch-shifted signal (C3)')
plt.subplot(2, 2, 3)
librosa.display.specshow(librosa.amplitude_to_db(cqt_spectrogram1, ref=np.max), sr=sampling_frequency, \
hop_length=step_length, x_axis='time', y_axis='cqt_note', bins_per_octave=octave_resolution)
plt.title('CQT spectrogram (C2)')
plt.subplot(2, 2, 4)
librosa.display.specshow(librosa.amplitude_to_db(cqt_spectrogram2, ref=np.max), sr=sampling_frequency, \
hop_length=step_length, x_axis='time', y_axis='cqt_note', bins_per_octave=octave_resolution)
plt.title('CQT spectrogram (C3)')
plt.tight_layout()
plt.show()
C:\Users\zarafii2001\Anaconda3\lib\site-packages\librosa\display.py:974: MatplotlibDeprecationWarning: The 'basey' parameter of __init__() has been renamed 'base' since Matplotlib 3.3; support for the old name will be dropped two minor releases later. scaler(mode, **kwargs)
Observation 2: The Fourier transform (FT) of the convolution between two functions is equal to the pointwise product between the FTs of the two functions, a property known as the convolution theorem.
This implies that the FT of the CQT spectrum is equal to the pointwise product between the FT of the spectral component and the FT of the pitch component. Given the first observation, this further implies that the FT of the spectral component for a musical object and for a pitch-shifted version of it would be equal. This is summarized below, where F
represents the FT function and .
the pointwise product.
F(X) = F(S * P) = F(S) . F(P)
F(X') = F(S' * P') = F(S') . F(P')
=> F(S) ≈ F(S')
import numpy as np
import matplotlib.pyplot as plt
# Define two random arrays
number_elements = 1000
data_array1 = np.random.normal(size=number_elements)
data_array2 = np.random.normal(size=number_elements)
# Compute their direct convolution
data_convolution = np.convolve(data_array1, data_array2)
# Compute their convolution using the convolution theorem
ft_array1 = np.fft.fft(data_array1, 2*number_elements-1)
ft_array2 = np.fft.fft(data_array2, 2*number_elements-1)
data_convolution2 = np.real(np.fft.ifft(ft_array1*ft_array2))
# Display the convolutions and their differences
maximum_value = max(abs(data_convolution))
plt.figure(figsize=(14, 3))
plt.subplot(1, 3, 1)
plt.plot(data_convolution), plt.ylim(-maximum_value, maximum_value), plt.title('Direct convolution')
plt.subplot(1, 3, 2)
plt.plot(data_convolution2), plt.ylim(-maximum_value, maximum_value), plt.title('Indirect convolution')
plt.subplot(1, 3, 3)
plt.plot(data_convolution-data_convolution2), plt.ylim(-maximum_value, maximum_value), plt.title('Differences')
plt.tight_layout()
plt.show()
Observation 3: The magnitude FT is shift-invariant.
This implies that the magnitude of the FT of the CQT spectrum for a musical object and for a pitch-shifted version of it would be equal. This is summarized below, where |.|
and Arg(.)
represent the modulus and argument, respectively, for a complex array, and j
, the imaginary unit.
F(X) = |F(X)| . exp(j*Arg(F(X)))
F(X') = |F(X')| . exp(j*Arg(F(X')))
=> abs(F(X)) ≈ abs(F(X'))
import numpy as np
import matplotlib.pyplot as plt
# Define a (non-negative) array and a shifted version of it
number_elements = 100
delay_size = 10
data_array1 = np.random.rand(number_elements-delay_size)
data_array2 = np.concatenate((np.zeros(delay_size), data_array1[0:number_elements-delay_size]))
# Compute their magnitude FT
absft_array1 = abs(np.fft.fft(data_array1, 2*number_elements-1))[0:number_elements]
absft_array2 = abs(np.fft.fft(data_array2, 2*number_elements-1))[0:number_elements]
# Display the arrays, their magnitude FT, and the differences between their magnitude FT
maximum_value = np.max(absft_array1)
plt.figure(figsize=(14, 6))
plt.subplot(2, 3, 1), plt.plot(data_array1), plt.ylim(0, 1), plt.title('Array')
plt.subplot(2, 3, 2), plt.plot(data_array2), plt.ylim(0, 1), plt.title('Shifted array')
plt.subplot(2, 3, 4), plt.plot(absft_array1), plt.ylim(0, maximum_value), plt.title('Magnitude FT of array')
plt.subplot(2, 3, 5), plt.plot(absft_array2), plt.ylim(0, maximum_value), plt.title('Magnitude FT of shifted array')
plt.subplot(2, 3, 6), plt.plot(abs(absft_array1-absft_array2)), plt.ylim(0, maximum_value), plt.title('Differences')
plt.tight_layout()
plt.show()
import numpy as np
import librosa
import matplotlib.pyplot as plt
# Load a musical note and derive a pitch-shifted one by making its pitch twice higher (make sure it keeps the same energy)
octave_resolution = 12
audio_signal1, sampling_frequency = librosa.load('bass_acoustic_000-036-075.wav', sr=None, mono=True)
audio_signal2 = librosa.effects.pitch_shift(audio_signal1, sampling_frequency, 12, bins_per_octave=octave_resolution)
audio_signal2 = audio_signal2*np.sum(np.power(audio_signal1, 2))/np.sum(np.power(audio_signal2, 2))
# Comptute their CQT spectrogram
step_length = int(pow(2, int(np.ceil(np.log2(0.04*sampling_frequency))))/2)
cqt_spectrogram1 = np.abs(librosa.cqt(audio_signal1, sr=sampling_frequency, hop_length=step_length, \
bins_per_octave=octave_resolution))
cqt_spectrogram2 = np.abs(librosa.cqt(audio_signal2, sr=sampling_frequency, hop_length=step_length, \
bins_per_octave=octave_resolution))
number_frequencies = np.shape(cqt_spectrogram1)[0]
# Compute the magnitude FT of their CQT spectrogram
absft_spectrogram1 = abs(np.fft.fft(cqt_spectrogram1, 2*number_frequencies-1, axis=0))[0:number_frequencies, :]
absft_spectrogram2 = abs(np.fft.fft(cqt_spectrogram2, 2*number_frequencies-1, axis=0))[0:number_frequencies, :]
# Display the CQT spectra, their magnitude FT, and the differences between their magnitude FT
i = 10
audio_maximum = max(np.max(abs(cqt_spectrogram1[:, i])), np.max(abs(cqt_spectrogram2[:, i])))
absft_maximum = max(np.max(abs(absft_spectrogram1[:, i])), np.max(abs(absft_spectrogram2[:, i])))
plt.figure(figsize=(14, 6))
plt.subplot(2, 3, 1), plt.plot(cqt_spectrogram1[:, i])
plt.ylim(0, audio_maximum), plt.title('Original CQT spectrum (C2)')
plt.subplot(2, 3, 2), plt.plot(cqt_spectrogram2[:, i])
plt.ylim(0, audio_maximum), plt.title('Pitch-shifted CQT spectrum (C3)')
plt.subplot(2, 3, 4), plt.plot(absft_spectrogram1[:, i])
plt.ylim(0, absft_maximum), plt.title('Magnitude FT of CQT spectrum (C2)')
plt.subplot(2, 3, 5), plt.plot(absft_spectrogram2[:, i])
plt.ylim(0, absft_maximum), plt.title('Magnitude FT of CQT spectrum (C3)')
plt.subplot(2, 3, 6), plt.plot(abs(absft_spectrogram1[:, i]-absft_spectrogram2[:, i]))
plt.ylim(0, absft_maximum), plt.title('Differences')
plt.tight_layout()
plt.show()
Given the previous observations, we can therefore conclude that the FT of the spectral component could be approximated by the magnitude FT of the CQT spectrum, while the FT of the pitch component could be approximated by the phase component. This finally gives us the estimates for the spectral component and the pitch component, after taking their inverse FTs, as shown below, where F-1(.)
represents the inverse FT function.
=> F(S) ~ |F(X)| and F(P) ~ exp(j*Arg(F(X')))
=> S ~ F-1(|F(X)|) and P ~ F-1(exp(j*Arg(F(X'))))
import os
import numpy as np
import librosa
import librosa.display
import cqhc
import matplotlib.pyplot as plt
## 1. Create a scale of notes
# Define the parameters for the notes to concatenate
folder_path = r'nsynth\nsynth-train\audio'
instrument_names = ['bass_acoustic_000']
note_number = 24
note_numbers = np.arange(note_number, note_number+12)
velocity_number = 75
sampling_frequency = 16000
# Initialize the audio signal
audio_signal = np.empty(0)
# Loop over the instrument names and note numbers to concatenate the notes
for instrument_name in instrument_names:
for note_number in note_numbers:
# Get the path to the file
file_name = f'{instrument_name}-{note_number:03d}-{velocity_number:03d}.wav'
file_path = os.path.join(folder_path, file_name)
# Load the current audio signal and concatenate them
audio_signal1, _ = librosa.load(file_path, sr=sampling_frequency, mono=True)
audio_signal = np.concatenate((audio_signal, audio_signal1))
## 2. Compute all the features
# Define the parameters for the feature to compute
window_length = pow(2, int(np.ceil(np.log2(0.04 * sampling_frequency))))
step_length = int(window_length / 2)
minimum_frequency = 32.70
octave_resolution = 12
number_coefficients = 20
# Compute the CQT spectrogram, derive the spectral and pitch components, extract the CQHCs, and also compute the MFCCs
cqt_spectrogram = np.power(cqhc.cqtspectrogram(audio_signal, sampling_frequency, step_length, minimum_frequency, \
octave_resolution), 2)
spectral_component, pitch_component = cqhc.cqtdeconv(cqt_spectrogram)
audio_cqhc = cqhc.cqhc(audio_signal, sampling_frequency, step_length, minimum_frequency, octave_resolution, \
number_coefficients)
audio_mfcc = cqhc.mfcc(audio_signal, sampling_frequency, window_length, step_length, number_coefficients)
## 3. Display the CQT spectrogram, spectral component, and pitch component
plt.figure(figsize=(14, 4))
plt.subplot(1, 3, 1)
librosa.display.specshow(librosa.power_to_db(cqt_spectrogram), x_axis='time', y_axis='cqt_note', sr=sampling_frequency,
hop_length=step_length, fmin=minimum_frequency, bins_per_octave=octave_resolution, cmap='jet')
plt.title('CQT spectrogram')
plt.subplot(1, 3, 2)
librosa.display.specshow(librosa.power_to_db(spectral_component), x_axis='time', y_axis='cqt_note', sr=sampling_frequency,
hop_length=step_length, fmin=minimum_frequency, bins_per_octave=octave_resolution, cmap='jet')
plt.title('Spectral component')
plt.subplot(1, 3, 3)
librosa.display.specshow(pitch_component, x_axis='time', y_axis='cqt_note', sr=sampling_frequency,
hop_length=step_length, fmin=minimum_frequency, bins_per_octave=octave_resolution, cmap='jet')
plt.title('Pitch component')
plt.tight_layout()
plt.show()
## 4. Display the CQHCs and the MFCCs
plt.figure(figsize=(14, 4))
plt.subplot(1, 2, 1)
librosa.display.specshow(librosa.power_to_db(audio_cqhc), x_axis='time', sr=sampling_frequency, hop_length=step_length, \
cmap='jet')
plt.title('CQHCs')
plt.ylabel('Coefficients')
plt.subplot(1, 2, 2)
librosa.display.specshow(audio_mfcc, x_axis='time', sr=sampling_frequency, hop_length=step_length, cmap='jet')
plt.title('MFCCs')
plt.ylabel('Coefficients')
plt.tight_layout()
plt.show()
C:\Users\zarafii2001\Anaconda3\lib\site-packages\librosa\display.py:974: MatplotlibDeprecationWarning: The 'basey' parameter of __init__() has been renamed 'base' since Matplotlib 3.3; support for the old name will be dropped two minor releases later. scaler(mode, **kwargs)
import os
import pickle
import numpy as np
import librosa
import cqhc
# Get the path to the dataset folder and the list of its subfolders
folder_path = r'nsynth'
folder_listdir = os.listdir(folder_path)
# Define the parameters
sampling_frequency = 16000
window_length = pow(2, int(np.ceil(np.log2(0.04*sampling_frequency))))
step_length = int(window_length/2)
minimum_frequency = 32.70
octave_resolution = 12
number_coefficients = 20
# Set the path to the feature files
folder_path2 = r'nsynth_features'
# Loop over the subsets
i = 0
for folder_name in folder_listdir:
# Get the path to the subset and the list of its files
subfolder_path = os.path.join(folder_path, folder_name, 'audio')
subfolder_listdir = os.listdir(subfolder_path)
# Loop over the files
for file_name in subfolder_listdir:
# Check for medium velocity only
velocity_number = file_name.split('-')[2][0:3]
if velocity_number == '075':
# Get the path to the file
file_path = os.path.join(subfolder_path, file_name)
# Display the counter and the name of the file
i = i+1
print(f'{i}: {file_name}')
# Load the audio signal
audio_signal, _ = librosa.load(file_path, sr=None, mono=True)
# Compute the CQHCs and the MFCCs
audio_cqhc = cqhc.cqhc(audio_signal, sampling_frequency, step_length, minimum_frequency, octave_resolution, \
number_coefficients)
audio_mfcc = cqhc.mfcc(audio_signal, sampling_frequency, window_length, step_length, number_coefficients)
# Create a dictionary for the current audio features and save it in a pkl file
audio_dict = {'cqhc': audio_cqhc, 'mfcc': audio_mfcc}
file_path2 = os.path.join(folder_path2, file_name[0:-4]+'.pkl')
pickle.dump(audio_dict, open(file_path2,'wb'))
import os
import pickle
import numpy as np
import librosa
# Get the path to the folder and the list of files
folder_path = r'nsynth_features'
folder_listdir = os.listdir(folder_path)
# Initialize an array for the instruments indices for all the files
number_files = len(folder_listdir)
instrument_indices = np.zeros(number_files, dtype=int)
# Initialize the previous unique instrument name and the instrument index
instrument_name0 = '0'
j = -1
# Loop over the files to get the instrument indices
i = 0
for file_name in folder_listdir:
# Get the name of the instrument (with the instrument number)
instrument_name = file_name.split('-')[0]
# Update the instrument index and previous unique name, if it is a new instrument
if instrument_name != instrument_name0:
j = j+1
instrument_name0 = instrument_name
# Update the instrument index and the file index
instrument_indices[i] = j
i = i+1
# Get the full feature size
audio_dict = pickle.load(open(os.path.join(folder_path, folder_listdir[0]), 'rb'))
feature_size = np.size(audio_dict['cqhc'])
# Initialize arrays for all the features, flatten and normalized
audio_cqhc = np.zeros((feature_size, number_files))
audio_mfccs = np.zeros((feature_size, number_files))
# Loop over the files to flatten and normalize the features
i = 0
for file_name in folder_listdir:
# # Print the counter and the file name
# print(f'{i+1}/{number_files}: {file_name}')
# Get the the full path to the file and load the dictionary
file_path = os.path.join(folder_path, file_name)
audio_dict = pickle.load(open(file_path, 'rb'))
# Get the CQHCs, flatten them and normalize them
audio_cqhc = audio_dict['cqhc'].flatten()
audio_cqhcs[:, i] = audio_cqhc/(np.sqrt(np.sum(np.power(audio_cqhc, 2)))+1e-16)
# Get the MFCCs, flatten them and normalize them
audio_mfcc = audio_dict['mfcc'].flatten()
audio_mfccs[:, i] = audio_mfcc/(np.sqrt(np.sum(np.power(audio_mfcc, 2)))+1e-16)
# Update the file index
i = i+1
# Get the path to the similarity folder
folder_path2 = r'nsynth_similarities'
# Get the number of instruments
number_instruments = instrument_indices[-1]+1
# Loop over the files to compute the instrument cross-similarities
i = 0
for file_name in folder_listdir:
# Print the counter and the file name
print(f'{i+1}/{number_files}: {file_name}')
# Get the full path to the similarity file
file_path2 = os.path.join(folder_path2, file_name)
# If the similarity file does not already exist
if not os.path.isfile(file_path2):
# Compute the cosine similarity between the current features and all the others
cqhc_similarities = np.matmul(audio_cqhcs[:, i], audio_cqhcs[:, i+1:])
mfcc_similarities = np.matmul(audio_mfccs[:, i], audio_mfccs[:, i+1:])
# Initialize the instrument similarities for the current file
cqhc_similarities2 = np.zeros(number_instruments)
mfcc_similarities2 = np.zeros(number_instruments)
# Loop over the cross-similarities (past the self-similarity)
l = 0
for j in range(i+1, number_files):
# Get the instrument index for the current cross-similarity
k = instrument_indices[j]
# Update the instrument similarities
cqhc_similarities2[k] = cqhc_similarities2[k] + cqhc_similarities[l]
mfcc_similarities2[k] = mfcc_similarities2[k] + mfcc_similarities[l]
l = l+1
# Create a similarity dictionary for the current file and save it
audio_dict = {'cqhc': cqhc_similarities2, 'mfcc': mfcc_similarities2}
pickle.dump(audio_dict, open(file_path2,'wb'))
# Update the file index
i = i+1
import os
import pickle
import numpy as np
# Get the path to the folder of similarities and the list of files
folder_path = r'nsynth_similarities'
folder_listdir = os.listdir(folder_path)
# Initialize an array for the instruments indices for all the files
number_files = len(folder_listdir)
instrument_indices = np.empty(number_files, dtype=int)
# Initialize the previous unique instrument name and the instrument index
instrument_name0 = '0'
j = -1
# Initialize an array for the number of notes per instrument
number_instruments = 945
note_counters = np.zeros(number_instruments, dtype=int)
# Loop over the files to get the instrument indices and note counters
for i in range(number_files):
# Get the name of the instrument (with the instrument number)
instrument_name = folder_listdir[i].split('-')[0]
# If it is a new instrument
if instrument_name != instrument_name0:
# Update the instrument index and the previous unique name
j = j+1
instrument_name0 = instrument_name
# Update the instrument index for the current file
instrument_indices[i] = j
# Update the note counter for the current instrument
note_counters[j] = note_counters[j]+1
# Initialize the instrument similarity matrix for the CQHCs and the MFCCs
cqhc_similarity = np.zeros((number_instruments, number_instruments))
mfcc_similarity = np.zeros((number_instruments, number_instruments))
# Loop over the files to compute the instrument similarity matrices
for i in range(number_files):
# Get the file name and display it with the file counter
file_name = folder_listdir[i]
# print(f'{i+1}/{number_files}: {file_name}')
# Get the full path to the similarity file and load the audio dictionary
file_path = os.path.join(folder_path, file_name)
audio_dict = pickle.load(open(file_path, 'rb'))
# Get the instrument index and update the instrument similarity matrices
j = instrument_indices[i]
cqhc_similarity[j, :] = cqhc_similarity[j, :] + audio_dict['cqhc']
mfcc_similarity[j, :] = mfcc_similarity[j, :] + audio_dict['mfcc']
# Recover the whole instrument similarity matrices
cqhc_similarity = cqhc_similarity+np.eye(number_instruments)+cqhc_similarity.T
mfcc_similarity = mfcc_similarity+np.eye(number_instruments)+mfcc_similarity.T
# Derive the number of pairs of notes used for the similarities for all the pairs of instruments
note_pairs = note_counters*note_counters[:, None]
# Normalize the instrument similarity matrices
cqhc_similarity = cqhc_similarity/note_pairs
mfcc_similarity = mfcc_similarity/note_pairs
# Create a dictionary and save the instrument similarities
audio_dict = {'cqhc': cqhc_similarity, 'mfcc': mfcc_similarity}
pickle.dump(audio_dict, open('instrument_similarity.pkl','wb'))
import os
import pickle
import numpy as np
import matplotlib.pyplot as plt
# Load the instrument similarity matrices
audio_dict = pickle.load(open('instrument_similarity.pkl', 'rb'))
cqhc_similarity = audio_dict['cqhc']
mfcc_similarity = audio_dict['mfcc']
# Display the instrument similarity matrices
plt.figure(figsize=(17, 6))
plt.subplot(1, 2, 1)
plt.imshow(cqhc_similarity, cmap="jet", aspect="auto", vmin=0, vmax=1, origin="lower")
plt.colorbar()
plt.title('CQHC instrument similarities')
plt.xlabel('Instrument index')
plt.ylabel('Instrument index')
plt.subplot(1, 2, 2)
plt.imshow(mfcc_similarity, cmap="jet", aspect="auto", vmin=0, vmax=1, origin="lower")
plt.colorbar()
plt.title('MFCC instrument similarities')
plt.xlabel('Instrument index')
plt.ylabel('Instrument index')
plt.show()
# Compute the self and cross-similarity matrices for the CQHCs and the MFCCs
number_instruments = 945
cqhc_selfsimilarity = np.diag(cqhc_similarity)
cqhc_crosssimilarity = cqhc_similarity[~np.eye(number_instruments, dtype=bool)].reshape(number_instruments,-1)
mfcc_selfsimilarity = np.diag(mfcc_similarity)
mfcc_crosssimilarity = mfcc_similarity[~np.eye(number_instruments, dtype=bool)].reshape(number_instruments,-1)
# Display the self-similarities and cross-similarities (with mean and +-std)
plt.figure(figsize=(17, 6))
plt.subplot(2, 1, 1)
plt.errorbar(np.arange(number_instruments), np.mean(cqhc_crosssimilarity, axis=1), \
yerr=np.std(cqhc_crosssimilarity, axis=1), ecolor='y', color='r')
plt.plot(cqhc_selfsimilarity, color='g')
plt.autoscale(enable=True, axis='x', tight=True)
plt.ylim(0, 1)
plt.grid()
plt.title('CQHC instrument self-similarities (green) and cross-similarities (mean/red and +-std/yellow)')
plt.xlabel('Instrument index')
plt.ylabel('Similarity')
plt.subplot(2, 1, 2)
plt.errorbar(np.arange(number_instruments), np.mean(mfcc_crosssimilarity, axis=1), \
yerr=np.std(mfcc_crosssimilarity, axis=1), ecolor='y', color='r')
plt.plot(mfcc_selfsimilarity, color='g')
plt.autoscale(enable=True, axis='x', tight=True)
plt.ylim(0, 1)
plt.grid()
plt.title('MFCC instrument self-similarities (green) and cross-similarities (mean/red and +-std/yellow)')
plt.xlabel('Instrument index')
plt.ylabel('Similarity')
plt.tight_layout()
plt.show()
print(f'Average CQHC self-similarity = {np.mean(cqhc_selfsimilarity)}')
print(f'Average CQHC cross-similarity = {np.mean(np.mean(cqhc_crosssimilarity, axis=1))} and std = {np.mean(np.std(cqhc_crosssimilarity, axis=1))}')
print(f'Average MFCC self-similarity = {np.mean(mfcc_selfsimilarity)}')
print(f'Average MFCC cross-similarity mean = {np.mean(np.mean(mfcc_crosssimilarity, axis=1))} and std = {np.mean(np.std(mfcc_crosssimilarity, axis=1))}')
Average CQHC self-similarity = 0.770456002165132 Average CQHC cross-similarity = 0.5126201079798618 and std = 0.20763389156466464 Average MFCC self-similarity = 0.939412354122086 Average MFCC cross-similarity mean = 0.9136073168156905 and std = 0.0363984181574115
import os
import pickle
import numpy as np
# Get the path to the folder of features and the list of files
folder_path = r'nsynth_features'
folder_listdir = os.listdir(folder_path)
# Initialize the instrument index and previous unique name
j = -1
instrument_name0 = '0'
# Initialize the class index and previous unique name
k = -1
class_name0 = '0'
# Initialize two arrays for the class indices for all the instruments and all the files
number_instruments = 945
number_files = len(folder_listdir)
class_indices = [np.empty(number_instruments, dtype=int), np.empty(number_files, dtype=int)]
# Initialize an array for the number of notes per class
number_classes = 11
note_counters = np.zeros(number_classes)
# Loop over the files
for i in range(number_files):
# Get the name of the instrument
instrument_name = folder_listdir[i].split('-')[0]
# If it is a new instrument
if instrument_name != instrument_name0:
# Update the instrument counter and the previous unique name
j = j+1
instrument_name0 = instrument_name
# Get the name of the class
class_name = folder_listdir[i].split('_')[0]
# If it is a new class
if class_name != class_name0:
# Update the class counter and the previous unique name
k = k+1
class_name0 = class_name
# Update the class index for the current instrument and file
class_indices[0][j] = k
class_indices[1][i] = k
# Update the note counter
note_counters[k] = note_counters[k]+1
# Get the path to the folder of similarities and the list of files
folder_path = r'nsynth_similarities'
folder_listdir = os.listdir(folder_path)
# Initialize the class similarity matrices for the CQHCs and the MFCCs
cqhc_similarity = np.zeros((number_classes, number_classes))
mfcc_similarity = np.zeros((number_classes, number_classes))
# Loop over the files
i = 0
for file_name in folder_listdir:
# Display the name of the file
# print(f'{i+1}/{number_files}: {file_name}')
# Get the full path to the similarity file and load the dictionary
file_path = os.path.join(folder_path, file_name)
audio_dict = pickle.load(open(file_path, 'rb'))
# Get the class index per file
j = class_indices[1][i]
# Loop over the files again
for l in range(number_instruments):
# Get the class index per instrument
k = class_indices[0][l]
# Update the class similarity matrices
cqhc_similarity[j, k] = cqhc_similarity[j, k] + audio_dict['cqhc'][l]
mfcc_similarity[j, k] = mfcc_similarity[j, k] + audio_dict['mfcc'][l]
# Update the counter
i = i+1
# Recover the whole class similarity matrices
cqhc_similarity = cqhc_similarity+np.eye(number_classes)+cqhc_similarity.T
mfcc_similarity = mfcc_similarity+np.eye(number_classes)+mfcc_similarity.T
# Derive the number of pairs of notes used for the similarities for all the pairs of classes
note_pairs = note_counters*note_counters[:, None]
# Normalize the class similarity matrices
cqhc_similarity = cqhc_similarity/note_pairs
mfcc_similarity = mfcc_similarity/note_pairs
# Create a dictionary and save the instrument similarity matrices
audio_dict = {'cqhc': cqhc_similarity, 'mfcc': mfcc_similarity}
pickle.dump(audio_dict, open('family_similarity.pkl','wb'))
import os
import pickle
import numpy as np
import matplotlib.pyplot as plt
# Load the instrument similarity matrices
audio_dict = pickle.load(open('family_similarity.pkl', 'rb'))
cqhc_similarity = audio_dict['cqhc']
mfcc_similarity = audio_dict['mfcc']
# Display the instrument similarity matrices
class_names = ['bass', 'brass', 'flute', 'guitar', 'keyboard', 'mallet', 'organ', 'reed', 'string', 'synth_lead', 'vocal']
number_classes = len(class_names)
plt.figure(figsize=(17, 6))
plt.subplot(1, 2, 1)
plt.imshow(cqhc_similarity, cmap="jet", aspect="auto", vmin=0, vmax=1, origin="lower")
plt.colorbar()
plt.xticks(np.arange(number_classes), class_names, rotation=45)
plt.yticks(np.arange(number_classes), class_names)
plt.title('CQHC instrument family similarities')
plt.xlabel('Instrument family')
plt.ylabel('Instrument family')
plt.subplot(1, 2, 2)
plt.imshow(mfcc_similarity, cmap="jet", aspect="auto", vmin=0, vmax=1, origin="lower")
plt.colorbar()
plt.xticks(np.arange(number_classes), class_names, rotation=45)
plt.yticks(np.arange(number_classes), class_names)
plt.title('MFCC instrument family similarities')
plt.xlabel('Instrument family')
plt.ylabel('Instrument family')
plt.show()
# Compute the self and cross-similarities for the CQHCs and the MFCCs
cqhc_selfsimilarity = np.diag(cqhc_similarity)
cqhc_crosssimilarity = cqhc_similarity[~np.eye(number_classes, dtype=bool)].reshape(number_classes,-1)
mfcc_selfsimilarity = np.diag(mfcc_similarity)
mfcc_crosssimilarity = mfcc_similarity[~np.eye(number_classes, dtype=bool)].reshape(number_classes,-1)
# Display the self-similarities and cross-similarities (with mean and +-std)
plt.figure(figsize=(17, 6))
plt.subplot(2, 1, 1)
plt.errorbar(np.arange(number_classes), np.mean(cqhc_crosssimilarity, axis=1), \
yerr=np.std(cqhc_crosssimilarity, axis=1), ecolor='r', color='r')
plt.plot(cqhc_selfsimilarity, color='g')
plt.autoscale(enable=True, axis='x', tight=True)
plt.ylim(0, 1)
plt.grid()
plt.title('CQHC instrument family self-similarities (green) and cross-similarities (mean+-std/red)')
plt.xticks(np.arange(number_classes), class_names, rotation=45)
plt.ylabel('Similarity')
plt.subplot(2, 1, 2)
plt.errorbar(np.arange(number_classes), np.mean(mfcc_crosssimilarity, axis=1), \
yerr=np.std(mfcc_crosssimilarity, axis=1), ecolor='r', color='r')
plt.plot(mfcc_selfsimilarity, color='g')
plt.autoscale(enable=True, axis='x', tight=True)
plt.ylim(0, 1)
plt.grid()
plt.title('MFCC instrument family self-similarities (green) and cross-similarities (mean+-std/red)')
plt.xticks(np.arange(number_classes), class_names, rotation=45)
plt.ylabel('Similarity')
plt.tight_layout()
plt.show()
print(f'Average CQHC self-similarity = {np.mean(cqhc_selfsimilarity)}')
print(f'Average CQHC cross-similarity = {np.mean(np.mean(cqhc_crosssimilarity, axis=1))} and std = {np.mean(np.std(cqhc_crosssimilarity, axis=1))}')
print(f'Average MFCC self-similarity = {np.mean(mfcc_selfsimilarity)}')
print(f'Average MFCC cross-similarity mean = {np.mean(np.mean(mfcc_crosssimilarity, axis=1))} and std = {np.mean(np.std(mfcc_crosssimilarity, axis=1))}')
Average CQHC self-similarity = 0.6053008122676642 Average CQHC cross-similarity = 0.5003323385285742 and std = 0.0878082321438902 Average MFCC self-similarity = 0.9172353053976425 Average MFCC cross-similarity mean = 0.9051245096252998 and std = 0.0128273223146673