Post on 30-Oct-2021
Chord Identiication in Real-Time Using a Neural Network
Giró Serratosa, Daniel
Curs 2018-2019
Director: Sergio Ivan Giraldo Mendez
GRAU EN ENGINYERIA EN SISTEMES AUDIOVISUALS
Trebal l de Fi de Grau
Chord Identification in Real-Time Using a Neural
Network
Daniel Giró Serratosa
TREBALL FI DE GRAU
Enginyeria en Sistemes Audiovisuals
ESCOLA SUPERIOR POLITÈCNICA UPF
2019
DIRECTOR DEL TREBALL
Sergio Ivan Giraldo Mendez
5
Agraïments
Primer de tot, agraïr a Sergio Giraldo per tutoritzar aquest treball i ajudar-me a
construir-lo. A més, voldria agrair als meus companys, amics i família per els seus savis
consells i la seva comprensió.
7
Resum
The necessity of musicians to transcribe by ear a music piece is a common known
problem, given that the music score is not always available. Transcribing by ear is a
difficult task, and for this reason, the automatic music transcription has been researched
over the years.
In this work we focus on the task of chord recognition, where we have implemented a
a real-time musical chords detection system. First, we build a data set of audio
recordings, that consists in all the major and minor chords played by a guitar in different
ways. Later we label the data using 25 classes, 12 for all the major chords, 12 for all the
minor chords, and the last one that consists in the class “None”. From the audio signal
we extract a feature vector using the Chromagram. The Chromagram is obtained by an
spectral analysis grouping the peaks corresponding to notes in to bins. With these
features the system is trained to classify the considered chord classes.
The first models implement major and minor chords, in the future, there will be
implemented more chords like augmented, diminished, suspended etc.
The system consists of a real-time implementation, where the sound coming through
the mic, pass through the system and it returns a chord prediction, the objective is to
build a real time interface to give visual feedback to the user.
Resum
La idea d’aquest treball neix de la necessitat dels músics de transcriure d’oïda una peça
musical, no sempre es disposa de partitura i transcriure d’oïda pot ser una gran eina, no
obstant, és una tasca molt difícil, per aquesta raó s’ha desenvolupat aquest treball.
Aquest treball presenta un sistema de detecció d’acords musicals a temps real.
Primerament, construim una base de dades de gravacions d’audio, que consisteixen in
tots els acords majors i menors tocats per una guitarra de maneres diferents. Després
etiquetem les dades utilitzant 25 classes, 12 per els acords majors, 12 per els acords
menors i la última per la classe “None”. De la senyal d’audio extraiem un vector de
característiques del so anomenat cromagrama. El cromagrama fa un recompte de totes
les notes que estan sonant i les agrupa per blocs, cada nota és un bloc. Amb aquestes
característiques entrenem un sistema per què aprengui a distingir i classificar aquests
acords.
Els primers models només implementaven acords majors i menors, futurament seràn
implementats altres com acords augmentats, disminuïts, suspesos etc.
La complicació d’aquest mètode és la implementació a temps real, que el so entri per un
micròfon, passi pel sistema i retorni un acord, l’objectiu is construir una interfície visual
que a temps real doni informació visual a l’usuari.
9
Prefaci o pròleg
El Treball de Fi de Grau presentat a continuació anomenat “Chord Identification in
Real-Time Using a Neural Network” neix de la dificultat per part dels músics a
transciure l’harmonia. El projecte s’ha dut a terme sota la tutela de Sergio Iván Giraldo,
al qual vull donar les gràcies per la seva gran orientació i ajuda en aquest treball.
11
Índex o sumari Pàg.
Resum............................................................................ 7
Prefaci o Pròleg............................................................. 9
Llista de figures............................................................. 13
Llista de Taules.............................................................. 15
1. INTRODUCTION.................................................... 17
1.1. The problem.................................................... 17
1.2 Related Work.................................................... 17
1.3 Objectives ........................................................ 18
2. STATE OF THE
ART................................................
21
2.1. Background...................................................... 21
2.2 Harmony Theory............................................... 21
a) Musical Chords............................................. 21
b) Chord Inversions........................................... 22
c) Extended Chords........................................... 26
3. MATERIALS AND METHODS............................... 27
3.1. External Codes................................................ 27
3.2 Database.......................................................... 29
a) First database trial......................................... 29
b) Offline database............................................ 29
c) Real-time database........................................ 31
4.
METHODOLOGY.....................................................
33
4.1. Feature Extraction............................................ 33
a) Offline implementation................................. 33
b) Real-time implementation............................. 36
4.2 Classifier........................................................... 37
a) Offline implementation................................. 37
b) Real-time implementation............................. 37
5. RESULTS.................................................................. 39
5.1. Parameters definition....................................... 39
a) Peak Threshold.............................................. 40
b) Number of peaks........................................... 42
c) Peak Weights................................................. 44
d) Ponderation................................................... 45
e) Octave Weights............................................. 46
5.2 Classifier.......................................................... 47
a) Setting a baseline........................................... 47
b) The classifiers ............................................... 47
6. DISCUSSION……………………………............. 51
6.1. Conclusion………………………….............. 51
6.2 Future Work………………………................. 52
Bibliografia…………………………………............... 54
13
Llista de figures
Pàg.
Fig. 1. Seriousness of Errors............................................ 19
Fig. 2. Harmonic Analysis in Chord Progressions.......... 20
Fig. 3. Helix of Fifths………………………………….. 23
Fig. 4. Different Positions of C Major…………………. 24
Fig. 5. D minor Chromagram………………………….. 25
Fig. 6. Bb major Chromagram……………………..….. 25
Fig. 7. G diminished Chromagram……………………. 26
Fig. 8. F-measure equation……………………………. 27
Fig. 9. Precision and Recall equation…………………. 27
Fig. 10. SVM Hyperplanes……………………………. 28
Fig. 11. MIDI Recordings in DAW………………….... 30
Fig. 12. Labeling in Sonic Visualizer…………………. 30
Fig. 13. Arff File Structure……………………………. 31
Fig. 14. SFTF Equation……………………………….. 33
Fig. 15. Irrelevant Peaks in FFT………………………. 33
Fig. 16. Frequency to Note Algorithm………………... 34
Fig. 17. RMS Equation………………………………... 35
Fig. 18. Threshold…………………………………….. 35
Fig. 19. Overlapped Chromagram…………………….. 36
Fig. 20. Algorithm Scheme……………………………. 36
Fig. 21. Ponderation Function…………………………. 37
Fig. 22. Program Scheme………………………………. 38
Fig. 23. Chromagram with Thresholds……………….. 41
Fig. 24. Accuracy depending on Thresholds………….. 41
Fig. 25. Chromagram with different N………………... 43
Fig. 26. Accuracy depending on N………………….. 43
Fig. 27. Chromagram with Peak Weights……………. 44
Fig. 28. Accuracy depending on Peak Weights………. 44
Fig. 29. Chromagram applying Ponderation………….... 45
Fig. 30. Chromagram with Octave Weight…………….. 46
Fig. 31. Accuracy depending on Octave Weight………. 46
Fig. 32. Baseline Precision, Recall and F-measure……. 47
Fig. 33. Precision of Classifiers……………………….. 48
Fig. 34. Recall of Classifiers………………………….. 48
Fig. 35. F-measure of Classifiers……………………... 49
15
Llista de taules
Pàg.
Taula 1. Chord definition................................................ 21
Taula 2. Table 2. Intervalic Relation (IR) in Chords....... 22
Table 3. IR in First Inversions…………………………. 22
Table 4. IR in Second Inversions………………………. 23
Table 5. IR with Octaves in Chords……………………. 24
Table 6. IR with Octaves in Dispositions………...……. 24
Table 7. C Major Chromagram………………………... 34
17
1. INTRODUCTION
1.1 The Problem A major challenge for musicians and music students is the chord and harmonic
progression recognition. Mastering this skill has important implications when
performing in “Jam Sessions”, a Jam Session is a musical event where musicians,
usually, intrumentalists improvise music without extensive preparation. Also, when it
comes to transcribing a musical piece by ear. However mastering this skill is not
always an easy task. This is the motivation for investigating on automatic methods to
transcribe music harmony (i.e. chord progressions) in real time.
An application scenario for this method could be the following: musicians accustomed
to searching on the Internet for the chords of a musical piece. These chord transcriptions
are usually done by average users with little musical training. Therefore, there might
be errors, or an user may be looking for the chords of an specific version (e.g. with a
reharmonization), or the chord progression for an specific piece might not be available..
A system of this nature can permit musicians to process the audio file in real time
obtain a visualization of the ongoing chords.
1.2 Related Work
Automatic chord recognition is a topic that has been researched in the past, within the
domain of audio description. Fujishima [1] developed a chord recognition system using
a chromagram, that analyses pitches categorized as notes. Gómez and Herrera [2]
developed a system that extract tonal information building a Harmonic Pitch Class
Profile (HPCP), a Harmonic Pitch Class Profile is a vector of low-level instaneous
features, which represents the pitch intensity content of polyphonic music signals,
mapped to a single octave.
The main idea of this method is to extract the harmony of a musical piece. So, in
addition to the tonal descriptors, there are also important considerations. For example,
the Christopher Harte, Mark Sandler and Martin Gasser’s work about Detecting
Harmonic Change in Musical Audio [3]. They have built the Harmonic Change
Detection Function (HCDF) based in the chromagram. It’s an example of how the tonal
descriptors can be applied. The Masataka Goto work about a Real-time Beat Tracking
System [4] generates a hierarchical beat structure that helps a lot to determine where has
the chord changed, but I have decided to focus on the work not on the time domain, but
it can be a big great improvement of the work.
When it comes to the harmonic transcription, what the musicians mostly use is focusing
on the bass and comparing its relation with the tonality. So it’s interesting the Satoru
Hayamizu work about detecting Melody and Bass [5] together with the Emilia Gómez
work about tonality [6], but I have finally focused on the relation between notes and
how they make chords, specially the Kyogu Lee work [7] and the Emilia Gómez
methods.
Emilia Gómez developed some methods to extract tonal descriptors [8] and they have
helped a lot through the process of building this method, Emilia Gómez together with
18
Joan Serrà has also developed a cover identification system based on the tonal
descriptors [9], they compare the features between the original song and its cover so it’s
not that relevant to the work, but it helps to understand how the tonal descriptors work.
Our method has two parts: the feature extraction and the classifier using machine
learning algorithms. This method is implemented real-time so it’s needed what Pascanu,
Gulcehre, Cho and Bengio explain in his work, a Recurrent Neural Network [10], It
needs an input that changes in time and the output that does it too, so the RNN seems to
be a good solution. Honglak Lee, Yan Largman, Peter Pham and Andrew Y.Ng used the
deep belief networks applied to audio classification [11] and Philippe Hamel and
Douglas Eck used them to learn features from Music Audio [12]. Those articles help me
to understand how my system should work.
1.3 Objectives
This method proposes a tool for the musicians, a real-time harmonic transcription of the
music using as a feature the chromagram and using machine learning techniques.
My main goal is to get a good accuracy in this classification task, first of all, in an
offline method. Passing through the program a song and the program would return me
the right chord, allowing the musician to just do sight reading of the chords, as they do
nowadays with Internet, but Internet doesn’t have all songs or it doesn’t have the
reharmonized version that you want. So, with this method you can look up for the
chords of any music audio.
The idea is to implement it in real-time, so, once I get a good accuracy, I check the
performance in real-time and the objective is to achieve a good result. It has to be
understandable and logic for the musician. In other words, if the guessed chord is
wrong, it is obvious that it’s a bad result, but it can be other problems that are not
necessary programs mistakes but makes the result confusing. For example: A chord that
changes so fast because it’s not detecting it right, or enharmonic problems of the
tonality.
Computing an accuracy, it’s easy but not always reflect if the functioning is correct.
One wrong classified chord could be better or worse than another, because it is also
important the tonal function of the chord (Harmony theory explained in the Music
Theory Subsection in the State of the Art Section). For example: Let’s suppose that we
have a song in C major, if the program returns and F major but the real chord is D
minor, it’s a mistake but it’s a better mistake that if the program would have returned a
Bb, because the tonal function of the chords is the same, subdominant. Or Db7 instead
of G7 (the tonic is C major), because Db7 is the tritonal substitute of the G7 and they
perform the same function, dominant.
It is difficult to set a formula to compute the functioning taking into account the tonal
function or other important features in the harmonic progressions. In most cases, if the
chords that the program has confused share some notes, the mistake would be less
serious.
19
Here are some examples of possible program confusions (The tonic is C major).
F major D minor Less serious mistake
{F, A, C} {D, F, A}
Db7 G7 Less serious mistake
{Db F, Ab, B} {G, B, D, F}
C major G major Serious mistake
{C, E, G} {G, B, D}
F major Ab minor Very Serious mistake
{F, A, C} {Ab, Cb, Eb}
Fig. 1. Seriousness of Errors
A future objective is make an improvement of the system, by implementing new chords
and improving the algorithm and making it more efficient. The main tools that I have
used to build it are Openframeworks, to extract the features and Wekinator to perform
the classification and an improvement would be substituting the Wekinator application
by code in Openframeworks, because Wekinator is a great tool, it’s open source and so
practical, but it has some complications when it comes to training or to choosing the
classifier, so a classifier implemented in openframeworks would be more modifiable
and would give better results.
Also, to improve the system one option is to implement in addition to the chord
recognition, a tonal functionality detection system, in other words, not to identify the
chord by looking at its note, but looking at the tonal function and the tonal elements, but
the real time implementation is almost impossible.
20
Mixing Chord recognition and Tonal Functionality recognition algorithm could have a
good result. Here are some examples of harmonic progressions that are commonly used
and they are more simple to look them as tonal functions.
C major F major Ab minor Eb7 D minor G7
{C, E, G} {F, A, C} {Ab, Cb, Eb} {Eb, G, Bb, Db} {D, F, A} {G, B, D, F}
I IV [VI-bs IIbs] II V
C major A7 D7 G7
{C, E, G} {A, C#, E, G} {D, F#, A, C} {G, B, D, F}
I [Ve] [Vs] V
Fig. 2. Harmonic Analysis in Chord Progressions
21
2. STATE OF THE ART
2.1 Background
It has been a lot of research in the music and audio descriptors, most of these studies did
not influence this work directly, but they help to set the idea of this project, Articles that
describe audio descriptors but focused in another aspects of the music, not in the
harmony, like rhythm, melody, texture etc.
There are some articles that help to understand the audio descriptors, articles that
explain music similarities methods [13] or tonal descriptors like HPCP and how useful
could be these descriptors [9], or explain some machine and deep learning algorithms
[10] [11].
This method uses the Chromagram as a feature extractor and some machine learning
algorithms, KNN (K-nearest neighbour), NN (Neural Network) or SVM (Support
Vector Machine).
This method does not use any database already built. It has been built a database for this
work, the database is stored in GitHub [I].
It uses an application already built that open an interface and it extract some features
[II], it has been implemented the feature extractor, the chromagram, by modifying this
application.
2.2 Harmony Theory
a) Musical Chords
A Musical Chord is defined by the chord type and the height.
A major E minor
Height Chord type Height Chord type
A
Major
E
Minor
Table 1. Chord definition
The Chord height or Chord key is defined by the root. There are some type of chords
depending on the relation of the notes, the triad chords have three notes, the root, the
third and the fifth, and the relation of tones and semitones between these notes will
define the chord type.
22
Major
1
2t
3
1’5t
5
Minor
1
1’5t
b3
2t
5
Augmented
1
2t
3
2t
#5
Diminished
1
1’5t
b3
1’5t
b5
Table 2. Intervalic Relation (IR) in Chords
b) Chord Inversions
But the root is not always the lower note, there are chord inversions, sometimes the
lower note is the third or the fifth. These are the chord structure in the 1rst inversion
(lower note is the third) and the 2nd inversion (lower note is the fifth).
First inversion:
Major
3
1’5t
5
2’5t
1
Minor
b3
2t
5
2’5t
1
Augmented
3
2t
#5
2t
1
Diminished
b3
1’5t
b5
3t
1
Table 3. IR in First Inversions
23
Second inversion:
Major
5
2’5t
1
2t
3
Minor
5
2’5t
1
1’5t
b3
Augmented
#5
2t
1
2t
3
Diminished
b5
3t
1
1’5t
b3
Table 4. IR in Second Inversions
But the relation between notes and the frequency is a helix.
Fig. 3. Helix of Fifths
So sometimes, the relation of tones and semitones between the notes of the chords is the
following but adding 12 semitones (semitones of the whole chromatic scale) but the
colour of the chord doesn’t change.
24
Major
1
8t (12st + 2t)
3
7’5t (12 st + 1’5t)
5
Table 5. IR with Octaves in Chords
These are different disposition of the same chord: C major (C, E, G)
Fig. 4. Different Positions of C Major
1
1
3’5t
5
4’5t
3
2
1
2t
3
7’5t (12st + 1’5t)
5
3
3
4t
1
3’5t
5
Table 6. IR with Octaves in Dispositions
25
In the Chromagram you can see the features of the chords (height and type). For
example:
D minor:
C C# D Eb E F F# G Ab A Bb B
Fig. 5. D minor Chromagram
Bb major:
C C# D Eb E F F# G Ab A Bb B
Fig. 6. Bb major Chromagram
26
G diminished:
C C# D Eb E F F# G Ab A Bb B
Fig. 7. G diminished Chromagram
c) Extended Chords
There are also the quatriad chords and extended chords, that are chords that has another
notes like the seventh or the ninth. But this method doesn’t take into account these type
of chords.
27
3. MATERIALS AND METHODS
3.1 External Codes
This work consists in two parts: the offline implementation and the real-time
implementation, and each one has the feature extractor and the classifier part.
I have used Ableton Live and Sonic Visualizer to build the database (Better explained in
the database section).
First of all, in the firsts trials, I have used some VAMP Plugins for Sonic Visualizer like
HCPC - Harmonic Pitch Class Profile, developed by Music Technology Group in
Universitat Pompeu Fabra [III] or Invariant Pitch Chroma, developed by Queen Mary,
University of London [IV] but after trying their execution and seeing bad results I have
dismissed the VAMP Plugins option and I have tried to developed my own feature
extractor using as a model the Chromagram.
For the offline implementation, I have used Matlab software to extract the features, I
have built a code that reads all the files from the database and extract features of them.
There is one function in the feature extraction code that converts frequency to notes that
I have extracted from the Mathwork repository [V].
All the features from the files are stored in txt files and these files are read in a C++
Text Parser code, in Microsoft Visual Studio 2017 [VI].
The result of the C++ code is what is read by the Weka Software [VII] who does the
classification.
Weka offers you a variety of classifiers and, depending on your project, one would be
better than another. In order to calculate how good is the functioning of the system it is
computed the f-measure (F1).
Fig. 8. F-measure equation
Where,
Fig. 9. Precision and Recall equation
tp = true positive
fp = false positive
fn = false negative
In order to determine if the f-measure is good enough, the classifiers are compared with
the ZeroR classifier.
28
The ZeroR classifier consists in classifying all instances as the class that has more
instances. In this work, there are a lot of classes, one class per chord. So the ZeroR f-
measure value is so low.
First of all, I have tried with a Multilayer Perceptron [14]. A Multilayer Perceptron
consists in, at least, three layers of nodes, the input layer, the hidden layer or hidden
layers and the output layer, each nodes represents a neuron. It uses a supervised learning
technique called backpropagation for training.
After multiple trials with different parameters, I have tried with a Support Vector
Machine (SVM) classifier [15].
A SVM classifier is a supervised learning model. The training algorithm works placing
a hyperplane that separate the classes. The dimension of the space could be high and the
best hyperplane will be the one that separates the most the classes.
In the image, H1 doesn’t separate the classes correctly, H2 separate the classes and H3
does it too but it separates them more. H3 would be the correct hyperplane.
Fig. 10. SVM Hyperplanes
Finally, I have tried a K-Nearest Neighbour (KNN) classifier [16]. A KNN classifier
consists in defining regions in a n-dimensional place (n is the number of inputs).
The execution results of these methods are in the Experimental Results section.
For the Real-Time Implementation, Matlab is not a good tool anymore, so I tried to
build a C++ code in Visual Studio to extract the audio features, I looked up for some
functions, like a Fast Fourier Transform (FFT) [VIII] or other functions like findpeaks
that MATLAB has already implemented [IX] but finally I ended not using them
because it was faster to implemented by myself the function rather than understanding
their functioning and making them work.
I have also tried to open MATLAB from the C++ code using the MATLAB Engine API
for C++ [X] but finally, I have opted for Essentia.
Essentia [XI] is an open-source library for audio and music analysis, is only available
for Mac, and I have used Windows. So I found an Openframeworks App [XII] in Visual
Studio. This app extract audio features in real-time, features like MFCCs, FFT etc.
[XIII]. I have used this code to compute the Chromagram, especially the FFT.
29
The functions that I've used in the Offline implementation that are already implemented
in MATLAB I have written them.
I tried to implement all the Machine learning classifiers in the app using a Neural
Network Addon [XIV] but I can't get to make them work together. So I use Wekinator
[XV]. Wekinator is an open source software that allows to use machine learning
algorithms in real-time.
The Openframeworks app and the Wekinator communicate using the OSC protocol
[XVI].
3.2 Database
b) First database trial
For this system, it’s needed a database where the chords are well labelled and separated
since when we do the feature extraction of the training part, the system will label
windows, so it’s important that the chords are well separated.
The first databases trials consisted in using the chords of a song, but the labelling
process was hard, slowly and inaccurate. For this reason, it was created the database
called DB Guitar Chords. This database consists in all the major and minor chords
played in 8 different positions simulating different chord positions of a guitar.
b) Offline database
The complexity of this database resides in the labelling, it is slow. For this reason, they
were recorded a group of MIDI tracks, each track has all the major and minor chords in
8 guitar position. All the chords are ordered within the track, in other words, first of all
the track has all the C major chords, then all the C# major chords and so on. In addition,
the spaces between chords and the duration of the chords are always the same. There are
three types of track, ones that play all the notes of the chord at the same time and they
let it sound, the ones that play all the notes but making an arpeggio and the ones that
play the chord strumming. Each type of track has 8 different MIDI velocities. The
process of obtaining of these MIDI track is fast, because it simply consists in recording
3 tracks (Chord, Arpeggio and Strumming) and all the other tracks are modified quickly
and easily in the corresponding DAW.
30
Fig. 11. MIDI Recordings in DAW
Once we have the MIDI tracks, we export them as audio. We use Kontak5 and Native
Instruments to simulate the different guitar sounds. It is interesting to use virtual
instruments because you make sure that the audio tracks don’t have any unwanted
noise.
The final result are 60 audio tracks and each track contains all the chords in 8 different
positions, the chords are ordered so the labelling process is very easy. Once we have the
audio files, we define some regions, where each region belongs to a chord, the
definitions of these regions is performed by creating a txt file with Sonic Visualizer.
Once we have the regions defined and the chords of every region, all the windows that
belong temporally to that region will be labelled as that chord.
Fig. 12. Labeling in Sonic Visualizer
31
The assignation between chords and windows is performed in a C++ code. What it does
is reading all the audio filed from the database, it splits them into windows and it
extracts all the corresponding features of each window, once it has the feature vector for
every window is added the chord in the end of the vector the corresponding chord
(checking the regions txt file and the chords txt file). All the features vectors are written
in an Arff file, so it can be used later in Weka.
The Arff file has this structure:
@RELATION Chords
@ATTRIBUTE Band_0 NUMERIC
@ATTRIBUTE Band_1 NUMERIC
@ATTRIBUTE Band_2 NUMERIC
@ATTRIBUTE Band_3 NUMERIC
@ATTRIBUTE Band_4 NUMERIC
@ATTRIBUTE Band_5 NUMERIC
@ATTRIBUTE Band_6 NUMERIC
@ATTRIBUTE Band_7 NUMERIC
@ATTRIBUTE Band_8 NUMERIC
@ATTRIBUTE Band_9 NUMERIC
@ATTRIBUTE Band_10 NUMERIC
@ATTRIBUTE Band_11 NUMERIC
@ATTRIBUTE Chord
{None,C,C#,D,Eb,E,F,F#,G,Ab,A,Bb,B,Cm,C#m,Dm,Ebm,Em,Fm,F#m,Gm,Abm,Am,
Bbm,Bm}
@DATA 0.2573,0.2863,0.2573,0.3470,0.3239,0.2631,0.2805,0.2949,0.2776,0.2516,0.3181,0.2891, E
0.2455,0.2835,0.2630,0.3507,0.3215,0.2747,0.2922,0.2747,0.2805,0.2484,0.3185,0.2922, E
0.2477,0.2831,0.2536,0.3539,0.3214,0.2654,0.3008,0.2713,0.2890,0.2448,0.3273,0.2831, E
0.2457,0.2872,0.2516,0.3434,0.3198,0.2576,0.3109,0.2813,0.2842,0.2457,0.3227,0.2931, E
0.2428,0.2902,0.2369,0.3169,0.3287,0.2665,0.3139,0.2873,0.2843,0.2517,0.3228,0.3021, E
...
Fig. 13. Arff File Structure
c) Real-time database
This database was built to implement the model offline. The trained Weka model cannot
be loaded in the Wekinator. So, we have used the audios from the offline database to
pass them through the program, extract the features and manually perform a real-time
chords and regions delimitation in the Wekinator. It’s a more difficult and slower task
than building the previous database, so not all the files are used, because they are too
much. The Real-time database has three audio files (from the Offline database), one
consisting in all the notes of the chord sounding at the same time, one arpeggiated chord
and one strummed chord, each audio file contains all the chords.
33
4. METHODOLOGY
4.1 Feature Extraction
a) Offline implementation
The Feature Extractor is based on the Chromagram Feature. The Chromagram consists
in making a count of certain frequencies and its multiples, in other words, the musical
notes.
We perform a STFT with a blackmanharris windowing on the audio signal,
Fig. 14. SFTF Equation
we extract the local maximas (peaks) in the spectrum
we sort the peaks in descending order, so the higher ones are the first ones, we
normalize them
We keep a part of the peaks and we discard the rest, we keep the first length/8 (They are
sorted) We do this step to discard all those maximas that are irrelevant, here’s a
graphical example:
Fig. 15. Irrelevant Peaks in FFT
We transform them to notes using the following algorithm:
34
Fig. 16. Frequency to Note Algorithm
We compute the weights, there are two types of weights: depending on the octave and
depending on the peak. The peak weights serve to give more importance to the notes
that sound louder, this helps to give less importance to higher harmonics. But it can
confuse the program if some instrument or a voice plays or sings a note outside of the
chord. The octave weights serve to give less importance to higher harmonics and more
to the lower notes, the core of the chord, the problem is that the FFT is less accurate in
the low frequency so it’s difficult to determine those weights.
Once the weighting is done, the only task remaining is to perform the count of every
note. We count all the peaks belonging to a C, we count them and we place the result in
the C bin, there are 12 bins, one per note. So finally we will have something like that:
Table 7. C Major Chromagram
And we now can tell the chord that it has been sounding, we only have to look the
higher three values (in case of triads) or more values (This task is performed by the
classifier so we don’t have to worry to look nothing).
But sometimes in music, there’s no chord, in a solo of a lead instrument like brass with
any kind of accompaniment, in a percussive fragment or simply a silence. We have
developed a very simple trick to solve that.
We calculate the RMS (root-mean square) of the FFT:
Fig. 17. RMS Equation
35
We compare the RMS value with a threshold and if it’s below it the chord would be
none, if it’s higher the chord would be the output of the classifier, the predicted one.
Fig. 18. Threshold
This Feature Extractor has a trade-off between notes accuracy and the time accuracy.
If the method performs the count of the notes only in one window, it will be very
exposed to errors, so the solution is to perform the count by summing all the notes in a
group of windows. Noise is random so the noise peaks will change through the
windows, so it’s a way to increase the difference between the note bins, so it will be less
exposed to errors.
But, in a song, the chords are constantly changing, so if it chooses to performs the count
in 100 windows, the Chromagram will be overlapping all the chord, giving a confusing
result and they won’t be classified correctly.
Fig. 19. Overlapped Chromagram
36
b) Real-Time Implementation
The Real-Time Implementation work as the offline implementation.
It makes the windowing in real-time and then it performs all the steps described in the
offline implantation section, finding the peaks, selecting the highest ones, transforming
them into notes and placing them into bins.
So the whole structure would look like this:
Fig. 20. Algorithm Scheme
37
There is a ponderation in the FFT, I have defined a function (better explained in the
Results section) that looks like this:
Fig. 21. Ponderation Function 4.2 Classifier
a) Off-line implementation
The Classifier consists in a KNN (K-nearest neighbour), The KNN algorithm is trained
with multidimensional vectors in a multidimensional feature space, each vector has a
class label, so in the classification task, the algorithm places the new vector in the space
and calculates the Euclidean distance to the centre of the clusters (each cluster would be
a chord), and the closest one is the predicted one.
The classifier is trained with some examples of a musical part with no chords,
percussion parts like a drum groove, strumming the guitar muting the strings so making
a percussive sound or only a melody with no harmony behind. With this training the
program differentiates the parts with no harmony, no chords and the parts with chords.
Weka is a big tool for Machine Learning algorithm, it helps you to implement some
classifiers to your data, and the results were good, as shown in the Experimental Results
part. But it cannot be used in Real-time. I have used for that Wekinator.
b) Real-Time Implementation
Wekinator allows users to use machine learning in real-time, specifically a Neural
Network. It would have been perfect if I could load a trained classifier into Wekinator
and used it to classify the chords in real-time. But it’s not possible, Wekinator and
Weka are free, open source software and they’re are in some aspects limited.
38
So, I have to train again the system. I still have the chord database, so It will be fast, but
no as extended as the training did with Weka.
The functioning of the classifier is the following: it receives as an input the features of
the signal, the chromagram, so it will have 12 inputs. It performs the classification and
it sends the result.
The Wekinator program will have, initially 24 nodes, one per chord (12 major chords
and 12 minor chords). The idea is to implement more chords in the future. So
Wekinator will return 24 values and in the program we keep the highest and we look
what node is, it will be the chord.
This sending of messages is performed using the OSC protocol. The Wekinator receive
the features with a message sent by the program using the port 6448 (default) and it
sends the result through the port 12000. The whole structure would look like this:
Fig. 22. Program Scheme
39
5. RESULTS
5.1 Parameters definition
To improve the precision of the program, there are some parameters that modify the
feature extractor and improve its performance.
The program performs a count of peaks (translated to notes), so it is important to define
which peaks are important (chord notes) and which are not (noise or superior harmonics
that are notes that don’t belong to the chord).
So, the first parameter is the peak threshold, if the peak is below the threshold is not
computed as a note, it defines also the silences between chords. But sometimes, you
will have noise above the threshold, so the solution is to not compute all the peaks,
compute the n-higher ones. This way a higher peak is more important and is not
computed in the same way as a lower peak.
There is also another ponderation to counteract the FFT characteristics. The FFT has a
regular resolution in all frequencies, but our perception of the frequencies is
logarithmic, so in the low frequencies the FFT won’t compute the notes correctly,
because in a 10 Hz bin in the high frequencies is not even a semitone and in the low
notes is maybe 2 tones.
So the ponderation is a function that gives more importance to the high frequencies, but
as we increase the frequency, there are more harmonics, and these harmonics may be
notes the are outside the chord. In a C chord (C, E, G) are also Bb, G#, B, F, D.
There are also other parameters but they are not modifiable because of the frequency
and time trade-off. Parameters such as the FFT Size, overlap, the number of previous
windows that also count etc. The parameters that I have used are the following:
Window size = 4096 samples
Overlap = 4 windows
Sampling Rate = 44100 samples/s
Accumulative = 5 windows
And we can calculate that each window grabs from the window itself until 0.232
seconds behind. We calculate it using this formula:
The results will be displayed with the precision, recall and f-measure and using as a
classifier a NN (Neural Network), because it will be the classifier used in the real-time
implementation. All these results are computed in the off-line implementation and performing a cross-validation.
40
a) Peak Threshold
The threshold could not be so high because it is important that any note of the chords
get lost. So, if it’s a low value the removed peaks will be only noise. Here’s the
evaluation with different threshold:
41
Fig. 23. Chromagram with Thresholds
Fig. 24. Accuracy depending on Thresholds
To compute the evolution of the accuracy depending on the Thresholds we have set the
other parameters constant.
42
b) Number of peaks
The peaks are sorted in descending order having the first ones as the higher peaks, so it
is interesting to keep the n-first in order to remove noise and keep the important peaks
(notes) only. The program detects a number of peaks, is not always the same number, in
fact, is constantly changing so define the number of peaks as a number is dangerous, it
is better to compute it as a fraction of the total number of peaks. I have tried with
different fraction. Here we can see how a Chromagram can change depending of the
number of peaks.
43
Fig. 25. Chromagram with different N
Fig. 26. Accuracy depending on N
To compute the evolution of the accuracy depending on the number of peaks we have
set the other parameters constant.
44
c) Peak Weights
It is to interesting to not compute the same way a low peak and a high one, so when the
program computes the chromagram, it sums the number of notes but this each note
(note) is multiplied by a weight multiplied by its height. Here’s the difference between
applying peak weights and not applying them.
Fig. 27. Chromagram with Peak Weights Fig. 28. Accuracy depending
on PW
To compute the evolution of the accuracy depending on the peaks weights we have set
the other parameters constant.
45
d) Ponderation
I has tried different ponderation and the best function has this shape, in the x-axis there
are the cut frequencies:
Fig. 21. Ponderation Function
Here we can see the difference between using the ponderation function and not using it,
the ponderation function helps to distinguish better the notes.
Fig. 29. Chromagram applying Ponderation
46
e) Octave Weights
The last parameter is the octave weights; it consists in applying a weight to the notes
(peaks) depending in the octave of the note, the lower the note the higher the weight.
This goes against the idea of the ponderation (section d) of counteracting the FFT
characteristics, and that is the reason of its bad performance.
The idea came from observing how musicians transcribe harmonically, they specially
focus on the bass (the root note). Here’s how octave weights affect the chromagram.
Fig. 30. Chromagram with Octave Weight
Fig. 31. Accuracy depending on Octave Weight
We can see that octave weights are not a good weight ponderation and it is better to not
use it. Because we are giving more importance to the low notes and the FFT is less
precise in the low frequencies.
47
5.2 Classifier
a) Setting a baseline
In order to compare the accuracy of the classifiers, it is necessary to establish a baseline,
this baseline is defined using what is called a ZeroR classifier.
The ZeroR classifier consists in a classifier that classify all instances as the most
instanced class. In this case, there are a lot of classes, one per chord (24, 12 major
chords and 12 minor chords) so the baseline will be a low value. In fact, there are 25
classes because there is a class called none that corresponds to the silence parts. The
most instanced class is this class the baseline is the following:
Fig. 32. Baseline Precision, Recall and F-measure
b) The classifiers
I have tried 3 classifiers, excluding the ZeroR classifier, a NN (Neural Network), a
KNN (K-Nearest Neighbour) and a SVM (Support Vector Machine).
Our feature extractor returns as an output a 12-dimensional vector that each dimension
corresponds to a note, any chord has the same notes, so the chords are well distributed
in the 12-dimensional space.
To see the performance of the classifiers I have set the parameters to have a nice
frequency resolution, I didn’t take into account the chord changes over time, in the
48
frequency-time trade-off, I have specially focused in the frequency, so the results are
not representative of the real performance in a song. As long as the x increments, the
parameters are modified to give more precision.
Fig. 33. Precision of Classifiers
Fig. 34. Recall of Classifiers
51
6. DISCUSSION
6.1 Conclusion
For this project, determine an accuracy is not as simple as applying a formula, it takes
into account so many variables; the correct extraction of the features, the temporary
accuracy, the correct classification of the chords etc. So the best way to determine if it is
useful is looking it as a Musician, not as an engineer. In other words, trying the program
and seeing if it's a useful tool.
In the first place, there are basic errors like not considering the tempo in the chord
changes. The program changes the chords when it detects them, and this could work in
rubato pieces or ad libitum pieces etc. But usually, the musical pieces have a tempo, and
the harmonic rhythm is important. So It would be helpful to implement a beat tracking
system allowing the program to change the chords on time.
It's obvious that the program would have a little delay, but the intern loops of the
musical piece would make the musician understand and predict some chords so the
delay is not that important as long as the chords change on time.
Another basic error is not taking into account the enharmonies. Ab Major is the same as
G# Major in the Tempered System. The system would return Ab Major always as a
system, and this is musically wrong. In the E Major tonality, there is a G# Minor, and it
isn't a Ab Minor because the A is natural and Major in the E Major tonality, but the
program would return Ab Minor and it is musically wrong although it will sound right,
so the notes have a different name but they are the same (same frequency). Or it doesn't
make sense if an Ab major is followed by a C# minor, because the Ab major it's clearly
a G# major because it's the dominant of the C# minor.
Another basic error is not taking into account the harmonic relation between chords.
The chords are not randomly chosen in a song. They preserve a harmonic relation
between themselves and between them and the tonality. So if the tonality of the song is
in E Major, the probability of having a G minor is very low. In the other hand, the
probability of having a B major is so high, because it's the fifth grade in the scale.
The last basic error is not considering the internal harmonic loops of a song. What
creates a song is the main structure with the harmonic progression of each section of
this structure, different sections can have the same harmonic progressions. The
interesting thing is that usually the harmonic loop has a static size, it could be 4 bars, 8,
12 (blues for example) or higher (jazz standards).
Implementing these considerations would make the program better for sure, in fact, they
are considerations that the musicians take into account.
Despite this, the Chromagram works perfectly as a feature extractor (99% f-measure in
the database K folding). The problem is not determining the chord; it is determining the
chord in the right time. The feature extractor has a big time lapse: the window size and
the accumulative windows. So, when it comes to transcribing the harmony of a song,
the program has some issues when the chords are changing, but it recognizes them right.
The Real-time implementation works not as good as the Offline implementation,
because there is noise in the Analog-to-Digital conversion due to a bad Audio Interface.
I have used the Audio Interface of the computer, in fact, I have used another computer
to recreate a mic using a jack cable. So I have performed a Digital-to-Audio followed
52
by an Audio-to-Digital conversion, both performed with a bad audio Interface, so the
noise is considerable and it affects in the detection of chords, because it causes random
peaks that the programs counts them as notes.
With a good audio interface, this would not be a problem, but the majority of devices
have not a good audio interface (smart phones, PC etc.) so it will be interesting to
implement some kind of noise reduction algorithm.
6.2 Future Work There are many ideas to improve and implementations for future work. These ideas are
specially focused in musical aspects, like the harmonic progression.
The next step is to, in addition to recognizing the chord, recognizing the tonal function
of the chord. Some mistakes are worse than others, confusing two chords that have the
same tonal function it’s not a big problem.
The weights function makes some chords easier to classify than others, depending on
the height. So modifying this function and making constant all along the octave.
This program doesn’t work at all in detuned songs, it grabs the A4 as 440 Hz as a
reference, but some songs are in 432 Hz or just a detuned instrument. So, an
improvement would be making the reference variable (in some regions).
The program has a peak threshold, if the peak is below this threshold, it won’t be
computed as peak. So in some cases, depending on the mic, the position of the mic, the
volume etc. the result could be bad, so implementing a variable threshold depending on
the RMS would be a solution.
Another improvement would be a real-time implementation of the classifier. In the
Offline part the KNN was clearly the best solution to classify the chords, so
implementing a real-time KNN instead of a NN in Wekinator, would be an
improvement.
53
Bibliografia
1. Fujishima, T. (1999). Real-time chord recognition of musical sound: A system
using common lisp music. Proc. ICMC, Oct. 1999, 464-467.
2. Gómez, E., & Herrera, P. (2004). Automatic extraction of tonal metadata from
polyphonic audio recordings. In AES.
3. Harte, C., Sandler, M., & Gasser, M. (2006, October). Detecting harmonic
change in musical audio. In Proceedings of the 1st ACM workshop on Audio and
music computing multimedia (pp. 21-26). ACM
4. Goto, M. (2001). An audio-based real-time beat tracking system for music with
or without drum-sounds. Journal of New Music Research, 30(2), 159-171.
5. Goto, M., & Hayamizu, S. (1999, August). A real-time music scene description
system: Detecting melody and bass lines in audio signals. In Working Notes of
the IJCAI-99 Workshop on Computational Auditory Scene Analysis (pp. 31-40).
6. Gómez, E. (2006). Tonal description of polyphonic audio for music content
processing. INFORMS Journal on Computing, 18(3), 294-304.
7. Lee, K. (2006, November). Automatic Chord Recognition from Audio Using
Enhanced Pitch Class Profile. In ICMC.
8. Gómez, E. (2006). Tonal description of music audio signals. Department of
Information and Communication Technologies.
9. Serra, J., & Gómez, E. (2007). A cover song identification system based on
sequences of tonal descriptors. Music Information Retrieval Evaluation
eXchange (MIREX).
10. Pascanu, R., Gulcehre, C., Cho, K., & Bengio, Y. (2013). How to construct deep
recurrent neural networks. arXiv preprint arXiv:1312.6026.
11. Lee, H., Pham, P., Largman, Y., & Ng, A. Y. (2009). Unsupervised feature
learning for audio classification using convolutional deep belief networks.
In Advances in neural information processing systems (pp. 1096-1104).
12. Hamel, P., & Eck, D. (2010, August). Learning features from music audio with
deep belief networks. In ISMIR (Vol. 10, pp. 339-344).
13. Pohle, T., Schnitzer, D., Schedl, M., Knees, P., & Widmer, G. (2009, October).
On Rhythm and General Music Similarity. In ISMIR (pp. 525-530).]
[Aucouturier, J. J., & Pachet, F. (2002, October). Music similarity measures:
What's the use? In ISMIR (pp. 13-17).
14. Haykin, S. (1994). Neural networks: a comprehensive foundation. Prentice Hall
PTR.
15. Wang, L. (Ed.). (2005). Support vector machines: theory and applications (Vol.
177). Springer Science & Business Media.
16. Cover, T. M., & Hart, P. E. (1967). Nearest neighbour pattern classification.
IEEE transactions on information theory, 13(1), 21-27.
55
I. https://github.com/danielgirotfg/Guitar-Chords-DB
II. https://github.com/leozimmerman/ofxAudioAnalyzer
III. https://www.upf.edu/web/mtg/hpcp
IV. https://code.soundsoftware.ac.uk/projects/tipic
V. https://es.mathworks.com/matlabcentral/fileexchange/35330-frequency-to-
note
VI. https://visualstudio.microsoft.com/es/?rr=https%3A%2F%2Fwww.google.co
m%2F
VII. https://www.cs.waikato.ac.nz/ml/weka/downloading.html
VIII. https://www.nayuki.io/page/free-small-fft-in-multiple-languages
IX. https://github.com/claydergc/find-peaks
X. https://www.mathworks.com/help/matlab/calling-matlab-engine-from-cpp-
programs.html
XI. https://essentia.upf.edu/documentation
XII. https://openframeworks.cc
XIII. https://github.com/paulreimer/ofxAudioFeatures
XIV. http://www.opennn.net/documentation
XV. http://www.wekinator.org
XVI. http://opensoundcontrol.org/introduction-osc