Structured training for large-vocabulary chord recognition Brian - PowerPoint PPT Presentation

Structured training for large-vocabulary chord recognition Brian McFee* & Juan Pablo Bello

Small chord vocabularies Typically a supervised learning problem ● Frames → chord labels ○ N 1-of-K classification models are common ● C:maj C:min 25 classes: N + (12 ⨉ min) + (12 ⨉ maj) ○ C#:maj C#:min Hidden Markov Models, Deep convolutional networks, etc. ○ D:maj D:min Optimize accuracy, log-likelihood, etc. ○ ... ... B:maj B:min

Small chord vocabularies Typically a supervised learning problem ● Frames → chord labels ○ N 1-of-K classification models are common ● C:maj C:min 25 classes: N + (12 ⨉ min) + (12 ⨉ maj) ○ C#:maj C#:min Hidden Markov Models, Deep convolutional networks, etc. ○ D:maj D:min Optimize accuracy, log-likelihood, etc. ○ ... ... B:maj B:min Implicit training assumption: ● All mistakes are equally bad

Large chord vocabularies Classes are not well-separated ● Chord quality Frequency C:7 = C:maj + m7 maj 52.53% ○ min 13.63% C:sus4 vs. F:sus2 ○ 7 10.05% Class distribution is non-uniform ● ... hdim7 0.17% Rare classes are hard to model ● dim7 0.07% minmaj7 0.04% Distribution of the 1217 dataset

Some mistakes are better than others Very bad Not so bad

Some mistakes are better than others This implies that chord Very bad Not so bad space is structured!

Deep learning architecture to ● exploit structure of chord symbols Our contributions Improve accuracy in rare classes ● Preserve accuracy in common classes Bonus: package is online for you to use! ●

Chord simplification All classification models need a finite, canonical label set ●

Chord simplification All classification models need a finite, canonical label set ● Vocabulary simplification process: ● a. Ignore inversions G ♭ :9(*5)/3 G ♭ :9(*5)

Chord simplification All classification models need a finite, canonical label set ● Vocabulary simplification process: ● a. Ignore inversions b. Ignore added and suppressed notes G ♭ :9(*5)/3 G ♭ :9(*5) G ♭ :9

Chord simplification All classification models need a finite, canonical label set ● Vocabulary simplification process: ● a. Ignore inversions b. Ignore added and suppressed notes c. Template-match to nearest quality G ♭ :9(*5)/3 G ♭ :9(*5) G ♭ :9 G ♭ :7

Chord simplification All classification models need a finite, canonical label set ● Vocabulary simplification process: ● a. Ignore inversions b. Ignore added and suppressed notes c. Template-match to nearest quality d. Resolve enharmonic equivalences G ♭ :9(*5)/3 G ♭ :9(*5) G ♭ :9 G ♭ :7 F ♯ :7

Chord simplification All classification models need a finite, canonical label set ● S i Vocabulary simplification process: m ● p l i f ( i b c a u t t a. Ignore inversions i o a n l l i c s h l o o b. Ignore added and suppressed notes r s d s y m ! o d c. Template-match to nearest quality e l s d o d. Resolve enharmonic equivalences i t ) G ♭ :9(*5)/3 G ♭ :9(*5) G ♭ :9 G ♭ :7 F ♯ :7

14 ⨉ 12 + 2 = 170 classes 14 qualities min maj dim aug min6 maj6 min7 minmaj7 maj7 7 dim7 hdim7 sus2 sus4 C C# ... B N No chord (e.g., silence) X Out of gamut (e.g., power chords)

Structural encoding Represent chord labels as binary encodings ● Encoding is lossless* and structured : ● Similar chords with different labels will have similar encodings ○ Dissimilar chords will have dissimilar encodings ○ Learning problem: ● Predict the encoding from audio ○ Learn to decode into chord labels ○ * up to octave-folding

The big idea Jointly estimate structured encoding AND chord labels ● Full objective = root loss + pitch loss + bass loss + decoder loss ●

Input: constant-Q spectral patches ● Model architectures Per-frame outputs: ● Root [multiclass, 13] ○ Pitches [multilabel, 12] ○ Bass [multiclass, 13] ○ Chords [multiclass, 170] ○ Convolutional-recurrent architecture ● (encoder-decoder) End-to-end training ●

Encoder architecture Hidden state at frame t : h ( t ) ∊ [-1, +1] D Suppress transients Encode frequencies Contextual smoothing

Decoder architectures Chords = Logistic regression from encoder state Frames are independently decoded: y ( t ) = softmax( W h ( t ) + β )

Decoder architectures Chords = Logistic regression from encoder state Decoding = GRU + LR Frames are recurrently decoded: h 2 ( t ) = Bi-GRU[ h ]( t ) y ( t ) = softmax( W h 2 ( t ) + β )

Decoder architectures Chords = Logistic regression from encoder state Decoding = GRU + LR Chords = LR from encoder state + root/pitch/bass Frames are independently decoded with structure: y( t ) = softmax( W r r ( t ) + W p p ( t ) + W b b ( t ) + W h h ( t ) + β )

Decoder architectures Chords = Logistic regression from encoder state Decoding = GRU + LR Chords = LR from encoder state + root/pitch/bass All of the above

What about root bias? Quality and root should be independent ● But the data is inherently biased ● Solution: data augmentation ! ● muda [McFee, Humphrey, Bello 2015] ○ Pitch-shift the audio and annotations simultaneously ○ Each training track → ± 6 semitone shifts ● All qualities are observed in all root positions ○ All roots, pitches, and bass values are observed ○ http://photos.jdhancock.com/photo/2012-09-28-001422-big-data.html

8 configurations ● ± data augmentation ○ ± structured training ○ 1 vs. 2 recurrent layers ○ Evaluation 1217 recordings ● (Billboard + Isophonics + MARL corpus) 5-fold cross-validation ○ Baseline models: ● DNN [Humphrey & Bello, 2015] ○ KHMM [Cho, 2014] ○

CR1: 1 recurrent layer CR2: 2 recurrent layers Results +A: data augmentation +S: structure encoding Data augmentation (+A) is necessary to match baselines.

CR1: 1 recurrent layer CR2: 2 recurrent layers Results +A: data augmentation +S: structure encoding Structured training (+S) and deeper models improve over baselines.

CR1: 1 recurrent layer CR2: 2 recurrent layers Results +A: data augmentation +S: structure encoding Improvements are bigger on the harder metrics (7 th s and tetrads)

CR1: 1 recurrent layer CR2: 2 recurrent layers Results +A: data augmentation +S: structure encoding Substantial gains in maj/min and MIREX metrics CR2+S+A wins on all metrics

Error analysis: quality confusions Errors tend toward simplification Reflects maj/min bias in training data Simplified vocab. accuracy: 63.6%

Structured training helps ● Deeper is better ● Summary Data augmentation is critical ● pip install muda ○ Rare classes are still hard ● We probably need new data ○

Thanks! Questions? brian.mcfee@nyu.edu ● https://bmcfee.github.io/ Implementation is online ● https://github.com/bmcfee/ismir2017_chords ○ pip install crema ○

Extra goodies

Error analysis: CR2+S+A vs CR2+A Reduction of confusions to major Improvements in rare classes: aug, maj6, dim7, hdim7, sus4

Learned model weights Layer 1: Harmonic saliency ● Layer 2: Pitch filters (sorted by dominant frequency) ●

Training details Keras / TensorFlow + pescador ● ADAM optimizer ● Early stopping @20, learning rate reduction @10 ● Determined by decoder loss ○ 8 seconds per patch ● 32 patches ber batch ● 1024 batches per epoch ●

Inter-root confusions Confusions primarily toward P4/P5

Inversion estimation For each detected chord segment ● Find the most likely bass note ○ If that note is within the detected quality, predict it as the inversion ○ Implemented in the crema package ● Inversion-sensitive metrics ~1% lower than inversion-agnostic ●

Pitches as chroma

Structured training for large-vocabulary chord recognition Brian - PowerPoint PPT Presentation

Structured training for large-vocabulary chord recognition Brian McFee* & Juan Pablo Bello Small chord vocabularies Typically a supervised learning problem Frames chord labels N 1-of-K classification models are common

GPU-Accelerated GPU-Accelerated Large Vocabulary Continuous Speech Recognition Large

A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Vocabulary and Reading in Secondary School (VaRiSS) Jessie Ricketts Royal Holloway Vocabulary

VOCABULARY ATI TEAS ENGLISH AND LANGUAGE USAGE VOCABULARY Vocabulary questions on this part of

VOCABULARY ATI TEAS ENGLISH AND LANGUAGE USAGE VOCABULARY Vocabulary questions on this part of

Building Science Vocabulary: Seeds of Science Roots of Reading Goal Review our model for

Teaching Vocabulary Pre-Teaching Vocabulary + Pre-Teaching Vocabulary: An Example for 2 nd -5 th

Peer-to-peer overlay networks Structured overlay network: Chord Basics Issue Large-scale

Scaling Log-Structured KV-Stores featuring Monkey and Dostoevsky SIGMOD17 / SIGMOD18 Niv Dayan

Machine Learning Fall 2017 Structured Prediction (structured perceptron, HMM, structured SVM)

What s in a Word? Academic Vocabulary Development for ELLs CCRC 2014 1 Essential

Vocabulary Word #1 flinched : (verb) to make a quick, nervous movement. Ellie the elephant flinched

THE LORDSHIP OF JESUS RD THE LORDSHIP OF JESUS RD VOCABULARY THE LORDSHIP OF JESUS RD

Vocabulary Word #1 fury : (noun) wild or violent anger. In his fury , he could not answer the math

A summary of deep models for face recognition Qianli Liao Face recognition Face recognition:

T O MB RAIDER T HE ART O F EPIC SC O RING I N T R O D U C T I O N 1 INT RO DUC T IO N

Energy Systema-cs Studies Elizabeth Worcester (BNL) March 15,

Walking Bass & Jazz Founda'ons Guide The Easy To Understand Guide To Crea'ng Walking Bass

Quasidualizing Modules and the Auslander and Bass Classes Bethany Kubik United States Military

TOWARDS MULTI-INSTRUMENT DRUM TRANSCRIPTION Richard Vogl 1,2 , Gerhard Widmer 2 , Peter Knees 1

Decision Making Probabilistic model Known Unknown Bayes Decision Supervised Unsupervised

Pattern Recognition: An Overview Prof. Richard Zanibbi Pattern Recognition (One) Definition

WSD Word Sense Disambiguation: Determine from context (or otherwise) what Word Sense

Structured training for large-vocabulary chord recognition Brian - PowerPoint PPT Presentation

Structured training for large-vocabulary chord recognition Brian McFee* & Juan Pablo Bello Small chord vocabularies Typically a supervised learning problem Frames chord labels N 1-of-K classification models are common

GPU-Accelerated GPU-Accelerated Large Vocabulary Continuous Speech Recognition Large

A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Vocabulary and Reading in Secondary School (VaRiSS) Jessie Ricketts Royal Holloway Vocabulary

VOCABULARY ATI TEAS ENGLISH AND LANGUAGE USAGE VOCABULARY Vocabulary questions on this part of

VOCABULARY ATI TEAS ENGLISH AND LANGUAGE USAGE VOCABULARY Vocabulary questions on this part of

Building Science Vocabulary: Seeds of Science Roots of Reading Goal Review our model for

Teaching Vocabulary Pre-Teaching Vocabulary + Pre-Teaching Vocabulary: An Example for 2 nd -5 th

Peer-to-peer overlay networks Structured overlay network: Chord Basics Issue Large-scale

Scaling Log-Structured KV-Stores featuring Monkey and Dostoevsky SIGMOD17 / SIGMOD18 Niv Dayan

Machine Learning Fall 2017 Structured Prediction (structured perceptron, HMM, structured SVM)

What s in a Word? Academic Vocabulary Development for ELLs CCRC 2014 1 Essential

Vocabulary Word #1 flinched : (verb) to make a quick, nervous movement. Ellie the elephant flinched

THE LORDSHIP OF JESUS RD THE LORDSHIP OF JESUS RD VOCABULARY THE LORDSHIP OF JESUS RD

Vocabulary Word #1 fury : (noun) wild or violent anger. In his fury , he could not answer the math

A summary of deep models for face recognition Qianli Liao Face recognition Face recognition:

T O MB RAIDER T HE ART O F EPIC SC O RING I N T R O D U C T I O N 1 INT RO DUC T IO N

Energy Systema-cs Studies Elizabeth Worcester (BNL) March 15,

Walking Bass &amp; Jazz Founda'ons Guide The Easy To Understand Guide To Crea'ng Walking Bass

Quasidualizing Modules and the Auslander and Bass Classes Bethany Kubik United States Military

TOWARDS MULTI-INSTRUMENT DRUM TRANSCRIPTION Richard Vogl 1,2 , Gerhard Widmer 2 , Peter Knees 1

Decision Making Probabilistic model Known Unknown Bayes Decision Supervised Unsupervised

Pattern Recognition: An Overview Prof. Richard Zanibbi Pattern Recognition (One) Definition

WSD Word Sense Disambiguation: Determine from context (or otherwise) what Word Sense

Walking Bass & Jazz Founda'ons Guide The Easy To Understand Guide To Crea'ng Walking Bass