Richard Vogl
richard.vogl@tuwien.ac.at ifs.tuwien.ac.at/~voglDRUM TRANSCRIPTION VIA JOINT BEAT AND DRUM MODELING USING CONVOLUTIONAL RNNs
21st Vienna Deep Learning Meetup
15th of October 2018
Institute of Computational Perception
DRUM TRANSCRIPTION VIA JOINT BEAT AND DRUM MODELING USING - - PowerPoint PPT Presentation
DRUM TRANSCRIPTION VIA JOINT BEAT AND DRUM MODELING USING CONVOLUTIONAL RNNs Richard Vogl richard.vogl@tuwien.ac.at ifs.tuwien.ac.at/~vogl 21 st Vienna Deep Learning Meetup 15 th of October 2018 Institute of Computational Perception DRUM
Richard Vogl
richard.vogl@tuwien.ac.at ifs.tuwien.ac.at/~voglDRUM TRANSCRIPTION VIA JOINT BEAT AND DRUM MODELING USING CONVOLUTIONAL RNNs
21st Vienna Deep Learning Meetup
15th of October 2018
Institute of Computational PerceptionRichard Vogl1,2, Matthias Dorfer2, Gerhard Widmer2, Peter Knees1
richard.vogl@tuwien.ac.at, matthias.dorfer@jku.at, gerhard.widmer@jku.at, peter.knees@tuwien.ac.atDRUM TRANSCRIPTION VIA JOINT BEAT AND DRUM MODELING USING CONVOLUTIONAL RNNs
1 2 Institute of Computational PerceptionPART 1 AUTOMATIC DRUM TRANSCRIPTION
Task Definition, Problem Modeling, Architectures
PART 2 MULTI-TASK LEARNING
Metadata for Transcripts
PART 1 AUTOMATIC DRUM TRANSCRIPTION
Task Definition, Problem Modeling, Architectures
PART 2 MULTI-TASK LEARNING
Metadata for Transcripts
WHAT IS DRUM TRANSCRIPTION?
4audio
WHAT IS DRUM TRANSCRIPTION?
4Input: western popular music containing drums Output: symbolic representation of notes played by drum instruments audio ADT system symbolic representation
WHAT IS DRUM TRANSCRIPTION?
4Input: western popular music containing drums Output: symbolic representation of notes played by drum instruments audio ADT system symbolic representation
WHAT IS DRUM TRANSCRIPTION?
4Input: western popular music containing drums Output: symbolic representation of notes played by drum instruments audio ADT system symbolic representation
WHAT IS DRUM TRANSCRIPTION?
5 audio ADT system notesWHAT IS DRUM TRANSCRIPTION?
5 audio ADT system notes♫
WHAT IS DRUM TRANSCRIPTION?
5 audio ADT system notes♫
drum samplerWHAT IS DRUM TRANSCRIPTION?
5 audio ADT system notes♫
drum sampler audioWHAT IS DRUM TRANSCRIPTION?
5 audio ADT system notes♫ ♫
drum sampler audioWHAT IS DRUM TRANSCRIPTION?
5 audio ADT system notes♫ ♫
drum sampler audioWHAT IS DRUM TRANSCRIPTION?
5 audio ADT system notes♫ ♫
drum sampler audio♫
WHAT IS DRUM TRANSCRIPTION?
5 audio ADT system notes♫ ♫
drum sampler audio♫ ♫
WHAT IS DRUM TRANSCRIPTION?
5 audio ADT system notes♫ ♫
drum sampler audio♫ ♫ ♫
WHY DRUM TRANSCRIPTION?
6WHY DRUM TRANSCRIPTION?
Wide range of application
6WHY DRUM TRANSCRIPTION?
Wide range of application
WHY DRUM TRANSCRIPTION?
Wide range of application
sampling / remixing / resynthesis
6WHY DRUM TRANSCRIPTION?
Wide range of application
sampling / remixing / resynthesis
use drum patterns for other tasks genre classification song segmentation
6FOCUSED INSTRUMENTS
7BD SD HH
ADT methods focus bass drum (BD) snare (SD) and hi-hat (HH)
FOCUSED INSTRUMENTS
7BD SD HH
ADT methods focus bass drum (BD) snare (SD) and hi-hat (HH)
FOCUSED INSTRUMENTS
7BD SD HH
ADT methods focus bass drum (BD) snare (SD) and hi-hat (HH)
FOCUSED INSTRUMENTS
7BD SD HH
ADT methods focus bass drum (BD) snare (SD) and hi-hat (HH)
FOCUSED INSTRUMENTS
7BD SD HH
bass drum snare drum hi-hatADT methods focus bass drum (BD) snare (SD) and hi-hat (HH)
FOCUSED INSTRUMENTS
7BD SD HH
bass drum snare drum hi-hatADT methods focus bass drum (BD) snare (SD) and hi-hat (HH)
FOCUSED INSTRUMENTS
7BD SD HH
bass drum snare drum hi-hatADT methods focus bass drum (BD) snare (SD) and hi-hat (HH)
FOCUSED INSTRUMENTS
7BD SD HH
bass drum snare drum hi-hatSTATE OF THE ART
8 activation functions spectrogram t [ms] t [ms] bass snare hi-hatSTATE OF THE ART
8 activation functions spectrogram t [ms] t [ms] bass snare hi-hatOverview Article
Wu, C.-W., Dittmar, C., Southall, C.,Vogl, R., Widmer, G., Hockman, J., Müller, M., Lerch, A.: “An Overview of Automatic Drum Transcription,” IEEE TASLP, vol. 26, no. 9, Sept. 2018.STATE OF THE ART
8 activation functions spectrogram t [ms] t [ms] bass snare hi-hatSYSTEM OVERVIEW
9signal preprocessing NN feature extraction event detection classification peak picking NN training audio events train data
SYSTEM OVERVIEW
9signal preprocessing NN feature extraction event detection classification peak picking NN training audio events
waveform t [s] Atrain data
SYSTEM OVERVIEW
9signal preprocessing NN feature extraction event detection classification peak picking NN training audio events
waveform t [s] A spectrogram t [s] f [Hz]train data
SYSTEM OVERVIEW
9signal preprocessing NN feature extraction event detection classification peak picking NN training audio events
waveform t [s] A spectrogram t [s] f [Hz] activation functions t [s] bass snare hi-hattrain data
SYSTEM OVERVIEW
9signal preprocessing NN feature extraction event detection classification peak picking NN training audio events
waveform t [s] A spectrogram t [s] f [Hz] activation functions t [s] bass snare hi-hat detected peaks t [s] bass snare hi-hattrain data
NETWORK MODELS — RNN
10Processing of spectrogram frames as sequential data
NETWORK MODELS — RNN
10Processing of spectrogram frames as sequential data Frame-wise detection of instrument onsets
NETWORK MODELS — RNN
10 RNN train data sampleProcessing of spectrogram frames as sequential data Frame-wise detection of instrument onsets
NETWORK MODELS — RNN
10 RNN train data sampleNETWORK MODELS — CNN
11NETWORK MODELS — CNN
Operate on small windows of spectrogram (current frame + spectral context)
11NETWORK MODELS — CNN
Operate on small windows of spectrogram (current frame + spectral context) Acoustic modeling of drum sounds
11NETWORK MODELS — CNN
Operate on small windows of spectrogram (current frame + spectral context) Acoustic modeling of drum sounds
11NETWORK MODELS — CRNN
12Low-level CNN for acoustic modeling
NETWORK MODELS — CRNN
12Low-level CNN for acoustic modeling High-level RNN for music language model
NETWORK MODELS — CRNN
12Low-level CNN for acoustic modeling High-level RNN for music language model
CRNN train data sampleNETWORK MODELS — CRNN
12Low-level CNN for acoustic modeling High-level RNN for music language model
CRNN train data sampleNETWORK MODELS — CRNN
12 2 x conv: 32 x 3x3 (batch norm) max pool: 3x3 2 x conv: 64 x 3x3 (batch norm) max pool: 3x3 3 x RNN: 50 BD GRU stacked CNN + RNN architecture:WHY IS CONTEXT RELEVANT?
Instruments from the same class often sound quite different Similar sound for different instruments
13WHY IS CONTEXT RELEVANT?
♫ ♫
snare drums: crash v.s. splash:Instruments from the same class often sound quite different Similar sound for different instruments
13WHY IS CONTEXT RELEVANT?
♫ ♫
snare drums: crash v.s. splash:Instruments from the same class often sound quite different Similar sound for different instruments
13WHY IS CONTEXT RELEVANT?
♫ ♫
snare drums: crash v.s. splash:Instruments from the same class often sound quite different Similar sound for different instruments When humans transcribe drums
WHY IS CONTEXT RELEVANT?
♫ ♫
snare drums: crash v.s. splash:Instruments from the same class often sound quite different Similar sound for different instruments When humans transcribe drums
WHY IS CONTEXT RELEVANT?
♫ ♫
snare drums: crash v.s. splash:Instruments from the same class often sound quite different Similar sound for different instruments When humans transcribe drums
Music Language Model
13WHY IS CONTEXT RELEVANT?
♫ ♫
snare drums: crash v.s. splash:♫
♫
BASS DRUM OR LOW TOM?
1: bass drum 2: floor tom 3: ? ? ?♫
♫
BASS DRUM OR LOW TOM?
1: bass drum 2: floor tom 3: ? ? ?♫
♫
BASS DRUM OR LOW TOM?
1: bass drum 2: floor tom 3: ? ? ?♫
♫
BASS DRUM OR LOW TOM?
♫
context 1: bass drum 2: floor tom 3: ? ? ?♫
♫
BASS DRUM OR LOW TOM?
♫
context 1: bass drum 2: floor tom 3: ? ? ?♫
♫
BASS DRUM OR LOW TOM?
♫
context 1: bass drum 2: floor tom♫
♫
BASS DRUM OR LOW TOM?
♫
context 1: bass drum 2: floor tomDATASETS
15IDMT-SMT-Drums [Dittmar and Gärtner 2014]
DATASETS
15♫
IDMT-SMT-Drums [Dittmar and Gärtner 2014]
DATASETS
15♫
SMT (simple!)
IDMT-SMT-Drums [Dittmar and Gärtner 2014]
ENST-Drums [Gillet and Richard 2006]
DATASETS
15♫ ♫ ♫
SMT (simple!)
IDMT-SMT-Drums [Dittmar and Gärtner 2014]
ENST-Drums [Gillet and Richard 2006]
DATASETS
15♫ ♫ ♫
SMT (simple!) ENST solo (harder!)
IDMT-SMT-Drums [Dittmar and Gärtner 2014]
ENST-Drums [Gillet and Richard 2006]
DATASETS
15♫ ♫ ♫
SMT (simple!) ENST solo (harder!) E N S T a c c . ( d i f f i c u l t ! )
NETWORK MODELS
16Frames Context
Dense Layers RNN (S) 100 — — 2x50 GRU — RNN (L) 400 — — 3x30 GRU — CNN (S) — 9 2 x 32 3x3 filt. 3x3 max pooling 2 x 64 3x3 filt. 3x3 max pooling all w/ batch norm. — 2x256 CNN (L) — 25 — 2x256 CRNN (S) 100 9 2x50 GRU — CRNN (L) 400 13 3x60 GRU — tsRNN
baseline [Vogl et al. ICASSP’17] Early stopping Batch normalization L2 norm Dropout ADAM optimizerArchitecture
RESULTS
17 F-measure [%] 60 70 80 90 100SMT ENST solo ENST acc.
RNN (S) RNN (L) CNN (S) CNN (L) CRNN (S) CRNN (L) tsRNNSMT ENST solo ENST with accompaniment
♫ ♫ ♫
HOW DOES IT SOUND?
“Punk” MEDLEY DB bass snare hi-hat♫ ♫ ♫
HOW DOES IT SOUND?
“Punk” MEDLEY DB bass snare hi-hat♫ ♫ ♫
HOW DOES IT SOUND?
bass snare hi-hat♫ ♫ ♫
HOW DOES IT SOUND?
bass snare hi-hat♫ ♫ ♫
HOW DOES IT SOUND?
bass snare hi-hat♫ ♫ ♫
HOW DOES IT SOUND?
bass snare hi-hatPART 1 AUTOMATIC DRUM TRANSCRIPTION
Task Definition, Problem Modeling, Architectures
PART 2 MULTI-TASK LEARNING
Metadata for Transcripts
LIMITATIONS OF CURRENT SYSTEMS
20Do not produce additional information for transcripts drum onset detection vs drum transcription
LIMITATIONS OF CURRENT SYSTEMS
20Do not produce additional information for transcripts drum onset detection vs drum transcription
LIMITATIONS OF CURRENT SYSTEMS
20Do not produce additional information for transcripts drum onset detection vs drum transcription
LIMITATIONS OF CURRENT SYSTEMS
20Do not produce additional information for transcripts drum onset detection vs drum transcription
LIMITATIONS OF CURRENT SYSTEMS
20Do not produce additional information for transcripts drum onset detection vs drum transcription
LIMITATIONS OF CURRENT SYSTEMS
20Do not produce additional information for transcripts drum onset detection vs drum transcription
LIMITATIONS OF CURRENT SYSTEMS
20Do not produce additional information for transcripts drum onset detection vs drum transcription
Only three instrument classes
LIMITATIONS OF CURRENT SYSTEMS
20Do not produce additional information for transcripts drum onset detection vs drum transcription
Only three instrument classes
LIMITATIONS OF CURRENT SYSTEMS
20ADDITIONAL INFORMATION FOR TRANSCRIPTS
21ADDITIONAL INFORMATION FOR TRANSCRIPTS
21Use beat and downbeat tracking to get:
HH SD BD t 1 2 3 4 1 4 3 beats 2ADDITIONAL INFORMATION FOR TRANSCRIPTS
21Use beat and downbeat tracking to get:
HH SD BD t 1 2 3 4 1 4 3 beats 2ADDITIONAL INFORMATION FOR TRANSCRIPTS
21Use beat and downbeat tracking to get:
ADDITIONAL INFORMATION FOR TRANSCRIPTS
21Use beat and downbeat tracking to get:
ADDITIONAL INFORMATION FOR TRANSCRIPTS
21Use beat and downbeat tracking to get:
ADDITIONAL INFORMATION FOR TRANSCRIPTS
21 4/4Use beat and downbeat tracking to get:
ADDITIONAL INFORMATION FOR TRANSCRIPTS
21 4/4LEVERAGE BEAT INFORMATION
22 HH SD BD t 1 2 3 4 1 4 3 beats 2LEVERAGE BEAT INFORMATION
Beats are highly correlated with drum patterns (drum onset locations / repetitive patterns)
22 HH SD BD t 1 2 3 4 1 4 3 beats 2LEVERAGE BEAT INFORMATION
Beats are highly correlated with drum patterns (drum onset locations / repetitive patterns) Assume that prior knowledge of beats is helpful for drum transcription
22 HH SD BD t 1 2 3 4 1 4 3 beats 2LEVERAGE BEAT INFORMATION
Beats are highly correlated with drum patterns (drum onset locations / repetitive patterns) Assume that prior knowledge of beats is helpful for drum transcription Use multi-task learning for beats and drums
22 HH SD BD t 1 2 3 4 1 4 3 beats 2MULTI-TASK LEARNING
Training one model to solve multiple related tasks
23MULTI-TASK LEARNING
Training one model to solve multiple related tasks
MULTI-TASK LEARNING
Training one model to solve multiple related tasks
Individual activation functions are already learned using multi-task learning
23MULTI-TASK LEARNING
t [ms] bass snare hi-hatTraining one model to solve multiple related tasks
Individual activation functions are already learned using multi-task learning
MULTI-TASK LEARNING
t [ms] bass snare hi-hatTraining one model to solve multiple related tasks
Individual activation functions are already learned using multi-task learning
MULTI-TASK LEARNING
t [ms] bass snare hi-hatTraining one model to solve multiple related tasks
Individual activation functions are already learned using multi-task learning
MULTI-TASK LEARNING
t [ms] bass snare hi-hatMULTI-TASK EXPERIMENTS
24 f [Hz] t [s] inputMULTI-TASK EXPERIMENTS
Three experiments:
24 f [Hz] t [s] inputMULTI-TASK EXPERIMENTS
Three experiments:
MULTI-TASK EXPERIMENTS
Three experiments:
MULTI-TASK EXPERIMENTS
Three experiments:
MULTI-TASK EXPERIMENTS
Three experiments:
Expected increase in performance for BF compared to DT
24 t [s] f [Hz] t [s] inputMULTI-TASK EXPERIMENTS
Three experiments:
Expected increase in performance for BF compared to DT Desirable increase in performance for MT compared to DT
24 t [s] f [Hz] t [s] inputNEW DATASETS
25 NEW!♫ ♫
RBMA (super difficult!)
NEW DATASETS
25 NEW!♫ ♫
RBMA (super difficult!)
NEW DATASETS
25 NEW!♫ ♫
RESULTS
DT BF MT RNN (S) 59.8 63.6 64.6 RNN (L) 61.8 64.5 64.3 CNN (S) 66.2 66.7 63.3 CNN (L) 66.8 65.2 64.8 CRNN (S) 65.2 66.1 66.9 CRNN (L) 67.3 68.4 67.2 % F-measure for drum onsets, tolerance: ±20ms, 3-fold cross-validation
DT … drum transcription BF … DT plus beats as input features MT … DT and beat detection multi-taskingExperiment Model
RESULTS
DT BF MT RNN (S) 59.8 63.6 64.6 RNN (L) 61.8 64.5 64.3 CNN (S) 66.2 66.7 63.3 CNN (L) 66.8 65.2 64.8 CRNN (S) 65.2 66.1 66.9 CRNN (L) 67.3 68.4 67.2 % F-measure for drum onsets, tolerance: ±20ms, 3-fold cross-validation
DT … drum transcription BF … DT plus beats as input features MT … DT and beat detection multi-taskingExperiment Model
R B M A ( s u p e r d i f f i c u l t ! )
RESULTS: RNNs
27 F-measure [%] 50 55 60 65 70 RNN (small) RNN (large) DT … drum transcription BF … DT plus beats as input features MT … DT and beat detection multi-taskingRESULTS: RNNs
27 F-measure [%] 50 55 60 65 70 RNN (small) RNN (large) DT … drum transcription BF … DT plus beats as input features MT … DT and beat detection multi-taskingImpact of beats for RNNs:
RESULTS: RNNs
27 F-measure [%] 50 55 60 65 70 RNN (small) RNN (large) DT … drum transcription BF … DT plus beats as input features MT … DT and beat detection multi-taskingImpact of beats for RNNs: BF improves for both models ✔
RESULTS: RNNs
27 F-measure [%] 50 55 60 65 70 RNN (small) RNN (large) DT … drum transcription BF … DT plus beats as input features MT … DT and beat detection multi-taskingImpact of beats for RNNs: BF improves for both models ✔ MT improves for both models ✔
RESULTS: CNNs
28 F-measure [%] 50 55 60 65 70 CNN (small) CNN (large) DT … drum transcription BF … DT plus beats as input features MT … DT and beat detection multi-taskingRESULTS: CNNs
28 F-measure [%] 50 55 60 65 70 CNN (small) CNN (large) DT … drum transcription BF … DT plus beats as input features MT … DT and beat detection multi-taskingImpact of beats for CNNs:
RESULTS: CNNs
28 F-measure [%] 50 55 60 65 70 CNN (small) CNN (large) DT … drum transcription BF … DT plus beats as input features MT … DT and beat detection multi-taskingImpact of beats for CNNs: BF inconsistent
RESULTS: CNNs
28 F-measure [%] 50 55 60 65 70 CNN (small) CNN (large) DT … drum transcription BF … DT plus beats as input features MT … DT and beat detection multi-taskingImpact of beats for CNNs: BF inconsistent MT declines for both models
RESULTS: CNNs
28 F-measure [%] 50 55 60 65 70 CNN (small) CNN (large) DT … drum transcription BF … DT plus beats as input features MT … DT and beat detection multi-taskingImpact of beats for CNNs: BF inconsistent MT declines for both models Expected: CNNs have too little context for beats
RESULTS: CRNNs
29 F-measure [%] 50 55 60 65 70 CRNN (small) CRNN (large) DT … drum transcription BF … DT plus beats as input features MT … DT and beat detection multi-taskingRESULTS: CRNNs
29 F-measure [%] 50 55 60 65 70 CRNN (small) CRNN (large) DT … drum transcription BF … DT plus beats as input features MT … DT and beat detection multi-taskingImpact of beats for CRNNs:
RESULTS: CRNNs
29 F-measure [%] 50 55 60 65 70 CRNN (small) CRNN (large) DT … drum transcription BF … DT plus beats as input features MT … DT and beat detection multi-taskingImpact of beats for CRNNs: BF improves for both models ✔
RESULTS: CRNNs
29 F-measure [%] 50 55 60 65 70 CRNN (small) CRNN (large) DT … drum transcription BF … DT plus beats as input features MT … DT and beat detection multi-taskingImpact of beats for CRNNs: BF improves for both models ✔ MT improves for small models ✔
RESULTS: CRNNs
29 F-measure [%] 50 55 60 65 70 CRNN (small) CRNN (large) DT … drum transcription BF … DT plus beats as input features MT … DT and beat detection multi-taskingImpact of beats for CRNNs: BF improves for both models ✔ MT improves for small models ✔ MT equal for large model ?
RESULTS FOR RECURRENT ARCHITECTURES
RESULTS FOR RECURRENT ARCHITECTURES
RESULTS FOR RECURRENT ARCHITECTURES
RESULTS FOR RECURRENT ARCHITECTURES
No improvement because of beat tracking results?♫ ♫
HOW DOES IT SOUND?
three instruments + beats bass hi-hat♫ ♫
HOW DOES IT SOUND?
three instruments + beats bass hi-hat♫ ♫
HOW DOES IT SOUND?
eight instruments + beats bass hi-hat snare side stick crash tom♫ ♫
HOW DOES IT SOUND?
eight instruments + beats bass hi-hat snare side stick crash tomCONCLUSIONS
32CONCLUSIONS
Deep learning for automatic drum transcription
32CONCLUSIONS
Deep learning for automatic drum transcription CRNNs can outperform RNNs and CNNs, especially on complex data
CONCLUSIONS
Deep learning for automatic drum transcription CRNNs can outperform RNNs and CNNs, especially on complex data
Leverage multi-task learning effects to increase performance
CONCLUSIONS
Deep learning for automatic drum transcription CRNNs can outperform RNNs and CNNs, especially on complex data
Leverage multi-task learning effects to increase performance
CRNN best overall results @ MIREX’17 and MIREX’18 drum transcription
MIREX system: http://ifs.tuwien.ac.at/~vogl/models/mirex-17.zip http://ifs.tuwien.ac.at/~vogl/models/mirex-18.tar.gz 32CONCLUSIONS
Deep learning for automatic drum transcription CRNNs can outperform RNNs and CNNs, especially on complex data
Leverage multi-task learning effects to increase performance
CRNN best overall results @ MIREX’17 and MIREX’18 drum transcription
MIREX system: http://ifs.tuwien.ac.at/~vogl/models/mirex-17.zip http://ifs.tuwien.ac.at/~vogl/models/mirex-18.tar.gz 32CONCLUSIONS
Deep learning for automatic drum transcription CRNNs can outperform RNNs and CNNs, especially on complex data
Leverage multi-task learning effects to increase performance
CRNN best overall results @ MIREX’17 and MIREX’18 drum transcription
MIREX system: http://ifs.tuwien.ac.at/~vogl/models/mirex-17.zip http://ifs.tuwien.ac.at/~vogl/models/mirex-18.tar.gz 32