DRUM TRANSCRIPTION VIA JOINT BEAT AND DRUM MODELING USING - - PowerPoint PPT Presentation

drum transcription via joint beat and drum modeling using
SMART_READER_LITE
LIVE PREVIEW

DRUM TRANSCRIPTION VIA JOINT BEAT AND DRUM MODELING USING - - PowerPoint PPT Presentation

DRUM TRANSCRIPTION VIA JOINT BEAT AND DRUM MODELING USING CONVOLUTIONAL RNNs Richard Vogl richard.vogl@tuwien.ac.at ifs.tuwien.ac.at/~vogl 21 st Vienna Deep Learning Meetup 15 th of October 2018 Institute of Computational Perception DRUM


slide-1
SLIDE 1

Richard Vogl

richard.vogl@tuwien.ac.at 
 ifs.tuwien.ac.at/~vogl

DRUM TRANSCRIPTION VIA 
 JOINT BEAT AND DRUM MODELING USING CONVOLUTIONAL RNNs

21st Vienna Deep Learning Meetup

15th of October 2018

Institute of Computational Perception
slide-2
SLIDE 2

Richard Vogl1,2, Matthias Dorfer2, Gerhard Widmer2, Peter Knees1

richard.vogl@tuwien.ac.at, matthias.dorfer@jku.at, gerhard.widmer@jku.at, peter.knees@tuwien.ac.at

DRUM TRANSCRIPTION VIA 
 JOINT BEAT AND DRUM MODELING USING CONVOLUTIONAL RNNs

1 2 Institute of Computational Perception
slide-3
SLIDE 3

PART 1 AUTOMATIC DRUM TRANSCRIPTION

Task Definition, Problem Modeling, Architectures

PART 2 MULTI-TASK LEARNING

Metadata for Transcripts

slide-4
SLIDE 4

PART 1 AUTOMATIC DRUM TRANSCRIPTION

Task Definition, Problem Modeling, Architectures

PART 2 MULTI-TASK LEARNING

Metadata for Transcripts

slide-5
SLIDE 5

WHAT IS DRUM TRANSCRIPTION?

4

audio

slide-6
SLIDE 6 sheet music

WHAT IS DRUM TRANSCRIPTION?

4

Input: western popular music containing drums Output: symbolic representation of notes played by drum instruments audio ADT system symbolic representation

  • f drum notes
slide-7
SLIDE 7

WHAT IS DRUM TRANSCRIPTION?

4

Input: western popular music containing drums Output: symbolic representation of notes played by drum instruments audio ADT system symbolic representation

  • f drum notes
list of events
slide-8
SLIDE 8

WHAT IS DRUM TRANSCRIPTION?

4

Input: western popular music containing drums Output: symbolic representation of notes played by drum instruments audio ADT system symbolic representation

  • f drum notes
piano roll t bass snare hi-hat
slide-9
SLIDE 9

WHAT IS DRUM TRANSCRIPTION?

5 audio ADT system notes
slide-10
SLIDE 10

WHAT IS DRUM TRANSCRIPTION?

5 audio ADT system notes

slide-11
SLIDE 11

WHAT IS DRUM TRANSCRIPTION?

5 audio ADT system notes

drum sampler
slide-12
SLIDE 12

WHAT IS DRUM TRANSCRIPTION?

5 audio ADT system notes

drum sampler audio
slide-13
SLIDE 13

WHAT IS DRUM TRANSCRIPTION?

5 audio ADT system notes

♫ ♫

drum sampler audio
slide-14
SLIDE 14

WHAT IS DRUM TRANSCRIPTION?

5 audio ADT system notes

♫ ♫

drum sampler audio
slide-15
SLIDE 15

WHAT IS DRUM TRANSCRIPTION?

5 audio ADT system notes

♫ ♫

drum sampler audio

slide-16
SLIDE 16

WHAT IS DRUM TRANSCRIPTION?

5 audio ADT system notes

♫ ♫

drum sampler audio

♫ ♫

slide-17
SLIDE 17

WHAT IS DRUM TRANSCRIPTION?

5 audio ADT system notes

♫ ♫

drum sampler audio

♫ ♫ ♫

slide-18
SLIDE 18

WHY DRUM TRANSCRIPTION?

6
slide-19
SLIDE 19

WHY DRUM TRANSCRIPTION?

Wide range of application

6
slide-20
SLIDE 20

WHY DRUM TRANSCRIPTION?

Wide range of application

  • Generate sheet music 

6
slide-21
SLIDE 21

WHY DRUM TRANSCRIPTION?

Wide range of application

  • Generate sheet music 

  • Music production 


sampling / remixing / resynthesis

6
slide-22
SLIDE 22

WHY DRUM TRANSCRIPTION?

Wide range of application

  • Generate sheet music 

  • Music production 


sampling / remixing / resynthesis

  • Higher level MIR tasks 


use drum patterns for other tasks
 genre classification
 song segmentation

6
slide-23
SLIDE 23

FOCUSED INSTRUMENTS

7

BD SD HH

slide-24
SLIDE 24

ADT methods focus bass drum (BD) snare (SD) and hi-hat (HH)

FOCUSED INSTRUMENTS

7

BD SD HH

slide-25
SLIDE 25

ADT methods focus bass drum (BD) snare (SD) and hi-hat (HH)

  • Make up majority of notes in datasets

FOCUSED INSTRUMENTS

7

BD SD HH

slide-26
SLIDE 26

ADT methods focus bass drum (BD) snare (SD) and hi-hat (HH)

  • Make up majority of notes in datasets
  • Beat defining / most important

FOCUSED INSTRUMENTS

7

BD SD HH

slide-27
SLIDE 27

ADT methods focus bass drum (BD) snare (SD) and hi-hat (HH)

  • Make up majority of notes in datasets
  • Beat defining / most important
  • Well separated spectral energy distribution

FOCUSED INSTRUMENTS

7

BD SD HH

bass drum snare drum hi-hat
slide-28
SLIDE 28

ADT methods focus bass drum (BD) snare (SD) and hi-hat (HH)

  • Make up majority of notes in datasets
  • Beat defining / most important
  • Well separated spectral energy distribution

FOCUSED INSTRUMENTS

7

BD SD HH

bass drum snare drum hi-hat
slide-29
SLIDE 29

ADT methods focus bass drum (BD) snare (SD) and hi-hat (HH)

  • Make up majority of notes in datasets
  • Beat defining / most important
  • Well separated spectral energy distribution

FOCUSED INSTRUMENTS

7

BD SD HH

bass drum snare drum hi-hat
slide-30
SLIDE 30

ADT methods focus bass drum (BD) snare (SD) and hi-hat (HH)

  • Make up majority of notes in datasets
  • Beat defining / most important
  • Well separated spectral energy distribution

FOCUSED INSTRUMENTS

7

BD SD HH

bass drum snare drum hi-hat
slide-31
SLIDE 31

STATE OF THE ART

8 activation functions spectrogram t [ms] t [ms] bass snare hi-hat
slide-32
SLIDE 32 End-to-end / activation-function-based Neural Networks and NMF-based approaches

STATE OF THE ART

8 activation functions spectrogram t [ms] t [ms] bass snare hi-hat
slide-33
SLIDE 33 End-to-end / activation-function-based Neural Networks and NMF-based approaches

Overview Article


Wu, C.-W., Dittmar, C., Southall, C.,Vogl, R., Widmer, G., Hockman, J., Müller, M., Lerch, A.: 
 “An Overview of Automatic Drum Transcription,” IEEE TASLP, vol. 26, no. 9, Sept. 2018.

STATE OF THE ART

8 activation functions spectrogram t [ms] t [ms] bass snare hi-hat
slide-34
SLIDE 34

SYSTEM OVERVIEW

9

signal preprocessing NN 
 feature extraction 
 event detection classification peak picking NN training audio events train data

slide-35
SLIDE 35

SYSTEM OVERVIEW

9

signal preprocessing NN 
 feature extraction 
 event detection classification peak picking NN training audio events

waveform t [s] A

train data

slide-36
SLIDE 36

SYSTEM OVERVIEW

9

signal preprocessing NN 
 feature extraction 
 event detection classification peak picking NN training audio events

waveform t [s] A spectrogram t [s] f [Hz]

train data

slide-37
SLIDE 37

SYSTEM OVERVIEW

9

signal preprocessing NN 
 feature extraction 
 event detection classification peak picking NN training audio events

waveform t [s] A spectrogram t [s] f [Hz] activation functions t [s] bass snare hi-hat

train data

slide-38
SLIDE 38

SYSTEM OVERVIEW

9

signal preprocessing NN 
 feature extraction 
 event detection classification peak picking NN training audio events

waveform t [s] A spectrogram t [s] f [Hz] activation functions t [s] bass snare hi-hat detected peaks t [s] bass snare hi-hat

train data

slide-39
SLIDE 39

NETWORK MODELS — RNN

10
slide-40
SLIDE 40

Processing of spectrogram frames as sequential data

NETWORK MODELS — RNN

10
slide-41
SLIDE 41

Processing of spectrogram frames as sequential data Frame-wise detection of instrument onsets

NETWORK MODELS — RNN

10 RNN train data sample
slide-42
SLIDE 42

Processing of spectrogram frames as sequential data Frame-wise detection of instrument onsets

NETWORK MODELS — RNN

10 RNN train data sample
  • utput: 3 sigmoid
30 BD GRU 30 BD GRU 30 BD GRU bidirectional RNN architecture with GRUs:
slide-43
SLIDE 43

NETWORK MODELS — CNN

11
slide-44
SLIDE 44 CNN train data sample

NETWORK MODELS — CNN

Operate on small windows of spectrogram 
 (current frame + spectral context)

11
slide-45
SLIDE 45 CNN train data sample

NETWORK MODELS — CNN

Operate on small windows of spectrogram 
 (current frame + spectral context) Acoustic modeling of drum sounds

11
slide-46
SLIDE 46 CNN train data sample

NETWORK MODELS — CNN

Operate on small windows of spectrogram 
 (current frame + spectral context) Acoustic modeling of drum sounds

11
  • utput: 3 sigmoid
2 x conv: 32 x 3x3 (batch norm) max pool: 3x3 2 x conv: 64 x 3x3 (batch norm) max pool: 3x3 2 x dense: 256 ReLU VGG - style architecture:
slide-47
SLIDE 47

NETWORK MODELS — CRNN

12
slide-48
SLIDE 48

Low-level CNN for acoustic modeling

NETWORK MODELS — CRNN

12
slide-49
SLIDE 49

Low-level CNN for acoustic modeling High-level RNN for music language model

NETWORK MODELS — CRNN

12
slide-50
SLIDE 50

Low-level CNN for acoustic modeling High-level RNN for music language model

CRNN train data sample

NETWORK MODELS — CRNN

12
slide-51
SLIDE 51

Low-level CNN for acoustic modeling High-level RNN for music language model

CRNN train data sample

NETWORK MODELS — CRNN

12 2 x conv: 32 x 3x3 (batch norm) max pool: 3x3 2 x conv: 64 x 3x3 (batch norm) max pool: 3x3 3 x RNN: 50 BD GRU stacked CNN + RNN architecture:
  • utput: 3 sigmoid
slide-52
SLIDE 52 13

WHY IS CONTEXT RELEVANT?

slide-53
SLIDE 53

Instruments from the same class often sound quite different
 Similar sound for different instruments

13

WHY IS CONTEXT RELEVANT?

♫ ♫

snare drums: crash v.s. splash:
slide-54
SLIDE 54

Instruments from the same class often sound quite different
 Similar sound for different instruments

13

WHY IS CONTEXT RELEVANT?

♫ ♫

snare drums: crash v.s. splash:
slide-55
SLIDE 55

Instruments from the same class often sound quite different
 Similar sound for different instruments

13

WHY IS CONTEXT RELEVANT?

♫ ♫

snare drums: crash v.s. splash:
slide-56
SLIDE 56

Instruments from the same class often sound quite different
 Similar sound for different instruments When humans transcribe drums

  • Function in a track equally important (snare drum v.s. backbeat)
13

WHY IS CONTEXT RELEVANT?

♫ ♫

snare drums: crash v.s. splash:
slide-57
SLIDE 57

Instruments from the same class often sound quite different
 Similar sound for different instruments When humans transcribe drums

  • Function in a track equally important (snare drum v.s. backbeat)
  • Inaudible onsets will be filled in if expected
13

WHY IS CONTEXT RELEVANT?

♫ ♫

snare drums: crash v.s. splash:
slide-58
SLIDE 58

Instruments from the same class often sound quite different
 Similar sound for different instruments When humans transcribe drums

  • Function in a track equally important (snare drum v.s. backbeat)
  • Inaudible onsets will be filled in if expected

Music Language Model

13

WHY IS CONTEXT RELEVANT?

♫ ♫

snare drums: crash v.s. splash:
slide-59
SLIDE 59 14

?

BASS DRUM OR LOW TOM?

1: bass drum 2: floor tom 3: ? ? ?
slide-60
SLIDE 60 14

?

BASS DRUM OR LOW TOM?

1: bass drum 2: floor tom 3: ? ? ?
slide-61
SLIDE 61 14

?

BASS DRUM OR LOW TOM?

1: bass drum 2: floor tom 3: ? ? ?
slide-62
SLIDE 62 14

?

BASS DRUM OR LOW TOM?

context 1: bass drum 2: floor tom 3: ? ? ?
slide-63
SLIDE 63 14

?

BASS DRUM OR LOW TOM?

context 1: bass drum 2: floor tom 3: ? ? ?
slide-64
SLIDE 64 3: bass drum 14

?

BASS DRUM OR LOW TOM?

context 1: bass drum 2: floor tom
slide-65
SLIDE 65 3: bass drum 14

?

BASS DRUM OR LOW TOM?

context 1: bass drum 2: floor tom
slide-66
SLIDE 66

DATASETS

15
slide-67
SLIDE 67

IDMT-SMT-Drums [Dittmar and Gärtner 2014]

  • Solo drum tracks, recorded, synthesized, and sampled
  • 95 tracks, total: 24m, onsets: 8004

DATASETS

15

slide-68
SLIDE 68

IDMT-SMT-Drums [Dittmar and Gärtner 2014]

  • Solo drum tracks, recorded, synthesized, and sampled
  • 95 tracks, total: 24m, onsets: 8004

DATASETS

15

SMT (simple!)

slide-69
SLIDE 69

IDMT-SMT-Drums [Dittmar and Gärtner 2014]

  • Solo drum tracks, recorded, synthesized, and sampled
  • 95 tracks, total: 24m, onsets: 8004

ENST-Drums [Gillet and Richard 2006]

  • Recordings, three drummers on different drum kits, optional accompaniment
  • 64 tracks, total: 1h, onsets: 22391

DATASETS

15

♫ ♫ ♫

SMT (simple!)

slide-70
SLIDE 70

IDMT-SMT-Drums [Dittmar and Gärtner 2014]

  • Solo drum tracks, recorded, synthesized, and sampled
  • 95 tracks, total: 24m, onsets: 8004

ENST-Drums [Gillet and Richard 2006]

  • Recordings, three drummers on different drum kits, optional accompaniment
  • 64 tracks, total: 1h, onsets: 22391

DATASETS

15

♫ ♫ ♫

SMT (simple!) ENST solo (harder!)

slide-71
SLIDE 71

IDMT-SMT-Drums [Dittmar and Gärtner 2014]

  • Solo drum tracks, recorded, synthesized, and sampled
  • 95 tracks, total: 24m, onsets: 8004

ENST-Drums [Gillet and Richard 2006]

  • Recordings, three drummers on different drum kits, optional accompaniment
  • 64 tracks, total: 1h, onsets: 22391

DATASETS

15

♫ ♫ ♫

SMT (simple!) ENST solo (harder!) E N S T a c c . ( d i f f i c u l t ! )

slide-72
SLIDE 72

NETWORK MODELS

16

Frames Context

  • Conv. Layers
  • Rec. Layers

Dense Layers RNN (S) 100 — — 2x50 GRU — RNN (L) 400 — — 3x30 GRU — CNN (S) — 9 2 x 32 3x3 filt.
 3x3 max pooling 
 2 x 64 3x3 filt.
 3x3 max pooling
 all w/ batch norm. — 2x256 CNN (L) — 25 — 2x256 CRNN (S) 100 9 2x50 GRU — CRNN (L) 400 13 3x60 GRU — tsRNN

baseline [Vogl et al. ICASSP’17] Early stopping Batch normalization L2 norm Dropout ADAM optimizer

Architecture

slide-73
SLIDE 73

RESULTS

17 F-measure [%] 60 70 80 90 100

SMT ENST solo ENST acc.

RNN (S) RNN (L) CNN (S) CNN (L) CRNN (S) CRNN (L) tsRNN

SMT ENST solo ENST with 
 accompaniment

slide-74
SLIDE 74 18

♫ ♫ ♫

HOW DOES IT SOUND?

“Punk” MEDLEY DB bass snare hi-hat
slide-75
SLIDE 75 18

♫ ♫ ♫

HOW DOES IT SOUND?

“Punk” MEDLEY DB bass snare hi-hat
slide-76
SLIDE 76 “Hendrix” MEDLEY DB 18

♫ ♫ ♫

HOW DOES IT SOUND?

bass snare hi-hat
slide-77
SLIDE 77 “Hendrix” MEDLEY DB 18

♫ ♫ ♫

HOW DOES IT SOUND?

bass snare hi-hat
slide-78
SLIDE 78 Alexa, play some music… 18

♫ ♫ ♫

HOW DOES IT SOUND?

bass snare hi-hat
slide-79
SLIDE 79 Alexa, play some music… 18

♫ ♫ ♫

HOW DOES IT SOUND?

bass snare hi-hat
slide-80
SLIDE 80

PART 1 AUTOMATIC DRUM TRANSCRIPTION

Task Definition, Problem Modeling, Architectures

PART 2 MULTI-TASK LEARNING

Metadata for Transcripts

slide-81
SLIDE 81

LIMITATIONS OF CURRENT SYSTEMS

20
slide-82
SLIDE 82

Do not produce additional information for transcripts
 drum onset detection vs drum transcription

LIMITATIONS OF CURRENT SYSTEMS

20
slide-83
SLIDE 83

Do not produce additional information for transcripts
 drum onset detection vs drum transcription

  • bars lines

LIMITATIONS OF CURRENT SYSTEMS

20
slide-84
SLIDE 84

Do not produce additional information for transcripts
 drum onset detection vs drum transcription

  • bars lines
  • tempo

LIMITATIONS OF CURRENT SYSTEMS

20
slide-85
SLIDE 85

Do not produce additional information for transcripts
 drum onset detection vs drum transcription

  • bars lines
  • tempo
  • meter

LIMITATIONS OF CURRENT SYSTEMS

20
slide-86
SLIDE 86

Do not produce additional information for transcripts
 drum onset detection vs drum transcription

  • bars lines
  • tempo
  • meter
  • dynamics / accents

LIMITATIONS OF CURRENT SYSTEMS

20
slide-87
SLIDE 87

Do not produce additional information for transcripts
 drum onset detection vs drum transcription

  • bars lines
  • tempo
  • meter
  • dynamics / accents
  • stroke / playing technique

LIMITATIONS OF CURRENT SYSTEMS

20
slide-88
SLIDE 88 Richard Vogl, Gerhard Widmer, and Peter Knees, “Towards multi-instrument drum transcription,” in Proc. 21th Intl. Conf. on Digital Audio Effects (DAFx18), Aveiro, Portugal, Sep. 2018.

Do not produce additional information for transcripts
 drum onset detection vs drum transcription

  • bars lines
  • tempo
  • meter
  • dynamics / accents
  • stroke / playing technique

Only three instrument classes

LIMITATIONS OF CURRENT SYSTEMS

20
slide-89
SLIDE 89 Richard Vogl, Gerhard Widmer, and Peter Knees, “Towards multi-instrument drum transcription,” in Proc. 21th Intl. Conf. on Digital Audio Effects (DAFx18), Aveiro, Portugal, Sep. 2018.

Do not produce additional information for transcripts
 drum onset detection vs drum transcription

  • bars lines
  • tempo
  • meter
  • dynamics / accents
  • stroke / playing technique

Only three instrument classes

LIMITATIONS OF CURRENT SYSTEMS

20
slide-90
SLIDE 90 HH
 SD
 BD t

ADDITIONAL INFORMATION FOR TRANSCRIPTS

21
slide-91
SLIDE 91 HH
 SD
 BD t

ADDITIONAL INFORMATION FOR TRANSCRIPTS

21
slide-92
SLIDE 92

Use beat and downbeat tracking to get:

HH
 SD
 BD t 1 2 3 4 1 4 3 beats 2

ADDITIONAL INFORMATION FOR TRANSCRIPTS

21
slide-93
SLIDE 93

Use beat and downbeat tracking to get:

HH
 SD
 BD t 1 2 3 4 1 4 3 beats 2

ADDITIONAL INFORMATION FOR TRANSCRIPTS

21
slide-94
SLIDE 94

Use beat and downbeat tracking to get:

  • bars lines
HH
 SD
 BD t 1 2 3 4 1 4 3 beats 2

ADDITIONAL INFORMATION FOR TRANSCRIPTS

21
slide-95
SLIDE 95

Use beat and downbeat tracking to get:

  • bars lines
  • tempo
HH
 SD
 BD t 1 2 3 4 1 4 3 beats 2

ADDITIONAL INFORMATION FOR TRANSCRIPTS

21
slide-96
SLIDE 96

Use beat and downbeat tracking to get:

  • bars lines
  • tempo
  • meter
HH
 SD
 BD t 1 2 3 4 1 4 3 beats 2

ADDITIONAL INFORMATION FOR TRANSCRIPTS

21 4/4
slide-97
SLIDE 97

Use beat and downbeat tracking to get:

  • bars lines
  • tempo
  • meter
HH
 SD
 BD t 1 2 3 4 1 4 3 beats 2

ADDITIONAL INFORMATION FOR TRANSCRIPTS

21 4/4

slide-98
SLIDE 98

LEVERAGE BEAT INFORMATION

22 HH
 SD
 BD t 1 2 3 4 1 4 3 beats 2
slide-99
SLIDE 99

LEVERAGE BEAT INFORMATION

Beats are highly correlated with drum patterns
 (drum onset locations / repetitive patterns)

22 HH
 SD
 BD t 1 2 3 4 1 4 3 beats 2
slide-100
SLIDE 100

LEVERAGE BEAT INFORMATION

Beats are highly correlated with drum patterns
 (drum onset locations / repetitive patterns) Assume that prior knowledge of beats is helpful for drum transcription

22 HH
 SD
 BD t 1 2 3 4 1 4 3 beats 2
slide-101
SLIDE 101

LEVERAGE BEAT INFORMATION

Beats are highly correlated with drum patterns
 (drum onset locations / repetitive patterns) Assume that prior knowledge of beats is helpful for drum transcription Use multi-task learning for beats and drums

22 HH
 SD
 BD t 1 2 3 4 1 4 3 beats 2
slide-102
SLIDE 102 23

MULTI-TASK LEARNING

slide-103
SLIDE 103

Training one model to solve multiple related tasks

23

MULTI-TASK LEARNING

slide-104
SLIDE 104

Training one model to solve multiple related tasks

  • Improve performance for each subtask ➡ context!
23

MULTI-TASK LEARNING

slide-105
SLIDE 105

Training one model to solve multiple related tasks

  • Improve performance for each subtask ➡ context!

Individual activation functions are already learned using multi-task learning

23

MULTI-TASK LEARNING

t [ms] bass snare hi-hat
slide-106
SLIDE 106

Training one model to solve multiple related tasks

  • Improve performance for each subtask ➡ context!

Individual activation functions are already learned using multi-task learning

  • One network for all instruments
23

MULTI-TASK LEARNING

t [ms] bass snare hi-hat
slide-107
SLIDE 107

Training one model to solve multiple related tasks

  • Improve performance for each subtask ➡ context!

Individual activation functions are already learned using multi-task learning

  • One network for all instruments
  • Instrument onsets are not independent
23

MULTI-TASK LEARNING

t [ms] bass snare hi-hat
slide-108
SLIDE 108

Training one model to solve multiple related tasks

  • Improve performance for each subtask ➡ context!

Individual activation functions are already learned using multi-task learning

  • One network for all instruments
  • Instrument onsets are not independent
  • MIREX results show that it works better
23

MULTI-TASK LEARNING

t [ms] bass snare hi-hat
slide-109
SLIDE 109

MULTI-TASK EXPERIMENTS

24 f [Hz] t [s] input
  • utput
slide-110
SLIDE 110

MULTI-TASK EXPERIMENTS

Three experiments:

24 f [Hz] t [s] input
  • utput
slide-111
SLIDE 111

MULTI-TASK EXPERIMENTS

Three experiments:

  • Training on drum targets (DT)
24 t [s] f [Hz] t [s] input
  • utput
slide-112
SLIDE 112

MULTI-TASK EXPERIMENTS

Three experiments:

  • Training on drum targets (DT)
  • Training on drum targets with annotated beats as additional input features (BF)
24 t [s] f [Hz] t [s] input
  • utput
slide-113
SLIDE 113

MULTI-TASK EXPERIMENTS

Three experiments:

  • Training on drum targets (DT)
  • Training on drum targets with annotated beats as additional input features (BF)
  • Training on drum and beat targets as multi-task problem (MT)
24 t [s] f [Hz] t [s] input
  • utput
slide-114
SLIDE 114

MULTI-TASK EXPERIMENTS

Three experiments:

  • Training on drum targets (DT)
  • Training on drum targets with annotated beats as additional input features (BF)
  • Training on drum and beat targets as multi-task problem (MT)

Expected increase in performance for BF compared to DT

24 t [s] f [Hz] t [s] input
  • utput
slide-115
SLIDE 115

MULTI-TASK EXPERIMENTS

Three experiments:

  • Training on drum targets (DT)
  • Training on drum targets with annotated beats as additional input features (BF)
  • Training on drum and beat targets as multi-task problem (MT)

Expected increase in performance for BF compared to DT Desirable increase in performance for MT compared to DT

24 t [s] f [Hz] t [s] input
  • utput
slide-116
SLIDE 116 RBMA13-Drums [http://ifs.tuwien.ac.at/~vogl/datasets/]
  • Free music from the 2013 Red Bull Music Academy, different styles
  • 27 tracks, total: 1h 43m, onsets: 24365
  • drum, beat, and downbeat annotations

NEW DATASETS

25 NEW!

♫ ♫

slide-117
SLIDE 117 RBMA13-Drums [http://ifs.tuwien.ac.at/~vogl/datasets/]
  • Free music from the 2013 Red Bull Music Academy, different styles
  • 27 tracks, total: 1h 43m, onsets: 24365
  • drum, beat, and downbeat annotations

RBMA 
 (super difficult!)

NEW DATASETS

25 NEW!

♫ ♫

slide-118
SLIDE 118 RBMA13-Drums [http://ifs.tuwien.ac.at/~vogl/datasets/]
  • Free music from the 2013 Red Bull Music Academy, different styles
  • 27 tracks, total: 1h 43m, onsets: 24365
  • drum, beat, and downbeat annotations

RBMA 
 (super difficult!)

NEW DATASETS

25 NEW!

♫ ♫

slide-119
SLIDE 119

RESULTS

DT BF MT RNN (S) 59.8 63.6 64.6 RNN (L) 61.8 64.5 64.3 CNN (S) 66.2 66.7 63.3 CNN (L) 66.8 65.2 64.8 CRNN (S) 65.2 66.1 66.9 CRNN (L) 67.3 68.4 67.2 % F-measure for drum onsets, tolerance: ±20ms, 3-fold cross-validation

DT … drum transcription 
 BF … DT plus beats as input features 
 MT … DT and beat detection multi-tasking

Experiment Model

slide-120
SLIDE 120

RESULTS

DT BF MT RNN (S) 59.8 63.6 64.6 RNN (L) 61.8 64.5 64.3 CNN (S) 66.2 66.7 63.3 CNN (L) 66.8 65.2 64.8 CRNN (S) 65.2 66.1 66.9 CRNN (L) 67.3 68.4 67.2 % F-measure for drum onsets, tolerance: ±20ms, 3-fold cross-validation

DT … drum transcription 
 BF … DT plus beats as input features 
 MT … DT and beat detection multi-tasking

Experiment Model

R B M A 
 ( s u p e r d i f f i c u l t ! )

slide-121
SLIDE 121

RESULTS: RNNs

27 F-measure [%] 50 55 60 65 70 RNN (small) RNN (large) DT … drum transcription BF … DT plus beats as input features MT … DT and beat detection multi-tasking
slide-122
SLIDE 122

RESULTS: RNNs

27 F-measure [%] 50 55 60 65 70 RNN (small) RNN (large) DT … drum transcription BF … DT plus beats as input features MT … DT and beat detection multi-tasking

Impact of beats for RNNs:

slide-123
SLIDE 123

RESULTS: RNNs

27 F-measure [%] 50 55 60 65 70 RNN (small) RNN (large) DT … drum transcription BF … DT plus beats as input features MT … DT and beat detection multi-tasking

Impact of beats for RNNs: BF improves for both models ✔

slide-124
SLIDE 124

RESULTS: RNNs

27 F-measure [%] 50 55 60 65 70 RNN (small) RNN (large) DT … drum transcription BF … DT plus beats as input features MT … DT and beat detection multi-tasking

Impact of beats for RNNs: BF improves for both models ✔ MT improves for both models ✔

slide-125
SLIDE 125

RESULTS: CNNs

28 F-measure [%] 50 55 60 65 70 CNN (small) CNN (large) DT … drum transcription BF … DT plus beats as input features MT … DT and beat detection multi-tasking
slide-126
SLIDE 126

RESULTS: CNNs

28 F-measure [%] 50 55 60 65 70 CNN (small) CNN (large) DT … drum transcription BF … DT plus beats as input features MT … DT and beat detection multi-tasking

Impact of beats for CNNs:

slide-127
SLIDE 127

RESULTS: CNNs

28 F-measure [%] 50 55 60 65 70 CNN (small) CNN (large) DT … drum transcription BF … DT plus beats as input features MT … DT and beat detection multi-tasking

Impact of beats for CNNs: BF inconsistent

slide-128
SLIDE 128

RESULTS: CNNs

28 F-measure [%] 50 55 60 65 70 CNN (small) CNN (large) DT … drum transcription BF … DT plus beats as input features MT … DT and beat detection multi-tasking

Impact of beats for CNNs: BF inconsistent MT declines for both models

slide-129
SLIDE 129

RESULTS: CNNs

28 F-measure [%] 50 55 60 65 70 CNN (small) CNN (large) DT … drum transcription BF … DT plus beats as input features MT … DT and beat detection multi-tasking

Impact of beats for CNNs: BF inconsistent MT declines for both models Expected: CNNs have too little context for beats

slide-130
SLIDE 130

RESULTS: CRNNs

29 F-measure [%] 50 55 60 65 70 CRNN (small) CRNN (large) DT … drum transcription BF … DT plus beats as input features MT … DT and beat detection multi-tasking
slide-131
SLIDE 131

RESULTS: CRNNs

29 F-measure [%] 50 55 60 65 70 CRNN (small) CRNN (large) DT … drum transcription BF … DT plus beats as input features MT … DT and beat detection multi-tasking

Impact of beats for CRNNs:

slide-132
SLIDE 132

RESULTS: CRNNs

29 F-measure [%] 50 55 60 65 70 CRNN (small) CRNN (large) DT … drum transcription BF … DT plus beats as input features MT … DT and beat detection multi-tasking

Impact of beats for CRNNs: BF improves for both models ✔

slide-133
SLIDE 133

RESULTS: CRNNs

29 F-measure [%] 50 55 60 65 70 CRNN (small) CRNN (large) DT … drum transcription BF … DT plus beats as input features MT … DT and beat detection multi-tasking

Impact of beats for CRNNs: BF improves for both models ✔ MT improves for small models ✔

slide-134
SLIDE 134

RESULTS: CRNNs

29 F-measure [%] 50 55 60 65 70 CRNN (small) CRNN (large) DT … drum transcription BF … DT plus beats as input features MT … DT and beat detection multi-tasking

Impact of beats for CRNNs: BF improves for both models ✔ MT improves for small models ✔ MT equal for large model ?

slide-135
SLIDE 135 30 F-measure [%] 50 55 60 65 70 RNN (S) RNN (L) CRNN (S) CRNN (L) DT … drum transcription BF … DT plus beats as input features MT … DT and beat detection multi-tasking

RESULTS FOR RECURRENT ARCHITECTURES

slide-136
SLIDE 136 30 F-measure [%] 50 55 60 65 70 RNN (S) RNN (L) CRNN (S) CRNN (L) DT … drum transcription BF … DT plus beats as input features MT … DT and beat detection multi-tasking

RESULTS FOR RECURRENT ARCHITECTURES

slide-137
SLIDE 137 30 F-measure [%] 50 55 60 65 70 RNN (S) RNN (L) CRNN (S) CRNN (L) DT … drum transcription BF … DT plus beats as input features MT … DT and beat detection multi-tasking

RESULTS FOR RECURRENT ARCHITECTURES

slide-138
SLIDE 138 30 F-measure [%] 50 55 60 65 70 RNN (S) RNN (L) CRNN (S) CRNN (L) DT … drum transcription BF … DT plus beats as input features MT … DT and beat detection multi-tasking

RESULTS FOR RECURRENT ARCHITECTURES

No improvement because of 
 beat tracking results?
slide-139
SLIDE 139 snare 31

♫ ♫

HOW DOES IT SOUND?

three instruments + beats bass hi-hat
slide-140
SLIDE 140 snare 31

♫ ♫

HOW DOES IT SOUND?

three instruments + beats bass hi-hat
slide-141
SLIDE 141 snare 31

♫ ♫

HOW DOES IT SOUND?

eight instruments + beats bass hi-hat snare side stick crash tom
slide-142
SLIDE 142 snare 31

♫ ♫

HOW DOES IT SOUND?

eight instruments + beats bass hi-hat snare side stick crash tom
slide-143
SLIDE 143

CONCLUSIONS

32
slide-144
SLIDE 144

CONCLUSIONS

Deep learning for automatic drum transcription

32
slide-145
SLIDE 145

CONCLUSIONS

Deep learning for automatic drum transcription CRNNs can outperform RNNs and CNNs, especially on complex data

  • Modeling of acoustic and rhythmic properties ➡ better generalization!
32
slide-146
SLIDE 146

CONCLUSIONS

Deep learning for automatic drum transcription CRNNs can outperform RNNs and CNNs, especially on complex data

  • Modeling of acoustic and rhythmic properties ➡ better generalization!

Leverage multi-task learning effects to increase performance

  • All instruments under observation within one model
  • Beats and downbeats for additional meta data for transcripts
32
slide-147
SLIDE 147

CONCLUSIONS

Deep learning for automatic drum transcription CRNNs can outperform RNNs and CNNs, especially on complex data

  • Modeling of acoustic and rhythmic properties ➡ better generalization!

Leverage multi-task learning effects to increase performance

  • All instruments under observation within one model
  • Beats and downbeats for additional meta data for transcripts

CRNN best overall results @ MIREX’17 and MIREX’18 drum transcription


MIREX system: 
 http://ifs.tuwien.ac.at/~vogl/models/mirex-17.zip
 http://ifs.tuwien.ac.at/~vogl/models/mirex-18.tar.gz 32
slide-148
SLIDE 148

CONCLUSIONS

Deep learning for automatic drum transcription CRNNs can outperform RNNs and CNNs, especially on complex data

  • Modeling of acoustic and rhythmic properties ➡ better generalization!

Leverage multi-task learning effects to increase performance

  • All instruments under observation within one model
  • Beats and downbeats for additional meta data for transcripts

CRNN best overall results @ MIREX’17 and MIREX’18 drum transcription


MIREX system: 
 http://ifs.tuwien.ac.at/~vogl/models/mirex-17.zip
 http://ifs.tuwien.ac.at/~vogl/models/mirex-18.tar.gz 32
slide-149
SLIDE 149

CONCLUSIONS

Deep learning for automatic drum transcription CRNNs can outperform RNNs and CNNs, especially on complex data

  • Modeling of acoustic and rhythmic properties ➡ better generalization!

Leverage multi-task learning effects to increase performance

  • All instruments under observation within one model
  • Beats and downbeats for additional meta data for transcripts

CRNN best overall results @ MIREX’17 and MIREX’18 drum transcription


MIREX system: 
 http://ifs.tuwien.ac.at/~vogl/models/mirex-17.zip
 http://ifs.tuwien.ac.at/~vogl/models/mirex-18.tar.gz 32