Neural Architectures for Music Representation Learning Sanghyuk - - PowerPoint PPT Presentation

neural architectures for music representation learning
SMART_READER_LITE
LIVE PREVIEW

Neural Architectures for Music Representation Learning Sanghyuk - - PowerPoint PPT Presentation

Neural Architectures for Music Representation Learning Sanghyuk Chun, Clova AI Research Contents - Understanding audio signals - Front-end and back-end framework for audio architectures - Powerful front-end with Harmonic filter banks - [ISMIR


slide-1
SLIDE 1

Neural Architectures for Music Representation Learning

Sanghyuk Chun, Clova AI Research

slide-2
SLIDE 2

Contents

1

  • Understanding audio signals
  • Front-end and back-end framework for audio architectures
  • Powerful front-end with Harmonic filter banks
  • [ISMIR 2019 Late Break Demo] Automatic Music Tagging with Harmonic CNN.
  • [ICASSP 2020] Data-driven Harmonic Filters for Audio Representation Learning.
  • [SMC 2020] Evaluation of CNN-based Automatic Music Tagging Models.
  • Interpretable back-end with self-attention mechanism
  • [ICML 2019 Workshop] Visualizing and Understanding Self-attention based Music Tagging.
  • [ArXiv 2019] Toward Interpretable Music Tagging with Self-attention.
  • Conclusion
slide-3
SLIDE 3

Understanding audio signals

Raw audio Spectrogram Mel filter bank

2

slide-4
SLIDE 4

Understanding audio signals in time domain.

3

[0.001, -0.002, -0.005, -0.004, -0.003, -0.003, -0.003, -0.002, -0.001, …]

“Waveform” shows “magnitude” of the input signal across time. How can we capture “frequency” information?

slide-5
SLIDE 5

Understanding audio signals in frequency domain.

4

Time-amplitude => Frequency-amplitude

slide-6
SLIDE 6

Understanding audio signals in time-frequency domain.

5

Types for audio inputs

  • Raw audio waveform
  • Linear spectrogram
  • Log-scale spectrogram
  • Mel spectrogram
  • Constant Q transform (CQT)
slide-7
SLIDE 7

Human perception for audio: Log-scale.

6

220 Hz 440 Hz 880 Hz 293.66 Hz 146.83 Hz 587.33 Hz

slide-8
SLIDE 8

Mel filter banks: Log-scale filter bank.

7

slide-9
SLIDE 9

Mel-spectrogram.

8

Raw audio Linear spectrogram Mel-spectrogram

1D: Sampling rate * audio length = 11025 * 30 = 330K

Input shape

2D: (# fft / 2 + 1) X # frames = (2048 / 2 + 1) X 1255 = [1025, 1255]

Related to many hyperparams (hop size, window size, …)

2D: (# mel bins) X # frames = [128, 1255]

Information density

Very sparse in time axis (need very large receptive field) 1 sec = sampling rate Sparse in freq axis (need large receptive field) Each time bin has 1025 dim Less sparse in freq axis Each time bin has 128 dim

loss

No loss (if SR > Nyquist rate) “Resolution” “Resolution” + Mel filter

slide-10
SLIDE 10

Bonus: MFCC (Mel-Frequency Cepstral Coefficient).

9

Mel-spectrogram

2D: (# mel bins) X # frames = [128, 1255] DCT (Discrete Cosine Transform) 2D: (# mfcc) X # frames = [20, 1255]

Frequently used in the speech domain. Very lossy representation to the high-level music representation learning (c.f., SIFT)

slide-11
SLIDE 11

Front-end and back-end framework

Fully convolutional neural network baseline Rethinking convolutional neural network Front-end and back-end framework

10

slide-12
SLIDE 12

Fully convolutional neural network baseline for automatic music tagging.

11

[ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

slide-13
SLIDE 13

Rethinking CNN as feature extractor and non-linear classifier.

12

slide-14
SLIDE 14

Rethinking CNN as feature extractor and non-linear classifier.

13

[ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

Low-level feature (timbre, pitch) High-level feature (rhythm, tempo) Non-linear classifier

slide-15
SLIDE 15

Rethinking CNN as feature extractor and non-linear classifier.

14

slide-16
SLIDE 16

Front-end and back-end framework

15

Local feature extraction

Front-end

Temporal summarization & Classification

Back-end Output Filter banks

Time-frequency feature extraction

Input

slide-17
SLIDE 17

Front-end and back-end framework

16

Local feature extraction

Front-end

Temporal summarization & Classification

Back-end Output Filter banks

Time-frequency feature extraction

Input

[ICASSP 2017] "Convolutional recurrent neural networks for music classification.”, Choi, et al.

CNN RNN Mel-filter banks

slide-18
SLIDE 18

Front-end and back-end framework

17

Local feature extraction

Front-end

Temporal summarization & Classification

Back-end Output Filter banks

Time-frequency feature extraction

Input

[ISMIR 2018] "End-to-end learning for music audio tagging at scale.”, Pons, et al.

CNN RNN Mel-filter banks CNN CNN Mel-filter banks

slide-19
SLIDE 19

Front-end and back-end framework

18

Local feature extraction

Front-end

Temporal summarization & Classification

Back-end Output Filter banks

Time-frequency feature extraction

Input

[ISMIR 2018] "End-to-end learning for music audio tagging at scale.”, Pons, et al.

CNN RNN Mel-filter banks CNN CNN Mel-filter banks

slide-20
SLIDE 20

Front-end and back-end framework

19

Local feature extraction

Front-end

Temporal summarization & Classification

Back-end Output Filter banks

Time-frequency feature extraction

Input

[SMC 2017] "Sample-level Deep Convolutional Neural Networks for Music Auto-tagging Using Raw Waveforms.”, Lee, et al.

CNN RNN Mel-filter banks CNN CNN Mel-filter banks CNN CNN Fully-learnable

slide-21
SLIDE 21

Front-end and back-end framework

20

Local feature extraction

Front-end

Temporal summarization & Classification

Back-end Output Filter banks

Time-frequency feature extraction

Input

[ICASSP 2020] "Data-driven Harmonic Filters for Audio Representation Learning.”, Won, et al.

CNN RNN Mel-filter banks CNN CNN Mel-filter banks CNN CNN Fully-learnable CNN CNN Partially-learnable

slide-22
SLIDE 22

Front-end and back-end framework

21

Local feature extraction

Front-end

Temporal summarization & Classification

Back-end Output Filter banks

Time-frequency feature extraction

Input

CNN RNN Mel-filter banks CNN CNN Mel-filter banks CNN CNN Fully-learnable

[ArXiv 2019] “Toward Interpretable Music Tagging with Self-attention.”, Won, et al.

CNN CNN Partially-learnable CNN Self-attention Mel-filter banks

slide-23
SLIDE 23

Front-end and back-end framework

22

Local feature extraction

Front-end

Temporal summarization & Classification

Back-end Output Filter banks

Time-frequency feature extraction

Input

CNN RNN Mel-filter banks CNN CNN Mel-filter banks CNN CNN Fully-learnable CNN CNN Partially-learnable CNN Self-attention Mel-filter banks

slide-24
SLIDE 24

Powerful front-end with Harmonic filter banks

[ISMIR 2019 Late Break Demo] Automatic Music Tagging with Harmonic CNN. [ICASSP 2020] Data-driven Harmonic Filters for Audio Representation Learning. [SMC 2020] Evaluation of CNN-based Automatic Music Tagging Models.

23

slide-25
SLIDE 25

Motivation: Data-driven, but human-guided

24

Local feature extraction

Front-end

Temporal summarization & Classification

Back-end Output Filter banks

Time-frequency feature extraction

Input Traditional methods

Mel-filter banks MFCC SVM

Recent methods

Fully-learnable CNN CNN

Hand-crafted features with strong human prior Data-driven approach without any human prior Classification

slide-26
SLIDE 26

25

DATA-DRIVEN HARMONIC FILTERS

slide-27
SLIDE 27

26

DATA-DRIVEN HARMONIC FILTERS

slide-28
SLIDE 28

Data-driven filter banks

27

f(m): pre-defined frequency values depending on Sampling rate, FFT size, mel bin size, … Proposed data-driven filter is parameterized by

  • fc: center frequency
  • BW: bandwidth
slide-29
SLIDE 29

Data-driven filter banks

28

Trainable Q! Derived from equivalent rectangular bandwidth (ERB)

slide-30
SLIDE 30

29

DATA-DRIVEN HARMONIC FILTERS

slide-31
SLIDE 31

Harmonic filters

30

n=1 n=2 n=3 n=4

slide-32
SLIDE 32

Output of harmonic filters

31

n=1 n=2 n=3 n=4

Fundamental frequency 2nd harmonic of Fund freq 3rd harmonic of Fund freq 4th harmonic of Fund freq

slide-33
SLIDE 33

Harmonic tensors

32

slide-34
SLIDE 34

Harmonic CNN

33

slide-35
SLIDE 35

Back-end

34

slide-36
SLIDE 36

Experiments

35

Music tagging Keyword spotting Sound event detection # data 21k audio clips 106k audio clips 53k audio clips # classes 50 35 17 Task Multi-labeled Single-labeled Multi-labeled

slide-37
SLIDE 37

Experiments

36

Music tagging Keyword spotting Sound event detection # data 21k audio clips 106k audio clips 53k audio clips # classes 50 35 17 Task Multi-labeled Single-labeled Multi-labeled

“techno”, “beat”, “no voice”, “fast”, “dance”, … Many tags are highly related to harmonic structure, e.g., timbre, genre, instruments, mood, …

slide-38
SLIDE 38

Experiments

37

Music tagging Keyword spotting Sound event detection # data 21k audio clips 106k audio clips 53k audio clips # classes 50 35 17 Task Multi-labeled Single-labeled Multi-labeled

“wow” (other labels: “yes”, “no”, “one”, “four”, …) Harmonic characteristic is well-known important feature for speech recognition (MFCC)

slide-39
SLIDE 39

Experiments

38

Music tagging Keyword spotting Sound event detection # data 21k audio clips 106k audio clips 53k audio clips # classes 50 35 17 Task Multi-labeled Single-labeled Multi-labeled

“Ambulance (siren)”, “Civil defense siren” (other labels: “train horn”, “Car”, “Fire truck”…) Non-music and non-verbal audio signals are expected to have “inharmonic” features

slide-40
SLIDE 40

Experiments

39 Mel-spectrogram Fully-learnable Filters Front-end back-end CNN CNN CNN CNN Partially-learnable CNN CNN Mel-spectrogram CNN Attention RNN Linear / Mel spec, MFCC Gated-CRNN

slide-41
SLIDE 41

Effect of harmonic

40

slide-42
SLIDE 42

Harmonic CNN is efficient and effective architecture for music representation

41

All models can be reproduced by the following repository https://github.com/minzwon/sota-music-tagging-models

slide-43
SLIDE 43

Harmonic CNN is more generalizable to realistic noises than other methods

42

slide-44
SLIDE 44

Interpretable back-end with self-attention mechanism

[ICML 2019 Workshop] Visualizing and Understanding Self-attention based Music Tagging. [ArXiv 2019] Toward Interpretable Music Tagging with Self-attention.

43

slide-45
SLIDE 45

Recall: Front-end and back-end framework

44

Local feature extraction

Front-end

Temporal summarization & Classification

Back-end Output Filter banks

Time-frequency feature extraction

Input

slide-46
SLIDE 46

Recall: Front-end and back-end framework

45

Local feature extraction

Front-end

Temporal summarization & Classification

Back-end Output Filter banks

Time-frequency feature extraction

Input

What’s happening in the back-end?

slide-47
SLIDE 47

CNN-based back-end cannot capture “temporal property”.

46

[ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler

Loose locality

slide-48
SLIDE 48

Self-attention back-end to capture long-term relationship

47

slide-49
SLIDE 49

Experiments (music tagging)

48

slide-50
SLIDE 50

Self-attention back-end for better interpretability

49

slide-51
SLIDE 51

Self-attention back-end for better interpretability

50

slide-52
SLIDE 52

Self-attention back-end for better interpretability

51

slide-53
SLIDE 53

Attention score visualization.

52

My network says it has a “quiet” tag. But why?

slide-54
SLIDE 54

Attention score visualization.

53

slide-55
SLIDE 55

Back-end as “sound” detector: Positive tags case

54

“Piano” “Drum”

slide-56
SLIDE 56

Back-end as “sound” detector: Negative tags case

55

“No vocal” “Quiet”

slide-57
SLIDE 57

Tag-wise heatmap contribution

56

Set the selected attention weight as 1, while others are set to 0

slide-58
SLIDE 58

Tag-wise heatmap contribution

57

Confidence of “Quiet” Confidence of “Loud”

slide-59
SLIDE 59

Tag-wise heatmap contribution

58

Confidence of “Piano” Confidence of “Flute”

slide-60
SLIDE 60

Conclusion

59

slide-61
SLIDE 61

Powerful front-end by Harmonic CNN Interpretable back-end by self-attention

60

slide-62
SLIDE 62

Reference

61

  • [ISMIR 2019 Late Break Demo] Automatic Music Tagging with Harmonic CNN,

Minz Won, Sanghyuk Chun, Oriol Nieto, Xavier Serra

  • [ICASSP 2020] Data-driven Harmonic Filters for Audio Representation Learning.

Minz Won, Sanghyuk Chun, Oriol Nieto, Xavier Serra

  • [SMC 2020] Evaluation of CNN-based Automatic Music Tagging Models.

Minz Won, Andres Ferraro Dmitry Bogdanov, Xavier Serra

  • [ICML 2019 Workshop] Visualizing and Understanding Self-attention based Music Tagging.

Minz Won, Sanghyuk Chun, Xavier Serra

  • [ArXiv 2019] Toward Interpretable Music Tagging with Self-attention.

Minz Won, Sanghyuk Chun, Xavier Serra