Neural Architectures for Music Representation Learning Sanghyuk - - PowerPoint PPT Presentation
Neural Architectures for Music Representation Learning Sanghyuk - - PowerPoint PPT Presentation
Neural Architectures for Music Representation Learning Sanghyuk Chun, Clova AI Research Contents - Understanding audio signals - Front-end and back-end framework for audio architectures - Powerful front-end with Harmonic filter banks - [ISMIR
Contents
1
- Understanding audio signals
- Front-end and back-end framework for audio architectures
- Powerful front-end with Harmonic filter banks
- [ISMIR 2019 Late Break Demo] Automatic Music Tagging with Harmonic CNN.
- [ICASSP 2020] Data-driven Harmonic Filters for Audio Representation Learning.
- [SMC 2020] Evaluation of CNN-based Automatic Music Tagging Models.
- Interpretable back-end with self-attention mechanism
- [ICML 2019 Workshop] Visualizing and Understanding Self-attention based Music Tagging.
- [ArXiv 2019] Toward Interpretable Music Tagging with Self-attention.
- Conclusion
Understanding audio signals
Raw audio Spectrogram Mel filter bank
2
Understanding audio signals in time domain.
3
[0.001, -0.002, -0.005, -0.004, -0.003, -0.003, -0.003, -0.002, -0.001, …]
“Waveform” shows “magnitude” of the input signal across time. How can we capture “frequency” information?
Understanding audio signals in frequency domain.
4
Time-amplitude => Frequency-amplitude
Understanding audio signals in time-frequency domain.
5
Types for audio inputs
- Raw audio waveform
- Linear spectrogram
- Log-scale spectrogram
- Mel spectrogram
- Constant Q transform (CQT)
Human perception for audio: Log-scale.
6
220 Hz 440 Hz 880 Hz 293.66 Hz 146.83 Hz 587.33 Hz
Mel filter banks: Log-scale filter bank.
7
Mel-spectrogram.
8
Raw audio Linear spectrogram Mel-spectrogram
1D: Sampling rate * audio length = 11025 * 30 = 330K
Input shape
2D: (# fft / 2 + 1) X # frames = (2048 / 2 + 1) X 1255 = [1025, 1255]
Related to many hyperparams (hop size, window size, …)
2D: (# mel bins) X # frames = [128, 1255]
Information density
Very sparse in time axis (need very large receptive field) 1 sec = sampling rate Sparse in freq axis (need large receptive field) Each time bin has 1025 dim Less sparse in freq axis Each time bin has 128 dim
loss
No loss (if SR > Nyquist rate) “Resolution” “Resolution” + Mel filter
Bonus: MFCC (Mel-Frequency Cepstral Coefficient).
9
Mel-spectrogram
2D: (# mel bins) X # frames = [128, 1255] DCT (Discrete Cosine Transform) 2D: (# mfcc) X # frames = [20, 1255]
Frequently used in the speech domain. Very lossy representation to the high-level music representation learning (c.f., SIFT)
Front-end and back-end framework
Fully convolutional neural network baseline Rethinking convolutional neural network Front-end and back-end framework
10
Fully convolutional neural network baseline for automatic music tagging.
11
[ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler
Rethinking CNN as feature extractor and non-linear classifier.
12
Rethinking CNN as feature extractor and non-linear classifier.
13
[ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler
Low-level feature (timbre, pitch) High-level feature (rhythm, tempo) Non-linear classifier
Rethinking CNN as feature extractor and non-linear classifier.
14
Front-end and back-end framework
15
Local feature extraction
Front-end
Temporal summarization & Classification
Back-end Output Filter banks
Time-frequency feature extraction
Input
Front-end and back-end framework
16
Local feature extraction
Front-end
Temporal summarization & Classification
Back-end Output Filter banks
Time-frequency feature extraction
Input
[ICASSP 2017] "Convolutional recurrent neural networks for music classification.”, Choi, et al.
CNN RNN Mel-filter banks
Front-end and back-end framework
17
Local feature extraction
Front-end
Temporal summarization & Classification
Back-end Output Filter banks
Time-frequency feature extraction
Input
[ISMIR 2018] "End-to-end learning for music audio tagging at scale.”, Pons, et al.
CNN RNN Mel-filter banks CNN CNN Mel-filter banks
Front-end and back-end framework
18
Local feature extraction
Front-end
Temporal summarization & Classification
Back-end Output Filter banks
Time-frequency feature extraction
Input
[ISMIR 2018] "End-to-end learning for music audio tagging at scale.”, Pons, et al.
CNN RNN Mel-filter banks CNN CNN Mel-filter banks
Front-end and back-end framework
19
Local feature extraction
Front-end
Temporal summarization & Classification
Back-end Output Filter banks
Time-frequency feature extraction
Input
[SMC 2017] "Sample-level Deep Convolutional Neural Networks for Music Auto-tagging Using Raw Waveforms.”, Lee, et al.
CNN RNN Mel-filter banks CNN CNN Mel-filter banks CNN CNN Fully-learnable
Front-end and back-end framework
20
Local feature extraction
Front-end
Temporal summarization & Classification
Back-end Output Filter banks
Time-frequency feature extraction
Input
[ICASSP 2020] "Data-driven Harmonic Filters for Audio Representation Learning.”, Won, et al.
CNN RNN Mel-filter banks CNN CNN Mel-filter banks CNN CNN Fully-learnable CNN CNN Partially-learnable
Front-end and back-end framework
21
Local feature extraction
Front-end
Temporal summarization & Classification
Back-end Output Filter banks
Time-frequency feature extraction
Input
CNN RNN Mel-filter banks CNN CNN Mel-filter banks CNN CNN Fully-learnable
[ArXiv 2019] “Toward Interpretable Music Tagging with Self-attention.”, Won, et al.
CNN CNN Partially-learnable CNN Self-attention Mel-filter banks
Front-end and back-end framework
22
Local feature extraction
Front-end
Temporal summarization & Classification
Back-end Output Filter banks
Time-frequency feature extraction
Input
CNN RNN Mel-filter banks CNN CNN Mel-filter banks CNN CNN Fully-learnable CNN CNN Partially-learnable CNN Self-attention Mel-filter banks
Powerful front-end with Harmonic filter banks
[ISMIR 2019 Late Break Demo] Automatic Music Tagging with Harmonic CNN. [ICASSP 2020] Data-driven Harmonic Filters for Audio Representation Learning. [SMC 2020] Evaluation of CNN-based Automatic Music Tagging Models.
23
Motivation: Data-driven, but human-guided
24
Local feature extraction
Front-end
Temporal summarization & Classification
Back-end Output Filter banks
Time-frequency feature extraction
Input Traditional methods
Mel-filter banks MFCC SVM
Recent methods
Fully-learnable CNN CNN
Hand-crafted features with strong human prior Data-driven approach without any human prior Classification
25
DATA-DRIVEN HARMONIC FILTERS
26
DATA-DRIVEN HARMONIC FILTERS
Data-driven filter banks
27
f(m): pre-defined frequency values depending on Sampling rate, FFT size, mel bin size, … Proposed data-driven filter is parameterized by
- fc: center frequency
- BW: bandwidth
Data-driven filter banks
28
Trainable Q! Derived from equivalent rectangular bandwidth (ERB)
29
DATA-DRIVEN HARMONIC FILTERS
Harmonic filters
30
n=1 n=2 n=3 n=4
Output of harmonic filters
31
n=1 n=2 n=3 n=4
Fundamental frequency 2nd harmonic of Fund freq 3rd harmonic of Fund freq 4th harmonic of Fund freq
Harmonic tensors
32
Harmonic CNN
33
Back-end
34
Experiments
35
Music tagging Keyword spotting Sound event detection # data 21k audio clips 106k audio clips 53k audio clips # classes 50 35 17 Task Multi-labeled Single-labeled Multi-labeled
Experiments
36
Music tagging Keyword spotting Sound event detection # data 21k audio clips 106k audio clips 53k audio clips # classes 50 35 17 Task Multi-labeled Single-labeled Multi-labeled
“techno”, “beat”, “no voice”, “fast”, “dance”, … Many tags are highly related to harmonic structure, e.g., timbre, genre, instruments, mood, …
Experiments
37
Music tagging Keyword spotting Sound event detection # data 21k audio clips 106k audio clips 53k audio clips # classes 50 35 17 Task Multi-labeled Single-labeled Multi-labeled
“wow” (other labels: “yes”, “no”, “one”, “four”, …) Harmonic characteristic is well-known important feature for speech recognition (MFCC)
Experiments
38
Music tagging Keyword spotting Sound event detection # data 21k audio clips 106k audio clips 53k audio clips # classes 50 35 17 Task Multi-labeled Single-labeled Multi-labeled
“Ambulance (siren)”, “Civil defense siren” (other labels: “train horn”, “Car”, “Fire truck”…) Non-music and non-verbal audio signals are expected to have “inharmonic” features
Experiments
39 Mel-spectrogram Fully-learnable Filters Front-end back-end CNN CNN CNN CNN Partially-learnable CNN CNN Mel-spectrogram CNN Attention RNN Linear / Mel spec, MFCC Gated-CRNN
Effect of harmonic
40
Harmonic CNN is efficient and effective architecture for music representation
41
All models can be reproduced by the following repository https://github.com/minzwon/sota-music-tagging-models
Harmonic CNN is more generalizable to realistic noises than other methods
42
Interpretable back-end with self-attention mechanism
[ICML 2019 Workshop] Visualizing and Understanding Self-attention based Music Tagging. [ArXiv 2019] Toward Interpretable Music Tagging with Self-attention.
43
Recall: Front-end and back-end framework
44
Local feature extraction
Front-end
Temporal summarization & Classification
Back-end Output Filter banks
Time-frequency feature extraction
Input
Recall: Front-end and back-end framework
45
Local feature extraction
Front-end
Temporal summarization & Classification
Back-end Output Filter banks
Time-frequency feature extraction
Input
What’s happening in the back-end?
CNN-based back-end cannot capture “temporal property”.
46
[ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler
Loose locality
Self-attention back-end to capture long-term relationship
47
Experiments (music tagging)
48
Self-attention back-end for better interpretability
49
Self-attention back-end for better interpretability
50
Self-attention back-end for better interpretability
51
Attention score visualization.
52
My network says it has a “quiet” tag. But why?
Attention score visualization.
53
Back-end as “sound” detector: Positive tags case
54
“Piano” “Drum”
Back-end as “sound” detector: Negative tags case
55
“No vocal” “Quiet”
Tag-wise heatmap contribution
56
Set the selected attention weight as 1, while others are set to 0
Tag-wise heatmap contribution
57
Confidence of “Quiet” Confidence of “Loud”
Tag-wise heatmap contribution
58
Confidence of “Piano” Confidence of “Flute”
Conclusion
59
Powerful front-end by Harmonic CNN Interpretable back-end by self-attention
60
Reference
61
- [ISMIR 2019 Late Break Demo] Automatic Music Tagging with Harmonic CNN,
Minz Won, Sanghyuk Chun, Oriol Nieto, Xavier Serra
- [ICASSP 2020] Data-driven Harmonic Filters for Audio Representation Learning.
Minz Won, Sanghyuk Chun, Oriol Nieto, Xavier Serra
- [SMC 2020] Evaluation of CNN-based Automatic Music Tagging Models.
Minz Won, Andres Ferraro Dmitry Bogdanov, Xavier Serra
- [ICML 2019 Workshop] Visualizing and Understanding Self-attention based Music Tagging.
Minz Won, Sanghyuk Chun, Xavier Serra
- [ArXiv 2019] Toward Interpretable Music Tagging with Self-attention.
Minz Won, Sanghyuk Chun, Xavier Serra