neural architectures for music representation learning
play

Neural Architectures for Music Representation Learning Sanghyuk - PowerPoint PPT Presentation

Neural Architectures for Music Representation Learning Sanghyuk Chun, Clova AI Research Contents - Understanding audio signals - Front-end and back-end framework for audio architectures - Powerful front-end with Harmonic filter banks - [ISMIR


  1. Neural Architectures for Music Representation Learning Sanghyuk Chun, Clova AI Research

  2. Contents - Understanding audio signals - Front-end and back-end framework for audio architectures - Powerful front-end with Harmonic filter banks - [ISMIR 2019 Late Break Demo] Automatic Music Tagging with Harmonic CNN. - [ICASSP 2020] Data-driven Harmonic Filters for Audio Representation Learning. - [SMC 2020] Evaluation of CNN-based Automatic Music Tagging Models. - Interpretable back-end with self-attention mechanism - [ICML 2019 Workshop] Visualizing and Understanding Self-attention based Music Tagging. - [ArXiv 2019] Toward Interpretable Music Tagging with Self-attention. - Conclusion 1

  3. Understanding audio signals Raw audio Spectrogram Mel filter bank 2

  4. Understanding audio signals in time domain. [0.001, -0.002, -0.005, -0.004, -0.003, -0.003, -0.003, -0.002, -0.001, …] “Waveform” shows “magnitude” of the input signal across time. How can we capture “frequency” information? 3

  5. Understanding audio signals in frequency domain. Time-amplitude => Frequency-amplitude 4

  6. Understanding audio signals in time-frequency domain. Types for audio inputs - Raw audio waveform - Linear spectrogram - Log-scale spectrogram - Mel spectrogram - Constant Q transform (CQT) 5

  7. Human perception for audio: Log-scale. 220 Hz 440 Hz 880 Hz 146.83 Hz 293.66 Hz 587.33 Hz 6

  8. Mel filter banks: Log-scale filter bank. 7

  9. Mel-spectrogram. Raw audio Linear spectrogram Mel-spectrogram Related to many hyperparams (hop size, window size, …) Input shape 1D: Sampling rate * audio length 2D: (# fft / 2 + 1) X # frames 2D: (# mel bins) X # frames = 11025 * 30 = 330K = (2048 / 2 + 1) X 1255 = [128, 1255] = [1025, 1255] Information Very sparse in time axis Sparse in freq axis Less sparse in freq axis (need very large receptive field) (need large receptive field) Each time bin has 128 dim density 1 sec = sampling rate Each time bin has 1025 dim loss No loss (if SR > Nyquist rate) “Resolution” “Resolution” + Mel filter 8

  10. Bonus: MFCC (Mel-Frequency Cepstral Coefficient). Mel-spectrogram DCT (Discrete Cosine Transform) 2D: (# mel bins) X # frames 2D: (# mfcc) X # frames = [128, 1255] = [20, 1255] Frequently used in the speech domain. Very lossy representation to the high-level music representation learning (c.f., SIFT) 9

  11. Front-end and back-end framework Fully convolutional neural network baseline Rethinking convolutional neural network Front-end and back-end framework 10

  12. Fully convolutional neural network baseline for automatic music tagging. [ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler 11

  13. Rethinking CNN as feature extractor and non-linear classifier. 12

  14. Rethinking CNN as feature extractor and non-linear classifier. Low-level feature High-level feature Non-linear (timbre, pitch) (rhythm, tempo) classifier [ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler 13

  15. Rethinking CNN as feature extractor and non-linear classifier. 14

  16. Front-end and back-end framework Temporal summarization Time-frequency Local feature extraction & Classification feature extraction Back-end Filter banks Front-end Input Output 15

  17. Front-end and back-end framework Temporal summarization Time-frequency Local feature extraction & Classification feature extraction Back-end Filter banks Front-end Input Output Mel-filter banks CNN RNN [ICASSP 2017] "Convolutional recurrent neural networks for music classification.”, Choi, et al. 16

  18. Front-end and back-end framework Temporal summarization Time-frequency Local feature extraction & Classification feature extraction Back-end Filter banks Front-end Input Output Mel-filter banks CNN RNN Mel-filter banks CNN CNN [ISMIR 2018] "End-to-end learning for music audio tagging at scale.”, Pons, et al. 17

  19. Front-end and back-end framework Temporal summarization Time-frequency Local feature extraction & Classification feature extraction Back-end Filter banks Front-end Input Output Mel-filter banks CNN RNN Mel-filter banks CNN CNN [ISMIR 2018] "End-to-end learning for music audio tagging at scale.”, Pons, et al. 18

  20. Front-end and back-end framework Temporal summarization Time-frequency Local feature extraction & Classification feature extraction Back-end Filter banks Front-end Input Output Mel-filter banks CNN RNN Mel-filter banks CNN CNN Fully-learnable CNN CNN [SMC 2017] "Sample-level Deep Convolutional Neural Networks for Music Auto-tagging Using Raw Waveforms.”, Lee, et al. 19

  21. Front-end and back-end framework Temporal summarization Time-frequency Local feature extraction & Classification feature extraction Back-end Filter banks Front-end Input Output Mel-filter banks CNN RNN Mel-filter banks CNN CNN Fully-learnable CNN CNN Partially-learnable CNN CNN [ICASSP 2020] "Data-driven Harmonic Filters for Audio Representation Learning.”, Won, et al. 20

  22. Front-end and back-end framework Temporal summarization Time-frequency Local feature extraction & Classification feature extraction Back-end Filter banks Front-end Input Output Mel-filter banks CNN RNN Mel-filter banks CNN CNN Fully-learnable CNN CNN Partially-learnable CNN CNN Mel-filter banks CNN Self-attention [ArXiv 2019] “Toward Interpretable Music Tagging with Self-attention.”, Won, et al. 21

  23. Front-end and back-end framework Temporal summarization Time-frequency Local feature extraction & Classification feature extraction Back-end Filter banks Front-end Input Output Mel-filter banks CNN RNN Mel-filter banks CNN CNN Fully-learnable CNN CNN Partially-learnable CNN CNN Mel-filter banks CNN Self-attention 22

  24. Powerful front-end with Harmonic filter banks [ISMIR 2019 Late Break Demo] Automatic Music Tagging with Harmonic CNN. [ICASSP 2020] Data-driven Harmonic Filters for Audio Representation Learning. [SMC 2020] Evaluation of CNN-based Automatic Music Tagging Models. 23

  25. Motivation: Data-driven, but human-guided Temporal summarization Time-frequency Local feature extraction & Classification feature extraction Back-end Filter banks Front-end Input Output Traditional Mel-filter banks MFCC SVM methods Hand-crafted features with strong human prior Recent Fully-learnable CNN CNN methods Classification Data-driven approach without any human prior 24

  26. DATA-DRIVEN HARMONIC FILTERS 25

  27. DATA-DRIVEN HARMONIC FILTERS 26

  28. Data-driven filter banks Proposed data-driven filter is parameterized by - f c : center frequency - BW: bandwidth f(m): pre-defined frequency values depending on Sampling rate, FFT size, mel bin size, … 27

  29. Data-driven filter banks Derived from equivalent rectangular bandwidth (ERB) Trainable Q! 28

  30. DATA-DRIVEN HARMONIC FILTERS 29

  31. Harmonic filters n=1 n=2 n=3 n=4 30

  32. Output of harmonic filters n=1 n=2 n=3 n=4 4 th harmonic of Fund freq 2 nd harmonic of Fund freq 3 rd harmonic of Fund freq Fundamental frequency 31

  33. Harmonic tensors 32

  34. Harmonic CNN 33

  35. Back-end 34

  36. Experiments Music tagging Keyword spotting Sound event detection # data 21k audio clips 106k audio clips 53k audio clips # classes 50 35 17 Task Multi-labeled Single-labeled Multi-labeled 35

  37. Experiments Music tagging Keyword spotting Sound event detection # data 21k audio clips 106k audio clips 53k audio clips # classes 50 35 17 Task Multi-labeled Single-labeled Multi-labeled “techno”, “beat”, “no voice”, “fast”, “dance”, … Many tags are highly related to harmonic structure, e.g., timbre, genre, instruments, mood, … 36

  38. Experiments Music tagging Keyword spotting Sound event detection # data 21k audio clips 106k audio clips 53k audio clips # classes 50 35 17 Task Multi-labeled Single-labeled Multi-labeled “wow” (other labels: “yes”, “no”, “one”, “four”, …) Harmonic characteristic is well-known important feature for speech recognition (MFCC) 37

  39. Experiments Music tagging Keyword spotting Sound event detection # data 21k audio clips 106k audio clips 53k audio clips # classes 50 35 17 Task Multi-labeled Single-labeled Multi-labeled “Ambulance (siren)”, “Civil defense siren” (other labels: “train horn”, “Car”, “Fire truck”…) Non-music and non-verbal audio signals are expected to have “inharmonic” features 38

  40. Experiments Front-end back-end Filters Mel-spectrogram CNN CNN Mel-spectrogram CNN Attention RNN Linear / Mel spec, MFCC Gated-CRNN Fully-learnable CNN CNN Partially-learnable CNN CNN 39

  41. Effect of harmonic 40

  42. Harmonic CNN is efficient and effective architecture for music representation All models can be reproduced by the following repository https://github.com/minzwon/sota-music-tagging-models 41

  43. Harmonic CNN is more generalizable to realistic noises than other methods 42

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend