Neural Architectures for Music Representation Learning Sanghyuk - PowerPoint PPT Presentation

Neural Architectures for Music Representation Learning Sanghyuk Chun, Clova AI Research

Contents - Understanding audio signals - Front-end and back-end framework for audio architectures - Powerful front-end with Harmonic filter banks - [ISMIR 2019 Late Break Demo] Automatic Music Tagging with Harmonic CNN. - [ICASSP 2020] Data-driven Harmonic Filters for Audio Representation Learning. - [SMC 2020] Evaluation of CNN-based Automatic Music Tagging Models. - Interpretable back-end with self-attention mechanism - [ICML 2019 Workshop] Visualizing and Understanding Self-attention based Music Tagging. - [ArXiv 2019] Toward Interpretable Music Tagging with Self-attention. - Conclusion 1

Understanding audio signals Raw audio Spectrogram Mel filter bank 2

Understanding audio signals in time domain. [0.001, -0.002, -0.005, -0.004, -0.003, -0.003, -0.003, -0.002, -0.001, …] “Waveform” shows “magnitude” of the input signal across time. How can we capture “frequency” information? 3

Understanding audio signals in frequency domain. Time-amplitude => Frequency-amplitude 4

Understanding audio signals in time-frequency domain. Types for audio inputs - Raw audio waveform - Linear spectrogram - Log-scale spectrogram - Mel spectrogram - Constant Q transform (CQT) 5

Human perception for audio: Log-scale. 220 Hz 440 Hz 880 Hz 146.83 Hz 293.66 Hz 587.33 Hz 6

Mel filter banks: Log-scale filter bank. 7

Mel-spectrogram. Raw audio Linear spectrogram Mel-spectrogram Related to many hyperparams (hop size, window size, …) Input shape 1D: Sampling rate * audio length 2D: (# fft / 2 + 1) X # frames 2D: (# mel bins) X # frames = 11025 * 30 = 330K = (2048 / 2 + 1) X 1255 = [128, 1255] = [1025, 1255] Information Very sparse in time axis Sparse in freq axis Less sparse in freq axis (need very large receptive field) (need large receptive field) Each time bin has 128 dim density 1 sec = sampling rate Each time bin has 1025 dim loss No loss (if SR > Nyquist rate) “Resolution” “Resolution” + Mel filter 8

Bonus: MFCC (Mel-Frequency Cepstral Coefficient). Mel-spectrogram DCT (Discrete Cosine Transform) 2D: (# mel bins) X # frames 2D: (# mfcc) X # frames = [128, 1255] = [20, 1255] Frequently used in the speech domain. Very lossy representation to the high-level music representation learning (c.f., SIFT) 9

Front-end and back-end framework Fully convolutional neural network baseline Rethinking convolutional neural network Front-end and back-end framework 10

Fully convolutional neural network baseline for automatic music tagging. [ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler 11

Rethinking CNN as feature extractor and non-linear classifier. 12

Rethinking CNN as feature extractor and non-linear classifier. Low-level feature High-level feature Non-linear (timbre, pitch) (rhythm, tempo) classifier [ISMIR 2016] "Automatic tagging using deep convolutional neural networks.”, Keunwoo Choi, George Fazekas, Mark Sandler 13

Rethinking CNN as feature extractor and non-linear classifier. 14

Front-end and back-end framework Temporal summarization Time-frequency Local feature extraction & Classification feature extraction Back-end Filter banks Front-end Input Output 15

Front-end and back-end framework Temporal summarization Time-frequency Local feature extraction & Classification feature extraction Back-end Filter banks Front-end Input Output Mel-filter banks CNN RNN [ICASSP 2017] "Convolutional recurrent neural networks for music classification.”, Choi, et al. 16

Front-end and back-end framework Temporal summarization Time-frequency Local feature extraction & Classification feature extraction Back-end Filter banks Front-end Input Output Mel-filter banks CNN RNN Mel-filter banks CNN CNN [ISMIR 2018] "End-to-end learning for music audio tagging at scale.”, Pons, et al. 17

Front-end and back-end framework Temporal summarization Time-frequency Local feature extraction & Classification feature extraction Back-end Filter banks Front-end Input Output Mel-filter banks CNN RNN Mel-filter banks CNN CNN [ISMIR 2018] "End-to-end learning for music audio tagging at scale.”, Pons, et al. 18

Front-end and back-end framework Temporal summarization Time-frequency Local feature extraction & Classification feature extraction Back-end Filter banks Front-end Input Output Mel-filter banks CNN RNN Mel-filter banks CNN CNN Fully-learnable CNN CNN [SMC 2017] "Sample-level Deep Convolutional Neural Networks for Music Auto-tagging Using Raw Waveforms.”, Lee, et al. 19

Front-end and back-end framework Temporal summarization Time-frequency Local feature extraction & Classification feature extraction Back-end Filter banks Front-end Input Output Mel-filter banks CNN RNN Mel-filter banks CNN CNN Fully-learnable CNN CNN Partially-learnable CNN CNN [ICASSP 2020] "Data-driven Harmonic Filters for Audio Representation Learning.”, Won, et al. 20

Front-end and back-end framework Temporal summarization Time-frequency Local feature extraction & Classification feature extraction Back-end Filter banks Front-end Input Output Mel-filter banks CNN RNN Mel-filter banks CNN CNN Fully-learnable CNN CNN Partially-learnable CNN CNN Mel-filter banks CNN Self-attention [ArXiv 2019] “Toward Interpretable Music Tagging with Self-attention.”, Won, et al. 21

Front-end and back-end framework Temporal summarization Time-frequency Local feature extraction & Classification feature extraction Back-end Filter banks Front-end Input Output Mel-filter banks CNN RNN Mel-filter banks CNN CNN Fully-learnable CNN CNN Partially-learnable CNN CNN Mel-filter banks CNN Self-attention 22

Powerful front-end with Harmonic filter banks [ISMIR 2019 Late Break Demo] Automatic Music Tagging with Harmonic CNN. [ICASSP 2020] Data-driven Harmonic Filters for Audio Representation Learning. [SMC 2020] Evaluation of CNN-based Automatic Music Tagging Models. 23

Motivation: Data-driven, but human-guided Temporal summarization Time-frequency Local feature extraction & Classification feature extraction Back-end Filter banks Front-end Input Output Traditional Mel-filter banks MFCC SVM methods Hand-crafted features with strong human prior Recent Fully-learnable CNN CNN methods Classification Data-driven approach without any human prior 24

DATA-DRIVEN HARMONIC FILTERS 25

Data-driven filter banks Proposed data-driven filter is parameterized by - f c : center frequency - BW: bandwidth f(m): pre-defined frequency values depending on Sampling rate, FFT size, mel bin size, … 27

Data-driven filter banks Derived from equivalent rectangular bandwidth (ERB) Trainable Q! 28

Harmonic filters n=1 n=2 n=3 n=4 30

Output of harmonic filters n=1 n=2 n=3 n=4 4 th harmonic of Fund freq 2 nd harmonic of Fund freq 3 rd harmonic of Fund freq Fundamental frequency 31

Harmonic tensors 32

Harmonic CNN 33

Back-end 34

Experiments Music tagging Keyword spotting Sound event detection # data 21k audio clips 106k audio clips 53k audio clips # classes 50 35 17 Task Multi-labeled Single-labeled Multi-labeled 35

Experiments Music tagging Keyword spotting Sound event detection # data 21k audio clips 106k audio clips 53k audio clips # classes 50 35 17 Task Multi-labeled Single-labeled Multi-labeled “techno”, “beat”, “no voice”, “fast”, “dance”, … Many tags are highly related to harmonic structure, e.g., timbre, genre, instruments, mood, … 36

Experiments Music tagging Keyword spotting Sound event detection # data 21k audio clips 106k audio clips 53k audio clips # classes 50 35 17 Task Multi-labeled Single-labeled Multi-labeled “wow” (other labels: “yes”, “no”, “one”, “four”, …) Harmonic characteristic is well-known important feature for speech recognition (MFCC) 37

Experiments Music tagging Keyword spotting Sound event detection # data 21k audio clips 106k audio clips 53k audio clips # classes 50 35 17 Task Multi-labeled Single-labeled Multi-labeled “Ambulance (siren)”, “Civil defense siren” (other labels: “train horn”, “Car”, “Fire truck”…) Non-music and non-verbal audio signals are expected to have “inharmonic” features 38

Experiments Front-end back-end Filters Mel-spectrogram CNN CNN Mel-spectrogram CNN Attention RNN Linear / Mel spec, MFCC Gated-CRNN Fully-learnable CNN CNN Partially-learnable CNN CNN 39

Effect of harmonic 40

Harmonic CNN is efficient and effective architecture for music representation All models can be reproduced by the following repository https://github.com/minzwon/sota-music-tagging-models 41

Harmonic CNN is more generalizable to realistic noises than other methods 42

Neural Architectures for Music Representation Learning Sanghyuk - PowerPoint PPT Presentation

Neural Architectures for Music Representation Learning Sanghyuk Chun, Clova AI Research Contents - Understanding audio signals - Front-end and back-end framework for audio architectures - Powerful front-end with Harmonic filter banks - [ISMIR

MUSIC THERAPY MUSIC THERAPY What is music therapy? Music therapy is simply the process of using

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

JEWISH MUSIC 101: WHAT IS JEWISH MUSIC? A PROGRAM OF THE LOWELL MILKEN FUND FOR AMERICAN JEWISH

The intriguing case of sad music Dr. Jonna Vuoskoski jonna.vuoskoski@music.ox.ac.uk Music &

Architectures Architectural styles Software architectures Architectures versus middleware

Music and Pain: A Music Therapy Perspective Deborah Salmon, MA, MTA, CMT BRAMS, Universit de

FOLK MUSIC AT KMH A presentation of the Folk Music Department at the Royal College of Music,

Music, Language and Computation Aline Honingh LoLaCo Guestlecture 2012 Outline Music at the

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

MPEG Symbolic MPEG Symbolic Music Representation, Music Representation, history and facts

Neural Network Approaches to Representation Learning for NLP Navid Rekabsaz Idiap Research

A Musical Future Options for Studying Music at UWA Why choose Music at UWA? Music at UWA

Music Tagging Ryan Curtin LUG@GT Ryan Curtin Music Tagging - p. 1 The Problem You have a

School Music Education Plan THAMES Guidance for Schools Music in Schools - Introducing School

Radium: A Music Editor Inspired by the Music Tracker Kjetil Matheussen Norwegian Center for

Music recommendation and discovery in which Web? scar Celma (Music Technology Group, UPF)

Strongly paired fermions Alexandros Gezerlis TALENT/INT Course on Nuclear forces and their

Neutron star periods Sergei Popov SAI MSU Kaplan arXiv: 0801.1143 Diversity of young neutron

When were medieval benefactors generous? Time modelling in the development of the database

Supernova limits on a light CP-even scalar and implications for the KOTO anomaly Yongchao Zhang

Magnetic field decay and unification of young and millisecond pulsar populations Peter L.

ChoreoSave: Determining metadata for digital dance preservation SAA Research Purdue University

Pablo Cerd-Durn University of Valencia Collaborators: A. Torres-Forn, J.A. Font (U.

DEPARTAMENTO DE PESSOAL Daniel Rodrigues Daniele Carolina Holga Monte Jersonita Moreno 22:51

Neural Architectures for Music Representation Learning Sanghyuk - PowerPoint PPT Presentation

Neural Architectures for Music Representation Learning Sanghyuk Chun, Clova AI Research Contents - Understanding audio signals - Front-end and back-end framework for audio architectures - Powerful front-end with Harmonic filter banks - [ISMIR

MUSIC THERAPY MUSIC THERAPY What is music therapy? Music therapy is simply the process of using

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

JEWISH MUSIC 101: WHAT IS JEWISH MUSIC? A PROGRAM OF THE LOWELL MILKEN FUND FOR AMERICAN JEWISH

The intriguing case of sad music Dr. Jonna Vuoskoski jonna.vuoskoski@music.ox.ac.uk Music &amp;

Architectures Architectural styles Software architectures Architectures versus middleware

Music and Pain: A Music Therapy Perspective Deborah Salmon, MA, MTA, CMT BRAMS, Universit de

FOLK MUSIC AT KMH A presentation of the Folk Music Department at the Royal College of Music,

Music, Language and Computation Aline Honingh LoLaCo Guestlecture 2012 Outline Music at the

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

MPEG Symbolic MPEG Symbolic Music Representation, Music Representation, history and facts

Neural Network Approaches to Representation Learning for NLP Navid Rekabsaz Idiap Research

A Musical Future Options for Studying Music at UWA Why choose Music at UWA? Music at UWA

Music Tagging Ryan Curtin LUG@GT Ryan Curtin Music Tagging - p. 1 The Problem You have a

School Music Education Plan THAMES Guidance for Schools Music in Schools - Introducing School

Radium: A Music Editor Inspired by the Music Tracker Kjetil Matheussen Norwegian Center for

Music recommendation and discovery in which Web? scar Celma (Music Technology Group, UPF)

Strongly paired fermions Alexandros Gezerlis TALENT/INT Course on Nuclear forces and their

Neutron star periods Sergei Popov SAI MSU Kaplan arXiv: 0801.1143 Diversity of young neutron

When were medieval benefactors generous? Time modelling in the development of the database

Supernova limits on a light CP-even scalar and implications for the KOTO anomaly Yongchao Zhang

Magnetic field decay and unification of young and millisecond pulsar populations Peter L.

ChoreoSave: Determining metadata for digital dance preservation SAA Research Purdue University

Pablo Cerd-Durn University of Valencia Collaborators: A. Torres-Forn, J.A. Font (U.

DEPARTAMENTO DE PESSOAL Daniel Rodrigues Daniele Carolina Holga Monte Jersonita Moreno 22:51

The intriguing case of sad music Dr. Jonna Vuoskoski jonna.vuoskoski@music.ox.ac.uk Music &