Juhan Nam
GCT634/AI613: Musical Applications of Machine Learning (Fall 2020)
CNN and Musical Applications Juhan Nam Motivation Sensory data - - PowerPoint PPT Presentation
GCT634/AI613: Musical Applications of Machine Learning (Fall 2020) CNN and Musical Applications Juhan Nam Motivation Sensory data (image or audio) have high-dimensionality Image: 256 x 256 pixels (commonly used size after crop and resize)
Juhan Nam
GCT634/AI613: Musical Applications of Machine Learning (Fall 2020)
Motivation
○ Image: 256 x 256 pixels (commonly used size after crop and resize)
■ The average image resolution on ImageNet is 469x387 pixels)
○ Audio: 128 mel bins x 128 frames (commonly used 3 sec mel-spectrogram)
■ 44,100 or 22050 samples/sec
○ If the hidden layer size is 256 for 256x256 images, the number of parameters is 256 x 256 x 3 (RGB) x 256 (hidden layer size) = 50M!
Locality and Translation Invariance
○ Important parts of the object structures are locally correlated
Locality and Translation Invariance
○ Important parts of the object structures are locally correlated
Incorporating Locality
○ Each hidden unit is connected to a local area (receptive field) ○ Different hidden units connected to different locations (feature map)
Source: NIPS 2017 Tutorial, Deep Learning: Practice and Trend
Incorporating Translation Invariance
weights (weight sharing)
○ Convolutional layer: locally-connected layer with weight sharing ○ The weight are invariant to the location and the output is equivalent
Source: NIPS 2017 Tutorial, Deep Learning: Practice and Trend
Convolutional Neural Network (CNN)
○ Local filters (or weight) are convolved with the input or hidden layers and return feature maps ○ The feature maps are sub-sampled (or pooled) to reduce the dimensionality
INPUT 32x32
Convolutions Subsampling Convolutions
C1: feature maps 6@28x28
Subsampling
S2: f. maps 6@14x14 S4: f. maps 16@5x5 C5: layer 120 C3: f. maps 16@10x10 F6: layer 84
Full connection Full connection Gaussian connections
OUTPUT 10
( LeCun 98) LeNet-5
Convolutional Neural Network (CNN)
○ CNN with more convolution and max-pooling layers ○ ReLU (fast and non-saturated), dropout (regularization) ○ Trained with 2 GPUs “directly” on 1.2M images during one week ○ ImageNet challenge: top-5% error 15.3% (>10% lower than the second)
ImageNet classification with deep convolutional neural networks, Alex Krizhevsky, Ilya Sutskever, Geoffrey E Hinton, 2012
ImageNet Challenge
Deep Learning Breakthrough
Convolution Mechanics
○ 3D tensor: width(W), height (H) and depth (channel) ○ Channel (C): R, G, B
Channel (or Depth) Width Height Filter
2D convolution
Filter must have the same depth as the input has
Hidden unit
Height Width Channel This channel corresponds to the number of filters! The input data become a 4D tensor when a batch or mini-batch is used à N (example) x C x W x H
Convolution Mechanics
input
No padding No striding No padding Stride size=2 Pad size=1 No Striding Pad size=1 Stride size=2 (Filter size: 3 x 3) Source: https://github.com/vdumoulin/conv_arithmetic
Sub-Sampling (or Pooling)
○ Max-pooling: most popular choice ○ Average pooling, standard deviation pooling, L^p (power-average) pooling
1 5 2 4 2 3 9 1 5 3 3 4 7 8 2 2 5 9 8 4 2 x 2 max pooling Stride with 2
CNN Architecture for Image Classification
○ Small filter size (3x3)
○ Inception module: multiple parallel convolution layers ○ 1x1 filter: reduce the depth (significantly reduce parameters)
○ Add skip connections between conv blocks: better gradient flow
(64) Depth:256 Depth:64 One block of VGGNet
7x7 conv, 64, /2 pool, /2 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 128, /2 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 256, /2 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 512, /2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 avg pool fc 1000 image
34-layer residual ResNet
Classification-based MIR Tasks Using CNN
○ Music genre/mood classification and auto-tagging ○ Music recommendation
○ Onset Detection ○ Musical instrument recognition ○ Singing voice detection (The output is usually predicted in frame-level)
○ Pitch estimation ○ Multiple F0 estimation
pitch contour (quantized) “piano” “singing voice” “soft rock”
Issues: Use of Domain-Specific or Task-Specific knowledge
○ Waveform ○ Spectrogram ○ Mel-spectrogram ○ Constant-Q transform
are most popularly used in practice!
Issues: Use of Domain-Specific or Task-Specific knowledge
○ One-hot: multi-class single-label classification ○ Multi-hot: multi-class multi-label classification (e.g., music auto-tagging) ○ Blurred one-hot/multi-hot (e.g., pitch estimation)
○ The last layer is set to the sigmoid function instead of the softmax function ○ The loss function is defined as the cross-entropy between the sigmoid output (
! !"#!" ) and the ground-truth (e.g. [1, 0, 1, 1, … , 0])
𝑚𝑝𝑡𝑡 = − &
!
𝑞 𝑦 log𝑟(𝑦)
Issues: Use of Domain-Specific or Task-Specific knowledge
○ 1D convolution blocks with time-frequency representation
■ The receptive filter of the first conv layer covers the entire frequency range ■ The 1D features map significantly reduces the number
■ Fast to train ■ Work well for small datasets ■ Not invariant to pitch shifting: key transpose changes the feature maps
channel (depth) convolution and pooling
○ 2D convolution blocks with time-frequency representation
■ Regards the time-frequency representations as an image ■ The filter slides over both time and frequency ■ The filter size can be small (smaller is more flexible, e.g., 3x3), horizontally long (capture temporal pattern), vertically long (capture timbre),
■ Relatively invariant to pitch shifting: assuming that the input data is a log-scaled spectrogram ■ The 2D features map significantly increases the number of parameters and accordingly more computational resources ■ The most common architecture in music research
Issues: Use of Domain-Specific or Task-Specific knowledge
Issues: Use of Domain-Specific or Task-Specific knowledge
○ 1D convolution blocks with raw waveforms
■ Filter size of the first conv layer can be frame-level (e.g., 1024 sample) to sample-level (e.g., 3 samples) (smaller is more flexible) ■ End-to-end learning model ■ No need of tuning STFT and log-scale parameters ■ No need of storing the preprocessed spectrogram ■ May require more parameters and memory (slow to train)
channel (depth) convolution and pooling
Issues: Use of Domain-Specific or Task-Specific knowledge
○ Pooling size
■ Large pooling to reduce the temporal dimensionality (semantic-level tasks) ■ No temporal pooling to make the output size equal to the input size in time (frame-level tasks)
○ Consider the receptive field of the last hidden layer
■ Related to the input size of CNN (context window) ■ Temporal pooling of the local predictions from the sliding input
Content-based Music Recommendation
○ Recommend music (or other items) based on mutual user history ○ Use matrix factorization of the listening history ○ Cold-start problem: new items cannot be recommended
Person A: I like songs A, B, C and D. Person B: I like songs A, B, C and E. Person A: Really? You should check out song D. Person B: Wow, you also should check out song E. Users Songs User latent vector Song latent vector
xu ys
pus = xu
Tys
Song preference
Content-based Music Recommendation
○ Overcome the cold-start problem ○ A regression problem that minimizes the MSE
○ Mel-spectrogram input filter with 128 (mel bin) x 4 (frames) ○ Global pooling with mean, max and L2: different weights on the feature map
Deep content-based music recommendation, Aaron van den Oord, Sander Dieleman, Benjamin Schrauwen, 2013
http://benanne.github.io/2014/08/05/spotify-cnns.html
Music Auto-Tagging
○ Genre, mood, instrument, years ○ Can be used for content-based filtering (e.g., Pandora’s Music genome project)
Music Auto-Tagging
○ 96-bin mel-spectrogram with 30 second (1366 frames)
○ VGGNet style: use 3x3 filter ○ Use a large max-pooling size in time to reduce the temporal dimensionality ○ The last layer has the sigmoid function for multi-label classification
FCN-4 Mel-spectrogram (input: 96×1366×1) Conv 3×3×128 MP (2, 4) (output: 48×341×128) Conv 3×3×384 MP (4, 5) (output: 24×85×384) Conv 3×3×768 MP (3, 8) (output: 12×21×768) Conv 3×3×2048 MP (4, 8) (output: 1×1×2048) Output 50×1 (sigmoid)
Automatic Tagging using Deep Convolutional Neural Networks, Keunwoo Choi, George Fazekas, Mark Sandler, 2016
○ Progressively reduce the filter size and stride size: correspond to window and hop size in STFT ○ Deeper and shorter-filter models work better: the best model has a filter size
Music Auto-Tagging
Sample-level Deep Convolutional Neural Networks for Music Auto-Tagging Using Raw Waveforms Jongpil Lee, Jiyoung Park, Keunhyoung Luke Kim, Juhan Nam, 2017
Onset Detection
○ Analogous to “edge detector” in computer vision
○ Input: three 80-bin mel spectrograms with different window sizes ○ Output: the binary output predicts the onset on the center of input frames ○ Filter size of the first conv layer: wide in time and narrow in frequency (7x3)
Improved Musical Onset Detection with Convolutional Neural Networks, Jan Schlüter and Sebastian Böck 2014 . . .
3 input channels (15x80) 10 feature maps (9x78) convolve (7x3) max-pool (1x3) 10 feature maps (9x26) convolve (3x3) 20 feature maps (7x24) max-pool (1x3) 20 feature maps (7x8) fully connected fully connected 256 sigmoid units sigmoid
unit
. . . . . .
(g) filter kernel for map 4: three 7 × 3 blocks for the three input spectrograms (mid, short, long) (h) spectrograms
Singing Voice Detection
○ Input: 80-bin mel spec x 115 frames (1.6 sec) ○ Output: the binary output predicts the voice
○ Dropout and Gaussian Noise ○ Pitch shift and time-stretching ○ Loudness and Random frequency filters
Exploring Data Augmentation for Improved Singing Voice Detection with Neural Networks, Jan Schlüter and Thomas Grill, 2015
Need a wide context to extract singing voice features (e.g., vibrato) Also, see specAugment (https://arxiv.org/abs/1904.08779)
Musical Instrument Recognition
○ Image: visual features are locally correlated ○ Audio spectrogram (CQT): spectral components are distributed apart at harmonic positions ○ Design an octave-interval filter instead of 2D “patch” filters to extract features at low frequencies (similar to a “Shepard” pitch spiral)
Deep convolutional networks on the pitch spiral for musical instrument recognition Vincent Lostanlen, Carmine-Emanuele Cella, 2016 time log- frequency 1-D 2-D Spiral
Monophonic Pitch Estimation
○ Frame-level pitch labels
○ 16 kHz sampling rate: resampling to 16kHz is a pre-processing step ○ 1024 sample of raw waveform input: a typical size of one “audio frame” ○ 360-dimensional softmax output: the pitch is quantized with a pitch resolution of 20 cents (100 cents is one semi-tone)
CREPE: A Convolutional Representation for Pitch Estimation Jong Wook Kim, Justin Salamon, Peter Li, Juan P. Bello 2018
Monophonic Pitch Estimation
○ The first layer is big: 1024 filters (size of 512), stride of 4, max-pooling of 2 ○ Adam optimizer, batch norm
○ The one-hot vector is blurred by a Gaussian filter ○ Softening the penalty for near-correct predictions
CREPE: A Convolutional Representation for Pitch Estimation Jong Wook Kim, Justin Salamon, Peter Li, Juan P. Bello 2018
Multiple-F0 Estimation
○ Frame-level pitch activations in the time and pitch space
○ CQT with 60 bins per octave ○ Multiple CQTs with harmonic relations (0.5, 1, 2, 3, 4, 5) ○ Filters learn the relative weights of harmonics ○ 3D input (time x frequency x harmonics): similar to RGB in image but deeper
Deep salience representations for F0 estimation in polyphonic music Rachel M. Bittner, Brian McFee, Justin Salamon, Peter Li, Juan P. Bello 2017
Multiple-F0 Estimation
○ 5x5 filter: 1 semitone, 70 x 3: one octave ○ The last layer has a sigmoid output
■ The loss is cross-entropy between the sigmoid output and the ground-truth
○ ReLU, batch norm, Adam optimizer ○ The input an output have the same dimensionality: no pooling layers
Deep salience representations for F0 estimation in polyphonic music Rachel M. Bittner, Brian McFee, Justin Salamon, Peter Li, Juan P. Bello 2017
1 semitone 1 octave