CNN and Musical Applications Juhan Nam Motivation Sensory data - - PowerPoint PPT Presentation

cnn and musical applications
SMART_READER_LITE
LIVE PREVIEW

CNN and Musical Applications Juhan Nam Motivation Sensory data - - PowerPoint PPT Presentation

GCT634/AI613: Musical Applications of Machine Learning (Fall 2020) CNN and Musical Applications Juhan Nam Motivation Sensory data (image or audio) have high-dimensionality Image: 256 x 256 pixels (commonly used size after crop and resize)


slide-1
SLIDE 1

Juhan Nam

GCT634/AI613: Musical Applications of Machine Learning (Fall 2020)

CNN and Musical Applications

slide-2
SLIDE 2

Motivation

  • Sensory data (image or audio) have high-dimensionality

○ Image: 256 x 256 pixels (commonly used size after crop and resize)

■ The average image resolution on ImageNet is 469x387 pixels)

○ Audio: 128 mel bins x 128 frames (commonly used 3 sec mel-spectrogram)

■ 44,100 or 22050 samples/sec

  • The fully-connected layer requires a large size of weight

○ If the hidden layer size is 256 for 256x256 images, the number of parameters is 256 x 256 x 3 (RGB) x 256 (hidden layer size) = 50M!

  • Can we reduce the number of parameters?
slide-3
SLIDE 3

Locality and Translation Invariance

  • Locality: the objects of our interest tend to have a local spatial support

○ Important parts of the object structures are locally correlated

  • Translation invariance: object appearance is independent of location
slide-4
SLIDE 4

Locality and Translation Invariance

  • Locality: the objects of our interest tend to have a local spatial support

○ Important parts of the object structures are locally correlated

  • Translation invariance: object appearance is independent of location
slide-5
SLIDE 5

Incorporating Locality

  • Change the fully-connected layer to a locally-connected layer

○ Each hidden unit is connected to a local area (receptive field) ○ Different hidden units connected to different locations (feature map)

Source: NIPS 2017 Tutorial, Deep Learning: Practice and Trend

slide-6
SLIDE 6

Incorporating Translation Invariance

  • Make the hidden units connected to different locations have the same

weights (weight sharing)

○ Convolutional layer: locally-connected layer with weight sharing ○ The weight are invariant to the location and the output is equivalent

Source: NIPS 2017 Tutorial, Deep Learning: Practice and Trend

slide-7
SLIDE 7

Convolutional Neural Network (CNN)

  • Consists of convolution layer and subsampling (or pooling) layer

○ Local filters (or weight) are convolved with the input or hidden layers and return feature maps ○ The feature maps are sub-sampled (or pooled) to reduce the dimensionality

INPUT 32x32

Convolutions Subsampling Convolutions

C1: feature maps 6@28x28

Subsampling

S2: f. maps 6@14x14 S4: f. maps 16@5x5 C5: layer 120 C3: f. maps 16@10x10 F6: layer 84

Full connection Full connection Gaussian connections

OUTPUT 10

( LeCun 98) LeNet-5

slide-8
SLIDE 8

Convolutional Neural Network (CNN)

  • The breakthrough in image classification (2012)

○ CNN with more convolution and max-pooling layers ○ ReLU (fast and non-saturated), dropout (regularization) ○ Trained with 2 GPUs “directly” on 1.2M images during one week ○ ImageNet challenge: top-5% error 15.3% (>10% lower than the second)

ImageNet classification with deep convolutional neural networks, Alex Krizhevsky, Ilya Sutskever, Geoffrey E Hinton, 2012

slide-9
SLIDE 9

ImageNet Challenge

  • CNN models have been deeper and deeper
  • Surpass human recognition

Deep Learning Breakthrough

slide-10
SLIDE 10

Convolution Mechanics

  • Image input and feature Map

○ 3D tensor: width(W), height (H) and depth (channel) ○ Channel (C): R, G, B

Channel (or Depth) Width Height Filter

2D convolution

Filter must have the same depth as the input has

Hidden unit

Height Width Channel This channel corresponds to the number of filters! The input data become a 4D tensor when a batch or mini-batch is used à N (example) x C x W x H

slide-11
SLIDE 11

Convolution Mechanics

  • Stride: sliding with hopping (equivalent to the hop size in STFT)
  • Padding: adjust the feature map size by zero-padding to the border of the

input

No padding No striding No padding Stride size=2 Pad size=1 No Striding Pad size=1 Stride size=2 (Filter size: 3 x 3) Source: https://github.com/vdumoulin/conv_arithmetic

slide-12
SLIDE 12

Sub-Sampling (or Pooling)

  • Down-size the feature map by summarizing the local features
  • Types

○ Max-pooling: most popular choice ○ Average pooling, standard deviation pooling, L^p (power-average) pooling

1 5 2 4 2 3 9 1 5 3 3 4 7 8 2 2 5 9 8 4 2 x 2 max pooling Stride with 2

slide-13
SLIDE 13

CNN Architecture for Image Classification

  • VGGNet (flexibility)

○ Small filter size (3x3)

  • GooLeNet (efficiency)

○ Inception module: multiple parallel convolution layers ○ 1x1 filter: reduce the depth (significantly reduce parameters)

  • ResNet (deep and high performance)

○ Add skip connections between conv blocks: better gradient flow

  • 1x1 Filter

(64) Depth:256 Depth:64 One block of VGGNet

7x7 conv, 64, /2 pool, /2 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 128, /2 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 256, /2 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 512, /2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 avg pool fc 1000 image

34-layer residual ResNet

slide-14
SLIDE 14

Classification-based MIR Tasks Using CNN

  • Semantic-Level (long segment)

○ Music genre/mood classification and auto-tagging ○ Music recommendation

  • Event-Level (note, beat or phrase)

○ Onset Detection ○ Musical instrument recognition ○ Singing voice detection (The output is usually predicted in frame-level)

  • Frame-Level (single audio frame)

○ Pitch estimation ○ Multiple F0 estimation

pitch contour (quantized) “piano” “singing voice” “soft rock”

slide-15
SLIDE 15

Issues: Use of Domain-Specific or Task-Specific knowledge

  • Input audio representation

○ Waveform ○ Spectrogram ○ Mel-spectrogram ○ Constant-Q transform

  • Mel-spectrogram and constant-Q transform

are most popularly used in practice!

slide-16
SLIDE 16

Issues: Use of Domain-Specific or Task-Specific knowledge

  • Output representation: highly task-specific

○ One-hot: multi-class single-label classification ○ Multi-hot: multi-class multi-label classification (e.g., music auto-tagging) ○ Blurred one-hot/multi-hot (e.g., pitch estimation)

  • Multi-label classification

○ The last layer is set to the sigmoid function instead of the softmax function ○ The loss function is defined as the cross-entropy between the sigmoid output (

! !"#!" ) and the ground-truth (e.g. [1, 0, 1, 1, … , 0])

𝑚𝑝𝑡𝑡 = − &

!

𝑞 𝑦 log𝑟(𝑦)

slide-17
SLIDE 17

Issues: Use of Domain-Specific or Task-Specific knowledge

  • CNN architecture design

○ 1D convolution blocks with time-frequency representation

■ The receptive filter of the first conv layer covers the entire frequency range ■ The 1D features map significantly reduces the number

  • f parameters compared to the 2D feature map

■ Fast to train ■ Work well for small datasets ■ Not invariant to pitch shifting: key transpose changes the feature maps

channel (depth) convolution and pooling

slide-18
SLIDE 18
  • CNN architecture design

○ 2D convolution blocks with time-frequency representation

■ Regards the time-frequency representations as an image ■ The filter slides over both time and frequency ■ The filter size can be small (smaller is more flexible, e.g., 3x3), horizontally long (capture temporal pattern), vertically long (capture timbre),

  • r fit to a certain input unit (one semitone)

■ Relatively invariant to pitch shifting: assuming that the input data is a log-scaled spectrogram ■ The 2D features map significantly increases the number of parameters and accordingly more computational resources ■ The most common architecture in music research

Issues: Use of Domain-Specific or Task-Specific knowledge

slide-19
SLIDE 19

Issues: Use of Domain-Specific or Task-Specific knowledge

  • CNN architecture design

○ 1D convolution blocks with raw waveforms

■ Filter size of the first conv layer can be frame-level (e.g., 1024 sample) to sample-level (e.g., 3 samples) (smaller is more flexible) ■ End-to-end learning model ■ No need of tuning STFT and log-scale parameters ■ No need of storing the preprocessed spectrogram ■ May require more parameters and memory (slow to train)

channel (depth) convolution and pooling

slide-20
SLIDE 20

Issues: Use of Domain-Specific or Task-Specific knowledge

  • CNN architecture design

○ Pooling size

■ Large pooling to reduce the temporal dimensionality (semantic-level tasks) ■ No temporal pooling to make the output size equal to the input size in time (frame-level tasks)

○ Consider the receptive field of the last hidden layer

■ Related to the input size of CNN (context window) ■ Temporal pooling of the local predictions from the sliding input

slide-21
SLIDE 21

Content-based Music Recommendation

  • Collaborative Filtering

○ Recommend music (or other items) based on mutual user history ○ Use matrix factorization of the listening history ○ Cold-start problem: new items cannot be recommended

Person A: I like songs A, B, C and D. Person B: I like songs A, B, C and E. Person A: Really? You should check out song D. Person B: Wow, you also should check out song E. Users Songs User latent vector Song latent vector

xu ys

pus = xu

Tys

Song preference

slide-22
SLIDE 22

Content-based Music Recommendation

  • Predict the song-level latent vector from the audio

○ Overcome the cold-start problem ○ A regression problem that minimizes the MSE

  • 1D CNN

○ Mel-spectrogram input filter with 128 (mel bin) x 4 (frames) ○ Global pooling with mean, max and L2: different weights on the feature map

Deep content-based music recommendation, Aaron van den Oord, Sander Dieleman, Benjamin Schrauwen, 2013

http://benanne.github.io/2014/08/05/spotify-cnns.html

slide-23
SLIDE 23

Music Auto-Tagging

  • Predict a rich set of descriptive words

○ Genre, mood, instrument, years ○ Can be used for content-based filtering (e.g., Pandora’s Music genome project)

slide-24
SLIDE 24

Music Auto-Tagging

  • Input

○ 96-bin mel-spectrogram with 30 second (1366 frames)

  • 2D CNN

○ VGGNet style: use 3x3 filter ○ Use a large max-pooling size in time to reduce the temporal dimensionality ○ The last layer has the sigmoid function for multi-label classification

FCN-4 Mel-spectrogram (input: 96×1366×1) Conv 3×3×128 MP (2, 4) (output: 48×341×128) Conv 3×3×384 MP (4, 5) (output: 24×85×384) Conv 3×3×768 MP (3, 8) (output: 12×21×768) Conv 3×3×2048 MP (4, 8) (output: 1×1×2048) Output 50×1 (sigmoid)

Automatic Tagging using Deep Convolutional Neural Networks, Keunwoo Choi, George Fazekas, Mark Sandler, 2016

slide-25
SLIDE 25
  • Investigated the 1D convolution blocks with raw waveform input

○ Progressively reduce the filter size and stride size: correspond to window and hop size in STFT ○ Deeper and shorter-filter models work better: the best model has a filter size

  • f 3 samples (1D VGGNet)

Music Auto-Tagging

Sample-level Deep Convolutional Neural Networks for Music Auto-Tagging Using Raw Waveforms Jongpil Lee, Jiyoung Park, Keunhyoung Luke Kim, Juhan Nam, 2017

slide-26
SLIDE 26

Onset Detection

  • Predict the beginning time of note events

○ Analogous to “edge detector” in computer vision

  • 2D CNN

○ Input: three 80-bin mel spectrograms with different window sizes ○ Output: the binary output predicts the onset on the center of input frames ○ Filter size of the first conv layer: wide in time and narrow in frequency (7x3)

Improved Musical Onset Detection with Convolutional Neural Networks, Jan Schlüter and Sebastian Böck 2014 . . .

3 input channels (15x80) 10 feature maps (9x78) convolve (7x3) max-pool (1x3) 10 feature maps (9x26) convolve (3x3) 20 feature maps (7x24) max-pool (1x3) 20 feature maps (7x8) fully connected fully connected 256 sigmoid units sigmoid

  • utput

unit

. . . . . .

(g) filter kernel for map 4: three 7 × 3 blocks for the three input spectrograms (mid, short, long) (h) spectrograms

slide-27
SLIDE 27

Singing Voice Detection

  • Predict the presence of singing voice in frame-level
  • 2D CNN with the VGGNet architecture

○ Input: 80-bin mel spec x 115 frames (1.6 sec) ○ Output: the binary output predicts the voice

  • n the center of input frames
  • Focused on data augmentation

○ Dropout and Gaussian Noise ○ Pitch shift and time-stretching ○ Loudness and Random frequency filters

Exploring Data Augmentation for Improved Singing Voice Detection with Neural Networks, Jan Schlüter and Thomas Grill, 2015

Need a wide context to extract singing voice features (e.g., vibrato) Also, see specAugment (https://arxiv.org/abs/1904.08779)

slide-28
SLIDE 28

Musical Instrument Recognition

  • Propose “pitch spiral CNN”

○ Image: visual features are locally correlated ○ Audio spectrogram (CQT): spectral components are distributed apart at harmonic positions ○ Design an octave-interval filter instead of 2D “patch” filters to extract features at low frequencies (similar to a “Shepard” pitch spiral)

Deep convolutional networks on the pitch spiral for musical instrument recognition Vincent Lostanlen, Carmine-Emanuele Cella, 2016 time log- frequency 1-D 2-D Spiral

slide-29
SLIDE 29

Monophonic Pitch Estimation

  • Predict the pitch from monophonic sound source

○ Frame-level pitch labels

  • Input and output

○ 16 kHz sampling rate: resampling to 16kHz is a pre-processing step ○ 1024 sample of raw waveform input: a typical size of one “audio frame” ○ 360-dimensional softmax output: the pitch is quantized with a pitch resolution of 20 cents (100 cents is one semi-tone)

CREPE: A Convolutional Representation for Pitch Estimation Jong Wook Kim, Justin Salamon, Peter Li, Juan P. Bello 2018

slide-30
SLIDE 30

Monophonic Pitch Estimation

  • 1D conv layers on waveforms

○ The first layer is big: 1024 filters (size of 512), stride of 4, max-pooling of 2 ○ Adam optimizer, batch norm

  • The output labels are smoothed

○ The one-hot vector is blurred by a Gaussian filter ○ Softening the penalty for near-correct predictions

CREPE: A Convolutional Representation for Pitch Estimation Jong Wook Kim, Justin Salamon, Peter Li, Juan P. Bello 2018

slide-31
SLIDE 31

Multiple-F0 Estimation

  • Predict the pitch saliency from multi-track instruments

○ Frame-level pitch activations in the time and pitch space

  • Input representation: harmonic constant-Q transform (HCQT)

○ CQT with 60 bins per octave ○ Multiple CQTs with harmonic relations (0.5, 1, 2, 3, 4, 5) ○ Filters learn the relative weights of harmonics ○ 3D input (time x frequency x harmonics): similar to RGB in image but deeper

Deep salience representations for F0 estimation in polyphonic music Rachel M. Bittner, Brian McFee, Justin Salamon, Peter Li, Juan P. Bello 2017

slide-32
SLIDE 32

Multiple-F0 Estimation

  • 2D CNN

○ 5x5 filter: 1 semitone, 70 x 3: one octave ○ The last layer has a sigmoid output

■ The loss is cross-entropy between the sigmoid output and the ground-truth

○ ReLU, batch norm, Adam optimizer ○ The input an output have the same dimensionality: no pooling layers

Deep salience representations for F0 estimation in polyphonic music Rachel M. Bittner, Brian McFee, Justin Salamon, Peter Li, Juan P. Bello 2017

1 semitone 1 octave