RNN and Musical Applications Juhan Nam Motivation When the output - - PowerPoint PPT Presentation

rnn and musical applications
SMART_READER_LITE
LIVE PREVIEW

RNN and Musical Applications Juhan Nam Motivation When the output - - PowerPoint PPT Presentation

GCT634/AI613: Musical Applications of Machine Learning (Fall 2020) RNN and Musical Applications Juhan Nam Motivation When the output is sequential, e.g., pitch estimation or note transcription, CNN predicts each of the successive outputs


slide-1
SLIDE 1

GCT634/AI613: Musical Applications of Machine Learning (Fall 2020)

RNN and Musical Applications

Juhan Nam

slide-2
SLIDE 2

Motivation

  • When the output is sequential, e.g., pitch estimation or note transcription,

CNN predicts each of the successive outputs independently

  • Can we predict the output considering the surrounding context (features

in previous or next predictions) for better performance ?

Each prediction is indepedent

slide-3
SLIDE 3

Motivation

  • Method #1: Increasing the input context size

○ May improve the performance but need a relatively deeper network ○ Limited to capturing successive features only in close neighbors

slide-4
SLIDE 4

Motivation

  • Method #2: connect the feature maps temporally

○ We use information from not only the input but also the hidden layer states in the previous or next time steps ○ We regard the hidden units activations as dynamic states which are successively updated ○ By doing so, the model is expected to capture wider input context

The update can be forward

  • r backward in time
slide-5
SLIDE 5

Recurrent Neural Networks (RNN)

  • A family of neural networks that have connections between previous

states and current states of hidden layers

○ The hidden units are “state vectors” with regard to the input index (i.e. time)

𝑋 ! 𝑋 " 𝑋 # 𝑋 $ 𝑋

% !

𝑋

% "

𝑋

% #

𝑋

% $

slide-6
SLIDE 6

Recurrent Neural Networks (RNN)

  • This simple structure is often called “Vanilla RNN”

○ tanh(𝑦) is a common choice of the activation function 𝑕(𝑦)

𝑋

% !

𝑋

% "

𝑋

% #

𝑋

% $

ℎ ! (𝑢) = 𝑕(𝑋

" ! ℎ ! (𝑢 − 1) +𝑋 # ! 𝑦(𝑢) + 𝑐 ! )

ℎ $ (𝑢) = 𝑕(𝑋

" $ ℎ $ (𝑢 − 1) + 𝑋 # $ ℎ ! (𝑢) + 𝑐 $ )

ℎ % (𝑢) = 𝑕(𝑋

" % ℎ % 𝑢 − 1 + 𝑋 # % ℎ $ (𝑢) + 𝑐 % )

  • 𝑧(𝑢) = 𝑔(𝑋

# & ℎ % (𝑢) + 𝑐 & )

recurrent connections: how much information from the previous state is used to update the current state 𝑢 = 0, 1, 2, …

slide-7
SLIDE 7

Training RNN: Forward Pass

  • The hidden layers keep updating the states over the time steps

○ Regard the progressively extended neural network over time as a single large neural network where the weights (𝑋

! " , 𝑋 #

" ) are shared at each time step 𝑦(2) ' 𝑧(1) 𝑋

% !

𝑋

% "

𝑋

% #

𝑋

% $

𝑦(1) ' 𝑧(1) 𝑋

% !

𝑋

% "

𝑋

% #

𝑋

% $

𝑦(𝑈) ' 𝑧(𝑈) 𝑋

% !

𝑋

% "

𝑋

% #

𝑋

% $

. . . . . . . . . 𝑋

& #

𝑋

& "

𝑋

& !

𝑋

& #

𝑋

& "

𝑋

& !

𝑋

& #

𝑋

& "

𝑋

& !

Unrolled RNN

slide-8
SLIDE 8

Training RNN: Backward Pass

  • Backpropagation through time (BPTT)

○ Gradients flow in both the top-down pass and the time

𝑦(2) ' 𝑧(2) 𝑋

% !

𝑋

% "

𝑋

% #

𝑋

% $

𝑦(1) ' 𝑧(1) 𝑋

% !

𝑋

% "

𝑋

% #

𝑋

% $

𝑦(𝑈) ' 𝑧(𝑈) 𝑋

% !

𝑋

% "

𝑋

% #

𝑋

% $

. . . . . . . . . 𝑋

& #

𝑋

& "

𝑋

& !

𝑋

& #

𝑋

& "

𝑋

& !

𝑋

& #

𝑋

& "

𝑋

& !

Unrolled RNN

𝑀(1) 𝑀(2) 𝑀(𝑈)

slide-9
SLIDE 9

The Problem of Vanilla RNN

  • As the time steps increase in the training, the gradients during BPTT can

become unstable

○ Exploding or vanishing gradients

  • Exploding gradients can be controlled by gradient clipping but vanishing

gradients require a different architecture

  • In practice, the vanilla RNNs are used only when the input is a short

sequence.

slide-10
SLIDE 10

Vanilla RNN

  • Another view

Source: Colah’s blog: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

slide-11
SLIDE 11

Long Short-Term Memory (LSTM)

  • Four neural network layers in one module

○ Two recurrent flows

Source: Colah’s blog: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

slide-12
SLIDE 12

Long Short-Term Memory (LSTM)

  • Cell state (“the key to LSTM”)

○ Information can flow through without a change: similar to the skip connection! ○ The sigmoid gates are a relaxation of the binary gate (0 or 1)

Source: Colah’s blog: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Forget gate Input gate Forget gate Input gate New information

slide-13
SLIDE 13

Long Short-Term Memory (LSTM)

  • Generate the next state from the cell

Source: Colah’s blog: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Output gate

slide-14
SLIDE 14

Long Short-Term Memory (LSTM)

  • Much more powerful than the vanilla RNNs

○ Uninterrupted gradient flow is possible through the cell over time steps ○ The structure with two current flows is similar to ResNet

  • Long-term dependency can be learned

○ We can use long sequence data as input

7x7 conv, 64, /2 pool, /2 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 128, /2 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 256, /2 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 512, /2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 avg pool fc 1000 image

34-layer residual

slide-15
SLIDE 15
  • There are several different input and output setups in RNN

Sequence Setups using RNN

Many-to-Many Many-to-One One-to-Many Many-to-One/One-to-Many (Seq2Seq)

slide-16
SLIDE 16
  • Both input and output are sequence

○ Assume that the input and output data are strongly aligned in training data

■ When the alignment is weak, the attention layer is added or Seq2Seq is used

○ Bi-directional RNN is more commonly used unless it is for a real-time system ○ Use cases

■ Video classification: image frames to label frames ■ Part-of-speech tagging: sentence to tags ■ Automatic music transcription: audio to note/pitch/beat/chord ■ Sound event detection: audio to events

Many-to-Many RNN

Bi-directional RNN (use both past and future information) Uni-directional RNN (use past information

  • nly)
slide-17
SLIDE 17
  • When input is high-dimensional (image or audio), CNN and RNN are

combined

○ CNN provides the embedding vector which is used as input of RNN

Convolutional Recurrent Neural Network (CRNN)

Audio Video Labels on frame-level Note/Pitch/Beat/Chord/Event

CNN

Image/Audio Embedding

slide-18
SLIDE 18
  • Predict the next word given a sequence of words

○ Compute the probability distribution of the next word 𝑦(#) given a sequence

  • f words 𝑦(#&'), . . . , 𝑦(') à 𝑄(𝑦 # |𝑦 #&' , . . . , 𝑦 ' )

■ 𝑦(-) can be any word from the vocabulary 𝑊 = {𝑥., … , 𝑥|0|}

○ The likelihood can be computed for a sentence

■ 𝑄 𝑦 . , 𝑦 1 , . . . , 𝑦 2 = ∏-3.

2

𝑄 𝑦 - | 𝑦 -4. , . . . , 𝑦 .

○ Trained in the many-to-many RNN setting

■ Transformer is more dominantly used these days

○ LM has many applications

■ Text generation: predict the most likely words ■ Speech recognition (acoustic model + LM)

Language Model (LM)

“I” “am” “so” “full” Language Model using RNN Word Embedding “am” “so”

By replacing words with musical notes or MIDI events, we can build a “Musical Language Model” which can be used for music generation and automatic music transcription

slide-19
SLIDE 19
  • The input is a sequence and the output is a categorical label

○ The input sequence can have a variable length! ○ Use cases

■ Text classification: sentence to labels (pos/neg) ■ Music genre/mood/audio scene classification and tagging: audio to labels ■ Video scene classification: image frames to labels

Many-to-One RNN

Audio/Video Scene Classification

“You” “are” “awesome” “positive”

Text Classification Word Embedding Audio/image Embedding

“park”

slide-20
SLIDE 20

One-to-Many RNN

  • The input is a single shot of data and the output is sequence data

○ This is regarded as a conditional generation model ○ Use cases

■ Image captioning: generate the text description of an image ■ Music playlist generation: playlist title (text) to a sequence of song embedding vectors

“The” “trees” “are”

Music Playlist Generation Image Captioning

“Bossa Nova Jazz Best” track1 track2 track3 <start> <start> “yellow”

image Embedding Text Embedding

slide-21
SLIDE 21

Many-to-One/One-to-Many (Seq2Seq)

  • Both input and output are sequence

○ Assume that the input and output data are not aligned ○ Regarded as an encoder-decoder RNN framework ○ Use cases

■ Machine translation: sentence to sentence (neural machine translation) ■ Speech recognition/note-level singing vice transcription: audio to text/note

Machine Translation Speech Recognition Singing Voice Transcription

“I” “love” “you”

“난” “너를” “사랑해”

Encoder RNN Decoder RNN

“난” “너를” “사랑해”

Encoder RNN Decoder RNN

The compressed latent vector

  • f the input sequence.

Seq2Seq is a conditional language model!

“난” “너를” <EOS> The decoder becomes a conditional text generation model “난” “너를” <EOS>

slide-22
SLIDE 22

MIR Tasks Using RNN

  • Many-to-many RNN

○ Vocal melody extraction ○ Polyphonic piano transcription ○ Beat tracking ○ Chord recognition

  • Many-to-one RNN

○ Music auto-tagging

slide-23
SLIDE 23

Vocal Melody Extraction

  • Extracting frame-level pitch contours of singing voice from mixed tracks

○ Pitch estimation in the presence of interfering sources (background music) ○ Downstream tasks include cover song detection, query-by-humming and singer identification

  • Singing voice detection and pitch estimation
  • Methods

○ Vocal source separation followed by monophonic pitch estimation ○ Estimate the vocal pitch directly from audio as a classification task

Vocal Pitch contour

slide-24
SLIDE 24

Vocal Melody Extraction

  • Joint learning of singing voice detection and

vocal pitch estimation

○ Combining the loss functions from the two tasks ○ Vocal pitch classification

■ ResNet stacks: no pooling over time ■ Bi-directional LSTM-RNN ■ Use the Gaussian blurring in the output layer

○ Singing voice detector

■ Use the shared features from the three layers of the pitch classifier: “hierarchical” audio features (e.g., vocal formant, vibrato, portamento) ■ Bi-directional LSTM-RNN

Joint Detection and Classification of Singing Voice Melody Using Convolutional Recurrent Neural Networks, Sangeun Kum and Juhan Nam, 2019

slide-25
SLIDE 25

Polyphonic Piano Transcription

  • Predicts individual notes from piano music recordings

○ Onset, direction (offset), velocity and pedal on/off ○ Audio-to-MIDI conversion task

  • Methods

○ Non-negative matrix factorization (NMF)

■ 𝑊: spectrogram, 𝑋: pitch template, 𝐼: temporal activation

○ Multi-label classification

■ 88 binary state output (note on/off) ■ Use the sigmoid output in an NN architecture

MIDI

𝑊 ≈ 𝑋𝐼

88-dim. binary vector

slide-26
SLIDE 26

Polyphonic Piano Transcription

  • Joint learning of note onset detection and frame-level pitch detection

○ Onset network (CRNN): detect the attack part of each note (percussive tone) ○ Frame network (CRNN): detect the on/off state of each note (harmonic tone) ○ Combining the two loss functions from the CRNNs ○ The two networks have a causal connection ○ Frame predictions without onset is discarded

Onsets and Frames: Dual-Objective Piano Transcription, Curtis Hawthorne, Erich Elsen, Jialin Song, Adam Roberts, Ian Simon, Colin Raffel, Jesse Engel, Sageev Oore, Douglas Eck, 2018

Log Mel-Spectrogram Conv Stack Conv Stack BiLSTM BiLSTM Onset Loss Frame Loss Onset Predictions Frame Predictions FC Sigmoid FC Sigmoid FC Sigmoid

Blue: frame prediction, Red; onset prediction Pink: both Yellow: True Positive, Red: False Negative Green: False Positive

slide-27
SLIDE 27

Polyphonic Note Transcription

  • The “Onset and Frames” model significantly outperformed the previous

state-of-the-arts

○ Huge jump in the note-level accuracy: perpetually more important than the frame-level accuracy ○ The key idea is detecting “onset” state separately

■ The causal connection and the inference using note state transition

○ The following studies investigated more note states: onset, sustain, and

  • ffset, and other states associated with the sustain pedal

○ Demo: https://magenta.tensorflow.org/onsets-frames

Frame Note Note with offset Precision Recall F1 Precision Recall F1 Precision Recall F1 Sigtia [3] (our reimpl.) 71.99 73.32 72.22 44.97 49.55 46.58 17.64 19.71 18.38 Kelz [4] (our reimpl.) 81.18 65.07 71.60 44.27 61.29 50.94 20.13 27.80 23.14 Melodyne (decay mode) 71.85 50.39 58.57 62.08 48.53 54.02 21.09 16.56 18.40 Onsets and Frames 88.53 70.89 78.30 84.24 80.67 82.29 51.32 49.31 50.22

slide-28
SLIDE 28

Beat Tracking

  • Predict the beat positions from music recordings

○ The first beat of a measure (or a bar) is called “downbeat” ○ Segment audio frames into musical units (BPM: beat per minute) ○ Useful for beat-sync audio features, chord recognition and automatic DJ

  • Methods

○ Dynamic programming using tempo estimation and onset strength features

■ See the ”beat_track” function in Librosa

○ Probabilistic model using HMM

■ Define beat states and their transition probability

○ Classification-based approach

■ Discriminate between beats and non-beats at a frame-level

downbeat beat

slide-29
SLIDE 29

Beat Tracking

  • Joint detection of beat and downbeat

○ Mel-spectrogram input with three different window sizes ○ 3 layers of LSTM-RNN (25 dim) ○ The softmax output with three classes

■ Downbeat, beat and non-beat

○ The output was further improved using a language model

■ The language model was trained only with the output labels ■ Find the most likely beat sequence given a chunk of softmax prediction probabilities

Joint Beat and Downbeat Tracking with Recurrent Neural Networks, Sebastian Bo ̈ck, Florian Krebs, and Gerhard Widmer, 2016

slide-30
SLIDE 30

Chord Recognition

  • Chord: the basic units of tonal harmony

○ Major, minor, 7th , diminish: different impressions of consonance and dissonance ○ Chord notation depends on the key ○ Chord progress has regular patterns

■ I-IV, I-V-I, I-IV(ii)-V-I

[The FMP book]

slide-31
SLIDE 31

Chord Recognition

  • Predict the chord labels from music recordings

○ The output can be predicted every frame or every beat ○ See this link: https://chordify.net/ ○ Useful for music structure analysis and music performance practice

  • Methods

○ Binary chord templates with chroma features

■ Pattern matching

○ Probabilistic model using HMM

■ Chord progress language model ■ Define chord transition probability

○ Classification-based approach

■ Predict one of the chord labels

Traditional approaches

slide-32
SLIDE 32

Chord Recognition

  • CRNN-based chord recognition

○ Use the gated recurrent unit (GRU) for RNN ○ Structured chord labels

■ The chord notation is represented with binary vectors of root, pitch and bass fields ■ The root and bass fields: softmax, the pitch field: sigmoid ■ Compare with the one-hot encoding of chord classes

Structured training or large-vocabulary chord recognition, Brian McFee Juan Pablo Bello, 2017

slide-33
SLIDE 33

Music Auto-Tagging

  • Predict descriptive words from music recordings

○ Multi-label classification: genre, mood, instrument, and so on ○ Compare CNN to CRNN: The RNN part is a better temporal aggregator than the averaging in the CNN model (use GRU-RNN) ○ The CRNN model slightly outperforms the CNN model but it is slower

2D CNN model (k2c2) CRNN model

Convolutional Recurrent Neural Networks for Music Classification, Keunwoo Choi, George Fazekas, Mark Sandler, Kyunghyun Cho