RNN and Musical Applications Juhan Nam Motivation When the output - - PowerPoint PPT Presentation
RNN and Musical Applications Juhan Nam Motivation When the output - - PowerPoint PPT Presentation
GCT634/AI613: Musical Applications of Machine Learning (Fall 2020) RNN and Musical Applications Juhan Nam Motivation When the output is sequential, e.g., pitch estimation or note transcription, CNN predicts each of the successive outputs
Motivation
- When the output is sequential, e.g., pitch estimation or note transcription,
CNN predicts each of the successive outputs independently
- Can we predict the output considering the surrounding context (features
in previous or next predictions) for better performance ?
Each prediction is indepedent
Motivation
- Method #1: Increasing the input context size
○ May improve the performance but need a relatively deeper network ○ Limited to capturing successive features only in close neighbors
Motivation
- Method #2: connect the feature maps temporally
○ We use information from not only the input but also the hidden layer states in the previous or next time steps ○ We regard the hidden units activations as dynamic states which are successively updated ○ By doing so, the model is expected to capture wider input context
The update can be forward
- r backward in time
Recurrent Neural Networks (RNN)
- A family of neural networks that have connections between previous
states and current states of hidden layers
○ The hidden units are “state vectors” with regard to the input index (i.e. time)
𝑋 ! 𝑋 " 𝑋 # 𝑋 $ 𝑋
% !
𝑋
% "
𝑋
% #
𝑋
% $
Recurrent Neural Networks (RNN)
- This simple structure is often called “Vanilla RNN”
○ tanh(𝑦) is a common choice of the activation function (𝑦)
𝑋
% !
𝑋
% "
𝑋
% #
𝑋
% $
ℎ ! (𝑢) = (𝑋
" ! ℎ ! (𝑢 − 1) +𝑋 # ! 𝑦(𝑢) + 𝑐 ! )
ℎ $ (𝑢) = (𝑋
" $ ℎ $ (𝑢 − 1) + 𝑋 # $ ℎ ! (𝑢) + 𝑐 $ )
ℎ % (𝑢) = (𝑋
" % ℎ % 𝑢 − 1 + 𝑋 # % ℎ $ (𝑢) + 𝑐 % )
- 𝑧(𝑢) = 𝑔(𝑋
# & ℎ % (𝑢) + 𝑐 & )
recurrent connections: how much information from the previous state is used to update the current state 𝑢 = 0, 1, 2, …
Training RNN: Forward Pass
- The hidden layers keep updating the states over the time steps
○ Regard the progressively extended neural network over time as a single large neural network where the weights (𝑋
! " , 𝑋 #
" ) are shared at each time step 𝑦(2) ' 𝑧(1) 𝑋
% !
𝑋
% "
𝑋
% #
𝑋
% $
𝑦(1) ' 𝑧(1) 𝑋
% !
𝑋
% "
𝑋
% #
𝑋
% $
𝑦(𝑈) ' 𝑧(𝑈) 𝑋
% !
𝑋
% "
𝑋
% #
𝑋
% $
. . . . . . . . . 𝑋
& #
𝑋
& "
𝑋
& !
𝑋
& #
𝑋
& "
𝑋
& !
𝑋
& #
𝑋
& "
𝑋
& !
Unrolled RNN
Training RNN: Backward Pass
- Backpropagation through time (BPTT)
○ Gradients flow in both the top-down pass and the time
𝑦(2) ' 𝑧(2) 𝑋
% !
𝑋
% "
𝑋
% #
𝑋
% $
𝑦(1) ' 𝑧(1) 𝑋
% !
𝑋
% "
𝑋
% #
𝑋
% $
𝑦(𝑈) ' 𝑧(𝑈) 𝑋
% !
𝑋
% "
𝑋
% #
𝑋
% $
. . . . . . . . . 𝑋
& #
𝑋
& "
𝑋
& !
𝑋
& #
𝑋
& "
𝑋
& !
𝑋
& #
𝑋
& "
𝑋
& !
Unrolled RNN
𝑀(1) 𝑀(2) 𝑀(𝑈)
The Problem of Vanilla RNN
- As the time steps increase in the training, the gradients during BPTT can
become unstable
○ Exploding or vanishing gradients
- Exploding gradients can be controlled by gradient clipping but vanishing
gradients require a different architecture
- In practice, the vanilla RNNs are used only when the input is a short
sequence.
Vanilla RNN
- Another view
Source: Colah’s blog: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Long Short-Term Memory (LSTM)
- Four neural network layers in one module
○ Two recurrent flows
Source: Colah’s blog: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Long Short-Term Memory (LSTM)
- Cell state (“the key to LSTM”)
○ Information can flow through without a change: similar to the skip connection! ○ The sigmoid gates are a relaxation of the binary gate (0 or 1)
Source: Colah’s blog: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Forget gate Input gate Forget gate Input gate New information
Long Short-Term Memory (LSTM)
- Generate the next state from the cell
Source: Colah’s blog: http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Output gate
Long Short-Term Memory (LSTM)
- Much more powerful than the vanilla RNNs
○ Uninterrupted gradient flow is possible through the cell over time steps ○ The structure with two current flows is similar to ResNet
- Long-term dependency can be learned
○ We can use long sequence data as input
7x7 conv, 64, /2 pool, /2 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 128, /2 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 256, /2 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 512, /2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 avg pool fc 1000 image
34-layer residual
- There are several different input and output setups in RNN
Sequence Setups using RNN
Many-to-Many Many-to-One One-to-Many Many-to-One/One-to-Many (Seq2Seq)
- Both input and output are sequence
○ Assume that the input and output data are strongly aligned in training data
■ When the alignment is weak, the attention layer is added or Seq2Seq is used
○ Bi-directional RNN is more commonly used unless it is for a real-time system ○ Use cases
■ Video classification: image frames to label frames ■ Part-of-speech tagging: sentence to tags ■ Automatic music transcription: audio to note/pitch/beat/chord ■ Sound event detection: audio to events
Many-to-Many RNN
Bi-directional RNN (use both past and future information) Uni-directional RNN (use past information
- nly)
- When input is high-dimensional (image or audio), CNN and RNN are
combined
○ CNN provides the embedding vector which is used as input of RNN
Convolutional Recurrent Neural Network (CRNN)
Audio Video Labels on frame-level Note/Pitch/Beat/Chord/Event
CNN
Image/Audio Embedding
- Predict the next word given a sequence of words
○ Compute the probability distribution of the next word 𝑦(#) given a sequence
- f words 𝑦(#&'), . . . , 𝑦(') à 𝑄(𝑦 # |𝑦 #&' , . . . , 𝑦 ' )
■ 𝑦(-) can be any word from the vocabulary 𝑊 = {𝑥., … , 𝑥|0|}
○ The likelihood can be computed for a sentence
■ 𝑄 𝑦 . , 𝑦 1 , . . . , 𝑦 2 = ∏-3.
2
𝑄 𝑦 - | 𝑦 -4. , . . . , 𝑦 .
○ Trained in the many-to-many RNN setting
■ Transformer is more dominantly used these days
○ LM has many applications
■ Text generation: predict the most likely words ■ Speech recognition (acoustic model + LM)
Language Model (LM)
“I” “am” “so” “full” Language Model using RNN Word Embedding “am” “so”
By replacing words with musical notes or MIDI events, we can build a “Musical Language Model” which can be used for music generation and automatic music transcription
- The input is a sequence and the output is a categorical label
○ The input sequence can have a variable length! ○ Use cases
■ Text classification: sentence to labels (pos/neg) ■ Music genre/mood/audio scene classification and tagging: audio to labels ■ Video scene classification: image frames to labels
Many-to-One RNN
Audio/Video Scene Classification
“You” “are” “awesome” “positive”
Text Classification Word Embedding Audio/image Embedding
“park”
One-to-Many RNN
- The input is a single shot of data and the output is sequence data
○ This is regarded as a conditional generation model ○ Use cases
■ Image captioning: generate the text description of an image ■ Music playlist generation: playlist title (text) to a sequence of song embedding vectors
“The” “trees” “are”
Music Playlist Generation Image Captioning
“Bossa Nova Jazz Best” track1 track2 track3 <start> <start> “yellow”
image Embedding Text Embedding
Many-to-One/One-to-Many (Seq2Seq)
- Both input and output are sequence
○ Assume that the input and output data are not aligned ○ Regarded as an encoder-decoder RNN framework ○ Use cases
■ Machine translation: sentence to sentence (neural machine translation) ■ Speech recognition/note-level singing vice transcription: audio to text/note
Machine Translation Speech Recognition Singing Voice Transcription
“I” “love” “you”
“난” “너를” “사랑해”
Encoder RNN Decoder RNN
“난” “너를” “사랑해”
Encoder RNN Decoder RNN
The compressed latent vector
- f the input sequence.
Seq2Seq is a conditional language model!
“난” “너를” <EOS> The decoder becomes a conditional text generation model “난” “너를” <EOS>
MIR Tasks Using RNN
- Many-to-many RNN
○ Vocal melody extraction ○ Polyphonic piano transcription ○ Beat tracking ○ Chord recognition
- Many-to-one RNN
○ Music auto-tagging
Vocal Melody Extraction
- Extracting frame-level pitch contours of singing voice from mixed tracks
○ Pitch estimation in the presence of interfering sources (background music) ○ Downstream tasks include cover song detection, query-by-humming and singer identification
- Singing voice detection and pitch estimation
- Methods
○ Vocal source separation followed by monophonic pitch estimation ○ Estimate the vocal pitch directly from audio as a classification task
Vocal Pitch contour
Vocal Melody Extraction
- Joint learning of singing voice detection and
vocal pitch estimation
○ Combining the loss functions from the two tasks ○ Vocal pitch classification
■ ResNet stacks: no pooling over time ■ Bi-directional LSTM-RNN ■ Use the Gaussian blurring in the output layer
○ Singing voice detector
■ Use the shared features from the three layers of the pitch classifier: “hierarchical” audio features (e.g., vocal formant, vibrato, portamento) ■ Bi-directional LSTM-RNN
Joint Detection and Classification of Singing Voice Melody Using Convolutional Recurrent Neural Networks, Sangeun Kum and Juhan Nam, 2019
Polyphonic Piano Transcription
- Predicts individual notes from piano music recordings
○ Onset, direction (offset), velocity and pedal on/off ○ Audio-to-MIDI conversion task
- Methods
○ Non-negative matrix factorization (NMF)
■ 𝑊: spectrogram, 𝑋: pitch template, 𝐼: temporal activation
○ Multi-label classification
■ 88 binary state output (note on/off) ■ Use the sigmoid output in an NN architecture
MIDI
𝑊 ≈ 𝑋𝐼
88-dim. binary vector
Polyphonic Piano Transcription
- Joint learning of note onset detection and frame-level pitch detection
○ Onset network (CRNN): detect the attack part of each note (percussive tone) ○ Frame network (CRNN): detect the on/off state of each note (harmonic tone) ○ Combining the two loss functions from the CRNNs ○ The two networks have a causal connection ○ Frame predictions without onset is discarded
Onsets and Frames: Dual-Objective Piano Transcription, Curtis Hawthorne, Erich Elsen, Jialin Song, Adam Roberts, Ian Simon, Colin Raffel, Jesse Engel, Sageev Oore, Douglas Eck, 2018
Log Mel-Spectrogram Conv Stack Conv Stack BiLSTM BiLSTM Onset Loss Frame Loss Onset Predictions Frame Predictions FC Sigmoid FC Sigmoid FC Sigmoid
Blue: frame prediction, Red; onset prediction Pink: both Yellow: True Positive, Red: False Negative Green: False Positive
Polyphonic Note Transcription
- The “Onset and Frames” model significantly outperformed the previous
state-of-the-arts
○ Huge jump in the note-level accuracy: perpetually more important than the frame-level accuracy ○ The key idea is detecting “onset” state separately
■ The causal connection and the inference using note state transition
○ The following studies investigated more note states: onset, sustain, and
- ffset, and other states associated with the sustain pedal
○ Demo: https://magenta.tensorflow.org/onsets-frames
Frame Note Note with offset Precision Recall F1 Precision Recall F1 Precision Recall F1 Sigtia [3] (our reimpl.) 71.99 73.32 72.22 44.97 49.55 46.58 17.64 19.71 18.38 Kelz [4] (our reimpl.) 81.18 65.07 71.60 44.27 61.29 50.94 20.13 27.80 23.14 Melodyne (decay mode) 71.85 50.39 58.57 62.08 48.53 54.02 21.09 16.56 18.40 Onsets and Frames 88.53 70.89 78.30 84.24 80.67 82.29 51.32 49.31 50.22
Beat Tracking
- Predict the beat positions from music recordings
○ The first beat of a measure (or a bar) is called “downbeat” ○ Segment audio frames into musical units (BPM: beat per minute) ○ Useful for beat-sync audio features, chord recognition and automatic DJ
- Methods
○ Dynamic programming using tempo estimation and onset strength features
■ See the ”beat_track” function in Librosa
○ Probabilistic model using HMM
■ Define beat states and their transition probability
○ Classification-based approach
■ Discriminate between beats and non-beats at a frame-level
downbeat beat
Beat Tracking
- Joint detection of beat and downbeat
○ Mel-spectrogram input with three different window sizes ○ 3 layers of LSTM-RNN (25 dim) ○ The softmax output with three classes
■ Downbeat, beat and non-beat
○ The output was further improved using a language model
■ The language model was trained only with the output labels ■ Find the most likely beat sequence given a chunk of softmax prediction probabilities
Joint Beat and Downbeat Tracking with Recurrent Neural Networks, Sebastian Bo ̈ck, Florian Krebs, and Gerhard Widmer, 2016
Chord Recognition
- Chord: the basic units of tonal harmony
○ Major, minor, 7th , diminish: different impressions of consonance and dissonance ○ Chord notation depends on the key ○ Chord progress has regular patterns
■ I-IV, I-V-I, I-IV(ii)-V-I
[The FMP book]
Chord Recognition
- Predict the chord labels from music recordings
○ The output can be predicted every frame or every beat ○ See this link: https://chordify.net/ ○ Useful for music structure analysis and music performance practice
- Methods
○ Binary chord templates with chroma features
■ Pattern matching
○ Probabilistic model using HMM
■ Chord progress language model ■ Define chord transition probability
○ Classification-based approach
■ Predict one of the chord labels
Traditional approaches
Chord Recognition
- CRNN-based chord recognition
○ Use the gated recurrent unit (GRU) for RNN ○ Structured chord labels
■ The chord notation is represented with binary vectors of root, pitch and bass fields ■ The root and bass fields: softmax, the pitch field: sigmoid ■ Compare with the one-hot encoding of chord classes
Structured training or large-vocabulary chord recognition, Brian McFee Juan Pablo Bello, 2017
Music Auto-Tagging
- Predict descriptive words from music recordings
○ Multi-label classification: genre, mood, instrument, and so on ○ Compare CNN to CRNN: The RNN part is a better temporal aggregator than the averaging in the CNN model (use GRU-RNN) ○ The CRNN model slightly outperforms the CNN model but it is slower
2D CNN model (k2c2) CRNN model
Convolutional Recurrent Neural Networks for Music Classification, Keunwoo Choi, George Fazekas, Mark Sandler, Kyunghyun Cho