RNN and Musical Applications Juhan Nam Motivation When the output - PowerPoint PPT Presentation

GCT634/AI613: Musical Applications of Machine Learning (Fall 2020) RNN and Musical Applications Juhan Nam

Motivation ● When the output is sequential, e.g., pitch estimation or note transcription, CNN predicts each of the successive outputs independently ● Can we predict the output considering the surrounding context (features in previous or next predictions) for better performance ? Each prediction is indepedent

Motivation ● Method #1: Increasing the input context size May improve the performance but need a relatively deeper network ○ Limited to capturing successive features only in close neighbors ○

Motivation ● Method #2: connect the feature maps temporally We use information from not only the input but also the hidden layer states ○ in the previous or next time steps We regard the hidden units activations as dynamic states which are ○ successively updated By doing so, the model is expected to capture wider input context ○ The update can be forward or backward in time

Recurrent Neural Networks (RNN) ● A family of neural networks that have connections between previous states and current states of hidden layers The hidden units are “state vectors” with regard to the input index (i.e. time) ○ $ 𝑋 𝑋 $ % # 𝑋 # 𝑋 % " 𝑋 " 𝑋 % 𝑋 ! ! 𝑋 %

Recurrent Neural Networks (RNN) ● This simple structure is often called “Vanilla RNN” tanh(𝑦) is a common choice of the activation function 𝑕(𝑦) ○ & ℎ % (𝑢) + 𝑐 & ) 𝑧(𝑢) = 𝑔(𝑋 - # $ 𝑋 % % ℎ $ (𝑢) + 𝑐 % ) % ℎ % 𝑢 − 1 + 𝑋 ℎ % (𝑢) = 𝑕(𝑋 " # # 𝑋 $ ℎ $ (𝑢 − 1) + 𝑋 $ ℎ ! (𝑢) + 𝑐 $ ) ℎ $ (𝑢) = 𝑕(𝑋 % " # ! ℎ ! (𝑢 − 1) +𝑋 ! 𝑦(𝑢) + 𝑐 ! ) ℎ ! (𝑢) = 𝑕(𝑋 " 𝑋 " # % 𝑢 = 0, 1, 2, … recurrent connections: how much information from ! 𝑋 % the previous state is used to update the current state

Training RNN: Forward Pass ● The hidden layers keep updating the states over the time steps Regard the progressively extended neural network over time as a single large ○ " , 𝑋 " ) are shared at each time step neural network where the weights ( 𝑋 ! # 𝑧(1) ' 𝑧(1) ' 𝑧(𝑈) ' $ $ $ 𝑋 𝑋 𝑋 % # % # % 𝑋 𝑋 # 𝑋 & & . . . & # # # 𝑋 𝑋 𝑋 " " % % % 𝑋 𝑋 " 𝑋 & & & . . . " " " 𝑋 𝑋 𝑋 % % % ! ! 𝑋 𝑋 ! 𝑋 & & & . . . ! ! ! 𝑋 𝑋 𝑋 % % % 𝑦(1) 𝑦(𝑈) 𝑦(2) Unrolled RNN

Training RNN: Backward Pass ● Backpropagation through time (BPTT) Gradients flow in both the top-down pass and the time ○ 𝑀(𝑈) 𝑀(1) 𝑀(2) 𝑧(1) ' 𝑧(2) ' 𝑧(𝑈) ' $ $ $ 𝑋 𝑋 𝑋 % # % # % 𝑋 𝑋 # 𝑋 & & . . . & # # # 𝑋 𝑋 𝑋 " " % % % 𝑋 𝑋 " 𝑋 & & & . . . " " " 𝑋 𝑋 𝑋 % % % ! ! 𝑋 𝑋 ! 𝑋 & & & . . . ! ! ! 𝑋 𝑋 𝑋 % % % 𝑦(1) 𝑦(𝑈) 𝑦(2) Unrolled RNN

The Problem of Vanilla RNN ● As the time steps increase in the training, the gradients during BPTT can become unstable Exploding or vanishing gradients ○ ● Exploding gradients can be controlled by gradient clipping but vanishing gradients require a different architecture ● In practice, the vanilla RNNs are used only when the input is a short sequence.

Vanilla RNN ● Another view Source: Colah’s blog: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Long Short-Term Memory (LSTM) ● Four neural network layers in one module Two recurrent flows ○ Source: Colah’s blog: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Long Short-Term Memory (LSTM) ● Cell state (“the key to LSTM”) Information can flow through without a change: similar to the skip connection! ○ The sigmoid gates are a relaxation of the binary gate (0 or 1) ○ Forget gate Input gate Input gate Forget gate New information Source: Colah’s blog: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Long Short-Term Memory (LSTM) ● Generate the next state from the cell Output gate Source: Colah’s blog: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Long Short-Term Memory (LSTM) ● Long-term dependency can be learned ● Much more powerful than the vanilla RNNs 34-layer residual ○ ○ ○ image We can use long sequence data as input The structure with two current flows is similar to ResNet Uninterrupted gradient flow is possible through the cell over time steps 7x7 conv, 64, /2 pool, /2 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 64 3x3 conv, 128, /2 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 128 3x3 conv, 256, /2 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 3x3 conv, 512, /2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 avg pool fc 1000

Sequence Setups using RNN ● There are several different input and output setups in RNN Many-to-One/One-to-Many Many-to-Many Many-to-One One-to-Many (Seq2Seq)

Many-to-Many RNN ● Both input and output are sequence Assume that the input and output data are strongly aligned in training data ○ When the alignment is weak, the attention layer is added or Seq2Seq is used ■ Bi-directional RNN is more commonly used unless it is for a real-time system ○ Use cases ○ Video classification: image frames to label frames ■ Part-of-speech tagging: sentence to tags ■ Automatic music transcription: audio to note/pitch/beat/chord ■ Sound event detection: audio to events ■ Bi-directional RNN Uni-directional RNN (use both past and (use past information future information) only)

Convolutional Recurrent Neural Network (CRNN) ● When input is high-dimensional (image or audio), CNN and RNN are combined CNN provides the embedding vector which is used as input of RNN ○ Labels on frame-level Note/Pitch/Beat/Chord/Event Image/Audio Embedding CNN Video Audio

Language Model (LM) ● Predict the next word given a sequence of words Compute the probability distribution of the next word 𝑦 (#) given a sequence ○ of words 𝑦 (#&') , . . . , 𝑦 (') à 𝑄(𝑦 # |𝑦 #&' , . . . , 𝑦 ' ) 𝑦 (-) can be any word from the vocabulary 𝑊 = {𝑥 . , … , 𝑥 |0| } ■ The likelihood can be computed for a sentence ○ “am” “so” “full” 𝑄 𝑦 . , 𝑦 1 , . . . , 𝑦 2 𝑄 𝑦 - | 𝑦 -4. , . . . , 𝑦 . 2 = ∏ -3. ■ Trained in the many-to-many RNN setting ○ Word Embedding Transformer is more dominantly used these days ■ LM has many applications ○ “I” “am” “so” Text generation: predict the most likely words ■ Language Model using RNN Speech recognition (acoustic model + LM) ■ By replacing words with musical notes or MIDI events, we can build a “Musical Language Model” which can be used for music generation and automatic music transcription

Many-to-One RNN ● The input is a sequence and the output is a categorical label The input sequence can have a variable length! ○ Use cases ○ Text classification: sentence to labels (pos/neg) ■ Music genre/mood/audio scene classification and tagging: audio to labels ■ Video scene classification: image frames to labels ■ “park” “positive” Audio/image Embedding Word Embedding “You” “are” “awesome” Text Classification Audio/Video Scene Classification

One-to-Many RNN ● The input is a single shot of data and the output is sequence data This is regarded as a conditional generation model ○ Use cases ○ Image captioning: generate the text description of an image ■ Music playlist generation: playlist title (text) to a sequence of song embedding ■ vectors “The” “trees” “are” “yellow” track1 track2 track3 image Text Embedding Embedding <start> <start> “Bossa Nova Jazz Best” Music Playlist Generation Image Captioning

Many-to-One/One-to-Many (Seq2Seq) ● Both input and output are sequence Assume that the input and output data are not aligned ○ Regarded as an encoder-decoder RNN framework ○ Use cases ○ Machine translation: sentence to sentence (neural machine translation) ■ Speech recognition/note-level singing vice transcription: audio to text/note ■ “ 난 ” “ 너를 ” “ 사랑해 ” “ 난 ” “ 너를 ” “ 사랑해 ” The compressed latent vector of the input sequence. Decoder Encoder Decoder Encoder RNN RNN RNN RNN The decoder becomes a conditional text “ 난 ” “ 너를 ” <EOS> “ 난 ” “ 너를 ” “I” “love” “you” generation model <EOS> Machine Translation Speech Recognition Singing Voice Transcription Seq2Seq is a conditional language model!

MIR Tasks Using RNN ● Many-to-many RNN Vocal melody extraction ○ Polyphonic piano transcription ○ Beat tracking ○ Chord recognition ○ ● Many-to-one RNN Music auto-tagging ○

RNN and Musical Applications Juhan Nam Motivation When the output - PowerPoint PPT Presentation

GCT634/AI613: Musical Applications of Machine Learning (Fall 2020) RNN and Musical Applications Juhan Nam Motivation When the output is sequential, e.g., pitch estimation or note transcription, CNN predicts each of the successive outputs

A Musical Journey A Musical Journey A Musical Journey A Musical Journey A Musical Journey A

Outline Gated Feedback Recurrent Neural Networks. arXiv1502. Introduction: RNN & Gated RNN

Some RNN Variants Arun Mallya Best viewed with Computer Modern fonts installed Outline

Memorial Day Choral Festival May 24-27, 2019 A Musical Tribute To Americas Veterans A Musical

Musical Instruments They sound different, even on the same note They require energy to

Musical Interfaces and Sequencers Graduate School of Culture Technology, KAIST Juhan Nam Musical

Musical Theatre Song: A Comprehensive Course In Selection, Preparation, And Presentation For The

Fundamentals of Musical Acoustics Graduate School of Culture Technology, KAIST Juhan Nam

Science, Policy & Musical Science, Policy & Musical Chairs Chairs Millie Baird Millie

The environment for musical learning A creative, confident learner

The Vision Mission Statement: Mission Statement: Musical- online Musical- online

Musical Instruments A glass pane exposed to a loud, short sound A. A glass pane exposed to a

Recurrent Neural Network Rachel Hu and Zhi Zhang Amazon AI d2l.ai Outline Dependent Random

RNN Input Layer RNN Hidden Layer RNN h t-1 h t x t (Picture adapted from Andrej

RNN-based AMs + Introduction to Language Modeling Lecture 9 CS 753 Instructor: Preethi Jyothi

Lecture 6: RNN wrap-up Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office

Distributed Applications Software Engineering 2017 Alessio Gambi - Saarland University Based on

Framing a challenge Vinay Dabholkar Oct 05, 2017 Warm-up quiz What are the four staminas of

Why Im looking forward to Drupal 8 Who Am I? Jim Taylor drupal.org/u/bigjim @jalama

Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA by

Matrix Calculations: Linear Equations Aleks Kissinger (and Herman Geuvers) Institute for

Building End-to-End Multi-Client Service Oriented Applications Module 01 A Car Rental System

MySQL Performance Schema in Action April, 23, 2018 Sveta Smirnova, Alexander Rubin Table of

Docker for Devs Bud Siddhisena Lead Engineer, Enova My goal is to inform Basic Web Service PHP

RNN and Musical Applications Juhan Nam Motivation When the output - PowerPoint PPT Presentation

GCT634/AI613: Musical Applications of Machine Learning (Fall 2020) RNN and Musical Applications Juhan Nam Motivation When the output is sequential, e.g., pitch estimation or note transcription, CNN predicts each of the successive outputs

A Musical Journey A Musical Journey A Musical Journey A Musical Journey A Musical Journey A

Outline Gated Feedback Recurrent Neural Networks. arXiv1502. Introduction: RNN &amp; Gated RNN

Some RNN Variants Arun Mallya Best viewed with Computer Modern fonts installed Outline

Memorial Day Choral Festival May 24-27, 2019 A Musical Tribute To Americas Veterans A Musical

Musical Instruments They sound different, even on the same note They require energy to

Musical Interfaces and Sequencers Graduate School of Culture Technology, KAIST Juhan Nam Musical

Musical Theatre Song: A Comprehensive Course In Selection, Preparation, And Presentation For The

Fundamentals of Musical Acoustics Graduate School of Culture Technology, KAIST Juhan Nam

Science, Policy &amp; Musical Science, Policy &amp; Musical Chairs Chairs Millie Baird Millie

The environment for musical learning A creative, confident learner

The Vision Mission Statement: Mission Statement: Musical- online Musical- online

Musical Instruments A glass pane exposed to a loud, short sound A. A glass pane exposed to a

Recurrent Neural Network Rachel Hu and Zhi Zhang Amazon AI d2l.ai Outline Dependent Random

RNN Input Layer RNN Hidden Layer RNN h t-1 h t x t (Picture adapted from Andrej

RNN-based AMs + Introduction to Language Modeling Lecture 9 CS 753 Instructor: Preethi Jyothi

Lecture 6: RNN wrap-up Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office

Distributed Applications Software Engineering 2017 Alessio Gambi - Saarland University Based on

Framing a challenge Vinay Dabholkar Oct 05, 2017 Warm-up quiz What are the four staminas of

Why Im looking forward to Drupal 8 Who Am I? Jim Taylor drupal.org/u/bigjim @jalama

Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA by

Matrix Calculations: Linear Equations Aleks Kissinger (and Herman Geuvers) Institute for

Building End-to-End Multi-Client Service Oriented Applications Module 01 A Car Rental System

MySQL Performance Schema in Action April, 23, 2018 Sveta Smirnova, Alexander Rubin Table of

Docker for Devs Bud Siddhisena Lead Engineer, Enova My goal is to inform Basic Web Service PHP

Outline Gated Feedback Recurrent Neural Networks. arXiv1502. Introduction: RNN & Gated RNN

Science, Policy & Musical Science, Policy & Musical Chairs Chairs Millie Baird Millie