Automatic Speech Recognition (CS753) Automatic Speech Recognition - PowerPoint PPT Presentation

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 9: Brief Introduction to Neural Networks Instructor: Preethi Jyothi Feb 2, 2017  

Final Project Landscape Tabla bol transcription Voice-based music Sanskrit Synthesis Automatic Tongue player and Recognition Twister Generator InfoGAN for   Music Genre music Classification Automatic authorised ASR Speech synthesis Keyword spotting Singer   & ASR for Indic for continuous Audio Synthesis Identification languages speech Using LSTMs Speaker   Verification Transcribing TED Swapping Ad detection in live Talks instruments in radio streams Emotion recordings Recognition from speech End-to-end Nationality Audio-Visual Bird call Programming Speaker Adaptation detection from Speech Recognition with speech-based speech accents Recognition commands

Feed-forward Neural Network Output   Layer Hidden   Input   Layer Layer

Feed-forward Neural Network   Brain Metaphor Single neuron g w i y i x i ( activation   function ) y i =g( Σ i w i ⋅ x i ) Image from: https://upload.wikimedia.org/wikipedia/commons/1/10/Blausen_0657_MultipolarNeuron.png

Feed-forward Neural Network   Parameterized Model w 13 w 35 1 3 x 1 w 14 w 23 a 5 5 w 45 2 4 w 24 x 2 Parameters of   a 5 = g( w 35 ⋅ a 3 + w 45 ⋅ a 4 ) the network: all w ij   = g( w 35 ⋅ ( g( w 13 ⋅ a 1 + w 23 ⋅ a 2 ) ) +   (and biases not   shown here) w 45 ⋅ ( g( w 14 ⋅ a 1 + w 24 ⋅ a 2 ) ) ) If x is a 2-dimensional vector and the layer above it is a 2-dimensional vector h , a fully-connected layer is associated with: h = xW + b where w ij in W is the weight of the connection between i th neuron in the input row and j th neuron in the first hidden layer and b is the bias vector

Feed-forward Neural Network   Parameterized Model w 13 w 35 1 3 x 1 w 14 w 23 a 5 5 w 45 2 4 w 24 x 2 a 5 = g( w 35 ⋅ a 3 + w 45 ⋅ a 4 ) = g( w 35 ⋅ ( g( w 13 ⋅ a 1 + w 23 ⋅ a 2 ) ) +   w 45 ⋅ ( g( w 14 ⋅ a 1 + w 24 ⋅ a 2 ) ) ) The simplest neural network is the perceptron: Perceptron(x) = xW + b A 1-layer feedforward neural network has the form: MLP(x) = g( xW 1 + b 1 ) W 2 + b 2

Common Activation Functions (g) Sigmoid : σ ( x ) = 1/(1 + e - x ) nonlinear activation functions 1.0 0.8 sigmoid 0.6 output 0.4 0.2 0.0 − 10 − 5 0 5 10 x

Common Activation Functions (g) Sigmoid : σ ( x ) = 1/(1 + e - x ) Hyperbolic tangent (tanh) : tanh( x ) = ( e 2 x - 1)/( e 2 x + 1) nonlinear activation functions 1.0 tanh sigmoid 0.5 output 0.0 − 0.5 − 1.0 − 10 − 5 0 5 10 x

Common Activation Functions (g) Sigmoid : σ ( x ) = 1/(1 + e - x ) Hyperbolic tangent (tanh) : tanh( x ) = ( e 2 x - 1)/( e 2 x + 1) Rectified Linear Unit (ReLU): RELU( x ) = max(0, x ) nonlinear activation functions 10 ReLU tanh 8 sigmoid 6 output 4 2 0 − 10 − 5 0 5 10 x

Optimization Problem To train a neural network, define a loss function L(y, ỹ ):   • a function of the true output y and the predicted output ỹ L(y, ỹ ) assigns a non-negative numerical score to the neural • network’s output, ỹ The parameters of the network are set to minimise L over the • training examples (i.e. a sum of losses over di ff erent training samples) L is typically minimised using a gradient-based method •

Stochastic Gradient Descent (SGD) SGD Algorithm Inputs:   Function NN(x; θ ), Training examples, x 1 … x n and   outputs, y 1 … y n and Loss function L. do until stopping criterion   Pick a training example x i , y i   Compute the loss L(NN(x i ; θ ), y i )   Compute gradient of L, ∇ L with respect to θ   θ ← θ - η ∇ L   done Return: θ

Training a Neural Network Define the Loss function to be minimised as a node L Goal: Learn weights for the neural network which minimise L Gradient Descent: Find ∂ L/ ∂ w for every weight w , and update it as   w ← w - η ∂ L/ ∂ w How do we e ff iciently compute ∂ L/ ∂ w for all w ? Will compute ∂ L/ ∂ u for every node u in the network! ∂ L/ ∂ w = ∂ L/ ∂ u ⋅ ∂ u/ ∂ w where u is the node which uses w

Training a Neural Network New goal: compute ∂ L/ ∂ u for every node u in the network Simple algorithm: Backpropagation Key fact: Chain rule of di ff erentiation If L can be wri tu en as a function of variables v 1 ,…, v n , which in turn depend (partially) on another variable u , then ∂ L/ ∂ u = Σ i ∂ L/ ∂ v i ⋅ ∂ v i / ∂ u

Backpropagation If L can be wri tu en as a function of variables v 1 ,…, v n , which in turn depend (partially) on another variable u , then ∂ L/ ∂ u = Σ i ∂ L/ ∂ v i ⋅ ∂ v i / ∂ u L Consider v 1 ,…, v n as the layer   above u, Γ ( u ) v u Then, the chain rule gives ∂ L/ ∂ u = Σ v ∈ Γ ( u ) ∂ L/ ∂ v ⋅ ∂ v/ ∂ u

Backpropagation ∂ L/ ∂ u = Σ v ∈ Γ ( u ) ∂ L/ ∂ v ⋅ ∂ v/ ∂ u Backpropagation Forward Pass L Base case: ∂ L/ ∂ L = 1 First compute all For each u (top to values of u given an bo tu om): v input, in a forward For each v ∈ Γ ( u ): pass   u (The values of each node Inductively, have   will be needed during computed ∂ L/ ∂ v backprop) Directly compute ∂ v/ ∂ u Compute ∂ L/ ∂ u Compute ∂ L/ ∂ w   Where values computed in the where ∂ L/ ∂ w = ∂ L/ ∂ u ⋅ ∂ u/ ∂ w forward pass may be needed

Neural Network Acoustic Models Input layer takes a window of acoustic feature vectors • Output layer corresponds to classes (e.g. monophone labels, • triphone states, etc.) Phone posteriors Image adapted from: Dahl et al., "Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition”, TASL’12

Neural Network Acoustic Models Input layer takes a window of acoustic feature vectors • Hybrid NN/HMM systems: replace GMMs with outputs of NNs • Image from: Dahl et al., "Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition”, TASL’12

Automatic Speech Recognition (CS753) Automatic Speech Recognition - PowerPoint PPT Presentation

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 9: Brief Introduction to Neural Networks Instructor: Preethi Jyothi Feb 2, 2017 Final Project Landscape Tabla bol transcription Voice-based music Sanskrit

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 23: Speech

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 12: Acoustic

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 4: WFSTs in ASR

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 21: Speaker

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 24: Statistical

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 22: Speaker

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 20:

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 10: Deep Neural

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 8: Hidden

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 4: WFST

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 14: Language

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 19: Search,

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 11: Recurrent

You dont hear me but your phones voice interface does Jos L OPES E STEVES & Chaouki K

The Development Of DAQ System For J-PARC K O TO Experiment Yasuyuki Sugiyama D1@Yamanaka Taku

A Virtualized Separation Kernel for Mixed Criticality Systems Ye Li, Richard West and Eric

Long term test of LED Long term test of LED pulsing pulsing Long time LED system stability

EN. 601.467/667 Introduc3on to Human Language Technology Deep Learning I Shinji Watanabe 1

SDP Attribute for Qualifying Media Formats with Generic Parameters (gpmd)

UMBC A B M A L T F O U M B C I M Y O R T 1 (4/3/08) I E S R C E O V U

A call from your landline Yatindra Nath Singh, Professor, Electrical Engineering Department,

Automatic Speech Recognition (CS753) Automatic Speech Recognition - PowerPoint PPT Presentation

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 9: Brief Introduction to Neural Networks Instructor: Preethi Jyothi Feb 2, 2017 Final Project Landscape Tabla bol transcription Voice-based music Sanskrit

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 1: Introduction

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 23: Speech

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 12: Acoustic

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 4: WFSTs in ASR

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 21: Speaker

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 24: Statistical

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 22: Speaker

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 20:

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 10: Deep Neural

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 8: Hidden

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 4: WFST

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 14: Language

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 19: Search,

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 11: Recurrent

You dont hear me but your phones voice interface does Jos L OPES E STEVES &amp; Chaouki K

The Development Of DAQ System For J-PARC K O TO Experiment Yasuyuki Sugiyama D1@Yamanaka Taku

A Virtualized Separation Kernel for Mixed Criticality Systems Ye Li, Richard West and Eric

Long term test of LED Long term test of LED pulsing pulsing Long time LED system stability

EN. 601.467/667 Introduc3on to Human Language Technology Deep Learning I Shinji Watanabe 1

SDP Attribute for Qualifying Media Formats with Generic Parameters (gpmd)

UMBC A B M A L T F O U M B C I M Y O R T 1 (4/3/08) I E S R C E O V U

A call from your landline Yatindra Nath Singh, Professor, Electrical Engineering Department,

You dont hear me but your phones voice interface does Jos L OPES E STEVES & Chaouki K