Lecture 5: Representation Learning Kai-Wei Chang CS @ UCLA - PowerPoint PPT Presentation

Lecture 5: Representation Learning Kai-Wei Chang CS @ UCLA kw@kwchang.net Couse webpage: https://uclanlp.github.io/CS269-17/ ML in NLP 1

This lecture v Review: Neural Network v Recurrent NN v Representation learning in NLP ML in NLP 2

Neural Network 10 Based on slide by AndrewNg

Neural Network (feed forward) 12 Slide by AndrewNg

Feed-Forward Process v Input layer units are features (in NLP, e.g., words) v Usually, one-hot vector or word embedding v Working forward through the network, the input function is applied to compute the input value v E.g., weighted sum of the input v The activation function transforms this input function into a final value v Typically a nonlinear function (e.g, sigmoid ) 13 Based on slide by T. Finin, M. desJardins, L Getoor, R.Par

14 Slide by AndrewNg

Vector Representation 15 Based on slide by AndrewNg

Can extend to multi-class Pedestrian Car Motorcycle Truck 17 Slide by AndrewNg

Why staged predictions? 21 Based on slide and example by AndrewNg

Representing Boolean Functions 22

Combining Representations to Create Non-Linear Functions 23 Based on example by AndrewNg

Layering Representations x 1 ... x 20 x 21 ... x 40 x 41 ... x 60 ... x 381 ... x 400 20 × 20 pixel images d = 400 10 classes Each image is “unrolled” into a vector x of pixel intensities 2 4

Layering Representations x 1 x 2 “0” x 3 “1” x 4 x 5 “9” Output Layer Hidden Layer x d Input Layer Visualization of Hidden Layer 2 5

This lecture v Review: Neural Network v Learning NN v Recursive and Recurrent NN v Representation learning in NLP ML in NLP 14

Stochastic Sub-gradient Descent Given a training set 𝒠 = { 𝒚,𝑧 } Initialize 𝒙 ← 𝟏 ∈ ℝ & 1. For epoch 1…𝑈 : 2. For (𝒚,𝑧) in 𝒠 : 3. Update 𝑥 ← 𝑥 − 𝜃 𝛼 𝑔(𝜄) 4. Return 𝜄 5. ML in NLP 15

� � Recap: Logistic regression 𝜾 𝜇 2𝑜 𝜾 A 𝜾 + 1 𝑜 C log( 1 + 𝑓 HI J (𝜾 K 𝐲 J ) ) min N Let h P (𝑦 N ) = 1/(1 + 𝑓 HP S T U ) (probability 𝑧 = 1 given 𝑦 N ) W& 𝜾 A 𝜾 + X V & ∑ y [ log( ℎ P (𝑦 N )) + (1 − 𝑧 N ) ( log(1 − ℎ P (𝑦 N )) N ML in NLP 16

Cost Function 𝑕 𝜄 = 𝛿 𝜄 A 𝜄 𝑔 𝜄 = 𝐾 𝜄 + 𝑕 𝜄 , 3 Based on slide by AndrewNg 2

Optimizing the Neural Network 3 Based on slide by AndrewNg 3

Forward Propagation 3 Based on slide by AndrewNg 4

Backpropagation: Compute Gradient 36 Based on slide by AndrewNg

This lecture v Review: Neural Network v Recurrent NN v Representation learning in NLP ML in NLP 21

How to deal with input with variant size? v Use same parameters Today is a … </S> <S> Today is … day Advanced ML: Inference 22

Recurrent Neural Networks

Unroll RNNs U V

RNN training v Back-propagation over time

Vanishing Gradients v For the traditional activation functions, each gradient term has the value in range (-1, 1). v Multiplying n of these small numbers to compute gradients v The longer the sequence is, the more severe the problems are.

RNNs characteristics v Model hidden states (input) dependencies v Errors “back propagation over time” v Feature learning methods v Vanishing gradient problem: cannot model long-distant dependencies of the hidden states.

Long-Short Term Memory Networks (LSTMs) Use gates to control the information to be added from the input, forgot from the p revious memories, and outputted. σ and f are sigmoid and tanh function respectively, to map the value to [-1, 1]

Another Visualization Capable of modeling long-distant dependencies between states. Figure credit: Christopher Olah

Bidirectional LSTMs

How to deal with sequence output? v Idea 1: combine DL with CRF v Idea 2: introduce structure in DL ML in NLP 32

LSTMs for Sequential Tagging y t y t = Wh t + b ^ ∑ min l ( y t , y t ) t y t = Wht Sophisticated model of input + local predictions.

Recall CRFs for Sequential Tagging Arbitrary features on the input side Markov assumption on the output side

LSTMs for Sequential Tagging v Completely ignored the interdependencies of the outputs v Will this work? Yes. v Liang et. al. (2008), Structure Compilation: Trading Structure for Features v Is this the best model? Not necessarily.

Combining CRFs with LSTMs

Traditional CRFs v.s. LSTM-CRFs v Traditional CRFs: n 1 ∏ P ( Y | X ; θ ) = exp( θ f ( y i , y i − 1 , x 1: n )) n ∑ ∏ n = 1 exp( θ f ( y i , y i − 1 , x 1: n )) Y n = 1 v LSTM-CRFs: n 1 ∏ P ( Y | X ; Θ ) = exp( λ f ( y i , y i − 1 , LSTM ( x 1: n ))) n ∑ ∏ n = 1 exp( λ f ( y i , y i − 1 , LSTM ( x 1: n ))) Y n = 1 Θ = { λ , Ω } where Ω is LSTM parameters

Combining Two Benefits ● Directly model output dependencies by CRFs. ● Powerful automatic feature learning using biLSTMs. ● Jointly training all the parameters to “share the modeling responsibilities”

Transfer Learning with LSTM-CRFs v Neural networks as feature learner v Share the feature learner for different tasks v Jointly train the feature learners so that it learns the common features . v Use different CRFs for different tasks to encode task-specific information v Going forward, one can imagine using other graphical models besides linear chain CRFs.

Transfer Learning CWS + NER Shared

Joint Training v Simply linearly combine two objectives. v Alternative updates for each module’s parameters.

How to deal with sequence output? v Idea 1: combine DL with CRF v Idea 2: introduce structure in DL ML in NLP 42

Advanced ML: Inference 43

Lecture 5: Representation Learning Kai-Wei Chang CS @ UCLA - PowerPoint PPT Presentation

Lecture 5: Representation Learning Kai-Wei Chang CS @ UCLA kw@kwchang.net Couse webpage: https://uclanlp.github.io/CS269-17/ ML in NLP 1 This lecture v Review: Neural Network v Recurrent NN v Representation learning in NLP ML in NLP 2

Stable and Efficient Representation Learning with Nonnegativity Constraints Tsung-Han Lin and

Unsupervised Language Learning: Representation Learning for NLP Katia Shutova ILLC University

K K Knowledge Knowledge l d l d Representation Representation Representation

Lecture 5: Data Representation 1 / 43 Data Representation Discussion Deep learning job postings

Precise and Approximate Representation of Numbers The Cartesian-Lagrangian representation of

Image and Video Coding: Representation, Acquisition, Display ... 10011 ... encoder decoder

Number representation in Java Scientific notation Overview topics Binary representation of

parametric surface patches 1 implicit representation implicit surface representation f ( P ) = 0

CS 103: Representation Learning, Information Theory and Control Lecture 5, Feb 8, 2019

Lecture 1 Number Representation CS 230 - Spring 2020 1-1 Number Representation Radix

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

The logic of learning: The logic of learning: logic and knowledge representation logic and

Computational Geometry Lecture 2b: Subdivision representation and map overlay 1 Computational

Computational Geometry Lecture 2b: Subdivision representation and map overlay Computational

What is meant by a flashforward? The mental representation of an The mental

Unit 11 Signed Representation Systems Binary Arithmetic 11.2 BINARY REPRESENTATION SYSTEMS

Green relations Dominique Perrin 23 novembre 2015 Dominique Perrin Green relations Monoids A

Causal analysis within the framework of structural autoregressive models Alessio Moneta Scuola

ON THE FUTURE OF HIGH PERFORMANCE COMPUTING: HOW TO THINK FOR PETA AND EXASCALE COMPUTING JACK

RFC822/MIME SMTP HTML Email HTTP WWW Link-Layer-Ctrl

Logistic Regression Lecture 6 Logistic Regression Classification Model CS 335

JUMP START ABOUT ME My name is Ysmay . I ' m a digital strategist and web designer . I ' ve been

MEMORY SYSTEM Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah

Nue Energy Reconstruction with CNN Lars Hertel, Ilsoo Seong, Jianming Bian 2018/08/20 Intro.