Lecture 5: Representation Learning Kai-Wei Chang CS @ UCLA - - PowerPoint PPT Presentation

lecture 5 representation learning
SMART_READER_LITE
LIVE PREVIEW

Lecture 5: Representation Learning Kai-Wei Chang CS @ UCLA - - PowerPoint PPT Presentation

Lecture 5: Representation Learning Kai-Wei Chang CS @ UCLA kw@kwchang.net Couse webpage: https://uclanlp.github.io/CS269-17/ ML in NLP 1 This lecture v Review: Neural Network v Recurrent NN v Representation learning in NLP ML in NLP 2


slide-1
SLIDE 1

Lecture 5: Representation Learning

Kai-Wei Chang CS @ UCLA kw@kwchang.net Couse webpage: https://uclanlp.github.io/CS269-17/

1 ML in NLP

slide-2
SLIDE 2

This lecture

v Review: Neural Network v Recurrent NN v Representation learning in NLP

ML in NLP 2

slide-3
SLIDE 3

Neural Network

10

Based on slide by AndrewNg

slide-4
SLIDE 4

12

Slide by AndrewNg

Neural Network (feed forward)

slide-5
SLIDE 5

13

Based on slide by T. Finin, M. desJardins, L Getoor, R.Par

Feed-Forward Process

vInput layer units are features (in NLP, e.g., words)

v Usually, one-hot vector or word embedding

vWorking forward through the network, the input function is applied to compute the input value vE.g., weighted sum of the input vThe activation function transforms this input function into a final value vTypically a nonlinear function (e.g, sigmoid)

slide-6
SLIDE 6

14

Slide by AndrewNg

slide-7
SLIDE 7

15

Based on slide by AndrewNg

Vector Representation

slide-8
SLIDE 8

Can extend to multi-class

Pedestrian Car Motorcycle Truck

17

Slide by AndrewNg

slide-9
SLIDE 9

Why staged predictions?

21

Based on slide and example by AndrewNg

slide-10
SLIDE 10

Representing Boolean Functions

22

slide-11
SLIDE 11

Combining Representations to Create Non-Linear Functions

23

Based on example by AndrewNg

slide-12
SLIDE 12

Layering Representations

x1 ... x20 x21 ... x40 x41 ... x60 x381 ... x400 20 × 20 pixel images d = 400 10 classes

Each image is “unrolled” into a vector x of pixel intensities

...

2 4

slide-13
SLIDE 13

Layering Representations

xd

2 5

“0” “1” “9”

Input Layer

x1 x2 x3 x4 x5

Output Layer Hidden Layer

Visualization of Hidden Layer

slide-14
SLIDE 14

This lecture

v Review: Neural Network

v Learning NN

v Recursive and Recurrent NN v Representation learning in NLP

ML in NLP 14

slide-15
SLIDE 15

Stochastic Sub-gradient Descent

ML in NLP 15

1. 2. 3. 4. 5. Initialize 𝒙 ← 𝟏 ∈ ℝ& For epoch 1…𝑈: For (𝒚,𝑧) in 𝒠: Update 𝑥 ← 𝑥 − 𝜃 𝛼 𝑔(𝜄) Return 𝜄

Given a training set 𝒠 = { 𝒚,𝑧 }

slide-16
SLIDE 16

Recap: Logistic regression

min

𝜾 𝜇

2𝑜 𝜾A𝜾 + 1 𝑜 C log( 1 + 𝑓HIJ(𝜾K𝐲J))

  • N

Let hP(𝑦N) = 1/(1 + 𝑓HPSTU) (probability 𝑧 = 1 given 𝑦N)

V W& 𝜾A𝜾 + X & ∑ y[ log( ℎP(𝑦N)) + (1 − 𝑧N) ( log(1 − ℎP(𝑦N))

  • N

ML in NLP 16

slide-17
SLIDE 17

Cost Function

3 2

Based on slide by AndrewNg

𝑔 𝜄 = 𝐾 𝜄 + 𝑕 𝜄 , 𝑕 𝜄 = 𝛿 𝜄A𝜄

slide-18
SLIDE 18

Optimizing the Neural Network

3 3

Based on slide by AndrewNg

slide-19
SLIDE 19

Forward Propagation

3 4

Based on slide by AndrewNg

slide-20
SLIDE 20

36

Based on slide by AndrewNg

Backpropagation: Compute Gradient

slide-21
SLIDE 21

This lecture

v Review: Neural Network v Recurrent NN v Representation learning in NLP

ML in NLP 21

slide-22
SLIDE 22

How to deal with input with variant size? v Use same parameters

Advanced ML: Inference 22

<S> Today is … day Today is a … </S>

slide-23
SLIDE 23

Recurrent Neural Networks

slide-24
SLIDE 24

Recurrent Neural Networks

slide-25
SLIDE 25

Unroll RNNs

U V

slide-26
SLIDE 26

RNN training

v Back-propagation over time

slide-27
SLIDE 27

Vanishing Gradients

v For the traditional activation functions, each gradient term has the value in range (-1, 1). v Multiplying n of these small numbers to compute gradients v The longer the sequence is, the more severe the problems are.

slide-28
SLIDE 28

RNNs characteristics

v Model hidden states (input) dependencies v Errors “back propagation over time” v Feature learning methods v Vanishing gradient problem: cannot model long-distant dependencies of the hidden states.

slide-29
SLIDE 29

Long-Short Term Memory Networks (LSTMs)

Use gates to control the information to be added from the input, forgot from the previous memories, and outputted. σ and f are sigmoid and tanh function respectively, to map the value to [-1, 1]

slide-30
SLIDE 30

Another Visualization

Figure credit: Christopher Olah Capable of modeling long-distant dependencies between states.

slide-31
SLIDE 31

Bidirectional LSTMs

slide-32
SLIDE 32

How to deal with sequence output?

v Idea 1: combine DL with CRF v Idea 2: introduce structure in DL

ML in NLP 32

slide-33
SLIDE 33

LSTMs for Sequential Tagging

yt

Sophisticated model of input + local predictions.

yt = Wht

yt = Wht + b min l(yt, y

^ t) t

slide-34
SLIDE 34

Recall CRFs for Sequential Tagging

Arbitrary features on the input side Markov assumption on the output side

slide-35
SLIDE 35

LSTMs for Sequential Tagging

v Completely ignored the interdependencies

  • f the outputs

v Will this work? Yes.

v Liang et. al. (2008), Structure Compilation: Trading Structure for Features

v Is this the best model? Not necessarily.

slide-36
SLIDE 36

Combining CRFs with LSTMs

slide-37
SLIDE 37

Traditional CRFs v.s. LSTM-CRFs v Traditional CRFs: v LSTM-CRFs:

P(Y | X;θ) = 1 exp(θ f (yi, yi−1, x1:n))

n=1 n

Y

exp(θ f (yi, yi−1, x1:n))

n=1 n

P(Y | X;Θ) = 1 exp(λ f (yi, yi−1, LSTM(x1:n)))

n=1 n

Y

exp(λ f (yi, yi−1, LSTM(x1:n)))

n=1 n

Θ = {λ,Ω} where Ω is LSTM parameters

slide-38
SLIDE 38

Combining Two Benefits

  • Directly model output dependencies by CRFs.
  • Powerful automatic feature learning using biLSTMs.
  • Jointly training all the parameters to “share the

modeling responsibilities”

slide-39
SLIDE 39

Transfer Learning with LSTM-CRFs

v Neural networks as feature learner v Share the feature learner for different tasks v Jointly train the feature learners so that it learns the common features. v Use different CRFs for different tasks to encode task-specific information

v Going forward, one can imagine using other graphical models besides linear chain CRFs.

slide-40
SLIDE 40

Transfer Learning CWS + NER

Shared

slide-41
SLIDE 41

Joint Training

v Simply linearly combine two objectives. v Alternative updates for each module’s parameters.

slide-42
SLIDE 42

How to deal with sequence output?

v Idea 1: combine DL with CRF v Idea 2: introduce structure in DL

ML in NLP 42

slide-43
SLIDE 43

Advanced ML: Inference 43

slide-44
SLIDE 44

Advanced ML: Inference 44

slide-45
SLIDE 45

Advanced ML: Inference 45

slide-46
SLIDE 46

Advanced ML: Inference 46

slide-47
SLIDE 47

Advanced ML: Inference 47