Introduction, Bag-of-words, and Multi-layer Perceptron Graham - - PowerPoint PPT Presentation

introduction bag of words and multi layer perceptron
SMART_READER_LITE
LIVE PREVIEW

Introduction, Bag-of-words, and Multi-layer Perceptron Graham - - PowerPoint PPT Presentation

CS11-747 Neural Networks for NLP Introduction, Bag-of-words, and Multi-layer Perceptron Graham Neubig Site https://phontron.com/class/nn4nlp2020/ Language is Hard! Are These Sentences OK? Jane went to the store. store to Jane went


slide-1
SLIDE 1

CS11-747 Neural Networks for NLP

Introduction, Bag-of-words, and Multi-layer Perceptron

Graham Neubig

Site https://phontron.com/class/nn4nlp2020/

slide-2
SLIDE 2

Language is Hard!

slide-3
SLIDE 3

Are These Sentences OK?

  • Jane went to the store.
  • store to Jane went the.
  • Jane went store.
  • Jane goed to the store.
  • The store went to Jane.
  • The food truck went to Jane.
slide-4
SLIDE 4

Engineering Solutions

  • Jane went to the store.
  • store to Jane went the.
  • Jane went store.
  • Jane goed to the store.
  • The store went to Jane.
  • The food truck went to Jane.

}

Create a grammar of the language

} Consider

morphology and exceptions

} Semantic categories,

preferences

} And their exceptions

slide-5
SLIDE 5

Are These Sentences OK?

  • ジェインは店へ⾏行降った。
  • は店⾏行降ったジェインは。
  • ジェインは店へ⾏行降た。
  • 店はジェインへ⾏行降った。
  • 屋台はジェインのところへ⾏行降った。
slide-6
SLIDE 6

Phenomena to Handle

  • Morphology
  • Syntax
  • Semantics/World Knowledge
  • Discourse
  • Pragmatics
  • Multilinguality
slide-7
SLIDE 7

Neural Nets for NLP

  • Neural nets are a tool to do hard things!
  • This class will give you the tools to handle the

problems you want to solve in NLP.

slide-8
SLIDE 8

Class Format/Structure

slide-9
SLIDE 9

Class Format

  • Before class: Read material on the topic
  • During class:
  • Quiz: Simple questions about the required reading

(should be easy)

  • Summary/Questions/Elaboration: Instructor or TAs will

summarize the material, field questions, elaborate on details and talk about advanced topics

  • Code Walk: The TAs (or instructor) will sometimes walk

through some demonstration code or equations

  • After class: Review the code, try to run/modify it
  • yourself. Visit office hours to talk about questions, etc.
slide-10
SLIDE 10

Scope of Teaching

  • Basics of general neural network knowledge

  • > Covered briefly (see reading and ask TAs if you are not familiar). Will

have recitation.

  • Advanced training techniques for neural networks

  • > Some coverage, like VAEs and adversarial training, mostly from the

scope of NLP, not as much as other DL classes

  • Advanced NLP-related neural network architectures

  • > Covered in detail
  • Structured prediction and structured models in neural nets

  • > Covered in detail
  • Implementation details salient to NLP

  • > Covered in detail
slide-11
SLIDE 11

Assignments

  • Course is largely group (2-3) assignment based
  • Assignment 1 - Text Classifier / Questionnaire:

Individually implement a text classifier and fill in questionnaire project topics

  • Assignment 2 - SOTA Survey: Survey about your project

topic and describe the state-of-the-art

  • Assignment 3 - SOTA Re-implementation: Re-implement

and reproduce results from a state-of-the-art model

  • Assignment 4 - Final Project: Perform a unique project

that either (1) improves on state-of-the-art, or (2) applies neural net models to a unique task

slide-12
SLIDE 12

Instructors/Office Hours

  • Instructors: Graham Neubig


(Fri. 4-5PM GHC5409)
 Pengfei Liu
 (Wed. 2-3PM GHC6607)

  • TAs:
  • Aditi Chaudhary (Mon. 10-11AM GHC6509)
  • Chunting Zhou (Fri. 10-11AM GHC5705)
  • Hiroaki Hayashi (Thu. 11AM-12PM GHC5705)
  • Pengcheng Yin (Wed. 10-11AM GHC5505)
  • Vidhisha Balachandran (Tue. 10-11AM GHC5713)
  • Zi-Yi Dou (Tue. 12-1PM GHC5417)
  • Piazza: http://piazza.com/cmu/spring2020/cs11747/home
slide-13
SLIDE 13

Neural Networks: A Tool for Doing Hard Things

slide-14
SLIDE 14

An Example Prediction Problem: Sentence Classification

I hate this movie I love this movie

very good good neutral bad very bad very good good neutral bad very bad

slide-15
SLIDE 15

A First Try: Bag of Words (BOW)

I hate this movie

lookup lookup lookup lookup

+ + + + bias = scores

softmax

probs

slide-16
SLIDE 16

What do Our Vectors Represent?

  • Each word has its own 5 elements corresponding

to [very good, good, neutral, bad, very bad]

  • “hate” will have a high value for “very bad”, etc.
slide-17
SLIDE 17

Build It, Break It

There’s nothing I don’t love about this movie

very good good neutral bad very bad

I don’t love this movie

very good good neutral bad very bad

slide-18
SLIDE 18

Combination Features

  • Does it contain “don’t” and “love”?
  • Does it contain “don’t”, “i”, “love”, and “nothing”?
slide-19
SLIDE 19

Basic Idea of Neural Networks (for NLP Prediction Tasks)

I hate this movie

lookup lookup lookup lookup softmax

probs some complicated function to extract combination features (neural net) scores

slide-20
SLIDE 20

Continuous Bag of Words (CBOW)

I hate this movie + bias = scores + + +

lookup lookup lookup lookup

W

=

slide-21
SLIDE 21

What do Our Vectors Represent?

  • Each vector has “features” (e.g. is this an animate
  • bject? is this a positive word, etc.)
  • We sum these features, then use these to make

predictions

  • Still no combination features: only the expressive

power of a linear model, but dimension reduced

slide-22
SLIDE 22

Deep CBOW

I hate this movie + bias = scores

W

+ + + =

tanh(
 W1*h + b1) tanh(
 W2*h + b2)

slide-23
SLIDE 23

What do Our Vectors Represent?

  • Now things are more interesting!
  • We can learn feature combinations (a node in the

second layer might be “feature 1 AND feature 5 are active”)

  • e.g. capture things such as “not” AND “hate”
slide-24
SLIDE 24

What is a Neural Net?: Computation Graphs

slide-25
SLIDE 25

“Neural” Nets

Original Motivation: Neurons in the Brain

Image credit: Wikipedia

Current Conception: Computation Graphs

x

f(u) = u>

A

f(U, V) = UV f(M, v) = Mv

b

f(u, v) = u · v

c

f(x1, x2, x3) = X

i

xi

slide-26
SLIDE 26

y = x>Ax + b · x + c A node is a {tensor, matrix, vector, scalar} value expression: x graph:

slide-27
SLIDE 27

y = x>Ax + b · x + c x expression: graph: An edge represents a function argument
 (and also an data dependency). They are just
 pointers to nodes. A node with an incoming edge is a function of that edge’s tail node.

f(u) = u>

A node knows how to compute its value and the value of its derivative w.r.t each argument (edge) times a derivative of an arbitrary input .

∂F ∂f(u)

∂f(u) ∂u ∂F ∂f(u) = ✓ ∂F ∂f(u) ◆>

slide-28
SLIDE 28

y = x>Ax + b · x + c x

f(u) = u>

A

f(U, V) = UV

expression: graph: Functions can be nullary, unary,
 binary, … n-ary. Often they are unary or binary.

slide-29
SLIDE 29

y = x>Ax + b · x + c x

f(u) = u>

A

f(U, V) = UV f(M, v) = Mv

expression: graph: Computation graphs are directed and acyclic (in DyNet)

slide-30
SLIDE 30

y = x>Ax + b · x + c x

f(u) = u>

A

f(U, V) = UV f(M, v) = Mv

x A

f(x, A) = x>Ax ∂f(x, A) ∂A = xx> ∂f(x, A) ∂x = (A> + A)x

expression: graph:

slide-31
SLIDE 31

y = x>Ax + b · x + c x

f(u) = u>

A

f(U, V) = UV f(M, v) = Mv

b

f(u, v) = u · v

c

f(x1, x2, x3) = X

i

xi

expression: graph:

slide-32
SLIDE 32

y = x>Ax + b · x + c x

f(u) = u>

A

f(U, V) = UV f(M, v) = Mv

b

f(u, v) = u · v

c y

f(x1, x2, x3) = X

i

xi

expression: graph: variable names are just labelings of nodes.

slide-33
SLIDE 33

Algorithms (1)

  • Graph construction
  • Forward propagation
  • In topological order, compute the value of the

node given its inputs

slide-34
SLIDE 34

x

f(u) = u>

A

f(U, V) = UV f(M, v) = Mv

b

f(u, v) = u · v

c

f(x1, x2, x3) = X

i

xi

graph:

Forward Propagation

slide-35
SLIDE 35

x

f(u) = u>

A

f(U, V) = UV f(M, v) = Mv

b

f(u, v) = u · v

c

f(x1, x2, x3) = X

i

xi

graph:

Forward Propagation

slide-36
SLIDE 36

x

f(u) = u>

A

f(U, V) = UV f(M, v) = Mv

b

f(u, v) = u · v

c

f(x1, x2, x3) = X

i

xi

graph:

Forward Propagation

slide-37
SLIDE 37

x

f(u) = u>

A

f(U, V) = UV f(M, v) = Mv

b

f(u, v) = u · v

c

f(x1, x2, x3) = X

i

xi

graph:

x>

Forward Propagation

slide-38
SLIDE 38

x

f(u) = u>

A

f(U, V) = UV f(M, v) = Mv

b

f(u, v) = u · v

c

f(x1, x2, x3) = X

i

xi

graph:

x> x>A

Forward Propagation

slide-39
SLIDE 39

x

f(u) = u>

A

f(U, V) = UV f(M, v) = Mv

b

f(u, v) = u · v

c

f(x1, x2, x3) = X

i

xi

graph:

x> x>A b · x

Forward Propagation

slide-40
SLIDE 40

x

f(u) = u>

A

f(U, V) = UV f(M, v) = Mv

b

f(u, v) = u · v

c

f(x1, x2, x3) = X

i

xi

graph:

x> x>A b · x x>Ax

Forward Propagation

slide-41
SLIDE 41

x

f(u) = u>

A

f(U, V) = UV f(M, v) = Mv

b

f(u, v) = u · v

c

f(x1, x2, x3) = X

i

xi

graph:

x> x>A b · x x>Ax

Forward Propagation

x>Ax + b · x + c

slide-42
SLIDE 42

Algorithms (2)

  • Back-propagation:
  • Process examples in reverse topological order
  • Calculate the derivatives of the parameters with

respect to the final value
 (This is usually a “loss function”, a value we want to minimize)

  • Parameter update:
  • Move the parameters in the direction of this

derivative
 W -= α * dl/dW

slide-43
SLIDE 43

Concrete Implementation Examples

slide-44
SLIDE 44

Neural Network Frameworks

Static Frameworks Dynamic Frameworks (Recommended!) +Gluon +Eager

slide-45
SLIDE 45

Basic Process in Dynamic Neural Network Frameworks

  • Create a model
  • For each example
  • create a graph that represents the computation

you want

  • calculate the result of that computation
  • if training, perform back propagation and

update

slide-46
SLIDE 46

Bag of Words (BOW)

I hate this movie

lookup lookup lookup lookup

+ + + + bias = scores

softmax

probs

slide-47
SLIDE 47

Continuous Bag of Words (CBOW)

I hate this movie + bias = scores + + +

lookup lookup lookup lookup

W

=

slide-48
SLIDE 48

Deep CBOW

I hate this movie + bias = scores

W

+ + + =

tanh(
 W1*h + b1) tanh(
 W2*h + b2)

slide-49
SLIDE 49

Things to Remember
 Going Forward

slide-50
SLIDE 50

Things to Remember

  • Neural nets are powerful!
  • They are universal function approximators, can

calculate any continuous function

  • But language is hard, and data is limited.
  • We need to design our networks to have

inductive bias, to make it easy to learn things we’d like to learn.

slide-51
SLIDE 51

Class Plan

slide-52
SLIDE 52

Topic 1:
 Models of Sentences/Sequences

undeserved

NN

  • Bag of words, bag of n-grams
  • Convolutional nets
  • Recurrent neural networks and variations
  • Modeling documents and longer texts

this movie’s reputation is

slide-53
SLIDE 53

Topic 2:
 Implementing, Debugging, and Interpreting

  • Implementation: How to efficiently and effectively implement

your models

  • Debugging: How to find problems in your implemented

models

  • Interpretation: How to find why your model made a

prediction?

Example:
 [Ribeiro+ 16]

slide-54
SLIDE 54

Topic 3: Conditioned Generation

I hate this movie

LSTM LSTM LSTM LSTM LSTM

</s>

LSTM LSTM LSTM LSTM

この 映画 が 嫌い

argmax

この 映画

argmax

argmax

嫌い

argmax

</s>

argmax

  • Encoder decoder models
  • Attentional models, self-attention (Transformers)
slide-55
SLIDE 55

Topic 4: Pre-trained Embeddings

  • Pre-training word embeddings, contextualized word

embeddings, sentence embeddings

  • Design decisions in pre-training: model, data,
  • bjective
slide-56
SLIDE 56

Topic 5:
 Structured Prediction Models

I hate this movie

LSTM LSTM LSTM LSTM LSTM LSTM LSTM LSTM

PRP VB DT NN

  • CRFs, and other marginalization-based training
  • REINFORCE, minimum risk training
  • Margin-based and search-based training methods
  • Advanced search algorithms
slide-57
SLIDE 57

Topic 6:
 Models of Tree/Graph Structures

  • Shift reduce, minimum spanning tree parsing
  • Tree structured compositions
  • Models of graph structures

I hate this movie

RNN RNN RNN

slide-58
SLIDE 58

Topic 7:
 Advanced Learning Techniques

  • Models with Latent Random Variables
  • Adversarial Networks
  • Semi-supervised and Unsupervised Learning
slide-59
SLIDE 59

Topic 8:
 Knowledge-based and Text-based QA

  • Learning and QA over knowledge graphs
  • Machine reading and text-based QA

animal dog cat is-a is-a

slide-60
SLIDE 60

Topic 9:
 Multi-task and Multilingual Learning

  • Multi-task and transfer learning
  • Multilingual learning of representations

I hate this movie この 映画 が 嫌い PRP VB DT NN

slide-61
SLIDE 61

Any Questions?