Convolutional Networks for Text Graham Neubig Site - - PowerPoint PPT Presentation

convolutional networks for text
SMART_READER_LITE
LIVE PREVIEW

Convolutional Networks for Text Graham Neubig Site - - PowerPoint PPT Presentation

CS11-747 Neural Networks for NLP Convolutional Networks for Text Graham Neubig Site https://phontron.com/class/nn4nlp2017/ An Example Prediction Problem: Sentence Classification very good good I hate this movie neutral bad very


slide-1
SLIDE 1

CS11-747 Neural Networks for NLP

Convolutional Networks
 for Text

Graham Neubig

Site https://phontron.com/class/nn4nlp2017/

slide-2
SLIDE 2

An Example Prediction Problem: Sentence Classification

I hate this movie I love this movie

very good good neutral bad very bad very good good neutral bad very bad

slide-3
SLIDE 3

A First Try: Bag of Words (BOW)

I hate this movie

lookup lookup lookup lookup

+ + + + bias = scores

softmax

probs

slide-4
SLIDE 4

Build It, Break It

There’s nothing I don’t love about this movie

very good good neutral bad very bad

I don’t love this movie

very good good neutral bad very bad

slide-5
SLIDE 5

Continuous Bag of Words (CBOW)

I hate this movie + bias = scores + + +

lookup lookup lookup lookup

W

=

slide-6
SLIDE 6

Deep CBOW

I hate this movie + bias = scores

W

+ + + =

tanh(
 W1*h + b1) tanh(
 W2*h + b2)

slide-7
SLIDE 7

What do Our Vectors Represent?

  • We can learn feature combinations (a node in the

second layer might be “feature 1 AND feature 5 are active”)

  • e.g. capture things such as “not” AND “hate”
  • BUT! Cannot handle “not hate”
slide-8
SLIDE 8

Handling Combinations

slide-9
SLIDE 9

Bag of n-grams

I hate this movie bias sum( ) = scores

softmax

probs

slide-10
SLIDE 10

Why Bag of n-grams?

  • Allow us to capture

combination features in a simple way “don’t love”, “not the best”

  • Works pretty well
slide-11
SLIDE 11

What Problems
 w/ Bag of n-grams?

  • Same as before: parameter explosion
  • No sharing between similar words/n-grams
slide-12
SLIDE 12

Time Delay/
 Convolutional Neural Networks

slide-13
SLIDE 13

tanh(
 W*[x1;x2] +b)

Time Delay Neural Networks

(Waibel et al. 1989)

I hate this movie

tanh(
 W*[x2;x3] +b) tanh(
 W*[x3;x4] +b) combine softmax(
 W*h + b)

probs These are soft 2-grams!

slide-14
SLIDE 14

Convolutional Networks

(LeCun et al. 1997)

Parameter extraction performs a 2D sweep, not 1D

slide-15
SLIDE 15

CNNs for Text

(Collobert and Weston 2011)

  • 1D convolution ≈ Time Delay Neural Network
  • But often uses terminology/functions borrowed from

image processing

  • Two main paradigms:
  • Context window modeling: For tagging, etc. get

the surrounding context before tagging

  • Sentence modeling: Do convolution to extract n-

grams, pooling to combine over whole sentence

slide-16
SLIDE 16

CNNs for Tagging

(Collobert and Weston 2011)

slide-17
SLIDE 17

CNNs for Sentence Modeling

(Collobert and Weston 2011)

slide-18
SLIDE 18

Standard conv2d Function

  • 2D convolution function takes input + parameters
  • Input: 3D tensor
  • rows (e.g. words), columns, features (“channels”)
  • Parameters/Filters: 4D tensor
  • rows, columns, input features, output features
slide-19
SLIDE 19

Padding/Striding

  • Padding: After convolution, the rows and columns of the output

tensor are either

  • = to rows/columns of input tensor (“same” convolution)
  • = to rows/columns of input tensor minus the size of the filter plus
  • ne (“valid” or “narrow”)
  • = to rows/columns of input tensor plus filter minus one (“wide”)



 


  • Striding: It is also common to skip rows or columns (e.g. a stride of

[2,2] means use every other)

Narrow → ← Wide

Image: Kalchbrenner et al. 2014

slide-20
SLIDE 20

Pooling

  • Pooling is like convolution, but calculates some reduction

function feature-wise

  • Max pooling: “Did you see this feature anywhere in the

range?” (most common)

  • Average pooling: “How prevalent is this feature over the

entire range”

  • k-Max pooling: “Did you see this feature up to k times?”
  • Dynamic pooling: “Did you see this feature in the

beginning? In the middle? In the end?”

slide-21
SLIDE 21

Let’s Try It!

cnn-class.py

slide-22
SLIDE 22

Stacked Convolution

slide-23
SLIDE 23

Stacked Convolution

  • Feeding in convolution from previous layer results

in larger area of focus for each feature

Image Credit: Goldberg Book

slide-24
SLIDE 24

Dilated Convolution

(e.g. Kalchbrenner et al. 2016)

  • Gradually increase stride: low-level to high-level

i _ h a t e _ t h i s _ f i l m sentence class (classification) next char (language
 modeling) word class (tagging)

slide-25
SLIDE 25

An Aside:
 Nonlinear Functions

  • Proper choice of a non-linear function is essential in

stacked networks
 
 
 
 
 
 
 


  • Functions such as RelU or softplus often work

better at preserving gradients step tanh soft plus rectifier (RelU)

Image: Wikipedia

slide-26
SLIDE 26

Why (Dilated) Convolution for Modeling Sentences?

  • In contrast to recurrent neural networks (next class)
  • + Fewer steps from each word to the final

representation: RNN O(N), Dilated CNN O(log N)

  • + Easier to parallelize on GPU
  • - Slightly less natural for arbitrary-length

dependencies

  • - A bit slower on CPU?
slide-27
SLIDE 27

Structured Convolution

slide-28
SLIDE 28

Why Structured Convolution?

  • Language has structure, would like it to localize

features

  • e.g. noun-verb pairs very informative, but not

captured by normal CNNs

slide-29
SLIDE 29

Example: Dependency Structure

Sequa makes and repairs jet engines

ROOT SBJ COORD CONJ NMOD OBJ Example From: Marcheggiani and Titov 2017

slide-30
SLIDE 30

Tree-structured Convolution

(Ma et al. 2015)

  • Convolve over parents, grandparents, siblings
slide-31
SLIDE 31

Graph Convolution

(e.g. Marcheggiani et al. 2017)

  • Convolution is shaped by graph structure
  • For example, dependency


tree is a graph with

  • Self-loop connections
  • Dependency connections
  • Reverse connections
slide-32
SLIDE 32

Convolutional Models of Sentence Pairs

slide-33
SLIDE 33

Why Model Sentence Pairs?

  • Paraphrase identification / sentence similarity
  • Textual entailment
  • Retrieval
  • (More about these specific applications in two

classes)

slide-34
SLIDE 34

Siamese Network

(Bromley et al. 1993)

  • Use the same network,

compare the extracted representations

  • (e.g. Time-delay

networks for signature recognition)

slide-35
SLIDE 35

Convolutional Matching Model (Hu et al. 2014)

  • Concatenate sentences into a 3D tensor and perform convolution
  • Shown more effective than simple Siamese network
slide-36
SLIDE 36

Convolutional Features
 + Matrix-based Pooling (Yin and Schutze 2015)

slide-37
SLIDE 37

Understanding CNN Results

slide-38
SLIDE 38

Why Understanding?

  • Sometimes we want to know why model is making

predictions (e.g. is there bias?)

  • Understanding extracted features might lead to

new architectural ideas

  • Visualization of filters, etc. easy in vision but harder

in NLP; other techniques can be used

slide-39
SLIDE 39

Maximum Activation

  • Calculate the hidden feature values for whole data, find

section of image/sentence that results in max value

Example: Karpathy 2016

slide-40
SLIDE 40

PCA/t-SNE Embedding


  • f Feature Vector
  • Do dimension reduction on feature vectors

Example: Sutskever+ 2014

slide-41
SLIDE 41

Occlusion

  • Blank out one part at a time (in NLP, word?), and measure

the difference from the final representation/prediction

Example: Karpathy 2016

slide-42
SLIDE 42

Let’s Try It!

cnn-activation.py

slide-43
SLIDE 43

Questions?