Comparing Convolution Kernels and RNNs on a wide-coverage - - PowerPoint PPT Presentation

comparing convolution kernels and rnns on a wide coverage
SMART_READER_LITE
LIVE PREVIEW

Comparing Convolution Kernels and RNNs on a wide-coverage - - PowerPoint PPT Presentation

Comparing Convolution Kernels and RNNs on a wide-coverage computational analysis of natural language Fabrizio Costa, Paolo Frasconi, Sauro Menchetti Dept. Systems and Computer Science Universit di Firenze Massimiliano Pontil Dept.


slide-1
SLIDE 1

Comparing Convolution Kernels and RNNs

  • n a wide-coverage computational analysis
  • f natural language

Fabrizio Costa, Paolo Frasconi, Sauro Menchetti

  • Dept. Systems and Computer Science

Università di Firenze Massimiliano Pontil

  • Dept. Information Engineering

Università di Siena

Related papers available from http://www.dsi.unifi.it/~paolo http://www.dsi.unifi.it/~costa

slide-2
SLIDE 2

Overview

  • Incremental parsing of natural language

– A ranking problem on labeled forests

  • Supervised learning of discrete structures

– Recursive neural networks (RNNs) – Kernel-based approaches

  • New results with RNNs
  • Experimental comparison
slide-3
SLIDE 3

Human vs computer parsing

  • Computer parsing: typically bottom up

– `islands’ are built at the beginning that are subsequently joined together

  • Human parsing: known to be left-to-right

– E.g., perception of speech is sequential, reading is sequential, etc.

slide-4
SLIDE 4

Strong incrementality hypothesis

  • The human parser maintains a connected structure

that explains the first n-1 words

  • When n-th word arrives it is attached to the existing

structure

The servant of the actress NP NP PP D N P D N WH who

Left context New word

Connection path (CP)

slide-5
SLIDE 5

NP NP PP D N P D N The servant of the actress who

S

WH

Attachment ambiguity

WH

S

E.g. low vs. high attachment

slide-6
SLIDE 6

Connection path ambiguity

V D NP N realized The athlete VP S NP his PRP VP S NP his PRP VP S’ t

Even for a fixed attachment point there may be several alternative legal paths (those matching the POS tag of the new word)

slide-7
SLIDE 7

A forest of alternatives

  • Given a dynamic grammar, a left context and a next

word

  • Many legal trees can be formed attacching a CP
  • One is correct — we want to predict it
slide-8
SLIDE 8

Supervised Learning of Discrete Structures

  • Lack of methods that handle “directly” recursive or

relational structures such as trees and graphs

  • General approach:
  • 1. Convert structures to real vectors
  • 2. Apply known learning methods on vectors
  • These steps can be elegantly merged within a more

general theoretical framework:

  • 1. Recursive neural networks (Göller & Küchler IJCNN 96,

Frasconi etal TNN 98)

  • 2. Kernel machines (Haussler 99, Collins & Duffy NIPS 01, ACL

02)

slide-9
SLIDE 9

Differences

  • Kernel-based methods map a tree into a vector f(x) in a

very high-dimensional space, perhaps infinite

  • Bag-of-something kind of representation
  • Kernel choice difficult (prior knowledge?)
  • RNN map a tree into a low dimensional vector

e.g. f(x) Œ¬30

  • Distributed representation
  • Task-driven: f(x) in this case depend on the specific

learning problem

slide-10
SLIDE 10

Kernels

  • Given sets of nonterminals {A,B,…} and terminals

{a,b,…} there are infinite possible subtrees:

A B C A B A a A B C A B a

  • fi(t): count occurrences of subtree i in tree t
  • f(t)=[f1(t), f2(t), f3(t),…] has infinite dimensionality but
  • f(t)T f(s) can be computed without actually enumerating

all subtrees by dynamic programming (Collins & Duffy NIPS 2001)

slide-11
SLIDE 11

Recursive neural networks

  • Recurrent networks can in principle realize arbitrarily

complex dynamical systems

  • Skepticism: Long-term dependencies cannot be easily

learned

  • But trees are different!

– Path lengths are O(log n) – Vanishing gradient problems not as serious for RNNs on trees

A B C DEFGH (A(B(DEFGH)C))

slide-12
SLIDE 12

Recursive Neural Networks

D B C A D

  • Let’s introduce a representation

vector X(v)Œ¬n for each vertex v in tree t

  • X(v) computed bottom-up
slide-13
SLIDE 13

Recursive Neural Networks

X0 D B C A D

  • Base step:

The representation of external nodes (“nil children”) is a constant X(v) = X0

X0 X0 X0 X0 X0

slide-14
SLIDE 14

Recursive Neural Networks

X(v) D B C A D

  • Induction:

the representation of the subtree rooted at v is a function of

  • 1. The representations at the

children of v

  • 2. the symbol U(v)
  • X(v) = f(X(w1),…,X(wk),U(v))
  • w1,…,wk are v’s children

(k assigned)

X(v) X(v)

slide-15
SLIDE 15

Recursive Neural Networks ...

X(v) D B C A D

  • What is more precisely f ?

X(v) = f(X(w1),…,X(wk),U(v))

  • f is realized by an MLP:
  • n outputs, nk+m inputs

X(w1) n X(wk) n U(v) m

slide-16
SLIDE 16

Recursive Neural Networks ...

X(r) D B C A D

  • The computation continues

bottom-up until the root r is reached

  • X(r) encodes the whole tree in a

real vector — same role as f(t)

X(w1) X(wk) U(r)

slide-17
SLIDE 17

NP PRP S It has no bearing

  • n

VBZ DT NN IN NP PP NP VP Structure unfolding

slide-18
SLIDE 18

NP S It has no bearing

  • n

NP PP NP VP Structure unfolding

slide-19
SLIDE 19

S It has no bearing

  • n

NP VP Structure unfolding

slide-20
SLIDE 20

S It has no bearing

  • n

VP Structure unfolding

slide-21
SLIDE 21

S It has no bearing

  • n

Structure unfolding

slide-22
SLIDE 22

It has no bearing

  • n

Structure unfolding

slide-23
SLIDE 23

It has no bearing

  • n

Structure unfolding

Output network

slide-24
SLIDE 24

It has no bearing

  • n

Prediction phase

Information Flow

slide-25
SLIDE 25

It has no bearing

  • n

Prediction phase

Information Flow

slide-26
SLIDE 26

It has no bearing

  • n

Prediction phase

Information Flow

slide-27
SLIDE 27

It has no bearing

  • n

Prediction phase

Information Flow

slide-28
SLIDE 28

It has no bearing

  • n

Prediction phase

Information Flow

slide-29
SLIDE 29

It has no bearing

  • n

Error Correction:

Information Flow

slide-30
SLIDE 30

It has no bearing

  • n

Error Correction:

Information Flow

slide-31
SLIDE 31

It has no bearing

  • n

Error Correction:

Information Flow

slide-32
SLIDE 32

It has no bearing

  • n

Error Correction:

Information Flow

slide-33
SLIDE 33

It has no bearing

  • n

Error Correction:

Information Flow

slide-34
SLIDE 34

Disambiguation is a preference task

slide-35
SLIDE 35

Learning preferences

  • Ranking: given an list of entities (x1,…,xr) find a corresponding

list of integers (y1,…,yr), with yi in [1,r] such that yi is the rank of xi

  • In total ranking: yi≠yj
  • In our case the favorite element x1 gets y1=1 and other xj get

yj=0

– typically r=120 (but goes up to 2000)

  • Linear utility function:

wTx1 – wTxj > 0 for j=2,…,r

  • Set of constraints — similar to binary classification but we have

differences between vectors

  • Can be used with SVM and Voting Perceptron:

wT[f(x1) – f(xj)]=Ssv y[f(x1) – f(xj)]T[f(x1) – f(xj)]

slide-36
SLIDE 36

Learning preferences

  • To get a differentiable version we use the softmax function

yj =

e

wT x j

ewT xk

k

Â

  • Find w and xj by maximizing

zj log yj +(1- zj )log(1- yj )

j

Â

i

Â

  • Where z1=1 and zj=0 for j>1
  • Gradients wrt xj are passed to the RNN so in this sense xj is an

adaptive encoding

slide-37
SLIDE 37

Experimental setup

  • Training on WSJ section of Penn treebank

– realistic corpus representative of natural language – large size (40.000 sentence, 1 million words) – uniform language (articles on economic subject) – Train on sections 2-21, test on section 23

  • Note: we are not (yet) into building a parser
  • Extending earlier results (Costa et al 2000, Sturt et

al, Cognition, in press)

slide-38
SLIDE 38

Results

40 50 60 70 80 90 100 1 2 3 4 RNN RNN500 LC MA

  • Perc. Correctly predicted
slide-39
SLIDE 39

Selecting the right attachment

70 75 80 85 90 95 100 1 2 3

Position Perc Correct

RNN Freq

  • Given attachment-site, correct connection-

path is chosen 89% of the time

slide-40
SLIDE 40

Reduced incremental trees Example tree

S NP VP V NP D N PP P D N PP P N a friend

  • f

Jim saw the thief with NP Left context Connection path

slide-41
SLIDE 41

Reduced incremental trees Right frontier

S NP VP V NP D N PP P D N PP P N a friend

  • f

Jim saw the thief with NP

slide-42
SLIDE 42

Reduced incremental trees Right frontier + c-commanding nodes

S NP VP V NP D N PP P D N PP P N a friend

  • f

Jim saw the thief with NP

slide-43
SLIDE 43

Reduced incremental trees Right frontier + c-commanding nodes + connection path

S NP VP V NP D N PP P D N PP P N a friend

  • f

Jim saw the thief with NP

slide-44
SLIDE 44

Reduced incremental trees

S NP VP V NP D N PP P NP

slide-45
SLIDE 45

Results

80 85 90 95 100 Full tree Reduced tree

  • Perc. Correctly predicted
slide-46
SLIDE 46

Data set partitioning (POS-tag based)

50 55 60 65 70 75 80 85 90 95 100 verb noun article adjective punctuation conjunction

  • ther

preposition adverb

Classes Percentage Correct

5 10 15 20 25 30 35 40 45 50

Error reduction

slide-47
SLIDE 47

Comparing RNN and VP

  • Regularization parameter l=0.5 (best value based on

preliminary trials using validation set)

  • Modularization in 10 POS-tag categories
  • Performance assessment at 100, 500, 2000, 10000, and

40000 training sentences

  • Small datasets: CPU(VP) ~ k CPU(RNN)
  • Larger datasets:

– RNN learns in 1-2 epochs (~ 3 days 2GHz) – VP took over 2 months to complete 1 epoch

slide-48
SLIDE 48

VP vs. RNN

NOUNS – 33%

87.0 88.0 89.0 90.0 91.0 92.0 93.0 94.0 95.0 96.0 97.0 100 1000 10000 100000 Training set size RNN VP

VERBS – 13%

80.0 82.0 84.0 86.0 88.0 90.0 92.0 94.0 96.0 98.0 100 1000 10000 100000 Training set size RNN VP

PREPOSITIONS – 13%

50.0 52.0 54.0 56.0 58.0 60.0 62.0 64.0 66.0 68.0 70.0 100 1000 10000 100000 Training set size RNN VP

ARTICLES – 12%

70.0 75.0 80.0 85.0 90.0 95.0 100 1000 10000 100000 Training set size RNN VP

PUNCTUATION 12%

40.0 45.0 50.0 55.0 60.0 65.0 70.0 75.0 80.0 85.0 100 1000 10000 100000 Training set size RNN VP

ADJECTIVES 8%

75.0 77.0 79.0 81.0 83.0 85.0 87.0 89.0 100 1000 10000 100000 Training set size RNN VP

slide-49
SLIDE 49

VP vs. RNN Modularization

Learning curve

70.0 72.0 74.0 76.0 78.0 80.0 82.0 84.0 86.0 88.0 90.0 100 1000 10000 100000 Training set size RNN VP

slide-50
SLIDE 50

5 independent splits No modularization

72 73 74 75 76 77 78 79 Split 1 Split 2 Split 3 Split 4 Split 5 RNN – Average 77% VP – Average 75.4%

slide-51
SLIDE 51

Summary

  • VP results perhaps to be strengthened but

– 5 x 100 sentences takes ~ a week on a 2GHz CPU – VP does not scale up linearly with # examples

  • However it appears that

– RNN to be preferred, unless one has good knowledge to put into the design of the right kernel

  • Ongoing work: Collins’ relabeling task

– Same problem setting (ranking on forests) – Less computation involved (1 forest for each sentence vs. 1 forest for each word)

slide-52
SLIDE 52

Thanks:

Patrick Sturt University of Glasgow Vincenzo Lombardo Università di Torino Giovanni Soda Università di Firenze