[PPT] - Comparing Convolution Kernels and RNNs on a wide-coverage PowerPoint Presentation

SLIDE 1

Comparing Convolution Kernels and RNNs

n a wide-coverage computational analysis
f natural language

Fabrizio Costa, Paolo Frasconi, Sauro Menchetti

Dept. Systems and Computer Science

Università di Firenze Massimiliano Pontil

Dept. Information Engineering

Università di Siena

Related papers available from http://www.dsi.unifi.it/~paolo http://www.dsi.unifi.it/~costa

SLIDE 2

Overview

Incremental parsing of natural language

– A ranking problem on labeled forests

Supervised learning of discrete structures

– Recursive neural networks (RNNs) – Kernel-based approaches

New results with RNNs
Experimental comparison

SLIDE 3

Human vs computer parsing

Computer parsing: typically bottom up

– `islands’ are built at the beginning that are subsequently joined together

Human parsing: known to be left-to-right

– E.g., perception of speech is sequential, reading is sequential, etc.

SLIDE 4

Strong incrementality hypothesis

The human parser maintains a connected structure

that explains the first n-1 words

When n-th word arrives it is attached to the existing

structure

The servant of the actress NP NP PP D N P D N WH who

Left context New word

Connection path (CP)

SLIDE 5

NP NP PP D N P D N The servant of the actress who

S

WH

Attachment ambiguity

WH

S

E.g. low vs. high attachment

SLIDE 6

Connection path ambiguity

V D NP N realized The athlete VP S NP his PRP VP S NP his PRP VP S’ t

Even for a fixed attachment point there may be several alternative legal paths (those matching the POS tag of the new word)

SLIDE 7

A forest of alternatives

Given a dynamic grammar, a left context and a next

word

Many legal trees can be formed attacching a CP
One is correct — we want to predict it

SLIDE 8

Supervised Learning of Discrete Structures

Lack of methods that handle “directly” recursive or

relational structures such as trees and graphs

General approach:
1. Convert structures to real vectors
2. Apply known learning methods on vectors
These steps can be elegantly merged within a more

general theoretical framework:

1. Recursive neural networks (Göller & Küchler IJCNN 96,

Frasconi etal TNN 98)

2. Kernel machines (Haussler 99, Collins & Duffy NIPS 01, ACL

02)

SLIDE 9

Differences

Kernel-based methods map a tree into a vector f(x) in a

very high-dimensional space, perhaps infinite

Bag-of-something kind of representation
Kernel choice difficult (prior knowledge?)
RNN map a tree into a low dimensional vector

e.g. f(x) Œ¬30

Distributed representation
Task-driven: f(x) in this case depend on the specific

learning problem

SLIDE 10

Kernels

Given sets of nonterminals {A,B,…} and terminals

{a,b,…} there are infinite possible subtrees:

A B C A B A a A B C A B a

fi(t): count occurrences of subtree i in tree t
f(t)=[f1(t), f2(t), f3(t),…] has infinite dimensionality but
f(t)T f(s) can be computed without actually enumerating

all subtrees by dynamic programming (Collins & Duffy NIPS 2001)

SLIDE 11

Recursive neural networks

Recurrent networks can in principle realize arbitrarily

complex dynamical systems

Skepticism: Long-term dependencies cannot be easily

learned

But trees are different!

– Path lengths are O(log n) – Vanishing gradient problems not as serious for RNNs on trees

A B C DEFGH (A(B(DEFGH)C))

SLIDE 12

Recursive Neural Networks

D B C A D

Let’s introduce a representation

vector X(v)Œ¬n for each vertex v in tree t

X(v) computed bottom-up

SLIDE 13

Recursive Neural Networks

X0 D B C A D

Base step:

The representation of external nodes (“nil children”) is a constant X(v) = X0

X0 X0 X0 X0 X0

SLIDE 14

Recursive Neural Networks

X(v) D B C A D

Induction:

the representation of the subtree rooted at v is a function of

1. The representations at the

children of v

2. the symbol U(v)
X(v) = f(X(w1),…,X(wk),U(v))
w1,…,wk are v’s children

(k assigned)

X(v) X(v)

SLIDE 15

Recursive Neural Networks ...

X(v) D B C A D

What is more precisely f ?

X(v) = f(X(w1),…,X(wk),U(v))

f is realized by an MLP:
n outputs, nk+m inputs

X(w1) n X(wk) n U(v) m

SLIDE 16

Recursive Neural Networks ...

X(r) D B C A D

The computation continues

bottom-up until the root r is reached

X(r) encodes the whole tree in a

real vector — same role as f(t)

X(w1) X(wk) U(r)

SLIDE 17

NP PRP S It has no bearing

n

VBZ DT NN IN NP PP NP VP Structure unfolding

SLIDE 18

NP S It has no bearing

n

NP PP NP VP Structure unfolding

SLIDE 19

S It has no bearing

n

NP VP Structure unfolding

SLIDE 20

S It has no bearing

n

VP Structure unfolding

SLIDE 21

S It has no bearing

n

Structure unfolding

SLIDE 22

It has no bearing

n

Structure unfolding

SLIDE 23

It has no bearing

n

Structure unfolding

Output network

SLIDE 24

It has no bearing

n

Prediction phase

Information Flow

SLIDE 25

It has no bearing

n

Prediction phase

Information Flow

SLIDE 26

It has no bearing

n

Prediction phase

Information Flow

SLIDE 27

It has no bearing

n

Prediction phase

Information Flow

SLIDE 28

It has no bearing

n

Prediction phase

Information Flow

SLIDE 29

It has no bearing

n

Error Correction:

Information Flow

SLIDE 30

It has no bearing

n

Error Correction:

Information Flow

SLIDE 31

It has no bearing

n

Error Correction:

Information Flow

SLIDE 32

It has no bearing

n

Error Correction:

Information Flow

SLIDE 33

It has no bearing

n

Error Correction:

Information Flow

SLIDE 34

Disambiguation is a preference task

SLIDE 35

Learning preferences

Ranking: given an list of entities (x1,…,xr) find a corresponding

list of integers (y1,…,yr), with yi in [1,r] such that yi is the rank of xi

In total ranking: yi≠yj
In our case the favorite element x1 gets y1=1 and other xj get

yj=0

– typically r=120 (but goes up to 2000)

Linear utility function:

wTx1 – wTxj > 0 for j=2,…,r

Set of constraints — similar to binary classification but we have

differences between vectors

Can be used with SVM and Voting Perceptron:

wT[f(x1) – f(xj)]=Ssv y[f(x1) – f(xj)]T[f(x1) – f(xj)]

SLIDE 36

Learning preferences

To get a differentiable version we use the softmax function

yj =

e

wT x j

ewT xk

k

Â

Find w and xj by maximizing

zj log yj +(1- zj )log(1- yj )

j

Â

i

Â

Where z1=1 and zj=0 for j>1
Gradients wrt xj are passed to the RNN so in this sense xj is an

adaptive encoding

SLIDE 37

Experimental setup

Training on WSJ section of Penn treebank

– realistic corpus representative of natural language – large size (40.000 sentence, 1 million words) – uniform language (articles on economic subject) – Train on sections 2-21, test on section 23

Note: we are not (yet) into building a parser
Extending earlier results (Costa et al 2000, Sturt et

al, Cognition, in press)

SLIDE 38

Results

40 50 60 70 80 90 100 1 2 3 4 RNN RNN500 LC MA

Perc. Correctly predicted

SLIDE 39

Selecting the right attachment

70 75 80 85 90 95 100 1 2 3

Position Perc Correct

RNN Freq

Given attachment-site, correct connection-

path is chosen 89% of the time

SLIDE 40

Reduced incremental trees Example tree

S NP VP V NP D N PP P D N PP P N a friend

f

Jim saw the thief with NP Left context Connection path

SLIDE 41

Reduced incremental trees Right frontier

S NP VP V NP D N PP P D N PP P N a friend

f

Jim saw the thief with NP

SLIDE 42

Reduced incremental trees Right frontier + c-commanding nodes

S NP VP V NP D N PP P D N PP P N a friend

f

Jim saw the thief with NP

SLIDE 43

Reduced incremental trees Right frontier + c-commanding nodes + connection path

S NP VP V NP D N PP P D N PP P N a friend

f

Jim saw the thief with NP

SLIDE 44

Reduced incremental trees

S NP VP V NP D N PP P NP

SLIDE 45

Results

80 85 90 95 100 Full tree Reduced tree

Perc. Correctly predicted

SLIDE 46

Data set partitioning (POS-tag based)

50 55 60 65 70 75 80 85 90 95 100 verb noun article adjective punctuation conjunction

ther

preposition adverb

Classes Percentage Correct

5 10 15 20 25 30 35 40 45 50

Error reduction

SLIDE 47

Comparing RNN and VP

Regularization parameter l=0.5 (best value based on

preliminary trials using validation set)

Modularization in 10 POS-tag categories
Performance assessment at 100, 500, 2000, 10000, and

40000 training sentences

Small datasets: CPU(VP) ~ k CPU(RNN)
Larger datasets:

– RNN learns in 1-2 epochs (~ 3 days 2GHz) – VP took over 2 months to complete 1 epoch

SLIDE 48

VP vs. RNN

NOUNS – 33%

87.0 88.0 89.0 90.0 91.0 92.0 93.0 94.0 95.0 96.0 97.0 100 1000 10000 100000 Training set size RNN VP

VERBS – 13%

80.0 82.0 84.0 86.0 88.0 90.0 92.0 94.0 96.0 98.0 100 1000 10000 100000 Training set size RNN VP

PREPOSITIONS – 13%

50.0 52.0 54.0 56.0 58.0 60.0 62.0 64.0 66.0 68.0 70.0 100 1000 10000 100000 Training set size RNN VP

ARTICLES – 12%

70.0 75.0 80.0 85.0 90.0 95.0 100 1000 10000 100000 Training set size RNN VP

PUNCTUATION 12%

40.0 45.0 50.0 55.0 60.0 65.0 70.0 75.0 80.0 85.0 100 1000 10000 100000 Training set size RNN VP

ADJECTIVES 8%

75.0 77.0 79.0 81.0 83.0 85.0 87.0 89.0 100 1000 10000 100000 Training set size RNN VP

SLIDE 49

VP vs. RNN Modularization

Learning curve

70.0 72.0 74.0 76.0 78.0 80.0 82.0 84.0 86.0 88.0 90.0 100 1000 10000 100000 Training set size RNN VP

SLIDE 50

5 independent splits No modularization

72 73 74 75 76 77 78 79 Split 1 Split 2 Split 3 Split 4 Split 5 RNN – Average 77% VP – Average 75.4%

SLIDE 51

Summary

VP results perhaps to be strengthened but

– 5 x 100 sentences takes ~ a week on a 2GHz CPU – VP does not scale up linearly with # examples

However it appears that

– RNN to be preferred, unless one has good knowledge to put into the design of the right kernel

Ongoing work: Collins’ relabeling task

– Same problem setting (ranking on forests) – Less computation involved (1 forest for each sentence vs. 1 forest for each word)

SLIDE 52