Feature-Rich Compositional Embedding Models Mo Yu * Matt Gormley * - - PowerPoint PPT Presentation

feature rich compositional
SMART_READER_LITE
LIVE PREVIEW

Feature-Rich Compositional Embedding Models Mo Yu * Matt Gormley * - - PowerPoint PPT Presentation

Improved Relation Extraction with Feature-Rich Compositional Embedding Models Mo Yu * Matt Gormley * Mark Dredze September 21, 2015 EMNLP *Co-first authors 1 FCM or: How I Learned to Stop Worrying (about Deep Learning) and Love Features Mo


slide-1
SLIDE 1

Improved Relation Extraction with Feature-Rich Compositional Embedding Models

September 21, 2015 EMNLP

1

Mo Yu* Matt Gormley* Mark Dredze

*Co-first authors

slide-2
SLIDE 2

FCM or: How I Learned to Stop Worrying (about Deep Learning) and Love Features

September 21, 2015 EMNLP

2

Mo Yu* Matt Gormley* Mark Dredze

*Co-first authors

slide-3
SLIDE 3

Handcrafted Features

3

NNP : VBN NNP VBD PER LOC Egypt - born Proyas directed S NP VP ADJP VP NP egypt

  • born

proyas direct

p(y|x) ∝ exp(Θyf(

))

born-in

slide-4
SLIDE 4

Where do features come from?

4

Feature Engineering Feature Learning

hand-crafted features

Sun et al., 2011 Zhou et al., 2005 First word before M1 Second word before M1 Bag-of-words in M1 Head word of M1 Other word in between First word after M2 Second word after M2 Bag-of-words in M2 Head word of M2 Bigrams in between Words on dependency path Country name list Personal relative triggers Personal title list WordNet Tags Heads of chunks in between Path of phrase labels Combination of entity types

slide-5
SLIDE 5

Where do features come from?

5

Feature Engineering Feature Learning

hand-crafted features

Sun et al., 2011 Zhou et al., 2005

word embeddings

Mikolov et al., 2013

CBOW model in Mikolov et al. (2013)

input

(context words)

embeddin g missing word

Look-up table Classifier

0.13 .26 …

  • .52

0.11 .23 …

  • .45

dog: cat: similar words, similar embeddings unsupervised learning

slide-6
SLIDE 6

Where do features come from?

6

Feature Engineering Feature Learning

hand-crafted features

Sun et al., 2011 Zhou et al., 2005

word embeddings

Mikolov et al., 2013

string embeddings

Collobert & Weston, 2008 Socher, 2011

Convolutional Neural Networks (Collobert and Weston 2008)

The [movie] showed [wars] pooling

CNN

Recursive Auto Encoder (Socher 2011)

The [movie] showed [wars]

RAE

slide-7
SLIDE 7

Where do features come from?

7

Feature Engineering Feature Learning

hand-crafted features

Sun et al., 2011 Zhou et al., 2005

word embeddings

Mikolov et al., 2013

tree embeddings

Socher et al., 2013 Hermann & Blunsom, 2013

string embeddings

Collobert & Weston, 2008 Socher, 2011

The [movie] showed [wars] WNP,VP WDT,NN WV,NN

S NP VP

slide-8
SLIDE 8

Where do features come from?

8

word embeddings tree embeddings hand-crafted features string embeddings

Feature Engineering Feature Learning

Sun et al., 2011 Zhou et al., 2005 Mikolov et al., 2013 Collobert & Weston, 2008 Socher, 2011 Socher et al., 2013 Hermann & Blunsom, 2013 Hermann et al. 2014

word embedding features

Turian et al. 2010 Koo et al. 2008

slide-9
SLIDE 9

Where do features come from?

9

word embeddings tree embeddings word embedding features hand-crafted features

Our model (FCM)

string embeddings

Feature Engineering Feature Learning

Sun et al., 2011 Zhou et al., 2005 Mikolov et al., 2013 Collobert & Weston, 2008 Socher, 2011 Socher et al., 2013 Turian et al. 2010 Koo et al. 2008 Hermann et al. 2014 Hermann & Blunsom, 2013

slide-10
SLIDE 10

Feature-rich Compositional Embedding Model (FCM)

Goals for our Model:

  • 1. Incorporate

semantic/syntactic structural information

  • 2. Incorporate word meaning
  • 3. Bridge the gap between

feature engineering and feature learning – but remain as simple as possible

10

slide-11
SLIDE 11

Feature-rich Compositional Embedding Model (FCM)

11

The [movie]M1 I watched depicted [hope]M2

nil noun-

  • ther

noun- person noun-

  • ther

verb- percep. verb- comm.

  • n-path(wi)

is-between(wi) head-of-M1(wi) head-of-M2(wi) before-M1(wi) before-M2(wi) … 1 … 1 1 … 1 … 1 … 1 1 1 … 1 1 …

f1 f2 f3 f4 f5 f6

Per-word Features:

slide-12
SLIDE 12

Feature-rich Compositional Embedding Model (FCM)

12

The [movie]M1 I watched depicted [hope]M2

nil noun-

  • ther

noun- person noun-

  • ther

verb- percep. verb- comm.

  • n-path(wi)

is-between(wi) head-of-M1(wi) head-of-M2(wi) before-M1(wi) before-M2(wi) … 1 1 1 …

f5

Per-word Features:

slide-13
SLIDE 13

Feature-rich Compositional Embedding Model (FCM)

13

The [movie]M1 I watched depicted [hope]M2

nil noun-

  • ther

noun- person noun-

  • ther

verb- percep. verb- comm.

  • n-path(wi) & wi= “depicted”

is-between(wi) & wi= “depicted” head-of-M1(wi) & wi= “depicted” head-of-M2(wi) & wi= “depicted” before-M1(wi) & wi= “depicted” before-M2(wi) & wi= “depicted” … 1 1 1 …

f5

Per-word Features: (with conjunction)

slide-14
SLIDE 14

Feature-rich Compositional Embedding Model (FCM)

14

The [movie]M1 I watched depicted [hope]M2

nil noun-

  • ther

noun- person noun-

  • ther

verb- percep. verb- comm.

1 1 1 …

f5

Per-word Features: (with soft conjunction)

  • n-path(wi)

is-between(wi) head-of-M1(wi) head-of-M2(wi) before-M1(wi) before-M2(wi) …

  • .3

.9 .1

  • 1

edepicted Outer-product

slide-15
SLIDE 15

Feature-rich Compositional Embedding Model (FCM)

15

The [movie]M1 I watched depicted [hope]M2

nil noun-

  • ther

noun- person noun-

  • ther

verb- percep. verb- comm.

1 1 1 …

f5

Per-word Features: (with soft conjunction)

  • n-path(wi)

is-between(wi) head-of-M1(wi) head-of-M2(wi) before-M1(wi) before-M2(wi) …

  • .3

.9 .1

  • 1

edepicted

  • .3

.9 .1

  • 1
  • .3

.9 .1

  • 1
  • .3

.9 .1

  • 1
  • .3

.9 .1

  • 1

… … … …

slide-16
SLIDE 16

Feature-rich Compositional Embedding Model (FCM)

16

fi ewi

p(y|x) ∝ exp

Σ

i=1 n Our full model sums

  • ver each word in the

sentence Then takes the dot- product with a parameter tensor Ty And finally, exponentiates and renormalizes

slide-17
SLIDE 17

Features for FCM

  • Let M1 and M2 denote the left and right entity

mentions

  • Our per-word Binary Features:
  • head of M1
  • head of M2
  • in-between M1 and M2
  • -2, -1, +1, or +2 of M1
  • -2, -1, +1, or +2 of M2
  • on dependency path between M1 and M2
  • Optionally:

Add the entity type of M1, M2, or both

17

slide-18
SLIDE 18

FCM as a Neural Network

18

Σ

Τ

  • fn

p(y|x) f1 e h1 hn ex

฀฀฀

e

฀฀฀ ฀฀฀ ฀฀฀ ฀฀฀

฀฀฀

w1 wn

Binary features Embeddings Parameter tensor

  • Embeddings are (optionally) treated as model parameters
  • A log-bilinear model
  • We initialize, then fine-tune the embeddings
slide-19
SLIDE 19

Baseline Model

19

Yi,j

NNP : VBN NNP VBD PER LOC Egypt - born Proyasdirected S NP VP ADJP VP NP egypt - born proyas direct born-in

p(y|x) µ exp(Θy฀f (

)

)

  • Multinomial logistic regression

(standard approach)

  • Bring in all the usual binary NLP

features (Sun et al., 2011)

– type of the left entity mention – dependency path between mentions – bag of words in right mention – …

slide-20
SLIDE 20

Hybrid Model: Baseline + FCM

20

Yi,j Σ

Τ

  • fn

p(y|x) f1 e h1 hn ex

฀฀฀

e

฀฀฀ ฀฀฀ ฀฀฀ ฀฀฀

฀฀฀

w1 wn

NNP : VBN NNP VBD PER LOC Egypt - born Proyasdirected S NP VP ADJP VP NP egypt - born proyas direct born-in

p(y|x) µ exp(Θy฀f (

)

)

p(y|x) = pBaseline(y|x) pFCM(y|x) 1 Z(x) Product of Experts:

slide-21
SLIDE 21

Experimental Setup

ACE 2005

  • Data: 6 domains

– Newswire (nw) – Broadcast Conversation (bc) – Broadcast News (bn) – Telephone Speech (cts) – Usenet Newsgroups (un) – Weblogs (wl)

  • Train: bn+nw (~3600 relations)

Dev: ½ of bc Test: ½ of bc, cts, wl`

  • Metric: Micro F1

(given entity mention)

SemEval-2010 Task 8

  • Data: Web text

– Newswire (nw) – Broadcast Conversation (bc) – Broadcast News (bn) – Telephone Speech (cts) – Usenet Newsgroups (un) – Weblogs (wl)

  • Train:

Dev: Test:

  • Metric: Macro F1

(given entity boundaries)

21

Standard split from shared task

slide-22
SLIDE 22

ACE 2005 Results

22

45% 50% 55% 60% 65% Broadcast Conversation Conversational Telephone Speech Weblogs Micro F1 Test Set Baseline FCM Baseline+FCM

slide-23
SLIDE 23

SemEval-2010 Results

Source Classifier F1

Socher et al. (2012) RNN

74.8

Socher et al. (2012)

MVRNN 79.1

Hashimoto et al. (2015) RelEmb

81.8

Rink and Harabagiu (2010)

SVM 82.2

Zeng et al. (2014) CNN

82.7

Santos et al. (2015)

CR-CNN (log-loss) 82.7

Liu et al. (2015)

DepNN 82.8

Hashimoto et al. (2015) RelEmb (task-spec-emb)

82.8

70 72 74 76 78 80 82 84 86 Best in SemEval-2010 Shared Task

slide-24
SLIDE 24

SemEval-2010 Results

Source Classifier F1

Socher et al. (2012) RNN

74.8

Socher et al. (2012)

MVRNN 79.1

Hashimoto et al. (2015) RelEmb

81.8

Rink and Harabagiu (2010)

SVM 82.2

Zeng et al. (2014) CNN

82.7

Santos et al. (2015)

CR-CNN (log-loss) 82.7

Liu et al. (2015)

DepNN 82.8

Hashimoto et al. (2015) RelEmb (task-spec-emb)

82.8 FCM (log-linear) 81.4 FCM (log-bilinear) 83.0

70 72 74 76 78 80 82 84 86 Best in SemEval-2010 Shared Task

slide-25
SLIDE 25

SemEval-2010 Results

Source Classifier F1

Socher et al. (2012) RNN

74.8

Socher et al. (2012)

MVRNN 79.1

Hashimoto et al. (2015) RelEmb

81.8

Rink and Harabagiu (2010)

SVM 82.2

Zeng et al. (2014) CNN

82.7

Santos et al. (2015)

CR-CNN (log-loss) 82.7

Liu et al. (2015)

DepNN 82.8

Hashimoto et al. (2015) RelEmb (task-spec-emb)

82.8 FCM (log-linear) 81.4 FCM (log-bilinear) 83.0

70 72 74 76 78 80 82 84 86

slide-26
SLIDE 26

SemEval-2010 Results

Source Classifier F1

Socher et al. (2012) RNN

74.8

Socher et al. (2012)

MVRNN 79.1

Hashimoto et al. (2015) RelEmb

81.8

Rink and Harabagiu (2010)

SVM 82.2

Xu et al. (2015)

SDP-LSTM 82.4

Zeng et al. (2014) CNN

82.7

Santos et al. (2015)

CR-CNN (log-loss) 82.7

Liu et al. (2015)

DepNN 82.8

Hashimoto et al. (2015) RelEmb (task-spec-emb)

82.8

Xu et al. (2015)

SDP-LSTM (full) 83.7 FCM (log-linear) 81.4 FCM (log-bilinear) 83.0

70 72 74 76 78 80 82 84 86

slide-27
SLIDE 27

SemEval-2010 Results

Source Classifier F1

Socher et al. (2012) RNN

74.8

Socher et al. (2012)

MVRNN 79.1

Hashimoto et al. (2015) RelEmb

81.8

Rink and Harabagiu (2010)

SVM 82.2

Xu et al. (2015)

SDP-LSTM 82.4

Zeng et al. (2014) CNN

82.7

Santos et al. (2015)

CR-CNN (log-loss) 82.7

Liu et al. (2015)

DepNN 82.8

Hashimoto et al. (2015) RelEmb (task-spec-emb)

82.8

Xu et al. (2015)

SDP-LSTM (full) 83.7 FCM (log-linear) 81.4 FCM (log-bilinear) 83.0 FCM (log-bilinear) (task-spec-emb) 83.7

70 72 74 76 78 80 82 84 86

slide-28
SLIDE 28

SemEval-2010 Results

Source Classifier F1

Socher et al. (2012) RNN

74.8

Socher et al. (2012)

MVRNN 79.1

Hashimoto et al. (2015) RelEmb

81.8

Rink and Harabagiu (2010)

SVM 82.2

Xu et al. (2015)

SDP-LSTM 82.4

Zeng et al. (2014) CNN

82.7

Santos et al. (2015)

CR-CNN (log-loss) 82.7

Liu et al. (2015)

DepNN 82.8

Hashimoto et al. (2015) RelEmb (task-spec-emb)

82.8

Xu et al. (2015)

SDP-LSTM (full) 83.7

Santos et al. (2015)

CR-CNN (ranking-loss) 84.1 FCM (log-linear) 81.4 FCM (log-bilinear) 83.0 FCM (log-bilinear) (task-spec-emb) 83.7

70 72 74 76 78 80 82 84 86

slide-29
SLIDE 29

Takeaways

FCM bridges the gap between feature engineering and feature learning If you are allergic to deep learning:

– Try the FCM for your task: it is simple, easy-to- implement, and was shown to be effective for two relation benchmarks

If you are a deep learning expert:

– Inject the FCM (i.e. outer product of features and embeddings) into your fancy deep network

29

slide-30
SLIDE 30

Questions?

Two open source implementations:

– Java: (Within the Pacaya framework) https://github.com/mgormley/pacaya – C++: (From our NAACL 2015 paper on LRFCM) https://github.com/Gorov/ERE_RE