Structures Tao Lei Yu Xin, Yuan Zhang, Regina Barzilay, Tommi - - PowerPoint PPT Presentation

β–Ά
structures
SMART_READER_LITE
LIVE PREVIEW

Structures Tao Lei Yu Xin, Yuan Zhang, Regina Barzilay, Tommi - - PowerPoint PPT Presentation

Low-Rank Tensors for Scoring Dependency Structures Tao Lei Yu Xin, Yuan Zhang, Regina Barzilay, Tommi Jaakkola CSAIL, MIT 1 Dependency Parsing ROOT I ate cake with a fork today PRON VB NN IN DT NN NN Dependency parsing as


slide-1
SLIDE 1

Low-Rank Tensors for Scoring Dependency Structures

Tao Lei

Yu Xin, Yuan Zhang, Regina Barzilay, Tommi Jaakkola

CSAIL, MIT

1

slide-2
SLIDE 2

Our Goal

Dependency Parsing

  • Dependency parsing as maximization problem:

π‘§βˆ— = argmax

yβˆˆπ‘ˆ(𝑦)

𝑇 𝑦, 𝑧; πœ„

  • Key aspects of a parsing system:

1. Accurate scoring function 𝑇(𝑦, 𝑧; πœ„) 2. Efficient decoding procedure argmax

I ate cake with a fork today PRON VB NN IN DT NN NN

ROOT

2

slide-3
SLIDE 3

Finding Expressive Feature Set

requires a rich, expressive set of manually-crafted feature templates 3

1 1 1

Traditional view: High-dim. sparse vector 𝜚 𝑦, 𝑧 ∈ ℝ𝑀

I ate cake with a fork today PRON VB NN IN DT NN NN

ROOT

… …

Feature Template: head POS, modifier POS and length Feature Example: β€œVB⨁NN⨁2”

slide-4
SLIDE 4

Finding Expressive Feature Set

requires a rich, expressive set of manually-crafted feature templates 4

1 1 1

Traditional view: High-dim. sparse vector 𝜚 𝑦, 𝑧 ∈ ℝ𝑀

I ate cake with a fork today PRON VB NN IN DT NN NN

ROOT

… …

Feature Template: head word and modifier word Feature Example: β€œate⨁cake”

slide-5
SLIDE 5

Finding Expressive Feature Set

requires a rich, expressive set of manually-crafted feature templates

1 2 1 2

Traditional view: High-dim. sparse vector 𝜚 𝑦, 𝑧 ∈ ℝ𝑀

I ate cake with a fork today PRON VB NN IN DT NN NN

ROOT

0.1 0.3 2.2 1.1 0.1 0.9

Parameter vector πœ„ ∈ ℝ𝑀

β‹…

π‘‡πœ„ 𝑦, 𝑧 = πœ„, 𝜚 𝑦, 𝑧

… … … …

5

slide-6
SLIDE 6

Traditional Scoring Revisited

Head ate VB VB+ate PRON NN

⨁

Modifier cake NN NN+cake VB IN

Word: POS: POS+Word: Left POS: Right POS:

Attach Length?

Yes No

⨁

HW_MW_LEN: ate⨁cake⨁2

Arc Features:

6

  • Features and templates are manually-selected concatenations of atomic

features, in traditional vector-based scoring:

I ate cake with a fork today PRON VB NN IN DT NN NN

ROOT

slide-7
SLIDE 7

Traditional Scoring Revisited

Head ate VB VB+ate PRON NN

⨁

Modifier cake NN NN+cake VB IN

Word: POS: POS+Word: Left POS: Right POS:

Attach Length?

Yes No

⨁

HW_MW_LEN: ate⨁cake⨁2 HW_MW: ate⨁cake

Arc Features:

7

I ate cake with a fork today PRON VB NN IN DT NN NN

ROOT

  • Features and templates are manually-selected concatenations of atomic

features, in traditional vector-based scoring:

slide-8
SLIDE 8

Traditional Scoring Revisited

Head ate VB VB+ate PRON NN

⨁

Modifier cake NN NN+cake VB IN

Word: POS: POS+Word: Left POS: Right POS:

Attach Length?

Yes No

⨁

HW_MW_LEN: ate⨁cake⨁2 HW_MW: ate⨁cake HP_MP_LEN: VB⨁NN⨁2 HP_MP: VB⨁NN … …

Arc Features:

8

I ate cake with a fork today PRON VB NN IN DT NN NN

ROOT

  • Features and templates are manually-selected concatenations of atomic

features, in traditional vector-based scoring:

slide-9
SLIDE 9

Traditional Scoring Revisited

9

  • Problem: very difficult to pick the best subset of concatenations

Too few templates Lose performance Too many templates Too many parameters to estimate Searching the best set?

Features are correlated Choices are exponential

  • Our approach: use low-rank tensor (i.e. multi-way array)
  • Capture a whole range of feature combinations
  • Keep the parameter estimation problem in control
slide-10
SLIDE 10

Low-Rank Tensor Scoring: Formulation

  • Formulate ALL possible concatenations as a rank-1 tensor

Head ate VB VB+ate PRON NN Modifier cake NN NN+cake VB IN Attach Length?

Yes No

πœšβ„Ž πœšπ‘› πœšβ„Ž,𝑛

atomic head feature vector atomic modifier feature vector atomic arc feature vector

10

slide-11
SLIDE 11

Low-Rank Tensor Scoring: Formulation

  • Formulate ALL possible concatenations as a rank-1 tensor

πœšβ„Ž πœšπ‘› πœšβ„Ž,𝑛

βŠ— βŠ—

atomic head feature vector atomic modifier feature vector atomic arc feature vector

∈ β„π‘œΓ—π‘œΓ—π‘’

𝑦⨂𝑧⨂𝑨 π‘—π‘˜π‘™ = π‘¦π‘—π‘§π‘˜π‘¨π‘™ tensor product Each entry indicates the occurrence of one feature concatenation 11

slide-12
SLIDE 12

Low-Rank Tensor Scoring: Formulation

  • Formulate ALL possible concatenations as a rank-1 tensor
  • Formulate the parameters as a tensor as well

(vector-based)

π‘‡πœ„ β„Ž β†’ 𝑛 = πœ„, πœšβ„Žβ†’π‘›

(tensor-based)

π‘‡π‘’π‘“π‘œπ‘‘π‘π‘  β„Ž β†’ 𝑛 = 𝐡, πœšβ„Ž βŠ— πœšπ‘› βŠ— πœšβ„Ž,𝑛 πœ„ ∈ ℝ𝑀: 𝐡 ∈ β„π‘œΓ—π‘œΓ—π‘’:

πœšβ„Ž πœšπ‘›

βŠ— βŠ—

atomic head feature vector atomic modifier feature vector

∈ β„π‘œΓ—π‘œΓ—π‘’

12

πœšβ„Ž,𝑛

atomic arc feature vector Can be huge. On English: π‘œ Γ— π‘œ Γ— 𝑒 β‰ˆ 1011 Involves features not in πœ„

slide-13
SLIDE 13
  • Formulate the parameters as a low-rank tensor

Low-Rank Tensor Scoring: Formulation

  • Formulate ALL possible concatenations as a rank-1 tensor

(vector-based)

π‘‡πœ„ β„Ž β†’ 𝑛 = πœ„, πœšβ„Žβ†’π‘›

(tensor-based)

π‘‡π‘’π‘“π‘œπ‘‘π‘π‘  β„Ž β†’ 𝑛 = 𝐡, πœšβ„Ž βŠ— πœšπ‘› βŠ— πœšβ„Ž,𝑛 πœ„ ∈ ℝ𝑀: 𝐡 ∈ β„π‘œΓ—π‘œΓ—π‘’:

πœšβ„Ž πœšπ‘›

βŠ— βŠ—

atomic head feature vector atomic modifier feature vector

∈ β„π‘œΓ—π‘œΓ—π‘’

𝐡 = 𝑉 𝑗 β¨‚π‘Š 𝑗 ⨂𝑋(𝑗) Low-rank tensor 𝑉, π‘Š ∈ β„π‘ Γ—π‘œ, 𝑋 ∈ ℝ𝑠×𝑒:

13

πœšβ„Ž,𝑛

atomic arc feature vector r rank-1 tensors

slide-14
SLIDE 14

Low-Rank Tensor Scoring: Formulation

𝐡 = 𝑉 𝑗 β¨‚π‘Š 𝑗 ⨂𝑋(𝑗)

14

π‘‡π‘’π‘“π‘œπ‘‘π‘π‘  β„Ž β†’ 𝑛 𝐡, πœšβ„Žβ¨‚πœšπ‘›β¨‚πœšβ„Ž,𝑛

𝑗=1 𝑠

π‘‰πœšβ„Ž 𝑗 π‘Šπœšπ‘› 𝑗 π‘‹πœšβ„Ž,𝑛 𝑗 = =

Dense low-dim representations:

𝑗=1 𝑠

π‘‰πœšβ„Ž 𝑗 π‘Šπœšπ‘› 𝑗 π‘‹πœšβ„Ž,𝑛 𝑗 ∈ ℝ𝑠

⟹ = Γ—

dense dense sparse

slide-15
SLIDE 15

Low-Rank Tensor Scoring: Formulation

15

π‘‡π‘’π‘“π‘œπ‘‘π‘π‘  β„Ž β†’ 𝑛 𝐡, πœšβ„Žβ¨‚πœšπ‘›β¨‚πœšβ„Ž,𝑛

𝑗=1 𝑠

π‘‰πœšβ„Ž 𝑗 π‘Šπœšπ‘› 𝑗 π‘‹πœšβ„Ž,𝑛 𝑗 = =

Dense low-dim representations:

𝑗=1 𝑠

π‘‰πœšβ„Ž 𝑗 π‘Šπœšπ‘› 𝑗 π‘‹πœšβ„Ž,𝑛 𝑗 ∈ ℝ𝑠

⟹ Element-wise products:

𝑗=1 𝑠

π‘‰πœšβ„Ž 𝑗 π‘Šπœšπ‘› 𝑗 π‘‹πœšβ„Ž 𝑛 𝑗

,

Sum over these products:

𝑗=1 𝑠

π‘‰πœšβ„Ž 𝑗 π‘Šπœšπ‘› 𝑗 π‘‹πœšβ„Ž 𝑛 𝑗

, 𝐡 = 𝑉 𝑗 β¨‚π‘Š 𝑗 ⨂𝑋(𝑗)

slide-16
SLIDE 16

Intuition and Explanations

Example: Collaborative Filtering Approximate user-ratings via low-rank

user-rating sparse matrix A

??

  • Ratings not completely independent

16

β€œprice” β€œquality”

π‘Š2×𝑛: properties

β€œprice” β€œquality”

𝑉2Γ—π‘œ: preferences

  • Users have hidden preferences over properties
  • Items share hidden properties (β€œprice” and β€œquality”)
slide-17
SLIDE 17

Intuition and Explanations

Example: Collaborative Filtering Approximate user-ratings via low-rank

β‰ˆ

V(1) U(1)

+ β‹― +

V(r) U(r)

Intuition: Data and parameters can be approximately characterized by a small number of hidden factors

β€œprice” β€œquality”

17

𝐡 = 𝑉Tπ‘Š = βˆ‘π‘‰ 𝑗 βŠ— π‘Š(𝑗)

user-rating sparse matrix A

?? # of parameters: π‘œ Γ— 𝑛 π‘œ + 𝑛 𝑠

slide-18
SLIDE 18

Intuition and Explanations

18

  • Hidden properties associated with each word

Our Case: Approximate parameters (feature weights) via low-rank

β‰ˆ + β‹― +

... 2

??

… 4 … … … … … … 1 0.9 … 5 … 0.1 0.1 … … ... 2

??

… 4 … … … … … … 1 0.9 … 5 … 0.1 0.1 … … ... 2

??

… 4 … … … … … … 1 0.9 … 5 … 0.1 0.1 … …

similar values because β€œapple” and β€œbanana” have similar syntactic behavior

  • Share parameter values via the hidden properties

𝐡 = βˆ‘π‘‰ 𝑗 βŠ— π‘Š 𝑗 βŠ— 𝑋 𝑗

parameter tensor A

slide-19
SLIDE 19

Low-Rank Tensor Scoring: Summary

  • Easily add and utilize new, auxiliary features
  • Naturally captures full feature expansion (concatenations)
  • - Without mannually specifying a bunch of feature templates
  • - Simply append them as atomic features

Head Atomic ate VB VB+ate PRON NN person:I number:singular Emb[1]: -0.0128 Emb[2]: 0.5392

  • Controlled feature expansion by low-rank (small r)
  • - better feature tuning and optimization

19

slide-20
SLIDE 20

Combined Scoring

  • Combining traditional and tensor scoring in 𝑇𝛿(𝑦, 𝑧):

𝛿 β‹… π‘‡πœ„ 𝑦, 𝑧 + 1 βˆ’ 𝛿 β‹… π‘‡π‘’π‘“π‘œπ‘‘π‘π‘  𝑦, 𝑧

Set of manual selected features Full feature expansion controlled by low-rank Similar β€œsparse+low-rank” idea for matrix decomposition: Tao and Yuan, 2011; Zhou and Tao, 2011; Waters et al., 2011; Chandrasekaran et al., 2011

  • Final maximization problem given parameters πœ„, 𝑉, π‘Š, 𝑋:

π‘§βˆ— = argmax

yβˆˆπ‘ˆ(𝑦)

𝑇𝛿 𝑦, 𝑧; πœ„, 𝑉, π‘Š, 𝑋

𝛿 ∈ [0,1]

20

slide-21
SLIDE 21

Learning Problem

  • Given training set D =

𝑦𝑗, 𝑧𝑗

𝑗=1 𝑂

  • Search for parameter values that score the gold trees higher than others:
  • The training objective:

βˆ€π‘§ ∈ π”π¬πŸπŸ 𝑦𝑗 : 𝑇 𝑦𝑗, 𝑧𝑗 β‰₯ 𝑇 𝑦𝑗, 𝑧 + 𝑧𝑗 βˆ’ 𝑧 βˆ’ πœŠπ‘—

unsatisfied constraints are penalized against Non-negative loss

min

πœ„,𝑉,π‘Š,𝑋,πœŠπ‘—β‰₯0

𝐷

𝑗

πœŠπ‘— + 𝑉 2 + π‘Š 2 + 𝑋 2 + πœ„ 2

Training loss Regularization

Calculating the loss requires to solve the expensive maximization problem; Following common practices, adopt online learning framework.

21

slide-22
SLIDE 22

βˆ‘π‘—=1

𝑠

π‘‰πœšβ„Ž 𝑗 π‘Šπœšπ‘› 𝑗 π‘‹πœšβ„Ž,𝑛 𝑗 is not linear nor convex

22

(ii) choose to update a pair of sets πœ„, 𝑉 , πœ„, π‘Š or πœ„, 𝑋 :

min

βˆ†πœ„,βˆ†π‘‰

1 2 βˆ†πœ„ 2 + 1 2 βˆ†π‘‰ 2 + π·πœŠπ‘— πœ„(𝑒+1) = πœ„(𝑒) + βˆ†πœ„, 𝑉(𝑒+1) = 𝑉(𝑒) + βˆ†π‘‰ Increments: Sub-problem:

Online Learning

  • Use passive-aggressive algorithm (Crammer et al. 2006) tailored to our tensor

setting

β‹―

𝑦𝑗, 𝑧𝑗

β‹―

𝑦1, 𝑧1 𝑦𝑂, 𝑧𝑂

(i) Iterate over training samples successively:

β‹―

revise parameter values for i-th training sample Efficient parameter update via closed-form solution

slide-23
SLIDE 23

Experiment Setup

23

Datasets

  • 14 languages in CoNLL 2006 & 2008 shared tasks

Features

  • Only 16 atomic word features for tensor
  • Combine with 1st-order (single arc) and up to 3rd-order

(three arcs) features used in MST/Turbo parsers

h

m

h m

s

g h

m

g h

m s h

m s t

… …

slide-24
SLIDE 24

Experiment Setup

24

  • By default, rank of the tensor r=50

Implementation

  • Train 10 iterations for all 14 languages
  • 3-way tensor captures only 1st-order arc-based features

Datasets

  • 14 languages in CoNLL 2006 & 2008 shared tasks

Features

  • Only 16 atomic word features for tensor
  • Combine with 1st-order (single arc) and up to 3rd-order

(three arcs) features used in MST/Turbo parsers

slide-25
SLIDE 25

Baselines and Evaluation Measure

MST and Turbo Parsers representative graph-based parsers; use similar set of features NT-1st and NT-3rd variants of our model by removing the tensor component; reimplementation of MST and Turbo Parser features Unlabeled Attachment Score (UAS) evaluated without punctuations 25

slide-26
SLIDE 26

Overall 1st-order Results

  • > 0.7% average improvement
  • Outperforms on 11 out of 14 languages

87.76% 87.05% 86.50% 86.83% 85.5% 86.0% 86.5% 87.0% 87.5% 88.0%

Our Model NT-1st MST Turbo 26

slide-27
SLIDE 27

Impact of Tensor Component

84.0% 84.5% 85.0% 85.5% 86.0% 86.5% 87.0% 87.5% 88.0% 1 2 3 4 5 6 7 8 9 10

  • No tensor (Ξ³ = 1)

27

# Iterations

slide-28
SLIDE 28

Impact of Tensor Component

  • Tensor component achieves better generalization on test data

84.0% 84.5% 85.0% 85.5% 86.0% 86.5% 87.0% 87.5% 88.0% 1 2 3 4 5 6 7 8 9 10

  • No tensor (Ξ³ = 1)
  • Tensor only (Ξ³ = 0)

28

# Iterations

slide-29
SLIDE 29

Impact of Tensor Component

  • Tensor component achieves better generalization on test data

84.0% 84.5% 85.0% 85.5% 86.0% 86.5% 87.0% 87.5% 88.0% 1 2 3 4 5 6 7 8 9 10

  • No tensor (Ξ³ = 1)
  • Tensor only (Ξ³ = 0)
  • Combined (Ξ³ = 0.3)
  • Combined scoring outperforms single components

29

# Iterations

slide-30
SLIDE 30

Overall 3rd-order Results

89.08% 88.73% 88.66% 88.2% 88.5% 88.8% 89.1% 89.4%

Our Model Turbo NT-3rd 30

  • Our traditional scoring component is just as good as the state-of-the-art

system

slide-31
SLIDE 31

Overall 3rd-order Results

  • The 1st-order tensor component remains useful on high-order parsing
  • Outperforms state-of-the-art single system
  • Achieves best published results on 5 languages

89.08% 88.73% 88.66% 88.2% 88.5% 88.8% 89.1% 89.4%

Our Model Turbo NT-3rd 31

slide-32
SLIDE 32

Leveraging Auxiliary Features

  • Unsupervised word embeddings publicly available*
  • Append the embeddings of current, previous and next words into πœšβ„Ž, πœšπ‘›

English, German and Swedish have word embeddings in this dataset πœšβ„Ž βŠ— πœšπ‘› involves more than 50 Γ— 3 2 values for 50-dimensional embeddings!

0.1 0.2 0.3 0.4 0.5 0.6

1st-order 3rd-order Swedish German English

  • Abs. UAS improvement by adding embeddings

32

* https://github.com/wolet/sprml13-word-embeddings

slide-33
SLIDE 33

Conclusion

  • Modeling: we introduced a low-rank tensor factorization

model for scoring dependency arcs

  • Learning: we proposed an online learning method that

directly optimizes the low-rank factorization for parsing performance, achieving state-of-the-art results

  • Opportunities & Challenges: we hope to apply this idea to
  • ther structures and NLP problems.

Source code available at: https://github.com/taolei87/RBGParser

33

slide-34
SLIDE 34

34

slide-35
SLIDE 35

Rank of the Tensor

35

80.0 83.0 86.0 89.0 92.0 95.0 10 20 30 40 50 60 70

Japanese English Chinese Slovene

slide-36
SLIDE 36

Choices of Gamma

36