SFU NatLangLab Bootstrapping via Graph Propagation Max Whitney - - PowerPoint PPT Presentation

sfu natlanglab
SMART_READER_LITE
LIVE PREVIEW

SFU NatLangLab Bootstrapping via Graph Propagation Max Whitney - - PowerPoint PPT Presentation

SFU NatLangLab Bootstrapping via Graph Propagation Max Whitney Anoop Sarkar Simon Fraser University Natural Language Laboratory http://natlang.cs.sfu.ca Bootstrapping Semi-supervised (vs supervised) Single domain (vs domain


slide-1
SLIDE 1

SFU NatLangLab

Bootstrapping via Graph Propagation

Max Whitney Anoop Sarkar Simon Fraser University Natural Language Laboratory http://natlang.cs.sfu.ca

slide-2
SLIDE 2

1

Bootstrapping

◮ Semi-supervised (vs supervised) ◮ Single domain (vs domain adaptation) ◮ Small amount of seed data/rules (vs domain adaptation)

slide-3
SLIDE 3

1

Bootstrapping

◮ Semi-supervised (vs supervised) ◮ Single domain (vs domain adaptation) ◮ Small amount of seed data/rules (vs domain adaptation)

Assumption:

◮ No transductive learning

slide-4
SLIDE 4

2

General approaches to semi-supervised learning

◮ Clustering

concept must be identifiable

◮ Maximum likelihood

problems with local optima generative discriminative

◮ Co-training

learn from agreement between models; need independent views

◮ Self-training

learn from agreement between features

slide-5
SLIDE 5

3

The Yarowsky algorithm

◮ Yarowsky algorithm: self-training algorithm by

David Yarowsky (1995)

◮ Works well empirically ◮ Little theoretical analysis

slide-6
SLIDE 6

3

The Yarowsky algorithm

◮ Yarowsky algorithm: self-training algorithm by

David Yarowsky (1995)

◮ Works well empirically ◮ Little theoretical analysis ◮ Co-training by Avrim Blum and Tom Mitchell

(1998):

The paper has been cited over 1000 times, and received the 10 years Best Paper Award at the 25th International Conference on Machine Learning (2008)

slide-7
SLIDE 7

3

The Yarowsky algorithm

◮ Yarowsky algorithm: self-training algorithm by

David Yarowsky (1995)

◮ Works well empirically ◮ Little theoretical analysis ◮ Co-training by Avrim Blum and Tom Mitchell

(1998):

The paper has been cited over 1000 times, and received the 10 years Best Paper Award at the 25th International Conference on Machine Learning (2008)

◮ Collins and Singer (1999) provide Co-Boost:

co-training with a per-iteration objective function and good accuracy

slide-8
SLIDE 8

3

The Yarowsky algorithm

◮ Yarowsky algorithm: self-training algorithm by

David Yarowsky (1995)

◮ Works well empirically ◮ Little theoretical analysis ◮ Co-training by Avrim Blum and Tom Mitchell

(1998):

The paper has been cited over 1000 times, and received the 10 years Best Paper Award at the 25th International Conference on Machine Learning (2008)

◮ Collins and Singer (1999) provide Co-Boost:

co-training with a per-iteration objective function and good accuracy

◮ Can we do the same for the Yarowsky algorithm?

slide-9
SLIDE 9

4

Example task: word sense disambiguation

Data from Canadian Hansards (Eisner and Karakos, 2005):

◮ 2 labels (senses) ◮ features are adjacent and context (nearby) words ◮ 2 seed rules

slide-10
SLIDE 10

5

Example task: word sense disambiguation

303 unlabelled training examples:

◮ Full time should be served for each sentence . ◮ The Liberals inserted a sentence of 14 words which reads : ◮ They get a concurrent sentence with no additional time added to

their sentence .

◮ The words tax relief appeared in every second sentence in the

federal government’s throne speech . . . .

slide-11
SLIDE 11

5

Example task: word sense disambiguation

303 unlabelled training examples:

◮ Full time should be served for each sentence . ◮ The Liberals inserted a sentence of 14 words which reads : ◮ They get a concurrent sentence with no additional time added to

their sentence .

◮ The words tax relief appeared in every second sentence in the

federal government’s throne speech . . . . 2 seed rules: context: served sense 1 context: reads sense 2

slide-12
SLIDE 12

5

Example task: word sense disambiguation

303 unlabelled training examples:

◮ Full time should be served for each sentence . ◮ The Liberals inserted a sentence of 14 words which reads : ◮ They get a concurrent sentence with no additional time added to

their sentence .

◮ The words tax relief appeared in every second sentence in the

federal government’s throne speech . . . . 2 seed rules: context: served sense 1 context: reads sense 2

slide-13
SLIDE 13

5

Example task: word sense disambiguation

303 unlabelled training examples:

◮ Full time should be served for each sentence . ◮ The Liberals inserted a sentence of 14 words which reads : ◮ They get a concurrent sentence with no additional time added to

their sentence .

◮ The words tax relief appeared in every second sentence in the

federal government’s throne speech . . . . 2 seed rules: context: served sense 1 context: reads sense 2

slide-14
SLIDE 14

5

Example task: word sense disambiguation

303 unlabelled training examples:

◮ Full time should be served for each sentence . ◮ The Liberals inserted a sentence of 14 words which reads : ◮ They get a concurrent sentence with no additional time added to

their sentence .

◮ The words tax relief appeared in every second sentence in the

federal government’s throne speech . . . . 2 seed rules: context: served sense 1 context: reads sense 2 → 76.99% accuracy on unseen test set

slide-15
SLIDE 15

5

Example task: word sense disambiguation

303 unlabelled training examples:

◮ Full time should be served for each sentence . ◮ The Liberals inserted a sentence of 14 words which reads : ◮ They get a concurrent sentence with no additional time added to

their sentence .

◮ The words tax relief appeared in every second sentence in the

federal government’s throne speech . . . . 2 seed rules: context: served sense 1 context: reads sense 2 → 76.99% accuracy on unseen test set → non-seeded accuracy (Daume, 2011) → non-seeded accuracy M

slide-16
SLIDE 16

6

Example task: named entity classification

Data from NYT (Collins and Singer, 1999):

◮ 3 labels (person, location, organization) ◮ spelling features from words in phrase,

context features from parse tree

◮ 7 seed rules

slide-17
SLIDE 17

7

Example task: named entity classification

89305 unlabelled training examples:

slide-18
SLIDE 18

7

Example task: named entity classification

89305 unlabelled training examples:

◮ Union Bank would automatically give it a foothold in this market in

California .

◮ It is an ironic agreement , given Mr. Jobs’ historical disdain for IBM

.

◮ It is an ironic agreement , given Mr. Jobs’ historical disdain for IBM

. . . .

slide-19
SLIDE 19

7

Example task: named entity classification

89305 unlabelled training examples:

◮ Union Bank would automatically give it a foothold in this market in

California .

◮ It is an ironic agreement , given Mr. Jobs’ historical disdain for IBM

.

◮ It is an ironic agreement , given Mr. Jobs’ historical disdain for IBM

. . . . 7 seed rules: spelling: New-York location spelling: California location spelling: U.S. location spelling: Microsoft

  • rganization

spelling: I.B.M.

  • rganization

spelling: *Incorporated*

  • rganization

spelling: *Mr.* person

slide-20
SLIDE 20

7

Example task: named entity classification

89305 unlabelled training examples:

◮ Union Bank would automatically give it a foothold in this market in

California .

◮ It is an ironic agreement , given Mr. Jobs’ historical disdain for IBM

.

◮ It is an ironic agreement , given Mr. Jobs’ historical disdain for IBM

. . . . 7 seed rules: spelling: New-York location spelling: California location spelling: U.S. location spelling: Microsoft

  • rganization

spelling: I.B.M.

  • rganization

spelling: *Incorporated*

  • rganization

spelling: *Mr.* person

slide-21
SLIDE 21

7

Example task: named entity classification

89305 unlabelled training examples:

◮ Union Bank would automatically give it a foothold in this market in

California .

◮ It is an ironic agreement , given Mr. Jobs’ historical disdain for IBM

.

◮ It is an ironic agreement , given Mr. Jobs’ historical disdain for IBM

. . . . 7 seed rules: spelling: New-York location spelling: California location spelling: U.S. location spelling: Microsoft

  • rganization

spelling: I.B.M.

  • rganization

spelling: *Incorporated*

  • rganization

spelling: *Mr.* person → 89.97% test accuracy

slide-22
SLIDE 22

8

Yarowsky algorithm (Yarowsky, 1995; Collins and Singer, 1999)

seed DL label data train DL

slide-23
SLIDE 23

8

Yarowsky algorithm (Yarowsky, 1995; Collins and Singer, 1999)

seed DL label data train DL

1.0 context: served sense 1 1.0 context: reads sense 2

slide-24
SLIDE 24

8

Yarowsky algorithm (Yarowsky, 1995; Collins and Singer, 1999)

seed DL label data train DL

1.0 context: served sense 1 1.0 context: reads sense 2

Full time should be served for each sentence . The Liberals inserted a sentence of 14 words which reads : The sentence for such an offence would be a term of imprisonment for one year . Mr. Speaker , I have a question based on the very last sentence of the hon. member . . . .

slide-25
SLIDE 25

8

Yarowsky algorithm (Yarowsky, 1995; Collins and Singer, 1999)

seed DL label data train DL

1.0 context: served sense 1 1.0 context: reads sense 2

Full time should be served for each sentence . The Liberals inserted a sentence of 14 words which reads : The sentence for such an offence would be a term of imprisonment for one year . Mr. Speaker , I have a question based on the very last sentence of the hon. member . . . .

1.0 context: served sense 1 1.0 context: reads sense 2 .976 context: serv* sense 1 .976 context: read* sense 2 .969 next word: reads sense 2 .969 next word: read* sense 2 .955 previous word: his sense 1 .955 previous word: hi* sense 1 .955 context: inmate sense 1 .917 previous word: their sense 1 .917 previous word: relevant sense 2 .917 previous word: next sense 2

. . .

slide-26
SLIDE 26

8

Yarowsky algorithm (Yarowsky, 1995; Collins and Singer, 1999)

seed DL label data train DL

1.0 context: served sense 1 1.0 context: reads sense 2

Full time should be served for each sentence . The Liberals inserted a sentence of 14 words which reads : The sentence for such an offence would be a term of imprisonment for one year . Mr. Speaker , I have a question based on the very last sentence of the hon. member . . . .

1.0 context: served sense 1 1.0 context: reads sense 2 .976 context: serv* sense 1 .976 context: read* sense 2 .969 next word: reads sense 2 .969 next word: read* sense 2 .955 previous word: his sense 1 .955 previous word: hi* sense 1 .955 context: inmate sense 1 previous word: relevant

threshold

slide-27
SLIDE 27

8

Yarowsky algorithm (Yarowsky, 1995; Collins and Singer, 1999)

seed DL label data train DL

1.0 context: served sense 1 1.0 context: reads sense 2

Full time should be served for each sentence . The Liberals inserted a sentence of 14 words which reads : The sentence for such an offence would be a term of imprisonment for one year . Mr. Speaker , I have a question based on the very last sentence of the hon. member . . . .

1.0 context: served sense 1 1.0 context: reads sense 2 .976 context: serv* sense 1 .976 context: read* sense 2 .969 next word: reads sense 2 .969 next word: read* sense 2 .955 previous word: his sense 1 .955 previous word: hi* sense 1 .955 context: inmate sense 1 previous word: relevant

threshold

slide-28
SLIDE 28

8

Yarowsky algorithm (Yarowsky, 1995; Collins and Singer, 1999)

seed DL label data train DL

Full time should be served for each sentence . The Liberals inserted a sentence of 14 words which reads : The sentence for such an offence would be a term of imprisonment for one year . Mr. Speaker , I have a question based on the very last sentence of the hon. member . . . .

final re-training (no threshold) test

slide-29
SLIDE 29

9

Example decision list for the named entity task

Rank Score Feature Label 1 0.999900 New-York loc. 2 0.999900 California loc. 3 0.999900 U.S. loc. 4 0.999900 Microsoft

  • rg.

5 0.999900 I.B.M.

  • rg.

6 0.999900 Incorporated

  • rg.

7 0.999900 Mr. per. 8 0.999976 U.S. loc. 9 0.999957 New-York-Stock-Exchange loc. 10 0.999952 California loc. 11 0.999947 New-York loc. 12 0.999946 court-in loc. 13 0.975154 Company-of loc. . . . Context features are indicated by italics; all others are spelling features. Seed rules are indicated by bold ranks.

slide-30
SLIDE 30

10

Yarowsky algorithm (Yarowsky, 1995; Collins and Singer, 1999)

200 400 600 800 1000 1200 1400 1600 1800 5 10 15 20 0.2 0.4 0.6 0.8 1 DL size | Num. labelled train examples Test accuracy Iteration

1 rule 1.0 context: served 1 rule 1.0 context: reads train 4 4

Iteration 0

slide-31
SLIDE 31

10

Yarowsky algorithm (Yarowsky, 1995; Collins and Singer, 1999)

200 400 600 800 1000 1200 1400 1600 1800 5 10 15 20 0.2 0.4 0.6 0.8 1 DL size | Num. labelled train examples Test accuracy Iteration 1

46 rules 1.0 context: served .976 context: serv* .976 context: served .955 context: inmat* .955 context: releas* . . . 31 rules 1.0 context: reads .976 context: read* .976 context: reads .969 next: read* .969 next: reads . . . train 114 37

Iteration 1

slide-32
SLIDE 32

10

Yarowsky algorithm (Yarowsky, 1995; Collins and Singer, 1999)

200 400 600 800 1000 1200 1400 1600 1800 5 10 15 20 0.2 0.4 0.6 0.8 1 DL size | Num. labelled train examples Test accuracy Iteration 1

46 rules 1.0 context: served .976 context: serv* .976 context: served .955 context: inmat* .955 context: releas* . . . 31 rules 1.0 context: reads .976 context: read* .976 context: reads .969 next: read* .969 next: reads . . . train 114 37

Iteration 1

slide-33
SLIDE 33

10

Yarowsky algorithm (Yarowsky, 1995; Collins and Singer, 1999)

200 400 600 800 1000 1200 1400 1600 1800 5 10 15 20 0.2 0.4 0.6 0.8 1 DL size | Num. labelled train examples Test accuracy Iteration 1

46 rules 1.0 context: served .976 context: serv* .976 context: served .955 context: inmat* .955 context: releas* . . . 31 rules 1.0 context: reads .976 context: read* .976 context: reads .969 next: read* .969 next: reads . . . train 114 37

Iteration 1

slide-34
SLIDE 34

10

Yarowsky algorithm (Yarowsky, 1995; Collins and Singer, 1999)

200 400 600 800 1000 1200 1400 1600 1800 5 10 15 20 0.2 0.4 0.6 0.8 1 DL size | Num. labelled train examples Test accuracy Iteration 1

46 rules 1.0 context: served .976 context: serv* .976 context: served .955 context: inmat* .955 context: releas* . . . 31 rules 1.0 context: reads .976 context: read* .976 context: reads .969 next: read* .969 next: reads . . . train 114 37

Iteration 1 test accuracy

slide-35
SLIDE 35

10

Yarowsky algorithm (Yarowsky, 1995; Collins and Singer, 1999)

200 400 600 800 1000 1200 1400 1600 1800 5 10 15 20 0.2 0.4 0.6 0.8 1 DL size | Num. labelled train examples Test accuracy Iteration 2 1

854 rules 1.0 context: served .998 next: .* .998 next: . .995 context: serv* .995 context: prison* . . . 214 rules 1.0 context: reads .991 context: read* .984 context: read .976 context: reads .969 context: 11* . . . train 238 56

Iteration 2

slide-36
SLIDE 36

10

Yarowsky algorithm (Yarowsky, 1995; Collins and Singer, 1999)

200 400 600 800 1000 1200 1400 1600 1800 5 10 15 20 0.2 0.4 0.6 0.8 1 DL size | Num. labelled train examples Test accuracy Iteration 3 2 1

1520 rules 1.0 context: served .998 next: .* .998 next: . .960 context: life* .960 context: life . . . 223 rules 1.0 context: reads .991 context: read* .984 context: read .984 next: :* .984 next: : . . . train 242 49

Iteration 3

slide-37
SLIDE 37

10

Yarowsky algorithm (Yarowsky, 1995; Collins and Singer, 1999)

200 400 600 800 1000 1200 1400 1600 1800 5 10 15 20 0.2 0.4 0.6 0.8 1 DL size | Num. labelled train examples Test accuracy Iteration 4 3 2 1

1557 rules 1.0 context: served .998 next: .* .998 next: . .996 context: life* .996 context: life . . . 221 rules 1.0 context: reads .991 context: read* .984 context: read .984 next: :* .984 next: : . . . train 247 49

Iteration 4

slide-38
SLIDE 38

10

Yarowsky algorithm (Yarowsky, 1995; Collins and Singer, 1999)

200 400 600 800 1000 1200 1400 1600 1800 5 10 15 20 0.2 0.4 0.6 0.8 1 DL size | Num. labelled train examples Test accuracy Iteration 5 4 3 2 1

1557 rules 1.0 context: served .998 next: .* .998 next: . .996 context: life* .996 context: life . . . 221 rules 1.0 context: reads .991 context: read* .984 context: read .984 next: :* .984 next: : . . . train 247 49

Iteration 5

slide-39
SLIDE 39

10

Yarowsky algorithm (Yarowsky, 1995; Collins and Singer, 1999)

200 400 600 800 1000 1200 1400 1600 1800 5 10 15 20 0.2 0.4 0.6 0.8 1 DL size | Num. labelled train examples Test accuracy Iteration 6 5 4 3 2 1

1557 rules 1.0 context: served .998 next: .* .998 next: . .996 context: life* .996 context: life . . . 221 rules 1.0 context: reads .991 context: read* .984 context: read .984 next: :* .984 next: : . . . train 247 49

Iteration 6

slide-40
SLIDE 40

11

Performance

Yarowsky 81.49 % clean non-seeded accuracy (named entity)

slide-41
SLIDE 41

12

  • Vs. co-training

DL-CoTrain from (Collins and Singer, 1999):

Yarowsky DL-CoTrain non-cautious 81.49 85.73 % clean non-seeded accuracy (named entity)

slide-42
SLIDE 42

12

  • Vs. co-training

DL-CoTrain from (Collins and Singer, 1999):

Yarowsky DL-CoTrain non-cautious 81.49 85.73 % clean non-seeded accuracy (named entity)

Co-training needs two views, eg:

◮ adjacent words { next word: a, next word: about, next word: according, . . . } ◮ context words { context: abolition, context: abundantly, context: accepting, . . . }

slide-43
SLIDE 43

13

  • Vs. EM

EM algorithm from (Collins and Singer, 1999):

Yarowsky EM 81.49 80.31 % clean non-seeded accuracy (named entity)

slide-44
SLIDE 44

13

  • Vs. EM

EM algorithm from (Collins and Singer, 1999):

Yarowsky EM 81.49 80.31 % clean non-seeded accuracy (named entity)

With Yarowsky we can exploit type-level information in the DL

slide-45
SLIDE 45

14

  • Vs. EM

EM

Expected counts

  • n data:

x1 x2 x3 x4 x5

. . .

Probabilities on features:

f1 f2 f3 f4 f5

. . .

slide-46
SLIDE 46

14

  • Vs. EM

EM

Expected counts

  • n data:

x1 x2 x3 x4 x5

. . .

Probabilities on features:

f1 f2 f3 f4 f5

. . .

Yarowsky

Labelled training data:

x1 x2 x3 x4 x5

. . .

Decision list:

f1 f2 f3 f4 f5

. . .

slide-47
SLIDE 47

14

  • Vs. EM

EM

Expected counts

  • n data:

x1 x2 x3 x4 x5

. . .

Probabilities on features:

f1 f2 f3 f4 f5

. . .

Yarowsky

Labelled training data:

x1 x2 x3 x4 x5

. . .

Decision list:

f1 f2 f3 f4 f5

. . .

Trimmed DL:

f1 f3 f5

. . .

slide-48
SLIDE 48

15

Cautiousness

Can we improve decision list trimming?

slide-49
SLIDE 49

15

Cautiousness

Can we improve decision list trimming?

◮ (Collins and Singer, 1999) cautiousness:

take top n rules for each label n = 5, 10, 15, . . . by iteration

slide-50
SLIDE 50

15

Cautiousness

Can we improve decision list trimming?

◮ (Collins and Singer, 1999) cautiousness:

take top n rules for each label n = 5, 10, 15, . . . by iteration

◮ Yarowsky-cautious ◮ DL-CoTrain cautious

slide-51
SLIDE 51

16

Yarowsky-cautious algorithm (Collins and Singer, 1999)

50 100 150 200 250 300 350 400 10 20 30 40 50 60 0.2 0.4 0.6 0.8 1 DL size | Num. labelled train examples Test accuracy Iteration

1 rule 1.0 context: served 1 rule 1.0 context: reads train 4 4

Iteration 0

slide-52
SLIDE 52

16

Yarowsky-cautious algorithm (Collins and Singer, 1999)

50 100 150 200 250 300 350 400 10 20 30 40 50 60 0.2 0.4 0.6 0.8 1 DL size | Num. labelled train examples Test accuracy Iteration 1

6 rules 1.0 context: served .976 context: serv* .976 context: served .955 context: inmat* .955 context: releas* . . . 6 rules 1.0 context: reads .976 context: read* .976 context: reads .969 next: read* .969 next: reads . . . train 25 12

Iteration 1

slide-53
SLIDE 53

16

Yarowsky-cautious algorithm (Collins and Singer, 1999)

50 100 150 200 250 300 350 400 10 20 30 40 50 60 0.2 0.4 0.6 0.8 1 DL size | Num. labelled train examples Test accuracy Iteration 2 1

11 rules 1.0 context: served .995 context: serv* .989 context: serve .986 context: serving .984 context: life* . . . 11 rules 1.0 context: reads .991 context: read* .984 context: read .976 context: reads .969 next: from* . . . train 62 20

Iteration 2

slide-54
SLIDE 54

16

Yarowsky-cautious algorithm (Collins and Singer, 1999)

50 100 150 200 250 300 350 400 10 20 30 40 50 60 0.2 0.4 0.6 0.8 1 DL size | Num. labelled train examples Test accuracy Iteration 3 2 1

16 rules 1.0 context: served .996 context: life* .996 context: life .995 context: serv* .995 context: prison* . . . 16 rules 1.0 context: reads .991 context: read* .991 next: from* .991 next: from .984 context: read . . . train 84 32

Iteration 3

slide-55
SLIDE 55

16

Yarowsky-cautious algorithm (Collins and Singer, 1999)

50 100 150 200 250 300 350 400 10 20 30 40 50 60 0.2 0.4 0.6 0.8 1 DL size | Num. labelled train examples Test accuracy Iteration 4 3 2 1

21 rules 1.0 context: served .996 context: commut* .996 context: life* .996 context: life .995 context: serv* . . . 21 rules 1.0 context: reads .991 context: read* .991 next: from* .991 next: from .989 context: quot* . . . train 100 36

Iteration 4

slide-56
SLIDE 56

16

Yarowsky-cautious algorithm (Collins and Singer, 1999)

50 100 150 200 250 300 350 400 10 20 30 40 50 60 0.2 0.4 0.6 0.8 1 DL size | Num. labelled train examples Test accuracy Iteration 5 4 3 2 1

26 rules 1.0 context: served .996 context: commut* .996 context: life* .996 context: life .995 context: serv* . . . 26 rules 1.0 context: reads .991 context: read* .991 next: from* .991 next: from .989 context: quot* . . . train 114 40

Iteration 5

slide-57
SLIDE 57

16

Yarowsky-cautious algorithm (Collins and Singer, 1999)

50 100 150 200 250 300 350 400 10 20 30 40 50 60 0.2 0.4 0.6 0.8 1 DL size | Num. labelled train examples Test accuracy Iteration 6 5 4 3 2 1

31 rules 1.0 context: served .996 context: commut* .996 context: life* .996 context: life .995 context: serv* . . . 31 rules 1.0 context: reads .991 context: read* .991 next: from* .991 next: from .989 context: quot* . . . train 128 40

Iteration 6

slide-58
SLIDE 58

16

Yarowsky-cautious algorithm (Collins and Singer, 1999)

50 100 150 200 250 300 350 400 10 20 30 40 50 60 0.2 0.4 0.6 0.8 1 DL size | Num. labelled train examples Test accuracy Iteration 7 6 5 4 3 2 1

36 rules 1.0 context: served .965 context: year* .996 context: commut* .996 context: life* .996 context: life . . . 36 rules 1.0 context: reads .991 context: read* .991 next: from* .991 next: from .989 context: quot* . . . train 139 40

Iteration 7

slide-59
SLIDE 59

16

Yarowsky-cautious algorithm (Collins and Singer, 1999)

50 100 150 200 250 300 350 400 10 20 30 40 50 60 0.2 0.4 0.6 0.8 1 DL size | Num. labelled train examples Test accuracy Iteration 8 7 6 5 4 3 2 1

41 rules 1.0 context: served .969 context: year* .996 context: commut* .996 context: life* .996 context: life . . . 41 rules 1.0 context: reads .991 context: read* .991 next: from* .991 next: from .989 context: quot* . . . train 139 48

Iteration 8

slide-60
SLIDE 60

16

Yarowsky-cautious algorithm (Collins and Singer, 1999)

50 100 150 200 250 300 350 400 10 20 30 40 50 60 0.2 0.4 0.6 0.8 1 DL size | Num. labelled train examples Test accuracy Iteration 9 8 7 6 5 4 3 2 1

46 rules 1.0 context: served .969 context: year* .996 context: commut* .996 context: life* .996 context: life . . . 46 rules 1.0 context: reads .991 context: read* .991 next: from* .991 next: from .989 context: quot* . . . train 139 51

Iteration 9

slide-61
SLIDE 61

16

Yarowsky-cautious algorithm (Collins and Singer, 1999)

50 100 150 200 250 300 350 400 10 20 30 40 50 60 0.2 0.4 0.6 0.8 1 DL size | Num. labelled train examples Test accuracy Iteration 10 9 8 7 6 5 4 3 2 1

51 rules 1.0 context: served .969 context: year* .996 context: commut* .996 context: life* .996 context: life . . . 51 rules 1.0 context: reads .991 context: read* .991 next: from* .991 next: from .989 context: quot* . . . train 146 53

Iteration 10

slide-62
SLIDE 62

16

Yarowsky-cautious algorithm (Collins and Singer, 1999)

50 100 150 200 250 300 350 400 10 20 30 40 50 60 0.2 0.4 0.6 0.8 1 DL size | Num. labelled train examples Test accuracy Iteration 11 10 9 8 7 6 5 4 3 2 1

56 rules 1.0 context: served .969 context: year* .996 context: commut* .996 context: life* .996 context: life . . . 56 rules 1.0 context: reads .991 context: read* .991 next: from* .991 next: from .989 context: quot* . . . train 156 54

Iteration 11

slide-63
SLIDE 63

16

Yarowsky-cautious algorithm (Collins and Singer, 1999)

50 100 150 200 250 300 350 400 10 20 30 40 50 60 0.2 0.4 0.6 0.8 1 DL size | Num. labelled train examples Test accuracy Iteration 12 11 10 9 8 7 6 5 4 3 2 1

61 rules 1.0 context: served .969 context: year* .996 context: commut* .996 context: life* .996 context: life . . . 61 rules 1.0 context: reads .991 context: read* .991 next: from* .991 next: from .989 context: quot* . . . train 159 57

Iteration 12

slide-64
SLIDE 64

16

Yarowsky-cautious algorithm (Collins and Singer, 1999)

50 100 150 200 250 300 350 400 10 20 30 40 50 60 0.2 0.4 0.6 0.8 1 DL size | Num. labelled train examples Test accuracy Iteration 13 12 11 10 9 8 7 6 5 4 3 2 1

66 rules 1.0 context: served .969 context: year* .996 context: commut* .996 context: life* .996 context: life . . . 66 rules 1.0 context: reads .991 context: read* .991 next: from* .991 next: from .989 context: quot* . . . train 159 58

Iteration 13

slide-65
SLIDE 65

16

Yarowsky-cautious algorithm (Collins and Singer, 1999)

50 100 150 200 250 300 350 400 10 20 30 40 50 60 0.2 0.4 0.6 0.8 1 DL size | Num. labelled train examples Test accuracy Iteration 14 13 12 11 10 9 8 7 6 5 4 3 2 1

71 rules 1.0 context: served .969 context: year* .996 context: commut* .996 context: life* .996 context: life . . . 71 rules 1.0 context: reads .991 context: read* .991 next: from* .991 next: from .989 context: quot* . . . train 163 58

Iteration 14

slide-66
SLIDE 66

16

Yarowsky-cautious algorithm (Collins and Singer, 1999)

50 100 150 200 250 300 350 400 10 20 30 40 50 60 0.2 0.4 0.6 0.8 1 DL size | Num. labelled train examples Test accuracy Iteration 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

76 rules 1.0 context: served .969 context: year* .996 context: commut* .996 context: life* .996 context: life . . . 76 rules 1.0 context: reads .991 context: read* .991 next: from* .991 next: from .989 context: quot* . . . train 165 58

Iteration 15

slide-67
SLIDE 67

16

Yarowsky-cautious algorithm (Collins and Singer, 1999)

50 100 150 200 250 300 350 400 10 20 30 40 50 60 0.2 0.4 0.6 0.8 1 DL size | Num. labelled train examples Test accuracy Iteration 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

81 rules 1.0 context: served .969 context: year* .996 context: commut* .996 context: life* .996 context: life . . . 81 rules 1.0 context: reads .991 context: read* .991 next: from* .991 next: from .989 context: quot* . . . train 166 58

Iteration 16

slide-68
SLIDE 68

16

Yarowsky-cautious algorithm (Collins and Singer, 1999)

50 100 150 200 250 300 350 400 10 20 30 40 50 60 0.2 0.4 0.6 0.8 1 DL size | Num. labelled train examples Test accuracy Iteration 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

86 rules 1.0 context: served .969 context: year* .996 context: commut* .996 context: life* .996 context: life . . . 86 rules 1.0 context: reads .991 context: read* .991 next: from* .991 next: from .989 context: quot* . . . train 169 58

Iteration 17

slide-69
SLIDE 69

16

Yarowsky-cautious algorithm (Collins and Singer, 1999)

50 100 150 200 250 300 350 400 10 20 30 40 50 60 0.2 0.4 0.6 0.8 1 DL size | Num. labelled train examples Test accuracy Iteration 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

91 rules 1.0 context: served .969 context: year* .996 context: commut* .996 context: life* .996 context: life . . . 91 rules 1.0 context: reads .991 context: read* .991 next: from* .991 next: from .989 context: quot* . . . train 170 58

Iteration 18

slide-70
SLIDE 70

16

Yarowsky-cautious algorithm (Collins and Singer, 1999)

50 100 150 200 250 300 350 400 10 20 30 40 50 60 0.2 0.4 0.6 0.8 1 DL size | Num. labelled train examples Test accuracy Iteration 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

96 rules 1.0 context: served .969 context: year* .996 context: commut* .996 context: life* .996 context: life . . . 96 rules 1.0 context: reads .991 context: read* .991 next: from* .991 next: from .989 context: quot* . . . train 170 58

Iteration 19

slide-71
SLIDE 71

16

Yarowsky-cautious algorithm (Collins and Singer, 1999)

50 100 150 200 250 300 350 400 10 20 30 40 50 60 0.2 0.4 0.6 0.8 1 DL size | Num. labelled train examples Test accuracy Iteration 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

101 rules 1.0 context: served .969 context: year* .996 context: commut* .996 context: life* .996 context: life . . . 101 rules 1.0 context: reads .991 context: read* .991 next: from* .991 next: from .989 context: quot* . . . train 172 59

Iteration 20

slide-72
SLIDE 72

16

Yarowsky-cautious algorithm (Collins and Singer, 1999)

50 100 150 200 250 300 350 400 10 20 30 40 50 60 0.2 0.4 0.6 0.8 1 DL size | Num. labelled train examples Test accuracy Iteration 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

101 rules 1.0 context: served .969 context: year* .996 context: commut* .996 context: life* .996 context: life . . . 101 rules 1.0 context: reads .991 context: read* .991 next: from* .991 next: from .989 context: quot* . . . train 172 59

Iteration 20

slide-73
SLIDE 73

17

Yarowsky-cautious vs. co-training and EM

Yarowsky-cautious DL-CoTrain cautious Yarowsky non-cautious DL-CoTrain non-cautious EM 89.97 90.49 81.49 85.73 80.31 % clean non-seeded accuracy (named entity)

statistically equivalent

slide-74
SLIDE 74

17

Yarowsky-cautious vs. co-training and EM

Yarowsky-cautious DL-CoTrain cautious Yarowsky non-cautious DL-CoTrain non-cautious EM 89.97 90.49 81.49 85.73 80.31 % clean non-seeded accuracy (named entity)

statistically equivalent

◮ Yarowsky performs well

slide-75
SLIDE 75

17

Yarowsky-cautious vs. co-training and EM

Yarowsky-cautious DL-CoTrain cautious Yarowsky non-cautious DL-CoTrain non-cautious EM 89.97 90.49 81.49 85.73 80.31 % clean non-seeded accuracy (named entity)

statistically equivalent

◮ Yarowsky performs well ◮ Cautiousness is important

slide-76
SLIDE 76

17

Yarowsky-cautious vs. co-training and EM

Yarowsky-cautious DL-CoTrain cautious Yarowsky non-cautious DL-CoTrain non-cautious EM 89.97 90.49 81.49 85.73 80.31 % clean non-seeded accuracy (named entity)

statistically equivalent

◮ Yarowsky performs well ◮ Cautiousness is important ◮ Yarowsky does not need views

slide-77
SLIDE 77

18

Did we really do EM right?

slide-78
SLIDE 78

18

Did we really do EM right?

Hard Online EM Online EM Hard EM EM Yarowsky-cautious DL-CoTrain cautious 80.49 83.89 80.94 80.31 89.97 90.49 % clean non-seeded accuracy (named entity)

slide-79
SLIDE 79

18

Did we really do EM right?

Hard Online EM Online EM Hard EM EM Yarowsky-cautious DL-CoTrain cautious 80.49 83.89 80.94 80.31 89.97 90.49 % clean non-seeded accuracy (named entity)

Multiple runs of EM. Variance of results:

◮ EM: ±.34 ◮ Hard EM: ±2.53 ◮ Online EM: ±.45 ◮ Hard Online EM: ±.68

slide-80
SLIDE 80

19

Yarowsky algorithm: (Abney, 2004)’s analysis

Yarowsky algorithm lacks theoretical analysis

slide-81
SLIDE 81

19

Yarowsky algorithm: (Abney, 2004)’s analysis

Yarowsky algorithm lacks theoretical analysis

◮ (Abney, 2004) gives bounds for some variants

(no cautiousness, no algorithm)

slide-82
SLIDE 82

19

Yarowsky algorithm: (Abney, 2004)’s analysis

Yarowsky algorithm lacks theoretical analysis

◮ (Abney, 2004) gives bounds for some variants

(no cautiousness, no algorithm)

◮ Basis for our work

slide-83
SLIDE 83

19

Yarowsky algorithm: (Abney, 2004)’s analysis

Yarowsky algorithm lacks theoretical analysis

◮ (Abney, 2004) gives bounds for some variants

(no cautiousness, no algorithm)

◮ Basis for our work

Training examples x, labels j:

Full time should be served for each sentence .

The Liberals inserted a sentence of 14 words which reads :

They get a concurrent sentence with no additional time added to their sentence .

The words tax relief appeared in every second sentence in the federal government’s throne speech . . . .

labelling distributions φx(j)

peaked for labelled example x uniform for unlabelled example x

slide-84
SLIDE 84

19

Yarowsky algorithm: (Abney, 2004)’s analysis

Yarowsky algorithm lacks theoretical analysis

◮ (Abney, 2004) gives bounds for some variants

(no cautiousness, no algorithm)

◮ Basis for our work

Training examples x, labels j:

Full time should be served for each sentence .

The Liberals inserted a sentence of 14 words which reads :

They get a concurrent sentence with no additional time added to their sentence .

The words tax relief appeared in every second sentence in the federal government’s throne speech . . . .

labelling distributions φx(j)

peaked for labelled example x uniform for unlabelled example x

Features f , labels j:

context: reads

context: served

context: inmate

next: the

context: article

previous: introductory

previous: passing

next: said . . .

parameter distributions θf (j)

normalized DL scores for feature f DL chooses arg maxj maxf ∈Fx θf (j)

slide-85
SLIDE 85

19

Yarowsky algorithm: (Abney, 2004)’s analysis

Yarowsky algorithm lacks theoretical analysis

◮ (Abney, 2004) gives bounds for some variants

(no cautiousness, no algorithm)

◮ Basis for our work

Training examples x, labels j:

Full time should be served for each sentence .

The Liberals inserted a sentence of 14 words which reads :

They get a concurrent sentence with no additional time added to their sentence .

The words tax relief appeared in every second sentence in the federal government’s throne speech . . . .

labelling distributions φx(j)

peaked for labelled example x uniform for unlabelled example x

Features f , labels j:

context: reads

context: served

context: inmate

next: the

context: article

previous: introductory

previous: passing

next: said . . .

parameter distributions θf (j)

normalized DL scores for feature f DL chooses arg maxj maxf ∈Fx θf (j) alternative: arg maxj

  • f ∈Fx θf (j)
slide-86
SLIDE 86

20

Yarowsky algorithm: (Haffari and Sarkar, 2007)’s analysis

◮ (Haffari and Sarkar, 2007) extend (Abney, 2004) to bipartite graph

representation (polytime algorithm; no cautiousness)

slide-87
SLIDE 87

20

Yarowsky algorithm: (Haffari and Sarkar, 2007)’s analysis

◮ (Haffari and Sarkar, 2007) extend (Abney, 2004) to bipartite graph

representation (polytime algorithm; no cautiousness)

θf|F| θf4 θf3 θf2 θf1 φx1 φx2 φx3 φx4 φx|X|

... ...

slide-88
SLIDE 88

20

Yarowsky algorithm: (Haffari and Sarkar, 2007)’s analysis

◮ (Haffari and Sarkar, 2007) extend (Abney, 2004) to bipartite graph

representation (polytime algorithm; no cautiousness)

features f parameter distributions θf (j)

θf|F| θf4 θf3 θf2 θf1 φx1 φx2 φx3 φx4 φx|X|

... ...

slide-89
SLIDE 89

20

Yarowsky algorithm: (Haffari and Sarkar, 2007)’s analysis

◮ (Haffari and Sarkar, 2007) extend (Abney, 2004) to bipartite graph

representation (polytime algorithm; no cautiousness)

features f parameter distributions θf (j) examples x labelling distributions φx(j)

θf|F| θf4 θf3 θf2 θf1 φx1 φx2 φx3 φx4 φx|X|

... ...

slide-90
SLIDE 90

20

Yarowsky algorithm: (Haffari and Sarkar, 2007)’s analysis

◮ (Haffari and Sarkar, 2007) extend (Abney, 2004) to bipartite graph

representation (polytime algorithm; no cautiousness)

features f parameter distributions θf (j) examples x labelling distributions φx(j)

θf|F| θf4 θf3 θf2 θf1 φx1 φx2 φx3 φx4 φx|X|

... ...

algorithm:

fix one side, update other

slide-91
SLIDE 91

21

Objective Function

◮ KL divergence between two probability distributions:

KL(p||q) =

  • i

p(i) log p(i) q(i)

slide-92
SLIDE 92

21

Objective Function

◮ KL divergence between two probability distributions:

KL(p||q) =

  • i

p(i) log p(i) q(i)

◮ Entropy of a distribution:

H(p) = −

  • i

p(i) log p(i)

slide-93
SLIDE 93

21

Objective Function

◮ KL divergence between two probability distributions:

KL(p||q) =

  • i

p(i) log p(i) q(i)

◮ Entropy of a distribution:

H(p) = −

  • i

p(i) log p(i)

◮ The Objective Function:

K(φ, θ) =

  • (fi,xj)∈Edges

KL(θfi||φxj)+H(θfi)+H(φxj)+Regularizer

slide-94
SLIDE 94

21

Objective Function

◮ KL divergence between two probability distributions:

KL(p||q) =

  • i

p(i) log p(i) q(i)

◮ Entropy of a distribution:

H(p) = −

  • i

p(i) log p(i)

◮ The Objective Function:

K(φ, θ) =

  • (fi,xj)∈Edges

KL(θfi||φxj)+H(θfi)+H(φxj)+Regularizer

◮ Reduce uncertainty in the labelling distribution while

respecting the labeled data

slide-95
SLIDE 95

22

Generalized Objective Function

◮ Bregman divergence between two probability distributions:

Bψ(p||q) =

  • i

ψ(p(i)) − ψ(q(i)) − ψ′(q(i))(p(i) − q(i)) Bt log t(p||q) = KL(p||q)

slide-96
SLIDE 96

22

Generalized Objective Function

◮ Bregman divergence between two probability distributions:

Bψ(p||q) =

  • i

ψ(p(i)) − ψ(q(i)) − ψ′(q(i))(p(i) − q(i)) Bt log t(p||q) = KL(p||q)

◮ ψ-Entropy of a distribution:

Hψ(p) = −

  • i

ψ(p(i)) Ht log t(p) = H(p)

slide-97
SLIDE 97

22

Generalized Objective Function

◮ Bregman divergence between two probability distributions:

Bψ(p||q) =

  • i

ψ(p(i)) − ψ(q(i)) − ψ′(q(i))(p(i) − q(i)) Bt log t(p||q) = KL(p||q)

◮ ψ-Entropy of a distribution:

Hψ(p) = −

  • i

ψ(p(i)) Ht log t(p) = H(p)

◮ The Generalized Objective Function:

Kψ(φ, θ) =

  • (fi,xj)∈Edges

Bψ(θfi||φxj)+Hψ(θfi)+Hψ(φxj)+Regularizer

slide-98
SLIDE 98

23

Generalized Objective Function

ψ(q(i)) ψ(q(i)) + ψ’(q(i))(p(i) - q(i)) ψ(p(i)) ψ(p’(i)) ψ(q(i)) + ψ’(q(i))(p’(i) - q(i)) q(i) p(i) p’(i) 1 ψ a = ψ(p(i))-ψ(q(i)) b = ψ’(q(i)) (p(i) - q(i)) a - b a’ b’ a’ - b’ p(i)-q(i) p’(i)-q(i)

slide-99
SLIDE 99

24

Variants from (Abney, 2004; Haffari and Sarkar, 2007)

Yarowsky-cautious Yarowsky non-cautious Yarowsky-cautious sum HaffariSarkar-bipartite avg-maj 89.97 81.49 90.49 79.69 % clean non-seeded accuracy (named entity)

slide-100
SLIDE 100

25

Graph-based Propagation (Subramanya et al., 2010)

Self-training with CRFs: seed data label data train CRF

graph propagate get types get posteriors

slide-101
SLIDE 101

25

Graph-based Propagation (Subramanya et al., 2010)

Self-training with CRFs: seed data label data train CRF

graph propagate get types get posteriors

Compare with Yarowsky: seed DL label data train DL

slide-102
SLIDE 102

26

Our contributions

  • 1. A cautious, well-performing Yarowsky variant with a

per-iteration objective

slide-103
SLIDE 103

26

Our contributions

  • 1. A cautious, well-performing Yarowsky variant with a

per-iteration objective

  • 2. Unification of various bootstrapping algorithms: (Collins

and Singer, 1999), (Abney, 2004), (Haffari and Sarkar, 2007), (Subramanya et al., 2010)

slide-104
SLIDE 104

26

Our contributions

  • 1. A cautious, well-performing Yarowsky variant with a

per-iteration objective

  • 2. Unification of various bootstrapping algorithms: (Collins

and Singer, 1999), (Abney, 2004), (Haffari and Sarkar, 2007), (Subramanya et al., 2010)

  • 3. More evidence that cautiousness is important
slide-105
SLIDE 105

27

Graph propagation

(Subramanya et al., 2010)’s propagation:

µ

  • u∈V

v∈N (u)

wuv Bt2(qu, qv) + ν

  • u∈V

Bt2(qu, U)

qu qv

slide-106
SLIDE 106

27

Graph propagation

(Subramanya et al., 2010)’s propagation:

µ

  • u∈V

v∈N (u)

wuv Bt2(qu, qv) + ν

  • u∈V

Bt2(qu, U)

qu qv

Bt2(qu, qv)

slide-107
SLIDE 107

27

Graph propagation

(Subramanya et al., 2010)’s propagation:

µ

  • u∈V

v∈N (u)

wuv Bt2(qu, qv) + ν

  • u∈V

Bt2(qu, U)

qu qv

Bt2(qu, qv) Bt2(qu, U) Bt2(qv, U)

slide-108
SLIDE 108

27

Graph propagation

(Subramanya et al., 2010)’s propagation:

µ

  • u∈V

v∈N (u)

wuv Bt2(qu, qv) + ν

  • u∈V

Bt2(qu, U)

qu qv

Bt2(qu, qv) Bt2(qu, U) Bt2(qv, U) efficient iterative updates

slide-109
SLIDE 109

28

Using graph propagation

µ   

  • u∈V

v∈N (u)

wuvBt2(qu, qv) + Ht2(qu)    + ν

  • u∈V

Bt2(qu, U) φ-θ:

θf|F| θf4 θf3 θf2 θf1 φx1 φx2 φx3 φx4 φx|X|

... ...

use bipartite graph from

(Haffari and Sarkar, 2007)

(motivated by similar objective)

slide-110
SLIDE 110

28

Using graph propagation

µ   

  • u∈V

v∈N (u)

wuvBt2(qu, qv) + Ht2(qu)    + ν

  • u∈V

Bt2(qu, U) φ-θ:

θf|F| θf4 θf3 θf2 θf1 φx1 φx2 φx3 φx4 φx|X|

... ...

use bipartite graph from

(Haffari and Sarkar, 2007)

(motivated by similar objective)

θ-only:

θf1 θf|F| θf2 θf4 θf3

...

use only θ in unipartite graph

slide-111
SLIDE 111

29

Yarowsky-prop (our algorithm)

seed DL

label data with θP

(sum)

graph propagate to get θP train DL θ

slide-112
SLIDE 112

29

Yarowsky-prop (our algorithm)

seed DL

label data with θP

(sum)

graph propagate to get θP train DL θ

Can use φ-θ (bipartite)

  • r θ-only (unipartite)

(or two more, in ACL 2012 paper)

slide-113
SLIDE 113

29

Yarowsky-prop (our algorithm)

seed DL

label data with θP

(sum)

graph propagate to get θP train DL θ

Can use φ-θ (bipartite)

  • r θ-only (unipartite)

(or two more, in ACL 2012 paper)

◮ Optimizes (Subramanya et al., 2010)’s

  • bjective per iteration

◮ Use cautiousness decisions of θ,

label with θP

slide-114
SLIDE 114

30

Yarowsky-prop: objective behaviour

µ    

  • u∈V

v∈N (u)

wuvBt2(qu, qv) + Ht2(qu)     + ν

  • u∈V

Bt2(qu, U)

55000 60000 65000 70000 75000 80000 85000 10 100 1000 0.4 0.5 0.6 0.7 0.8 0.9 1 Propagation objective value Coverage Iteration

  • bjective

training set coverage

φ-θ without cautiousness

slide-115
SLIDE 115

31

The basic Yarowsky algorithm.

Require: training data X and a seed DL θ(0)

1: for iteration t = 1, 2, . . . to maximum or convergence do 2:

apply θ(t−1) to X to produce Y (t)

3:

train a new DL θ(t) on Y (t), keeping only rules with score above ζ

4: end for 5: train a final DL θ on the last Y (t) // retraining step

slide-116
SLIDE 116

32

Yarowsky-prop algorithm (θ-only form)

1: let θfj be the scores of the seed rules // crf train 2: for iteration t to maximum or convergence do 3:

let πx(j) =

1 |Fx|

  • f ∈Fx θfj // post. decode

4:

let θT

fj =

  • x∈Xf πx(j)

|Xf |

// token to type

5:

propagate θT to get θP // graph propagate

6:

label the data with θP // viterbi decode; cautiousness

7:

train a new DL θfj // crf train

8: end for

slide-117
SLIDE 117

33

Yarowsky-prop-cautious: behaviour

100 200 300 400 500 10 20 30 40 50 0.2 0.4 0.6 0.8 1 DL size | Num. labelled train examples Test accuracy Iteration

1 rule 1.0 context: served 1 rule 1.0 context: reads train 4 4

Iteration 0

slide-118
SLIDE 118

33

Yarowsky-prop-cautious: behaviour

100 200 300 400 500 10 20 30 40 50 0.2 0.4 0.6 0.8 1 DL size | Num. labelled train examples Test accuracy Iteration 1

6 rules 1.0 context: served .976 context: serv* .976 context: served .955 context: inmat* .955 context: releas* . . . 6 rules 1.0 context: reads .976 context: read* .976 context: reads .969 next: read* .969 next: reads . . . train 107 44

Iteration 1

slide-119
SLIDE 119

33

Yarowsky-prop-cautious: behaviour

100 200 300 400 500 10 20 30 40 50 0.2 0.4 0.6 0.8 1 DL size | Num. labelled train examples Test accuracy Iteration 2 1

11 rules 1.0 context: served .995 context: serv* .989 context: serve .986 context: serving .984 context: life* . . . 11 rules 1.0 context: reads .991 context: read* .984 context: read .976 context: reads .969 next: from* . . . train 211 73

Iteration 2

slide-120
SLIDE 120

33

Yarowsky-prop-cautious: behaviour

100 200 300 400 500 10 20 30 40 50 0.2 0.4 0.6 0.8 1 DL size | Num. labelled train examples Test accuracy Iteration 3 2 1

16 rules 1.0 context: served .996 context: life* .996 context: life .995 context: serv* .995 context: prison* . . . 16 rules 1.0 context: reads .991 context: read* .991 next: from* .991 next: from .984 context: read . . . train 243 60

Iteration 3

slide-121
SLIDE 121

33

Yarowsky-prop-cautious: behaviour

100 200 300 400 500 10 20 30 40 50 0.2 0.4 0.6 0.8 1 DL size | Num. labelled train examples Test accuracy Iteration 4 3 2 1

21 rules 1.0 context: served .996 context: commut* .996 context: life* .996 context: life .995 context: serv* . . . 21 rules 1.0 context: reads .991 context: read* .991 next: from* .991 next: from .989 context: quot* . . . train 233 70

Iteration 4

slide-122
SLIDE 122

33

Yarowsky-prop-cautious: behaviour

100 200 300 400 500 10 20 30 40 50 0.2 0.4 0.6 0.8 1 DL size | Num. labelled train examples Test accuracy Iteration 5 4 3 2 1

26 rules 1.0 context: served .996 context: commut* .996 context: life* .996 context: life .995 context: serv* . . . 26 rules 1.0 context: reads .991 context: read* .991 next: from* .991 next: from .989 context: quot* . . . train 229 74

Iteration 5

slide-123
SLIDE 123

33

Yarowsky-prop-cautious: behaviour

100 200 300 400 500 10 20 30 40 50 0.2 0.4 0.6 0.8 1 DL size | Num. labelled train examples Test accuracy Iteration 6 5 4 3 2 1

31 rules 1.0 context: served .996 context: commut* .996 context: life* .996 context: life .995 context: serv* . . . 31 rules 1.0 context: reads .991 context: read* .991 next: from* .991 next: from .989 context: quot* . . . train 230 73

Iteration 6

slide-124
SLIDE 124

33

Yarowsky-prop-cautious: behaviour

100 200 300 400 500 10 20 30 40 50 0.2 0.4 0.6 0.8 1 DL size | Num. labelled train examples Test accuracy Iteration 7 6 5 4 3 2 1

36 rules 1.0 context: served .996 context: commut* .996 context: life* .996 context: life .995 context: serv* . . . 36 rules 1.0 context: reads .991 context: read* .989 context: quot* .988 context: quote .984 context: put* . . . train 229 74

Iteration 7

slide-125
SLIDE 125

33

Yarowsky-prop-cautious: behaviour

100 200 300 400 500 10 20 30 40 50 0.2 0.4 0.6 0.8 1 DL size | Num. labelled train examples Test accuracy Iteration 8 7 6 5 4 3 2 1

41 rules 1.0 context: served .997 context: year* .996 context: commut* .996 context: life* .996 context: life . . . 41 rules 1.0 context: reads .991 context: read* .989 context: quot* .988 context: quote .984 context: put* . . . train 238 65

Iteration 8

slide-126
SLIDE 126

33

Yarowsky-prop-cautious: behaviour

100 200 300 400 500 10 20 30 40 50 0.2 0.4 0.6 0.8 1 DL size | Num. labelled train examples Test accuracy Iteration 9 8 7 6 5 4 3 2 1

46 rules 1.0 context: served .997 context: year* .996 context: commut* .996 context: life* .996 context: life . . . 46 rules 1.0 context: reads .989 context: quot* .988 context: quote .984 context: read .984 previous: thi* . . . train 240 63

Iteration 9

slide-127
SLIDE 127

33

Yarowsky-prop-cautious: behaviour

100 200 300 400 500 10 20 30 40 50 0.2 0.4 0.6 0.8 1 DL size | Num. labelled train examples Test accuracy Iteration 10 9 8 7 6 5 4 3 2 1

51 rules 1.0 context: served .997 context: year* .996 context: commut* .996 context: life* .996 context: life . . . 51 rules 1.0 context: reads .990 context: read* .989 context: quot* .988 context: quote .984 context: read . . . train 245 58

Iteration 10

slide-128
SLIDE 128

33

Yarowsky-prop-cautious: behaviour

100 200 300 400 500 10 20 30 40 50 0.2 0.4 0.6 0.8 1 DL size | Num. labelled train examples Test accuracy Iteration 11 10 9 8 7 6 5 4 3 2 1

56 rules 1.0 context: served .997 context: year* .996 context: commut* .996 context: death* .996 context: life* . . . 56 rules 1.0 context: reads .989 context: quot* .988 context: quote .984 context: read .984 next: :* . . . train 248 55

Iteration 11

slide-129
SLIDE 129

33

Yarowsky-prop-cautious: behaviour

100 200 300 400 500 10 20 30 40 50 0.2 0.4 0.6 0.8 1 DL size | Num. labelled train examples Test accuracy Iteration 12 11 10 9 8 7 6 5 4 3 2 1

61 rules 1.0 context: served .997 context: year* .996 context: commut* .996 context: death* .996 context: life* . . . 61 rules 1.0 context: reads .990 context: read* .989 context: quot* .988 context: quote .984 context: read . . . train 251 52

Iteration 12

slide-130
SLIDE 130

33

Yarowsky-prop-cautious: behaviour

100 200 300 400 500 10 20 30 40 50 0.2 0.4 0.6 0.8 1 DL size | Num. labelled train examples Test accuracy Iteration 13 12 11 10 9 8 7 6 5 4 3 2 1

66 rules 1.0 context: served .971 previous: the* .971 previous: the .997 context: year* .996 context: commut* . . . 66 rules 1.0 context: reads .989 context: quot* .988 context: quote .984 context: read .984 next: :* . . . train 248 55

Iteration 13

slide-131
SLIDE 131

33

Yarowsky-prop-cautious: behaviour

100 200 300 400 500 10 20 30 40 50 0.2 0.4 0.6 0.8 1 DL size | Num. labelled train examples Test accuracy Iteration 14 13 12 11 10 9 8 7 6 5 4 3 2 1

71 rules 1.0 context: served .971 previous: the* .971 previous: the .997 context: year* .996 context: commut* . . . 71 rules 1.0 context: reads .990 context: read* .989 context: quot* .988 context: quote .984 context: read . . . train 250 53

Iteration 14

slide-132
SLIDE 132

33

Yarowsky-prop-cautious: behaviour

100 200 300 400 500 10 20 30 40 50 0.2 0.4 0.6 0.8 1 DL size | Num. labelled train examples Test accuracy Iteration 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

76 rules 1.0 context: served .971 previous: the* .971 previous: the .997 context: year* .996 context: commut* . . . 76 rules 1.0 context: reads .989 context: quot* .988 context: quote .984 context: read .984 next: :* . . . train 250 53

Iteration 15

slide-133
SLIDE 133

33

Yarowsky-prop-cautious: behaviour

100 200 300 400 500 10 20 30 40 50 0.2 0.4 0.6 0.8 1 DL size | Num. labelled train examples Test accuracy Iteration 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

81 rules 1.0 context: served .971 previous: the* .971 previous: the .997 context: year* .996 context: commut* . . . 81 rules 1.0 context: reads .989 context: quot* .988 context: quote .984 context: read .984 next: :* . . . train 250 53

Iteration 16

slide-134
SLIDE 134

33

Yarowsky-prop-cautious: behaviour

100 200 300 400 500 10 20 30 40 50 0.2 0.4 0.6 0.8 1 DL size | Num. labelled train examples Test accuracy Iteration 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

86 rules 1.0 context: served .971 previous: the* .971 previous: the .997 context: year* .996 context: commut* . . . 86 rules 1.0 context: reads .989 context: quot* .988 context: quote .984 context: read .984 next: :* . . . train 250 53

Iteration 17

slide-135
SLIDE 135

33

Yarowsky-prop-cautious: behaviour

100 200 300 400 500 10 20 30 40 50 0.2 0.4 0.6 0.8 1 DL size | Num. labelled train examples Test accuracy Iteration 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

91 rules 1.0 context: served .971 previous: the* .971 previous: the .997 context: year* .996 context: commut* . . . 91 rules 1.0 context: reads .989 context: quot* .988 context: quote .984 context: read .984 next: :* . . . train 250 53

Iteration 18

slide-136
SLIDE 136

33

Yarowsky-prop-cautious: behaviour

100 200 300 400 500 10 20 30 40 50 0.2 0.4 0.6 0.8 1 DL size | Num. labelled train examples Test accuracy Iteration 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

91 rules 1.0 context: served .971 previous: the* .971 previous: the .997 context: year* .996 context: commut* . . . 91 rules 1.0 context: reads .989 context: quot* .988 context: quote .984 context: read .984 next: :* . . . train 250 53

Iteration 18

slide-137
SLIDE 137

34

Results

Yarowsky-cautious DL-CoTrain cautious Y.-prop-cautious theta-only Y.-prop-cautious phi-theta 89.97 90.49 91.52 88.95 % clean non-seeded accuracy (named entity)

Statistically equivalent to DL-CoTrain

slide-138
SLIDE 138

34

Results

Yarowsky-cautious DL-CoTrain cautious Y.-prop-cautious theta-only Y.-prop-cautious phi-theta 89.97 90.49 91.52 88.95 % clean non-seeded accuracy (named entity)

Statistically equivalent to DL-CoTrain But:

◮ No need for views ◮ Per-iteration objective

slide-139
SLIDE 139

35

Correct Yarowsky-prop Examples

Gold label Features location X0 Waukegan X01 maker, X3 LEFT location X0 Mexico, X42 president, X42 of X11 president-of, X3 RIGHT location X0 La-Jolla, X2 La, X2 Jolla X01 company, X3 LEFT

Figure: Named entity test set examples where Yarowsky-prop θ-only is correct and no other tested algorithms are correct.

slide-140
SLIDE 140

36

Software available at https://github.com/sfu-natlang/yarowsky

Thank you!

slide-141
SLIDE 141

37

Introduction The Yarowsky algorithm Graph-based Propagation Our algorithm Extra slides References

slide-142
SLIDE 142

38

More results

Algorithm Task named entity drug land sentence

  • Num. train

89305 134 1604 303

  • Num. test examples

962 386 1488 515 DL-CoTrain (non-cautious) 85.73 58.73 77.72 51.05 DL-CoTrain (cautious) 90.49 58.17 77.72 65.69 Yarowsky 81.49 57.62 78.41 54.81 Yarowsky-cautious 89.97 52.63 78.48 76.99 Yarowsky-cautious-sum 90.49 52.63 77.72 76.99 HS-bipartite avg-maj 79.69 50.14 77.72 51.67 EM 80.31 52.49 31.12 65.23 ± 0.34 0.28 0.03 3.55 Yarowsky-prop φ-θ 77.89 51.80 77.72 51.88 Yarowsky-prop θ-only 75.84 52.91 77.72 51.05 Yarowsky-prop-cautious φ-θ 88.95 55.40 77.72 72.18 Yarowsky-prop-cautious θ-only 91.52 57.06 77.72 73.22 clean non-seeded accuracy

slide-143
SLIDE 143

39

EM results

Algorithm Task named entity drug land sentence

  • Num. train

89305 134 1604 303

  • Num. test examples

962 386 1488 515 Yarowsky 81.49 57.62 78.41 54.81 Yarowsky-cautious 89.97 52.63 78.48 76.99 Yarowsky-prop-cautious θ-only 91.52 57.06 77.72 73.22 EM 80.31 52.49 31.12 65.23 ± 0.34 0.28 0.03 3.55 Hard EM 80.95 52.91 40.12 63.47 ± 2.53 0.74 13.39 6.37 Online EM 83.89 54.29 45.00 56.25 ± 0.45 0.94 21.29 3.28 Hard online EM 80.41 54.54 50.51 56.28 ± 0.68 1.03 23.02 3.56 clean non-seeded accuracy

slide-144
SLIDE 144

40

Decision lists

(Collins and Singer, 1999)’s DL scores:

θfj ∝ |Λfj| + ǫ |Λf | + Lǫ max definition of π (strict DL): πx(j) ∝ max

f ∈Fx θfj

sum definition of π: πx(j) = 1 |Fx|

  • f ∈Fx

θfj

slide-145
SLIDE 145

41

HS-bipartite.

1: apply θ(0) to X produce a labelling Y (0) 2: for iteration t to maximum or convergence do 3:

for f ∈ F do

4:

let p = examples-to-feature({φx : x ∈ Xf })

5:

if p = U then let θf = p

6:

end for

7:

for x ∈ X do

8:

let p = features-to-example({θf : f ∈ Fx})

9:

if p = U then let φx = p

10:

end for

11: end for

slide-146
SLIDE 146

42

Acuracy plot: (Collins and Singer, 1999) algorithms

0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 100 200 300 400 500 600 Clean non-seeded test accuracy Iteration DL-CoTrain (cautious) Yarowsky Yarowsky-sum Yarowsky-cautious Yarowsky-cautious-sum

Non-seeded test accuracy versus iteration for various algorithms on named entity. The results for the Yarowsky-prop algorithms are for the propagated classifier θP, except for the final DL retraining iteration.

slide-147
SLIDE 147

43

Acuracy plot: Yarowsky-prop cautious

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 100 200 300 400 500 600 Non-seeded test accuracy Iteration Yarowsky-prop-cautious phi-theta Yarowsky-prop-cautious pi-theta Yarowsky-prop-cautious theta-only Yarowsky-prop-cautious thetatype-only

Non-seeded test accuracy versus iteration for various algorithms on named entity. The results for the Yarowsky-prop algorithms are for the propagated classifier θP, except for the final DL retraining iteration.

slide-148
SLIDE 148

44

Accuracy and coverage plot: non-cautious

0.4 0.5 0.6 0.7 0.8 0.9 1 100 200 300 400 500 600 Non-seeded test accuracy | Coverage Iteration main dl coverage

Internal train set coverage and non-seeded test accuracy (same scale) for Yarowsky-prop θ-only on named entity.

slide-149
SLIDE 149

45

Accuracy and coverage plot: cautious

0.4 0.5 0.6 0.7 0.8 0.9 1 100 200 300 400 500 600 Non-seeded test accuracy | Coverage Iteration main dl coverage

Internal train set coverage and non-seeded test accuracy (same scale) for Yarowsky-prop θ-only on named entity.

slide-150
SLIDE 150

46

Objective plot

55000 60000 65000 70000 75000 80000 85000 10 100 1000 0.4 0.5 0.6 0.7 0.8 0.9 1 Propagation objective value Coverage Iteration

  • bjective

training set coverage

Non-seeded test accuracy (left axis), coverage (left axis, same scale), and objective value (right axis) for Yarowsky-prop φ-θ. Iterations are shown on a log scale. We omit the first iteration (where the DL contains only the seed rules) and start the plot at iteration 2 where there is a complete DL.

slide-151
SLIDE 151

47

Graph structures for propagation.

Method V N(u) qu φ-θ X ∪ F Nx = Fx, Nf = Xf qx = φx, qf = θf π-θ X ∪ F Nx = Fx, Nf = Xf qx = πx, qf = θf θ-only F Nf =

x∈Xf Fx \ f

qf = θf θT-only F Nf =

x∈Xf Fx \ f

qf = θT

f

µ

  • u∈V

v∈N (u)

wuvBt2(qu, qv) + ν

  • u∈V

Bt2(qu, U)

θf|F| θf4 θf3 θf2 θf1 φx1 φx2 φx3 φx4 φx|X|

... ...

θf1 θf|F| θf2 θf4 θf3

...

slide-152
SLIDE 152

48

Introduction The Yarowsky algorithm Graph-based Propagation Our algorithm Extra slides References

slide-153
SLIDE 153

48

Abney, S. (2004). Understanding the Yarowsky algorithm. Computational Linguistics, 30(3). Collins, M. and Singer, Y. (1999). Unsupervised models for named entity classification. In In EMNLP 1999: Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pages 100–110. Daume, H. (2011). Seeding, transduction, out-of-sample error and the Microsoft approach... Blog post at http://nlpers.blogspot.com/2011/04/seeding- transduction-out-of-sample.html. Eisner, J. and Karakos, D. (2005). Bootstrapping without the boot. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language

slide-154
SLIDE 154

48

Processing, pages 395–402, Vancouver, British Columbia,

  • Canada. Association for Computational Linguistics.

Haffari, G. and Sarkar, A. (2007). Analysis of semi-supervised learning with the Yarowsky algorithm. In UAI 2007, Proceedings of the Twenty-Third Conference on Uncertainty in Artificial Intelligence, Vancouver, BC, Canada, pages 159–166. Subramanya, A., Petrov, S., and Pereira, F. (2010). Efficient graph-based semi-supervised learning of structured tagging models. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 167–176, Cambridge,

  • MA. Association for Computational Linguistics.

Yarowsky, D. (1995). Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd Annual Meeting of the Association

slide-155
SLIDE 155

48

for Computational Linguistics, pages 189–196, Cambridge, Massachusetts, USA. Association for Computational Linguistics.