sfu natlanglab
play

SFU NatLangLab Bootstrapping via Graph Propagation Max Whitney - PowerPoint PPT Presentation

SFU NatLangLab Bootstrapping via Graph Propagation Max Whitney Anoop Sarkar Simon Fraser University Natural Language Laboratory http://natlang.cs.sfu.ca Bootstrapping Semi-supervised (vs supervised) Single domain (vs domain


  1. Yarowsky algorithm (Yarowsky, 1995; Collins and Singer, 1999) 1.0 context: served sense 1 1.0 context: reads sense 2 seed DL Full time should be served for each sentence . The Liberals inserted a sentence of 14 words which reads : The sentence for such an offence would be a term of imprisonment for one year . Mr. Speaker , I have a question based on the very last sentence of the hon. member . label data . . . 1.0 context: served sense 1 1.0 context: reads sense 2 .976 context: serv* sense 1 .976 context: read* sense 2 .969 next word: reads sense 2 .969 next word: read* sense 2 .955 previous word: his sense 1 train DL .955 previous word: hi* sense 1 threshold .955 context: inmate sense 1 previous word: relevant 8

  2. Yarowsky algorithm (Yarowsky, 1995; Collins and Singer, 1999) 1.0 context: served sense 1 1.0 context: reads sense 2 seed DL Full time should be served for each sentence . The Liberals inserted a sentence of 14 words which reads : The sentence for such an offence would be a term of imprisonment for one year . Mr. Speaker , I have a question based on the very last sentence of the hon. member . label data . . . 1.0 context: served sense 1 1.0 context: reads sense 2 .976 context: serv* sense 1 .976 context: read* sense 2 .969 next word: reads sense 2 .969 next word: read* sense 2 .955 previous word: his sense 1 train DL .955 previous word: hi* sense 1 threshold .955 context: inmate sense 1 previous word: relevant 8

  3. Yarowsky algorithm (Yarowsky, 1995; Collins and Singer, 1999) seed DL Full time should be served for each sentence . The Liberals inserted a sentence of 14 words which reads : The sentence for such an offence would be a term of imprisonment for one year . Mr. Speaker , I have a question based on the very last sentence of the hon. member . label data . . . final re-training (no threshold) train DL test 8

  4. Example decision list for the named entity task Rank Score Feature Label 1 0.999900 New-York loc. 2 0.999900 California loc. 3 0.999900 U.S. loc. 4 0.999900 Microsoft org. 5 0.999900 I.B.M. org. 6 0.999900 Incorporated org. 7 0.999900 Mr. per. 8 0.999976 U.S. loc. 9 0.999957 New-York-Stock-Exchange loc. 10 0.999952 California loc. 11 0.999947 New-York loc. 12 0.999946 loc. court-in 13 0.975154 loc. Company-of . . . Context features are indicated by italics ; all others are spelling features. Seed rules are indicated by bold ranks. 9

  5. Yarowsky algorithm (Yarowsky, 1995; Collins and Singer, 1999) 1800 1 1600 0.8 1 rule 1400 DL size | Num. labelled train examples 1.0 context: served 1200 0.6 Test accuracy 1000 1 rule 800 1.0 context: reads 0.4 600 400 0.2 train 200 4 4 0 0 0 5 10 15 20 0 Iteration Iteration 0 10

  6. Yarowsky algorithm (Yarowsky, 1995; Collins and Singer, 1999) 46 rules 1.0 context: served 1800 1 .976 context: serv* .976 context: served 1600 .955 context: inmat* 0.8 1400 .955 context: releas* DL size | Num. labelled train examples . 1200 . . 0.6 Test accuracy 1000 31 rules 800 1.0 context: reads 0.4 .976 context: read* 600 .976 context: reads 400 .969 next: read* 0.2 .969 next: reads 200 . . . 0 0 0 5 10 15 20 0 1 Iteration train 114 37 Iteration 1 10

  7. Yarowsky algorithm (Yarowsky, 1995; Collins and Singer, 1999) 46 rules 1.0 context: served 1800 1 .976 context: serv* .976 context: served 1600 .955 context: inmat* 0.8 1400 .955 context: releas* DL size | Num. labelled train examples . 1200 . . 0.6 Test accuracy 1000 31 rules 800 1.0 context: reads 0.4 .976 context: read* 600 .976 context: reads 400 .969 next: read* 0.2 .969 next: reads 200 . . . 0 0 0 5 10 15 20 0 1 Iteration train 114 37 Iteration 1 10

  8. Yarowsky algorithm (Yarowsky, 1995; Collins and Singer, 1999) 46 rules 1.0 context: served 1800 1 .976 context: serv* .976 context: served 1600 .955 context: inmat* 0.8 1400 .955 context: releas* DL size | Num. labelled train examples . 1200 . . 0.6 Test accuracy 1000 31 rules 800 1.0 context: reads 0.4 .976 context: read* 600 .976 context: reads 400 .969 next: read* 0.2 .969 next: reads 200 . . . 0 0 0 5 10 15 20 0 1 Iteration train 114 37 Iteration 1 10

  9. Yarowsky algorithm (Yarowsky, 1995; Collins and Singer, 1999) 46 rules 1.0 context: served 1800 1 .976 context: serv* .976 context: served 1600 test accuracy .955 context: inmat* 0.8 1400 .955 context: releas* DL size | Num. labelled train examples . 1200 . . 0.6 Test accuracy 1000 31 rules 800 1.0 context: reads 0.4 .976 context: read* 600 .976 context: reads 400 .969 next: read* 0.2 .969 next: reads 200 . . . 0 0 0 5 10 15 20 0 1 Iteration train 114 37 Iteration 1 10

  10. Yarowsky algorithm (Yarowsky, 1995; Collins and Singer, 1999) 854 rules 1.0 context: served 1800 1 .998 next: .* .998 next: . 1600 .995 context: serv* 0.8 1400 .995 context: prison* DL size | Num. labelled train examples . 1200 . . 0.6 Test accuracy 1000 214 rules 800 1.0 context: reads 0.4 .991 context: read* 600 .984 context: read 400 .976 context: reads 0.2 .969 context: 11* 200 . . . 0 0 0 5 10 15 20 0 1 2 Iteration train 238 56 Iteration 2 10

  11. Yarowsky algorithm (Yarowsky, 1995; Collins and Singer, 1999) 1520 rules 1.0 context: served 1800 1 .998 next: .* .998 next: . 1600 .960 context: life* 0.8 1400 .960 context: life DL size | Num. labelled train examples . 1200 . . 0.6 Test accuracy 1000 223 rules 800 1.0 context: reads 0.4 .991 context: read* 600 .984 context: read 400 .984 next: :* 0.2 .984 next: : 200 . . . 0 0 0 5 10 15 20 0 1 2 3 Iteration train 242 49 Iteration 3 10

  12. Yarowsky algorithm (Yarowsky, 1995; Collins and Singer, 1999) 1557 rules 1.0 context: served 1800 1 .998 next: .* .998 next: . 1600 .996 context: life* 0.8 1400 .996 context: life DL size | Num. labelled train examples . 1200 . . 0.6 Test accuracy 1000 221 rules 800 1.0 context: reads 0.4 .991 context: read* 600 .984 context: read 400 .984 next: :* 0.2 .984 next: : 200 . . . 0 0 0 5 10 15 20 0 1 2 3 4 Iteration train 247 49 Iteration 4 10

  13. Yarowsky algorithm (Yarowsky, 1995; Collins and Singer, 1999) 1557 rules 1.0 context: served .998 next: .* 1800 1 .998 next: . 1600 .996 context: life* .996 context: life 0.8 1400 DL size | Num. labelled train examples . . . 1200 0.6 Test accuracy 1000 221 rules 1.0 context: reads 800 .991 context: read* 0.4 .984 context: read 600 .984 next: :* 400 .984 next: : 0.2 . 200 . . 0 0 0 5 10 15 20 0 1 2 3 4 5 Iteration train 247 49 Iteration 5 10

  14. Yarowsky algorithm (Yarowsky, 1995; Collins and Singer, 1999) 1557 rules 1.0 context: served .998 next: .* 1800 1 .998 next: . 1600 .996 context: life* .996 context: life 0.8 1400 DL size | Num. labelled train examples . . . 1200 0.6 Test accuracy 1000 221 rules 1.0 context: reads 800 .991 context: read* 0.4 .984 context: read 600 .984 next: :* 400 .984 next: : 0.2 . 200 . . 0 0 0 5 10 15 20 0 1 2 3 4 5 6 Iteration train 247 49 Iteration 6 10

  15. Performance Yarowsky 81 . 49 % clean non-seeded accuracy (named entity) 11

  16. Vs. co-training DL-CoTrain from (Collins and Singer, 1999) : Yarowsky 81 . 49 85 . 73 DL-CoTrain non-cautious % clean non-seeded accuracy (named entity) 12

  17. Vs. co-training DL-CoTrain from (Collins and Singer, 1999) : Yarowsky 81 . 49 85 . 73 DL-CoTrain non-cautious % clean non-seeded accuracy (named entity) Co-training needs two views, eg: ◮ adjacent words { next word: a, next word: about, next word: according, . . . } ◮ context words { context: abolition, context: abundantly, context: accepting, . . . } 12

  18. Vs. EM EM algorithm from (Collins and Singer, 1999) : Yarowsky 81 . 49 EM 80 . 31 % clean non-seeded accuracy (named entity) 13

  19. Vs. EM EM algorithm from (Collins and Singer, 1999) : Yarowsky 81 . 49 EM 80 . 31 % clean non-seeded accuracy (named entity) With Yarowsky we can exploit type-level information in the DL 13

  20. Vs. EM EM Expected counts on data: x 1 x 2 x 3 x 4 x 5 . . . Probabilities on features: f 1 f 2 f 3 f 4 f 5 . . . 14

  21. Vs. EM Yarowsky EM Labelled training data: x 1 Expected counts x 2 on data: x 3 x 1 x 4 x 2 x 5 x 3 . . x 4 . Decision list: x 5 f 1 . f 2 . . Probabilities on f 3 features: f 4 f 1 f 5 f 2 . . f 3 . f 4 f 5 . . . 14

  22. Vs. EM Yarowsky EM Labelled training data: x 1 Expected counts x 2 on data: x 3 x 1 x 4 x 2 x 5 x 3 . . x 4 . Decision list: x 5 f 1 . f 2 . . Probabilities on f 3 features: f 4 f 1 f 5 f 2 . . f 3 . Trimmed DL: f 4 f 1 f 5 f 3 . f 5 . . . . . 14

  23. Cautiousness Can we improve decision list trimming? 15

  24. Cautiousness Can we improve decision list trimming? ◮ (Collins and Singer, 1999) cautiousness: take top n rules for each label n = 5 , 10 , 15 , . . . by iteration 15

  25. Cautiousness Can we improve decision list trimming? ◮ (Collins and Singer, 1999) cautiousness: take top n rules for each label n = 5 , 10 , 15 , . . . by iteration ◮ Yarowsky-cautious ◮ DL-CoTrain cautious 15

  26. Yarowsky-cautious algorithm (Collins and Singer, 1999) 400 1 350 0.8 1 rule DL size | Num. labelled train examples 300 1.0 context: served 250 0.6 Test accuracy 1 rule 200 1.0 context: reads 0.4 150 100 0.2 train 50 4 4 0 0 0 10 20 30 40 50 60 0 Iteration Iteration 0 16

  27. Yarowsky-cautious algorithm (Collins and Singer, 1999) 6 rules 1.0 context: served 400 1 .976 context: serv* .976 context: served 350 .955 context: inmat* 0.8 .955 context: releas* DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 6 rules 200 1.0 context: reads 0.4 150 .976 context: read* .976 context: reads 100 .969 next: read* 0.2 .969 next: reads 50 . . . 0 0 0 10 20 30 40 50 60 0 1 Iteration train 25 12 Iteration 1 16

  28. Yarowsky-cautious algorithm (Collins and Singer, 1999) 11 rules 1.0 context: served 400 1 .995 context: serv* .989 context: serve 350 .986 context: serving 0.8 .984 context: life* DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 11 rules 200 1.0 context: reads 0.4 150 .991 context: read* .984 context: read 100 .976 context: reads 0.2 .969 next: from* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 Iteration train 62 20 Iteration 2 16

  29. Yarowsky-cautious algorithm (Collins and Singer, 1999) 16 rules 1.0 context: served 400 1 .996 context: life* .996 context: life 350 .995 context: serv* 0.8 .995 context: prison* DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 16 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .984 context: read 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 Iteration train 84 32 Iteration 3 16

  30. Yarowsky-cautious algorithm (Collins and Singer, 1999) 21 rules 1.0 context: served 400 1 .996 context: commut* .996 context: life* 350 .996 context: life 0.8 .995 context: serv* DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 21 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .989 context: quot* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 Iteration train 100 36 Iteration 4 16

  31. Yarowsky-cautious algorithm (Collins and Singer, 1999) 26 rules 1.0 context: served 400 1 .996 context: commut* .996 context: life* 350 .996 context: life 0.8 .995 context: serv* DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 26 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .989 context: quot* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 5 Iteration train 114 40 Iteration 5 16

  32. Yarowsky-cautious algorithm (Collins and Singer, 1999) 31 rules 1.0 context: served 400 1 .996 context: commut* .996 context: life* 350 .996 context: life 0.8 .995 context: serv* DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 31 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .989 context: quot* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 5 6 Iteration train 128 40 Iteration 6 16

  33. Yarowsky-cautious algorithm (Collins and Singer, 1999) 36 rules 1.0 context: served 400 1 .965 context: year* .996 context: commut* 350 .996 context: life* 0.8 .996 context: life DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 36 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .989 context: quot* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 5 6 7 Iteration train 139 40 Iteration 7 16

  34. Yarowsky-cautious algorithm (Collins and Singer, 1999) 41 rules 1.0 context: served 400 1 .969 context: year* .996 context: commut* 350 .996 context: life* 0.8 .996 context: life DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 41 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .989 context: quot* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 5 6 7 8 Iteration train 139 48 Iteration 8 16

  35. Yarowsky-cautious algorithm (Collins and Singer, 1999) 46 rules 1.0 context: served 400 1 .969 context: year* .996 context: commut* 350 .996 context: life* 0.8 .996 context: life DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 46 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .989 context: quot* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 5 6 7 8 9 Iteration train 139 51 Iteration 9 16

  36. Yarowsky-cautious algorithm (Collins and Singer, 1999) 51 rules 1.0 context: served 400 1 .969 context: year* .996 context: commut* 350 .996 context: life* 0.8 .996 context: life DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 51 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .989 context: quot* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 5 6 7 8 9 10 Iteration train 146 53 Iteration 10 16

  37. Yarowsky-cautious algorithm (Collins and Singer, 1999) 56 rules 1.0 context: served 400 1 .969 context: year* .996 context: commut* 350 .996 context: life* 0.8 .996 context: life DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 56 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .989 context: quot* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 5 6 7 8 9 10 11 Iteration train 156 54 Iteration 11 16

  38. Yarowsky-cautious algorithm (Collins and Singer, 1999) 61 rules 1.0 context: served 400 1 .969 context: year* .996 context: commut* 350 .996 context: life* 0.8 .996 context: life DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 61 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .989 context: quot* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 5 6 7 8 9 10 11 12 Iteration train 159 57 Iteration 12 16

  39. Yarowsky-cautious algorithm (Collins and Singer, 1999) 66 rules 1.0 context: served 400 1 .969 context: year* .996 context: commut* 350 .996 context: life* 0.8 .996 context: life DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 66 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .989 context: quot* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 5 6 7 8 9 10 11 12 13 Iteration train 159 58 Iteration 13 16

  40. Yarowsky-cautious algorithm (Collins and Singer, 1999) 71 rules 1.0 context: served 400 1 .969 context: year* .996 context: commut* 350 .996 context: life* 0.8 .996 context: life DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 71 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .989 context: quot* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Iteration train 163 58 Iteration 14 16

  41. Yarowsky-cautious algorithm (Collins and Singer, 1999) 76 rules 1.0 context: served 400 1 .969 context: year* .996 context: commut* 350 .996 context: life* 0.8 .996 context: life DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 76 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .989 context: quot* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Iteration train 165 58 Iteration 15 16

  42. Yarowsky-cautious algorithm (Collins and Singer, 1999) 81 rules 1.0 context: served 400 1 .969 context: year* .996 context: commut* 350 .996 context: life* 0.8 .996 context: life DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 81 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .989 context: quot* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Iteration train 166 58 Iteration 16 16

  43. Yarowsky-cautious algorithm (Collins and Singer, 1999) 86 rules 1.0 context: served 400 1 .969 context: year* .996 context: commut* 350 .996 context: life* 0.8 .996 context: life DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 86 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .989 context: quot* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Iteration train 169 58 Iteration 17 16

  44. Yarowsky-cautious algorithm (Collins and Singer, 1999) 91 rules 1.0 context: served 400 1 .969 context: year* .996 context: commut* 350 .996 context: life* 0.8 .996 context: life DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 91 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .989 context: quot* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Iteration train 170 58 Iteration 18 16

  45. Yarowsky-cautious algorithm (Collins and Singer, 1999) 96 rules 1.0 context: served 400 1 .969 context: year* .996 context: commut* 350 .996 context: life* 0.8 .996 context: life DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 96 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .989 context: quot* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Iteration train 170 58 Iteration 19 16

  46. Yarowsky-cautious algorithm (Collins and Singer, 1999) 101 rules 1.0 context: served .969 context: year* 400 1 .996 context: commut* .996 context: life* 350 .996 context: life 0.8 DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 101 rules 200 1.0 context: reads .991 context: read* 0.4 150 .991 next: from* .991 next: from 100 .989 context: quot* 0.2 . 50 . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Iteration train 172 59 Iteration 20 16

  47. Yarowsky-cautious algorithm (Collins and Singer, 1999) 101 rules 1.0 context: served .969 context: year* 400 1 .996 context: commut* .996 context: life* 350 .996 context: life 0.8 DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 101 rules 200 1.0 context: reads .991 context: read* 0.4 150 .991 next: from* .991 next: from 100 .989 context: quot* 0.2 . 50 . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Iteration train 172 59 Iteration 20 16

  48. Yarowsky-cautious vs. co-training and EM EM 80 . 31 DL-CoTrain non-cautious 85 . 73 Yarowsky non-cautious 81 . 49 90 . 49 DL-CoTrain cautious Yarowsky-cautious 89 . 97 % clean non-seeded accuracy (named entity) statistically equivalent 17

  49. Yarowsky-cautious vs. co-training and EM EM 80 . 31 DL-CoTrain non-cautious 85 . 73 Yarowsky non-cautious 81 . 49 90 . 49 DL-CoTrain cautious Yarowsky-cautious 89 . 97 % clean non-seeded accuracy (named entity) statistically equivalent ◮ Yarowsky performs well 17

  50. Yarowsky-cautious vs. co-training and EM EM 80 . 31 DL-CoTrain non-cautious 85 . 73 Yarowsky non-cautious 81 . 49 90 . 49 DL-CoTrain cautious Yarowsky-cautious 89 . 97 % clean non-seeded accuracy (named entity) statistically equivalent ◮ Yarowsky performs well ◮ Cautiousness is important 17

  51. Yarowsky-cautious vs. co-training and EM EM 80 . 31 DL-CoTrain non-cautious 85 . 73 Yarowsky non-cautious 81 . 49 90 . 49 DL-CoTrain cautious Yarowsky-cautious 89 . 97 % clean non-seeded accuracy (named entity) statistically equivalent ◮ Yarowsky performs well ◮ Cautiousness is important ◮ Yarowsky does not need views 17

  52. Did we really do EM right? 18

  53. Did we really do EM right? DL-CoTrain cautious 90 . 49 Yarowsky-cautious 89 . 97 EM 80 . 31 Hard EM 80 . 94 Online EM 83 . 89 Hard Online EM 80 . 49 % clean non-seeded accuracy (named entity) 18

  54. Did we really do EM right? DL-CoTrain cautious 90 . 49 Yarowsky-cautious 89 . 97 EM 80 . 31 Hard EM 80 . 94 Online EM 83 . 89 Hard Online EM 80 . 49 % clean non-seeded accuracy (named entity) Multiple runs of EM. Variance of results: ◮ EM: ± .34 ◮ Hard EM: ± 2.53 ◮ Online EM: ± .45 ◮ Hard Online EM: ± .68 18

  55. Yarowsky algorithm: (Abney, 2004) ’s analysis Yarowsky algorithm lacks theoretical analysis 19

  56. Yarowsky algorithm: (Abney, 2004) ’s analysis Yarowsky algorithm lacks theoretical analysis ◮ (Abney, 2004) gives bounds for some variants (no cautiousness, no algorithm) 19

  57. Yarowsky algorithm: (Abney, 2004) ’s analysis Yarowsky algorithm lacks theoretical analysis ◮ (Abney, 2004) gives bounds for some variants (no cautiousness, no algorithm) ◮ Basis for our work 19

  58. Yarowsky algorithm: (Abney, 2004) ’s analysis Yarowsky algorithm lacks theoretical analysis ◮ (Abney, 2004) gives bounds for some variants (no cautiousness, no algorithm) ◮ Basis for our work Training examples x , labels j : ◮ Full time should be served for each sentence . ◮ The Liberals inserted a sentence of 14 words which reads : ◮ They get a concurrent sentence with no additional time added to their sentence . ◮ The words tax relief appeared in every second sentence in the federal government’s throne speech . . . . labelling distributions φ x ( j ) peaked for labelled example x uniform for unlabelled example x 19

  59. Yarowsky algorithm: (Abney, 2004) ’s analysis Yarowsky algorithm lacks theoretical analysis ◮ (Abney, 2004) gives bounds for some variants (no cautiousness, no algorithm) ◮ Basis for our work Training examples x , labels j : ◮ Full time should be served for each sentence . ◮ The Liberals inserted a sentence of 14 words which reads : ◮ They get a concurrent sentence with no additional time added to their sentence . ◮ The words tax relief appeared in every second sentence in the federal government’s throne speech . . . . labelling distributions φ x ( j ) Features f , labels j : peaked for labelled example x ◮ context: reads uniform for unlabelled example x ◮ context: served ◮ context: inmate ◮ next: the parameter distributions θ f ( j ) ◮ context: article normalized DL scores for feature f ◮ previous: introductory DL chooses arg max j max f ∈ F x θ f ( j ) ◮ previous: passing ◮ next: said 19 . . .

  60. Yarowsky algorithm: (Abney, 2004) ’s analysis Yarowsky algorithm lacks theoretical analysis ◮ (Abney, 2004) gives bounds for some variants (no cautiousness, no algorithm) ◮ Basis for our work Training examples x , labels j : ◮ Full time should be served for each sentence . ◮ The Liberals inserted a sentence of 14 words which reads : ◮ They get a concurrent sentence with no additional time added to their sentence . ◮ The words tax relief appeared in every second sentence in the federal government’s throne speech . . . . labelling distributions φ x ( j ) Features f , labels j : peaked for labelled example x ◮ context: reads uniform for unlabelled example x ◮ context: served ◮ context: inmate ◮ next: the parameter distributions θ f ( j ) ◮ context: article normalized DL scores for feature f ◮ previous: introductory DL chooses arg max j max f ∈ F x θ f ( j ) ◮ previous: passing alternative: arg max j � f ∈ F x θ f ( j ) ◮ next: said 19 . . .

  61. Yarowsky algorithm: (Haffari and Sarkar, 2007) ’s analysis ◮ (Haffari and Sarkar, 2007) extend (Abney, 2004) to bipartite graph representation (polytime algorithm; no cautiousness) 20

  62. Yarowsky algorithm: (Haffari and Sarkar, 2007) ’s analysis ◮ (Haffari and Sarkar, 2007) extend (Abney, 2004) to bipartite graph representation (polytime algorithm; no cautiousness) θ f 1 φ x 1 θ f 2 φ x 2 θ f 3 φ x 3 θ f 4 φ x 4 ... ... θ f | F | φ x | X | 20

  63. Yarowsky algorithm: (Haffari and Sarkar, 2007) ’s analysis ◮ (Haffari and Sarkar, 2007) extend (Abney, 2004) to bipartite graph representation (polytime algorithm; no cautiousness) θ f 1 φ x 1 θ f 2 φ x 2 features f parameter distributions θ f ( j ) θ f 3 φ x 3 θ f 4 φ x 4 ... ... θ f | F | φ x | X | 20

  64. Yarowsky algorithm: (Haffari and Sarkar, 2007) ’s analysis ◮ (Haffari and Sarkar, 2007) extend (Abney, 2004) to bipartite graph representation (polytime algorithm; no cautiousness) θ f 1 φ x 1 examples x θ f 2 φ x 2 labelling distributions φ x ( j ) features f parameter distributions θ f ( j ) θ f 3 φ x 3 θ f 4 φ x 4 ... ... θ f | F | φ x | X | 20

  65. Yarowsky algorithm: (Haffari and Sarkar, 2007) ’s analysis ◮ (Haffari and Sarkar, 2007) extend (Abney, 2004) to bipartite graph representation (polytime algorithm; no cautiousness) θ f 1 φ x 1 examples x θ f 2 φ x 2 labelling distributions φ x ( j ) features f parameter distributions θ f ( j ) θ f 3 φ x 3 θ f 4 φ x 4 ... ... θ f | F | φ x | X | algorithm: fix one side, update other 20

  66. Objective Function ◮ KL divergence between two probability distributions: p ( i ) log p ( i ) � KL ( p || q ) = q ( i ) i 21

  67. Objective Function ◮ KL divergence between two probability distributions: p ( i ) log p ( i ) � KL ( p || q ) = q ( i ) i ◮ Entropy of a distribution: � H ( p ) = − p ( i ) log p ( i ) i 21

  68. Objective Function ◮ KL divergence between two probability distributions: p ( i ) log p ( i ) � KL ( p || q ) = q ( i ) i ◮ Entropy of a distribution: � H ( p ) = − p ( i ) log p ( i ) i ◮ The Objective Function: � K ( φ, θ ) = KL ( θ f i || φ x j )+ H ( θ f i )+ H ( φ x j )+ Regularizer ( f i , x j ) ∈ Edges 21

  69. Objective Function ◮ KL divergence between two probability distributions: p ( i ) log p ( i ) � KL ( p || q ) = q ( i ) i ◮ Entropy of a distribution: � H ( p ) = − p ( i ) log p ( i ) i ◮ The Objective Function: � K ( φ, θ ) = KL ( θ f i || φ x j )+ H ( θ f i )+ H ( φ x j )+ Regularizer ( f i , x j ) ∈ Edges ◮ Reduce uncertainty in the labelling distribution while respecting the labeled data 21

  70. Generalized Objective Function ◮ Bregman divergence between two probability distributions: � ψ ( p ( i )) − ψ ( q ( i )) − ψ ′ ( q ( i ))( p ( i ) − q ( i )) B ψ ( p || q ) = i B t log t ( p || q ) KL ( p || q ) = 22

  71. Generalized Objective Function ◮ Bregman divergence between two probability distributions: � ψ ( p ( i )) − ψ ( q ( i )) − ψ ′ ( q ( i ))( p ( i ) − q ( i )) B ψ ( p || q ) = i B t log t ( p || q ) KL ( p || q ) = ◮ ψ -Entropy of a distribution: � − H ψ ( p ) = ψ ( p ( i )) i H t log t ( p ) = H ( p ) 22

  72. Generalized Objective Function ◮ Bregman divergence between two probability distributions: � ψ ( p ( i )) − ψ ( q ( i )) − ψ ′ ( q ( i ))( p ( i ) − q ( i )) B ψ ( p || q ) = i B t log t ( p || q ) KL ( p || q ) = ◮ ψ -Entropy of a distribution: � − H ψ ( p ) = ψ ( p ( i )) i H t log t ( p ) = H ( p ) ◮ The Generalized Objective Function: � K ψ ( φ, θ ) = B ψ ( θ f i || φ x j )+ H ψ ( θ f i )+ H ψ ( φ x j )+ Regularizer ( f i , x j ) ∈ Edges 22

  73. Generalized Objective Function ψ ψ (q(i)) + ψ ’(q(i))(p’(i) - q(i)) a’ - b’ a’ ψ (p’(i)) ψ (p(i)) a - b ψ (q(i)) + ψ ’(q(i))(p(i) - q(i)) b’ a = ψ (p(i))- ψ (q(i)) b = ψ ’(q(i)) (p(i) - q(i)) ψ (q(i)) p(i)-q(i) p’(i)-q(i) 0 q(i) p(i) p’(i) 1 23

  74. Variants from (Abney, 2004; Haffari and Sarkar, 2007) Yarowsky non-cautious 81 . 49 Yarowsky-cautious 89 . 97 Yarowsky-cautious sum 90 . 49 HaffariSarkar-bipartite avg-maj 79 . 69 % clean non-seeded accuracy (named entity) 24

  75. Graph-based Propagation (Subramanya et al., 2010) Self-training with CRFs: label data graph propagate get types train CRF get posteriors seed data 25

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend