SFU NatLangLab Bootstrapping via Graph Propagation Max Whitney - PowerPoint PPT Presentation

Yarowsky algorithm (Yarowsky, 1995; Collins and Singer, 1999) 1.0 context: served sense 1 1.0 context: reads sense 2 seed DL Full time should be served for each sentence . The Liberals inserted a sentence of 14 words which reads : The sentence for such an offence would be a term of imprisonment for one year . Mr. Speaker , I have a question based on the very last sentence of the hon. member . label data . . . 1.0 context: served sense 1 1.0 context: reads sense 2 .976 context: serv* sense 1 .976 context: read* sense 2 .969 next word: reads sense 2 .969 next word: read* sense 2 .955 previous word: his sense 1 train DL .955 previous word: hi* sense 1 threshold .955 context: inmate sense 1 previous word: relevant 8

Yarowsky algorithm (Yarowsky, 1995; Collins and Singer, 1999) seed DL Full time should be served for each sentence . The Liberals inserted a sentence of 14 words which reads : The sentence for such an offence would be a term of imprisonment for one year . Mr. Speaker , I have a question based on the very last sentence of the hon. member . label data . . . final re-training (no threshold) train DL test 8

Example decision list for the named entity task Rank Score Feature Label 1 0.999900 New-York loc. 2 0.999900 California loc. 3 0.999900 U.S. loc. 4 0.999900 Microsoft org. 5 0.999900 I.B.M. org. 6 0.999900 Incorporated org. 7 0.999900 Mr. per. 8 0.999976 U.S. loc. 9 0.999957 New-York-Stock-Exchange loc. 10 0.999952 California loc. 11 0.999947 New-York loc. 12 0.999946 loc. court-in 13 0.975154 loc. Company-of . . . Context features are indicated by italics ; all others are spelling features. Seed rules are indicated by bold ranks. 9

Yarowsky algorithm (Yarowsky, 1995; Collins and Singer, 1999) 1800 1 1600 0.8 1 rule 1400 DL size | Num. labelled train examples 1.0 context: served 1200 0.6 Test accuracy 1000 1 rule 800 1.0 context: reads 0.4 600 400 0.2 train 200 4 4 0 0 0 5 10 15 20 0 Iteration Iteration 0 10

Yarowsky algorithm (Yarowsky, 1995; Collins and Singer, 1999) 46 rules 1.0 context: served 1800 1 .976 context: serv* .976 context: served 1600 .955 context: inmat* 0.8 1400 .955 context: releas* DL size | Num. labelled train examples . 1200 . . 0.6 Test accuracy 1000 31 rules 800 1.0 context: reads 0.4 .976 context: read* 600 .976 context: reads 400 .969 next: read* 0.2 .969 next: reads 200 . . . 0 0 0 5 10 15 20 0 1 Iteration train 114 37 Iteration 1 10

Yarowsky algorithm (Yarowsky, 1995; Collins and Singer, 1999) 46 rules 1.0 context: served 1800 1 .976 context: serv* .976 context: served 1600 test accuracy .955 context: inmat* 0.8 1400 .955 context: releas* DL size | Num. labelled train examples . 1200 . . 0.6 Test accuracy 1000 31 rules 800 1.0 context: reads 0.4 .976 context: read* 600 .976 context: reads 400 .969 next: read* 0.2 .969 next: reads 200 . . . 0 0 0 5 10 15 20 0 1 Iteration train 114 37 Iteration 1 10

Yarowsky algorithm (Yarowsky, 1995; Collins and Singer, 1999) 854 rules 1.0 context: served 1800 1 .998 next: .* .998 next: . 1600 .995 context: serv* 0.8 1400 .995 context: prison* DL size | Num. labelled train examples . 1200 . . 0.6 Test accuracy 1000 214 rules 800 1.0 context: reads 0.4 .991 context: read* 600 .984 context: read 400 .976 context: reads 0.2 .969 context: 11* 200 . . . 0 0 0 5 10 15 20 0 1 2 Iteration train 238 56 Iteration 2 10

Yarowsky algorithm (Yarowsky, 1995; Collins and Singer, 1999) 1520 rules 1.0 context: served 1800 1 .998 next: .* .998 next: . 1600 .960 context: life* 0.8 1400 .960 context: life DL size | Num. labelled train examples . 1200 . . 0.6 Test accuracy 1000 223 rules 800 1.0 context: reads 0.4 .991 context: read* 600 .984 context: read 400 .984 next: :* 0.2 .984 next: : 200 . . . 0 0 0 5 10 15 20 0 1 2 3 Iteration train 242 49 Iteration 3 10

Yarowsky algorithm (Yarowsky, 1995; Collins and Singer, 1999) 1557 rules 1.0 context: served 1800 1 .998 next: .* .998 next: . 1600 .996 context: life* 0.8 1400 .996 context: life DL size | Num. labelled train examples . 1200 . . 0.6 Test accuracy 1000 221 rules 800 1.0 context: reads 0.4 .991 context: read* 600 .984 context: read 400 .984 next: :* 0.2 .984 next: : 200 . . . 0 0 0 5 10 15 20 0 1 2 3 4 Iteration train 247 49 Iteration 4 10

Yarowsky algorithm (Yarowsky, 1995; Collins and Singer, 1999) 1557 rules 1.0 context: served .998 next: .* 1800 1 .998 next: . 1600 .996 context: life* .996 context: life 0.8 1400 DL size | Num. labelled train examples . . . 1200 0.6 Test accuracy 1000 221 rules 1.0 context: reads 800 .991 context: read* 0.4 .984 context: read 600 .984 next: :* 400 .984 next: : 0.2 . 200 . . 0 0 0 5 10 15 20 0 1 2 3 4 5 Iteration train 247 49 Iteration 5 10

Yarowsky algorithm (Yarowsky, 1995; Collins and Singer, 1999) 1557 rules 1.0 context: served .998 next: .* 1800 1 .998 next: . 1600 .996 context: life* .996 context: life 0.8 1400 DL size | Num. labelled train examples . . . 1200 0.6 Test accuracy 1000 221 rules 1.0 context: reads 800 .991 context: read* 0.4 .984 context: read 600 .984 next: :* 400 .984 next: : 0.2 . 200 . . 0 0 0 5 10 15 20 0 1 2 3 4 5 6 Iteration train 247 49 Iteration 6 10

Performance Yarowsky 81 . 49 % clean non-seeded accuracy (named entity) 11

Vs. co-training DL-CoTrain from (Collins and Singer, 1999) : Yarowsky 81 . 49 85 . 73 DL-CoTrain non-cautious % clean non-seeded accuracy (named entity) 12

Vs. co-training DL-CoTrain from (Collins and Singer, 1999) : Yarowsky 81 . 49 85 . 73 DL-CoTrain non-cautious % clean non-seeded accuracy (named entity) Co-training needs two views, eg: ◮ adjacent words { next word: a, next word: about, next word: according, . . . } ◮ context words { context: abolition, context: abundantly, context: accepting, . . . } 12

Vs. EM EM algorithm from (Collins and Singer, 1999) : Yarowsky 81 . 49 EM 80 . 31 % clean non-seeded accuracy (named entity) 13

Vs. EM EM algorithm from (Collins and Singer, 1999) : Yarowsky 81 . 49 EM 80 . 31 % clean non-seeded accuracy (named entity) With Yarowsky we can exploit type-level information in the DL 13

Vs. EM EM Expected counts on data: x 1 x 2 x 3 x 4 x 5 . . . Probabilities on features: f 1 f 2 f 3 f 4 f 5 . . . 14

Vs. EM Yarowsky EM Labelled training data: x 1 Expected counts x 2 on data: x 3 x 1 x 4 x 2 x 5 x 3 . . x 4 . Decision list: x 5 f 1 . f 2 . . Probabilities on f 3 features: f 4 f 1 f 5 f 2 . . f 3 . f 4 f 5 . . . 14

Vs. EM Yarowsky EM Labelled training data: x 1 Expected counts x 2 on data: x 3 x 1 x 4 x 2 x 5 x 3 . . x 4 . Decision list: x 5 f 1 . f 2 . . Probabilities on f 3 features: f 4 f 1 f 5 f 2 . . f 3 . Trimmed DL: f 4 f 1 f 5 f 3 . f 5 . . . . . 14

Cautiousness Can we improve decision list trimming? 15

Cautiousness Can we improve decision list trimming? ◮ (Collins and Singer, 1999) cautiousness: take top n rules for each label n = 5 , 10 , 15 , . . . by iteration 15

Cautiousness Can we improve decision list trimming? ◮ (Collins and Singer, 1999) cautiousness: take top n rules for each label n = 5 , 10 , 15 , . . . by iteration ◮ Yarowsky-cautious ◮ DL-CoTrain cautious 15

Yarowsky-cautious algorithm (Collins and Singer, 1999) 400 1 350 0.8 1 rule DL size | Num. labelled train examples 300 1.0 context: served 250 0.6 Test accuracy 1 rule 200 1.0 context: reads 0.4 150 100 0.2 train 50 4 4 0 0 0 10 20 30 40 50 60 0 Iteration Iteration 0 16

Yarowsky-cautious algorithm (Collins and Singer, 1999) 6 rules 1.0 context: served 400 1 .976 context: serv* .976 context: served 350 .955 context: inmat* 0.8 .955 context: releas* DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 6 rules 200 1.0 context: reads 0.4 150 .976 context: read* .976 context: reads 100 .969 next: read* 0.2 .969 next: reads 50 . . . 0 0 0 10 20 30 40 50 60 0 1 Iteration train 25 12 Iteration 1 16

Yarowsky-cautious algorithm (Collins and Singer, 1999) 11 rules 1.0 context: served 400 1 .995 context: serv* .989 context: serve 350 .986 context: serving 0.8 .984 context: life* DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 11 rules 200 1.0 context: reads 0.4 150 .991 context: read* .984 context: read 100 .976 context: reads 0.2 .969 next: from* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 Iteration train 62 20 Iteration 2 16

Yarowsky-cautious algorithm (Collins and Singer, 1999) 16 rules 1.0 context: served 400 1 .996 context: life* .996 context: life 350 .995 context: serv* 0.8 .995 context: prison* DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 16 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .984 context: read 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 Iteration train 84 32 Iteration 3 16

Yarowsky-cautious algorithm (Collins and Singer, 1999) 21 rules 1.0 context: served 400 1 .996 context: commut* .996 context: life* 350 .996 context: life 0.8 .995 context: serv* DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 21 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .989 context: quot* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 Iteration train 100 36 Iteration 4 16

Yarowsky-cautious algorithm (Collins and Singer, 1999) 26 rules 1.0 context: served 400 1 .996 context: commut* .996 context: life* 350 .996 context: life 0.8 .995 context: serv* DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 26 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .989 context: quot* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 5 Iteration train 114 40 Iteration 5 16

Yarowsky-cautious algorithm (Collins and Singer, 1999) 31 rules 1.0 context: served 400 1 .996 context: commut* .996 context: life* 350 .996 context: life 0.8 .995 context: serv* DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 31 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .989 context: quot* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 5 6 Iteration train 128 40 Iteration 6 16

Yarowsky-cautious algorithm (Collins and Singer, 1999) 36 rules 1.0 context: served 400 1 .965 context: year* .996 context: commut* 350 .996 context: life* 0.8 .996 context: life DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 36 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .989 context: quot* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 5 6 7 Iteration train 139 40 Iteration 7 16

Yarowsky-cautious algorithm (Collins and Singer, 1999) 41 rules 1.0 context: served 400 1 .969 context: year* .996 context: commut* 350 .996 context: life* 0.8 .996 context: life DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 41 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .989 context: quot* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 5 6 7 8 Iteration train 139 48 Iteration 8 16

Yarowsky-cautious algorithm (Collins and Singer, 1999) 46 rules 1.0 context: served 400 1 .969 context: year* .996 context: commut* 350 .996 context: life* 0.8 .996 context: life DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 46 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .989 context: quot* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 5 6 7 8 9 Iteration train 139 51 Iteration 9 16

Yarowsky-cautious algorithm (Collins and Singer, 1999) 51 rules 1.0 context: served 400 1 .969 context: year* .996 context: commut* 350 .996 context: life* 0.8 .996 context: life DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 51 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .989 context: quot* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 5 6 7 8 9 10 Iteration train 146 53 Iteration 10 16

Yarowsky-cautious algorithm (Collins and Singer, 1999) 56 rules 1.0 context: served 400 1 .969 context: year* .996 context: commut* 350 .996 context: life* 0.8 .996 context: life DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 56 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .989 context: quot* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 5 6 7 8 9 10 11 Iteration train 156 54 Iteration 11 16

Yarowsky-cautious algorithm (Collins and Singer, 1999) 61 rules 1.0 context: served 400 1 .969 context: year* .996 context: commut* 350 .996 context: life* 0.8 .996 context: life DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 61 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .989 context: quot* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 5 6 7 8 9 10 11 12 Iteration train 159 57 Iteration 12 16

Yarowsky-cautious algorithm (Collins and Singer, 1999) 66 rules 1.0 context: served 400 1 .969 context: year* .996 context: commut* 350 .996 context: life* 0.8 .996 context: life DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 66 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .989 context: quot* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 5 6 7 8 9 10 11 12 13 Iteration train 159 58 Iteration 13 16

Yarowsky-cautious algorithm (Collins and Singer, 1999) 71 rules 1.0 context: served 400 1 .969 context: year* .996 context: commut* 350 .996 context: life* 0.8 .996 context: life DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 71 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .989 context: quot* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Iteration train 163 58 Iteration 14 16

Yarowsky-cautious algorithm (Collins and Singer, 1999) 76 rules 1.0 context: served 400 1 .969 context: year* .996 context: commut* 350 .996 context: life* 0.8 .996 context: life DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 76 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .989 context: quot* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Iteration train 165 58 Iteration 15 16

Yarowsky-cautious algorithm (Collins and Singer, 1999) 81 rules 1.0 context: served 400 1 .969 context: year* .996 context: commut* 350 .996 context: life* 0.8 .996 context: life DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 81 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .989 context: quot* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Iteration train 166 58 Iteration 16 16

Yarowsky-cautious algorithm (Collins and Singer, 1999) 86 rules 1.0 context: served 400 1 .969 context: year* .996 context: commut* 350 .996 context: life* 0.8 .996 context: life DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 86 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .989 context: quot* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Iteration train 169 58 Iteration 17 16

Yarowsky-cautious algorithm (Collins and Singer, 1999) 91 rules 1.0 context: served 400 1 .969 context: year* .996 context: commut* 350 .996 context: life* 0.8 .996 context: life DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 91 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .989 context: quot* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Iteration train 170 58 Iteration 18 16

Yarowsky-cautious algorithm (Collins and Singer, 1999) 96 rules 1.0 context: served 400 1 .969 context: year* .996 context: commut* 350 .996 context: life* 0.8 .996 context: life DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 96 rules 200 1.0 context: reads 0.4 150 .991 context: read* .991 next: from* 100 .991 next: from 0.2 .989 context: quot* 50 . . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Iteration train 170 58 Iteration 19 16

Yarowsky-cautious algorithm (Collins and Singer, 1999) 101 rules 1.0 context: served .969 context: year* 400 1 .996 context: commut* .996 context: life* 350 .996 context: life 0.8 DL size | Num. labelled train examples 300 . . . 250 0.6 Test accuracy 101 rules 200 1.0 context: reads .991 context: read* 0.4 150 .991 next: from* .991 next: from 100 .989 context: quot* 0.2 . 50 . . 0 0 0 10 20 30 40 50 60 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Iteration train 172 59 Iteration 20 16

Yarowsky-cautious vs. co-training and EM EM 80 . 31 DL-CoTrain non-cautious 85 . 73 Yarowsky non-cautious 81 . 49 90 . 49 DL-CoTrain cautious Yarowsky-cautious 89 . 97 % clean non-seeded accuracy (named entity) statistically equivalent 17

Yarowsky-cautious vs. co-training and EM EM 80 . 31 DL-CoTrain non-cautious 85 . 73 Yarowsky non-cautious 81 . 49 90 . 49 DL-CoTrain cautious Yarowsky-cautious 89 . 97 % clean non-seeded accuracy (named entity) statistically equivalent ◮ Yarowsky performs well 17

Yarowsky-cautious vs. co-training and EM EM 80 . 31 DL-CoTrain non-cautious 85 . 73 Yarowsky non-cautious 81 . 49 90 . 49 DL-CoTrain cautious Yarowsky-cautious 89 . 97 % clean non-seeded accuracy (named entity) statistically equivalent ◮ Yarowsky performs well ◮ Cautiousness is important 17

Yarowsky-cautious vs. co-training and EM EM 80 . 31 DL-CoTrain non-cautious 85 . 73 Yarowsky non-cautious 81 . 49 90 . 49 DL-CoTrain cautious Yarowsky-cautious 89 . 97 % clean non-seeded accuracy (named entity) statistically equivalent ◮ Yarowsky performs well ◮ Cautiousness is important ◮ Yarowsky does not need views 17

Did we really do EM right? 18

Did we really do EM right? DL-CoTrain cautious 90 . 49 Yarowsky-cautious 89 . 97 EM 80 . 31 Hard EM 80 . 94 Online EM 83 . 89 Hard Online EM 80 . 49 % clean non-seeded accuracy (named entity) 18

Did we really do EM right? DL-CoTrain cautious 90 . 49 Yarowsky-cautious 89 . 97 EM 80 . 31 Hard EM 80 . 94 Online EM 83 . 89 Hard Online EM 80 . 49 % clean non-seeded accuracy (named entity) Multiple runs of EM. Variance of results: ◮ EM: ± .34 ◮ Hard EM: ± 2.53 ◮ Online EM: ± .45 ◮ Hard Online EM: ± .68 18

Yarowsky algorithm: (Abney, 2004) ’s analysis Yarowsky algorithm lacks theoretical analysis 19

Yarowsky algorithm: (Abney, 2004) ’s analysis Yarowsky algorithm lacks theoretical analysis ◮ (Abney, 2004) gives bounds for some variants (no cautiousness, no algorithm) 19

Yarowsky algorithm: (Abney, 2004) ’s analysis Yarowsky algorithm lacks theoretical analysis ◮ (Abney, 2004) gives bounds for some variants (no cautiousness, no algorithm) ◮ Basis for our work 19

Yarowsky algorithm: (Abney, 2004) ’s analysis Yarowsky algorithm lacks theoretical analysis ◮ (Abney, 2004) gives bounds for some variants (no cautiousness, no algorithm) ◮ Basis for our work Training examples x , labels j : ◮ Full time should be served for each sentence . ◮ The Liberals inserted a sentence of 14 words which reads : ◮ They get a concurrent sentence with no additional time added to their sentence . ◮ The words tax relief appeared in every second sentence in the federal government’s throne speech . . . . labelling distributions φ x ( j ) peaked for labelled example x uniform for unlabelled example x 19

Yarowsky algorithm: (Abney, 2004) ’s analysis Yarowsky algorithm lacks theoretical analysis ◮ (Abney, 2004) gives bounds for some variants (no cautiousness, no algorithm) ◮ Basis for our work Training examples x , labels j : ◮ Full time should be served for each sentence . ◮ The Liberals inserted a sentence of 14 words which reads : ◮ They get a concurrent sentence with no additional time added to their sentence . ◮ The words tax relief appeared in every second sentence in the federal government’s throne speech . . . . labelling distributions φ x ( j ) Features f , labels j : peaked for labelled example x ◮ context: reads uniform for unlabelled example x ◮ context: served ◮ context: inmate ◮ next: the parameter distributions θ f ( j ) ◮ context: article normalized DL scores for feature f ◮ previous: introductory DL chooses arg max j max f ∈ F x θ f ( j ) ◮ previous: passing ◮ next: said 19 . . .

Yarowsky algorithm: (Abney, 2004) ’s analysis Yarowsky algorithm lacks theoretical analysis ◮ (Abney, 2004) gives bounds for some variants (no cautiousness, no algorithm) ◮ Basis for our work Training examples x , labels j : ◮ Full time should be served for each sentence . ◮ The Liberals inserted a sentence of 14 words which reads : ◮ They get a concurrent sentence with no additional time added to their sentence . ◮ The words tax relief appeared in every second sentence in the federal government’s throne speech . . . . labelling distributions φ x ( j ) Features f , labels j : peaked for labelled example x ◮ context: reads uniform for unlabelled example x ◮ context: served ◮ context: inmate ◮ next: the parameter distributions θ f ( j ) ◮ context: article normalized DL scores for feature f ◮ previous: introductory DL chooses arg max j max f ∈ F x θ f ( j ) ◮ previous: passing alternative: arg max j � f ∈ F x θ f ( j ) ◮ next: said 19 . . .

Yarowsky algorithm: (Haffari and Sarkar, 2007) ’s analysis ◮ (Haffari and Sarkar, 2007) extend (Abney, 2004) to bipartite graph representation (polytime algorithm; no cautiousness) 20

Yarowsky algorithm: (Haffari and Sarkar, 2007) ’s analysis ◮ (Haffari and Sarkar, 2007) extend (Abney, 2004) to bipartite graph representation (polytime algorithm; no cautiousness) θ f 1 φ x 1 θ f 2 φ x 2 θ f 3 φ x 3 θ f 4 φ x 4 ... ... θ f | F | φ x | X | 20

Yarowsky algorithm: (Haffari and Sarkar, 2007) ’s analysis ◮ (Haffari and Sarkar, 2007) extend (Abney, 2004) to bipartite graph representation (polytime algorithm; no cautiousness) θ f 1 φ x 1 θ f 2 φ x 2 features f parameter distributions θ f ( j ) θ f 3 φ x 3 θ f 4 φ x 4 ... ... θ f | F | φ x | X | 20

Yarowsky algorithm: (Haffari and Sarkar, 2007) ’s analysis ◮ (Haffari and Sarkar, 2007) extend (Abney, 2004) to bipartite graph representation (polytime algorithm; no cautiousness) θ f 1 φ x 1 examples x θ f 2 φ x 2 labelling distributions φ x ( j ) features f parameter distributions θ f ( j ) θ f 3 φ x 3 θ f 4 φ x 4 ... ... θ f | F | φ x | X | 20

Yarowsky algorithm: (Haffari and Sarkar, 2007) ’s analysis ◮ (Haffari and Sarkar, 2007) extend (Abney, 2004) to bipartite graph representation (polytime algorithm; no cautiousness) θ f 1 φ x 1 examples x θ f 2 φ x 2 labelling distributions φ x ( j ) features f parameter distributions θ f ( j ) θ f 3 φ x 3 θ f 4 φ x 4 ... ... θ f | F | φ x | X | algorithm: fix one side, update other 20

Objective Function ◮ KL divergence between two probability distributions: p ( i ) log p ( i ) � KL ( p || q ) = q ( i ) i 21

Objective Function ◮ KL divergence between two probability distributions: p ( i ) log p ( i ) � KL ( p || q ) = q ( i ) i ◮ Entropy of a distribution: � H ( p ) = − p ( i ) log p ( i ) i 21

Objective Function ◮ KL divergence between two probability distributions: p ( i ) log p ( i ) � KL ( p || q ) = q ( i ) i ◮ Entropy of a distribution: � H ( p ) = − p ( i ) log p ( i ) i ◮ The Objective Function: � K ( φ, θ ) = KL ( θ f i || φ x j )+ H ( θ f i )+ H ( φ x j )+ Regularizer ( f i , x j ) ∈ Edges 21

Objective Function ◮ KL divergence between two probability distributions: p ( i ) log p ( i ) � KL ( p || q ) = q ( i ) i ◮ Entropy of a distribution: � H ( p ) = − p ( i ) log p ( i ) i ◮ The Objective Function: � K ( φ, θ ) = KL ( θ f i || φ x j )+ H ( θ f i )+ H ( φ x j )+ Regularizer ( f i , x j ) ∈ Edges ◮ Reduce uncertainty in the labelling distribution while respecting the labeled data 21

Generalized Objective Function ◮ Bregman divergence between two probability distributions: � ψ ( p ( i )) − ψ ( q ( i )) − ψ ′ ( q ( i ))( p ( i ) − q ( i )) B ψ ( p || q ) = i B t log t ( p || q ) KL ( p || q ) = 22

Generalized Objective Function ◮ Bregman divergence between two probability distributions: � ψ ( p ( i )) − ψ ( q ( i )) − ψ ′ ( q ( i ))( p ( i ) − q ( i )) B ψ ( p || q ) = i B t log t ( p || q ) KL ( p || q ) = ◮ ψ -Entropy of a distribution: � − H ψ ( p ) = ψ ( p ( i )) i H t log t ( p ) = H ( p ) 22

Generalized Objective Function ◮ Bregman divergence between two probability distributions: � ψ ( p ( i )) − ψ ( q ( i )) − ψ ′ ( q ( i ))( p ( i ) − q ( i )) B ψ ( p || q ) = i B t log t ( p || q ) KL ( p || q ) = ◮ ψ -Entropy of a distribution: � − H ψ ( p ) = ψ ( p ( i )) i H t log t ( p ) = H ( p ) ◮ The Generalized Objective Function: � K ψ ( φ, θ ) = B ψ ( θ f i || φ x j )+ H ψ ( θ f i )+ H ψ ( φ x j )+ Regularizer ( f i , x j ) ∈ Edges 22

Generalized Objective Function ψ ψ (q(i)) + ψ ’(q(i))(p’(i) - q(i)) a’ - b’ a’ ψ (p’(i)) ψ (p(i)) a - b ψ (q(i)) + ψ ’(q(i))(p(i) - q(i)) b’ a = ψ (p(i))- ψ (q(i)) b = ψ ’(q(i)) (p(i) - q(i)) ψ (q(i)) p(i)-q(i) p’(i)-q(i) 0 q(i) p(i) p’(i) 1 23

Variants from (Abney, 2004; Haffari and Sarkar, 2007) Yarowsky non-cautious 81 . 49 Yarowsky-cautious 89 . 97 Yarowsky-cautious sum 90 . 49 HaffariSarkar-bipartite avg-maj 79 . 69 % clean non-seeded accuracy (named entity) 24

Graph-based Propagation (Subramanya et al., 2010) Self-training with CRFs: label data graph propagate get types train CRF get posteriors seed data 25

SFU NatLangLab Bootstrapping via Graph Propagation Max Whitney - PowerPoint PPT Presentation

SFU NatLangLab Bootstrapping via Graph Propagation Max Whitney Anoop Sarkar Simon Fraser University Natural Language Laboratory http://natlang.cs.sfu.ca Bootstrapping Semi-supervised (vs supervised) Single domain (vs domain

Yue Pan ypa11@sfu.ca Ziyue Zhang zza15@sfu.ca http://www.sfu.ca/~ypa11/Ensc%20427/427.html

ENSC 427: Communication Networks Spring 2015 Video Streaming over Wi-Fi www.sfu.ca/~jwk10 Group

COUNSELLORS SFU APPLIED SCIENCES FACULTY OF APPLIED SCIENCES COMPUTING SCIENCE | SOFTWARE

TCP-Friendly Rate Control : An analysis Charu Jain www.sfu.ca/~cjain cjain@cs.sfu.ca Roadmap

A Case for NUMA-aware Contention Management on Multicore Systems Sergey Blagodurov

STAT 830 Blank Slides for Notes Richard Lockhart SFU STAT 830 Fall 2020 Richard Lockhart

Why Study Modified Gravity? Andrei Frolov (SFU) Unscreening Scalarons GC2018 2 / 33 T HE B EST A

Mind Change Optimal Learning Of Bayes Net Structure Oliver Schulte School of Computing Science

LINKVAN DIGITAL ACCESS SURVERY, DTES, VAN BC N=76, February, 2018. Suzanne Smythe, SFU

S Varun Gupta Syed Hamza srufai@sfu.ca vgupta@sfu.ca Roadmap q Introduction q Routing

Mining Network Traffic Data Ljiljana Trajkovi ljilja@cs.sfu.ca Communication Networks

ENSC 427: COMMUNICATION NETWORKS SPRING 2019 FINAL PROJECT PRESENTATIONS Analyzation of Gaming

VS-oscilloscope new parameterization algorithm of process- based tree-ring model Shishov

W ELCOME T O CMPT 110 1 Chapter 1 C OURSE I NFO Instructor: Richard Frank rfrank@sfu.ca

Analysis of Internet topology data Johnson Chen and Ljiljana Trajkovi hchenj, ljilja@cs.sfu.ca

CMPT-401 Operating System II Instructor: Byron Gao (bgao@sfu.ca) Office hour: Tues &

SECOND - HALF 10 18 8

Metarouting Timothy G. Griffin Joo Lus Sobrinho Computer Laboratory

C OMPILING A L ANGUAGE DCC 888 Dealing with Programming Languages LLVM gives developers many

Introduction to MATLAB Markus Kuhn Computer Laboratory Michaelmas 2011 What is MATLAB

Time Domain Lapped Transforms for Video Coding draft-egge-netvc-tdlt-00 Nathan Egge IETF 93

Wiimote Interfaces for Lifelong Robotic Learning Micah Lapping Carr Chad Jenkins, Daniel

Recovering Clear, Natural Identifiers from Obfuscated (JavaScript) Names @b_vasilescu Today

On the Correctness of Bubbling Sergio Antoy Portland State University RTA06, Seattle, WA,

Sambuz

Useful Links

Newsletter

Mail Us

SFU NatLangLab Bootstrapping via Graph Propagation Max Whitney - PowerPoint PPT Presentation

SFU NatLangLab Bootstrapping via Graph Propagation Max Whitney Anoop Sarkar Simon Fraser University Natural Language Laboratory http://natlang.cs.sfu.ca Bootstrapping Semi-supervised (vs supervised) Single domain (vs domain

Yue Pan ypa11@sfu.ca Ziyue Zhang zza15@sfu.ca http://www.sfu.ca/~ypa11/Ensc%20427/427.html

ENSC 427: Communication Networks Spring 2015 Video Streaming over Wi-Fi www.sfu.ca/~jwk10 Group

COUNSELLORS SFU APPLIED SCIENCES FACULTY OF APPLIED SCIENCES COMPUTING SCIENCE | SOFTWARE

TCP-Friendly Rate Control : An analysis Charu Jain www.sfu.ca/~cjain cjain@cs.sfu.ca Roadmap

A Case for NUMA-aware Contention Management on Multicore Systems Sergey Blagodurov

STAT 830 Blank Slides for Notes Richard Lockhart SFU STAT 830 Fall 2020 Richard Lockhart

Why Study Modified Gravity? Andrei Frolov (SFU) Unscreening Scalarons GC2018 2 / 33 T HE B EST A

Mind Change Optimal Learning Of Bayes Net Structure Oliver Schulte School of Computing Science

LINKVAN DIGITAL ACCESS SURVERY, DTES, VAN BC N=76, February, 2018. Suzanne Smythe, SFU

S Varun Gupta Syed Hamza srufai@sfu.ca vgupta@sfu.ca Roadmap q Introduction q Routing

Mining Network Traffic Data Ljiljana Trajkovi ljilja@cs.sfu.ca Communication Networks

ENSC 427: COMMUNICATION NETWORKS SPRING 2019 FINAL PROJECT PRESENTATIONS Analyzation of Gaming

VS-oscilloscope new parameterization algorithm of process- based tree-ring model Shishov

W ELCOME T O CMPT 110 1 Chapter 1 C OURSE I NFO Instructor: Richard Frank rfrank@sfu.ca

Analysis of Internet topology data Johnson Chen and Ljiljana Trajkovi hchenj, ljilja@cs.sfu.ca

CMPT-401 Operating System II Instructor: Byron Gao (bgao@sfu.ca) Office hour: Tues &amp;

SECOND - HALF 10 18 8

Metarouting Timothy G. Griffin Joo Lus Sobrinho Computer Laboratory

C OMPILING A L ANGUAGE DCC 888 Dealing with Programming Languages LLVM gives developers many

Introduction to MATLAB Markus Kuhn Computer Laboratory Michaelmas 2011 What is MATLAB

Time Domain Lapped Transforms for Video Coding draft-egge-netvc-tdlt-00 Nathan Egge IETF 93

Wiimote Interfaces for Lifelong Robotic Learning Micah Lapping Carr Chad Jenkins, Daniel

Recovering Clear, Natural Identifiers from Obfuscated (JavaScript) Names @b_vasilescu Today

On the Correctness of Bubbling Sergio Antoy Portland State University RTA06, Seattle, WA,

Sambuz

Useful Links

Newsletter

Mail Us

CMPT-401 Operating System II Instructor: Byron Gao (bgao@sfu.ca) Office hour: Tues &