Strong Baselines for Neural Semi-supervised Learning under Domain - - PowerPoint PPT Presentation

strong baselines for neural semi supervised learning
SMART_READER_LITE
LIVE PREVIEW

Strong Baselines for Neural Semi-supervised Learning under Domain - - PowerPoint PPT Presentation

Strong Baselines for Neural Semi-supervised Learning under Domain Shift Sebastian Ruder Barbara Plank Learning under Domain Shift 2 Learning under Domain Shift State-of-the-art domain adaptation approaches 2 Learning under Domain Shift


slide-1
SLIDE 1

Strong Baselines for Neural Semi-supervised Learning under Domain Shift

Sebastian Ruder Barbara Plank

slide-2
SLIDE 2

2

Learning under Domain Shift

slide-3
SLIDE 3
  • State-of-the-art domain adaptation approaches

2

Learning under Domain Shift

slide-4
SLIDE 4
  • State-of-the-art domain adaptation approaches
  • leverage task-specific features

2

Learning under Domain Shift

slide-5
SLIDE 5
  • State-of-the-art domain adaptation approaches
  • leverage task-specific features
  • evaluate on proprietary datasets or on a single

benchmark

2

Learning under Domain Shift

slide-6
SLIDE 6
  • State-of-the-art domain adaptation approaches
  • leverage task-specific features
  • evaluate on proprietary datasets or on a single

benchmark

  • Only compare against weak baselines

2

Learning under Domain Shift

slide-7
SLIDE 7
  • State-of-the-art domain adaptation approaches
  • leverage task-specific features
  • evaluate on proprietary datasets or on a single

benchmark

  • Only compare against weak baselines
  • Almost none evaluate against approaches from the

extensive semi-supervised learning (SSL) literature

2

Learning under Domain Shift

slide-8
SLIDE 8

3

Revisiting Semi-Supervised Learning Classics in a Neural World

slide-9
SLIDE 9
  • How do classics in SSL compare to recent advances?

3

Revisiting Semi-Supervised Learning Classics in a Neural World

slide-10
SLIDE 10
  • How do classics in SSL compare to recent advances?
  • Can we combine the best of both worlds?

3

Revisiting Semi-Supervised Learning Classics in a Neural World

slide-11
SLIDE 11
  • How do classics in SSL compare to recent advances?
  • Can we combine the best of both worlds?
  • How well do these approaches work on out-of-distribution

data?

3

Revisiting Semi-Supervised Learning Classics in a Neural World

slide-12
SLIDE 12

Bootstrapping algorithms

slide-13
SLIDE 13
  • Self-training

Bootstrapping algorithms

slide-14
SLIDE 14
  • Self-training
  • (Co-training)

Bootstrapping algorithms

slide-15
SLIDE 15
  • Self-training
  • (Co-training)
  • Tri-training

Bootstrapping algorithms

slide-16
SLIDE 16
  • Self-training
  • (Co-training)
  • Tri-training
  • Tri-training with disagreement

Bootstrapping algorithms

slide-17
SLIDE 17

5

Self-training

slide-18
SLIDE 18
  • 1. Train model on labeled data.

5

Self-training

slide-19
SLIDE 19
  • 1. Train model on labeled data.
  • 2. Use confident predictions on unlabeled data

as training examples. Repeat.

5

Self-training

slide-20
SLIDE 20
  • 1. Train model on labeled data.
  • 2. Use confident predictions on unlabeled data

as training examples. Repeat.

5

Self-training

  • E

r r

  • r

a m p l i f i c a t i

  • n
slide-21
SLIDE 21
  • 1. Train model on labeled data.
  • 2. Use confident predictions on unlabeled data

as training examples. Repeat.

5

Self-training

  • E

r r

  • r

a m p l i f i c a t i

  • n
slide-22
SLIDE 22

6

Self-training variants

slide-23
SLIDE 23
  • Calibration

6

Self-training variants

slide-24
SLIDE 24
  • Calibration
  • Output probabilities in neural networks are poorly

calibrated.

6

Self-training variants

slide-25
SLIDE 25
  • Calibration
  • Output probabilities in neural networks are poorly

calibrated.

  • Throttling (Abney, 2007), i.e. selecting the top n highest

confidence unlabeled examples works best.

6

Self-training variants

slide-26
SLIDE 26
  • Calibration
  • Output probabilities in neural networks are poorly

calibrated.

  • Throttling (Abney, 2007), i.e. selecting the top n highest

confidence unlabeled examples works best.

  • Online learning

6

Self-training variants

slide-27
SLIDE 27
  • Calibration
  • Output probabilities in neural networks are poorly

calibrated.

  • Throttling (Abney, 2007), i.e. selecting the top n highest

confidence unlabeled examples works best.

  • Online learning
  • Training until convergence on labeled data and then on

unlabeled data works best.

6

Self-training variants

slide-28
SLIDE 28

7

Tri-training

Tri-training

slide-29
SLIDE 29

7

  • 1. Train three models on bootstrapped samples.

Tri-training

Tri-training

slide-30
SLIDE 30

7

  • 1. Train three models on bootstrapped samples.
  • 2. Use predictions on unlabeled data for third if two agree.

Tri-training

Tri-training

slide-31
SLIDE 31

7

  • 1. Train three models on bootstrapped samples.
  • 2. Use predictions on unlabeled data for third if two agree.

x

Tri-training

Tri-training

slide-32
SLIDE 32

7

  • 1. Train three models on bootstrapped samples.
  • 2. Use predictions on unlabeled data for third if two agree.

y = 1

x

Tri-training

Tri-training

slide-33
SLIDE 33

7

  • 1. Train three models on bootstrapped samples.
  • 2. Use predictions on unlabeled data for third if two agree.

y = 1

x

y = 1

Tri-training

Tri-training

slide-34
SLIDE 34

7

  • 1. Train three models on bootstrapped samples.
  • 2. Use predictions on unlabeled data for third if two agree.

y = 1

x

y = 1 1

Tri-training

Tri-training

slide-35
SLIDE 35

8

Tri-training

Tri-training

  • 1. Train three models on bootstrapped samples.
  • 2. Use predictions on unlabeled data for third if two agree.
  • 3. Final prediction: majority voting

Tri-training

slide-36
SLIDE 36

8

Tri-training

Tri-training

  • 1. Train three models on bootstrapped samples.
  • 2. Use predictions on unlabeled data for third if two agree.
  • 3. Final prediction: majority voting

Tri-training

x

slide-37
SLIDE 37

8

Tri-training

Tri-training

  • 1. Train three models on bootstrapped samples.
  • 2. Use predictions on unlabeled data for third if two agree.
  • 3. Final prediction: majority voting

Tri-training

y = 1

x

slide-38
SLIDE 38

8

Tri-training

Tri-training

  • 1. Train three models on bootstrapped samples.
  • 2. Use predictions on unlabeled data for third if two agree.
  • 3. Final prediction: majority voting

Tri-training

y = 1 y = 0

x

slide-39
SLIDE 39

8

Tri-training

Tri-training

  • 1. Train three models on bootstrapped samples.
  • 2. Use predictions on unlabeled data for third if two agree.
  • 3. Final prediction: majority voting

Tri-training

y = 1 y = 1 y = 0

x

slide-40
SLIDE 40

8

Tri-training

Tri-training

  • 1. Train three models on bootstrapped samples.
  • 2. Use predictions on unlabeled data for third if two agree.
  • 3. Final prediction: majority voting

Tri-training

y = 1 y = 1 y = 0 1

x

slide-41
SLIDE 41

Tri-training
 with disagreement

9

Tri-training with disagreement

slide-42
SLIDE 42

Tri-training
 with disagreement

9

Tri-training with disagreement

  • 1. Train three models on bootstrapped samples.
slide-43
SLIDE 43

Tri-training
 with disagreement

9

Tri-training with disagreement

  • 1. Train three models on bootstrapped samples.
  • 2. Use predictions on unlabeled data for third if two agree

and prediction differs.

slide-44
SLIDE 44

Tri-training
 with disagreement

9

Tri-training with disagreement

  • 1. Train three models on bootstrapped samples.
  • 2. Use predictions on unlabeled data for third if two agree

and prediction differs.

x

slide-45
SLIDE 45

Tri-training
 with disagreement

9

Tri-training with disagreement

  • 1. Train three models on bootstrapped samples.
  • 2. Use predictions on unlabeled data for third if two agree

and prediction differs.

y = 1

x

slide-46
SLIDE 46

Tri-training
 with disagreement

9

Tri-training with disagreement

  • 1. Train three models on bootstrapped samples.
  • 2. Use predictions on unlabeled data for third if two agree

and prediction differs.

y = 1

x

y = 1

slide-47
SLIDE 47

Tri-training
 with disagreement

9

Tri-training with disagreement

  • 1. Train three models on bootstrapped samples.
  • 2. Use predictions on unlabeled data for third if two agree

and prediction differs.

y = 1

x

y = 1 y = 0

slide-48
SLIDE 48

Tri-training
 with disagreement

9

Tri-training with disagreement

  • 1. Train three models on bootstrapped samples.
  • 2. Use predictions on unlabeled data for third if two agree

and prediction differs.

y = 1

x

y = 1 1 y = 0

slide-49
SLIDE 49

Tri-training
 with disagreement

9

Tri-training with disagreement

  • 1. Train three models on bootstrapped samples.
  • 2. Use predictions on unlabeled data for third if two agree

and prediction differs.

y = 1

x

y = 1 1 y = 0

  • 3

i n d e p e n d e n t m

  • d

e l s

slide-50
SLIDE 50

10

Tri-training hyper-parameters

slide-51
SLIDE 51
  • Sampling unlabeled data

10

Tri-training hyper-parameters

slide-52
SLIDE 52
  • Sampling unlabeled data
  • Producing predictions for all unlabeled examples is

expensive

10

Tri-training hyper-parameters

slide-53
SLIDE 53
  • Sampling unlabeled data
  • Producing predictions for all unlabeled examples is

expensive

  • Sample number of unlabeled examples

10

Tri-training hyper-parameters

slide-54
SLIDE 54
  • Sampling unlabeled data
  • Producing predictions for all unlabeled examples is

expensive

  • Sample number of unlabeled examples
  • Confidence thresholding

10

Tri-training hyper-parameters

slide-55
SLIDE 55
  • Sampling unlabeled data
  • Producing predictions for all unlabeled examples is

expensive

  • Sample number of unlabeled examples
  • Confidence thresholding
  • Not effective for classic approaches, but essential for
  • ur method

10

Tri-training hyper-parameters

slide-56
SLIDE 56

11

Multi-task tri-training

Multi-task
 Tri-training

slide-57
SLIDE 57

11

Multi-task tri-training

  • 1. Train one model with 3 objective functions.

Multi-task
 Tri-training

slide-58
SLIDE 58

11

Multi-task tri-training

  • 1. Train one model with 3 objective functions.
  • 2. Use predictions on unlabeled data for third if two agree.

Multi-task
 Tri-training

slide-59
SLIDE 59

11

x

Multi-task tri-training

  • 1. Train one model with 3 objective functions.
  • 2. Use predictions on unlabeled data for third if two agree.

Multi-task
 Tri-training

slide-60
SLIDE 60

11

y = 1

x

Multi-task tri-training

  • 1. Train one model with 3 objective functions.
  • 2. Use predictions on unlabeled data for third if two agree.

Multi-task
 Tri-training

slide-61
SLIDE 61

11

y = 1

x

y = 1

Multi-task tri-training

  • 1. Train one model with 3 objective functions.
  • 2. Use predictions on unlabeled data for third if two agree.

Multi-task
 Tri-training

slide-62
SLIDE 62

11

y = 1

x

y = 1 1

Multi-task tri-training

  • 1. Train one model with 3 objective functions.
  • 2. Use predictions on unlabeled data for third if two agree.

Multi-task
 Tri-training

slide-63
SLIDE 63

11

y = 1

x

y = 1 1

Multi-task tri-training

  • 1. Train one model with 3 objective functions.
  • 2. Use predictions on unlabeled data for third if two agree.

Multi-task
 Tri-training

  • 3. Restrict final layers to 


use different 
 representations.

slide-64
SLIDE 64

11

y = 1

x

y = 1 1

Multi-task tri-training

  • 1. Train one model with 3 objective functions.
  • 2. Use predictions on unlabeled data for third if two agree.

Multi-task
 Tri-training

  • 3. Restrict final layers to 


use different 
 representations.

  • 4. Train third objective 


function only on 
 pseudo labeled to 
 bridge domain shift.

slide-65
SLIDE 65

12

Multi-task
 Tri-training

slide-66
SLIDE 66

12

BiLSTM

Multi-task
 Tri-training

(Plank et al., 2016)

slide-67
SLIDE 67

12

BiLSTM w2

char BiLSTM

Multi-task
 Tri-training

(Plank et al., 2016)

slide-68
SLIDE 68

12

BiLSTM w2

char BiLSTM

BiLSTM w1

char BiLSTM

Multi-task
 Tri-training

(Plank et al., 2016)

slide-69
SLIDE 69

12

BiLSTM w2

char BiLSTM

BiLSTM w1

char BiLSTM

BiLSTM w3

char BiLSTM

Multi-task
 Tri-training

(Plank et al., 2016)

slide-70
SLIDE 70

12

BiLSTM w2

char BiLSTM

BiLSTM w1

char BiLSTM

BiLSTM w3

char BiLSTM

m1

Multi-task
 Tri-training

(Plank et al., 2016)

slide-71
SLIDE 71

12

BiLSTM w2

char BiLSTM

BiLSTM w1

char BiLSTM

BiLSTM w3

char BiLSTM

m1 m2

Multi-task
 Tri-training

(Plank et al., 2016)

slide-72
SLIDE 72

12

BiLSTM w2

char BiLSTM

BiLSTM w1

char BiLSTM

BiLSTM w3

char BiLSTM

m1 m2 m3

Multi-task
 Tri-training

(Plank et al., 2016)

slide-73
SLIDE 73

12

BiLSTM w2

char BiLSTM

BiLSTM w1

char BiLSTM

BiLSTM w3

char BiLSTM

m1 m2 m3 m1 m2 m3

Multi-task
 Tri-training

(Plank et al., 2016)

slide-74
SLIDE 74

12

BiLSTM w2

char BiLSTM

BiLSTM w1

char BiLSTM

BiLSTM w3

char BiLSTM

m1 m2 m3 m1 m2 m3 m1 m2 m3

Multi-task
 Tri-training

(Plank et al., 2016)

slide-75
SLIDE 75

12

BiLSTM w2

char BiLSTM

BiLSTM w1

char BiLSTM

BiLSTM w3

char BiLSTM

m1 m2 m3 m1 m2 m3 m1 m2 m3

  • rthogonality constraint (Bousmalis et al., 2016)

Multi-task
 Tri-training

Lorth = ∥W⊤

m1Wm2∥2 F

(Plank et al., 2016)

slide-76
SLIDE 76

12

BiLSTM w2

char BiLSTM

BiLSTM w1

char BiLSTM

BiLSTM w3

char BiLSTM

m1 m2 m3 m1 m2 m3 m1 m2 m3

  • rthogonality constraint (Bousmalis et al., 2016)

Multi-task
 Tri-training

Lorth = ∥W⊤

m1Wm2∥2 F

L(θ) = − ∑

i ∑ 1,..,n

log Pmi(y| ⃗ h ) + γLorth

Loss: (Plank et al., 2016)

slide-77
SLIDE 77

13

Data & Tasks

slide-78
SLIDE 78

13

Data & Tasks

Two tasks: Domains:

slide-79
SLIDE 79

13

Data & Tasks

Two tasks: Domains:

Sentiment analysis on Amazon reviews dataset (Blitzer et al, 2006)

slide-80
SLIDE 80

13

Data & Tasks

Two tasks: Domains:

Sentiment analysis on Amazon reviews dataset (Blitzer et al, 2006) POS tagging on SANCL 2012 dataset (Petrov and McDonald, 2012)

slide-81
SLIDE 81

Sentiment Analysis Results

Accuracy 75 76.75 78.5 80.25 82 Avg over 4 target domains

VFAE* DANN* Asym* Source only Self-training Tri-training Tri-training-Disagr. MT-Tri

* result from Saito et al., (2017) 14

slide-82
SLIDE 82

Sentiment Analysis Results

Accuracy 75 76.75 78.5 80.25 82 Avg over 4 target domains

VFAE* DANN* Asym* Source only Self-training Tri-training Tri-training-Disagr. MT-Tri

* result from Saito et al., (2017) 14

slide-83
SLIDE 83

Sentiment Analysis Results

Accuracy 75 76.75 78.5 80.25 82 Avg over 4 target domains

VFAE* DANN* Asym* Source only Self-training Tri-training Tri-training-Disagr. MT-Tri

* result from Saito et al., (2017) 14

slide-84
SLIDE 84

Sentiment Analysis Results

Accuracy 75 76.75 78.5 80.25 82 Avg over 4 target domains

VFAE* DANN* Asym* Source only Self-training Tri-training Tri-training-Disagr. MT-Tri

* result from Saito et al., (2017) 14

slide-85
SLIDE 85

Sentiment Analysis Results

Accuracy 75 76.75 78.5 80.25 82 Avg over 4 target domains

VFAE* DANN* Asym* Source only Self-training Tri-training Tri-training-Disagr. MT-Tri

* result from Saito et al., (2017) 14

  • Multi-task tri-training slightly outperforms tri-training, but

has higher variance.

slide-86
SLIDE 86

15

POS Tagging Results

Trained on 10% labeled data (WSJ)

Accuracy 88.7 88.975 89.25 89.525 89.8 Avg over 5 target domains

Source (+embeds) Self-training Tri-training Tri-training-Disagr. MT-Tri

slide-87
SLIDE 87

15

POS Tagging Results

Trained on 10% labeled data (WSJ)

Accuracy 88.7 88.975 89.25 89.525 89.8 Avg over 5 target domains

Source (+embeds) Self-training Tri-training Tri-training-Disagr. MT-Tri

slide-88
SLIDE 88

15

POS Tagging Results

Trained on 10% labeled data (WSJ)

Accuracy 88.7 88.975 89.25 89.525 89.8 Avg over 5 target domains

Source (+embeds) Self-training Tri-training Tri-training-Disagr. MT-Tri

slide-89
SLIDE 89

15

POS Tagging Results

Trained on 10% labeled data (WSJ)

Accuracy 88.7 88.975 89.25 89.525 89.8 Avg over 5 target domains

Source (+embeds) Self-training Tri-training Tri-training-Disagr. MT-Tri

slide-90
SLIDE 90

15

POS Tagging Results

Trained on 10% labeled data (WSJ)

Accuracy 88.7 88.975 89.25 89.525 89.8 Avg over 5 target domains

Source (+embeds) Self-training Tri-training Tri-training-Disagr. MT-Tri

  • Tri-training with disagreement works best with little data.
slide-91
SLIDE 91

16

POS Tagging Results

* result from Schnabel & Schütze (2014)

Trained on full labeled data (WSJ)

Accuracy 89 89.75 90.5 91.25 92 Avg over 5 target domains

TnT Stanford* Source (+embeds) Tri-training Tri-training-Disagr. MT-Tri

slide-92
SLIDE 92

16

POS Tagging Results

* result from Schnabel & Schütze (2014)

Trained on full labeled data (WSJ)

Accuracy 89 89.75 90.5 91.25 92 Avg over 5 target domains

TnT Stanford* Source (+embeds) Tri-training Tri-training-Disagr. MT-Tri

slide-93
SLIDE 93

16

POS Tagging Results

* result from Schnabel & Schütze (2014)

Trained on full labeled data (WSJ)

Accuracy 89 89.75 90.5 91.25 92 Avg over 5 target domains

TnT Stanford* Source (+embeds) Tri-training Tri-training-Disagr. MT-Tri

slide-94
SLIDE 94

16

POS Tagging Results

* result from Schnabel & Schütze (2014)

Trained on full labeled data (WSJ)

Accuracy 89 89.75 90.5 91.25 92 Avg over 5 target domains

TnT Stanford* Source (+embeds) Tri-training Tri-training-Disagr. MT-Tri

  • Tri-training works best in the full data setting.
slide-95
SLIDE 95

17

POS Tagging Analysis

Accuracy on out-of-vocabulary (OOV) tokens

Accuracy on OOV tokens 50 57.5 65 72.5 80 % OOV tokens 2.75 5.5 8.25 11 Answers Emails Newsgroups Reviews Weblogs

OOV tokens Src Tri MT-Tri

slide-96
SLIDE 96

17

POS Tagging Analysis

Accuracy on out-of-vocabulary (OOV) tokens

Accuracy on OOV tokens 50 57.5 65 72.5 80 % OOV tokens 2.75 5.5 8.25 11 Answers Emails Newsgroups Reviews Weblogs

OOV tokens Src Tri MT-Tri

slide-97
SLIDE 97

17

POS Tagging Analysis

Accuracy on out-of-vocabulary (OOV) tokens

Accuracy on OOV tokens 50 57.5 65 72.5 80 % OOV tokens 2.75 5.5 8.25 11 Answers Emails Newsgroups Reviews Weblogs

OOV tokens Src Tri MT-Tri

slide-98
SLIDE 98

17

POS Tagging Analysis

Accuracy on out-of-vocabulary (OOV) tokens

Accuracy on OOV tokens 50 57.5 65 72.5 80 % OOV tokens 2.75 5.5 8.25 11 Answers Emails Newsgroups Reviews Weblogs

OOV tokens Src Tri MT-Tri

  • Classic tri-training works best on OOV tokens.
slide-99
SLIDE 99

17

POS Tagging Analysis

Accuracy on out-of-vocabulary (OOV) tokens

Accuracy on OOV tokens 50 57.5 65 72.5 80 % OOV tokens 2.75 5.5 8.25 11 Answers Emails Newsgroups Reviews Weblogs

OOV tokens Src Tri MT-Tri

  • Classic tri-training works best on OOV tokens.
  • MT-Tri does worse than source-only baseline on OOV.
slide-100
SLIDE 100

18

POS accuracy per binned log frequency

Accuracy delta vs. src-only baseline

  • 0.005

0.005 0.009 0.014 0.018 Binned frequency 1 2 3 4 5 6 7 8 9 10 11 12 13 14

MT-Tri Tri

POS Tagging Analysis

slide-101
SLIDE 101

18

POS accuracy per binned log frequency

Accuracy delta vs. src-only baseline

  • 0.005

0.005 0.009 0.014 0.018 Binned frequency 1 2 3 4 5 6 7 8 9 10 11 12 13 14

MT-Tri Tri

POS Tagging Analysis

slide-102
SLIDE 102

18

POS accuracy per binned log frequency

Accuracy delta vs. src-only baseline

  • 0.005

0.005 0.009 0.014 0.018 Binned frequency 1 2 3 4 5 6 7 8 9 10 11 12 13 14

MT-Tri Tri

  • Tri-training works best on low-frequency tokens (leftmost

bins).

POS Tagging Analysis

slide-103
SLIDE 103

19

POS Tagging Analysis

Accuracy on unknown word-tag (UWT) tokens

Accuracy on UWT tokens 8 12.5 17 21.5 26 % UWT tokens 1 2 3 4 Answers Emails Newsgroups Reviews Weblogs

UWT rate Src Tri MT-Tri FLORS*

* result from Schnabel & Schütze (2014)

slide-104
SLIDE 104

19

POS Tagging Analysis

Accuracy on unknown word-tag (UWT) tokens

Accuracy on UWT tokens 8 12.5 17 21.5 26 % UWT tokens 1 2 3 4 Answers Emails Newsgroups Reviews Weblogs

UWT rate Src Tri MT-Tri FLORS*

v e r y d i f fi c u l t c a s e s

* result from Schnabel & Schütze (2014)

slide-105
SLIDE 105

19

POS Tagging Analysis

Accuracy on unknown word-tag (UWT) tokens

Accuracy on UWT tokens 8 12.5 17 21.5 26 % UWT tokens 1 2 3 4 Answers Emails Newsgroups Reviews Weblogs

UWT rate Src Tri MT-Tri FLORS*

v e r y d i f fi c u l t c a s e s

* result from Schnabel & Schütze (2014)

slide-106
SLIDE 106

19

POS Tagging Analysis

Accuracy on unknown word-tag (UWT) tokens

Accuracy on UWT tokens 8 12.5 17 21.5 26 % UWT tokens 1 2 3 4 Answers Emails Newsgroups Reviews Weblogs

UWT rate Src Tri MT-Tri FLORS*

v e r y d i f fi c u l t c a s e s

* result from Schnabel & Schütze (2014)

slide-107
SLIDE 107

19

POS Tagging Analysis

Accuracy on unknown word-tag (UWT) tokens

Accuracy on UWT tokens 8 12.5 17 21.5 26 % UWT tokens 1 2 3 4 Answers Emails Newsgroups Reviews Weblogs

UWT rate Src Tri MT-Tri FLORS*

  • No bootstrapping method works well on unknown word-

tag combinations.

v e r y d i f fi c u l t c a s e s

* result from Schnabel & Schütze (2014)

slide-108
SLIDE 108

19

POS Tagging Analysis

Accuracy on unknown word-tag (UWT) tokens

Accuracy on UWT tokens 8 12.5 17 21.5 26 % UWT tokens 1 2 3 4 Answers Emails Newsgroups Reviews Weblogs

UWT rate Src Tri MT-Tri FLORS*

  • No bootstrapping method works well on unknown word-

tag combinations.

  • Less lexicalized FLORS approach is superior.

v e r y d i f fi c u l t c a s e s

* result from Schnabel & Schütze (2014)

slide-109
SLIDE 109

20

Takeaways

Tri-training

slide-110
SLIDE 110
  • Classic tri-training works best: outperforms recent

state-of-the-art methods for sentiment analysis.

20

Takeaways

Tri-training

slide-111
SLIDE 111
  • Classic tri-training works best: outperforms recent

state-of-the-art methods for sentiment analysis.

  • We address the drawback of tri-training (space &

time complexity) via the proposed MT-Tri model

  • MT-Tri works best on sentiment, but not for POS.

20

Takeaways

Tri-training

slide-112
SLIDE 112
  • Classic tri-training works best: outperforms recent

state-of-the-art methods for sentiment analysis.

  • We address the drawback of tri-training (space &

time complexity) via the proposed MT-Tri model

  • MT-Tri works best on sentiment, but not for POS.
  • Importance of:
  • Comparing neural methods to classics (strong

baselines)

  • Evaluation on multiple tasks & domains

20

Takeaways

Tri-training