Variational Sequential Labelers for Semi-Supervised Learning Mingda - - PowerPoint PPT Presentation

variational sequential labelers for semi supervised
SMART_READER_LITE
LIVE PREVIEW

Variational Sequential Labelers for Semi-Supervised Learning Mingda - - PowerPoint PPT Presentation

Variational Sequential Labelers for Semi-Supervised Learning Mingda Chen, Qingming Tang, Karen Livescu, Kevin Gimpel Sequence Labeling Part-of-Speech (POS) Tagging determiner noun verb determiner adjective noun


slide-1
SLIDE 1

Variational Sequential Labelers for Semi-Supervised Learning

Mingda Chen, Qingming Tang, Karen Livescu, Kevin Gimpel

slide-2
SLIDE 2

Sequence Labeling

Part-of-Speech (POS) Tagging This item is a small one and easily missed .

determiner noun verb determiner adjective noun coordinating adverb verb punctuation conjunction

Named Entity Recognition (NER) EU rejects German call to boycott British lamb .

B-ORG O B-MISC O O O B-MISC O O

slide-3
SLIDE 3

Overview

❖ Latent-variable generative models for sequence labeling ❖ 0.8 ~ 1% absolute improvements over 8 datasets without structured inference ❖ 0.1 ~ 0.3% absolute improvements from adding unlabeled data

slide-4
SLIDE 4

Why latent-variable models?

❖ Natural way to incorporate unlabeled data ❖ Ability to disentangle representations via the configuration of latent variables ❖ Allow us to use neural variational methods

slide-5
SLIDE 5

Variational Autoencoder (VAE)

[Kingma and Welling, ICLR’14; Rezende and Mohamed, ICML’15]

Observation Latent variable

slide-6
SLIDE 6

Variational Autoencoder (VAE)

Evidence Lower Bound (ELBO)

[Kingma and Welling, ICLR’14; Rezende and Mohamed, ICML’15]

Observation Latent variable

slide-7
SLIDE 7

Conditional Variational Autoencoder

Given context Observation Latent variable

slide-8
SLIDE 8

Conditional Variational Autoencoder

Given context Observation Latent variable

slide-9
SLIDE 9

The input words other than the word at position

slide-10
SLIDE 10

The input words other than the word at position This item is a small one and easily missed .

slide-11
SLIDE 11

The input words other than the word at position This item is a small one and easily missed .

slide-12
SLIDE 12

Variational Sequential Labeler (VSL)

Observation Latent variable Given context

slide-13
SLIDE 13

Variational Sequential Labeler (VSL)

ELBO Observation Latent variable Given context

slide-14
SLIDE 14

Variational Sequential Labeler (VSL)

slide-15
SLIDE 15

Variational Sequential Labeler (VSL)

slide-16
SLIDE 16

Variational Sequential Labeler (VSL) Classification loss (CL)

slide-17
SLIDE 17

Variational Sequential Labeler (VSL) Classification loss (CL)

slide-18
SLIDE 18

VSL: Training and Testing

❖ Maximize where is a hyperparameter ❖ Use one sample from Gaussian distribution using reparameterization trick

Training Testing

❖ Use the mean of Gaussian distribution

slide-19
SLIDE 19

Variants of VSL

VSL-G Position of classifier

slide-20
SLIDE 20

Variants of VSL

VSL-G Stands for “Gaussian” Position of classifier

slide-21
SLIDE 21

Variants of VSL

VSL-G Stands for “Gaussian” VSL-GG-Flat Position of classifier

slide-22
SLIDE 22

Variants of VSL

VSL-G Stands for “Gaussian” VSL-GG-Flat Position of classifier

slide-23
SLIDE 23

Variants of VSL

VSL-G Stands for “Gaussian” VSL-GG-Flat Position of classifier

slide-24
SLIDE 24

Variants of VSL

VSL-G Stands for “Gaussian” VSL-GG-Flat VSL-GG-Hier Position of classifier

slide-25
SLIDE 25

Variants of VSL

VSL-G Stands for “Gaussian” VSL-GG-Flat VSL-GG-Hier Position of classifier

slide-26
SLIDE 26

Experiments

❖ Twitter POS Dataset ➢ Subset of 56 million English tweets as unlabeled data ➢ 25 tags ❖ Universal Dependencies POS Datasets ➢ 20% of original training set as labeled data ➢ 50% of original training set as unlabeled data ➢ 6 languages ➢ 17 tags ❖ CoNLL 2003 English NER Dataset ➢ 10% of original training set as labeled data ➢ 50% of original training set as unlabeled data ➢ BIOES labeling scheme

slide-27
SLIDE 27

Results

slide-28
SLIDE 28

Universal Dependencies POS

slide-29
SLIDE 29

t-SNE Visualization

❖ Each point represents a word token ❖ Color indicates gold standard POS tag in Twitter dev set

BiGRU baseline

slide-30
SLIDE 30

t-SNE Visualization

VSL-GG-Hier VSL-GG-Flat

y (label) variable z variable

slide-31
SLIDE 31

Effect of Position of Classification Loss

VSL-GG-Hier Position of classifier

slide-32
SLIDE 32

Effect of Position of Classification Loss

VSL-GG-Hier VSL-GG-Hier with classifier on Position of classifier

slide-33
SLIDE 33

Effect of Position of Classification Loss

VSL-GG-Hier VSL-GG-Hier-z VSL-GG-Hier with classifier on Position of classifier

slide-34
SLIDE 34

Effect of Position of Classification Loss

slide-35
SLIDE 35

Effect of Position of Classification Loss

Hierarchical structure is only helpful when classification loss and reconstruction loss are attached to different latent variables

slide-36
SLIDE 36

Effect of Variational Regularization (VR)

Randomness in the latent space VR KL divergence between approximated posterior and prior

slide-37
SLIDE 37

Effect of VR

slide-38
SLIDE 38

Effect of Unlabeled data

❖ Evaluate VSL-GG-Hier on Twitter dataset ❖ Subsample unlabeled data from 56 million tweets ❖ Vary the number of unlabeled data

slide-39
SLIDE 39

Effect of Unlabeled data

slide-40
SLIDE 40

Summary

❖ We introduced VSLs for semi-supervised learning ❖ Best VSL uses multiple latent variable and arranged in hierarchical structure ❖ Hierarchical structure is only helpful when classification loss and reconstruction loss are attached to different latent variables ❖ VSLs show consistent improvements across 8 datasets over a strong baseline

slide-41
SLIDE 41

Thank you!