Sea Search-Gu Guided, Lightly-Su Super ervised ed Training of - - PowerPoint PPT Presentation

sea search gu guided lightly su super ervised ed training
SMART_READER_LITE
LIVE PREVIEW

Sea Search-Gu Guided, Lightly-Su Super ervised ed Training of - - PowerPoint PPT Presentation

Sea Search-Gu Guided, Lightly-Su Super ervised ed Training of of Structured Prediction on Energy Networ orks Pedram Rooshenas Dongxu Zhang Gopal Sharma Andrew McCallum St Struc uctur ured d Predi diction We are interested to


slide-1
SLIDE 1

Sea Search-Gu Guided, Lightly-Su Super ervised ed Training

  • f
  • f Structured Prediction
  • n Energy Networ
  • rks

Andrew McCallum Pedram Rooshenas Dongxu Zhang Gopal Sharma

slide-2
SLIDE 2

St Struc uctur ured d Predi diction

  • We are interested to learn a function
  • X input variables
  • Y output variables
  • We can define as
  • For a Gibbs distribution:
slide-3
SLIDE 3

St Struc uctur ured d Predi diction n Ene nergy Networks (SP SPENs)

  • If is parameterized using a differentiable model such as a

deep neural network:

  • We can find a local minimum of E using gradient descent
  • The energy networks express the correlation among input and output

variables.

  • Traditionally graphical models are used for representing the correlation among output

variables.

  • Inference is intractable for most of expressive graphical models
slide-4
SLIDE 4

En Energy Mod

  • dels

[picture from Belanger (2016)] [picture from Altinel (2018)]

slide-5
SLIDE 5

Tr Training SPENs

  • Structural SVM (Belanger and McCallum, 2016)
  • End-to-End (Belanger et al., 2017)
  • Value-based training (Gygliet al. 2017)
  • Inference Network (Lifu Tu and Kevin Gimpel, 2018)
  • Rank-Based Training (Rooshenas et al., 2018)
slide-6
SLIDE 6

In Indirect Supervisi sion

  • Data annotation is expensive, especially for structured outputs.
  • Domain knowledge as the source of supervision.
  • It can be written as reward functions
  • evaluates a pair of input and output configuration into a scalar

value

  • For a given x, we are looking for the best y that maximize

6

slide-7
SLIDE 7

Se Search-Gu Guided ed Training

We have a reward function that provides indirect supervision

slide-8
SLIDE 8

Se Search-Gu Guided ed Training

We have a reward function that provides indirect supervision We want to learn a smooth version of the reward function such that we can use gradient-descent inference at test time

slide-9
SLIDE 9

Se Search-Gu Guided ed Training

y0 We sample a point from energy function using noisy gradient-descent inference

slide-10
SLIDE 10

Se Search-Gu Guided ed Training

y0 y1 We sample a point from energy function using noisy gradient-descent inference

slide-11
SLIDE 11

Se Search-Gu Guided ed Training

y0 y2 y1 We sample a point from energy function using noisy gradient-descent inference

slide-12
SLIDE 12

Se Search-Gu Guided ed Training

y0 y2 y3 y1 We sample a point from energy function using noisy gradient-descent inference

slide-13
SLIDE 13

Se Search-Gu Guided ed Training

y0 y2 y3 y1 y4 We sample a point from energy function using noisy gradient-descent inference

slide-14
SLIDE 14

Se Search-Gu Guided ed Training

y0 y2 y3 y1 y4 y5 We sample a point from energy function using noisy gradient-descent inference

slide-15
SLIDE 15

Se Search-Gu Guided ed Training

y0 y2 y3 y1 y4 y5 Then we project the sample to the domain of the reward function (the sample is a point in the simplex, but the domain of the reward function is often discrete, i.e., the vertices of the simplex)

slide-16
SLIDE 16

Se Search-Gu Guided ed Training

y0 y2 y3 y1 y4 y5 Then the search procedure uses the sample as input and returns an output structure by searching the reward function

slide-17
SLIDE 17

Se Search-Gu Guided ed Training

y0 y2 y3 y1 y4 y5 We expect that the two points have the same ranking

  • n the reward function and negative of the energy function
slide-18
SLIDE 18

Se Search-Gu Guided ed Training

y0 y2 y3 y1 y4 y5 Ranking violation We expect that the two points have the same ranking

  • n the reward function and negative of the energy function
slide-19
SLIDE 19

Se Search-Gu Guided ed Training

y0 y2 y3 y1 y4 y5 When we find a pair of points that violates the ranking constraints, we update the energy function towards reducing the violation

slide-20
SLIDE 20

Ta Task-Lo Loss as Reward Function fo for Multi-La Label Classification

  • The simplest form of indirect supervision is to use task-loss as reward

function:

slide-21
SLIDE 21

Do Domain Knowledge as Re Reward Function fo for Ci Citation Field Extraction

24

slide-22
SLIDE 22

Do Domain Knowledge as Re Reward Function fo for Ci Citation Field Extraction

25

slide-23
SLIDE 23

Do Domain Knowledge as Re Reward Function fo for Ci Citation Field Extraction

26

slide-24
SLIDE 24

Do Domain Knowledge as Re Reward Function fo for Ci Citation Field Extraction

27

slide-25
SLIDE 25

En Energy Model

0.9 0.9 0.85 0.4 0.1 0.05 0.05 0.04 0.1 0.45 0.8 0.9 ... ...

Input embedding Tag distribution Convolutional layer with multiple filters and different window sizes Max pooling and concatenation Multi-layer perceptron Tokens Wei Li . Deep Learning for ... Energy

... ... ... ... ... ... ...

author title ...

F i l t e r s i z e Filter size

slide-26
SLIDE 26

Pe Performance on Citation Field Extraction

slide-27
SLIDE 27

Se Semi-Supe Supervised d Se Setting ng

  • Alternatively use the output of search and ground-truth label for

training.

slide-28
SLIDE 28

Sha Shape pe Parser

I +

  • c(32,32,28)

c(32,32,24)

t(32,32,20)

Parsing

slide-29
SLIDE 29

Sha Shape pe Parser

+

  • c(32,32,28)

c(32,32,24)

t(32,32,20)

Parsing I Predict +

  • c(32,32,28)

c(32,32,24)

t(32,32,20)

Parsing

slide-30
SLIDE 30

Sha Shape pe Parser

+

  • c(32,32,28)

c(32,32,24)

t(32,32,20)

Parsing +

  • c(32,32,28)

c(32,32,24)

t(32,32,20)

Parsing +

  • c(32,32,28)

c(32,32,24)

t(32,32,20)

Parsing Graphic Engine I O Predict +

  • c(32,32,28)

c(32,32,24)

t(32,32,20)

Parsing

slide-31
SLIDE 31

Sha Shape pe Parser

+

  • c(32,32,28)

c(32,32,24)

t(32,32,20)

Parsing +

  • c(32,32,28)

c(32,32,24)

t(32,32,20)

Parsing +

  • c(32,32,28)

c(32,32,24)

t(32,32,20)

Parsing Graphic Engine I O Predict +

  • c(32,32,28)

c(32,32,24)

t(32,32,20)

Parsing

slide-32
SLIDE 32

Sha Shape pe Parser Ene nergy Mode del

0.8

1e-5 1e-5

0.01

1e-5

...

...

... ... ...

Convolutional layer Program circle(16,16,12) triangle(32,48,16) + circle(16,24,12) ­ Energy

1e-5 1e-5 1e-3 1e-5

0.9

circle(16,16,12)

  • ...

CNN

Output distribution Input image Multi-layer perceptron

slide-33
SLIDE 33

Se Search h Budg udget vs. Cons nstraint nts

slide-34
SLIDE 34

Pe Performance on Shape Pa Parser

slide-35
SLIDE 35

Co Conclusion and Future Directions

  • If a reward function exists to evaluate every structured output into a

scalar value

  • We can use unlabled data for training structured prediction energy networks
  • Domain knowledge or non-differentiable pipelines can be used to

define the reward functions.

  • The main ingredient for learning from the reward function is the

search operator.

  • Here we only use simple search operators, but more complex search

functions derived from domain knowledge can be used for complicated problems.