Tractable Semi-Supervised Learning of Complex Structured Prediction - - PowerPoint PPT Presentation

tractable semi supervised learning of complex structured
SMART_READER_LITE
LIVE PREVIEW

Tractable Semi-Supervised Learning of Complex Structured Prediction - - PowerPoint PPT Presentation

Tractable Semi-Supervised Learning of Complex Structured Prediction Models Kai-Wei Chang University of Illinois at Urbana-Champaign (Work conducted while interning at Microsoft) Joint Work with Sundararajan S (Microsoft Research) and Sathiya


slide-1
SLIDE 1

Tractable Semi-Supervised Learning of Complex Structured Prediction Models

Kai-Wei Chang

University of Illinois at Urbana-Champaign (Work conducted while interning at Microsoft) Joint Work with Sundararajan S (Microsoft Research) and Sathiya Keerthi S (Microsoft CISL)

September 24, 2013

K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 1 / 23

slide-2
SLIDE 2

Structured Prediction Problems (examples) Sequence Learning (e.g., input: a sentence; output: POS Tag) The President Came to the

  • ffice

DT N V P DT N Multi-label Classification (e.g., a document belongs to more than one class - finance, politics) (Object → {cl1, cl2, clK }) In this paper, we consider general structures Characteristics: Exponential number of output combinations for a given input (e.g., 2K in K output multi-label classification problem) Label dependency across the outputs

K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 2 / 23

slide-3
SLIDE 3

Semi-supervised Learning (SSL) Manual labeling is expensive Unlabeled data is freely available (e.g., web pages, mails) Additional domain knowledge or side information available

◮ Label distribution in the unlabeled data (e.g., 80% positive examples) ◮ Label correlation (e.g., multi-label classification problem)

For SSL, we need inference engine that can handle domain constraints Make use of unlabeled data with domain knowledge or side information to constrain the solution space - improved performance

K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 3 / 23

slide-4
SLIDE 4

SSL of Complex Structured Prediction Models Most works assume that the output structure is simple (e.g., Dhillon et al 12, Chang et al 12) ⇒ Cannot handle problems with complex structure Contributions: We propose an approximate semi-supervised learning algorithm:

◮ use piecewise training for estimating the model weights ◮ dual decomposition method for inference problem

⇒ extend SSL to general structured prediction problems Our inference engine can be applied to various SSL frameworks

K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 4 / 23

slide-5
SLIDE 5

Outline Background

◮ Semi-supervised Learning Problem Setting ◮ Decomposable Scoring Function

Semi-supervised Learning for Structured Predictions

◮ Composite Likelihood - approximate learning ◮ Dual Decomposition Method - approximate inference (with constraints)

Experimental Results Conclusion

K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 5 / 23

slide-6
SLIDE 6

Outline Background

◮ Semi-supervised Learning Problem Setting ◮ Decomposable Scoring Function

Semi-supervised Learning for Structured Predictions

◮ Composite Likelihood - approximate learning ◮ Dual Decomposition Method - approximate inference (with constraints)

Experimental Results Conclusion

K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 6 / 23

slide-7
SLIDE 7

SSL Problem Input Space X Output Space Y a small set of labeled examples XL = {xi}n

i=1, YL = {yi}n i=1

a large set of unlabeled examples XU = {xj}m

j=1

domain knowledge or a set of constraints C

K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 7 / 23

slide-8
SLIDE 8

SSL Problem Input Space X Output Space Y a small set of labeled examples XL = {xi}n

i=1, YL = {yi}n i=1

a large set of unlabeled examples XU = {xj}m

j=1

domain knowledge or a set of constraints C Learning Problem: learn a scoring function s(x, y; θ) = θ · f (x, y) where θ denotes model parameter and f (·) is the feature function

K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 7 / 23

slide-9
SLIDE 9

SSL Problem Input Space X Output Space Y a small set of labeled examples XL = {xi}n

i=1, YL = {yi}n i=1

a large set of unlabeled examples XU = {xj}m

j=1

domain knowledge or a set of constraints C Learning Problem: learn a scoring function s(x, y; θ) = θ · f (x, y) where θ denotes model parameter and f (·) is the feature function Inference Problem: y∗ = argmax

y∈Y

s(x, y; θ)

K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 7 / 23

slide-10
SLIDE 10

SSL Problem (2) (Exact) Likelihood Model (using the scoring function s(·)) p(y|x; θ) = exp(s(x, y; θ))

  • y exp(s(x, y; θ))

K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 8 / 23

slide-11
SLIDE 11

SSL Problem (2) (Exact) Likelihood Model (using the scoring function s(·)) p(y|x; θ) = exp(s(x, y; θ))

  • y exp(s(x, y; θ))

Supervised Learning: maxθ S(θ) = R(θ) + L(YL; XL, θ) Regularization: R(θ) = − ||θ||2

2σ2 (σ2: regularization Parameter)

Log Likelihood Function: L(Y; X, θ) = 1 n log p(Y|X; θ) = 1 n

  • i

log p(yi|xi; θ)

K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 8 / 23

slide-12
SLIDE 12

SSL Problem (2) (Exact) Likelihood Model (using the scoring function s(·)) p(y|x; θ) = exp(s(x, y; θ))

  • y exp(s(x, y; θ))

Supervised Learning: maxθ S(θ) = R(θ) + L(YL; XL, θ) Regularization: R(θ) = − ||θ||2

2σ2 (σ2: regularization Parameter)

Log Likelihood Function: L(Y; X, θ) = 1 n log p(Y|X; θ) = 1 n

  • i

log p(yi|xi; θ) Semi-supervised Learning: max

θ,YU

S(θ) + L(YU; XU, θ) s.t. label constraints on YU

K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 8 / 23

slide-13
SLIDE 13

Decomposable Scoring Function Learning probabilistic model is intractable (except for simple models) p(y|x; w) = exp(s(x, y; θ))

  • y exp(s(x, y; θ))

Partition function (sum over exponential number of label combinations)

K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 9 / 23

slide-14
SLIDE 14

Decomposable Scoring Function Learning probabilistic model is intractable (except for simple models) p(y|x; w) = exp(s(x, y; θ))

  • y exp(s(x, y; θ))

Partition function (sum over exponential number of label combinations) Inference involved in SSL is also intractable Number of output combinations is exponentially large

K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 9 / 23

slide-15
SLIDE 15

Decomposable Scoring Function Learning probabilistic model is intractable (except for simple models) p(y|x; w) = exp(s(x, y; θ))

  • y exp(s(x, y; θ))

Partition function (sum over exponential number of label combinations) Inference involved in SSL is also intractable Number of output combinations is exponentially large Decomposable scoring function s(·) : s(y; x, θ) =

  • c φc(yπc)

where c is a component

K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 9 / 23

slide-16
SLIDE 16

Decomposable Scoring Function (2) Decomposable Scoring Function s(·) s(y; x, θ) =

  • c

φc(yπc) where c is a component Can we use a simplified likelihood model to learn the model parameters efficiently? Composite Likelihood Approach - Compose likelihood using likelihoods of individual components Can we use popular decomposition methods for solving inference problems with domain constraints efficiently? (e.g., dual decomposition)

K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 10 / 23

slide-17
SLIDE 17

Outline Background

◮ Semi-supervised Learning Problem Setting ◮ Decomposable Scoring Function

Semi-supervised Learning for Structured Predictions

◮ Composite Likelihood - approximate learning ◮ Dual Decomposition Method - approximate inference (with constraints)

Experimental Results Conclusion

K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 11 / 23

slide-18
SLIDE 18

Composite Likelihood Composite (log) likelihood ˜ L(y; x, θ) =

  • c Lc(yπc; x, θ) =
  • c φc(yπc) −
  • c log Zc

πc ⊂ {1, . . . , N} is an index set associated with c. Key: Partition function in each component is easy to compute

K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 12 / 23

slide-19
SLIDE 19

Composite Likelihood Composite (log) likelihood ˜ L(y; x, θ) =

  • c Lc(yπc; x, θ) =
  • c φc(yπc) −
  • c log Zc

πc ⊂ {1, . . . , N} is an index set associated with c. Key: Partition function in each component is easy to compute Examples: Let y = (y1, y2, ..., yK), yk ∈ {+, −}, decompose likelihood function using K spanning trees (involving all variables y):

◮ Score of each tree φk(yπc) = 1

K

  • p θp(yp) · x + 1

2

  • q=k θpq(ypq) · x

K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 12 / 23

slide-20
SLIDE 20

Composite Likelihood Composite (log) likelihood ˜ L(y; x, θ) =

  • c Lc(yπc; x, θ) =
  • c φc(yπc) −
  • c log Zc

πc ⊂ {1, . . . , N} is an index set associated with c. Key: Partition function in each component is easy to compute Examples: Let y = (y1, y2, ..., yK), yk ∈ {+, −}, decompose likelihood function using K spanning trees (involving all variables y):

◮ Score of each tree φk(yπc) = 1

K

  • p θp(yp) · x + 1

2

  • q=k θpq(ypq) · x

Related to composite marginal maximization (Lindsay 88) and piecewise training methods (Sutteon and McCallum 09)

K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 12 / 23

slide-21
SLIDE 21

Model Parameter (θ) Learning max

θ,YU

S(θ) + L(YU; XU, θ) s.t. label constraints on YU Alternatively update θ and YU Objective function is non-concave ⇒ annealing technique: gradually increase the effect of unlabeled data

K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 13 / 23

slide-22
SLIDE 22

Model Parameter (θ) Learning max

θ,YU

S(θ) + L(YU; XU, θ) s.t. label constraints on YU Alternatively update θ and YU Objective function is non-concave ⇒ annealing technique: gradually increase the effect of unlabeled data Keeping YU fixed, optimize over θ max

θ

S(θ) + L(YU; XU, θ)

K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 13 / 23

slide-23
SLIDE 23

Model Parameter (θ) Learning max

θ,YU

S(θ) + L(YU; XU, θ) s.t. label constraints on YU Alternatively update θ and YU Objective function is non-concave ⇒ annealing technique: gradually increase the effect of unlabeled data Keeping YU fixed, optimize over θ max

θ

S(θ) + L(YU; XU, θ) Composite likelihood simplifies log likelihood and its gradient computation tractable Standard unconstrained optimizers can be used (e.g., L-BFGS)

K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 13 / 23

slide-24
SLIDE 24

Model Parameter (θ) Learning max

θ,YU

S(θ) + L(YU; XU, θ) s.t. label constraints on YU Alternatively update θ and YU Objective function is non-concave ⇒ annealing technique: gradually increase the effect of unlabeled data Keeping YU fixed, optimize over θ max

θ

S(θ) + L(YU; XU, θ) Composite likelihood simplifies log likelihood and its gradient computation tractable Standard unconstrained optimizers can be used (e.g., L-BFGS) Note: Composite likelihood does not simplify our inference problem!

K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 13 / 23

slide-25
SLIDE 25

Solving the Inference Problem Constrained Inference During Learning: Fix θ, optimize over YU max

YU

L(YU; XU, θ) s.t. label constraints on YU

K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 14 / 23

slide-26
SLIDE 26

Solving the Inference Problem Constrained Inference During Learning: Fix θ, optimize over YU max

YU

L(YU; XU, θ) s.t. label constraints on YU Basic form without constraints:

K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 14 / 23

slide-27
SLIDE 27

Solving the Inference Problem Constrained Inference During Learning: Fix θ, optimize over YU max

YU

L(YU; XU, θ) s.t. label constraints on YU Basic form without constraints: y∗ = argmax

y∈Y

s(x, y; θ) = argmax

y∈Y

  • c

φc(yπc) where each c is a component (e.g., tree)

K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 14 / 23

slide-28
SLIDE 28

Solving the Inference Problem Constrained Inference During Learning: Fix θ, optimize over YU max

YU

L(YU; XU, θ) s.t. label constraints on YU Basic form without constraints: y∗ = argmax

y∈Y

s(x, y; θ) = argmax

y∈Y

  • c

φc(yπc) where each c is a component (e.g., tree) Assumption: efficient inference is feasible for each component c

K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 14 / 23

slide-29
SLIDE 29

Reformulation Introduce a new set of auxiliary variables y(c)

πc for each tree c

(note: variables are shared across the trees c) Constrain shared variables to take same values (i.e., y(c)

πc = yπc)

K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 15 / 23

slide-30
SLIDE 30

Reformulation Introduce a new set of auxiliary variables y(c)

πc for each tree c

(note: variables are shared across the trees c) Constrain shared variables to take same values (i.e., y(c)

πc = yπc)

New optimization problem: max

y∈Y,y(c)

πc ∈Yπc

  • c

φc(y(c)

πc ) s.t. {y(c) πc = yπc}

Can be rewritten using a set of binary variables (for each label combination) (we use same notation here for simplicity)

K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 15 / 23

slide-31
SLIDE 31

Inference via Dual Decomposition (e.g., Komodakis et al, 07) Lagrangian Dual Problem: min

ν

  • c

max

y(c)

πc

 φc(y(c)

πc ) +

  • j∈πc

νc,j, (y(c)

πc )j

  s.t.

  • c∈π−1

j

νc,j = 0 ∀j. Iterative Master-Slave Approach Given some parameter values, each slave finds solution for its sub-problem and communicate to the master Master receives the solutions, does parameter update and communicate to the slaves Inner maximization is not differentiable ⇒ use projected sub-gradient method (iterative)

K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 16 / 23

slide-32
SLIDE 32

Dual Decomposition With Constraints Constraints bring in new dual variables {ηm} Alternate optimization of {νc} and {ηm} Special form of constraint functions helps in maintaining efficient sub-gradient computations

K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 17 / 23

slide-33
SLIDE 33

Outline Background

◮ Semi-supervised Learning Problem Setting ◮ Decomposable Scoring Function

Semi-supervised Learning for Structured Predictions

◮ Composite Likelihood - approximate learning ◮ Dual Decomposition Method - approximate inference (with constraints)

Experimental Results Conclusion

K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 18 / 23

slide-34
SLIDE 34

Datasets Multi-label Classification

Dataset No.Classes n d EMOTION 6 593 72 SCENE 6 2403 294 YEAST 14 2417 103 RCV 30* 3000 47236 SIAM 2007 22 28096 30438

n, d: Total number of examples and features Training Split: 49% Validation Split: 21% Test Split: 30% Labeled examples (%): 1, 2, 4, 8 10 Random Partitions

K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 19 / 23

slide-35
SLIDE 35

Performance Over Iterations (a) rcv1 (b) emotions

K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 20 / 23

slide-36
SLIDE 36

Comparison of semi-supervised and supervised learning algorithms Dataset Degree of labeling (d) 1% 4% Semi-Sup Sup Semi-Sup Sup scene 55.2±3.7 51.5±2.7 63.8±1.4 61.8±2.5 yeast 42.9±2.1 42.5±1.8 45.2±1.2 44.5±1.0 emotions 51.3±5.9 49.9±4.5 58.8±3.8 58.3±4.3 rcv1 30.5±2.1 27.8±2.0 36.4±0.9 34.2±1.4 tmc2007 42.0±1.2 41.3±1.2 44.1±0.6 41.4±1.7

K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 21 / 23

slide-37
SLIDE 37

Incorporate Inference Engine with Other Methods Dataset Degree of labeling (d) 1% 4% Semi-Sup Sup Semi-Sup Sup Scene TSVM+ 45.7±4.0 42.1±2.6 60.4±1.5 55.9±1.9 CoDL+ 50.2±9.1 33.8±6.8 60.8±1.9 45.4±4.7 Our Method 55.2±3.7 51.5±2.7 63.8±1.4 61.8±2.5 Yeast TSVM+ 40.1±2.1 41.1±2.2 41.2±1.5 43.3±1.2 CoDL+ 40.9±3.3 39.6±3.0 40.9±1.7 43.2±1.9 Our Method 42.9±2.1 42.5±1.8 45.2±1.2 44.5±1.0

K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 22 / 23

slide-38
SLIDE 38

Incorporate Inference Engine with Other Methods Dataset Degree of labeling (d) 1% 4% Semi-Sup Sup Semi-Sup Sup Scene TSVM+ 45.7±4.0 42.1±2.6 60.4±1.5 55.9±1.9 CoDL+ 50.2±9.1 33.8±6.8 60.8±1.9 45.4±4.7 Our Method 55.2±3.7 51.5±2.7 63.8±1.4 61.8±2.5 Yeast TSVM+ 40.1±2.1 41.1±2.2 41.2±1.5 43.3±1.2 CoDL+ 40.9±3.3 39.6±3.0 40.9±1.7 43.2±1.9 Our Method 42.9±2.1 42.5±1.8 45.2±1.2 44.5±1.0 Other experiments in the paper exact v.s. approximation different sets of constraints

K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 22 / 23

slide-39
SLIDE 39

Conclusion We presented an effective SSL framework for structured prediction problems with complex output structure Our inference engine can be applied to various SSL frameworks Can be applied to problems other than multi-label classification

K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 23 / 23

slide-40
SLIDE 40

Conclusion We presented an effective SSL framework for structured prediction problems with complex output structure Our inference engine can be applied to various SSL frameworks Can be applied to problems other than multi-label classification Thanks

K.-W. Chang (Univ. of Illinois) Semi-supervised Structured Predictions September 24, 2013 23 / 23