Learning with Structured Output Spaces Keerthiram Murugesan - - PowerPoint PPT Presentation
Learning with Structured Output Spaces Keerthiram Murugesan - - PowerPoint PPT Presentation
Learning with Structured Output Spaces Keerthiram Murugesan Standard Predic,on Find func8on from input space X to output space Y such that the predic8on error is low. x Microsoft announced today that they x acquired Apple for the amount
- Find func8on from input space X to output space Y
such that the predic8on error is low.
(typically Y is “simple”)
Microsoft announced today that they acquired Apple for the amount equal to the gross national product of Switzerland. Microsoft officials stated that they first wanted to buy Switzerland, but eventually were turned off by the mountains and the snowy winters…
x y 1
GATACAACCTATCCCCGTATATATATTCTA TGGGTATAGTATTAAATCAATACAACCTAT CCCCGTATATATATTCTATGGGTATAGTAT TAAATCAATACAACCTATCCCCGTATATAT ATTCTATGGGTATAGTATTAAATCAGATAC AACCTATCCCCGTATATATATTCTATGGGT ATAGTATTAAATCACATTTA
x y
- 1
x y 7.3
Standard Predic,on
Conservation Reservoir Corridors
!"#$%&'()*'+,'-.%' !"#$%&'/)*'+,'-.%' !"#$0' 12+30' 4' 5'
- Y
- Y
- The dog chased the cat.
S NP VP N Det NP V Det N
- Y
- Y
- APPGEAYLQPGEAYLQV
- Y
- Y
- [Obama]running
in the [presidental election] has mobilized [ many young voters]. [His][position] on [climate change] was well received by [this group].
Obama presidential election many young voters His position climate change this groupStructured Predic,on
X Y X Y X Y X X Y
Talk Overview
- Structured Predic8on (Quick Review)
– Conven8onal Approach
- Structured Predic8on Cascades
– Ensemble Cascades
- Ensemble learning for Structured Predic8on
– Online algorithm – Boos8ng-style algorithm
- Y
- Y
- The dog chased the cat.
S NP VP N Det NP V Det N
Structured Predic8on
Structured Output Spaces
- Input: x
- Predict: y Y(x)
- Quality determined by u8lity func8on
- Conven,onal Approach:
– Train: learn model U(x,y) of u8lity – Test: predict via
∈
Structured!
h(x) = argmax
y∈Y (x) U(x, y)
Can be challenging Scoring func,on
- Part-of-Speech Tagging
– Given a sequence of words x – Predict sequence of tags y.
The rain wet the cat x Det N V Det N y
Example: Sequence Predic8on
…
Adj V V Det V V V N Adv Det y y
h(x) = argmax
y∈Y (x) U(x, y)
Example: Sequence Predic8on
- MAP inference in 1-st order Markov models
y1 y2 y3 y4 x1 x2 x3 x4
… …
1st order dynamics Similar models include CRFs, Kalman Filters, Linear Dynamical Systems, etc.
Example: Sequence Predic8on
- Utility function:
- Prediction:
h(x) = argmax
y
u(x t, yt, yt−1)
t=1 n
∑
U(x, y) = u(x t, yt, yt−1)
t=1 n
∑
y1 y2 y3 y4 x1 x2 x3 x4
Dynamic Programming Sum over maximal cliques
Scoring function as Linear Models
- U/u is parameterized linearly:
U(x, y;θ) = u(x t, yt, yt−1;θ)
t
∑
u(x, y1, y2;θ) =θ T f (x, y1, y2)
h(x;θ) = argmax
y
θ T f (x t, yt, yt−1)
t
∑
Some feature representa,on Dynamic Programming
Feature representa8on
Generalizing to Other Structures
- From last slide:
- General Formulation:
h(x;θ) = argmax
y
θ T f (x t, yt, yt−1)
t
∑
Ψ(x, y) = f (xt, yt, yt−1)
t
∑
h(x;θ) = argmax
y
θ TΨ(x, y)
- Viterbi
- CKY Parsing
- Sorting
- Belief Propagation
- Integer Programming
Learning Se]ng
- Generaliza8on of Conven8onal Se]ngs
– Hinge loss = Structural SVMs – Log-loss = Condi8onal Random Fields – Gradient Descent, Cu]ng Plane, etc…
- Requires running inference during training
argmin
θ
λ 2 θ
2 +
ℓ y,h(x;θ)
( )
(x,y)
∑
h(x) = argmax
y∈Y (x) U(x, y)
Regulariza,on Loss Func,on
Restric8on: Increased Complexity
Restric8on: Pre-specified Structure
- Learn a (linearly) parameterized U
– Such that h(x) gives good predic8ons
- What if U is “wrong”?
– Known to not be consistent – Infinite training data ≠ converging to best model
Structure
h(x;θ) = argmax
y
U(x, y;θ)
Summary: Structured Predic8on
- Conven8onal Approach
– Specify structure & inference procedure – Train parameters on training set {(x,y)}
- Limita,ons:
– Run,me propor,onal to Model Complexity – Structure Mismatch & Inconsistency
h(x;θ) = argmax
y
U(x, y;θ)
Structure
Structured Predic8on Cascades
Classifier Cascades (Face Classifier)
Classifier Cascades
Tradeoffs in Cascaded Learning
- Accuracy: Minimize the
number of errors incurred by each level
- Efficiency: Maximize the
number of filtered assignments at each level
Structured Predic8on Cascades
Clique Assignments
- Valid assignment for clique (Yk-1,Yk)
- Invalid assignment (that will be eliminated/
pruned)
Adj N
Yk-1 Yk
N N
Yk-1 Yk Remember Sum over Cliques? U(x, y) = u(x t, yt, yt−1)
c∈C
∑
Clique Assignments
- Valid assignment for clique (Yk-1,Yk)
- Invalid assignment (that will be eliminated/
pruned)
Adj N
Yk-1 Yk
N N
Yk-1 Yk How do we know this assignment is good or bad?
- 1. Score
- 2. Threshold
Max-marginal score (sequence models)
Threshold (t)
Threshold (t)
Threshold (t)
Learning θ at each cascade level
Online learning
Structured Predic8on Ensembles
Ensemble Learning
h1 h2 hp h3
face face face no face
Goal: Combine these output from mul8ple models / hypotheses / experts: 1) Majority Vo8ng 2) Linear combina8on of hypotheses/experts 3) Boos8ng, etc
Weighted Majority Algorithm
Ensemble learning for Structured Predic8on
h1 h2 hp h3 h1
1
h1
2
h1
l
. . . . . . hp
1
hp
2
hp
l
. . .
h1 V V N Adv Det
Example: Sequence Model
Weighted Majority Algorithm for Structured Predic8on Ensembles
Ensemble output from Weighted Majority Algorithm
- Given W1, W2, … WT
Boos8ng for Structured Predic8on Ensembles
Ensemble output from Boos8ng
- Given the base learners h1, h2, … hT,:
- Note h1, h2, … hT are different from h1, h2, … hP
- THE END