L2S: Learning to Search CS 6355: Structured Prediction 1 Some - - PowerPoint PPT Presentation

l2s learning to search
SMART_READER_LITE
LIVE PREVIEW

L2S: Learning to Search CS 6355: Structured Prediction 1 Some - - PowerPoint PPT Presentation

L2S: Learning to Search CS 6355: Structured Prediction 1 Some slides adapted from Daum and Ross Inference What is inference? An overview of what we have seen before Combinatorial optimization Different views of inference


slide-1
SLIDE 1

CS 6355: Structured Prediction

L2S: Learning to Search

1

Some slides adapted from Daumé and Ross

slide-2
SLIDE 2

Inference

  • What is inference?

– An overview of what we have seen before – Combinatorial optimization – Different views of inference

  • Graph algorithms

– Dynamic programming, greedy algorithms, search

  • Integer programming
  • Heuristics for inference

– Sampling

  • Learning to search

2

slide-3
SLIDE 3

Learning to Search (L2S)

We have seen that inference as graph search – Iteratively construct a series of partial structures – Find the highest scoring structure in this fashion Can we learn a model that is designed with such inference in mind? – L2S is a way of formulating structured prediction problems as a search problem – Integrates learning and prediction into a unified framework

3

slide-4
SLIDE 4

Overview

  • 1. Preliminaries

– Learning to minimize costs – Search problems and a generic search algorithm

  • 2. Learning to search: A general formulation
  • 3. LaSO: Learning as Search Optimization
  • 4. SEARN: Search and Learning
  • 5. DAgger: Dataset Aggregation

4

slide-5
SLIDE 5

Learning to minimize prediction cost

5

x1 x2 x3 y3 y2 y1 Suppose each y can be one of A, B or C, and the true label is (𝑧1 = A, 𝑧2 = B, 𝑧3 = C) 𝐳 = (𝑧1, 𝑧2, 𝑧3)

slide-6
SLIDE 6

Learning to minimize prediction cost

6

x1 x2 x3 y3 y2 y1 𝑑(𝐵, 𝐵, 𝐵) = 1 𝑑(𝐵, 𝐵, 𝐶) = 1 𝑑(𝐵, 𝐵, 𝐷) = 1 … 𝑑(𝐵, 𝐶, 𝐷) = 0 … 𝑑(𝐷, 𝐷, 𝐶) = 1 𝑑(𝐷, 𝐷, 𝐷) = 1 𝑑(𝐵, 𝐵, 𝐵) = 2 𝑑(𝐵, 𝐵, 𝐶) = 2 𝑑(𝐵, 𝐵, 𝐷) = 1 … 𝑑(𝐵, 𝐶, 𝐷) = 0 … 𝑑(𝐷, 𝐷, 𝐶) = 3 𝑑(𝐷, 𝐷, 𝐷) = 2 Hamming Distance

  • r

Suppose each y can be one of A, B or C, and the true label is (𝑧1 = A, 𝑧2 = B, 𝑧3 = C) 𝐳 = (𝑧1, 𝑧2, 𝑧3) The cost vector for this input x can be: The goal: Learn a classifier that has lowest cost What is the dimension of the cost vector c?

slide-7
SLIDE 7

Learning to minimize prediction cost

7

x1 x2 x3 y3 y2 y1 𝑑(𝐵, 𝐵, 𝐵) = 1 𝑑(𝐵, 𝐵, 𝐶) = 1 𝑑(𝐵, 𝐵, 𝐷) = 1 … 𝑑(𝐵, 𝐶, 𝐷) = 0 … 𝑑(𝐷, 𝐷, 𝐶) = 1 𝑑(𝐷, 𝐷, 𝐷) = 1 Suppose each y can be one of A, B or C, and the true label is (𝑧1 = A, 𝑧2 = B, 𝑧3 = C) 𝐳 = (𝑧1, 𝑧2, 𝑧3) The cost vector for this input x can be: The goal: Learn a classifier that has lowest cost

slide-8
SLIDE 8

Learning to minimize prediction cost

8

x1 x2 x3 y3 y2 y1 𝑑(𝐵, 𝐵, 𝐵) = 1 𝑑(𝐵, 𝐵, 𝐶) = 1 𝑑(𝐵, 𝐵, 𝐷) = 1 … 𝑑(𝐵, 𝐶, 𝐷) = 0 … 𝑑(𝐷, 𝐷, 𝐶) = 1 𝑑(𝐷, 𝐷, 𝐷) = 1 𝑑(𝐵, 𝐵, 𝐵) = 2 𝑑(𝐵, 𝐵, 𝐶) = 2 𝑑(𝐵, 𝐵, 𝐷) = 1 … 𝑑(𝐵, 𝐶, 𝐷) = 0 … 𝑑(𝐷, 𝐷, 𝐶) = 3 𝑑(𝐷, 𝐷, 𝐷) = 2 Hamming Distance

  • r

Suppose each y can be one of A, B or C, and the true label is (𝑧1 = A, 𝑧2 = B, 𝑧3 = C) 𝐳 = (𝑧1, 𝑧2, 𝑧3) The cost vector for this input x can be: The goal: Learn a classifier that has lowest cost

slide-9
SLIDE 9

Learning to minimize prediction cost

9

x1 x2 x3 y3 y2 y1 𝑑(𝐵, 𝐵, 𝐵) = 1 𝑑(𝐵, 𝐵, 𝐶) = 1 𝑑(𝐵, 𝐵, 𝐷) = 1 … 𝑑(𝐵, 𝐶, 𝐷) = 0 … 𝑑(𝐷, 𝐷, 𝐶) = 1 𝑑(𝐷, 𝐷, 𝐷) = 1 𝑑(𝐵, 𝐵, 𝐵) = 2 𝑑(𝐵, 𝐵, 𝐶) = 2 𝑑(𝐵, 𝐵, 𝐷) = 1 … 𝑑(𝐵, 𝐶, 𝐷) = 0 … 𝑑(𝐷, 𝐷, 𝐶) = 3 𝑑(𝐷, 𝐷, 𝐷) = 2 Hamming Distance

  • r

Suppose each y can be one of A, B or C, and the true label is (𝑧1 = A, 𝑧2 = B, 𝑧3 = C) 𝐳 = (𝑧1, 𝑧2, 𝑧3) The cost vector for this input x can be: The goal: Learn a classifier that has lowest cost What is the dimension of the cost vector c?

slide-10
SLIDE 10

A Structured Prediction Problem

Learn a mapping ℎ(𝐲) from inputs 𝐲 to outputs 𝐳

  • Each 𝐳 decomposes into vectors (𝑧1, 𝑧2, … , 𝑧𝑈)
  • Each 𝐲 has a cost vector 𝐝.

– 𝐝 has 28 components if 𝑧𝑗 are binary – each component specify the cost of the corresponding y

  • The goal is to minimize 𝑀 ℎ = 𝐹[𝑑=(>)]

10

slide-11
SLIDE 11

Formalizing search problems

  • Initial state: denoted by s0

– The starting point

  • Actions: Actions(s)

– The set of actions that can be performed at a state

  • Transition model: Result(s, a)

– “Applies” an action a to a state s to produce the next state

  • Goal test: A check for whether the search is complete or not
  • Path cost/score: A score for the path from the start state to any state

11

slide-12
SLIDE 12

Formalizing search problems

  • Initial state: denoted by s0

– The starting point

  • Actions: Actions(s)

– The set of actions that can be performed at a state

  • Transition model: Result(s, a)

– “Applies” an action a to a state s to produce the next state

  • Goal test: A check for whether the search is complete or not
  • Path cost/score: A score for the path from the start state to any state

A solution is an action sequence that leads from initial state to a goal state. An optimal solution has the lowest path cost or highest score.

12

slide-13
SLIDE 13

Example Search Problem: 8-puzzle

7 2 4 5 blank 6 8 3 1

13

blank 1 2 3 4 5 6 7 8 Initial State Goal State

slide-14
SLIDE 14

Example Search Problem: 8-puzzle

7 2 4 5 blank 6 8 3 1

14

blank 1 2 3 4 5 6 7 8 Initial State Goal State Initial state: s0 Actions: Actions(s) Transition model: Result(s, a) Goal test Path cost / score What are these five components for 8-puzzle?

slide-15
SLIDE 15

Generic Search Algorithm

15

How do we solve a search problem? Answer: By starting at the initial state, and navigating the state space till we get to an answer

slide-16
SLIDE 16

Generic Search Algorithm

16

Algo Search(problem, initial, enqueue):

slide-17
SLIDE 17

Generic Search Algorithm

17

Algo Search(problem, initial, enqueue): nodes = MakeQueue(MakeNode(problem, initial))

slide-18
SLIDE 18

Generic Search Algorithm

18

Algo Search(problem, initial, enqueue): nodes = MakeQueue(MakeNode(problem, initial)) while nodes is not empty:

slide-19
SLIDE 19

Generic Search Algorithm

19

Algo Search(problem, initial, enqueue): nodes = MakeQueue(MakeNode(problem, initial)) while nodes is not empty: node = Pop(nodes)

slide-20
SLIDE 20

Generic Search Algorithm

20

Algo Search(problem, initial, enqueue): nodes = MakeQueue(MakeNode(problem, initial)) while nodes is not empty: node = Pop(nodes) if GoalTest(node) then return node

slide-21
SLIDE 21

Generic Search Algorithm

21

Algo Search(problem, initial, enqueue): nodes = MakeQueue(MakeNode(problem, initial)) while nodes is not empty: node = Pop(nodes) if GoalTest(node) then return node next = Result(node, Actions(node))

slide-22
SLIDE 22

Generic Search Algorithm

22

Algo Search(problem, initial, enqueue): nodes = MakeQueue(MakeNode(problem, initial)) while nodes is not empty: node = Pop(nodes) if GoalTest(node) then return node next = Result(node, Actions(node)) nodes = enqueue(problem, nodes, next)

slide-23
SLIDE 23

Generic Search Algorithm

23

Algo Search(problem, initial, enqueue): nodes = MakeQueue(MakeNode(problem, initial)) while nodes is not empty: node = Pop(nodes) if GoalTest(node) then return node next = Result(node, Actions(node)) nodes = enqueue(problem, nodes, next) return failure

slide-24
SLIDE 24

Generic Search Algorithm

24

Algo Search(problem, initial, enqueue): nodes = MakeQueue(MakeNode(problem, initial)) while nodes is not empty: node = Pop(nodes) if GoalTest(node) then return node next = Result(node, Actions(node)) nodes = enqueue(problem, nodes, next) return failure

All magic happens in the enqueue function (BFS, DFS, beam, A*) Or is there any magic?

slide-25
SLIDE 25

Learning to search: General setting

The high level idea:

– Frame the problem of structured prediction as a generic search problem – Learn to enqueue nodes so that “good” states are explored first, and we get to a solution easily.

25

Predicting an output 𝐳 as a sequence of decisions

slide-26
SLIDE 26

Learning to search: General setting

General data structures – State: Partial assignments to (𝑧1, 𝑧2, … , 𝑧𝑈)

26

Predicting an output 𝐳 as a sequence of decisions

slide-27
SLIDE 27

Learning to search: General setting

General data structures – State: Partial assignments to (𝑧1, 𝑧2, … , 𝑧𝑈) – Initial state: Empty assignments (−, −, … , −)

27

Predicting an output 𝐳 as a sequence of decisions

slide-28
SLIDE 28

Learning to search: General setting

General data structures – State: Partial assignments to (𝑧1, 𝑧2, … , 𝑧𝑈) – Initial state: Empty assignments (−, −, … , −) – Actions: Pick a 𝑧𝑗 component and assign an label to it

28

Predicting an output 𝐳 as a sequence of decisions

slide-29
SLIDE 29

Learning to search: General setting

General data structures – State: Partial assignments to (𝑧1, 𝑧2, … , 𝑧𝑈) – Initial state: Empty assignments (−, −, … , −) – Actions: Pick a 𝑧𝑗 component and assign an label to it – Transition model: Move from one partial structure to another

29

Predicting an output 𝐳 as a sequence of decisions

slide-30
SLIDE 30

Learning to search: General setting

General data structures – State: Partial assignments to (𝑧1, 𝑧2, … , 𝑧𝑈) – Initial state: Empty assignments (−, −, … , −) – Actions: Pick a 𝑧𝑗 component and assign an label to it – Transition model: Move from one partial structure to another – Goal test: Whether all 𝑧 components are assigned

  • A goal state does not need to be optimal

30

Predicting an output 𝐳 as a sequence of decisions

slide-31
SLIDE 31

Learning to search: General setting

General data structures – State: Partial assignments to (𝑧1, 𝑧2, … , 𝑧𝑈) – Initial state: Empty assignments (−, −, … , −) – Actions: Pick a 𝑧𝑗 component and assign an label to it – Transition model: Move from one partial structure to another – Goal test: Whether all 𝑧 components are assigned

  • A goal state does not need to be optimal

– Path cost/score function: 𝐱𝑈 𝜚(𝐲, node), or more generally, a neural network that depends on the 𝐲 and the node

  • A node contains the current state and the back pointer to trace

back the search path

31

Predicting an output 𝐳 as a sequence of decisions

slide-32
SLIDE 32

Example

32

x1 x2 x3 y3 y2 y1 Suppose each y can be one

  • f A, B or C
slide-33
SLIDE 33

Example

33

x1 x2 x3 y3 y2 y1

  • State: Triples (y1, y2, y3) all possibly unknown
  • (A, -, -), (-, A, A), (-, -, -),…
  • Transition: Fill in one of the unknowns
  • Start state: (-,-,-)
  • End state: All three y’s are assigned

Suppose each y can be one

  • f A, B or C
slide-34
SLIDE 34

Example

34

x1 x2 x3 y3 y2 y1

  • State: Triples (y1, y2, y3) all possibly unknown
  • (A, -, -), (-, A, A), (-, -, -),…
  • Transition: Fill in one of the unknowns
  • Start state: (-,-,-)
  • End state: All three y’s are assigned

(-,-,-) (A,-,-) (B,-,-) (C,-,-) (A,A,-) (C,C,-) (A,A,A) (C,C,C) ….. Suppose each y can be one

  • f A, B or C
slide-35
SLIDE 35

LaSO: Learning as Search Optimization

35

1st Framework:

[Hal Daumé III and Daniel Marcu, ICML 2005]

slide-36
SLIDE 36

The enqueue function in LaSO

36

slide-37
SLIDE 37

The enqueue function in LaSO

  • The goal of learning is to produce an enqueue

function that

– places good hypotheses high on the queue – places bad hypotheses low on the queue

37

slide-38
SLIDE 38

The enqueue function in LaSO

  • The goal of learning is to produce an enqueue

function that

– places good hypotheses high on the queue – places bad hypotheses low on the queue

  • LaSO assumes enqueue is based on two components

g + h

38

slide-39
SLIDE 39

The enqueue function in LaSO

  • The goal of learning is to produce an enqueue

function that

– places good hypotheses high on the queue – places bad hypotheses low on the queue

  • LaSO assumes enqueue is based on two components

g + h

– g: path component. (g = wT φ(x, node))

39

slide-40
SLIDE 40

The enqueue function in LaSO

  • The goal of learning is to produce an enqueue

function that

– places good hypotheses high on the queue – places bad hypotheses low on the queue

  • LaSO assumes enqueue is based on two components

g + h

– g: path component. (g = wT φ(x, node)) – h: heuristic component. (h is given)

  • A* if h is admissible, heuristic search if h is not admissible, best

first search if h = 0, beam search if queue is limited.

40

slide-41
SLIDE 41

The enqueue function in LaSO

  • The goal of learning is to produce an enqueue

function that

– places good hypotheses high on the queue – places bad hypotheses low on the queue

  • LaSO assumes enqueue is based on two components

g + h

– g: path component. (g = wT φ(x, node)) – h: heuristic component. (h is given)

  • A* if h is admissible, heuristic search if h is not admissible, best

first search if h = 0, beam search if queue is limited.

41

The goal is to learn w. How?

slide-42
SLIDE 42

“y-good” node

42

slide-43
SLIDE 43

“y-good” node

Assumption: for any given node s and an gold output y, we can tell whether s can or cannot lead to y.

43

slide-44
SLIDE 44

“y-good” node

Assumption: for any given node s and an gold output y, we can tell whether s can or cannot lead to y. Definition: The node s is y-good if s can lead to y

44

slide-45
SLIDE 45

“y-good” node

Assumption: for any given node s and an gold output y, we can tell whether s can or cannot lead to y. Definition: The node s is y-good if s can lead to y

45

Suppose each y can be one

  • f A, B or C, and the true

label is (y1=A, y2=B, y3=C) y = (y1, y2, y3)

slide-46
SLIDE 46

“y-good” node

Assumption: for any given node s and an gold output y, we can tell whether s can or cannot lead to y. Definition: The node s is y-good if s can lead to y

46

Suppose each y can be one

  • f A, B or C, and the true

label is (y1=A, y2=B, y3=C) y = (y1, y2, y3) (-,-,-) (A,-,-) (-,B,-) (C,-,-) (A,A,-) (C,C,-) (A,A,A) (C,C,C) …..

slide-47
SLIDE 47

Learning in LaSO

47

slide-48
SLIDE 48

Learning in LaSO

  • Search as if in the prediction phase, but when an

error is made:

48

slide-49
SLIDE 49

Learning in LaSO

  • Search as if in the prediction phase, but when an

error is made:

– update w

49

slide-50
SLIDE 50

Learning in LaSO

  • Search as if in the prediction phase, but when an

error is made:

– update w – clear the queue and insert all the correct moves

50

slide-51
SLIDE 51

Learning in LaSO

  • Search as if in the prediction phase, but when an

error is made:

– update w – clear the queue and insert all the correct moves

  • Two kinds of errors:

51

slide-52
SLIDE 52

Learning in LaSO

  • Search as if in the prediction phase, but when an

error is made:

– update w – clear the queue and insert all the correct moves

  • Two kinds of errors:

– Error type 1: none of the queue is y-good

52

slide-53
SLIDE 53

Learning in LaSO

  • Search as if in the prediction phase, but when an

error is made:

– update w – clear the queue and insert all the correct moves

  • Two kinds of errors:

– Error type 1: none of the queue is y-good – Error type 2: the goal state is not y-good

53

slide-54
SLIDE 54

Learning Algorithm in LaSO

54

slide-55
SLIDE 55

Learning Algorithm in LaSO

55

Algo Learn(problem, initial, enqueue, w, x, y) nodes = MakeQueue(MakeNode(problem, initial)) while nodes is not empty: node = Pop(nodes) if error step 1: update w step 2: refresh queue else if GoalTest(node) then return w next = Result(node, Actions(node)) nodes = enqueue(problem, nodes, next, w)

slide-56
SLIDE 56

Learning Algorithm in LaSO

56

Algo Learn(problem, initial, enqueue, w, x, y) nodes = MakeQueue(MakeNode(problem, initial)) while nodes is not empty: node = Pop(nodes) if error step 1: update w step 2: refresh queue else if GoalTest(node) then return w next = Result(node, Actions(node)) nodes = enqueue(problem, nodes, next, w)

slide-57
SLIDE 57

Learning Algorithm in LaSO

57

Algo Learn(problem, initial, enqueue, w, x, y) nodes = MakeQueue(MakeNode(problem, initial)) while nodes is not empty: node = Pop(nodes) if error step 1: update w step 2: refresh queue else if GoalTest(node) then return w next = Result(node, Actions(node)) nodes = enqueue(problem, nodes, next, w)

slide-58
SLIDE 58

Learning Algorithm in LaSO

58

Algo Learn(problem, initial, enqueue, w, x, y) nodes = MakeQueue(MakeNode(problem, initial)) while nodes is not empty: node = Pop(nodes) if error step 1: update w step 2: refresh queue else if GoalTest(node) then return w next = Result(node, Actions(node)) nodes = enqueue(problem, nodes, next, w)

slide-59
SLIDE 59

Learning Algorithm in LaSO

59

Algo Learn(problem, initial, enqueue, w, x, y) nodes = MakeQueue(MakeNode(problem, initial)) while nodes is not empty: node = Pop(nodes) if error step 1: update w step 2: refresh queue else if GoalTest(node) then return w next = Result(node, Actions(node)) nodes = enqueue(problem, nodes, next, w)

slide-60
SLIDE 60

Learning Algorithm in LaSO

60

Algo Learn(problem, initial, enqueue, w, x, y) nodes = MakeQueue(MakeNode(problem, initial)) while nodes is not empty: node = Pop(nodes) if error step 1: update w step 2: refresh queue else if GoalTest(node) then return w next = Result(node, Actions(node)) nodes = enqueue(problem, nodes, next, w)

slide-61
SLIDE 61

Learning Algorithm in LaSO

61

Algo Learn(problem, initial, enqueue, w, x, y) nodes = MakeQueue(MakeNode(problem, initial)) while nodes is not empty: node = Pop(nodes) if error step 1: update w step 2: refresh queue else if GoalTest(node) then return w next = Result(node, Actions(node)) nodes = enqueue(problem, nodes, next, w)

slide-62
SLIDE 62

Learning Algorithm in LaSO

62

Algo Learn(problem, initial, enqueue, w, x, y) nodes = MakeQueue(MakeNode(problem, initial)) while nodes is not empty: node = Pop(nodes) if error step 1: update w step 2: refresh queue else if GoalTest(node) then return w next = Result(node, Actions(node)) nodes = enqueue(problem, nodes, next, w)

slide-63
SLIDE 63

Learning Algorithm in LaSO

63

Algo Learn(problem, initial, enqueue, w, x, y) nodes = MakeQueue(MakeNode(problem, initial)) while nodes is not empty: node = Pop(nodes) if error step 1: update w step 2: refresh queue else if GoalTest(node) then return w next = Result(node, Actions(node)) nodes = enqueue(problem, nodes, next, w)

slide-64
SLIDE 64

Learning Algorithm in LaSO

64

Algo Learn(problem, initial, enqueue, w, x, y) nodes = MakeQueue(MakeNode(problem, initial)) while nodes is not empty: node = Pop(nodes) if error step 1: update w step 2: refresh queue else if GoalTest(node) then return w next = Result(node, Actions(node)) nodes = enqueue(problem, nodes, next, w)

slide-65
SLIDE 65

Learning Algorithm in LaSO

65

Algo Learn(problem, initial, enqueue, w, x, y) nodes = MakeQueue(MakeNode(problem, initial)) while nodes is not empty: node = Pop(nodes) if error step 1: update w step 2: refresh queue else if GoalTest(node) then return w next = Result(node, Actions(node)) nodes = enqueue(problem, nodes, next, w)

slide-66
SLIDE 66

What should learning do?

node 1 y-good node 2 y-good node 4 y-good current node 3 y-good node 5 y-good

66

Let’s say we found an error (of either type) at the current node, then we should have made the choice of node 4 instead of the current node

slide-67
SLIDE 67

What should learning do?

node 1 y-good node 2 y-good node 4 y-good current node 3 y-good node 5 y-good

67

Let’s say we found an error (of either type) at the current node, then we should have made the choice of node 4 instead of the current node Node 4 is the y-good sibling of the current node

slide-68
SLIDE 68

Learning Algorithm in LaSO

68

Algo Learn(problem, initial, enqueue, w, x, y) nodes = MakeQueue(MakeNode(problem, initial)) while nodes is not empty: node = Pop(nodes) if none of (node + nodes) is y-good or GoalTest(node) and node is not y-good then sibs = siblings(node, y) w = update(w, x, sibs, node, nodes) nodes = MakeQueue(sibs) else if GoalTest(node) then return w next = Result(node, Actions(node)) nodes = enqueue(problem, nodes, next, w)

slide-69
SLIDE 69

Learning Algorithm in LaSO

69

Algo Learn(problem, initial, enqueue, w, x, y) nodes = MakeQueue(MakeNode(problem, initial)) while nodes is not empty: node = Pop(nodes) if none of (node + nodes) is y-good or GoalTest(node) and node is not y-good then sibs = siblings(node, y) w = update(w, x, sibs, node, nodes) nodes = MakeQueue(sibs) else if GoalTest(node) then return w next = Result(node, Actions(node)) nodes = enqueue(problem, nodes, next, w)

slide-70
SLIDE 70

Learning Algorithm in LaSO

70

Algo Learn(problem, initial, enqueue, w, x, y) nodes = MakeQueue(MakeNode(problem, initial)) while nodes is not empty: node = Pop(nodes) if none of (node + nodes) is y-good or GoalTest(node) and node is not y-good then sibs = siblings(node, y) w = update(w, x, sibs, node, nodes) nodes = MakeQueue(sibs) else if GoalTest(node) then return w next = Result(node, Actions(node)) nodes = enqueue(problem, nodes, next, w)

slide-71
SLIDE 71

Learning Algorithm in LaSO

71

Algo Learn(problem, initial, enqueue, w, x, y) nodes = MakeQueue(MakeNode(problem, initial)) while nodes is not empty: node = Pop(nodes) if none of (node + nodes) is y-good or GoalTest(node) and node is not y-good then sibs = siblings(node, y) w = update(w, x, sibs, {node, nodes}) nodes = MakeQueue(sibs) else if GoalTest(node) then return w next = Result(node, Actions(node)) nodes = enqueue(problem, nodes, next, w)

slide-72
SLIDE 72

Learning Algorithm in LaSO

72

Algo Learn(problem, initial, enqueue, w, x, y) nodes = MakeQueue(MakeNode(problem, initial)) while nodes is not empty: node = Pop(nodes) if none of (node + nodes) is y-good or GoalTest(node) and node is not y-good then sibs = siblings(node, y) w = update(w, x, sibs, {node, nodes}) nodes = MakeQueue(sibs) else if GoalTest(node) then return w next = Result(node, Actions(node)) nodes = enqueue(problem, nodes, next, w)

slide-73
SLIDE 73

Parameter Updates

73

We need to specify w = update(w, x, sibs, nodes) A simple perceptron-style update rule: w = w + Δ

It comes with the usual perceptron-style mistake bound and generalization bound. (See references)

∆ = X

n∈sibs

Φ(x, n) |sibs| − X

n∈nodes

Φ(x, n) |nodes|

slide-74
SLIDE 74

SEARN: Search and Learning

74

2nd Framework:

Hal Daumé III, John Langford, Daniel Marcu (2007)

slide-75
SLIDE 75

Policy

  • A policy is a mapping from a state to an action
  • For a given node, the policy tells what action should be taken

75

slide-76
SLIDE 76

Policy

  • A policy is a mapping from a state to an action
  • For a given node, the policy tells what action should be taken
  • A policy gives a search path in the search space.

– Different policy means different search path – Can be thought as the “driver” in the search space

76

slide-77
SLIDE 77

Policy

  • A policy is a mapping from a state to an action
  • For a given node, the policy tells what action should be taken
  • A policy gives a search path in the search space.

– Different policy means different search path – Can be thought as the “driver” in the search space

  • A policy may be deterministic, or may contain some randomness.

(More on this later)

77

slide-78
SLIDE 78

Reference Policy and Learned Policy

78

slide-79
SLIDE 79

Reference Policy and Learned Policy

  • We assume we already have a good reference policy 𝜌 for

training data (𝐲, 𝐝)

– i.e. examples associated with costs for outputs

79

slide-80
SLIDE 80

Reference Policy and Learned Policy

  • We assume we already have a good reference policy 𝜌 for

training data (𝐲, 𝐝)

– i.e. examples associated with costs for outputs

  • Goal: Learn a good policy for test data when we do not have

access to cost vector c. (Imitation Learning)

80

slide-81
SLIDE 81

Reference Policy and Learned Policy

  • We assume we already have a good reference policy 𝜌 for

training data (𝐲, 𝐝)

– i.e. examples associated with costs for outputs

  • Goal: Learn a good policy for test data when we do not have

access to cost vector c. (Imitation Learning)

81

π πref

ref

slide-82
SLIDE 82

Reference Policy and Learned Policy

  • We assume we already have a good reference policy 𝜌 for

training data (𝐲, 𝐝)

– i.e. examples associated with costs for outputs

  • Goal: Learn a good policy for test data when we do not have

access to cost vector c. (Imitation Learning)

82

π πref

ref

For example if we are using Hamming distance for cost vector 𝐝, then the reference policy is trivial to compute, why?

slide-83
SLIDE 83

Reference Policy and Learned Policy

  • We assume we already have a good reference policy 𝜌 for

training data (𝐲, 𝐝)

– i.e. examples associated with costs for outputs

  • Goal: Learn a good policy for test data when we do not have

access to cost vector c. (Imitation Learning)

83

π πref

ref

For example if we are using Hamming distance for cost vector 𝐝, then the reference policy is trivial to compute, why? Just make the right decision at every step

slide-84
SLIDE 84

Reference Policy and Learned Policy

  • We assume we already have a good reference policy 𝜌 for

training data (𝐲, 𝐝)

– i.e. examples associated with costs for outputs

  • Goal: Learn a good policy for test data when we do not have

access to cost vector c. (Imitation Learning)

84

π πref

ref

For example if we are using Hamming distance for cost vector 𝐝, then the reference policy is trivial to compute, why? Just make the right decision at every step Suppose gold state is (A, B, C, A) and we are at the state (A, C, -, -) The reference policy tells us the next action is assigned C to the third slot.

slide-85
SLIDE 85

Cost-Sensitive Classification

Suppose we want to learn a classifier ℎ that maps examples to one of 𝐿 labels Standard multiclass classification

  • Training data: Pairs of examples associated with labels

– 𝑦, 𝑧 ∈ 𝑌 ×[𝐿]

  • Learning goal: To find a classifier that has low error

– min= Pr ℎ 𝑦 ≠ 𝑧

Cost-sensitive classification

  • Training data: An example paired with a cost vector that lists out the cost
  • f predicting each label

– 𝑦, 𝐝 ∈ 𝑌 × 0, ∞ S

  • Learning goal: To find a classifier that has low cost

– min= 𝐹>,T 𝑑= >

85

slide-86
SLIDE 86

Cost-Sensitive Classification

Suppose we want to learn a classifier ℎ that maps examples to one of 𝐿 labels Standard multiclass classification

  • Training data: Pairs of examples associated with labels

– 𝑦, 𝑧 ∈ 𝑌 ×[𝐿]

  • Learning goal: To find a classifier that has low error

– min= Pr ℎ 𝑦 ≠ 𝑧

Cost-sensitive classification

  • Training data: An example paired with a cost vector that lists out the cost
  • f predicting each label

– 𝑦, 𝐝 ∈ 𝑌 × 0, ∞ S

  • Learning goal: To find a classifier that has low cost

– min= 𝐹>,T 𝑑= >

86

slide-87
SLIDE 87

Cost-Sensitive Classification

Suppose we want to learn a classifier ℎ that maps examples to one of 𝐿 labels Standard multiclass classification

  • Training data: Pairs of examples associated with labels

– 𝑦, 𝑧 ∈ 𝑌 ×[𝐿]

  • Learning goal: To find a classifier that has low error

– min= Pr ℎ 𝑦 ≠ 𝑧

Cost-sensitive classification

  • Training data: An example paired with a cost vector that lists out the cost
  • f predicting each label

– 𝑦, 𝐝 ∈ 𝑌 × 0, ∞ S

  • Learning goal: To find a classifier that has low cost

– min= 𝐹>,T 𝑑= >

87

slide-88
SLIDE 88

Cost-Sensitive Classification

Suppose we want to learn a classifier ℎ that maps examples to one of 𝐿 labels Standard multiclass classification

  • Training data: Pairs of examples associated with labels

– 𝑦, 𝑧 ∈ 𝑌 ×[𝐿]

  • Learning goal: To find a classifier that has low error

– min= Pr ℎ 𝑦 ≠ 𝑧

Cost-sensitive classification

  • Training data: An example paired with a cost vector that lists out the cost
  • f predicting each label

– 𝑦, 𝐝 ∈ 𝑌 × 0, ∞ S

  • Learning goal: To find a classifier that has low cost

– min= 𝐹>,T 𝑑= >

88

Exercise: How would you design a cost- sensitive learner?

slide-89
SLIDE 89

Cost-Sensitive Classification

Suppose we want to learn a classifier ℎ that maps examples to one of 𝐿 labels Standard multiclass classification

  • Training data: Pairs of examples associated with labels

– 𝑦, 𝑧 ∈ 𝑌 ×[𝐿]

  • Learning goal: To find a classifier that has low error

– min= Pr ℎ 𝑦 ≠ 𝑧

Cost-sensitive classification

  • Training data: An example paired with a cost vector that lists out the cost
  • f predicting each label

– 𝑦, 𝐝 ∈ 𝑌 × 0, ∞ S

  • Learning goal: To find a classifier that has low cost

– min= 𝐹>,T 𝑑= >

89

SEARN uses a cost-sensitive learner to learn a policy

slide-90
SLIDE 90

SEARN at test time

We already have learned a policy. We can use this policy to construct a sequence of decisions y and get the final structured output.

90

slide-91
SLIDE 91

SEARN at test time

We already have learned a policy. We can use this policy to construct a sequence of decisions y and get the final structured output.

  • 1. Use the learned policy on initial state (-,…, -) to

compute y1

91

slide-92
SLIDE 92

SEARN at test time

We already have learned a policy. We can use this policy to construct a sequence of decisions y and get the final structured output.

  • 1. Use the learned policy on initial state (-,…, -) to

compute y1

  • 2. Use the learned policy on state (y1, -,…,-) to

compute y2

92

slide-93
SLIDE 93

SEARN at test time

We already have learned a policy. We can use this policy to construct a sequence of decisions y and get the final structured output.

  • 1. Use the learned policy on initial state (-,…, -) to

compute y1

  • 2. Use the learned policy on state (y1, -,…,-) to

compute y2

  • 3. Keep going until we get y = (y1,…,yn)

93

slide-94
SLIDE 94

SEARN at training time

94

slide-95
SLIDE 95

SEARN at training time

  • The core idea in training is to notice that at each

decision step, we are actually doing a cost-sensitive classification

95

slide-96
SLIDE 96

SEARN at training time

  • The core idea in training is to notice that at each

decision step, we are actually doing a cost-sensitive classification

  • Construct cost-sensitive classification examples (s, c)

with state s and cost vector c.

96

slide-97
SLIDE 97

SEARN at training time

  • The core idea in training is to notice that at each

decision step, we are actually doing a cost-sensitive classification

  • Construct cost-sensitive classification examples (s, c)

with state s and cost vector c.

  • Learn a cost-sensitive classifier. (This is nothing but a

policy)

97

slide-98
SLIDE 98

Roll-in, Roll-out

98

slide-99
SLIDE 99

Roll-in, Roll-out

99

roll in At each state, use some policy to move to a new state.

slide-100
SLIDE 100

Roll-in, Roll-out

100

roll in What is the cost of deviating from the policy at this step?

slide-101
SLIDE 101

Roll-in, Roll-out

101

roll in

  • ne step deviation

What is the cost of deviating from the policy at this step? Assuming that there are three possible actions at this state

slide-102
SLIDE 102

Roll-in, Roll-out

102

roll in

  • ne step deviation

What is the cost of deviating from the policy at this step?

slide-103
SLIDE 103

Roll-in, Roll-out

103

roll in

  • ne step deviation

roll out roll out What is the cost of deviating from the policy at this step? Once we make the one- step deviation, we could use some policy to get to a goal state again

slide-104
SLIDE 104

Roll-in, Roll-out

104

roll in

  • ne step deviation

roll out roll out What is the cost of deviating from the policy at this step?

slide-105
SLIDE 105
  • ?

E E E

ro roll llin in rol rollo lout ut

  • ne-step

deviations

loss=.2 loss=0 loss=.8

SEARN at training time (continued)

105

slide-106
SLIDE 106
  • ?

E E E

ro roll llin in rol rollo lout ut

  • ne-step

deviations

loss=.2 loss=0 loss=.8

SEARN at training time (continued)

106

  • Generate a search path
slide-107
SLIDE 107
  • ?

E E E

ro roll llin in rol rollo lout ut

  • ne-step

deviations

loss=.2 loss=0 loss=.8

SEARN at training time (continued)

107

  • Generate a search path
  • Construct a cost-sensitive example: (?-state, c=(0, 0.2, 0.8))
slide-108
SLIDE 108
  • ?

E E E

ro roll llin in rol rollo lout ut

  • ne-step

deviations

loss=.2 loss=0 loss=.8

SEARN at training time (continued)

108

  • Generate a search path
  • Construct a cost-sensitive example: (?-state, c=(0, 0.2, 0.8))
  • Do this for every step along the path
slide-109
SLIDE 109
  • ?

E E E

ro roll llin in rol rollo lout ut

  • ne-step

deviations

loss=.2 loss=0 loss=.8

SEARN at training time (continued)

109

  • Generate a search path
  • Construct a cost-sensitive example: (?-state, c=(0, 0.2, 0.8))
  • Do this for every step along the path
  • And for every structured training example
slide-110
SLIDE 110
  • ?

E E E

ro roll llin in rol rollo lout ut

  • ne-step

deviations

loss=.2 loss=0 loss=.8

SEARN at training time (continued)

110

  • Generate a search path
  • Construct a cost-sensitive example: (?-state, c=(0, 0.2, 0.8))
  • Do this for every step along the path
  • And for every structured training example
  • Collect all these cost-sensitive examples to train a improved policy h’
slide-111
SLIDE 111
  • ?

E E E

ro roll llin in rol rollo lout ut

  • ne-step

deviations

loss=.2 loss=0 loss=.8

SEARN at training time (continued)

111

  • Generate a search path
  • Construct a cost-sensitive example: (?-state, c=(0, 0.2, 0.8))
  • Do this for every step along the path
  • And for every structured training example
  • Collect all these cost-sensitive examples to train a improved policy h’
  • Interpolate: h ← βh0 + (1 − β)h
slide-112
SLIDE 112
  • ?

E E E

ro roll llin in rol rollo lout ut

  • ne-step

deviations

loss=.2 loss=0 loss=.8

SEARN at training time (continued)

112

  • Generate a search path
  • Construct a cost-sensitive example: (?-state, c=(0, 0.2, 0.8))
  • Do this for every step along the path
  • And for every structured training example
  • Collect all these cost-sensitive examples to train a improved policy h’
  • Interpolate:
  • Repeat

h ← βh0 + (1 − β)h

slide-113
SLIDE 113
  • ?

E E E

ro roll llin in rol rollo lout ut

  • ne-step

deviations

loss=.2 loss=0 loss=.8

SEARN at training time (continued)

113

Roll-in with current policy h

  • Generate a search path
  • Construct a cost-sensitive example: (?-state, c=(0, 0.2, 0.8))
  • Do this for every step along the path
  • And for every structured training example
  • Collect all these cost-sensitive examples to train a improved policy h’
  • Interpolate:
  • Repeat

h ← βh0 + (1 − β)h

slide-114
SLIDE 114
  • ?

E E E

ro roll llin in rol rollo lout ut

  • ne-step

deviations

loss=.2 loss=0 loss=.8

SEARN at training time (continued)

114

Roll-in with current policy h Roll-out with current policy h

  • Generate a search path
  • Construct a cost-sensitive example: (?-state, c=(0, 0.2, 0.8))
  • Do this for every step along the path
  • And for every structured training example
  • Collect all these cost-sensitive examples to train a improved policy h’
  • Interpolate:
  • Repeat

h ← βh0 + (1 − β)h

slide-115
SLIDE 115
  • ?

E E E

ro roll llin in rol rollo lout ut

  • ne-step

deviations

loss=.2 loss=0 loss=.8

SEARN at training time (continued)

115

Roll-in with current policy h Roll-out with current policy h

  • If h is deterministic:

lh(c, s, a) = cy(s,a,h) − min

a0 cy(s,a0,h)

slide-116
SLIDE 116
  • ?

E E E

ro roll llin in rol rollo lout ut

  • ne-step

deviations

loss=.2 loss=0 loss=.8

SEARN at training time (continued)

116

Roll-in with current policy h Roll-out with current policy h

  • If h is deterministic:
  • If h contains randomness:

lh(c, s, a) = cy(s,a,h) − min

a0 cy(s,a0,h)

lh(c, s, a) = Ey∼(s,a,h)cy − min

a0 Ey∼(s,a0,h)cy

slide-117
SLIDE 117
  • ?

E E E

ro roll llin in rol rollo lout ut

  • ne-step

deviations

loss=.2 loss=0 loss=.8

SEARN at training time (continued)

117

Roll-in with current policy h Roll-out with current policy h

  • If h is deterministic:
  • If h contains randomness:

lh(c, s, a) = cy(s,a,h) − min

a0 cy(s,a0,h)

lh(c, s, a) = Ey∼(s,a,h)cy − min

a0 Ey∼(s,a0,h)cy

The loss defined this way is called regret

slide-118
SLIDE 118

DAgger: Dataset Aggregation

118

3rd Framework:

[Stéphane Ross, Geoffrey J. Gordon, J. Andrew Bagnell, 2011]

slide-119
SLIDE 119

Dagger Algorithm (Simplified Version)

π πref

ref

} π π1

1

π π2

2

119

slide-120
SLIDE 120

Dagger Algorithm (Simplified Version)

π πref

ref

} π π1

1

π π2

2

120

  • Initialize Dataset D = empty
  • Collect trajectories with reference policy πref (the expert)
  • Dataset D1 = {(s, πref(s))}
  • Aggregate Datasets
  • Train π1 on D
  • Collect new trajectories with π1
  • New Dataset D2 = {(s, πref(s))}
  • Aggregate Datasets
  • Train π2 on D
slide-121
SLIDE 121

Dagger Algorithm (Simplified Version)

π πref

ref

} π π1

1

π π2

2

121

  • Initialize Dataset D = empty
  • Collect trajectories with reference policy πref (the expert)
  • Dataset D1 = {(s, πref(s))}
  • Aggregate Datasets
  • Train π1 on D
  • Collect new trajectories with π1
  • New Dataset D2 = {(s, πref(s))}
  • Aggregate Datasets
  • Train π2 on D
slide-122
SLIDE 122

Dagger Algorithm (Simplified Version)

π πref

ref

} π π1

1

π π2

2

122

  • Initialize Dataset D = empty
  • Collect trajectories with reference policy πref (the expert)
  • Dataset D1 = {(s, πref(s))}
  • Aggregate Datasets
  • Train π1 on D
  • Collect new trajectories with π1
  • New Dataset D2 = {(s, πref(s))}
  • Aggregate Datasets
  • Train π2 on D
slide-123
SLIDE 123

Dagger Algorithm (Simplified Version)

π πref

ref

} π π1

1

π π2

2

123

  • Initialize Dataset D = empty
  • Collect trajectories with reference policy πref (the expert)
  • Dataset D1 = {(s, πref(s))}
  • Aggregate Datasets
  • Train π1 on D
  • Collect new trajectories with π1
  • New Dataset D2 = {(s, πref(s))}
  • Aggregate Datasets
  • Train π2 on D

D = D ∪ D1

slide-124
SLIDE 124

Dagger Algorithm (Simplified Version)

π πref

ref

} π π1

1

π π2

2

124

  • Initialize Dataset D = empty
  • Collect trajectories with reference policy πref (the expert)
  • Dataset D1 = {(s, πref(s))}
  • Aggregate Datasets
  • Train π1 on D
  • Collect new trajectories with π1
  • New Dataset D2 = {(s, πref(s))}
  • Aggregate Datasets
  • Train π2 on D

D = D ∪ D1

slide-125
SLIDE 125

Dagger Algorithm (Simplified Version)

π πref

ref

} π π1

1

π π2

2

125

  • Initialize Dataset D = empty
  • Collect trajectories with reference policy πref (the expert)
  • Dataset D1 = {(s, πref(s))}
  • Aggregate Datasets
  • Train π1 on D
  • Collect new trajectories with π1
  • New Dataset D2 = {(s, πref(s))}
  • Aggregate Datasets
  • Train π2 on D

D = D ∪ D1

slide-126
SLIDE 126

Dagger Algorithm (Simplified Version)

π πref

ref

} π π1

1

π π2

2

126

  • Initialize Dataset D = empty
  • Collect trajectories with reference policy πref (the expert)
  • Dataset D1 = {(s, πref(s))}
  • Aggregate Datasets
  • Train π1 on D
  • Collect new trajectories with π1
  • New Dataset D2 = {(s, πref(s))}
  • Aggregate Datasets
  • Train π2 on D

D = D ∪ D1

slide-127
SLIDE 127

Dagger Algorithm (Simplified Version)

π πref

ref

} π π1

1

π π2

2

127

  • Initialize Dataset D = empty
  • Collect trajectories with reference policy πref (the expert)
  • Dataset D1 = {(s, πref(s))}
  • Aggregate Datasets
  • Train π1 on D
  • Collect new trajectories with π1
  • New Dataset D2 = {(s, πref(s))}
  • Aggregate Datasets
  • Train π2 on D

D = D ∪ D1 D = D ∪ D2

slide-128
SLIDE 128

Dagger Algorithm (Simplified Version)

π πref

ref

} π π1

1

π π2

2

128

  • Initialize Dataset D = empty
  • Collect trajectories with reference policy πref (the expert)
  • Dataset D1 = {(s, πref(s))}
  • Aggregate Datasets
  • Train π1 on D
  • Collect new trajectories with π1
  • New Dataset D2 = {(s, πref(s))}
  • Aggregate Datasets
  • Train π2 on D

D = D ∪ D1 D = D ∪ D2

slide-129
SLIDE 129

DAgger V.S. SEARN

Similarities:

  • Dagger also treats a structured prediction problem as a

sequence of multiclass classification problem.

  • Roll-in with current policy
  • Iteratively improving the current policy by learning better

multiclass classifiers. Differences:

  • There is no roll-out stage
  • At each step we just have a regular multiclass example

(not cost-sensitive example), given by the expert.

  • Aggregate dataset

129

slide-130
SLIDE 130

DAgger V.S. SEARN

Similarities:

  • Dagger also treats a structured prediction problem as a

sequence of multiclass classification problem.

  • Roll-in with current policy
  • Iteratively improving the current policy by learning better

multiclass classifiers. Differences:

  • There is no roll-out stage
  • At each step we just have a regular multiclass example

(not cost-sensitive example), given by the expert.

  • Aggregate dataset

130

slide-131
SLIDE 131

DAgger V.S. SEARN

Similarities:

  • Dagger also treats a structured prediction problem as a

sequence of multiclass classification problem.

  • Roll-in with current policy
  • Iteratively improving the current policy by learning better

multiclass classifiers. Differences:

  • There is no roll-out stage
  • At each step we just have a regular multiclass example

(not cost-sensitive example), given by the expert.

  • Aggregate dataset

131

slide-132
SLIDE 132

DAgger V.S. SEARN

Similarities:

  • Dagger also treats a structured prediction problem as a

sequence of multiclass classification problem.

  • Roll-in with current policy
  • Iteratively improving the current policy by learning better

multiclass classifiers. Differences:

  • There is no roll-out stage
  • At each step we just have a regular multiclass example

(not cost-sensitive example), given by the expert.

  • Aggregate dataset

132

slide-133
SLIDE 133

DAgger V.S. SEARN

Similarities:

  • Dagger also treats a structured prediction problem as a

sequence of multiclass classification problem.

  • Roll-in with current policy
  • Iteratively improving the current policy by learning better

multiclass classifiers. Differences:

  • There is no roll-out stage
  • At each step we just have a regular multiclass example

(not cost-sensitive example), given by the expert.

  • Aggregate dataset

133

slide-134
SLIDE 134

DAgger V.S. SEARN

Similarities:

  • Dagger also treats a structured prediction problem as a

sequence of multiclass classification problem.

  • Roll-in with current policy
  • Iteratively improving the current policy by learning better

multiclass classifiers. Differences:

  • There is no roll-out stage
  • At each step we just have a regular multiclass example

(not cost-sensitive example), given by the expert.

  • Aggregate dataset

134

slide-135
SLIDE 135

DAgger V.S. SEARN

Similarities:

  • Dagger also treats a structured prediction problem as a

sequence of multiclass classification problem.

  • Roll-in with current policy
  • Iteratively improving the current policy by learning better

multiclass classifiers. Differences:

  • There is no roll-out stage
  • At each step we just have a regular multiclass example

(not cost-sensitive example), given by the expert.

  • Aggregate dataset

135

slide-136
SLIDE 136

DAgger V.S. SEARN

Similarities:

  • Dagger also treats a structured prediction problem as a

sequence of multiclass classification problem.

  • Roll-in with current policy
  • Iteratively improving the current policy by learning better

multiclass classifiers. Differences:

  • There is no roll-out stage
  • At each step we just have a regular multiclass example

(not cost-sensitive example), given by the expert.

  • Aggregate dataset

136

slide-137
SLIDE 137

Other related algorithms

  • Incremental Perceptron (2002)

– Based on structured Perceptron – Instead of finishing inference during training, when inference makes its first mistake, stop and update parameters

  • AggreVaTe: Aggregate Values to Imitate (2014)

– Combines ideas from DAgger and SEARN – Cost-sensitive learning + dataset aggregation

  • LOLS: Locally Optimal Learning to Search (2015)

– What if the reference policy is not good? – Changes roll-outs to account for this

137

slide-138
SLIDE 138

Learning to search: Summary

  • Inference in structured prediction can be framed as

search

– Can we learn a model that explicitly helps inference navigate the search space?

  • Several algorithms:

– LaSO, SEARN, DAgger, etc – Often easy to implement with simpler building blocks

  • Can be the basis of a general purpose structured prediction

framework

138