l2s learning to search
play

L2S: Learning to Search CS 6355: Structured Prediction 1 Some - PowerPoint PPT Presentation

L2S: Learning to Search CS 6355: Structured Prediction 1 Some slides adapted from Daum and Ross Inference What is inference? An overview of what we have seen before Combinatorial optimization Different views of inference


  1. Learning to search: General setting Predicting an output 𝐳 as a sequence of decisions General data structures – State: Partial assignments to (𝑧 1 , 𝑧 2 , … , 𝑧 π‘ˆ ) – Initial state: Empty assignments (βˆ’, βˆ’, … , βˆ’) – Actions: Pick a 𝑧 𝑗 component and assign an label to it – Transition model: Move from one partial structure to another – Goal test: Whether all 𝑧 components are assigned β€’ A goal state does not need to be optimal 30

  2. Learning to search: General setting Predicting an output 𝐳 as a sequence of decisions General data structures – State: Partial assignments to (𝑧 1 , 𝑧 2 , … , 𝑧 π‘ˆ ) – Initial state: Empty assignments (βˆ’, βˆ’, … , βˆ’) – Actions: Pick a 𝑧 𝑗 component and assign an label to it – Transition model: Move from one partial structure to another – Goal test: Whether all 𝑧 components are assigned β€’ A goal state does not need to be optimal – Path cost/score function: 𝐱 π‘ˆ 𝜚(𝐲, node) , or more generally, a neural network that depends on the 𝐲 and the node β€’ A node contains the current state and the back pointer to trace back the search path 31

  3. Example Suppose each y can be one of A, B or C y 3 y 1 y 2 x 1 x 2 x 3 32

  4. Example Suppose each y can be one of A, B or C y 3 y 1 y 2 x 1 x 2 x 3 State: Triples (y 1 , y 2 , y 3 ) all possibly unknown β€’ (A, -, -), (-, A, A), (-, -, -),… β€’ Transition: Fill in one of the unknowns β€’ Start state: (-,-,-) β€’ End state: All three y’s are assigned β€’ 33

  5. Example Suppose each y can be one of A, B or C y 3 y 1 y 2 x 1 x 2 x 3 (-,-,-) State: Triples (y 1 , y 2 , y 3 ) all possibly unknown β€’ (A, -, -), (-, A, A), (-, -, -),… β€’ (A,-,-) (B,-,-) (C,-,-) Transition: Fill in one of the unknowns β€’ (A,A,-) (C,C,-) Start state: (-,-,-) ….. β€’ (A,A,A) End state: All three y’s are assigned (C,C,C) β€’ 34

  6. 1 st Framework: LaSO: Learning as Search Optimization [Hal DaumΓ© III and Daniel Marcu, ICML 2005] 35

  7. The enqueue function in LaSO 36

  8. The enqueue function in LaSO β€’ The goal of learning is to produce an enqueue function that – places good hypotheses high on the queue – places bad hypotheses low on the queue 37

  9. The enqueue function in LaSO β€’ The goal of learning is to produce an enqueue function that – places good hypotheses high on the queue – places bad hypotheses low on the queue β€’ LaSO assumes enqueue is based on two components g + h 38

  10. The enqueue function in LaSO β€’ The goal of learning is to produce an enqueue function that – places good hypotheses high on the queue – places bad hypotheses low on the queue β€’ LaSO assumes enqueue is based on two components g + h – g: path component. (g = w T Ο†(x, node)) 39

  11. The enqueue function in LaSO β€’ The goal of learning is to produce an enqueue function that – places good hypotheses high on the queue – places bad hypotheses low on the queue β€’ LaSO assumes enqueue is based on two components g + h – g: path component. (g = w T Ο†(x, node)) – h: heuristic component. (h is given) β€’ A * if h is admissible, heuristic search if h is not admissible, best first search if h = 0, beam search if queue is limited. 40

  12. The enqueue function in LaSO β€’ The goal of learning is to produce an enqueue function that – places good hypotheses high on the queue – places bad hypotheses low on the queue β€’ LaSO assumes enqueue is based on two components g + h The goal is to learn w. – g: path component. (g = w T Ο†(x, node)) How? – h: heuristic component. (h is given) β€’ A * if h is admissible, heuristic search if h is not admissible, best first search if h = 0, beam search if queue is limited. 41

  13. β€œy-good” node 42

  14. β€œy-good” node Assumption : for any given node s and an gold output y, we can tell whether s can or cannot lead to y. 43

  15. β€œy-good” node Assumption : for any given node s and an gold output y, we can tell whether s can or cannot lead to y. Definition : The node s is y-good if s can lead to y 44

  16. β€œy-good” node Assumption : for any given node s and an gold output y, we can tell whether s can or cannot lead to y. Definition : The node s is y-good if s can lead to y y = (y 1 , y 2 , y 3 ) Suppose each y can be one of A, B or C, and the true label is (y 1 =A, y 2 =B, y 3 =C) 45

  17. β€œy-good” node Assumption : for any given node s and an gold output y, we can tell whether s can or cannot lead to y. Definition : The node s is y-good if s can lead to y (-,-,-) y = (y 1 , y 2 , y 3 ) (A,-,-) (-,B,-) (C,-,-) Suppose each y can be one of A, B or C, and the true (A,A,-) (C,C,-) label is (y 1 =A, y 2 =B, y 3 =C) ….. (A,A,A) (C,C,C) 46

  18. Learning in LaSO 47

  19. Learning in LaSO β€’ Search as if in the prediction phase, but when an error is made: 48

  20. Learning in LaSO β€’ Search as if in the prediction phase, but when an error is made: – update w 49

  21. Learning in LaSO β€’ Search as if in the prediction phase, but when an error is made: – update w – clear the queue and insert all the correct moves 50

  22. Learning in LaSO β€’ Search as if in the prediction phase, but when an error is made: – update w – clear the queue and insert all the correct moves β€’ Two kinds of errors: 51

  23. Learning in LaSO β€’ Search as if in the prediction phase, but when an error is made: – update w – clear the queue and insert all the correct moves β€’ Two kinds of errors: – Error type 1: none of the queue is y-good 52

  24. Learning in LaSO β€’ Search as if in the prediction phase, but when an error is made: – update w – clear the queue and insert all the correct moves β€’ Two kinds of errors: – Error type 1: none of the queue is y-good – Error type 2: the goal state is not y-good 53

  25. Learning Algorithm in LaSO 54

  26. Learning Algorithm in LaSO Algo Learn( problem, initial, enqueue, w, x, y ) nodes = MakeQueue(MakeNode( problem, initial )) while nodes is not empty: node = Pop( nodes ) if error step 1: update w step 2: refresh queue else if GoalTest( node ) then return w next = Result( node , Actions( node )) nodes = enqueue ( problem, nodes, next, w ) 55

  27. Learning Algorithm in LaSO Algo Learn( problem, initial, enqueue, w, x, y ) nodes = MakeQueue(MakeNode( problem, initial )) while nodes is not empty: node = Pop( nodes ) if error step 1: update w step 2: refresh queue else if GoalTest( node ) then return w next = Result( node , Actions( node )) nodes = enqueue ( problem, nodes, next, w ) 56

  28. Learning Algorithm in LaSO Algo Learn( problem, initial, enqueue, w, x, y ) nodes = MakeQueue(MakeNode( problem, initial )) while nodes is not empty: node = Pop( nodes ) if error step 1: update w step 2: refresh queue else if GoalTest( node ) then return w next = Result( node , Actions( node )) nodes = enqueue ( problem, nodes, next, w ) 57

  29. Learning Algorithm in LaSO Algo Learn( problem, initial, enqueue, w, x, y ) nodes = MakeQueue(MakeNode( problem, initial )) while nodes is not empty: node = Pop( nodes ) if error step 1: update w step 2: refresh queue else if GoalTest( node ) then return w next = Result( node , Actions( node )) nodes = enqueue ( problem, nodes, next, w ) 58

  30. Learning Algorithm in LaSO Algo Learn( problem, initial, enqueue, w, x, y ) nodes = MakeQueue(MakeNode( problem, initial )) while nodes is not empty: node = Pop( nodes ) if error step 1: update w step 2: refresh queue else if GoalTest( node ) then return w next = Result( node , Actions( node )) nodes = enqueue ( problem, nodes, next, w ) 59

  31. Learning Algorithm in LaSO Algo Learn( problem, initial, enqueue, w, x, y ) nodes = MakeQueue(MakeNode( problem, initial )) while nodes is not empty: node = Pop( nodes ) if error step 1: update w step 2: refresh queue else if GoalTest( node ) then return w next = Result( node , Actions( node )) nodes = enqueue ( problem, nodes, next, w ) 60

  32. Learning Algorithm in LaSO Algo Learn( problem, initial, enqueue, w, x, y ) nodes = MakeQueue(MakeNode( problem, initial )) while nodes is not empty: node = Pop( nodes ) if error step 1: update w step 2: refresh queue else if GoalTest( node ) then return w next = Result( node , Actions( node )) nodes = enqueue ( problem, nodes, next, w ) 61

  33. Learning Algorithm in LaSO Algo Learn( problem, initial, enqueue, w, x, y ) nodes = MakeQueue(MakeNode( problem, initial )) while nodes is not empty: node = Pop( nodes ) if error step 1: update w step 2: refresh queue else if GoalTest( node ) then return w next = Result( node , Actions( node )) nodes = enqueue ( problem, nodes, next, w ) 62

  34. Learning Algorithm in LaSO Algo Learn( problem, initial, enqueue, w, x, y ) nodes = MakeQueue(MakeNode( problem, initial )) while nodes is not empty: node = Pop( nodes ) if error step 1: update w step 2: refresh queue else if GoalTest( node ) then return w next = Result( node , Actions( node )) nodes = enqueue ( problem, nodes, next, w ) 63

  35. Learning Algorithm in LaSO Algo Learn( problem, initial, enqueue, w, x, y ) nodes = MakeQueue(MakeNode( problem, initial )) while nodes is not empty: node = Pop( nodes ) if error step 1: update w step 2: refresh queue else if GoalTest( node ) then return w next = Result( node , Actions( node )) nodes = enqueue ( problem, nodes, next, w ) 64

  36. Learning Algorithm in LaSO Algo Learn( problem, initial, enqueue, w, x, y ) nodes = MakeQueue(MakeNode( problem, initial )) while nodes is not empty: node = Pop( nodes ) if error step 1: update w step 2: refresh queue else if GoalTest( node ) then return w next = Result( node , Actions( node )) nodes = enqueue ( problem, nodes, next, w ) 65

  37. What should learning do? node 1 y-good node 2 node 3 y-good y-good node 4 node 5 current y-good y-good Let’s say we found an error (of either type) at the current node, then we should have made the choice of node 4 instead of the current node 66

  38. What should learning do? node 1 y-good node 2 node 3 y-good y-good node 5 node 4 current y-good y-good Let’s say we found an error (of either type) at the current node, then we should have made the choice of node 4 instead of the current node Node 4 is the y-good sibling of the current node 67

  39. Learning Algorithm in LaSO Algo Learn( problem, initial, enqueue, w, x, y ) nodes = MakeQueue(MakeNode( problem, initial )) while nodes is not empty: node = Pop( nodes ) if none of ( node + nodes ) is y-good or GoalTest( node ) and node is not y-good then sibs = siblings( node, y ) w = update( w, x, sibs, node, nodes ) nodes = MakeQueue( sibs ) else if GoalTest( node ) then return w next = Result( node , Actions( node )) nodes = enqueue ( problem, nodes, next, w ) 68

  40. Learning Algorithm in LaSO Algo Learn( problem, initial, enqueue, w, x, y ) nodes = MakeQueue(MakeNode( problem, initial )) while nodes is not empty: node = Pop( nodes ) if none of ( node + nodes ) is y-good or GoalTest( node ) and node is not y-good then sibs = siblings( node, y ) w = update( w, x, sibs, node, nodes ) nodes = MakeQueue( sibs ) else if GoalTest( node ) then return w next = Result( node , Actions( node )) nodes = enqueue ( problem, nodes, next, w ) 69

  41. Learning Algorithm in LaSO Algo Learn( problem, initial, enqueue, w, x, y ) nodes = MakeQueue(MakeNode( problem, initial )) while nodes is not empty: node = Pop( nodes ) if none of ( node + nodes ) is y-good or GoalTest( node ) and node is not y-good then sibs = siblings( node, y ) w = update( w, x, sibs, node, nodes ) nodes = MakeQueue( sibs ) else if GoalTest( node ) then return w next = Result( node , Actions( node )) nodes = enqueue ( problem, nodes, next, w ) 70

  42. Learning Algorithm in LaSO Algo Learn( problem, initial, enqueue, w, x, y ) nodes = MakeQueue(MakeNode( problem, initial )) while nodes is not empty: node = Pop( nodes ) if none of ( node + nodes ) is y-good or GoalTest( node ) and node is not y-good then sibs = siblings( node, y ) w = update( w, x, sibs, {node, nodes} ) nodes = MakeQueue( sibs ) else if GoalTest( node ) then return w next = Result( node , Actions( node )) nodes = enqueue ( problem, nodes, next, w ) 71

  43. Learning Algorithm in LaSO Algo Learn( problem, initial, enqueue, w, x, y ) nodes = MakeQueue(MakeNode( problem, initial )) while nodes is not empty: node = Pop( nodes ) if none of ( node + nodes ) is y-good or GoalTest( node ) and node is not y-good then sibs = siblings( node, y ) w = update( w, x, sibs, {node, nodes} ) nodes = MakeQueue( sibs ) else if GoalTest( node ) then return w next = Result( node , Actions( node )) nodes = enqueue ( problem, nodes, next, w ) 72

  44. Parameter Updates We need to specify w = update( w, x, sibs, nodes ) A simple perceptron-style update rule: w = w + Ξ” Ξ¦ ( x, n ) Ξ¦ ( x, n ) X X βˆ† = | sibs | βˆ’ | nodes | n ∈ sibs n ∈ nodes It comes with the usual perceptron-style mistake bound and generalization bound. (See references) 73

  45. 2 nd Framework: SEARN: Search and Learning Hal DaumΓ© III, John Langford, Daniel Marcu (2007) 74

  46. Policy A policy is a mapping from a state to an action β€’ For a given node, the policy tells what action should be taken β€’ 75

  47. Policy A policy is a mapping from a state to an action β€’ For a given node, the policy tells what action should be taken β€’ A policy gives a search path in the search space. β€’ – Different policy means different search path – Can be thought as the β€œdriver” in the search space 76

  48. Policy A policy is a mapping from a state to an action β€’ For a given node, the policy tells what action should be taken β€’ A policy gives a search path in the search space. β€’ – Different policy means different search path – Can be thought as the β€œdriver” in the search space A policy may be deterministic, or may contain some randomness. β€’ (More on this later) 77

  49. Reference Policy and Learned Policy 78

  50. Reference Policy and Learned Policy β€’ We assume we already have a good reference policy 𝜌 for training data (𝐲, 𝐝) – i.e. examples associated with costs for outputs 79

  51. Reference Policy and Learned Policy β€’ We assume we already have a good reference policy 𝜌 for training data (𝐲, 𝐝) – i.e. examples associated with costs for outputs β€’ Goal: Learn a good policy for test data when we do not have access to cost vector c. (Imitation Learning) 80

  52. Reference Policy and Learned Policy β€’ We assume we already have a good reference policy 𝜌 for training data (𝐲, 𝐝) – i.e. examples associated with costs for outputs β€’ Goal: Learn a good policy for test data when we do not have access to cost vector c. (Imitation Learning) Ο€ ref ref Ο€ 81

  53. Reference Policy and Learned Policy β€’ We assume we already have a good reference policy 𝜌 for training data (𝐲, 𝐝) – i.e. examples associated with costs for outputs β€’ Goal: Learn a good policy for test data when we do not have access to cost vector c. (Imitation Learning) For example if we are using Hamming distance for cost vector 𝐝 , then the reference policy is trivial Ο€ ref ref Ο€ to compute, why? 82

  54. Reference Policy and Learned Policy β€’ We assume we already have a good reference policy 𝜌 for training data (𝐲, 𝐝) – i.e. examples associated with costs for outputs β€’ Goal: Learn a good policy for test data when we do not have access to cost vector c. (Imitation Learning) For example if we are using Hamming distance for cost vector 𝐝 , then the reference policy is trivial Ο€ ref ref Ο€ to compute, why? Just make the right decision at every step 83

  55. Reference Policy and Learned Policy β€’ We assume we already have a good reference policy 𝜌 for training data (𝐲, 𝐝) – i.e. examples associated with costs for outputs β€’ Goal: Learn a good policy for test data when we do not have access to cost vector c. (Imitation Learning) For example if we are using Hamming distance for cost vector 𝐝 , then the reference policy is trivial Ο€ ref ref Ο€ to compute, why? Just make the right decision at every step Suppose gold state is (A, B, C, A) and we are at the state (A, C, -, -) The reference policy tells us the next action is assigned C to the third slot. 84

  56. Cost-Sensitive Classification Suppose we want to learn a classifier β„Ž that maps examples to one of 𝐿 labels Standard multiclass classification Training data: Pairs of examples associated with labels β€’ 𝑦, 𝑧 ∈ π‘Œ Γ—[𝐿] – Learning goal: To find a classifier that has low error β€’ – min = Pr β„Ž 𝑦 β‰  𝑧 Cost-sensitive classification Training data: An example paired with a cost vector that lists out the cost β€’ of predicting each label 𝑦, 𝐝 ∈ π‘Œ Γ— 0, ∞ S – Learning goal: To find a classifier that has low cost β€’ – min = 𝐹 >,T 𝑑 = > 85

  57. Cost-Sensitive Classification Suppose we want to learn a classifier β„Ž that maps examples to one of 𝐿 labels Standard multiclass classification Training data: Pairs of examples associated with labels β€’ 𝑦, 𝑧 ∈ π‘Œ Γ—[𝐿] – Learning goal: To find a classifier that has low error β€’ – min = Pr β„Ž 𝑦 β‰  𝑧 Cost-sensitive classification Training data: An example paired with a cost vector that lists out the cost β€’ of predicting each label 𝑦, 𝐝 ∈ π‘Œ Γ— 0, ∞ S – Learning goal: To find a classifier that has low cost β€’ – min = 𝐹 >,T 𝑑 = > 86

  58. Cost-Sensitive Classification Suppose we want to learn a classifier β„Ž that maps examples to one of 𝐿 labels Standard multiclass classification Training data: Pairs of examples associated with labels β€’ 𝑦, 𝑧 ∈ π‘Œ Γ—[𝐿] – Learning goal: To find a classifier that has low error β€’ – min = Pr β„Ž 𝑦 β‰  𝑧 Cost-sensitive classification Training data: An example paired with a cost vector that lists out the cost β€’ of predicting each label 𝑦, 𝐝 ∈ π‘Œ Γ— 0, ∞ S – Learning goal: To find a classifier that has low cost β€’ – min = 𝐹 >,T 𝑑 = > 87

  59. Cost-Sensitive Classification Suppose we want to learn a classifier β„Ž that maps examples to one of 𝐿 labels Standard multiclass classification Training data: Pairs of examples associated with labels β€’ 𝑦, 𝑧 ∈ π‘Œ Γ—[𝐿] – Learning goal: To find a classifier that has low error β€’ – min = Pr β„Ž 𝑦 β‰  𝑧 Cost-sensitive classification Training data: An example paired with a cost vector that lists out the cost β€’ of predicting each label Exercise: How would 𝑦, 𝐝 ∈ π‘Œ Γ— 0, ∞ S – you design a cost- Learning goal: To find a classifier that has low cost β€’ sensitive learner? – min = 𝐹 >,T 𝑑 = > 88

  60. Cost-Sensitive Classification Suppose we want to learn a classifier β„Ž that maps examples to one of 𝐿 labels Standard multiclass classification Training data: Pairs of examples associated with labels β€’ 𝑦, 𝑧 ∈ π‘Œ Γ—[𝐿] – Learning goal: To find a classifier that has low error β€’ – min = Pr β„Ž 𝑦 β‰  𝑧 Cost-sensitive classification Training data: An example paired with a cost vector that lists out the cost β€’ of predicting each label 𝑦, 𝐝 ∈ π‘Œ Γ— 0, ∞ S – Learning goal: To find a classifier that has low cost β€’ – min = 𝐹 >,T 𝑑 = > SEARN uses a cost-sensitive learner to learn a policy 89

  61. SEARN at test time We already have learned a policy. We can use this policy to construct a sequence of decisions y and get the final structured output. 90

  62. SEARN at test time We already have learned a policy. We can use this policy to construct a sequence of decisions y and get the final structured output. 1. Use the learned policy on initial state (-,…, -) to compute y 1 91

  63. SEARN at test time We already have learned a policy. We can use this policy to construct a sequence of decisions y and get the final structured output. 1. Use the learned policy on initial state (-,…, -) to compute y 1 2. Use the learned policy on state (y 1 , -,…,-) to compute y 2 92

  64. SEARN at test time We already have learned a policy. We can use this policy to construct a sequence of decisions y and get the final structured output. 1. Use the learned policy on initial state (-,…, -) to compute y 1 2. Use the learned policy on state (y 1 , -,…,-) to compute y 2 3. Keep going until we get y = (y 1 ,…,y n ) 93

  65. SEARN at training time 94

  66. SEARN at training time β€’ The core idea in training is to notice that at each decision step, we are actually doing a cost-sensitive classification 95

  67. SEARN at training time β€’ The core idea in training is to notice that at each decision step, we are actually doing a cost-sensitive classification β€’ Construct cost-sensitive classification examples (s, c) with state s and cost vector c. 96

  68. SEARN at training time β€’ The core idea in training is to notice that at each decision step, we are actually doing a cost-sensitive classification β€’ Construct cost-sensitive classification examples (s, c) with state s and cost vector c. β€’ Learn a cost-sensitive classifier. (This is nothing but a policy) 97

  69. Roll-in, Roll-out 98

  70. Roll-in, Roll-out roll in At each state, use some policy to move to a new state. 99

  71. Roll-in, Roll-out roll in What is the cost of deviating from the policy at this step? 100

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend