a series of lectures on approximate dynamic programming
play

A Series of Lectures on Approximate Dynamic Programming Dimitri P - PowerPoint PPT Presentation

A Series of Lectures on Approximate Dynamic Programming Dimitri P . Bertsekas Laboratory for Information and Decision Systems Massachusetts Institute of Technology Lucca, Italy June 2017 Bertsekas (M.I.T.) Approximate Dynamic Programming 1


  1. A Series of Lectures on Approximate Dynamic Programming Dimitri P . Bertsekas Laboratory for Information and Decision Systems Massachusetts Institute of Technology Lucca, Italy June 2017 Bertsekas (M.I.T.) Approximate Dynamic Programming 1 / 24

  2. Our Aim Discuss optimization by Dynamic Programming (DP) and the use of approximations Purpose: Computational tractability in a broad variety of practical contexts Bertsekas (M.I.T.) Approximate Dynamic Programming 2 / 24

  3. The Scope of these Lectures After an intoduction to exact DP , we will focus on approximate DP for optimal control under stochastic uncertainty The subject is broad with rich variety of theory/math, algorithms, and applications Applications come from a vast array of areas: control/robotics/planning, operations research, economics, artificial intelligence, and beyond ... We will concentrate on control of discrete-time systems with a finite number of stages (a finite horizon), and the expected value criterion We will focus mostly on algorithms ... less on theory and modeling We will not cover: Infinite horizon problems Imperfect state information and minimax/game problems Simulation-based methods: reinforcement learning, neuro-dynamic programming A series of video lectures on the latter can be found at the author’s web site Reference: The lectures will follow Chapters 1 and 6 of the author’s book “Dynamic Programming and Optimal Control," Vol. I, Athena Scientific, 2017 Bertsekas (M.I.T.) Approximate Dynamic Programming 3 / 24

  4. Lectures Plan Exact DP The basic problem formulation Some examples The DP algorithm for finite horizon problems with perfect state information Computational limitations; motivation for approximate DP Approximate DP - I Approximation in value space; limited lookahead Parametric cost approximation, including neural networks Q -factor approximation, model-free approximate DP Problem approximation Approximate DP - II Simulation-based on-line approximation; rollout and Monte Carlo tree search Applications in backgammon and AlphaGo Approximation in policy space Bertsekas (M.I.T.) Approximate Dynamic Programming 4 / 24

  5. First Lecture EXACT DYNAMINC PROGRAMMING Bertsekas (M.I.T.) Approximate Dynamic Programming 5 / 24

  6. Outline Basic Problem 1 Some Examples 2 The DP Algorithm 3 Approximation Ideas 4 Bertsekas (M.I.T.) Approximate Dynamic Programming 6 / 24

  7. Basic Problem Structure for DP Discrete-time system x k + 1 = f k ( x k , u k , w k ) , k = 0 , 1 , . . . , N − 1 x k : State; summarizes past information that is relevant for future optimization at time k u k : Control; decision to be selected at time k from a given set U k ( x k ) w k : Disturbance; random parameter with distribution P ( w k | x k , u k ) For deterministic problems there is no w k Cost function that is additive over time � N − 1 � � E g N ( x N ) + g k ( x k , u k , w k ) k = 0 Perfect state information The control u k is applied with (exact) knowledge of the state x k Bertsekas (M.I.T.) Approximate Dynamic Programming 8 / 24

  8. Optimization over Feedback Policies w k u k = µ k ( x k ) x k System x k +1 = f k ( x k , u k , w k ) µ k Feedback policies: Rules that specify the control to apply at each possible state x k that can occur Major distinction: We minimize over sequences of functions of state π = { µ 0 , µ 1 , . . . , µ N − 1 } , with u k = µ k ( x k ) ∈ U k ( x k ) - not sequences of controls { u 0 , u 1 , . . . , u N − 1 } Cost of a policy π = { µ 0 , µ 1 , . . . , µ N − 1 } starting at initial state x 0 � N − 1 � � � � J π ( x 0 ) = E g N ( x N ) + g k x k , µ k ( x k ) , w k k = 0 Optimal cost function: J ∗ ( x 0 ) = min π J π ( x 0 ) Bertsekas (M.I.T.) Approximate Dynamic Programming 9 / 24

  9. Scope of DP Any optimization (deterministic, stochastic, minimax, etc) involving a sequence of decisions fits the framework A continuous-state example: Linear-quadratic optimal control Linear discrete-time system: x k + 1 = Ax k + Bu k + w k , k = 0 , . . . , N − 1 x k ∈ ℜ n : The state at time k u k ∈ ℜ m : The control at time k (no constraints in the classical version) w k ∈ ℜ n : The disturbance at time k ( w 0 , . . . , w N − 1 are independent random variables with given distribution) Quadratic Cost Function � N − 1 � � x ′ x ′ k Qx k + u ′ E N Qx N + � k Ru k � k = 0 where Q and R are positive definite symmetric matrices Bertsekas (M.I.T.) Approximate Dynamic Programming 11 / 24

  10. Discrete-State Deterministic Scheduling Example 9 6 ABC 3 5 AB ACB 1 2 4 6 2 e A 2 4 6 AC 3 5 Empty schedule 8 3 ACD 2 4 6 3 5 1 Initial al State 1 CAB CA 2 4 6 2 3 5 C 2 4 6 2 4 6 CAD 8 3 2 4 6 CD 3 5 CDA 1 2 Find optimal sequence of operations A, B, C, D (A must precede B and C must precede D) DP Problem Formulation States: Partial schedules; Controls: Stage 0, 1, and 2 decisions DP idea: Break down the problem into smaller pieces (tail subproblems) Start from the last decision and go backwards Bertsekas (M.I.T.) Approximate Dynamic Programming 12 / 24

  11. Scheduling Example Algorithm I 9 6 ABC 3 5 AB ACB 1 3 9 2 4 6 2 e A 2 4 6 AC 3 5 8 3 ACD 2 4 6 3 5 10 5 A Stage 2 1 Initial Subproblem al State 1 CAB CA 2 4 6 2 3 5 C 2 4 6 8 3 2 4 6 CAD 8 3 2 4 6 CD 10 5 3 5 CDA 1 2 Solve the stage 2 subproblems (using the terminal costs) At each state of stage 2, we record the optimal cost-to-go and the optimal decision Bertsekas (M.I.T.) Approximate Dynamic Programming 13 / 24

  12. Scheduling Example Algorithm II 9 6 ABC 3 5 AB ACB 1 3 9 2 4 6 2 e A 2 4 6 AC 3 5 8 ACD 8 3 2 4 6 3 5 5 1 Initial A Stage 1 al State Subproblem 1 CAB CA 2 4 6 2 3 5 C 2 4 6 8 3 2 4 6 CAD 8 3 5 7 2 4 6 CD 5 3 5 CDA 1 2 Solve the stage 1 subproblems (using the solution of stage 2 subproblems) At each state of stage 1, we record the optimal cost-to-go and the optimal decision Bertsekas (M.I.T.) Approximate Dynamic Programming 14 / 24

  13. Scheduling Example Algorithm III 9 6 ABC 3 5 AB ACB 1 3 9 2 4 6 2 e A 2 4 6 AC 3 5 8 3 8 ACD 2 4 6 3 5 5 1 Initial Stage 0 al State Subproblem CAB 1 CA 2 4 6 2 10 3 5 C 2 4 6 8 3 2 4 6 8 3 CAD 5 7 2 4 6 CD 5 3 5 CDA 1 2 Solve the stage 0 subproblem (using the solution of stage 1 subproblems) The stage 0 subproblem is the entire problem The optimal value of the stage 0 subproblem is the optimal cost J ∗ (initial state) Construct the optimal sequence going forward Bertsekas (M.I.T.) Approximate Dynamic Programming 15 / 24

  14. Principle of Optimality Let π ∗ = { µ ∗ 0 , µ ∗ 1 , . . . , µ ∗ N − 1 } be an optimal policy Consider the “tail subproblem" whereby we are at x k at time k and wish to minimize the “cost-to-go” from time k to time N � N − 1 � � � � E g N ( x N ) + g m x m , µ m ( x m ) , w m m = k Consider the “tail" { µ ∗ k , µ ∗ k + 1 , . . . , µ ∗ N − 1 } of the optimal policy Tail Subproblem x k Time 0 k N THE TAIL OF AN OPTIMAL POLICY IS OPTIMAL FOR THE TAIL SUBPROBLEM DP Algorithm Start with the last tail (stage N − 1) subproblems Solve the stage k tail subproblems, using the optimal costs-to-go of the stage ( k + 1 ) tail subproblems The optimal value of the stage 0 subproblem is the optimal cost J ∗ (initial state) In the process construct the optimal policy Bertsekas (M.I.T.) Approximate Dynamic Programming 16 / 24

  15. Formal Statement of the DP Algorithm Computes for all k and states x k : J k ( x k ) : opt. cost of tail problem that starts at x k Go backwards, k = N − 1 , . . . , 0, using J N ( x N ) = g N ( x N ) � �� � J k ( x k ) = u k ∈ U k ( x k ) E w k min g k ( x k , u k , w k ) + J k + 1 f k ( x k , u k , w k ) Interpretation: To solve a tail problem that starts at state x k Minimize the ( k th-stage cost + Opt. cost of the tail problem that starts at state x k + 1 ) Notes: J 0 ( x 0 ) = J ∗ ( x 0 ) : Cost generated at the last step, is equal to the optimal cost Let µ ∗ k ( x k ) minimize in the right side above for each x k and k . Then the policy π ∗ = { µ ∗ 0 , . . . , µ ∗ N − 1 } is optimal Proof by induction Bertsekas (M.I.T.) Approximate Dynamic Programming 18 / 24

  16. Practical Difficulties of DP The curse of dimensionality (too many values of x k ) In continuous-state problems: ◮ Discretization needed ◮ Exponential growth of the computation with the dimensions of the state and control spaces In naturally discrete/combinatorial problems: Quick explosion of the number of states as the search space increases Length of the horizon (what if it is infinite?) The curse of modeling; we may not know exactly f k and P ( x k | x k , u k ) It is often hard to construct an accurate math model of the problem Sometimes a simulator of the system is easier to construct than a model The problem data may not be known well in advance A family of problems may be addressed. The data of the problem to be solved is given with little advance notice The problem data may change as the system is controlled – need for on-line replanning and fast solution Bertsekas (M.I.T.) Approximate Dynamic Programming 19 / 24

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend