a series of lectures on approximate dynamic programming
play

A Series of Lectures on Approximate Dynamic Programming Dimitri P - PowerPoint PPT Presentation

A Series of Lectures on Approximate Dynamic Programming Dimitri P . Bertsekas Laboratory for Information and Decision Systems Massachusetts Institute of Technology Lucca, Italy June 2017 Bertsekas (M.I.T.) Approximate Dynamic Programming 1


  1. A Series of Lectures on Approximate Dynamic Programming Dimitri P . Bertsekas Laboratory for Information and Decision Systems Massachusetts Institute of Technology Lucca, Italy June 2017 Bertsekas (M.I.T.) Approximate Dynamic Programming 1 / 29

  2. Second Lecture APPROXIMATE DYNAMIC PROGRAMMING I Bertsekas (M.I.T.) Approximate Dynamic Programming 2 / 29

  3. Outline Review of the Exact DP Algorithm 1 Approximation in Value Space 2 Parametric Cost Approximation 3 Tail Problem Approximation 4 Bertsekas (M.I.T.) Approximate Dynamic Programming 3 / 29

  4. Recall the Basic Problem Structure for DP Discrete-time system x k + 1 = f k ( x k , u k , w k ) , k = 0 , 1 , . . . , N − 1 x k : State u k : Control from a constraint set U k ( x k ) w k : Disturbance; random parameter with distribution P ( w k | x k , u k ) Optimization over Feedback Policies π = { µ 0 , µ 1 , . . . , µ N − 1 } , with u k = µ k ( x k ) ∈ U ( x k ) Cost of a policy starting at initial state x 0 : � N − 1 � � � � J π ( x 0 ) = E g N ( x N ) + g k x k , µ k ( x k ) , w k k = 0 Optimal cost function: J ∗ ( x 0 ) = min π J π ( x 0 ) Bertsekas (M.I.T.) Approximate Dynamic Programming 5 / 29

  5. Recall the Exact DP Algorithm Computes for all k and states x k : J k ( x k ) , the opt. cost of tail problem that starts at x k Go backwards, k = N − 1 , . . . , 0, using J N ( x N ) = g N ( x N ) � �� � J k ( x k ) = u k ∈ U k ( x k ) E min g k ( x k , u k , w k ) + J k + 1 f k ( x k , u k , w k ) Notes: J 0 ( x 0 ) = J ∗ ( x 0 ) : Cost generated at the last step, is equal to the optimal cost Let µ ∗ k ( x k ) minimize in the right side above for each x k and k . Then the policy π ∗ = { µ ∗ 0 , . . . , µ ∗ N − 1 } is optimal Potentially ENORMOUS computational requirements IF we knew J k + 1 , the computation of the minimizing u k would be much simpler Bertsekas (M.I.T.) Approximate Dynamic Programming 6 / 29

  6. One-Step and Multistep Lookahead One-Step Lookahead Replace J k + 1 by an approximation ˜ J k + 1 Apply ¯ u k that attains the minimum in � �� g k ( x k , u k , w k ) + ˜ � f k ( x k , u k , w k ) u k ∈ U k ( x k ) E min J k + 1 ℓ -Step Lookahead At state x k solve the ℓ -step DP problem starting at x k and using terminal cost ˜ J k + ℓ If u k , µ k + 1 , . . . , µ k + ℓ − 1 is an optimal policy for the ℓ -step problem, apply the first control u k Notes Other names used: Rolling or receding horizon control A key issue: How do we choose ˜ J k + ℓ ? Another issue: How do we deal with the minimization and the computation of E {·} Implementation issues; e.g., tradeoff between on-line vs off-line computation Performance issues; e.g., error bounds (we will not cover) Bertsekas (M.I.T.) Approximate Dynamic Programming 8 / 29

  7. A Summary of Approximation Possibilities in Value Space At State x k DP minimization : (Could be approximate) First ℓ Steps ps “Future” � k + ℓ − 1 � � + ˜ � � u k ,µ k +1 ,...,µ k + ℓ − 1 E min g k ( x k , u k , w k ) + g k x m , µ m ( x m ) , w m J k + ℓ ( x k + ℓ ) m = k +1 s: Computation of ˜ J k + ℓ : Approximations: Simple choices Replace E {·} with nominal values (certainty equivalent control) es Parametric approximation Tail problem approximation Limited simulation Rollout n (Monte Carlo tree search) Bertsekas (M.I.T.) Approximate Dynamic Programming 9 / 29

  8. A First-Order Division of Lookahead Choices Long lookahead ℓ and simple choice of ˜ J k + ℓ Some examples ˜ J k + ℓ ( x ) ≡ 0 (or a constant) ˜ J k + ℓ ( x ) = g N ( x ) For problems with a “goal state" use a simple penalty ˜ J k + ℓ � 0 if x is a goal state ˜ J k + ℓ ( x ) = >> 1 if x is not a goal state Long lookahead = ⇒ A lot of DP computation Often must be done off-line Short lookahead ℓ and sophisticated choice ˜ J k + ℓ ≈ J k + ℓ The lookahead cost function approximates (to within a constant) the optimal cost-to-go produced by exact DP We will next describe a variety of off-line and on-line approximation approaches Bertsekas (M.I.T.) Approximate Dynamic Programming 10 / 29

  9. Approximation in Value Space n Cost-to-go S Lookahead Minimization o Approximation ps “Future” First ℓ Steps � k + ℓ − 1 � � + ˜ � � u k ,µ k +1 ,...,µ k + ℓ − 1 E min g k ( x k , u k , w k ) + g k x m , µ m ( x m ) , w m J k + ℓ ( x k + ℓ ) m = k +1 es Parametric approximation Bertsekas (M.I.T.) Approximate Dynamic Programming 12 / 29

  10. Parametric Approximation: Approximation Architectures We approximate J k ( x k ) with a function from an approximation architecture, i.e., a parametric class ˜ J k ( x k , r k ) , where r k = ( r 1 , k , . . . , r m , k ) is a vector of “tunable" scalar weights We use ˜ J k in place of J k (the optimal cost-to-go function) in a one-step or multistep lookahead scheme Role of r k : By adjusting r k we can change the “shape" of ˜ J k so that it is “close" to to the optimal J k (at least within a constant) Two key Issues The choice of the parametric class ˜ J k ( x k , r k ) ; there is a large variety The method for tuning/adjusting the weights (“training" the architecture) Bertsekas (M.I.T.) Approximate Dynamic Programming 13 / 29

  11. Feature-Based Architectures Feature extraction � � A process that maps the state x k into a vector φ k ( x k ) = φ 1 , k ( x k ) , . . . , φ m , k ( x k ) , called the feature vector associated with x k A feature-based cost approximator has the form J k ( x k , r k ) = ˆ ˜ � � J k φ k ( x k ) , r k where r k is a parameter vector and ˆ J k is some function, linear or nonlinear in r k With a well-chosen feature vector φ k ( x k ) , a good approximation to the cost-to-go is often provided by linearly weighting the features, i.e., m J k ( x k , r k ) = ˆ ˜ � � � r i , k φ i , k ( x k ) = r ′ J k φ k ( x k ) , r k = k φ k ( x k ) i = 1 i ) Linear Cost State x k k Feature Vector φ k ( x k ) ) Approximator r 0 k φ k ( x k ) i Feature Extraction i ) Linear on Mapping on Mapping This can be viewed as approximation onto a subspace of basis functions of x k defined by the features φ i , k ( x k ) Bertsekas (M.I.T.) Approximate Dynamic Programming 14 / 29

  12. Feature-Based Architectures Any generic basis functions, such as classes of polynomials, wavelets, radial basis functions, etc, can serve as features In some cases, problem-specific features can be “hand-crafted" Computer chess example on Features: : Material Balance, Mobility, y, Safety, etc W s Score P Feature c Weighting of e Extraction of Features S e Position Evaluator Think of state: board position; control: move choice Use a feature-based position evaluator assigning a score to each position Most chess programs use a linear architecture with “manual" choice of weights Some computer programs choose the weights by a least squares fit using lots of grandmaster play examples Bertsekas (M.I.T.) Approximate Dynamic Programming 15 / 29

  13. An Example of Architecture Training: Sequential DP Approximation A common way to train architectures ˜ J k ( x k , r k ) in the context of DP We start with ˜ J N = g N and sequentially train going backwards, until k = 1 Given a cost-to-go approximation ˜ J k + 1 , we use one-step lookahead to construct a large number of state-cost pairs ( x s k , β s k ) , s = 1 , . . . , q , where � �� β s g ( x s k , u , w k ) + ˜ f k ( x s � k = min k ) E J k + 1 k , u , w k ) , r k + 1 , s = 1 , . . . , q u ∈ U k ( x s We “train" an architecture ˜ J k on the training set ( x s k , β s k ) , s = 1 , . . . , q Training by least squares/regression We minimize over r k q � ˜ k , r k ) − β s � 2 + γ � r k − ¯ � J k ( x s r � 2 s = 1 where ¯ r is an initial guess for r k and γ > 0 is a regularization parameter Special algorithms called incremental gradient methods are typically used for this. They take advantage of the large sum structure of the cost function For a linear architecture the training problem is a linear least squares problem Bertsekas (M.I.T.) Approximate Dynamic Programming 16 / 29

  14. Neural Networks for Constructing Cost-to-Go Approximations ˜ J k Neural nets can be used in the preceding sequential DP approximation scheme: Train the stage k neural net using a training set generated with the stage k + 1 neural net Two ways to view neural networks As nonlinear approximation architectures As linear architectures with automatically constructed features Focus at the typical stage k and drop the index k for convenience Neural nets are approximation architectures of the form m ˜ � r i φ i ( x , v ) = r ′ φ ( x , v ) J ( x , v , r ) = i = 1 involving two parameter vectors r and v with different roles View φ ( x , v ) as a feature vector; view r as a vector of linear weighting parameters for φ ( x , v ) By training v jointly with r , we obtain automatically generated features! Bertsekas (M.I.T.) Approximate Dynamic Programming 17 / 29

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend