CS 287 Advanced Robotics (Fall 2019) Lecture 6: Unconstrained - PowerPoint PPT Presentation

CS 287 Advanced Robotics (Fall 2019) Lecture 6: Unconstrained Optimization Pieter Abbeel UC Berkeley EECS Many slides and figures adapted from Stephen Boyd [optional] Boyd and Vandenberghe, Convex Optimization, Chapters 9 – 11 [optional] Betts, Practical Methods for Optimal Control Using Nonlinear Programming

Bellman’s Curse of Dimensionality n n-dimensional state space n Number of states grows exponentially in n (for fixed number of discretization levels per coordinate) n In practice n Discretization is considered only computationally feasible up to 5 or 6 dimensional state spaces even when using n Variable resolution discretization n Highly optimized implementations

Optimization for Optimal Control Goal: find a sequence of control inputs (and corresponding sequence of states) that solves: n Generally hard to do. Exception: convex problems, which means g is convex, the sets U t and X t are n convex, and f is linear. Note: iteratively applying LQR is one way to solve this problem but can get a bit tricky when there n are constraints on the control inputs and state. In principle (though not in our examples), u could be parameters of a control policy rather than n the raw control inputs.

Outline n Convex optimization problems n Unconstrained minimization n Gradient Descent n Newton’s Method n Natural Gradient / Gauss-Newton n Momentum, RMSprop, Aam

Convex Functions n A function f is convex if and only if ∀ x 1 , x 2 ∈ Domain( f ) , ∀ t ∈ [0 , 1] : f ( tx 1 + (1 − t ) x 2 ) ≤ tf ( x 1 ) + (1 − t ) f ( x 2 ) Image source: wikipedia

Convex Functions • Unique minimum • Set of points for which f(x) <= a is convex Source: Thomas Jungblut’s Blog

Convex Optimization Problems n Convex optimization problems are a special class of optimization problems, of the following form: x ∈ R n f 0 ( x ) min s . t . f i ( x ) ≤ 0 i = 1 , . . . , n Ax = b with f i (x) convex for i = 0, 1, …, n n A function f is convex if and only if ∀ x 1 , x 2 ∈ Domain( f ) , ∀ λ ∈ [0 , 1] f ( λ x 1 + (1 − λ ) x 2 ) ≤ λ f ( x 1 ) + (1 − λ ) f ( x 2 )

Unconstrained Minimization x* is a local minimum of (differentiable) f than it has to satisfy: n In simple cases we can directly solve the system of n equations given by (2) to find n candidate local minima, and then verify (3) for these candidates. In general however, solving (2) is a difficult problem. Going forward we will consider n this more general setting and cover numerical solution methods for (1).

Steepest Descent Idea: n Start somewhere n Repeat: Take a step in the steepest descent direction n Figure source: Mathworks

Steepest Descent Algorithm 1. Initialize x 2. Repeat 1. Determine the steepest descent direction Δx 2. Line search: Choose a step size t > 0. 3. Update: x := x + t Δx. 3. Until stopping criterion is satisfied

What is the Steepest Descent Direction? à Steepest Descent = Gradient Descent

Stepsize Selection: Exact Line Search Used when the cost of solving the minimization problem with one variable is low compared to the cost of computing the search direction itself.

Stepsize Selection: Backtracking Line Search n Inexact: step length is chose to approximately minimize f along the ray {x + t Δx | t > 0}

Stepsize Selection: Backtracking Line Search Figure source: Boyd and Vandenberghe

Steepest Descent (= Gradient Descent) Source: Boyd and Vandenberghe

Gradient Descent: Example 1 Figure source: Boyd and Vandenberghe

Gradient Descent Convergence Condition number = 10 Condition number = 1 For quadratic function, convergence speed depends on ratio of highest second n derivative over lowest second derivative (“condition number”) In high dimensions, almost guaranteed to have a high (=bad) condition number n Rescaling coordinates (as could happen by simply expressing quantities in different n measurement units) results in a different condition number

Newton’s Method n 2 nd order Taylor Approximation rather than 1 st order: assuming (which is true for convex f) the minimum of the 2 nd order approximation is achieved at: Figure source: Boyd and Vandenberghe

Newton’s Method Figure source: Boyd and Vandenberghe

Affine Invariance n Consider the coordinate transformation y = A -1 x (x = Ay) n If running Newton’s method starting from x (0) on f(x) results in x (0) , x (1) , x (2) , … n Then running Newton’s method starting from y (0) = A -1 x (0) on g(y) = f(Ay), will result in the sequence y (0) = A -1 x (0) , y (1) = A -1 x (1) , y (2) = A -1 x (2) , … Exercise: try to prove this!

Affine Invariance --- Proof

Example 1 gradient descent with Newton’s method with backtracking line search Figure source: Boyd and Vandenberghe

Example 2 gradient descent Newton’s method Figure source: Boyd and Vandenberghe

Larger Version of Example 2 Figure source: Boyd and Vandenberghe

Example 3 Gradient descent n Newton’s method (converges in one step if f convex quadratic) n

Quasi-Newton Methods n Quasi-Newton methods use an approximation of the Hessian n Example 1: Only compute diagonal entries of Hessian, set others equal to zero. Note this also simplifies computations done with the Hessian. n Example 2: Natural gradient --- see next slide

Natural Gradient Consider a standard maximum likelihood problem: n Gradient: n Hessian: n r 2 p ( x ( i ) ; θ ) ⌘ > ⇣ ⌘ ⇣ X r 2 f ( θ ) = r log p ( x ( i ) ; θ ) r log p ( x ( i ) ; θ ) � p ( x ( i ) ; θ ) i Natural gradient: n only keeps the 2 nd term in the Hessian. Benefits: (1) faster to compute (only gradients needed); (2) guaranteed to be negative definite; (3) found to be superior in some experiments; (4) invariant to re-parameterization

Natural Gradient n Property: Natural gradient is invariant to parameterization of the family of probability distributions p( x ; θ) n Hence the name. n Note this property is stronger than the property of Newton’s method, which is invariant to affine re-parameterizations only. n Exercise: Try to prove this property!

Natural Gradient Invariant to Reparametrization --- Proof n Natural gradient for parametrization with θ: n Let Φ = f(θ), and let i.e., à the natural gradient direction is the same independent of the (invertible, but otherwise not constrained) reparametrization f

Gradient Descent with Momentum Gradient Descent Gradient Descent with Momentum Typically beta = 0.9 v = exponentially weighted avg of gradient

RMSprop RMSprop Gradient Descent RMSprop (Root Mean Square propagation) Typically beta = 0.999 s = exponentially weighted avg of squared gradients

Adam Adam Gradient Descent Adam (Adaptive momentum estimation) Typically beta1= 0.9; beta2=0.999; eps=1e-8 s = exponentially weighted avg of squared gradients v= momentum

CS 287 Advanced Robotics (Fall 2019) Lecture 6: Unconstrained - PowerPoint PPT Presentation

CS 287 Advanced Robotics (Fall 2019) Lecture 6: Unconstrained Optimization Pieter Abbeel UC Berkeley EECS Many slides and figures adapted from Stephen Boyd [optional] Boyd and Vandenberghe, Convex Optimization, Chapters 9 11 [optional]

287(g) Program Sheriff Eric J. Severson Waukesha County, WI 287(g) Program Legal Authority

CS 287 Advanced Robotics (Fall 2019) Lecture 9: Motion Planning Lecture by: Huazhe (Harry) Xu

CS 287 Advanced Robotics (Fall 2019) Lecture 7: Constrained Optimization Pieter Abbeel UC

CS 287 Advanced Robotics (Fall 2019) Lecture 13: Kalman Smoother, Maximum A Posteriori, Maximum

Robotics Engineering Prof. Michael Gennert Robotics Engineering Program Director Fall 2016

Mobile & Service Robotics Mobile & Service Robotics Sensors for Robotics Sensors for

Mobile & Service Robotics Mobile & Service Robotics Sensors for Robotics Sensors for

Mobile & Service Robotics Mobile & Service Robotics Sensors for Sensors for Robotics

US 287 / SH 40 Passing Lane Pre-Proposal August 18, 2020 1 US 287 / SH 40 Passing Lane Project

Human-Oriented Robotics Octave/Matlab Tutorial Kai Arras Social Robotics Lab, University of

LEGO Develops a new LEGO Develops a new robotics platform - WeDo robotics platform - WeDo

CS 287 Lecture 18 (Fall 2019) RL I: Policy Gradients Pieter Abbeel UC Berkeley EECS Many slides

CS 287 Lecture 12 (Fall 2019) Kalman Filtering Lecturer: Ignasi Clavera Slides by Pieter Abbeel

CS 287 Lecture 11 (Fall 2019) Probability Review, Bayes Filters, Gaussians Pieter Abbeel UC

Human-Oriented Robotics Basics of Probabilistic Reasoning Kai Arras Social Robotics Lab,

Human-Oriented Robotics Temporal Reasoning Part 3/3 Kai Arras Social Robotics Lab, University

Logic of the Scientific Method s e d i l S 2 n o i s s e S - - 0 4 2 k W c

18.650 Statistics for Applications Chapter 4: The Method of Moments 1/14 Weierstrass

Discontinuous Galerkin method for hyperbolic equations with singularities Chi-Wang Shu Division

The Simplex Method Marco Chiarandini Department of Mathematics & Computer Science University

Functional Steins Institut method Mines-Telecom L. Decreusefond Borchard symposium Roadmap

Shallow RNNs: A Method for Accurate Time-series Classification on Tiny Devices* Don Kurian

Collaborators Figen Oztoprak Stefan Solntsev Richard Byrd 2 Outline 1. How to improve

Convergence of a Stochastic Gradient Method with Momentum for Non-Smooth Non-Convex Optimization

CS 287 Advanced Robotics (Fall 2019) Lecture 6: Unconstrained - PowerPoint PPT Presentation

CS 287 Advanced Robotics (Fall 2019) Lecture 6: Unconstrained Optimization Pieter Abbeel UC Berkeley EECS Many slides and figures adapted from Stephen Boyd [optional] Boyd and Vandenberghe, Convex Optimization, Chapters 9 11 [optional]

287(g) Program Sheriff Eric J. Severson Waukesha County, WI 287(g) Program Legal Authority

CS 287 Advanced Robotics (Fall 2019) Lecture 9: Motion Planning Lecture by: Huazhe (Harry) Xu

CS 287 Advanced Robotics (Fall 2019) Lecture 7: Constrained Optimization Pieter Abbeel UC

CS 287 Advanced Robotics (Fall 2019) Lecture 13: Kalman Smoother, Maximum A Posteriori, Maximum

Robotics Engineering Prof. Michael Gennert Robotics Engineering Program Director Fall 2016

Mobile &amp; Service Robotics Mobile &amp; Service Robotics Sensors for Robotics Sensors for

Mobile &amp; Service Robotics Mobile &amp; Service Robotics Sensors for Robotics Sensors for

Mobile &amp; Service Robotics Mobile &amp; Service Robotics Sensors for Sensors for Robotics

US 287 / SH 40 Passing Lane Pre-Proposal August 18, 2020 1 US 287 / SH 40 Passing Lane Project

Human-Oriented Robotics Octave/Matlab Tutorial Kai Arras Social Robotics Lab, University of

LEGO Develops a new LEGO Develops a new robotics platform - WeDo robotics platform - WeDo

CS 287 Lecture 18 (Fall 2019) RL I: Policy Gradients Pieter Abbeel UC Berkeley EECS Many slides

CS 287 Lecture 12 (Fall 2019) Kalman Filtering Lecturer: Ignasi Clavera Slides by Pieter Abbeel

CS 287 Lecture 11 (Fall 2019) Probability Review, Bayes Filters, Gaussians Pieter Abbeel UC

Human-Oriented Robotics Basics of Probabilistic Reasoning Kai Arras Social Robotics Lab,

Human-Oriented Robotics Temporal Reasoning Part 3/3 Kai Arras Social Robotics Lab, University

Logic of the Scientific Method s e d i l S 2 n o i s s e S - - 0 4 2 k W c

18.650 Statistics for Applications Chapter 4: The Method of Moments 1/14 Weierstrass

Discontinuous Galerkin method for hyperbolic equations with singularities Chi-Wang Shu Division

The Simplex Method Marco Chiarandini Department of Mathematics &amp; Computer Science University

Functional Steins Institut method Mines-Telecom L. Decreusefond Borchard symposium Roadmap

Shallow RNNs: A Method for Accurate Time-series Classification on Tiny Devices* Don Kurian

Collaborators Figen Oztoprak Stefan Solntsev Richard Byrd 2 Outline 1. How to improve

Convergence of a Stochastic Gradient Method with Momentum for Non-Smooth Non-Convex Optimization

Mobile & Service Robotics Mobile & Service Robotics Sensors for Robotics Sensors for

Mobile & Service Robotics Mobile & Service Robotics Sensors for Robotics Sensors for

Mobile & Service Robotics Mobile & Service Robotics Sensors for Sensors for Robotics

The Simplex Method Marco Chiarandini Department of Mathematics & Computer Science University