First-order methods (Optml++ Meeting 2) Suvrit Sra Massachusetts - PowerPoint PPT Presentation

First-order methods (Optml++ Meeting 2) Suvrit Sra Massachusetts Institute of Technology OPTML++, Fall 2015

Outline – Lect 1: Recap on convexity – Lect 1: Recap on duality, optimality – First-order optimization algorithms – Proximal methods, operator splitting Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 2 / 23

Descent methods min x f ( x ) Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 3 / 23

Descent methods min x f ( x ) x k x k +1 . . . x ∗ ∇ f ( x ∗ ) = 0 Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 3 / 23

Descent methods x Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 4 / 23

Descent methods ∇ f ( x ) x −∇ f ( x ) Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 4 / 23

Descent methods ∇ f ( x ) x − α ∇ f ( x ) x x − δ ∇ f ( x ) −∇ f ( x ) Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 4 / 23

Descent methods ∇ f ( x ) x + α 2 d x − α ∇ f ( x ) x d x − δ ∇ f ( x ) −∇ f ( x ) Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 4 / 23

Algorithm 1 Start with some guess x 0 ; 2 For each k = 0 , 1 , . . . x k + 1 ← x k + α k d k Check when to stop (e.g., if ∇ f ( x k + 1 ) = 0) Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 5 / 23

Gradient methods x k + 1 = x k + α k d k , k = 0 , 1 , . . . stepsize α k ≥ 0, usually ensures f ( x k + 1 ) < f ( x k ) Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 6 / 23

Gradient methods x k + 1 = x k + α k d k , k = 0 , 1 , . . . stepsize α k ≥ 0, usually ensures f ( x k + 1 ) < f ( x k ) Descent direction d k satisfies �∇ f ( x k ) , d k � < 0 Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 6 / 23

Gradient methods x k + 1 = x k + α k d k , k = 0 , 1 , . . . stepsize α k ≥ 0, usually ensures f ( x k + 1 ) < f ( x k ) Descent direction d k satisfies �∇ f ( x k ) , d k � < 0 Numerous ways to select α k and d k Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 6 / 23

Gradient methods x k + 1 = x k + α k d k , k = 0 , 1 , . . . stepsize α k ≥ 0, usually ensures f ( x k + 1 ) < f ( x k ) Descent direction d k satisfies �∇ f ( x k ) , d k � < 0 Numerous ways to select α k and d k Usually methods seek monotonic descent f ( x k + 1 ) < f ( x k ) Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 6 / 23

Gradient methods – direction x k + 1 = x k + α k d k , k = 0 , 1 , . . . ◮ Different choices of direction d k ◦ Scaled gradient: d k = − D k ∇ f ( x k ) , D k ≻ 0 ◦ Newton’s method: ( D k = [ ∇ 2 f ( x k )] − 1 ) ◦ Quasi-Newton: D k ≈ [ ∇ 2 f ( x k )] − 1 ◦ Steepest descent: D k = I � − 1 ◦ Diagonally scaled: D k diagonal with D k � ∂ 2 f ( x k ) ii ≈ ( ∂ x i ) 2 ◦ Discretized Newton: D k = [ H ( x k )] − 1 , H via finite-diff. Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 7 / 23

Gradient methods – direction x k + 1 = x k + α k d k , k = 0 , 1 , . . . ◮ Different choices of direction d k ◦ Scaled gradient: d k = − D k ∇ f ( x k ) , D k ≻ 0 ◦ Newton’s method: ( D k = [ ∇ 2 f ( x k )] − 1 ) ◦ Quasi-Newton: D k ≈ [ ∇ 2 f ( x k )] − 1 ◦ Steepest descent: D k = I � − 1 ◦ Diagonally scaled: D k diagonal with D k � ∂ 2 f ( x k ) ii ≈ ( ∂ x i ) 2 ◦ Discretized Newton: D k = [ H ( x k )] − 1 , H via finite-diff. ◦ . . . Exercise: Verify that �∇ f ( x k ) , d k � < 0 for above choices Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 7 / 23

Gradient methods – stepsize f ( x k + α d k ) ◮ Exact: α k := argmin α ≥ 0 Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 8 / 23

Gradient methods – stepsize f ( x k + α d k ) ◮ Exact: α k := argmin α ≥ 0 f ( x k + α d k ) ◮ Limited min: α k = argmin 0 ≤ α ≤ s Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 8 / 23

Gradient methods – stepsize f ( x k + α d k ) ◮ Exact: α k := argmin α ≥ 0 f ( x k + α d k ) ◮ Limited min: α k = argmin 0 ≤ α ≤ s ◮ Armijo-rule . Given fixed scalars, s , β, σ with 0 < β < 1 and 0 < σ < 1 (chosen experimentally). Set α k = β m k s , where we try β m s for m = 0 , 1 , . . . until sufficient descent f ( x k ) − f ( x + β m sd k ) ≥ − σβ m s �∇ f ( x k ) , d k � Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 8 / 23

Gradient methods – stepsize f ( x k + α d k ) ◮ Exact: α k := argmin α ≥ 0 f ( x k + α d k ) ◮ Limited min: α k = argmin 0 ≤ α ≤ s ◮ Armijo-rule . Given fixed scalars, s , β, σ with 0 < β < 1 and 0 < σ < 1 (chosen experimentally). Set α k = β m k s , where we try β m s for m = 0 , 1 , . . . until sufficient descent f ( x k ) − f ( x + β m sd k ) ≥ − σβ m s �∇ f ( x k ) , d k � If �∇ f ( x k ) , d k � < 0, stepsize guaranteed to exist Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 8 / 23

Gradient methods – stepsize f ( x k + α d k ) ◮ Exact: α k := argmin α ≥ 0 f ( x k + α d k ) ◮ Limited min: α k = argmin 0 ≤ α ≤ s ◮ Armijo-rule . Given fixed scalars, s , β, σ with 0 < β < 1 and 0 < σ < 1 (chosen experimentally). Set α k = β m k s , where we try β m s for m = 0 , 1 , . . . until sufficient descent f ( x k ) − f ( x + β m sd k ) ≥ − σβ m s �∇ f ( x k ) , d k � If �∇ f ( x k ) , d k � < 0, stepsize guaranteed to exist Usually, σ small ∈ [ 10 − 5 , 0 . 1 ] , while β from 1 / 2 to 1 / 10 depending on how confident we are about initial stepsize s . ◮ Constant: α k = 1 / L (for suitable value of L ) ◮ Diminishing: α k → 0 but � k α k = ∞ . Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 8 / 23

Gradient methods – nonmonotonic steps ∗ Stepsize computation can be expensive Convergence analysis depends on monotonic descent Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 9 / 23

Gradient methods – nonmonotonic steps ∗ Stepsize computation can be expensive Convergence analysis depends on monotonic descent Give up search for stepsizes Use closed-form formulae for stepsizes Don’t insist on monotonic descent? (e.g., diminishing stepsizes do not give monotonic descent) Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 9 / 23

Gradient methods – nonmonotonic steps ∗ Stepsize computation can be expensive Convergence analysis depends on monotonic descent Give up search for stepsizes Use closed-form formulae for stepsizes Don’t insist on monotonic descent? (e.g., diminishing stepsizes do not give monotonic descent) Barzilai & Borwein stepsizes x k + 1 = x k − α k ∇ f ( x k ) , k = 0 , 1 , . . . Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 9 / 23

Gradient methods – nonmonotonic steps ∗ Stepsize computation can be expensive Convergence analysis depends on monotonic descent Give up search for stepsizes Use closed-form formulae for stepsizes Don’t insist on monotonic descent? (e.g., diminishing stepsizes do not give monotonic descent) Barzilai & Borwein stepsizes x k + 1 = x k − α k ∇ f ( x k ) , k = 0 , 1 , . . . α k = � u k , v k � � u k � 2 � v k � 2 , α k = � u k , v k � u k = x k − x k − 1 , v k = ∇ f ( x k ) − ∇ f ( x k − 1 ) Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 9 / 23

Least-squares Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 10 / 23

Nonnegative least squares 2 � Ax − b � 2 + � x ≥ 0 � 1 min intensities, concentrations, frequencies, . . . Applications Machine learning Physics Statistics Bioinformatics Image Processing Remote Sensing Computer Vision Engineering Medical Imaging Inverse problems Astronomy Finance Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 11 / 23

NNLS: � Ax − b � 2 s.t. x ≥ 0 Unconstrained solution x uc = ( A T A ) − 1 A T b Solve ∇ f ( x ) = 0 = ⇒ x = ( x uc ) + Cannot just truncate x ∗ ( x uc ) + x uc x ≥ 0 makes problem trickier as problem size ր Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 12 / 23

Solving NNLS scalably x ∗ x ← ( x − α ∇ f ( x )) + x uc Good choice of α crucial ◮ Backtracking line-search ◮ Armijo ◮ and many others Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 13 / 23

Solving NNLS scalably x ∗ x ← ( x − α ∇ f ( x )) + x uc Good choice of α crucial ◮ Backtracking line-search ◮ Armijo ◮ and many others Too slow! Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 13 / 23

NNLS: long studied problem Method Remarks Scalability Accuracy NNLS (1976) M ATLAB default poor high FNNLS (1989) fast NNLS poor high LBFGS-B (1997) famous solver fair medium TRON (1999) TR newton poor high SPG (2000) spectral proj fair+ medium ASA (2006) prev state-of-art fair+ medium SBB (2011) subspace BB steps very good medium Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 14 / 23

Spectacular failure of projection 2 10 Naive BB+Projxn 1 10 Objective function value 0 10 −1 10 −2 10 −3 10 −4 10 5 10 15 20 25 30 35 40 Running time (seconds) x ′ = ( x − α ∇ f ( x )) + Suvrit Sra (MIT) Optimization for ML and beyond: OPTML++ 15 / 23

First-order methods (Optml++ Meeting 2) Suvrit Sra Massachusetts - PowerPoint PPT Presentation

First-order methods (Optml++ Meeting 2) Suvrit Sra Massachusetts Institute of Technology OPTML++, Fall 2015 Outline Lect 1: Recap on convexity Lect 1: Recap on duality, optimality First-order optimization algorithms Proximal

Reorder Buffer Method Issue Execute Write Classic 5-stage pipeline In-order In-order

3. First-Order Theories 3- 1 First-Order Theories First-order theory T defined by Signature

Using first order logic (Ch. 8-9) Review: First order logic In first order logic, we have objects

Using first order logic (Ch. 8-9) Review: First order logic In first order logic, we have objects

First Order Logic: First-order resolution. Valentin Goranko DTU Informatics September 2010 V

First-order logic 6 AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 6 1 6 First-Order Logic

York University www.cs.york.ac.uk/~ndm First order vs Higher order Higher order:

First-order logic Whereas propositional logic assumes world contains facts , first-order logic

Logic as a Tool Chapter 3: Understanding First-order Logic 3.1 First-order structures and

ARTIFICIAL INTELLIGENCE Russell & Norvig Chapter 8. First-Order Logic First-Order Logic

Logic as a Tool Chapter 3: Understanding First-order Logic 3.2 Semantics of first-order logic

First-Order Logical Duality Henrik Forssell June 2008 First-Order Logical Duality Introduction

Knowledge Compilation Guy Van den Broeck Beyond NP Workshop Feb 12, 2016 Overview 1. Why

Logic as a Tool Chapter 3: Understanding First-order Logic 3.1 First-order structures and

Inference in first-order logic Russell and Norvig Chapter 9 Outline n Reducing first-order

First-Order Theories First-order theory T defined by Signature - set of constant, function,

CASCADES INC. NBF Quebec Conference Montreal June 2, 2016 DISCLAIMER Certain statements in

Harnessing Wake Vortices for Efficient Collective Swimming via Deep Reinforcement Learning

E-203 status and future plans Nicolas Delerue on behalf

Positive Train Control Project Status Metro-North Railroad Long Island Rail Road January 28,

Vattenfall Eldistribution Confidentiality: C2 - Internal Why sthlmflex? The project

Maria R. Coady, Ph.D. (co-PI) Candace Harper, Ph.D. (co-PI) OELA Presentation, November 30, 2010

Working Session #3 Alternatives September 7, 2005 Website:

About IRTC Our Communities Departments Flooding in Pinaymootang First A once

First-order methods (Optml++ Meeting 2) Suvrit Sra Massachusetts - PowerPoint PPT Presentation

First-order methods (Optml++ Meeting 2) Suvrit Sra Massachusetts Institute of Technology OPTML++, Fall 2015 Outline Lect 1: Recap on convexity Lect 1: Recap on duality, optimality First-order optimization algorithms Proximal

Reorder Buffer Method Issue Execute Write Classic 5-stage pipeline In-order In-order

3. First-Order Theories 3- 1 First-Order Theories First-order theory T defined by Signature

Using first order logic (Ch. 8-9) Review: First order logic In first order logic, we have objects

Using first order logic (Ch. 8-9) Review: First order logic In first order logic, we have objects

First Order Logic: First-order resolution. Valentin Goranko DTU Informatics September 2010 V

First-order logic 6 AI Slides (5e) c Lin Zuoquan@PKU 2003-2019 6 1 6 First-Order Logic

York University www.cs.york.ac.uk/~ndm First order vs Higher order Higher order:

First-order logic Whereas propositional logic assumes world contains facts , first-order logic

Logic as a Tool Chapter 3: Understanding First-order Logic 3.1 First-order structures and

ARTIFICIAL INTELLIGENCE Russell &amp; Norvig Chapter 8. First-Order Logic First-Order Logic

Logic as a Tool Chapter 3: Understanding First-order Logic 3.2 Semantics of first-order logic

First-Order Logical Duality Henrik Forssell June 2008 First-Order Logical Duality Introduction

Knowledge Compilation Guy Van den Broeck Beyond NP Workshop Feb 12, 2016 Overview 1. Why

Logic as a Tool Chapter 3: Understanding First-order Logic 3.1 First-order structures and

Inference in first-order logic Russell and Norvig Chapter 9 Outline n Reducing first-order

First-Order Theories First-order theory T defined by Signature - set of constant, function,

CASCADES INC. NBF Quebec Conference Montreal June 2, 2016 DISCLAIMER Certain statements in

Harnessing Wake Vortices for Efficient Collective Swimming via Deep Reinforcement Learning

E-203 status and future plans Nicolas Delerue on behalf

Positive Train Control Project Status Metro-North Railroad Long Island Rail Road January 28,

Vattenfall Eldistribution Confidentiality: C2 - Internal Why sthlmflex? The project

Maria R. Coady, Ph.D. (co-PI) Candace Harper, Ph.D. (co-PI) OELA Presentation, November 30, 2010

Working Session #3 Alternatives September 7, 2005 Website:

About IRTC Our Communities Departments Flooding in Pinaymootang First A once

ARTIFICIAL INTELLIGENCE Russell & Norvig Chapter 8. First-Order Logic First-Order Logic