Sketchy Decisions: Convex Low-Rank Matrix Optimization with Optimal - PowerPoint PPT Presentation

Sketchy Decisions: Convex Low-Rank Matrix Optimization with Optimal Storage Madeleine Udell Operations Research and Information Engineering Cornell University Based on joint work with Alp Yurtsever (EPFL), Volkan Cevher (EPFL), and Joel Tropp (Caltech) LCCC, June 15 2017 1 / 30

Desiderata Suppose that the solution to a convex optimization problem has a compact representation . Problem data: O ( n ) ↓ Working memory: O (???) ↓ Solution: O ( n ) Can we develop algorithms that provably solve the problem using storage bounded by the size of the problem data and the size of the solution ? 2 / 30

Model problem: low rank matrix optimization consider a convex problem with decision variable X ∈ R m × n compact matrix optimization problem : minimize f ( A X ) (CMOP) subject to � X � S 1 ≤ α ◮ A : R m × n → R d ◮ f : R d → R convex and smooth ◮ � X � S 1 is Schatten-1 norm: sum of singular values 3 / 30

Model problem: low rank matrix optimization consider a convex problem with decision variable X ∈ R m × n compact matrix optimization problem : minimize f ( A X ) (CMOP) subject to � X � S 1 ≤ α ◮ A : R m × n → R d ◮ f : R d → R convex and smooth ◮ � X � S 1 is Schatten-1 norm: sum of singular values assume ◮ compact specification : problem data use O ( n ) storage ◮ compact solution : rank X ⋆ = r constant 3 / 30

Model problem: low rank matrix optimization consider a convex problem with decision variable X ∈ R m × n compact matrix optimization problem : minimize f ( A X ) (CMOP) subject to � X � S 1 ≤ α ◮ A : R m × n → R d ◮ f : R d → R convex and smooth ◮ � X � S 1 is Schatten-1 norm: sum of singular values assume ◮ compact specification : problem data use O ( n ) storage ◮ compact solution : rank X ⋆ = r constant Note: Same ideas work for X � 0 3 / 30

Are desiderata achievable? minimize f ( A X ) subject to � X � S 1 ≤ α CMOP, using any first order method: Problem data: O ( n ) ↓ Working memory: O ( n 2 ) ↓ Solution: O ( n ) 4 / 30

Are desiderata achievable? CMOP, using ??? : Problem data: O ( n ) ↓ Working memory: O (???) ↓ Solution: O ( n ) 4 / 30

Application: matrix completion find X matching M on observed entries ( i , j ) ∈ Ω ( X ij − M ij ) 2 � minimize subject to � X � S 1 ≤ α ◮ m = rows, n = columns of matrix to complete ◮ d = | Ω | number of observations ◮ A selects observed entries X ij , ( i , j ) ∈ Ω ◮ f ( A X ) = �A X − A M � 2 compact if d = O ( n ) observations and rank( X ⋆ ) constant 5 / 30

Application: matrix completion find X matching M on observed entries ( i , j ) ∈ Ω ( X ij − M ij ) 2 � minimize subject to � X � S 1 ≤ α ◮ m = rows, n = columns of matrix to complete ◮ d = | Ω | number of observations ◮ A selects observed entries X ij , ( i , j ) ∈ Ω ◮ f ( A X ) = �A X − A M � 2 compact if d = O ( n ) observations and rank( X ⋆ ) constant Theorem: ǫ -rank of M grows as log( m + n ) if rows and cols iid (under some technical conditions) (Udell and Townsend, 2017) 5 / 30

Application: Phase retrieval ◮ image with n pixels x ♮ ∈ C n ◮ acquire noisy nonlinear measurements b i = |� a i , x ♮ �| 2 + ω i ◮ relax: if X = x ♮ x ∗ ♮ , then |� a i , x ♮ �| 2 = x ♮ a ∗ i a i x ∗ ♮ = tr( a ∗ i a i x ∗ ♮ x ♮ ) = tr( a ∗ i a i X ) ◮ recover image by solving minimize f ( A X ; b ) subject to tr X ≤ α X � 0 . compact if d = O ( n ) observations and rank( X ⋆ ) constant 6 / 30

Optimal Storage What kind of storage bounds can we hope for? ◮ Assume black-box implementation of A ( uv ∗ ) u ∗ ( A ∗ z ) ( A ∗ z ) v where u ∈ R m , v ∈ R n , and z ∈ R d ◮ Need Ω( m + n + d ) storage to apply linear map ◮ Need Θ( r ( m + n )) storage for a rank- r approximate solution Definition. An algorithm for the model problem has optimal storage if its working storage is Θ( d + r ( m + n )) . 7 / 30

Goal: optimal storage We can specify the problem using O ( n ) ≪ mn units of storage. Can we solve the problem using only O ( n ) units of storage? 8 / 30

Goal: optimal storage We can specify the problem using O ( n ) ≪ mn units of storage. Can we solve the problem using only O ( n ) units of storage? If we write down X , we’ve already failed. 8 / 30

A brief biased history of matrix optimization ◮ 1990s: Interior-point methods ◮ Storage cost Θ(( m + n ) 4 ) for Hessian ◮ 2000s: Convex first-order methods ◮ (Accelerated) proximal gradient and others ◮ Store matrix variable Θ( mn ) ( Interior-point: Nemirovski & Nesterov 1994; . . . ; First-order: Rockafellar 1976; Auslender & Teboulle 2006; . . . ) 9 / 30

A brief biased history of matrix optimization ◮ 2008–Present: Storage-efficient convex first-order methods ◮ Conditional gradient method (CGM) and extensions ◮ Store matrix in low-rank form O ( t ( m + n )) after t iterations ◮ Requires storage Θ( mn ) for t ≥ min( m , n ) ◮ 2003–Present: Nonconvex heuristics ◮ Burer–Monteiro factorization idea + various opt algorithms ◮ Store low-rank matrix factors Θ( r ( m + n )) ◮ For guaranteed solution, need unrealistic + unverifiable statistical assumptions ( CGM: Frank & Wolfe 1956; Levitin & Poljak 1967; Hazan 2008; Clarkson 2010; Jaggi 2013; . . . ; Heuristics: Burer & Monteiro 2003; Keshavan et al. 2009; Jain et al. 2012; Bhojanapalli et al. 2015; Cand` es et al. 2014; Boumal et al. 2015; . . . ) 10 / 30

The dilemma ◮ convex methods: slow memory hogs with guarantees ◮ nonconvex methods: fast, lightweight, but brittle 11 / 30

The dilemma ◮ convex methods: slow memory hogs with guarantees ◮ nonconvex methods: fast, lightweight, but brittle low memory or guaranteed convergence . . . but not both? 11 / 30

Conditional Gradient Method minimize f ( A X ) = g ( X ) subject to � X � S 1 ≤ α � X � S 1 ≤ α X t X t +1 −∇ g ( X t ) H t = argmax � X , −∇ g ( X t ) � � X � S 1 ≤ 1 12 / 30

Conditional Gradient Method minimize f ( A X ) subject to � X � S 1 ≤ α CGM. set X 0 = 0. for t = 0 , 1 , . . . ◮ compute G t = A ∗ ∇ f ( A X t ) ◮ set search direction H t = argmax � X , − G t � � X � S 1 ≤ α ◮ set stepsize η t = 2 / ( t + 2) ◮ update X t +1 = (1 − η t ) X t + η t H t 13 / 30

Conditional gradient method (CGM) features: ◮ relies on efficient linear optimization oracle to compute H t = argmax � X , − G t � � X � S 1 ≤ α ◮ bound on suboptimality follows from subgradient inequality � X t − X ⋆ , G t � f ( A X t ) − f ( A X ⋆ ) ≤ � X t − X ⋆ , A ∗ ∇ f ( A X ) � ≤ �A X t − A X ⋆ , ∇ f ( A X ) � ≤ to provide stopping condition ◮ faster variants: linesearch, away steps, . . . 14 / 30

Linear optimization oracle for MOP compute search direction argmax � X , − G � � X � S 1 ≤ α 15 / 30

Linear optimization oracle for MOP compute search direction argmax � X , − G � � X � S 1 ≤ α ◮ solution given by maximum singular vector of − G : n � σ i u i v ∗ X = α u 1 v ∗ − G = = ⇒ i 1 i =1 ◮ use Lanczos method: only need to apply G and G ∗ 15 / 30

Conditional gradient descent Algorithm 1 CGM for the model problem (CMOP) Input: Problem data for (CMOP); suboptimality ε Output: Solution X ⋆ function CGM 1 X ← 0 2 for t ← 0 , 1 , . . . do 3 ( u , v ) ← MaxSingVec ( −A ∗ ( ∇ f ( A X ))) 4 H ← − α uv ∗ 5 if �A X − A H , ∇ f ( A X ) � ≤ ε then break for 6 η ← 2 / ( t + 2) 7 X ← (1 − η ) X + η H 8 return X 9 16 / 30

Two crucial ideas To solve the problem using optimal storage: ◮ Use the low-dimensional “dual” variable z t = A X t ∈ R d to drive the iteration. ◮ Recover solution from small (randomized) sketch. 17 / 30

Two crucial ideas To solve the problem using optimal storage: ◮ Use the low-dimensional “dual” variable z t = A X t ∈ R d to drive the iteration. ◮ Recover solution from small (randomized) sketch. Never write down X until it has converged to low rank. 17 / 30

Conditional gradient descent Algorithm 2 CGM for the model problem (CMOP) Input: Problem data for (CMOP); suboptimality ε Output: Solution X ⋆ function CGM 1 X ← 0 2 for t ← 0 , 1 , . . . do 3 ( u , v ) ← MaxSingVec ( −A ∗ ( ∇ f ( A X ))) 4 H ← − α uv ∗ 5 if �A X − A H , ∇ f ( A X ) � ≤ ε then break for 6 η ← 2 / ( t + 2) 7 X ← (1 − η ) X + η H 8 return X 9 17 / 30

Conditional gradient descent Introduce “dual variable” z = A X ∈ R d ; eliminate X . Algorithm 3 Dual CGM for the model problem (CMOP) Input: Problem data for (CMOP); suboptimality ε Output: Solution X ⋆ function dualCGM 1 z ← 0 2 for t ← 0 , 1 , . . . do 3 ( u , v ) ← MaxSingVec ( −A ∗ ( ∇ f ( z ))) 4 h ← A ( − α uv ∗ ) 5 if � z − h , ∇ f ( z ) � ≤ ε then break for 6 η ← 2 / ( t + 2) 7 z ← (1 − η ) z + η h 8 17 / 30

Sketchy Decisions: Convex Low-Rank Matrix Optimization with Optimal - PowerPoint PPT Presentation

Sketchy Decisions: Convex Low-Rank Matrix Optimization with Optimal Storage Madeleine Udell Operations Research and Information Engineering Cornell University Based on joint work with Alp Yurtsever (EPFL), Volkan Cevher (EPFL), and Joel

A very short, sketchy, introduction to A very short, sketchy, introduction to Bioconductor

2 3 4 5 8 9 MINNEAPOLIS MILWAUKEE MSA RANK #16 MSA RANK #39 CHICAGO MSA RANK #3

Convex Hell 362 dnc CS 16: Convex Hull Whoops, I mean... Convex Hull Whats a Convex Hull?

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Optimization Problems Instructor:

CS675: Convex and Combinatorial Optimization Spring 2018 Convex Optimization Problems

constrained convex optimization virgil pavlu 1 convex set a set X in a vector space is convex if

Convex Optimization 4. Convex Optimization Problems Prof. Ying Cui Department of Electrical

Parallel Numerical Algorithms Chapter 6 Matrix Models Section 6.2 Low Rank Approximation

CS675: Convex and Combinatorial Optimization Spring 2018 Convex Sets Instructor: Shaddin Dughmi

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Functions Instructor: Shaddin

CS675: Convex and Combinatorial Optimization Fall 2019 Convex Sets Instructor: Shaddin Dughmi

CS675: Convex and Combinatorial Optimization Fall 2014 Convex Functions Instructor: Shaddin

Convex hull 1 - 1 Convex hull 1 - 2 Convex hull 1 - 3 Convex hull Definition, extremal

CS133 Computational Geometry Convex Hull 1 Convex Hull Given a set of n points, find the

Some Recent Advances in Non-convex Optimization Purushottam Kar IIT KANPUR Outline of the Talk

A Primer in Convex Optimization Moritz Diehl partly based on material by Colin Jones, Stephen

Optimality conditions for bang-bang controls (theory and examples) Joint work with Helmut Maurer

Object Recognition: Scale Invariant Feature Transform (SIFT) - based Approach, in comparison

A new complex frequency spectrum for the analysis of transmission efficiency in waveguide-like

Performance in Operating Systems and Middleware Frank Feinbube Operating Systems and Middleware

The Ruby Language [@+]pragdave dave@pragprog.com Ruby A tool for communication A tool

1 / f noise arising from time-subordinated Langevin equations Julius Ruseckas and Bronsilovas

On Entropy for Quantum Compound Systems Noboru Watanabe Department of Information Sciences

Fundamental loss of quantum coherence related to gravity Jorge Pullin Jorge Pullin Horace