The Power of Nonconvex Optimization in Solving Random Quadratic - PowerPoint PPT Presentation

The Power of Nonconvex Optimization in Solving Random Quadratic Systems of Equations Yuxin Chen (Princeton) Emmanuel Cand` es (Stanford) Y. Chen, E. J. Cand` es, Communications on Pure and Applied Mathematics vol. 70, no. 5, pp. 822-883, May 2017

Agenda 1. The power of nonconvex optimization in solving random quadratic systems of equations (Aug. 28) 2. Random initialization and implicit regularization in nonconvex statistical estimation (Aug. 29) 3. The projected power method: an efficient nonconvex algorithm for joint discrete assignment from pairwise data (Sep. 3) 4. Spectral methods meets asymmetry: two recent stories (Sep. 4) 5. Inference and uncertainty quantification for noisy matrix completion (Sep. 5)

on (high-dimensional) statistics nonconvex optimization

Nonconvex problems are everywhere Maximum likelihood estimation is usually nonconvex → maximize x ℓ ( x ; data) may be nonconcave x ∈ S → subj. to may be nonconvex

Nonconvex problems are everywhere Maximum likelihood estimation is usually nonconvex → maximize x ℓ ( x ; data) may be nonconcave x ∈ S → subj. to may be nonconvex • low-rank matrix completion • robust principal component analysis • graph clustering • dictionary learning • blind deconvolution • learning neural nets • ...

Nonconvex optimization may be super scary There may be bumps everywhere and exponentially many local optima e.g. 1-layer neural net (Auer, Herbster, Warmuth ’96; Vu ’98)

Example: solving quadratic programs is hard Finding maximum cut in a graph is x ⊤ W x maximize x x 2 subj. to i = 1 , i = 1 , · · · , n

Example: solving quadratic programs is hard Fig credit: coding horror

One strategy: convex relaxation Can relax into convex problems by • finding convex surrogates (e.g. compressed sensing, matrix completion) • lifting into higher dimensions (e.g. Max-Cut)

Example of convex surrogate: low-rank matrix completion — Fazel ’02, Recht, Parrilo, Fazel ’10, Cand` es, Recht ’09 minimize M rank ( M ) subj. to data constraints cvx surrogate minimize M nuc-norm ( M ) subj. to data constraints

Example of convex surrogate: low-rank matrix completion — Fazel ’02, Recht, Parrilo, Fazel ’10, Cand` es, Recht ’09 minimize M rank ( M ) subj. to data constraints cvx surrogate minimize M nuc-norm ( M ) subj. to data constraints Robust variation used everyday by Netflix

Example of convex surrogate: low-rank matrix completion — Fazel ’02, Recht, Parrilo, Fazel ’10, Cand` es, Recht ’09 minimize M rank ( M ) subj. to data constraints cvx surrogate minimize M nuc-norm ( M ) subj. to data constraints Robust variation used everyday by Netflix Problem: operate in full matrix space even though X is low-rank

Example of lifting: Max-Cut — Goemans, Williamson ’95 x ⊤ W x maximize x x 2 subj. to i = 1 , i = 1 , · · · , n

Example of lifting: Max-Cut — Goemans, Williamson ’95 x ⊤ W x maximize x x 2 subj. to i = 1 , i = 1 , · · · , n let X be xx ⊤ maximize X � X , W � subj. to X i,i = 1 , i = 1 , · · · , n X � 0 rank ( X ) = 1

Example of lifting: Max-Cut — Goemans, Williamson ’95 x ⊤ W x maximize x x 2 subj. to i = 1 , i = 1 , · · · , n let X be xx ⊤ maximize X � X , W � subj. to X i,i = 1 , i = 1 , · · · , n X � 0 rank ( X ) = 1 Problem: explosion in dimensions ( R n → R n × n )

How about optimizing nonconvex problems directly without lifting?

A case study: solving random quadratic systems of equations

Solving quadratic systems of equations y = | Ax | 2 x A Ax 1 1 9 -3 2 4 -1 1 16 4 4 2 -2 4 1 -1 9 3 4 16 Solve for x ∈ C n in m quadratic equations |� a k , x �| 2 , y k ≈ k = 1 , . . . , m

Motivation: a missing phase problem in imaging science Detectors record intensities of diffracted rays • x ( t 1 , t 2 ) − → Fourier transform ˆ x ( f 1 , f 2 ) � � � 2 � 2 = � � � x ( t 1 , t 2 ) e − i 2 π ( f 1 t 1 + f 2 t 2 ) d t 1 d t 2 � intensity of electrical field: � ˆ x ( f 1 , f 2 ) � � Phase retrieval : recover true signal x ( t 1 , t 2 ) from intensity measurements

Motivation: latent variable models Example: mixture of regression y ≈ h x , β i y ≈ h x , − β i • Samples { ( y k , x k ) } : drawn from one of two unknown regressors β and − β � � x k , β � , with prob. 0 . 5 y k ≈ ( labels: latent variables ) � x k , − β � , else

Motivation: latent variable models Example: mixture of regression y ≈ h x , β i y ≈ h x , − β i • Samples { ( y k , x k ) } : drawn from one of two unknown regressors β and − β � � x k , β � , with prob. 0 . 5 y k ≈ ( labels: latent variables ) � x k , − β � , else — equivalent to observing | y k | 2 ≈ |� x k , β �| 2 • Goal: estimate β

Motivation: learning neural nets with quadratic activation — Soltanolkotabi, Javanmard, Lee ’17, Li, Ma, Zhang ’17 X \ X σ y a + a σ σ er output layer hidden layer i er input layer o input features: a ; weights: X = [ x 1 , · · · , x r ] r r σ ( z )= z 2 � � σ ( a ⊤ x i ) ( a ⊤ x i ) 2 output: y = := i =1 i =1

An equivalent view: low-rank factorization Lifting: introduce X = xx ∗ to linearize constraints k x | 2 = a ∗ y k = | a ∗ k ( xx ∗ ) a k y k = a ∗ = ⇒ k Xa k

An equivalent view: low-rank factorization Lifting: introduce X = xx ∗ to linearize constraints k x | 2 = a ∗ y k = | a ∗ k ( xx ∗ ) a k y k = a ∗ = ⇒ k Xa k X � 0 find y k = a ∗ k = 1 , · · · , m s.t. k Xa k , rank ( X ) = 1

An equivalent view: low-rank factorization Lifting: introduce X = xx ∗ to linearize constraints k x | 2 = a ∗ y k = | a ∗ k ( xx ∗ ) a k y k = a ∗ = ⇒ k Xa k X � 0 find y k = a ∗ k = 1 , · · · , m s.t. k Xa k , rank ( X ) = 1 Works well if { a k } are random, but huge increase in dimensions

Prior art (before our work) y = | Ax | 2 , A ∈ R m × n n : # unknowns; m : sample size (# eqns); sample complexity cvx relaxation n mn infeasible comput. cost

Prior art (before our work) y = | Ax | 2 , A ∈ R m × n n : # unknowns; m : sample size (# eqns); sample complexity infeasible cvx relaxation n mn infeasible comput. cost mn 2

Prior art (before our work) y = | Ax | 2 , A ∈ R m × n n : # unknowns; m : sample size (# eqns); sample complexity infeasible cvx relaxation n mn infeasible comput. cost mn 2 mn 2

Prior art (before our work) y = | Ax | 2 , A ∈ R m × n n : # unknowns; m : sample size (# eqns); sample complexity Wirtinger flow infeasible n log n 3 cvx relaxation n mn infeasible comput. cost mn 2 mn 2

Prior art (before our work) y = | Ax | 2 , A ∈ R m × n n : # unknowns; m : sample size (# eqns); sample complexity alt-min (fresh samples at each iter) n log 3 n Wirtinger flow infeasible n log n 3 cvx relaxation n mn infeasible comput. cost mn 2 mn 2

A glimpse of our results y = | Ax | 2 , A ∈ R m × n n : # unknowns; m : sample size (# eqns); sample complexity alt-min (fresh samples at each iter) n log 3 n Wirtinger flow infeasible n log n 3 cvx relaxation Our algorithm n mn infeasible comput. cost mn 2 mn 2 This work: random quadratic systems are solvable in linear time!

A glimpse of our results y = | Ax | 2 , A ∈ R m × n n : # unknowns; m : sample size (# eqns); sample complexity alt-min (fresh samples at each iter) n log 3 n Wirtinger flow infeasible n log n 3 cvx relaxation Our algorithm n mn infeasible comput. cost mn 2 mn 2 This work: random quadratic systems are solvable in linear time! � minimal sample size � optimal statistical accuracy

A first impulse: maximum likelihood estimate 1 � m maximize z ℓ ( z ) = k =1 ℓ k ( z ) m

A first impulse: maximum likelihood estimate 1 � m maximize z ℓ ( z ) = k =1 ℓ k ( z ) m k x | 2 + N (0 , σ 2 ) y k ∼ | a ∗ • Gaussian data: k z | 2 � 2 � y k − | a ∗ ℓ k ( z ) = −

The Power of Nonconvex Optimization in Solving Random Quadratic - PowerPoint PPT Presentation

The Power of Nonconvex Optimization in Solving Random Quadratic Systems of Equations Yuxin Chen (Princeton) Emmanuel Cand` es (Stanford) Y. Chen, E. J. Cand` es, Communications on Pure and Applied Mathematics vol. 70, no. 5, pp. 822-883, May

Stochastic Cubic Regularization for Fast Nonconvex Optimization Nilesh Tripuraneni, Mitchell

Integrating Problem Solving 2020 Integrating Problem Solving 2020 Integrating Problem Solving

Nonconvex Demixing from Bilinear Measurements Yuanming Shi 1 Outline Motivations Blind

Random Numbers RANDOM VS PSEUDO RANDOM Truly Random numbers From Wolfram: A random number

Implicit Regularization in Nonconvex Statistical Estimation Yuxin Chen Electrical Engineering,

PRACTICAL AUGMENTED LAGRANGIAN METHODS FOR NONCONVEX PROBLEMS Jos e Mario Mart nez

Pathwise Coordinate Optimization for Nonconvex Sparse Learning Tuo Zhao

Accelerated Douglas-Rachford splitting and ADMM for structured nonconvex optimization Panos

(power x 0) == 1 (power x (+ n 1)) == (* (power x n) x) (power x 0) == 1 (power x (+ (* 2 m)

Nonconvex Phase Retrieval with Random Gaussian Measurements Yuejie Chi Department of Electrical

15-780: Optimization J. Zico Kolter March 14-16, 2015 1 Outline Introduction to optimization

Chapter 2: Random Variables In this chapter we will cover: 1. Discrete Random variables, ( 2.1

Random Numbers, Files, and Onwards Random Numbers Computers cannot produce truly random numbers.

How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization Frank E.

A Trust Funnel Algorithm for Nonconvex Equality Constrained Optimization with O ( 3 / 2 )

Nonconvex Distributed Optimization: Novel Algorithmic Design and Arbitrarily Precise Solutions

Transformation Structures for 2-Group Actions (Joint work with Roger Picken) Jeffrey C. Morton

Lecture 2 Elliptic curves over finite fields The Group structure Reminder from Monday the j

Invertible Objects: An Elementary Introduction to Picard Groups Richard Wong Math Club 2020

Variants of the Borel Conjecture and Sacks dense ideals Wolfgang Wohofsky Vienna University of

Heuristic Search Robert Platt Northeastern University Some images and slides are used from: 1.

4 Heuristic Search 4.0 Introduction 4.3 Using Heuristics in Games 4.1 An Algorithm for

R.I.T S. Ludi/R. Kuehl p. 1 R I T Software Engineering Heuristic Evaluation Another method

Heuristi tic Search. In uninformed search, we dont try to evaluate which of the nodes on