Implicit Regularization in Nonconvex Statistical Estimation Yuxin - PowerPoint PPT Presentation

Implicit Regularization in Nonconvex Statistical Estimation Yuxin Chen Electrical Engineering, Princeton University

Cong Ma Kaizheng Wang Yuejie Chi Princeton ORFE Princeton ORFE CMU ECE

Nonconvex estimation problems are everywhere Empirical risk minimization is usually nonconvex → minimize x ℓ ( x ; y ) may be nonconvex subj. to x ∈ S → may be nonconvex 3/ 36

Nonconvex estimation problems are everywhere Empirical risk minimization is usually nonconvex → minimize x ℓ ( x ; y ) may be nonconvex subj. to x ∈ S → may be nonconvex • low-rank matrix completion • graph clustering • dictionary learning • mixture models • deep learning • ... 3/ 36

Nonconvex optimization may be super scary There may be bumps everywhere and exponentially many local optima e.g. 1-layer neural net (Auer, Herbster, Warmuth ’96; Vu ’98) 4/ 36

... but is sometimes much nicer than we think Under certain statistical models, we see benign global geometry: no spurious local optima Fig credit: Sun, Qu & Wright 5/ 36

... but is sometimes much nicer than we think statistical models benign landscape exploit geometry efficient algorithms

Optimization-based methods: two-stage approach ≈ h − i initial guess z 0 on: x 0 data m x ng x ¯ basin of attraction • Start from an appropriate initial point 7/ 36

Optimization-based methods: two-stage approach ≈ h − i initial guess z 0 on: x 0 i data m ess z 0 on: x 0 data m • Find ind an i z 1 x 1 x 2 z 2 x x ng x ¯ ng x ¯ basin of attraction basin of attraction • Start from an appropriate initial point • Proceed via some iterative optimization algorithms 7/ 36

Roles of regularization • Prevents overfitting and improves generalization ◦ e.g. ℓ 1 penalization, SCAD, nuclear norm penalization, ... 8/ 36

Roles of regularization • Prevents overfitting and improves generalization ◦ e.g. ℓ 1 penalization, SCAD, nuclear norm penalization, ... • Improves computation by stabilizing search directions ◦ e.g. trimming, projection, regularized loss 8/ 36

Roles of regularization • Prevents overfitting and improves generalization ◦ e.g. ℓ 1 penalization, SCAD, nuclear norm penalization, ... • Improves computation by stabilizing search directions ⇒ = focus of this talk ◦ e.g. trimming, projection, regularized loss 8/ 36

3 representative nonconvex problems phase matrix blind retrieval completion deconvolution 9/ 36

Regularized methods phase matrix blind retrieval completion deconvolution regularized regularized regularized trimming regularized cost regularized cost projection projection 9/ 36

Regularized vs. unregularized methods phase matrix blind retrieval completion deconvolution unregularized regularized regularized unregularized regularized unregularized suboptimal trimming regularized cost ? regularized cost ? comput. cost projection projection 9/ 36

Regularized vs. unregularized methods phase matrix blind retrieval completion deconvolution unregularized regularized regularized unregularized regularized unregularized suboptimal trimming regularized cost ? regularized cost ? comput. cost projection projection Are unregularized methods suboptimal for nonconvex estimation? 9/ 36

Missing phase problem Detectors record intensities of diffracted rays • electric field x ( t 1 , t 2 ) − → Fourier transform � x ( f 1 , f 2 ) Fig credit: Stanford SLAC � � � � � 2 � 2 = � � �� x ( t 1 , t 2 ) e − i 2 π ( f 1 t 1 + f 2 t 2 ) d t 1 d t 2 intensity of electrical field: x ( f 1 , f 2 ) � � 10/ 36

Missing phase problem Detectors record intensities of diffracted rays • electric field x ( t 1 , t 2 ) − → Fourier transform � x ( f 1 , f 2 ) Fig credit: Stanford SLAC � � � � � 2 � 2 = � � �� x ( t 1 , t 2 ) e − i 2 π ( f 1 t 1 + f 2 t 2 ) d t 1 d t 2 intensity of electrical field: x ( f 1 , f 2 ) � � � � 2 Phase retrieval: recover signal x ( t 1 , t 2 ) from intensity | � x ( f 1 , f 2 ) 10/ 36

Solving quadratic systems of equations y = | Ax | 2 x A Ax 1 1 -3 9 2 4 -1 1 ) = m 4 16 2 4 -2 4 -1 1 3 9 4 16 X y n p Recover x ♮ ∈ R n from m random quadratic measurements | a ⊤ k x ♮ | 2 , y k = k = 1 , . . . , m Assume w.l.o.g. � x ♮ � 2 = 1 11/ 36

Wirtinger flow (Cand` es, Li, Soltanolkotabi ’14) Empirical risk minimization m �� a ⊤ � 2 � 1 � 2 − y k minimize x f ( x ) = k x 4 m k =1 12/ 36

Wirtinger flow (Cand` es, Li, Soltanolkotabi ’14) Empirical risk minimization m �� a ⊤ � 2 � 1 � 2 − y k minimize x f ( x ) = k x 4 m k =1 • Initialization by spectral method • Gradient iterations: for t = 0 , 1 , . . . x t +1 = x t − η ∇ f ( x t ) 12/ 36

Gradient descent theory revisited Two standard conditions that enable geometric convergence of GD 13/ 36

Gradient descent theory revisited Two standard conditions that enable geometric convergence of GD • (local) restricted strong convexity (or regularity condition) 13/ 36

Gradient descent theory revisited Two standard conditions that enable geometric convergence of GD • (local) restricted strong convexity (or regularity condition) • (local) smoothness ∇ 2 f ( x ) ≻ 0 and is well-conditioned 13/ 36

Gradient descent theory revisited f is said to be α -strongly convex and β -smooth if 0 � α I � ∇ 2 f ( x ) � β I , ∀ x ℓ 2 error contraction: GD with η = 1 /β obeys � � 1 − α � x t +1 − x ♮ � 2 ≤ � x t − x ♮ � 2 β 14/ 36

Gradient descent theory revisited � x t +1 − x ♮ � 2 ≤ (1 − α/β ) � x t − x ♮ � 2 region of local strong convexity + smoothness 15/ 36

Gradient descent theory revisited 0 � α I � ∇ 2 f ( x ) � β I , ∀ x ℓ 2 error contraction: GD with η = 1 /β obeys � � 1 − α � x t +1 − x ♮ � 2 ≤ � x t − x ♮ � 2 β • Condition number β/α determines rate of convergence 16/ 36

Gradient descent theory revisited 0 � α I � ∇ 2 f ( x ) � β I , ∀ x ℓ 2 error contraction: GD with η = 1 /β obeys � � 1 − α � x t +1 − x ♮ � 2 ≤ � x t − x ♮ � 2 β • Condition number β/α determines rate of convergence � β � iterations α log 1 • Attains ε -accuracy within O ε 16/ 36

What does this optimization theory say about WF? i . i . d . Gaussian designs: a k ∼ N ( 0 , I n ) , 1 ≤ k ≤ m 17/ 36

What does this optimization theory say about WF? i . i . d . Gaussian designs: a k ∼ N ( 0 , I n ) , 1 ≤ k ≤ m Population level (infinite samples) � 2 I + 2 xx ⊤ � �� 2 I + 2 x ♮ x ♮ ⊤ � � = 3 � x ♮ � � ∇ 2 f ( x ) � 2 � x � 2 − E � �� locally positive definite and well-conditioned � log 1 � iterations if m → ∞ Consequence: WF converges within O ε 17/ 36

What does this optimization theory say about WF? i . i . d . Gaussian designs: a k ∼ N ( 0 , I n ) , 1 ≤ k ≤ m Finite-sample level ( m ≍ n log n ) ∇ 2 f ( x ) ≻ 0 17/ 36

What does this optimization theory say about WF? i . i . d . Gaussian designs: a k ∼ N ( 0 , I n ) , 1 ≤ k ≤ m Finite-sample level ( m ≍ n log n ) ∇ 2 f ( x ) ≻ 0 but ill-conditioned (even locally) � �� condition number ≍ n 17/ 36

What does this optimization theory say about WF? i . i . d . Gaussian designs: a k ∼ N ( 0 , I n ) , 1 ≤ k ≤ m Finite-sample level ( m ≍ n log n ) ∇ 2 f ( x ) ≻ 0 but ill-conditioned (even locally) � �� condition number ≍ n Consequence (Cand` es et al ’14) : WF attains ε -accuracy within � iterations if m ≍ n log n � n log 1 O ε 17/ 36

What does this optimization theory say about WF? i . i . d . Gaussian designs: a k ∼ N ( 0 , I n ) , 1 ≤ k ≤ m Finite-sample level ( m ≍ n log n ) ∇ 2 f ( x ) ≻ 0 but ill-conditioned (even locally) � �� condition number ≍ n Consequence (Cand` es et al ’14) : WF attains ε -accuracy within � iterations if m ≍ n log n � n log 1 O ε Too slow ... can we accelerate it? 17/ 36

One solution: truncated WF (Chen, Cand` es ’15) Regularize / trim gradient components to accelerate convergence x z 18/ 36

But wait a minute ... WF converges in O ( n ) iterations 19/ 36

But wait a minute ... WF converges in O ( n ) iterations Step size taken to be η t = O (1 /n ) 19/ 36

But wait a minute ... WF converges in O ( n ) iterations Step size taken to be η t = O (1 /n ) This choice is suggested by generic optimization theory 19/ 36

Implicit Regularization in Nonconvex Statistical Estimation Yuxin - PowerPoint PPT Presentation

Implicit Regularization in Nonconvex Statistical Estimation Yuxin Chen Electrical Engineering, Princeton University Cong Ma Kaizheng Wang Yuejie Chi Princeton ORFE Princeton ORFE CMU ECE Nonconvex estimation problems are everywhere

Stochastic Cubic Regularization for Fast Nonconvex Optimization Nilesh Tripuraneni, Mitchell

Implicit Guarantees and Risk Taking: Implicit Guarantees and Risk Taking: Implicit Guarantees and

Convergence of Cubic Regularization for Nonconvex Optimization under ojasiewicz Property

Implicit Bias Implicit bias Implicit bias refers to attitudes or stereotypes that affect our

Implicit Surfaces Implicit Surfaces An implicit surface is simply an iso-contour CIS 781 of a

Regularization Overview Regularization Overview Problems & Multicollinearity We will

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

Regularization Regularization is a general approach to add a complexity parameter to a

Implicit Bias: Transcript Inclusive Teaching Series: Implicit Bias Welcome to the third module of

Implicit Extremes and Implicit MaxStable Laws Stilian Stoev ( sstoev@umich.edu ) University of

Multi-core Programming: Implicit Parallelism Tuukka Haapasalo April 16, 2009 Tuukka Haapasalo

Implicit Surfaces CPSC 599.86 / 601.86 Sonny Chan University of Calgary (some board work happened

Nonconvex Demixing from Bilinear Measurements Yuanming Shi 1 Outline Motivations Blind

PRACTICAL AUGMENTED LAGRANGIAN METHODS FOR NONCONVEX PROBLEMS Jos e Mario Mart nez

How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization Frank E.

CS7015 (Deep Learning) : Lecture 8 Regularization: Bias Variance Tradeoff, l2 regularization,

Memory Allocation Nima Honarmand (Based on slides by Don Porter and Mike Ferdman) Fall 2014 ::

Cache Modeling and Optimization using Miniature Simulations Carl Waldspurger CachePhysics, Inc.

3.1 Iterated Partial Derivatives Prof. Tesler Math 20C Fall 2018 Prof. Tesler 3.1 Iterated

Nonparametric Methods Michael R. Roberts Department of Finance The Wharton School University of

Snatch : Opportunistically Reassigning Power Allocation between Processor and Memory in 3D Stacks

TEXTURE MAPPING SAUMITRA BAGCHI DEFINITION Texture: T he feel, appearance, or consistency of a

Extreme value theory QUAN TITATIVE RIS K MAN AGEMEN T IN P YTH ON Jamsheed Shorish

Noise / IR Drop Crosstalk Delay Impacts Timing Timing Failures Crosstalk affects

Implicit Regularization in Nonconvex Statistical Estimation Yuxin - PowerPoint PPT Presentation

Implicit Regularization in Nonconvex Statistical Estimation Yuxin Chen Electrical Engineering, Princeton University Cong Ma Kaizheng Wang Yuejie Chi Princeton ORFE Princeton ORFE CMU ECE Nonconvex estimation problems are everywhere

Stochastic Cubic Regularization for Fast Nonconvex Optimization Nilesh Tripuraneni, Mitchell

Implicit Guarantees and Risk Taking: Implicit Guarantees and Risk Taking: Implicit Guarantees and

Convergence of Cubic Regularization for Nonconvex Optimization under ojasiewicz Property

Implicit Bias Implicit bias Implicit bias refers to attitudes or stereotypes that affect our

Implicit Surfaces Implicit Surfaces An implicit surface is simply an iso-contour CIS 781 of a

Regularization Overview Regularization Overview Problems &amp; Multicollinearity We will

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

Regularization Regularization is a general approach to add a complexity parameter to a

Implicit Bias: Transcript Inclusive Teaching Series: Implicit Bias Welcome to the third module of

Implicit Extremes and Implicit MaxStable Laws Stilian Stoev ( sstoev@umich.edu ) University of

Multi-core Programming: Implicit Parallelism Tuukka Haapasalo April 16, 2009 Tuukka Haapasalo

Implicit Surfaces CPSC 599.86 / 601.86 Sonny Chan University of Calgary (some board work happened

Nonconvex Demixing from Bilinear Measurements Yuanming Shi 1 Outline Motivations Blind

PRACTICAL AUGMENTED LAGRANGIAN METHODS FOR NONCONVEX PROBLEMS Jos e Mario Mart nez

How to Characterize the Worst-Case Performance of Algorithms for Nonconvex Optimization Frank E.

CS7015 (Deep Learning) : Lecture 8 Regularization: Bias Variance Tradeoff, l2 regularization,

Memory Allocation Nima Honarmand (Based on slides by Don Porter and Mike Ferdman) Fall 2014 ::

Cache Modeling and Optimization using Miniature Simulations Carl Waldspurger CachePhysics, Inc.

3.1 Iterated Partial Derivatives Prof. Tesler Math 20C Fall 2018 Prof. Tesler 3.1 Iterated

Nonparametric Methods Michael R. Roberts Department of Finance The Wharton School University of

Snatch : Opportunistically Reassigning Power Allocation between Processor and Memory in 3D Stacks

TEXTURE MAPPING SAUMITRA BAGCHI DEFINITION Texture: T he feel, appearance, or consistency of a

Extreme value theory QUAN TITATIVE RIS K MAN AGEMEN T IN P YTH ON Jamsheed Shorish

Noise / IR Drop Crosstalk Delay Impacts Timing Timing Failures Crosstalk affects

Regularization Overview Regularization Overview Problems & Multicollinearity We will