Random Methods for Large-Scale Linear Problems, Variational - PowerPoint PPT Presentation

Random Methods for Large-Scale Linear Problems, Variational Inequalities, and Convex Optimization Doctoral Thesis Defense Mengdi Wang Laboratory for Information and Decision Systems (LIDS) Massachusetts Institute of Technology April 1st, 2013 1/38

A Roadmap 1 Stochastic Methods for Linear Systems 2 Stochastic Methods for Convex Optimization & Variational 3 Inequalities Motivation A Unified Algorithmic Framework The Coupled Convergence Process Summary 4 Acknowledgement 5 A Roadmap 2/38

The Broader Context of Our Work: Large-Scale Problems Linear Systems Ax = b or E [ A v ] x = E [ b v ] (inverse problems, regression, statistical learning, approximate DP) ⇓ Linear & Quadratic Programming min Ax ≤ b x ′ � i Q i x + c ′ x (approximate DP , high performance computation) ⇓ Complementarity Problems (equilibriums, projected equations) ⇓ � Convex Problems & Variational Inequalities min x ∈∩ i X i i f i ( x ) (networks, data-driven problems, cooperative games, online decision making) Address large-scale problems by randomization/simulation A Roadmap 3/38

Use Stochastic Methods to Tackle Large Scale How to obtain random samples? Importance sampling Adaptive sampling Monte Carlo methods Application/implementation-dependent methods: asynchronous, distributed, irregular, unknown random process, etc How to use random samples? Stochastic approximation Sample average approximation Use Monte Carlo estimates to iterate Modify deterministic methods to allow stochasticity A Roadmap 4/38

Our work Part 1: Large scale linear systems Ax = b Deal with the joint effect of singularity and stochastic noise Stabilizing divergent iterative methods Part 2: Large scale optimization problems with complicated constraints Combine optimization and feasibility methods with randomness Incremental/Online Structure: updating based on a part of all constraint/gradient information using minimal storage to deal with large data set allowing various sources of stochasticity Coupled Convergence: x k → x ∗ vs. x k → X A Roadmap 5/38

A Roadmap 1 Stochastic Methods for Linear Systems 2 Stochastic Methods for Convex Optimization & Variational 3 Inequalities Motivation A Unified Algorithmic Framework The Coupled Convergence Process Summary 4 Acknowledgement 5 Stochastic Methods for Linear Systems 6/38

Solving linear systems Ax = b by stochastic sampling Assume that: A = E [ A w ] , b = E [ b v ] Moreover, a sequence of samples { ( A w k , b v k ) } is available. Stochastic Approximation (SA) x k + 1 = x k − α k ( A w k x k − b v k ) Using one sample per update is too slow! Sample Average Approximation (SAA) � k � k Obtain finite-sample estimates A k = 1 t = 1 A w t and b k = 1 t = 1 b v t , k k then solve A k x = b k . Stochastic Methods for Linear Systems 7/38

Can we do better? Using Monte Carlo Estimates √ a . s . a . s . Given A k − → A , b k − → b at a rate of 1 / k , iterate as x k + 1 = x k − γ G ( A k x k − b k ) If ρ ( I − γ GA ) < 1 ⇛ geometric convergence! Not working if (close to) singular! (Wang and Bertsekas, 2011) √ √ k k , x k ∼ e Ax k − b ∼ e w . p . 1 . Divergence rate: Based on random samples of A , we cannot detect the (near) singularity We still like the nonsingular part of the system Stochastic Methods for Linear Systems 8/38

Deal with singularity under noise Stabilized Iterations (Wang and Bertsekas, 2011) √ a . s . a . s . Given A k − → A , b k − → b at a rate of 1 / k , iterate as ✭ ✭✭✭✭✭✭✭✭✭✭✭✭✭ x k + 1 = x k − γ G ( A k x k − b k ) Add a stabilization term to deal with singularity and multiplicative noise x k + 1 = ( 1 − δ k ) x k − γ G ( A k x k − b k ) where δ k ↓ 0 , � δ k = ∞ and δ k ≫ noise. Then x k a . s . → some x ∗ − Proximal Iteration Naturally Converges (Wang and Bertsekas, 2011) x k + 1 = argmin x � A k x − b k � 2 + λ � x k − x � 2 a . s . a . s . → some x ∗ Then Ax k − b − → 0 and we can extract a subsequence ˆ x k − Stochastic Methods for Linear Systems 9/38

A Roadmap 1 Stochastic Methods for Linear Systems 2 Stochastic Methods for Convex Optimization & Variational 3 Inequalities Motivation A Unified Algorithmic Framework The Coupled Convergence Process Summary 4 Acknowledgement 5 Stochastic Methods for Large Scale COP & VI Motivation 10/38

The problems Convex Optimization Problems (COP) min x ∈ X F ( x ) where F : ℜ n �→ ℜ is convex and continuously differentiable Variational Inequalities (VI) G ( x ∗ ) ′ ( x − x ∗ ) ≥ 0 , ∀ x ∈ X where G : ℜ n �→ ℜ n is strongly monotone Strongly monotone: for some σ > 0 ( y − x ) ′ G ( y − x ) ≥ σ � x − y � 2 ∀ x , y VI = COP , if G ( x ) = ∇ F ( x ) Equilibriums/LP/Projected equations/Complementarity Problems Stochastic Methods for Large Scale COP & VI Motivation 11/38

We focus on large-scale problems with incremental structure Linearly Additive Objectives COP: � F ( x ) = F i ( x ) or F ( x ) = E [ f ( x , v )] VI: � G ( x ) = G i ( x ) or G ( x ) = E [ g ( x , v )] Set Intersection Constraints X = ∩ m i = 1 X i where each X i is closed and convex Applications Machine Learning/Distributed Optimization/Computing Nash Equilibriums Stochastic Methods for Large Scale COP & VI Motivation 12/38

Difficulty with practical large-scale problems Operating with X = ∩ X i is difficult, especially for: Big data-driven problems with huge # of constraints stored in external hard drives Distributed problems where each agent can only access part of all constraints Stochastic process-driven problems whose constraints involve a random process only available through simulation Question: Why not replace X with a single X i ? Stochastic Methods for Large Scale COP & VI Motivation 13/38

Putting two ideas together Gradient projection Alternate projection Stochastic Methods for Large Scale COP & VI Motivation 14/38

Related works Incremental COP: min x ∈ X F ( x ) by x k + 1 = Π X [ x k − α g ( x k , v k )] stochastic gradient projection (Nedi´ c and Bertsekas 2001, etc) incremental proximal (Bertsekas 2010, etc) incremental gradient with random projection (Nedi´ c 2011) Feasibility Problems: finding x ∈ ∩ i ∈ M X i by x k + 1 = Π X wk x k alternate/cyclic projection (Gubin 1967, Tseng 1990, Deutsch and Hundal 2006-2008, Lewis 2008, etc) random projection (Nedi´ c 2010) super halfspace projection (Censor 2008, etc ) Stochastic Methods for Large Scale COP & VI Motivation 15/38

A Roadmap 1 Stochastic Methods for Linear Systems 2 Stochastic Methods for Convex Optimization & Variational 3 Inequalities Motivation A Unified Algorithmic Framework The Coupled Convergence Process Summary 4 Acknowledgement 5 Stochastic Methods for Large Scale COP & VI A Unified Algorithmic Framework 16/38

Existing methods Gradient/Subgradient Projection Method for COP � � x k + 1 = Π X x k − α k ∇ F ( x k ) Projection Method for VI � � x k + 1 = Π X x k − α k G ( x k ) Stochastic Gradient Projection Method for COP Projection Method for Stochastic VI � � x k + 1 = Π X x k − α k g ( x k , v k ) Proximal Method for COP � 1 � � x − x k � 2 x k + 1 = argmin x ∈ X F ( x ) + 2 α k Stochastic Methods for Large Scale COP & VI A Unified Algorithmic Framework 17/38

The general random incremental algorithm A Two-Step Algorithm Optimality update: z k = x k − α k g (¯ with ¯ x k , v k ) , x k = x k or x k + 1 Feasibility update: x k + 1 = ( 1 − β k ) z k + β k Π X wk z k When β k = 1 x k + 1 = Π X wk [ x k − α k g (¯ x k , v k )] ¯ x k ∈ { x k , x k + 1 } Analytical difficulty: x k no longer feasible! Stochastic Methods for Large Scale COP & VI A Unified Algorithmic Framework 18/38

Special cases of the general algorithm Projection algorithm using random projection and stochastic gradient x k = Π X wk [ x k − α k g ( x k , v k )] Proximal algorithm using random constraint and random cost function � F ( x , v k ) + ( 1 / 2 α k ) � x − x k � 2 � x k + 1 = argmin x ∈ X wk Variations that alternate between proximal and projection Successive projection algorithm x k + 1 = Π w k x k Stochastic Methods for Large Scale COP & VI A Unified Algorithmic Framework 19/38

Sampling schemes for X w k Nearly independent samples by random sampling s.t. k ≥ 0 P ( w k = X i | F k ) > 0 , inf i = 1 , . . . , m Cyclic samples by cyclic selection or random shuffling, s.t. { X w k } consists of permutations of { X 1 , . . . , X m } Most distant constraint sets by adaptively select X w k s.t. w k = argmax i = 1 ,..., m � x k − Π X i x k � Markov samples by generating X w k through a recurrent Markov chain with states { X i } m i = 1 Stochastic Methods for Large Scale COP & VI A Unified Algorithmic Framework 20/38

Sampling schemes for g ( x k , v k ) Unbiased samples by random sampling s.t. � � E g ( x , v k ) | F k = G ( x ) , ∀ x , k ≥ 0 , w . p . 1 Cyclic samples by cyclic selection or random shuffling of component functions s.t. � � Avg k ∈ cycle E g ( x , v k ) | F begining = G ( x ) , ∀ x , w . p . 1 Markov samples by generating v k through an irreducible Markov chain with invariant distribution ξ , s.t. E v ∼ ξ [ g ( x , v )] = G ( x ) , ∀ x Stochastic Methods for Large Scale COP & VI A Unified Algorithmic Framework 21/38

Random Methods for Large-Scale Linear Problems, Variational - PowerPoint PPT Presentation

Random Methods for Large-Scale Linear Problems, Variational Inequalities, and Convex Optimization Doctoral Thesis Defense Mengdi Wang Laboratory for Information and Decision Systems (LIDS) Massachusetts Institute of Technology April 1st, 2013

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

Random Numbers RANDOM VS PSEUDO RANDOM Truly Random numbers From Wolfram: A random number

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

FINANCING LARGE SCALE SOLAR Large Scale Solar Conference - Sydney Gloria Chan Director, Large

Numerical Methods for Solving Large Scale Eigenvalue Problems Lecture 2, February 28, 2018:

Cascadic Multilevel Methods for Cascadic Multilevel Methods for Large-Scale Ill-Posed Problems

Linear and Nonlinear SP 2 Methods for Large Scale Eigenvalue Calculations Zhaojun Bai

Simulation Random numbers Random numbers Anyone who considers arithmetic methods of

Random Walk Inference and Learning in A Large Scale Knowledge Base in A Large Scale Knowledge Base

Solving large scale eigenvalue problems Lecture 9, April 25, 2018: Lanczos and Arnoldi methods

Solving large scale eigenvalue problems Lecture 3, March 7, 2018: Newton methods

Chapter 2: Random Variables In this chapter we will cover: 1. Discrete Random variables, ( 2.1

Random Numbers, Files, and Onwards Random Numbers Computers cannot produce truly random numbers.

Linear Programming Problems Linear programming problems come up in many applications. In a linear

Large-Scale Machine Learning at Twitter 2 Large-Scale Machine Learning at Twitter Jimmy Lin and

INFRASTRUCTURE 2110414 Large Scale Computing Systems Natawut Nupairoj, Ph.D. Outline 2

Potential outcomes and randomized experiments Frank Venmans University of Mons

CS 170 Section 11 Approximation Algorithms Owen Jow April 11, 2018 University of California,

Repeated Games with Perfect Monitoring Mihai Manea MIT Repeated Games normal-form stage game

CS 147: Computer Systems Performance Analysis Higher Designs and Other Considerations 1 / 25

(Randomized) Localized Model Order Reduction Kathrin Smetana (University of Twente) March 24,

Preventing Shoulder Surfing using Randomized Augmented Reality Keyboards Anindya Maiti, Murtuza

Adaptive partitioning Dennis Hofheinz (KIT, Karlsruhe) Public-Key Encryption Public-Key

Verifiable Elections That Scale for Free Melissa Chase (MSR Redmond) Markulf Kohlweiss (MSR