Big Data Optimization: Randomized lock-free methods for minimizing - PowerPoint PPT Presentation

Big Data Optimization: Randomized lock-free methods for minimizing partially separable convex functions Peter Richt´ arik School of Mathematics The University of Edinburgh Joint work with Martin Tak´ aˇ c (Edinburgh) Les Houches ⋄ January 11, 2013 1 / 30

Lock-Free (Asynchronous) Updates Between the time when x is read by any given processor and an update is computed and applied to x by it, other processors apply their updates. x 6 ← x 5 + update ( x 3 ) Other processors Update x Update x Update x Update x x 3 x 6 time x 5 x 7 x 2 x 4 Read current x ( x 3 ) Compute update Write to current x ( x 5 ) Viewpoint of a single processor 2 / 30

Generic Parallel Lock-Free Algorithm In general: x j +1 = x j + update ( x r ( j ) ) ◮ r ( j ) = index of iterate current at reading time ◮ j = index of iterate current at writing time Assumption: j − r ( j ) ≤ τ τ + 1 ≈ # processors 3 / 30

The Problem and Its Structure � minimize x ∈ R | V | [ f ( x ) ≡ f e ( x )] ( OPT ) e ∈ E ◮ Set of vertices/coordinates: V ( x = ( x v , v ∈ V ) , dim x = | V | ) ◮ Set of edges: E ⊂ 2 V ◮ Set of blocks: B (a collection of sets forming a partition of V ) ◮ Assumption: f e depends on x v , v ∈ e , only Example (convex f : R 5 → R ): f ( x ) = 7( x 1 + x 3 ) 2 + 5( x 2 − x 3 + x 4 ) 2 + ( x 4 − x 5 ) 2 � �� f e 1 ( x ) f e 2 ( x ) f e 3 ( x ) V = { 1 , 2 , 3 , 4 , 5 } , | V | = 5 , e 1 = { 1 , 3 } , e 2 = { 2 , 3 , 4 } , e 3 = { 4 , 5 } 4 / 30

Applications ◮ structured stochastic optimization (via Sample Average Approximation) ◮ learning ◮ sparse least-squares ◮ sparse SVMs, matrix completion, graph cuts (see Niu-Recht-R´ e-Wright (2011)) ◮ truss topology design ◮ optimal statistical designs 5 / 30

PART 1: LOCK-FREE HYBRID SGD/RCD METHODS based on: P. R. and M. Tak´ aˇ c, Lock-free randomized first order methods, manuscript, 2013. 6 / 30

Problem-Specific Constants function definition average maximum Edge-Vertex Degree ω ′ ω e = | e | = |{ v ∈ V : v ∈ e }| ω ¯ (# vertices incident with an edge) (relevant if | B | = | V | ) Edge-Block Degree ¯ σ σ ′ σ e = |{ b ∈ B : b ∩ e � = ∅}| (# blocks incident with an edge) (relevant if | B | > 1) Vertex-Edge Degree ¯ δ ′ δ v = |{ e ∈ E : v ∈ e }| δ (# edges incident with a vertex) (not needed!) Edge-Edge Degree ¯ ρ e = |{ e ′ ∈ E : e ′ ∩ e � = ∅} ρ ′ ρ (# edges incident with an edge) (relevant if | E | > 1) Remarks: ◮ Our results depend on: ¯ σ (avg Edge-Block degree) and ¯ ρ (avg Edge-Edge degree) ◮ First and second row are identical if | B | = | V | (blocks correspond to vertices/coordinates) 7 / 30

Example     A T 5 0 − 3 1 A T 1 . 5 2 . 1 0      ∈ R 4 × 3 2 A =  =     A T 0 0 6   3 A T . 4 0 0 4 4 � f ( x ) = 1 2 � Ax � 2 2 = 1 i x ) 2 , ( A T | E | = 4 , | V | = 3 2 i =1 8 / 30

Example     A T 5 0 − 3 1 A T 1 . 5 2 . 1 0      ∈ R 4 × 3 2 A =  =     A T 0 0 6   3 A T . 4 0 0 4 4 � f ( x ) = 1 2 � Ax � 2 2 = 1 i x ) 2 , ( A T | E | = 4 , | V | = 3 2 i =1 Computation of ¯ ω and ¯ ρ : v 1 v 2 v 3 ω e i ρ e i e 1 × × 2 4 e 2 × × 2 3 e 3 × 1 2 e 4 × 1 3 ω = 2+2+1+1 ρ = 4+3+2+3 δ v j 3 1 2 ¯ = 1 . 5 , ¯ = 3 4 4 ρ e = |{ e ′ ∈ E : e ′ ∩ e � = ∅} , ω e = | e | , δ v = |{ e ∈ E : v ∈ e }| 8 / 30

Algorithm Iteration j + 1 looks as follows: x j +1 = x j − γ | E | σ e ∇ b f e ( x r ( j ) ) Viewpoint of the processor performing this iteration: ◮ Pick edge e ∈ E , uniformly at random ◮ Pick block b intersecting edge e , uniformly at random ◮ Read current x (enough to read x v for v ∈ e ) ◮ Compute ∇ b f e ( x ) ◮ Apply update: x ← x − α ∇ b f e ( x ) with α = γ | E | σ e and γ > 0 ◮ Do not wait (no synchronization!) and start again! Easy to show that E [ | E | σ e ∇ b f e ( x )] = ∇ f ( x ) 9 / 30

Main Result Setup: ◮ c = strong convexity parameter of f ◮ L = Lipschitz constant of ∇ f ◮ �∇ f ( x ) � 2 ≤ M for x visited by the method ◮ Starting point: x 0 ∈ R | V | ◮ 0 < ǫ < L 2 � x 0 − x ∗ � 2 2 c ǫ ◮ constant stepsize: γ := ρ/ | E | ) L 2 M 2 (¯ σ +2 τ ¯ 10 / 30

Main Result Setup: ◮ c = strong convexity parameter of f ◮ L = Lipschitz constant of ∇ f ◮ �∇ f ( x ) � 2 ≤ M for x visited by the method ◮ Starting point: x 0 ∈ R | V | ◮ 0 < ǫ < L 2 � x 0 − x ∗ � 2 2 c ǫ ◮ constant stepsize: γ := ρ/ | E | ) L 2 M 2 (¯ σ +2 τ ¯ Result: Under the above assumptions, for � � LM 2 � L � x 0 − x ∗ � 2 � σ + 2 τ ¯ ρ 2 k ≥ ¯ c 2 ǫ log − 1 , | E | ǫ we have 0 ≤ j ≤ k E { f ( x j ) − f ∗ } ≤ ǫ. min 10 / 30

Special Cases � � LM 2 2 L � x 0 − x ∗ � 2 σ + 2 τ ¯ ρ General result: (¯ | E | ) c 2 ǫ log − 1 ǫ � �� Λ common to all special cases special case lock-free parallel version of . . . Λ | B | + 2 τ | E | = 1 Randomized Block Coordinate Descent | E | Incremental Gradient Descent 1 + 2 τ ¯ ρ | B | = 1 | E | (Hogwild! as implemented) RAINCODE: RAndomized INcremental ω + 2 τ ¯ ρ | B | = | V | COordinate DEscent ¯ | E | (Hogwild! as analyzed) | E | = | B | = 1 Gradient Descent 1 + 2 τ 11 / 30

Analysis via a New Recurrence Let a j = 1 2 E [ � x j − x ∗ � 2 ] Nemirovski-Juditsky-Lan-Shapiro: a j +1 ≤ (1 − 2 c γ j ) a j + 1 2 γ 2 j M 2 Niu-Recht-R´ e-Wright (Hogwild!): √ 2 c ω ′ M τ ( δ ′ ) 1 / 2 ) a 1 / 2 a j +1 ≤ (1 − c γ ) a j + γ 2 ( 2 γ 2 M 2 Q , + 1 j Q = ω ′ + 2 τ ρ ′ | E | + 4 ω ′ ρ ′ | E | τ + 2 τ 2 ( ω ′ ) 2 ( δ ′ ) 1 / 2 where R.-Tak´ aˇ c: a j +1 ≤ (1 − 2 c γ ) a j + 1 2 γ 2 (¯ | E | ) M 2 ρ ¯ σ + 2 τ 12 / 30

Parallelization Speedup Factor Λ of serial version ¯ σ 1 PSF = (Λ of parallel version) /τ = | E | ) /τ = ρ ¯ 1 2¯ ρ (¯ σ + 2 τ τ + ¯ σ | E | 13 / 30

Parallelization Speedup Factor Λ of serial version ¯ σ 1 PSF = (Λ of parallel version) /τ = | E | ) /τ = ρ ¯ 1 2¯ ρ (¯ σ + 2 τ τ + ¯ σ | E | Three modes: ◮ Brute force (many processors; τ very large): PSF ≈ ¯ σ | E | 2¯ ρ ρ ¯ ◮ Favorable structure ( σ | E | ≪ 1 τ ; fixed τ ): ¯ PSF ≈ τ ◮ Special τ ( τ = | E | ρ ): ¯ PSF = | E | σ ¯ σ + 2 ≈ τ ρ ¯ ¯ 13 / 30

Improvements vs Hogwild! If | B | = | V | (blocks = coordinates), then our method coincides with Hogwild! (as analyzed in Niu et al), up to stepsize choice: x j +1 = x j − γ | E | ω e ∇ v f e ( x r ( j ) ) 14 / 30

Improvements vs Hogwild! If | B | = | V | (blocks = coordinates), then our method coincides with Hogwild! (as analyzed in Niu et al), up to stepsize choice: x j +1 = x j − γ | E | ω e ∇ v f e ( x r ( j ) ) Niu-Recht-R´ e-Wright (Hogwild!, 2011): Λ = 4 ω ′ + 24 τ ρ ′ | E | + 24 τ 2 ω ′ ( δ ′ ) 1 / 2 R.-Tak´ aˇ c: ω + 2 τ ¯ ρ Λ = ¯ | E | 14 / 30

Improvements vs Hogwild! If | B | = | V | (blocks = coordinates), then our method coincides with Hogwild! (as analyzed in Niu et al), up to stepsize choice: x j +1 = x j − γ | E | ω e ∇ v f e ( x r ( j ) ) Niu-Recht-R´ e-Wright (Hogwild!, 2011): Λ = 4 ω ′ + 24 τ ρ ′ | E | + 24 τ 2 ω ′ ( δ ′ ) 1 / 2 R.-Tak´ aˇ c: ω + 2 τ ¯ ρ Λ = ¯ | E | Advantages of our approach: ◮ Dependence on averages and not maxima! ( ω ′ → ¯ ω , ρ ′ → ¯ ρ ) ◮ Better constants (4 → 1, 24 → 2) ◮ The third large term is not present (no dependence on τ 2 and δ ′ ) ◮ Introduction of blocks ( ⇒ cover also block coordinate descent, gradient descent, SGD) ◮ Simpler analysis 14 / 30

Modified Algorithm: Global Reads and Local Writes ∗ Partition vertices (coordinates) into τ + 1 blocks V = b 1 ∪ b 2 ∪ · · · ∪ b τ +1 and assign block b i to processor i , i = 1 , 2 , . . . , τ + 1. Processor i will (asynchronously) do: ◮ Pick edge e ∈ { e ′ ∈ E : e ′ ∩ b i � = ∅} , uniformly at random (edge intersecting with block owned by processor i ) ◮ Update: x j +1 = x j − α ∇ b i f e ( x r ( j ) ) Pros and cons: ◮ + good if global reads and local writes are cheap, but global writes are expensive (NUMA = Non Uniform Memory Access) ◮ - do not have an analysis ∗ Idea proposed by Ben Recht. 15 / 30

Experiment 1: rcv size = 1.2 GB, features = | V | = 47,236, training: | E | = 677,399, testing: 20,242 0.12 1 CPU, Asyn. 1 CPU, Syn. 0.11 4 CPU, Asyn. 4 CPU, Syn. 0.1 16 CPU, Asyn. 16 CPU, Syn. 0.09 Train Error 0.08 0.07 0.06 0.05 0.04 0.03 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Epoch 16 / 30

Experiment 2 Artificial problem instance: m � 2 � Ax � 2 = f ( x ) = 1 1 2 ( A T i x ) 2 . minimize i =1 A ∈ R m × n ; m = | E | = 500 , 000; n = | V | = 50 , 000 Three methods: ◮ Synchronous, all = parallel synchronous method with | B | = 1 ◮ Asynchronous, all = parallel asynchronous method with | B | = 1 ◮ Asynchronous, block = parallel asynchronous method with | B | = τ (no need for atomic operations ⇒ additional speedup) We measure elapsed time needed to perform 20 m iterations (20 epochs) 17 / 30

Uniform instance: | e | = 10 for all edges 18 / 30

Big Data Optimization: Randomized lock-free methods for minimizing - PowerPoint PPT Presentation

Big Data Optimization: Randomized lock-free methods for minimizing partially separable convex functions Peter Richt arik School of Mathematics The University of Edinburgh Joint work with Martin Tak a c (Edinburgh) Les Houches

Lock-Free, Wait-Free and Multi-core Programming Roger Deran boilerbay.com Fast, Efficient

1 Reader/Writer Lock: Second Try Reader/Writer Lock: Second Try Guidelines for Condition

LOCK/WAIT FREE SYNCHRONIZATION Synchronization Mutex Blocking Lock-free At

Randomized Algorithms Randomized Algorithms Two Types of Randomized Algorithms Two Types of

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

From Lock-Free to Wait-Free: Linked List Edward Duong Outline 1) Outline operations of the

Avoiding Vendor Lock-In Avoiding Vendor Lock-In Using Apache Libcloud Using Apache Libcloud

Concurrency Problems Thierry Sans (recap) Lock A lock is an object in memory providing two atomic

Synchronization: Going Deeper Synchronization: Going Deeper SharedLock : Reader/Writer Lock :

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

CSC373 Week 11: Randomized Algorithms 373F19 - Nisarg Shah & Karan Singh 1 Randomized

Transactional Memory: Architectural support for Lock-Free Data Structure Transactional Memory:

Thread-Modular Reasoning for Lock-Free Data Structures Roland Meyer based on joint work with

Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model Aras Atalar, Paul

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

Random Subwindows for Robust Image Classification Rapha el Mar ee, Pierre Geurts, Justus

Sublinear Algorithms Lecture 5 Sofya Raskhodnikova Penn State University Thanks to Madhav Jha

Mixed models in R using the lme4 package Part 2: Longitudinal data, modeling interactions Douglas

Expected Value Lecture A Tiefenbruck MWF 9-9:50am Center 212 Lecture B Jones MWF 2-2:50pm

Just-in-Time Code Reuse The more things change, the more they stay the same Kevin Z. Snow 1 Luca

Chapter 9 Object recognition Random Forests 9.9 Random forests 2 9.9 Random forests

Using Randomized Controlled Trials in Criminal Justice Gipsy Escobar, PhD June 8 th , 2016

Random Projections for Dimensionality Reduction: Some Theory and Applications Robert J. Durrant

Big Data Optimization: Randomized lock-free methods for minimizing - PowerPoint PPT Presentation

Big Data Optimization: Randomized lock-free methods for minimizing partially separable convex functions Peter Richt arik School of Mathematics The University of Edinburgh Joint work with Martin Tak a c (Edinburgh) Les Houches

Lock-Free, Wait-Free and Multi-core Programming Roger Deran boilerbay.com Fast, Efficient

1 Reader/Writer Lock: Second Try Reader/Writer Lock: Second Try Guidelines for Condition

LOCK/WAIT FREE SYNCHRONIZATION Synchronization Mutex Blocking Lock-free At

Randomized Algorithms Randomized Algorithms Two Types of Randomized Algorithms Two Types of

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

From Lock-Free to Wait-Free: Linked List Edward Duong Outline 1) Outline operations of the

Avoiding Vendor Lock-In Avoiding Vendor Lock-In Using Apache Libcloud Using Apache Libcloud

Concurrency Problems Thierry Sans (recap) Lock A lock is an object in memory providing two atomic

Synchronization: Going Deeper Synchronization: Going Deeper SharedLock : Reader/Writer Lock :

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

CSC373 Week 11: Randomized Algorithms 373F19 - Nisarg Shah &amp; Karan Singh 1 Randomized

Transactional Memory: Architectural support for Lock-Free Data Structure Transactional Memory:

Thread-Modular Reasoning for Lock-Free Data Structures Roland Meyer based on joint work with

Analyzing the Performance of Lock-Free Data Structures: A Conflict-based Model Aras Atalar, Paul

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

Random Subwindows for Robust Image Classification Rapha el Mar ee, Pierre Geurts, Justus

Sublinear Algorithms Lecture 5 Sofya Raskhodnikova Penn State University Thanks to Madhav Jha

Mixed models in R using the lme4 package Part 2: Longitudinal data, modeling interactions Douglas

Expected Value Lecture A Tiefenbruck MWF 9-9:50am Center 212 Lecture B Jones MWF 2-2:50pm

Just-in-Time Code Reuse The more things change, the more they stay the same Kevin Z. Snow 1 Luca

Chapter 9 Object recognition Random Forests 9.9 Random forests 2 9.9 Random forests

Using Randomized Controlled Trials in Criminal Justice Gipsy Escobar, PhD June 8 th , 2016

Random Projections for Dimensionality Reduction: Some Theory and Applications Robert J. Durrant

CSC373 Week 11: Randomized Algorithms 373F19 - Nisarg Shah & Karan Singh 1 Randomized