big data optimization randomized lock free methods for
play

Big Data Optimization: Randomized lock-free methods for minimizing - PowerPoint PPT Presentation

Big Data Optimization: Randomized lock-free methods for minimizing partially separable convex functions Peter Richt arik School of Mathematics The University of Edinburgh Joint work with Martin Tak a c (Edinburgh) Les Houches


  1. Big Data Optimization: Randomized lock-free methods for minimizing partially separable convex functions Peter Richt´ arik School of Mathematics The University of Edinburgh Joint work with Martin Tak´ aˇ c (Edinburgh) Les Houches ⋄ January 11, 2013 1 / 30

  2. Lock-Free (Asynchronous) Updates Between the time when x is read by any given processor and an update is computed and applied to x by it, other processors apply their updates. x 6 ← x 5 + update ( x 3 ) Other processors Update x Update x Update x Update x x 3 x 6 time x 5 x 7 x 2 x 4 Read current x ( x 3 ) Compute update Write to current x ( x 5 ) Viewpoint of a single processor 2 / 30

  3. Generic Parallel Lock-Free Algorithm In general: x j +1 = x j + update ( x r ( j ) ) ◮ r ( j ) = index of iterate current at reading time ◮ j = index of iterate current at writing time Assumption: j − r ( j ) ≤ τ τ + 1 ≈ # processors 3 / 30

  4. The Problem and Its Structure � minimize x ∈ R | V | [ f ( x ) ≡ f e ( x )] ( OPT ) e ∈ E ◮ Set of vertices/coordinates: V ( x = ( x v , v ∈ V ) , dim x = | V | ) ◮ Set of edges: E ⊂ 2 V ◮ Set of blocks: B (a collection of sets forming a partition of V ) ◮ Assumption: f e depends on x v , v ∈ e , only Example (convex f : R 5 → R ): f ( x ) = 7( x 1 + x 3 ) 2 + 5( x 2 − x 3 + x 4 ) 2 + ( x 4 − x 5 ) 2 � �� � � �� � � �� � f e 1 ( x ) f e 2 ( x ) f e 3 ( x ) V = { 1 , 2 , 3 , 4 , 5 } , | V | = 5 , e 1 = { 1 , 3 } , e 2 = { 2 , 3 , 4 } , e 3 = { 4 , 5 } 4 / 30

  5. Applications ◮ structured stochastic optimization (via Sample Average Approximation) ◮ learning ◮ sparse least-squares ◮ sparse SVMs, matrix completion, graph cuts (see Niu-Recht-R´ e-Wright (2011)) ◮ truss topology design ◮ optimal statistical designs 5 / 30

  6. PART 1: LOCK-FREE HYBRID SGD/RCD METHODS based on: P. R. and M. Tak´ aˇ c, Lock-free randomized first order methods, manuscript, 2013. 6 / 30

  7. Problem-Specific Constants function definition average maximum Edge-Vertex Degree ω ′ ω e = | e | = |{ v ∈ V : v ∈ e }| ω ¯ (# vertices incident with an edge) (relevant if | B | = | V | ) Edge-Block Degree ¯ σ σ ′ σ e = |{ b ∈ B : b ∩ e � = ∅}| (# blocks incident with an edge) (relevant if | B | > 1) Vertex-Edge Degree ¯ δ ′ δ v = |{ e ∈ E : v ∈ e }| δ (# edges incident with a vertex) (not needed!) Edge-Edge Degree ¯ ρ e = |{ e ′ ∈ E : e ′ ∩ e � = ∅} ρ ′ ρ (# edges incident with an edge) (relevant if | E | > 1) Remarks: ◮ Our results depend on: ¯ σ (avg Edge-Block degree) and ¯ ρ (avg Edge-Edge degree) ◮ First and second row are identical if | B | = | V | (blocks correspond to vertices/coordinates) 7 / 30

  8. Example     A T 5 0 − 3 1 A T 1 . 5 2 . 1 0      ∈ R 4 × 3 2 A =  =     A T 0 0 6   3 A T . 4 0 0 4 4 � f ( x ) = 1 2 � Ax � 2 2 = 1 i x ) 2 , ( A T | E | = 4 , | V | = 3 2 i =1 8 / 30

  9. Example     A T 5 0 − 3 1 A T 1 . 5 2 . 1 0      ∈ R 4 × 3 2 A =  =     A T 0 0 6   3 A T . 4 0 0 4 4 � f ( x ) = 1 2 � Ax � 2 2 = 1 i x ) 2 , ( A T | E | = 4 , | V | = 3 2 i =1 Computation of ¯ ω and ¯ ρ : v 1 v 2 v 3 ω e i ρ e i e 1 × × 2 4 e 2 × × 2 3 e 3 × 1 2 e 4 × 1 3 ω = 2+2+1+1 ρ = 4+3+2+3 δ v j 3 1 2 ¯ = 1 . 5 , ¯ = 3 4 4 ρ e = |{ e ′ ∈ E : e ′ ∩ e � = ∅} , ω e = | e | , δ v = |{ e ∈ E : v ∈ e }| 8 / 30

  10. Algorithm Iteration j + 1 looks as follows: x j +1 = x j − γ | E | σ e ∇ b f e ( x r ( j ) ) Viewpoint of the processor performing this iteration: ◮ Pick edge e ∈ E , uniformly at random ◮ Pick block b intersecting edge e , uniformly at random ◮ Read current x (enough to read x v for v ∈ e ) ◮ Compute ∇ b f e ( x ) ◮ Apply update: x ← x − α ∇ b f e ( x ) with α = γ | E | σ e and γ > 0 ◮ Do not wait (no synchronization!) and start again! Easy to show that E [ | E | σ e ∇ b f e ( x )] = ∇ f ( x ) 9 / 30

  11. Main Result Setup: ◮ c = strong convexity parameter of f ◮ L = Lipschitz constant of ∇ f ◮ �∇ f ( x ) � 2 ≤ M for x visited by the method ◮ Starting point: x 0 ∈ R | V | ◮ 0 < ǫ < L 2 � x 0 − x ∗ � 2 2 c ǫ ◮ constant stepsize: γ := ρ/ | E | ) L 2 M 2 (¯ σ +2 τ ¯ 10 / 30

  12. Main Result Setup: ◮ c = strong convexity parameter of f ◮ L = Lipschitz constant of ∇ f ◮ �∇ f ( x ) � 2 ≤ M for x visited by the method ◮ Starting point: x 0 ∈ R | V | ◮ 0 < ǫ < L 2 � x 0 − x ∗ � 2 2 c ǫ ◮ constant stepsize: γ := ρ/ | E | ) L 2 M 2 (¯ σ +2 τ ¯ Result: Under the above assumptions, for � � LM 2 � L � x 0 − x ∗ � 2 � σ + 2 τ ¯ ρ 2 k ≥ ¯ c 2 ǫ log − 1 , | E | ǫ we have 0 ≤ j ≤ k E { f ( x j ) − f ∗ } ≤ ǫ. min 10 / 30

  13. Special Cases � � LM 2 2 L � x 0 − x ∗ � 2 σ + 2 τ ¯ ρ General result: (¯ | E | ) c 2 ǫ log − 1 ǫ � �� � � �� � Λ common to all special cases special case lock-free parallel version of . . . Λ | B | + 2 τ | E | = 1 Randomized Block Coordinate Descent | E | Incremental Gradient Descent 1 + 2 τ ¯ ρ | B | = 1 | E | (Hogwild! as implemented) RAINCODE: RAndomized INcremental ω + 2 τ ¯ ρ | B | = | V | COordinate DEscent ¯ | E | (Hogwild! as analyzed) | E | = | B | = 1 Gradient Descent 1 + 2 τ 11 / 30

  14. Analysis via a New Recurrence Let a j = 1 2 E [ � x j − x ∗ � 2 ] Nemirovski-Juditsky-Lan-Shapiro: a j +1 ≤ (1 − 2 c γ j ) a j + 1 2 γ 2 j M 2 Niu-Recht-R´ e-Wright (Hogwild!): √ 2 c ω ′ M τ ( δ ′ ) 1 / 2 ) a 1 / 2 a j +1 ≤ (1 − c γ ) a j + γ 2 ( 2 γ 2 M 2 Q , + 1 j Q = ω ′ + 2 τ ρ ′ | E | + 4 ω ′ ρ ′ | E | τ + 2 τ 2 ( ω ′ ) 2 ( δ ′ ) 1 / 2 where R.-Tak´ aˇ c: a j +1 ≤ (1 − 2 c γ ) a j + 1 2 γ 2 (¯ | E | ) M 2 ρ ¯ σ + 2 τ 12 / 30

  15. Parallelization Speedup Factor Λ of serial version ¯ σ 1 PSF = (Λ of parallel version) /τ = | E | ) /τ = ρ ¯ 1 2¯ ρ (¯ σ + 2 τ τ + ¯ σ | E | 13 / 30

  16. Parallelization Speedup Factor Λ of serial version ¯ σ 1 PSF = (Λ of parallel version) /τ = | E | ) /τ = ρ ¯ 1 2¯ ρ (¯ σ + 2 τ τ + ¯ σ | E | Three modes: ◮ Brute force (many processors; τ very large): PSF ≈ ¯ σ | E | 2¯ ρ ρ ¯ ◮ Favorable structure ( σ | E | ≪ 1 τ ; fixed τ ): ¯ PSF ≈ τ ◮ Special τ ( τ = | E | ρ ): ¯ PSF = | E | σ ¯ σ + 2 ≈ τ ρ ¯ ¯ 13 / 30

  17. Improvements vs Hogwild! If | B | = | V | (blocks = coordinates), then our method coincides with Hogwild! (as analyzed in Niu et al), up to stepsize choice: x j +1 = x j − γ | E | ω e ∇ v f e ( x r ( j ) ) 14 / 30

  18. Improvements vs Hogwild! If | B | = | V | (blocks = coordinates), then our method coincides with Hogwild! (as analyzed in Niu et al), up to stepsize choice: x j +1 = x j − γ | E | ω e ∇ v f e ( x r ( j ) ) Niu-Recht-R´ e-Wright (Hogwild!, 2011): Λ = 4 ω ′ + 24 τ ρ ′ | E | + 24 τ 2 ω ′ ( δ ′ ) 1 / 2 R.-Tak´ aˇ c: ω + 2 τ ¯ ρ Λ = ¯ | E | 14 / 30

  19. Improvements vs Hogwild! If | B | = | V | (blocks = coordinates), then our method coincides with Hogwild! (as analyzed in Niu et al), up to stepsize choice: x j +1 = x j − γ | E | ω e ∇ v f e ( x r ( j ) ) Niu-Recht-R´ e-Wright (Hogwild!, 2011): Λ = 4 ω ′ + 24 τ ρ ′ | E | + 24 τ 2 ω ′ ( δ ′ ) 1 / 2 R.-Tak´ aˇ c: ω + 2 τ ¯ ρ Λ = ¯ | E | Advantages of our approach: ◮ Dependence on averages and not maxima! ( ω ′ → ¯ ω , ρ ′ → ¯ ρ ) ◮ Better constants (4 → 1, 24 → 2) ◮ The third large term is not present (no dependence on τ 2 and δ ′ ) ◮ Introduction of blocks ( ⇒ cover also block coordinate descent, gradient descent, SGD) ◮ Simpler analysis 14 / 30

  20. Modified Algorithm: Global Reads and Local Writes ∗ Partition vertices (coordinates) into τ + 1 blocks V = b 1 ∪ b 2 ∪ · · · ∪ b τ +1 and assign block b i to processor i , i = 1 , 2 , . . . , τ + 1. Processor i will (asynchronously) do: ◮ Pick edge e ∈ { e ′ ∈ E : e ′ ∩ b i � = ∅} , uniformly at random (edge intersecting with block owned by processor i ) ◮ Update: x j +1 = x j − α ∇ b i f e ( x r ( j ) ) Pros and cons: ◮ + good if global reads and local writes are cheap, but global writes are expensive (NUMA = Non Uniform Memory Access) ◮ - do not have an analysis ∗ Idea proposed by Ben Recht. 15 / 30

  21. Experiment 1: rcv size = 1.2 GB, features = | V | = 47,236, training: | E | = 677,399, testing: 20,242 0.12 1 CPU, Asyn. 1 CPU, Syn. 0.11 4 CPU, Asyn. 4 CPU, Syn. 0.1 16 CPU, Asyn. 16 CPU, Syn. 0.09 Train Error 0.08 0.07 0.06 0.05 0.04 0.03 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Epoch 16 / 30

  22. Experiment 2 Artificial problem instance: m � 2 � Ax � 2 = f ( x ) = 1 1 2 ( A T i x ) 2 . minimize i =1 A ∈ R m × n ; m = | E | = 500 , 000; n = | V | = 50 , 000 Three methods: ◮ Synchronous, all = parallel synchronous method with | B | = 1 ◮ Asynchronous, all = parallel asynchronous method with | B | = 1 ◮ Asynchronous, block = parallel asynchronous method with | B | = τ (no need for atomic operations ⇒ additional speedup) We measure elapsed time needed to perform 20 m iterations (20 epochs) 17 / 30

  23. Uniform instance: | e | = 10 for all edges 18 / 30

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend