dynamic parameter allocation in parameter servers
play

Dynamic Parameter Allocation in Parameter Servers Alexander - PowerPoint PPT Presentation

Dynamic Parameter Allocation in Parameter Servers Alexander Renz-Wieland 1 , Rainer Gemulla 2 , Stefgen Zeuch 1,3 , Volker Markl 1,3 VLDB 2020 1 / 11 1 TU Berlin, 2 University of Mannheim, 3 DFKI Takeaways communication overhead 2 / 11 Key


  1. Dynamic Parameter Allocation in Parameter Servers Alexander Renz-Wieland 1 , Rainer Gemulla 2 , Stefgen Zeuch 1,3 , Volker Markl 1,3 VLDB 2020 1 / 11 1 TU Berlin, 2 University of Mannheim, 3 DFKI

  2. Takeaways communication overhead 2 / 11 ◮ Key challenge in distributed Machine Learning (ML): ◮ Parameter Servers (PSs) ◮ Intuitive ◮ Limited support for common techniques to reduce overhead ◮ How to improve support? ◮ Dynamic parameter allocation ◮ Is this support benefjcial? ◮ Up to two orders of magnitude faster

  3. Background: Distributed Machine Learning Physical worker worker worker parameters worker worker worker parameters worker worker worker parameters 3 / 11 worker Logical Parameter Server worker ◮ Distributed training is a necessity for large-scale ML tasks ◮ Parameter management is a key concern ◮ Parameter servers (PS) are widely used push() p pull() push() u pull() p s u h l ( l ) ( ) worker ...

  4. Problem: Communication Overhead Training knowledge graph embeddings (RESCAL, dimension 100): 4 / 11 ◮ Communication overhead can limit scalability ◮ Performance can fall behind a single node Epoch run time in minutes Classic PS (PS−Lite) 4.5h 4h 200 2.4h 100 1.5h 0 1x4 2x4 4x4 8x4 Parallelism (nodes x threads)

  5. Problem: Communication Overhead Training knowledge graph embeddings (RESCAL, dimension 100): 4 / 11 ◮ Communication overhead can limit scalability ◮ Performance can fall behind a single node Epoch run time in minutes Classic PS (PS−Lite) 4.5h 4h 200 Classic PS with fast local access 2.4h 100 1.5h 1.2h 0 1x4 2x4 4x4 8x4 Parallelism (nodes x threads)

  6. Problem: Communication Overhead Training knowledge graph embeddings (RESCAL, dimension 100): 4 / 11 ◮ Communication overhead can limit scalability ◮ Performance can fall behind a single node Epoch run time in minutes Classic PS (PS−Lite) 4.5h 4h 200 Classic PS with fast local access 2.4h 100 Dynamic Allocation PS (Lapse), ● 1.5h incl. fast local access 1.2h ● ● 0.6h ● 0 0.4h 0.2h 1x4 2x4 4x4 8x4 Parallelism (nodes x threads)

  7. How to reduce communication overhead? DATA worker 2 worker 1 5 / 11 Latency hiding PARAMETERS Parameter blocking PARAMETERS DATA Data clustering PARAMETERS DATA parameter access ◮ Common techniques to reduce overhead: ◮ Key is to avoid remote accesses ◮ Do PSs support these techniques? ◮ Techniques require local access at difgerent nodes over time ◮ But PSs allocate parameters statically

  8. How to reduce communication overhead? DATA worker 2 worker 1 5 / 11 Latency hiding PARAMETERS Parameter blocking PARAMETERS DATA Data clustering PARAMETERS DATA parameter access ◮ Common techniques to reduce overhead: ◮ Key is to avoid remote accesses ◮ Do PSs support these techniques? ◮ Techniques require local access at difgerent nodes over time ◮ But PSs allocate parameters statically

  9. Dynamic Parameter Allocation Localize(parameters) 6 / 11 ◮ What if the PS could allocate parameters dynamically? ◮ Would provide support for ◮ Data clustering � ◮ Parameter blocking � ◮ Latency hiding � ◮ We call this dynamic parameter allocation

  10. The Lapse Parameter Server Stale Serializability Sequential Causal PRAM Eventual 7 / 11 Lapse PS per-key consistency guarantees (for synchronous operations) Classic ◮ Features ◮ Dynamic allocation ◮ Location transparency ◮ Retains sequential consistency � � � � � � � � × � � × × × × ◮ Many system challenges (see paper) ◮ Manage parameter locations ◮ Route parameter accesses to current location ◮ Relocate parameters ◮ Handle reads and writes during relocations ◮ All while maintaining sequential consistency

  11. 3. Comparison to bounded staleness PSs 4. Comparison to manual management 5. Ablation study Experimental study Tasks: matrix factorization, knowledge graph embeddings, word vectors Cluster: 1–8 nodes, each with 4 worker threads, 10 GBit Ethernet 1. Performance of Classic PSs 2–28x faster and more scalable Competitive to a specialized low-level implementation Combining fast local access and dynamic allocation is key 8 / 11 ◮ 2–8 nodes barely outperformed 1 node in all tested tasks 2. Efgect of dynamic parameter allocation ◮ 4–203x faster than a Classic PSs, up to linear speed-ups Epoch run time in minutes Classic PS (PS−Lite) 4.5h 4h 200 Classic PS with fast local access 2.4h 100 Dynamic Allocation PS (Lapse), ● 1.5h incl. fast local access 1.2h ● ● 0.6h ● 0 0.4h 0.2h 1x4 2x4 4x4 8x4 Parallelism (nodes x threads)

  12. Comparison to Bounded Staleness PS Non-linear scaling 9 / 11 overhead Single-node ◮ Matrix factorization (matrix with 1b entries, rank 100) ◮ Parameter blocking 40 Epoch run time in minutes Bounded staleness PS (Petuum), client sync. 0.6x 30 20 Bounded staleness PS ● (Petuum), server sync. 10 ● Dynamic Allocation 2.9x ● PS (Lapse) 8.4x ● 0 1x4 2x4 4x4 8x4 Parallelism (nodes x threads)

  13. Comparison to Bounded Staleness PS Non-linear scaling 9 / 11 overhead Single-node ◮ Matrix factorization (matrix with 1b entries, rank 100) ◮ Parameter blocking 40 Epoch run time in minutes Bounded staleness PS (Petuum), client sync. 0.6x 30 20 Bounded staleness PS ● (Petuum), server sync. 10 ● Dynamic Allocation 2.9x ● PS (Lapse) 8.4x ● 0 1x4 2x4 4x4 8x4 Parallelism (nodes x threads)

  14. Experimental study Tasks: matrix factorization, knowledge graph embeddings, word vectors Cluster: 1–8 nodes, each with 4 worker threads, 10 GBit Ethernet 1. Performance of Classic PSs 10 / 11 ◮ 2–8 nodes barely outperformed 1 node in all tested tasks 2. Efgect of dynamic parameter allocation ◮ 4–203x faster than a Classic PSs, up to linear speed-ups 3. Comparison to bounded staleness PSs ◮ 2–28x faster and more scalable 4. Comparison to manual management ◮ Competitive to a specialized low-level implementation 5. Ablation study ◮ Combining fast local access and dynamic allocation is key

  15. Dynamic Parameter Allocation in Parameter Servers communication overhead https://github.com/alexrenz/lapse-ps 11 / 11 ◮ Key challenge in distributed Machine Learning (ML): ◮ Parameter Servers (PSs) ◮ Intuitive ◮ Limited support for common techniques to reduce overhead ◮ How to improve support? ◮ Dynamic parameter allocation ◮ Is this support benefjcial? ◮ Up to two orders of magnitude faster ◮ Lapse is open source:

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend