partition cast modelling and optimizing the distribution
play

Partition Cast - Modelling and Optimizing the Distribution of Large - PowerPoint PPT Presentation

Partition Cast - Modelling and Optimizing the Distribution of Large Data Sets in PC Clusters Felix Rauch Christian Kurmann, Thomas M. Stricker Laboratory for Computer Systems, ETH Zrich CoPs project: http://www.cs.inf.ethz.ch/CoPs/


  1. Partition Cast - Modelling and Optimizing the Distribution of Large Data Sets in PC Clusters Felix Rauch Christian Kurmann, Thomas M. Stricker Laboratory for Computer Systems, ETH Zürich CoPs project: http://www.cs.inf.ethz.ch/CoPs/ Eidgenössische 31. August 2000 Technische Hochschule Zürich 1

  2. Clusters of PCs • Scientific computing (computational grids) • Enterprise computing (distrib. databases/datamining) • Corporate computing (multimedia/collaborative work) • Education and training (classrooms) 2

  3. Common Problem Maintenance of software installations is hard: • Different operating systems or applications in Cluster • Temporary installations: tests, experiments, courses • Software rejuvenation to combat software rotting process Manual Install: days, Network Installs: hrs, Cloning: min 3

  4. Partition Cast (cloning) Fast replication of entire system installations (OS image, application, data) on clusters is helpful • How to do ultra fast data distribution in clusters? Essential tradeoffs: What network • is needed? Giga/Switches/Hubs Protocol family • ? (multicast, broadcast, unicast) Compressed or raw • data? Best logical topology • for distribution path? 4

  5. Overview • Network topologies and embedding • Related work • Analytical model for partition cast • Implemented tools for partition cast • Evaluation of alternative topologies • Model vs. measurement • Conclusion 5

  6. Network Topologies Given: Physical network topology • Resource constraints • links nodes (maximal throughput over or trough ) Wanted: logical network topology • Best for data distribution. embedding • Best of logical network into physical net- work? throughput • Limit on for distribution of big data sets (partition cast) 6

  7. Physical Network COPS Patagonia 16 Nodes 8 Nodes ... ... Cabletron Cabletron SSR 8000 SSR 8600 Fast Ethernet Gigabit Ethernet Cabletron SSR 8600 Matrix 7 ... ... Math./Phys. Linneus Beowulf 16 Nodes 192 Nodes • Graph given by cables, nodes and switches 7

  8. Logical Network S S S S S S • Spanning tree, embeded into physical network 8

  9. Previous and Related Work • Protocols and tools for the distribution of data to large number of clients USENIX 1996 [Kotsopoulos and Cooperstock, ] • Model is based on ideas for throughput-oriented memory system performance for MPP computers ISCA95 [Stricker and Gross, ] • High speed multicast leads to great variation in per- ceived bandwidth, complex to implement and quite resource intensive. High speeds seem impossible. ETHZ 97 [Rauch, masters thesis, ] 9

  10. Simple Model of Partion Cast Definitions: Node types • Capacity constraints • Algorithm for evaluation • of model Example: • Heterogenious network: Gigabit / Fast Ethernet 10

  11. Node Types Active node Passive node • Active node: Participates in partition cast, can dupli- cate and store stream • Passive node: Can neither duplicate nor store data, passes one or more streams between active nodes 11

  12. Capacity Constraints • Reliable transfer promise • Fair sharing of links • Edge capacity → Link 125 MB/s, 2 logical channels <62 MB/s • Node capacity → Switches 30 MB/s, 3 Streams <10 MB/s Examples: Active Node Passive Node 12

  13. Model Algorithm (Constraint Satisfaction) Algorithm “evaluate basic model” 1 Choose logical network 2 Embed into given physical network 3 For all edges Post bandwidth limitations due to edge congestions 4 For all nodes Post bandwidth limitations due to node congestions 5 Over all posted limitations Find minimum bandwidth 13

  14. Example Network S S 14

  15. Example Network S S 15

  16. Example Network < 12.5 < 125/2 < 125/3 < 125 S < 125/2 16

  17. Example Network < 12.5 < 30/3 < 30/4 < 125/2 < 125/3 < 125 S < 4000/5 < 4000/6 < 30/2 < 125/2 < 30/4 17

  18. Example Network < 12.5 < 24 < 24 < 30/3 < 30/4 < 125/2 < 24 < 125/3 < 125 S < 4000/5 < 4000/6 < 30/2 < 125/2 < 30/4 < 24 18

  19. Detailed Model of Active Nodes • In the simple model active nodes were black boxes • Detailed model would allow accurate predictions of achievable data stream bandwidths • Requires detailed knowledge of: • Flows of node-internal data streams • Limits of involved subsystems • Complexity of handling and coordinating data streams and subsystems 19

  20. Detailed Example: Data-Streams Network DMA System buffers Copy S User buffer Copy gunzip (uncompress) Copy System buffer DMA SCSI Logical Topology Data streams within active node 20

  21. Limitations in Active Nodes • Link capacity Gigabit Ethernet: 125 MB/s Fast Ethernet: 12.5 MB/s • Disk system Seagate Cheetah SCSI harddisk: 24 MB/s • I/O bus capacity Current 32 bit PCI bus: 132 MB/s • CPU utilization Processing power required for each stream, depen- ding on speed and complexity of handling 21

  22. Detailed Example of an Active Node • Modelling switching capacity: Binary spanning-tree topology with Fast Ethernet and compression: b const c: c < 12.5 MB/s link receive compression 2b factor c < 12.5 MB/s link send e.g. c=2 b < 24 MB/s SCSI disk b: bandwidth 3b c + b < 132 MB/s I/O, PCI 8b + 3b < 180 MB/s Memory c ( 3 1 4 1 c + + + + 9 )b < 1 (100%) CPU 45c 80 90c 90 Solve equations for b -> node can handle 5.25 MB/s 22

  23. Implementation (tools for partition cast) • dd/NFS , built-in OS function and network file system based on UDP/IP - simple - permits star topology only • Dolly , small application for streaming with cloning based on TCP/IP - reliable data streaming Dolly for reliable data casting on all spanning trees • star (n-ary) • 2-ary, 3-ary • chain (un-ary) 23

  24. Active Nodes with Dolly • Simple receiver for star topologies Simple receiver • Advanced cloning node for multi drop chains Multi-drop receiver • Node cloning streams for general spanning trees Active node cloning streams 24

  25. Experimental Evaluation • Topologies: • Star • 3-ary spanning tree • Multi-drop chain • Fast Ethernet / Gigabit Ethernet • Compressed / Uncompressed Images All experiments: Distribute 2 GByte to 1..15 clients 25

  26. Star topology (Standard NFS) 2000 1.0 2200 Bandwidth per Node [MByte/s] ★ 1.1 1800 ✩ 1600 1.3 ✩ Fast Ethernet Execution Time [s] compressed 1400 1.4 ✩ ✩ ✩ ★ Fast Ethernet 1.7 1200 ● ● ● ● ● ● ● ● raw 1000 2.0 ✩ ✩ ✩ ✩ ✩ ✩ ✩ ✩ ❍ Gigabit Ethernet ✩ ❍ ❍ 800 2.5 compressed ❍ ❍ 600 3.3 ✩ ✩ ✩ ✩ ✩ ❍ ● Gigabit Ethernet ❍ ✩ ✩✩ ❍ ★ 400 ❍ 5.0 raw ★ 200 10 0 1 2 5 10 15 20 Number of Nodes 26

  27. 3-Tree (Dolly) 600 3.3 Bandwidth per node [MByte/s] ★ ★ ★ ★ 500 4 ★ Fast Ethernet Execution Time [s] raw ✩ ✩ 400 ❍ 5 ● Gigabit Ethernet ❍ ✩ ★ raw ❍ 300 ✩✩ 6.7 ❍❍ ✩ Fast Ethernet compr. ★ ● ● ● ●● 200 10 ❍ Gigabit Ethernet compr. 100 20 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Number of Nodes 27

  28. Multi-Drop Chain (Dolly) 600 3.3 Bandwidth per Node [MByte/s] 500 4 ✩ Fast Ethernet Execution Time [s] compressed ✩ ✩ ✩ ✩ 400 5 ★ Fast Ethernet ❍ ❍ ❍ raw ❍❍ 300 6.7 ✩ ★ ★ ❍ Gigabit Ethernet ★ ★★ ●● ● ● ● compressed 200 10 ● Gigabit Ethernet 100 20 raw 0 1 2 5 10 15 20 Number of Nodes 28

  29. Scalability ) d 180 ● e Aggregate Bandwidth [MByte/s] Gigabit Ethernet e ● p 160 multi-drop/raw s ★ k s 140 Fast Ethernet i ● d ★ ( ● multi-drop/raw t 120 i m ★ i Gigabit Ethernet l 100 l a ● spanning tree/ c ● i ● t raw 80 e ★ r o ★ e Fast Ethernet 60 h ★ T ★ spanning tree/ ● ● ★ 40 raw ★ ❍ ❍ ❍ ✩ ✩ ✩ ❍ 20 Gigabit Ethernet ● ● ★ ✩ ★★ ❍ ★★ ❍❍ ● ✩✩ ● star/compressed 0 Fast Ethernet 1 5 10 15 20 ✩ star/compressed Number of Nodes 29

  30. Predictions and Measurements Giga Giga Fast Fast Fast Giga Fast Giga 12 11.1 11.1 Bandwidth per Node [MByte/s] 10 9 8.8 8 8 6.4 6.1 6.1 6.1 6 5 5 4.9 4.2 3.8 4.1 3.6 4 2 0 Multi-drop chain topology Star topology 3 clients, raw 5 clients, compressed raw compressed Modelled Measured 30

  31. Conclusions • A simple model captures network topology and node congestion • An extended model also captures the utilisation of basic ressources in nodes and switches. • Optimal configurations can be derived from our model. • For most physical networks a linear multi-drop chain is better than any other spanning tree configuration for distributing large data sets. • Dolly - our simple tool - transfers an entire 2 GB Win- dows NT partition to 24 workstations in less than 5 minutes , with a sustained transfer rate of 9 MB/s per node 31

  32. Questions/Discussion? Our Project CoPs - Cluster of PCs Lab for Computersystems ETH Zürich, Switzerland Dolly is available for download under the GNU general public license (source code included). http://www.cs.inf.ethz.ch/CoPs/ 32

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend