Partition Cast - Modelling and Optimizing the Distribution of Large - - PowerPoint PPT Presentation

partition cast modelling and optimizing the distribution
SMART_READER_LITE
LIVE PREVIEW

Partition Cast - Modelling and Optimizing the Distribution of Large - - PowerPoint PPT Presentation

Partition Cast - Modelling and Optimizing the Distribution of Large Data Sets in PC Clusters Felix Rauch Christian Kurmann, Thomas M. Stricker Laboratory for Computer Systems, ETH Zrich CoPs project: http://www.cs.inf.ethz.ch/CoPs/


slide-1
SLIDE 1

1

Partition Cast - Modelling and Optimizing the Distribution of Large Data Sets in PC Clusters

Felix Rauch Christian Kurmann, Thomas M. Stricker Laboratory for Computer Systems, ETH Zürich CoPs project: http://www.cs.inf.ethz.ch/CoPs/

Eidgenössische Technische Hochschule Zürich

  • 31. August 2000
slide-2
SLIDE 2

2

Clusters of PCs

  • Scientific computing (computational grids)
  • Enterprise computing (distrib. databases/datamining)
  • Corporate computing (multimedia/collaborative work)
  • Education and training (classrooms)
slide-3
SLIDE 3

3

Common Problem

Maintenance of software installations is hard:

  • Different operating systems or applications in Cluster
  • Temporary installations: tests, experiments, courses
  • Software rejuvenation to combat software rotting process

Manual Install: days, Network Installs: hrs, Cloning: min

slide-4
SLIDE 4

4

Partition Cast (cloning)

Fast replication of entire system installations (OS image, application, data) on clusters is helpful

  • How to do ultra fast data distribution in clusters?

Essential tradeoffs:

  • What network

is needed? Giga/Switches/Hubs

  • Protocol family

? (multicast, broadcast, unicast)

  • Compressed or raw

data?

  • Best logical topology

for distribution path?

slide-5
SLIDE 5

5

Overview

  • Network topologies and embedding
  • Related work
  • Analytical model for partition cast
  • Implemented tools for partition cast
  • Evaluation of alternative topologies
  • Model vs. measurement
  • Conclusion
slide-6
SLIDE 6

6

Network Topologies

Given:

  • Physical network topology
  • Resource constraints

(maximal throughput over links

  • r trough

nodes ) Wanted:

  • Best

logical network topology for data distribution.

  • Best

embedding

  • f logical network into physical net-

work?

  • Limit on

throughput for distribution of big data sets (partition cast)

slide-7
SLIDE 7

7

Physical Network

  • Graph given by cables, nodes and switches

Matrix 7

Patagonia Gigabit Ethernet 192 Nodes Beowulf Math./Phys.

...

8 Nodes 16 Nodes Fast Ethernet

SSR 8000 Cabletron Cabletron SSR 8600 Cabletron SSR 8600

Linneus 16 Nodes COPS

... ... ...

slide-8
SLIDE 8

8

Logical Network

  • Spanning tree, embeded into physical network

S S S S S S

slide-9
SLIDE 9

9

Previous and Related Work

  • Protocols and tools for the distribution of data to large

number of clients [Kotsopoulos and Cooperstock, USENIX 1996 ]

  • Model is based on ideas for throughput-oriented

memory system performance for MPP computers [Stricker and Gross, ISCA95 ]

  • High speed multicast leads to great variation in per-

ceived bandwidth, complex to implement and quite resource intensive. High speeds seem impossible. [Rauch, masters thesis, ETHZ 97 ]

slide-10
SLIDE 10

10

Simple Model of Partion Cast

Definitions:

  • Node types
  • Capacity constraints
  • Algorithm for evaluation
  • f model

Example:

  • Heterogenious network: Gigabit / Fast Ethernet
slide-11
SLIDE 11

11

Node Types

Active node Passive node

  • Active node: Participates in partition cast, can dupli-

cate and store stream

  • Passive node: Can neither duplicate nor store data,

passes one or more streams between active nodes

slide-12
SLIDE 12

12

Capacity Constraints

  • Reliable transfer promise
  • Fair sharing of links
  • Edge

capacity Link 125 MB/s, 2 logical channels → <62 MB/s

  • Node

capacity Switches 30 MB/s, 3 Streams → <10 MB/s

Passive Node

Active Node

Examples:

slide-13
SLIDE 13

13

Model Algorithm (Constraint Satisfaction)

Algorithm “evaluate basic model”

1

Choose logical network

2

Embed into given physical network

3

For all edges Post bandwidth limitations due to edge congestions

4

For all nodes Post bandwidth limitations due to node congestions

5

Over all posted limitations Find minimum bandwidth

slide-14
SLIDE 14

14

Example Network

S

S

slide-15
SLIDE 15

15

Example Network

S

S

slide-16
SLIDE 16

16

Example Network

S < 12.5 < 125 < 125/2 < 125/3 < 125/2

slide-17
SLIDE 17

17

Example Network

S < 12.5 < 30/3 < 30/2 < 30/4 < 125 < 125/2 < 125/3 < 4000/6 < 125/2 < 30/4 < 4000/5

slide-18
SLIDE 18

18

Example Network

S < 24 < 24 < 12.5 < 24 < 24 < 30/3 < 30/2 < 30/4 < 125/2 < 4000/5 < 125/2 < 4000/6 < 125/3 < 30/4 < 125

slide-19
SLIDE 19

19

Detailed Model of Active Nodes

  • In the simple model active nodes were black boxes
  • Detailed model would allow accurate predictions of

achievable data stream bandwidths

  • Requires detailed knowledge of:
  • Flows of node-internal data streams
  • Limits of involved subsystems
  • Complexity of handling and coordinating data

streams and subsystems

slide-20
SLIDE 20

20

Detailed Example: Data-Streams

Logical Topology Data streams within active node

S

buffer System (uncompress) gunzip System buffer User buffers Copy Copy DMA DMA SCSI Copy Network

slide-21
SLIDE 21

21

Limitations in Active Nodes

  • Link capacity

Gigabit Ethernet: 125 MB/s Fast Ethernet: 12.5 MB/s

  • Disk system

Seagate Cheetah SCSI harddisk: 24 MB/s

  • I/O bus capacity

Current 32 bit PCI bus: 132 MB/s

  • CPU utilization

Processing power required for each stream, depen- ding on speed and complexity of handling

slide-22
SLIDE 22

22

Detailed Example of an Active Node

  • Modelling switching capacity: Binary spanning-tree

topology with Fast Ethernet and compression: Solve equations for b -> node can handle 5.25 MB/s

< 24 MB/s < 132 MB/s < 180 MB/s < 1 (100%) b + 3b )b c + 9 1 90 + 90c 4 + 1 80 + 45c ( 3 I/O, PCI Memory c < 12.5 MB/s c < 12.5 MB/s 2b b link send link receive SCSI disk 3b c + b 8b c CPU

const c: compression factor e.g. c=2 b: bandwidth

slide-23
SLIDE 23

23

Implementation (tools for partition cast)

  • dd/NFS, built-in OS function and network file system

based on UDP/IP - simple - permits star topology only

  • Dolly, small application for streaming with cloning

based on TCP/IP - reliable data streaming Dolly for reliable data casting

  • n all spanning trees
  • star (n-ary)
  • 2-ary, 3-ary
  • chain (un-ary)
slide-24
SLIDE 24

24

Active Nodes with Dolly

  • Simple receiver for star topologies
  • Advanced cloning node for multi drop chains
  • Node cloning streams for general spanning trees

Simple receiver Multi-drop receiver Active node cloning streams

slide-25
SLIDE 25

25

Experimental Evaluation

  • Topologies:
  • Star
  • 3-ary spanning tree
  • Multi-drop chain
  • Fast Ethernet / Gigabit Ethernet
  • Compressed / Uncompressed Images

All experiments: Distribute 2 GByte to 1..15 clients

slide-26
SLIDE 26

26

Star topology (Standard NFS)

✩✩ ✩ ✩ ✩ ✩ ✩ ✩ ✩ ✩ ✩ ✩ ✩ ✩ ✩ ✩ ✩ ✩ ✩ ✩ ✩ ★ ★ ❍ ❍ ❍ ❍ ❍ ❍ ❍ ❍

  • 200

400 600 800 1000 1200 1400 1600 1800 2000 1 2 5 10 15 20 Execution Time [s] Bandwidth per Node [MByte/s] Number of Nodes ✩ Fast Ethernet compressed ★ Fast Ethernet raw ❍ Gigabit Ethernet compressed

  • Gigabit Ethernet

raw ★

2200

10 5.0 3.3 2.5 2.0 1.7 1.4 1.3 1.1 1.0

slide-27
SLIDE 27

27

3-Tree (Dolly)

★ ★ ★ ★ ★ ★

  • ✩✩

✩ ✩ ✩ ❍❍ ❍ ❍ ❍ 100 200 300 400 500 600 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Execution Time [s] Bandwidth per node [MByte/s] Number of Nodes ★ Fast Ethernet raw

  • Gigabit Ethernet

raw ✩ Fast Ethernet compr. ❍ Gigabit Ethernet compr. 20 5 6.7 10 3.3 4

slide-28
SLIDE 28

28

Multi-Drop Chain (Dolly)

✩ ✩ ✩ ✩ ✩ ★★ ★ ★ ★ ❍❍ ❍ ❍ ❍

  • 100

200 300 400 500 600 1 2 5 10 15 20 Execution Time [s] Bandwidth per Node [MByte/s] Number of Nodes ✩ Fast Ethernet compressed ★ Fast Ethernet raw ❍ Gigabit Ethernet compressed

  • Gigabit Ethernet

raw 20 5 6.7 10 3.3 4

slide-29
SLIDE 29

29

Scalability

  • ★★

★ ★ ★ ★

  • ★★

★ ★ ★ ★ ❍❍ ❍ ❍ ❍ ❍ ✩✩ ✩ ✩ ✩ ✩ 20 40 60 80 100 120 140 160 180 1 5 10 15 20 Aggregate Bandwidth [MByte/s] Number of Nodes

  • Gigabit Ethernet

multi-drop/raw ★ Fast Ethernet multi-drop/raw

  • Gigabit Ethernet

spanning tree/ raw ★ Fast Ethernet spanning tree/ raw ❍ Gigabit Ethernet star/compressed ✩ Fast Ethernet star/compressed T h e

  • r

e t i c a l l i m i t ( d i s k s p e e d )

slide-30
SLIDE 30

30

Predictions and Measurements

2 4 6 8 10 12 Bandwidth per Node [MByte/s] 11.1 8.8 11.1 9 6.1 4.9 6.1 6.1 4.2 3.8 6.4 8 5 3.6 5 4.1 Fast Fast Fast Fast Giga Giga Giga Giga Multi-drop chain topology raw compressed Star topology 3 clients, raw 5 clients, compressed Modelled Measured

slide-31
SLIDE 31

31

Conclusions

  • A simple model captures network topology and node

congestion

  • An extended model also captures the utilisation of

basic ressources in nodes and switches.

  • Optimal configurations can be derived from our model.
  • For most physical networks a linear multi-drop chain

is better than any other spanning tree configuration for distributing large data sets.

  • Dolly - our simple tool - transfers an entire 2 GB Win-

dows NT partition to 24 workstations in less than 5 minutes, with a sustained transfer rate of 9 MB/s per node

slide-32
SLIDE 32

32

Questions/Discussion?

Dolly is available for download under the GNU general public license (source code included). http://www.cs.inf.ethz.ch/CoPs/ Our Project CoPs - Cluster of PCs Lab for Computersystems ETH Zürich, Switzerland