dist-gem5: Distributed Simulation of Compute Clusters Mohammad - - PowerPoint PPT Presentation

dist gem5 distributed simulation of
SMART_READER_LITE
LIVE PREVIEW

dist-gem5: Distributed Simulation of Compute Clusters Mohammad - - PowerPoint PPT Presentation

dist-gem5: Distributed Simulation of Compute Clusters Mohammad Alian, Umur Darbaz, Gabor Dozsa, Stephan Diestelhorst, Daehoon Kim, Nam Sung Kim University of Illinois Urbana-Champaign ARM Ltd., Cambridge, UK 1 2 Outline motivation


slide-1
SLIDE 1

dist-gem5: Distributed Simulation of Compute Clusters

Mohammad Alian, Umur Darbaz, Gabor Dozsa, Stephan Diestelhorst, Daehoon Kim, Nam Sung Kim University of Illinois Urbana-Champaign ARM Ltd., Cambridge, UK

1

slide-2
SLIDE 2

Outline

  • motivation

accelerating large-scale simulation

  • dist-gem5 architecture

packet forwarding synchronization checkpointing network model

  • evaluation

validation, speedup, synchronization overhead

  • conclusion

2

dist-gem5 architecture evaluation conclusion what is gem5

slide-3
SLIDE 3

Outline

  • motivation

accelerating large-scale simulation

  • dist-gem5 architecture

packet forwarding synchronization checkpointing network model

  • evaluation

validation, speedup, synchronization overhead

  • conclusion

3

dist-gem5 architecture evaluation conclusion what is gem5

slide-4
SLIDE 4

What is gem5 – overview

  • full-system, cycle-level, event-driven simulator
  • used/maintained at universities and industry

4

Core Integrated IP ARM ISA Support

ARMv7a ARMv8 GICv2

CPU Models

L1-L3 $ SCU ArchTimer PMU

IO components Simulation support

UART UHDLCD 10Gb NIC NVMe DMA KVMv7 Traffic Gen Traffic Monitor

Memory

Flash DRAM

Interconnect

Crossbar Snoop filter Bridges Stream Line Sim Points Power Model Int. KVMv8 FracFact PCA UFS Timers RTC

GPU models

NoMali HMC

dist-gem5 architecture evaluation conclusion what is gem5

Atomic Timing Out of Order In Order

slide-5
SLIDE 5

Why dist-gem5?

  • performance and power dissipation of a distributed system

complex interplay among system components at scale

  • need a full-system, cycle-level simulator which is fast enough to simulate a

large-scale computer system

  • distributed simulation:

simulate a distributed system w/ many simulation hosts

5

scale OS ISAs caches memory network devices

performance Power

cores

dist-gem5 architecture evaluation conclusion what is gem5

slide-6
SLIDE 6

dist-gem5 architecture – high level view

  • gem5 processes modeling full systems run in

parallel on a cluster of physical machines

  • simulated network switch

forward packets among the simulated systems synchronize the distributed simulation simulate network topology

6

host #1 dist-gem5 architecture evaluation conclusion what is gem5 simulated system #1 gem5 process physical machine simulated network switch host #4 simulated system #3 host #3 simulated system #2 host #2

slide-7
SLIDE 7

Outline

  • motivation

accelerating large-scale simulation

  • dist-gem5 architecture

packet forwarding synchronization checkpointing network model

  • evaluation

validation, speedup, synchronization overhead

  • conclusion

7

dist-gem5 architecture evaluation conclusion what is gem5

slide-8
SLIDE 8

dist-gem5 architecture – core components

8

packet forwarding simulated network distributed check-pointing synchronization

dist-gem5 architecture evaluation conclusion what is gem5

slide-9
SLIDE 9

dist-gem5 architecture – core components

9

packet forwarding simulated network distributed check-pointing synchronization

dist-gem5 architecture evaluation conclusion what is gem5

slide-10
SLIDE 10

physical host #1 physical host #3 physical host #2 physical switch phys NIC#1 phys NIC#2 phys port1 phys port2 phys port3 phys NIC#3

dist-gem5 architecture – packet forw rwarding

10

dist-gem5 architecture evaluation conclusion what is gem5

slide-11
SLIDE 11

physical host #1 physical host #3 physical host #2 physical switch phys NIC#1 phys NIC#2 phys port1 phys port2 phys port3 phys NIC#3

dist-gem5 architecture – packet forw rwarding

11

gem5 #1 simulated system #1 sim NIC gem5 #3 simulated switch

dist-gem5 architecture evaluation conclusion what is gem5

gem5 #2 simulated system #2 sim NIC sim port0 sim port1

slide-12
SLIDE 12

physical host #1 physical host #3 physical host #2 physical switch phys NIC#1 phys NIC#2 phys port1 phys port2 phys port3 phys NIC#3 gem5 #1 simulated system #1 sim NIC gem5 #3 simulated switch gem5 #2 simulated system #2 sim NIC sim port0 sim port1

dist-gem5 architecture – packet forw rwarding

12

simulated packets are embedded into host TCP/IP packets

sim pkt TCP sim pkt sim pkt TCP sim pkt sim pkt dist-gem5 architecture evaluation conclusion what is gem5

slide-13
SLIDE 13

Asynchronous processing of f incoming messages

  • simulation thread (main thread)

process/insert events in the event queue in case of send pkt event, encapsulate the simulated Ethernet packet in a message and send it out

  • receiver thread

create for each gem5 process waits for incoming packets creates a recv pkt event and insert it to the event queue

13

eventQ simulation thread send pkt recv pkt physical host gem5 process receiver thread phys NIC

dist-gem5 architecture evaluation conclusion what is gem5

slide-14
SLIDE 14

dist-gem5 architecture – core components

14

packet forwarding simulated network distributed check-pointing synchronization

dist-gem5 architecture evaluation conclusion what is gem5

slide-15
SLIDE 15

Need for synchronization

15

wall clock time

  • receiver gem5 can run ahead of

sender gem5 physical host mismatch different events to be processed

  • slowed down receiver gem5 to

ensure simulation accuracy

  • quantum-based synchronization

gem5#0 gem5#1 send time expected delivery time simulated network delay

dist-gem5 architecture evaluation conclusion what is gem5

recv time late packet arrival

slide-16
SLIDE 16

Accurate packet forw rwarding

16

wall clock time

  • quantum: interval for periodic

synchronization in simulated time

  • sync-event flushes inter gem5

communication channels

  • if quantum ≤ simulated link delay:

expected delivery tick falls inside the next quantum

  • optimal quantum size for accurate

forwarding == simulated link delay

gem5#0 gem5#1 gem5#0 gem5#1 send time packet arrival wall clock time expected delivery time simulated network delay quantum global sync

dist-gem5 architecture evaluation conclusion what is gem5

quantum

slide-17
SLIDE 17

dist-gem5 architecture – core components

17

packet forwarding simulated network distributed check-pointing synchronization

dist-gem5 architecture evaluation conclusion what is gem5

slide-18
SLIDE 18

server #2

dist-gem5 architecture – network modeling

18

Server #1 server #3 server #4 server #5 server #6 server #7 Server #0

top of rack switch #0

server #10 server #9 server #11 server #12 server #13 server #14 server #15 server #8

top of rack switch #1

server #58 server #57 server #59 server #60 server #61 server #62 server #63 server #56

top of rack switch #7 aggregate switch

. . .

simulate in one gem5 process

dist-gem5 architecture evaluation conclusion what is gem5

slide-19
SLIDE 19
  • configurable baseline Ethernet switch model

port number, delay, bandwidth, buffer size

physical host

Configurable network model

19

top of rack switch #0 top of rack switch #1 top of rack switch #7 aggregate switch p8 p0 p7 p0 p7 p8

. . . . . . . . .

p0 p7 p1 p8 gem5 simulated etherLink simulated port distEtherLink simulated etherSwitch

dist-gem5 architecture evaluation conclusion what is gem5

p0 p7

MAC Table In-orderQ#0 In-orderQ#n IPORT#0 IPORT#n OPORT#0 OPORT#n

. . . . . .

slide-20
SLIDE 20

Outline

  • motivation

accelerating large-scale simulation

  • dist-gem5 architecture

packet forwarding synchronization checkpointing network model

  • evaluation

validation, speedup, synchronization overhead

  • conclusion

20

dist-gem5 architecture evaluation conclusion what is gem5

slide-21
SLIDE 21

Methodology – simulation techniques

21

quad core physical host gem5#6

system#6

gem5#7

switch

gem5#4

system#4

gem5#2

system#2

gem5#0

system#0

gem5#5

system#5

gem5#3

system#3

gem5#1

system#1

quad core physical host gem5#6

system#6

gem5#7

switch

gem5#4

system#4

gem5#5

system#5

quad core physical host gem5#0

system#6 switch system#4 system#2 system#0 system#5 system#3 system#1

quad core physical host gem5#6

system#2

gem5#7

system#3

gem5#4

system#0

gem5#5

system#1

dist-gem5 architecture evaluation conclusion what is gem5

single-threaded-gem5 dist-gem5 parallel-gem5

  • For example, simulating a cluster w/ 7 nodes and 1 network switch:
slide-22
SLIDE 22

Methodology – experimental setup

  • focus on off-chip network performance using network intensive applications

iperf, memcached, httperf, tcptest, netperf, NAS parallel benchmark

  • verification/validation against:

single-threaded-gem5 physical cluster

  • 4 node cluster w/ AMD A10-5800K
  • speedup comparison against:

single-threaded-gem5 parallel-gem5

22

category gem5 configuration O3 core 4 cores; 4 way superscalar memory 8GB DDR3 1600 MHz network Intel GbE NIC; 1 μs Link latency OS Linux Ubuntu 14.04 (Kernel 4.3)

dist-gem5 architecture evaluation conclusion what is gem5

slide-23
SLIDE 23

Verification

  • same node/network config

dist-gem5 generates identical simulation statistics compared to single-threaded-gem5 different cluster sizes

23

quad core physical host gem5#6

system#6

gem5#7

switch

gem5#4

system#4

gem5#5

system#5

quad core physical host gem5#0

system#6 switch system#4 system#2 system#0 system#5 system#3 system#1

quad core physical host gem5#6

system#2

gem5#7

system#3

gem5#4

system#0

gem5#5

system#1

single-threaded-gem5 dist-gem5

=

slide-24
SLIDE 24

Validation – network latency and bandwidth

  • iperf (left) and memcahed (right)
  • follows the behavior of physical setup
  • 17.5% lower response time for memcached

24

0.0 0.3 0.6 0.9 1.2 1.5 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 Latancy (ms) Bandwidth (Gbps) dist-gem5 phys 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 1 5 10 20 30 40 50 60 70 80 90 95 Latency (ms) memcached Distribution Percentile dist-gem5 phys

dist-gem5 architecture evaluation conclusion what is gem5

slide-25
SLIDE 25

Speedup – simulation time reduction

  • running httperf on each simulated node

sending fixed number of requests to a unique simulated node (apache server)

  • compared with single-threaded-gem5
  • dist-gem5 simulating 63 nodes on 16

physical hosts is

83.1 faster than single-threaded-gem5 12.8 faster than parallel-gem5

25

2.7 6.3 21.8 36.0 83.1 2.7 3.7 6.6 6.0 6.5 10 20 30 40 50 60 70 80 90 3 7 15 31 63 Speedup ( Norm. single-threaded-gem5) Number of Simulated Nodes dist-gem5 parallel-gem5

dist-gem5 architecture evaluation conclusion what is gem5

speedup of parallel-gem5 saturates!

slide-26
SLIDE 26

Scalability – sim imulation tim ime vs. . sim imulated clu luster siz ize

  • simulation time increase for simulating 64 vs. 3 nodes:

57.3 for Single-threaded-gem5 23.9 for parallel-gem5 1.9 for dist-gem5

26

1.4 1.9 1.9 3.9 11.2 23.9 1.0 2.6 9.4 25.0 57.3 1.0 10.0 100.0 10 20 30 40 50 60 70 Normalized Simulation Time Number of Simulated Nodes dist-gem5 parallel-gem5 single-threaded-gem5

dist-gem5 scales well!

dist-gem5 architecture evaluation conclusion what is gem5

slide-27
SLIDE 27

Synchronization overhead

  • sweep synchronization quantum size
  • # of http req remains near constants

maximum 2.6% variance almost the same amount of work done at each quantum size

  • simulation time improvement

4.9% from 0.5 μs to 1 μs 15.7% from 0.5 μs to 128 μs

27

4 8 12 16 20 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.5 1 2 4 8 16 32 64 128 Number of Requests (K Req) Normalized Simulation Time Synchronization Quantum Size (μs) Simulation Time Req#

dist-gem5 synchronization is efficient!

dist-gem5 architecture evaluation conclusion what is gem5

slide-28
SLIDE 28

Conclusion

  • dist-gem5 is a distributed version of gem5 for modeling computer clusters

validated against a physical cluster accurate/deterministic rich off-chip network modeling 83.1x speedup over single-threaded-gem5 simulating a 63 node cluster

  • integrated to mainstream gem5

available at gem5.org enabled via “--dist” command line option

  • developed/maintained by university and industry

28

dist-gem5 architecture evaluation conclusion what is gem5

slide-29
SLIDE 29

Thank You

29