A Study of Network Quality of Service in Many-Core MPI Applications - - PowerPoint PPT Presentation

a study of network quality of service in many core mpi
SMART_READER_LITE
LIVE PREVIEW

A Study of Network Quality of Service in Many-Core MPI Applications - - PowerPoint PPT Presentation

A Study of Network Quality of Service in Many-Core MPI Applications Lee Savoie 1 , David Lowenthal 1 , Bronis de Supinski 2 , Kathryn Mohror 2 1 The University of Arizona, 2 Lawrence Livermore National Laboratory Introduction Core counts


slide-1
SLIDE 1

A Study of Network Quality of Service in Many-Core MPI Applications

Lee Savoie1, David Lowenthal1, Bronis de Supinski2, Kathryn Mohror2

1The University of Arizona, 2Lawrence Livermore National Laboratory

slide-2
SLIDE 2

Introduction

2

  • Core counts increasing in high performance computing

(HPC)

  • Many machines already include many-core accelerators
  • Many-core nodes process more data
  • The network must work harder to transfer data between

nodes

slide-3
SLIDE 3

Network Contention

3

“There goes the neighborhood: performance degradation due to nearby jobs” (Bhatele et al., SC 13)

slide-4
SLIDE 4

Fat-tree Contention

4

  • HPC systems with many-core nodes need better network

management

slide-5
SLIDE 5

Quality of Service (QoS)

5

  • Most networks provide QoS mechanisms for network

management

  • In Infiniband:
  • Packets are marked with a service level (SL)
  • Each SL has a priority

SL 1, priority 1 SL 2, priority 3 Network

slide-6
SLIDE 6

Research Question

6

  • Can we improve the performance of contending jobs on

HPC systems using QoS?

  • This will enable HPC systems to handle the increased data demands of

many-core nodes.

  • This work focuses on per-job QoS
  • Each job runs in a separate service level
  • Each job is guaranteed a minimum amount of bandwidth
slide-7
SLIDE 7

Experimental Set Up

7

  • 300 node machine
  • Left 20 nodes free in case of failures
  • No other jobs running
  • Service levels with priorities 2286:254:9:1
  • Applications
  • QBox
  • Crystal Router
  • MILC
  • pF3D
  • Micro-benchmarks
slide-8
SLIDE 8

Micro-Benchmarks

8

Flood-Pairs Nearest-Neighbor All-to-all Random-Pairs

slide-9
SLIDE 9

Methodology

9

  • Ran 4 jobs at a time
  • 70 nodes each
  • 22 ranks per node
  • Assigned nodes to jobs randomly
  • Repeated tests several times with different node assignments
  • Restarted each job when it completed to maintain

contention profile until all jobs completed at least once

  • Ran the following tests
  • Ideal – each job running in isolation
  • Default – all jobs in the same service level
  • All assignments of jobs to 4 service levels
slide-10
SLIDE 10

Results: Micro-Benchmarks

10

  • Per-job QoS is insufficient to improve performance.
slide-11
SLIDE 11

Flood-pairs Rank Timing

11

  • Only a few ranks need to be prioritized.
slide-12
SLIDE 12

Nearest-neighbor Rank Timing

12

High Priority Contended

slide-13
SLIDE 13

Nearest-neighbor Rank Timing

13

High Priority Contended

slide-14
SLIDE 14

Nearest-neighbor Rank Timing

14

High Priority Contended

slide-15
SLIDE 15

Nearest-neighbor Rank Timing

15

High Priority Contended

slide-16
SLIDE 16

Nearest-neighbor Rank Timing

16

High Priority Contended

slide-17
SLIDE 17

Nearest-neighbor Rank Timing

17

High Priority Contended

slide-18
SLIDE 18

Per-Rank QoS

18

  • Prioritizing an entire job gives high priority to some ranks

that are already fast.

  • This slows down other jobs, erasing any throughput

improvement.

  • What if we prioritize only the slowest ranks?
  • Requires prioritizing only ~10% of ranks
  • Same performance as prioritizing the entire job
  • Expect significant reduction in impact on other jobs
  • This is the subject of ongoing research
slide-19
SLIDE 19

Related Work

19

  • QoS has been studied for a long time
  • Jokanovic et al. (2012) came to opposite conclusions
  • Segregate jobs into SLs with different priorities
  • 59% contention reduction
  • Possible reasons for the difference:
  • Simulation vs hardware
  • Future vs current hardware
  • Different service levels
slide-20
SLIDE 20

Different Service Levels

20

  • QoS in HPC deserves more research
slide-21
SLIDE 21

Conclusion

21

  • Many-core nodes will require efficient networks to move

data around

  • Simple, per-job QoS is unlikely to improve performance
  • Differs from previous work
  • Per-rank QoS is more promising
  • Further research is needed to understand QoS in HPC

lsavoie@cs.arizona.edu http://www.cs.arizona.edu/people/lsavoie/

slide-22
SLIDE 22

Backup

22

slide-23
SLIDE 23

Per-Job QoS

23

Job 1, priority 1 Job 2, priority 3 Network Job 3, priority 2 Job 1 Job 2 Network Job 3

No QoS: QoS:

slide-24
SLIDE 24

Related Work

24

  • QoS has been applied to:
  • The internet [Blake 1998]
  • Video streaming [Ke 2005, Kumwilaisak 2003]
  • Clouds and data centers [Voith 2012]
  • Wireless networks [Andrews 2001]
  • Divide traffic across SLs with the same priority to avoid

head of line blocking [Subramoni 2010, Guay 2011]

  • We use service levels with different priorities
  • Other methods of dealing with contention
  • Adaptive routing [Jain 2014]
  • Job placement [Yang 2016, Jokanovic 2015]
  • These methods are complimentary to ours and insufficient on their
  • wn
slide-25
SLIDE 25

Results: Applications

25

  • Per-job QoS is insufficient to improve performance.