HELP YOUR BUSY NEIGHBOURS DYNAMIC MULTICASTS OVER STATIC TOPOLOGIES - - PowerPoint PPT Presentation

help your busy neighbours dynamic multicasts over static
SMART_READER_LITE
LIVE PREVIEW

HELP YOUR BUSY NEIGHBOURS DYNAMIC MULTICASTS OVER STATIC TOPOLOGIES - - PowerPoint PPT Presentation

28th August 2017 HELP YOUR BUSY NEIGHBOURS DYNAMIC MULTICASTS OVER STATIC TOPOLOGIES Robert Kuban , Randolf Rotta, J org Nolte Distributed Systems / Operating Systems OUR TARGET SCENARIO objective: scalable multicasts + acknowledgement of


slide-1
SLIDE 1

28th August 2017

HELP YOUR BUSY NEIGHBOURS DYNAMIC MULTICASTS OVER STATIC TOPOLOGIES

Robert Kuban, Randolf Rotta, J¨

  • rg Nolte

Distributed Systems / Operating Systems

slide-2
SLIDE 2

OUR TARGET SCENARIO

  • bjective: scalable multicasts

+ acknowledgement of completion + dynamic group membership (join/leave) applications: cache invalidation, esp. TLB shootdown hardware: many-cores like Intel XeonPhi, Tilera TilePro. . . + cache-coherent shared memory + point-to-point message passing

1·Motivation 2

slide-3
SLIDE 3

EXAMPLE: LINUX TLB SHOOTDOWN

Linux 4.11 x86 smp_call_function_many()

Initiator (Sender)

  • 1. update page tables
  • 2. enqueue invalidation

tasklet at each thread

  • 3. send IPI to each thread
  • 4. wait on flag in each tasklet

Other CPU Threads

IPI handler processes tasklet:

  • 1. invalidate page(s) in TLB
  • 2. set ACK flag in tasklet

S R0 send R1 R2 ... Rn ack

⇒ flat topology fast join/leave via bit-mask O(n) latency

1·Motivation 3

slide-4
SLIDE 4

EXAMPLE: MULTICASTS IN BARRELFISH

root R0 send R1 R2 R3 R4 R5 R6 ack R7

propagate along a tree topology use constraint solver for optimized topology proposed for TLB shootdowns1 expensive join/leave

  • r interrupt ex-members

O(logn) latency

1Baumann et al., The multikernel: A new OS architecture for scalable multicore systems, 2009 1·Motivation 4

slide-5
SLIDE 5

DESIGN SPACE

Multicasts

(just members)

Broadcasts

(over all threads)

Flat

low latency for small groups high latency for large groups fast join/leave always high latency interrupts non-members

Tree

always low latency costly join/leave good latency for large groups bad latency for small groups interrupts non-members

1·Motivation 5

slide-6
SLIDE 6

MULTICASTS ON A STATIC TOPOLOGY

Problem Statement: Combine. . .

fast join/leave like with flat topology low latency like in tree topologies (parallel propagation)

Solution Idea

use static tree topology like in broadcasts (can be hand-crafted for the processor) membership as bit-mask for fast join/leave exploit shared memory to skip non-members, just message passing to actual members

2·Multicasts on a Static Topology 6

slide-7
SLIDE 7

TREES WITH ACKNOWLEDGEMENT

Nodes = Cores; Two roles at each node

root send ack

2·Multicasts on a Static Topology 7

slide-8
SLIDE 8

TREES WITH ACKNOWLEDGEMENT

Logical nodes for larger design space & simpler code

scatter nodes gather nodes root send send ack root ack

2·Multicasts on a Static Topology 8

slide-9
SLIDE 9

NON-MEMBER NODES IN BROADCASTS

1 send 9 2 send 3 send 7 4 send send send send 5 send 8 6 send send send send

2·Multicasts on a Static Topology 9

slide-10
SLIDE 10

SOLUTION: HELPING

Skip non-member scatter nodes

1 help 3 send 9 4 send 2 send 7 help help help 5 help 8 help 6 send help help

2·Multicasts on a Static Topology 10

slide-11
SLIDE 11

HUGE OVERHEAD FOR SMALL GROUPS :(

1 help 9 2 help 3 help 7 4 help help help help 5 help 8 6 help help help help

2·Multicasts on a Static Topology 11

slide-12
SLIDE 12

SOLUTION: SKIPPING

Jump over whole subtrees

1 help 9 skip 2 help 3 help 7 skip 4 help help help help 5 help 8 skip 6 help help help help 2·Multicasts on a Static Topology 12

slide-13
SLIDE 13

EVALUATION SETUP

Setup Flat Topology Binary Tree

Intel XeonPhi Knights Corner (1.053 GHz) 60 cores message passing via shared memory polling

3·Evaluation 13

slide-14
SLIDE 14

FLAT TOPOLOGY

multicast similar to Linux TLB shootdown

  • 20

40 60 80 20 40 60

group size median latency [k cycles]

  • broadcast

multicast

3·Evaluation 14

slide-15
SLIDE 15

FLAT TOPOLOGY WITH HELPING

Overhead from membership tests and graph traversal 20 40 60 80 20 40 60

group size median latency [k cycles]

  • broadcast

broadcast with helping multicast

3·Evaluation 15

slide-16
SLIDE 16

BINARY TREE WITH HELPING, SKIPPING

20 40 60 80 20 40 60

group size median latency [k cycles]

broadcast with helping broadcast with skipping

3·Evaluation 16

slide-17
SLIDE 17

CONCLUSION

Scalable, acknowledged, dynamic multicasts for manycores:

Challenges: generating good topologies is costly, flat topology not scalable, non-members should not be interrupted Solution: static optimized broadcast topology, help and skip non-member cores Result: success for large groups, alright for small Implications: improve Linux TLB shootdown for Many-Core HPC apps

3·Evaluation 17

slide-18
SLIDE 18
slide-19
SLIDE 19

ACKNOWLEDGE VIA SHARED MEMORY

Decrement shared variable instead of message passing

Only message passing:

2 ack 1 ack

Using shared memory:

2 dec 1 dec ack

19

slide-20
SLIDE 20

HELPING WITH SHARED MEM ACK

→ tree combining2 for gather nodes

1 help 3 send 9 4 send 2 send 7 dec dec dec 5 help 8 dec 6 send dec+ack dec

1Yew et al., Distributing Hot-Spot Addressing in Large-Scale Multiprocessors, 1987 20