TRAM: Improving Fine-grained Communication Performance with - - PowerPoint PPT Presentation

tram improving fine grained communication performance
SMART_READER_LITE
LIVE PREVIEW

TRAM: Improving Fine-grained Communication Performance with - - PowerPoint PPT Presentation

TRAM: Improving Fine-grained Communication Performance with Topological Routing and Aggregation of Messages Presented by Lukasz Wesolowski 11th Annual Charm++ Workshop 1 April 15 - 16, 2013 T opological R outing and A ggregation M odule


slide-1
SLIDE 1

TRAM: Improving Fine-grained Communication Performance with Topological Routing and Aggregation of Messages

Presented by Lukasz Wesolowski

11th Annual Charm++ Workshop April 15 - 16, 2013 1

slide-2
SLIDE 2

T opological R outing and A ggregation M odule

11th Annual Charm++ Workshop April 15 - 16, 2013 2

slide-3
SLIDE 3

T opological

exploits physical network topology

R outing and A ggregation M odule

11th Annual Charm++ Workshop April 15 - 16, 2013 3

slide-4
SLIDE 4

T opological R outing and

determines message path

A ggregation M odule

11th Annual Charm++ Workshop April 15 - 16, 2013 4

slide-5
SLIDE 5

T opological R outing and A ggregation

combines messages

M odule

11th Annual Charm++ Workshop April 15 - 16, 2013 5

slide-6
SLIDE 6

T opological R outing and A ggregation M odule

component of a larger system

11th Annual Charm++ Workshop April 15 - 16, 2013 6

slide-7
SLIDE 7

Introduction

  • Charm++ library

– Prototype: Mesh Streamer – Originally developed for the 2011 Charm++ HPC Challenge submission

  • Aggregates fine grained messages to improve

communication performance

11th Annual Charm++ Workshop April 15 - 16, 2013 7

slide-8
SLIDE 8

Why Aggregation?

  • Sending a message involves overhead

– Allocating buffer – Serializing into buffer – Injecting onto the network – Routing – Receiving – Scheduling

11th Annual Charm++ Workshop April 15 - 16, 2013 8

slide-9
SLIDE 9

Communication Overhead

  • Some overhead depends on data size

– Serialization

  • Some does not

– Scheduling

  • Aggregation targets the latter, constant
  • verhead

11th Annual Charm++ Workshop April 15 - 16, 2013 9

slide-10
SLIDE 10

Two Types of Constant Overhead

  • Processing overhead

– Processing time involved in sending a message

  • Bandwidth overhead

– To send some bytes on the network you must first … send some more bytes on the network – What does it mean to send a 0-byte message?

  • Answer: in Charm++, to send at least 48 bytes

11th Annual Charm++ Workshop April 15 - 16, 2013 10

slide-11
SLIDE 11

Bandwidth Overhead

  • Message header

– Charm++ envelope: 48 bytes

  • Network overhead

– Routing – Error checking – Partially filled packets

11th Annual Charm++ Workshop April 15 - 16, 2013 11

slide-12
SLIDE 12

Bandwidth Methodology

  • Network bandwidth is tricky to deal with
  • Fundamentally, it is a property of a single link,

but our tendency is to try to distill it into a single value (e.g. bisection bandwidth)

  • If all links are utilized equally and link

bandwidth is saturated, then each link’s consumption is significant

– We can then add up each link’s utilization, and concern ourselves with this aggregate bandwidth

11th Annual Charm++ Workshop April 15 - 16, 2013 12

slide-13
SLIDE 13

Fine-grained Communication

  • Constant communication overhead really adds up

when sending large numbers of small messages

– What about large numbers of large messages?

  • Sources of fine-grained communication

– Control messages, acknowledgments, requests, etc.

  • For strong scaling, communication becomes

increasingly fine-grained with increasing processor count

11th Annual Charm++ Workshop April 15 - 16, 2013 13

slide-14
SLIDE 14

Why Routing?

  • By routing, we mean not selection of the links

along which messages travel, but instead:

– Selection of intermediate destination nodes or processes and delivery of the message to the runtime system at the intermediate destinations – Analogy: bus route

  • Why does a passenger bus make stops before

reaching the end of the route?

11th Annual Charm++ Workshop April 15 - 16, 2013 14

slide-15
SLIDE 15

Why Routing?

  • Why does a bus make stops before reaching

the end of the route?

– To serve more people along its direction of travel

  • Picking up people who want to board the bus at ANY

stop along the route

  • Dropping off people whose destination is ANY

subsequent stop along the route

– Stopping at n stops serves (n-1)(n-2) separate trips (source/destination pairs)

  • This is why a relatively small number of buses can serve

a large area of a city

11th Annual Charm++ Workshop April 15 - 16, 2013 15

slide-16
SLIDE 16

Why Topological?

  • It is infeasible to have a separate hardware

network link between every pair of nodes in the system

  • Consequences

– some messages must travel through one or more intermediate nodes or switches

  • How it happens is normally invisible to the application

and runtime system

– aggregate bandwidth consumed grows linearly with every additional link along the route

11th Annual Charm++ Workshop April 15 - 16, 2013 16

slide-17
SLIDE 17

Congestion

  • Messages traveling

concurrently along a link must split the bandwidth, leading to congestion

  • Due to aggregation,

TRAM messages are larger than typical, so congestion is of higher concern

11th Annual Charm++ Workshop April 15 - 16, 2013 17

slide-18
SLIDE 18

Network Topology

  • No single network topology for

supercomputers is accepted as best, so in practice several are in use

11th Annual Charm++ Workshop April 15 - 16, 2013 18 Source: wiki.ci.uchicago.edu Source: en.wikipedia.org Source: Bhatele et al., SC ‘11

slide-19
SLIDE 19

Virtual Topology

  • The nodes of a physical topology can be

mapped onto a virtual topology

  • The same virtual topology can be reused for

various physical topologies

  • TRAM employs a mesh virtual topology

11th Annual Charm++ Workshop April 15 - 16, 2013 19

slide-20
SLIDE 20

Topological Routing

  • Most messages pass across multiple links to reach

the destination

  • We can try combining messages, taking

advantage of intermediate destinations analogously to bus stops

  • But hardware-level routing is transparent to the

runtime system

– Solution: lift routing into software, at the level of the runtime system – Possible pitfall: routing will still happen independently in hardware

11th Annual Charm++ Workshop April 15 - 16, 2013 20

slide-21
SLIDE 21

Minimal Routing

  • Routing is minimal if every message sent

travels over the minimum number of links possible to reach its destination

  • Our goal with TRAM is to preserve minimal

routing if possible

– Reason: non-minimal routing consumes additional aggregate bandwidth

11th Annual Charm++ Workshop April 15 - 16, 2013 21

slide-22
SLIDE 22

Virtual to Physical Topology Mapping

  • Simplest and often best: make virtual topology

identical to physical

– Using Charm++ Topology Manager

  • For high dimensional meshes, tori

– Reduce number of dimensions while preserving minimal routing

  • Fat trees

– 2D within/across nodes

11th Annual Charm++ Workshop April 15 - 16, 2013 22

slide-23
SLIDE 23

Data Item

  • Unit of fine-grained communication to be sent

by TRAM

  • Sent for a particular destination
  • Submitted using a local library call instead of

the regular Charm++ syntax for a message send

11th Annual Charm++ Workshop April 15 - 16, 2013 23

slide-24
SLIDE 24

TRAM Peers

  • In the context of TRAM, a process is allowed

to communicate only with its peers

– peers are all the processes that can be reached from it by moving arbitrarily far strictly along a single dimension

11th Annual Charm++ Workshop April 15 - 16, 2013 24

slide-25
SLIDE 25

Mesh Routing Algorithm

  • Order the N dimensions in the virtual topology
  • According to the order, send data items along

the highest dimension whose index does not match the destination’s

– to the peer whose index does match the final destination’s index along that dimension

  • Aggregate at the source and each

intermediate destination

11th Annual Charm++ Workshop April 15 - 16, 2013 25

slide-26
SLIDE 26

Mesh Routing and Aggregation

11th Annual Charm++ Workshop April 15 - 16, 2013 26

slide-27
SLIDE 27

Aggregation Buffer Size

  • Buffers should be large

enough to give good bandwidth utilization, but no larger

– Buffering time should be relatively low

  • On Blue Gene/P Buffers
  • f size 4 KB or more are

sufficient to almost saturate the bandwidth

11th Annual Charm++ Workshop April 15 - 16, 2013 27

slide-28
SLIDE 28

TRAM Memory Footprint

  • Number of peers is typically a small fraction of

all the processes in the run

– For example, 32 x 32 x 32 topology

  • 32768 processes
  • 93 peers
  • This allows TRAM’s memory footprint to

remain relatively small

– Small enough for lower level cache

11th Annual Charm++ Workshop April 15 - 16, 2013 28

slide-29
SLIDE 29

TRAM Usage Pattern

  • Start-up
  • Initialization
  • Sending and receiving
  • Termination
  • Re-initialization

11th Annual Charm++ Workshop April 15 - 16, 2013 29

slide-30
SLIDE 30

Alltoall Performance on Blue Gene/P

11th Annual Charm++ Workshop April 15 - 16, 2013 30

slide-31
SLIDE 31

ChaNGa on Blue Gene/Q

11th Annual Charm++ Workshop April 15 - 16, 2013 31

slide-32
SLIDE 32

EpiSimdemics on Blue Waters

11th Annual Charm++ Workshop April 15 - 16, 2013 32

slide-33
SLIDE 33

Future Plans

  • Develop alternative virtual topologies for non-

mesh networks

  • Generalize

– First within Charm++ – Then to other communication models

  • Automate

– Library parameter selection – Virtual topology dimensions – Choice of which messages to aggregate

11th Annual Charm++ Workshop April 15 - 16, 2013 33

slide-34
SLIDE 34

Acknowledgements

This research used resources of the Argonne Leadership Computing Facility at Argonne National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under contract DE-AC02-06CH11357. EpiSimdemics results courtesy of Jae-Seung Yeom and the EpiSimdemics team.

11th Annual Charm++ Workshop April 15 - 16, 2013 34

slide-35
SLIDE 35

Thank You

11th Annual Charm++ Workshop April 15 - 16, 2013 35