A UPC++ Actor Library and its Evaluation on a Shallow Water - - PowerPoint PPT Presentation

a upc actor library and its evaluation on a shallow water
SMART_READER_LITE
LIVE PREVIEW

A UPC++ Actor Library and its Evaluation on a Shallow Water - - PowerPoint PPT Presentation

A UPC++ Actor Library and its Evaluation on a Shallow Water Application Alexander Pppl 1 , Scott Baden 2 , Michael Bader 1 1 Department of Informatics Technical University of Munich 2 Computational Research Division Lawrence Berkeley National


slide-1
SLIDE 1

A UPC++ Actor Library and its Evaluation on a Shallow Water Application

Alexander Pöppl1, Scott Baden2, Michael Bader1

1Department of Informatics

Technical University of Munich

2Computational Research Division

Lawrence Berkeley National Laboratory Department of Computer Science and Engineering University of California, San Diego Parallel Applications Workshop, Alternatives To MPI+X November 18th 2019 Denver, Colorado

slide-2
SLIDE 2

Is it feasible to program an actor library using standard languages and frameworks? If so, how does performance compare, both to our X10-based library, and BSP? ü Tools: C++, OpenMP, UPC++

Motivation

2 Alexander Pöppl | A UPC++ Actor Library | PAW-ATM 2019

TCPA

Memory I/O Memory I/O Memory

CPU i-Core CPU Memory CPU CPU CPU CPU CPU Memory CPU CPU CPU CPU Memory CPU CPU CPU CPU Memory CPU CPU CPU CPU Memory NoC Router NoC Router NoC Router NoC Router NoC Router NoC Router NoC Router NoC Router NoC Router

Invasive Computing:

  • Dynamic resource allocation
  • Predictability through exclusive resource usage
  • Heterogeneous compute tiles

Actor-based Modelling

  • Good fit for architecture, enables exploration of different

mappings of actors to compute tiles

  • SWE-X10 as sample application

Transfer to larger-scale applications

slide-3
SLIDE 3

Asynchronous Partitioned Global Address Space (APGAS) Model Reliance on one-sided communication Asynchronous, continuation-based API Based on GASNet-EX, makes direct use of InfiniBand and (some) Cray interconnects

UPC++

3 Alexander Pöppl | A UPC++ Actor Library | PAW-ATM 2019

Shared Segment Shared Segment Shared Segment Private Segment Private Segment Private Segment Rank 0 Rank 1 Rank n … …

Adapted from: UPC++ Specification v1.0 Draft 10, available at https://upcxx.lbl.gov

slide-4
SLIDE 4

RPCs

  • Executed asynchronously
  • Serialization and transfer of parameters,

return value

  • Completion events available after the local

part (or overall RPC execution) is finished

UPC++

4 Alexander Pöppl | A UPC++ Actor Library | PAW-ATM 2019

Rank m Rank n …

slide-5
SLIDE 5

RPCs

  • Executed asynchronously
  • Serialization and transfer of parameters,

return value

  • Completion events available after the local

part (or overall RPC execution) is finished Global Pointers

  • Point to data in Shared segment
  • May be used as target for RMA operations

UPC++

5 Alexander Pöppl | A UPC++ Actor Library | PAW-ATM 2019

Rank m Rank n …

slide-6
SLIDE 6

RPCs

  • Executed asynchronously
  • Serialization and transfer of parameters,

return value

  • Completion events available after the local

part (or overall RPC execution) is finished Global Pointers

  • Point to data in Shared segment
  • May be used as target for RMA operations

Distributed Objects

  • Created collectively
  • Same handle points to different objects on

each rank

UPC++

6 Alexander Pöppl | A UPC++ Actor Library | PAW-ATM 2019

Rank m Rank n …

slide-7
SLIDE 7

Actors

  • Encapsulate specific functionality, data and behavior
  • Behavior defined through finite state machines
  • No data sharing between actors
  • Defined communication endpoints (Ports)
  • Have the ability compute whenever data in their ports

(InPorts or OutPorts) changes Ø Actors are being triggered Application Developers…

  • …subclass and implement act() method (actor FSM)
  • …use ports as communication endpoints
  • …specify which ports are connected

UPC++ Actor Library

7 Alexander Pöppl | A UPC++ Actor Library | PAW-ATM 2019

slide-8
SLIDE 8

Channels

  • Unidirectional connection between two ports
  • FiFo semantics
  • Operations: read(), write(T), peek()
  • Guards: available(), freeCapacity()

UPC++ Actor Library

8 Alexander Pöppl | A UPC++ Actor Library | PAW-ATM 2019

slide-9
SLIDE 9

UPC++ Actor Library – Write

9 Alexander Pöppl | A UPC++ Actor Library | PAW-ATM 2019

1:RPC

(insert Data)

A1 A2

A1::Out A2::In

Rank N Rank M

2 : L P C

( t r i g g e r A c t

  • r

)

Channel

3:LPC

(track RPC completion)

slide-10
SLIDE 10

UPC++ Actor Library – Read

10 Alexander Pöppl | A UPC++ Actor Library | PAW-ATM 2019

3:LPC

(trigger Actor)

A1 A2

A1::Out A2::In

2:RPC

(update capacity)

1:read

(dequeue Data)

Channel

4:LPC

(track RPC completion)

slide-11
SLIDE 11

Rank-based Execution Strategy One thread per UPC++ rank, one rank per (logical) core One event loop:

  • Query runtime for progress
  • Execute RPCs, mark affected actors
  • Execute act() on affected actors

May use sequential UPC++ code mode Low number of actors per rank

UPC++ Actor Library – Actor Execution Strategies

11 Alexander Pöppl | A UPC++ Actor Library | PAW-ATM 2019

Query Runtime Perform RPCs act() Query Runtim e Perform RPCs act() Query Runtim e Perform RPCs act() Query Runtime Perform RPCs act()

slide-12
SLIDE 12

Thread-based Execution Strategy One thread per actor, and one communication thread, low number of ranks per node Two event loops:

  • Communication thread queries runtime and

executes RPCs

  • Actor threads query runtime for progress

and execute LPCs, execute act Requires balancing of communication thread against number of actors

UPC++ Actor Library – Actor Execution Strategies

12 Alexander Pöppl | A UPC++ Actor Library | PAW-ATM 2019

Query Runtime Perform LPCs act() Query Runtime Perform LPCs act() Query Runtime Perform RPCs Comm Query Runtime Perform LPCs act() Query Runtime Perform LPCs act() Query Runtim e Perform RPCs Comm

slide-13
SLIDE 13

Task-based Execution Strategy Map act() executions on OpenMP tasks One event loop:

  • Master thread queries Runtime
  • Performs any incoming RPCs and

triggers affected actors

  • Schedules OpenMP task for each

invocation of act. Dependencies between act invocations of same actor Large number of actors per rank possible

UPC++ Actor Library – Actor Execution Strategies

13 Alexander Pöppl | A UPC++ Actor Library | PAW-ATM 2019

Query Runtim e Perform RPCs Master

Schedule

act() Worker Worker act() act() act() act() act() act() Query Runtime Perform RPCs Master

Schedule

act() Worker Worker act() act() act() act() act() act()

slide-14
SLIDE 14

Pond – A Shallow Water Proxy Application

14 Alexander Pöppl | A UPC++ Actor Library | PAW-ATM 2019

Based on prior applications

  • SWE, a BSP-based code

written using MPI and OpenMP

  • SWE-X10, an actor-based X10

application written using the actorX10 library Parallelized using our actor library Possible to auto-vectorize with AVX512 with Intel Compiler (v18.0)

slide-15
SLIDE 15

Pond – A Shallow Water Proxy Application

15 Alexander Pöppl | A UPC++ Actor Library | PAW-ATM 2019

  h hu hv  

t

+   hu hu2 + 1

2gh2

huv  

x

+   hv huv hv2 + 1

2gh2

 

y

= S(t, x, y)

Image: Bachelor-Lab Tsunami Simulation http://www5.in.tum.de/wiki/index.php/Tsunami_Simulation_-_Winter_15

slide-16
SLIDE 16

Pond – A Shallow Water Proxy Application

16 Alexander Pöppl | A UPC++ Actor Library | PAW-ATM 2019

Finite volume scheme on a Cartesian grid with piecewise constant unknown quantities and Euler time step Numerical approach based on LeVeque (R. J. LeVeque, D. L. George, and M. J. Berger. Tsunami modelling with adaptively refined finite volume methods. Acta Numerica, 20:211–289, 2011)

Image: Bachelor-Lab Tsunami Simulation http://www5.in.tum.de/wiki/index.php/Tsunami_Simulation_-_Winter_15

slide-17
SLIDE 17

Pond – A Shallow Water Proxy Application

17 Alexander Pöppl | A UPC++ Actor Library | PAW-ATM 2019

Finite volume scheme on a Cartesian grid with piecewise constant unknown quantities and Euler time step

Image: Bachelor-Lab Tsunami Simulation http://www5.in.tum.de/wiki/index.php/Tsunami_Simulation_-_Winter_15

slide-18
SLIDE 18

Subdivision into rectangular, equally-sized patches with Halo regions

Pond – A Shallow Water Proxy Application

18 Alexander Pöppl | A UPC++ Actor Library | PAW-ATM 2019

slide-19
SLIDE 19

One actor per patch Actors connected with direct neighbors

Pond – A Shallow Water Proxy Application

19 Alexander Pöppl | A UPC++ Actor Library | PAW-ATM 2019

I start C T I start C T I start C T I start C T I start C T I start C T I start C T I start C T I start C T

<latexit sha1_base64="KRrqjBfUliKRU+qbzTDKaR3fN4=">AYPHicxVdLb9tGEN6kbWK7jyTtwYdetg2MOgEliJTlR1IFQR5NejCSNnYSQDICklpRG1EkQa4cywT/Wv9HT+0ht6LXnLo7HApS7JIUmRUpC4XM5+38w3O0PKClweiVrt9wsXP/n0s0uXV1bXPv/iy6+uXL329fPIH4Y2O7R91w9fWmbEXO6xQ8GFy14GITMHlsteWP378v6LYxZG3PcOxChgRwPT8XiX26aAqVfXVkZtiznciwXvnwbcFsOQJWt04mh3fQC0e7TNKfdoXN0zUhoPGU0bfi62BCN0XWXiRj8TsML5P4HD/SWh6zowPeMzHkYtpK2S2gGUu02gnN80Lde0+xrtctdtThAc0U2zduvb1BT0M02v1mvGhr4e9OoNm7keIosIYv4KbP8k9iwB0n8XZJvi/bnNaWtyh2N3mlGgpmu6P2g0ajnh4J5MKcHQqOQpREVPS79NofC1yiG1oE9YHo2a+rVhj04KqRFakaiZHLYoaQsFywpNlCLUDfR+GQedMyCXYi0vHR2mIC6VaLe1xw09VSePDY5Y7XtJknGKhM6aYyuFGAFP+c3C5Jl8MyPjZtfxAMBcvoWqblHzMacqcnqN9tKm9gTXy/NKlp2ywQ3HNyotyE3wH34E4HeVsWc/03Z6TKJ0l6AKSLWQNT9CaUYx0HtnZuzJNhlwgpBR/rlIK7vh9Q1GpOCjPbhdizRwnHJ6Qr9r3NvE5+c5o85t+ZOzlnC0yvnr76GP0wcLoDyJAIhx70ZjYw7dCP28IPkhjWV6g+P7ycdS7rCljICxbyrugxj7kRkwR3aA3scY+0VBOyQHkqgTQqWxc2DOgf91xz+Ji5ruql4KHaTGkntnloyzactZsz8zjJu9tmsQf5BWmEZ/rPwfbvH3FUvels+ej+3WIrX+U7/y6uh8Hb+6er1WreFBzw90NbhO1PHUv3b5kLRJh/jEJkMyIx4RMDYJSaJ4NMiOqmRAOaOSAxzIYw43mckIWuwdghWDCxMmO3DrwNXLTXrwbXEjHC1DSwufENYSckGfH9CRAusJSuDcQTnd/A9xTknlyFGZOnhCM4WIK4i4j7MC9IDi0UrB8oy86XMShtiF4WRz6JrM35m13QcYcLcByYSQV4g2QO2LVjhga8JsDy1PCm1lJCeYbR+QgwX5FODzLuaRg+cBzsgM2MFH8CdEOb6eIeSh2jpAIaF18cYMyWHwCb3V4ZAMdcdOJt4ZojiKUQT8NL4OfpTFI+0OJ2KI5tz4WyhNiFkIt3JIfjzBqw05YuvsmYiwkauFh2IhMNsNLO/T6Z2+Bp6IaPpog5s4m5MDkCFfXJP6ZDAzK/kEVzLUQ08klWn43mX7I0VfoKCdRKloHq9oSPA9Q6T7YuHM4d4mBjFtkp5BRX4LxMXLNsumkobj2gFWDcR1Y8zgNsKkvwbmPu0XMiVJHvSTvNp4NvM7jbYBPBuaRqo+0eoz78xiZbewVKe6VcLDh3Ddxb0tK0JaTftnYL41PDeUf/Vc/3ZQxR0YVeYgL/bmEYyHuGZWpW2VnR1g1woUqmNmJD9FtHeleJ9gHXpYTbPxy3gks9yFWuGukArtKu5JxHJxZ1WX9suRqoEqRJ7xZVYU98oG7oMNch9qDJ+bkntbqlekcp8LlqTVR6CKzWuOok6xZwpKMdzHdRDOXwamO8rXE3K49ow/UI+kSmSrYPqio/clSbmCnjsTOh+qTH1XHsZ+dEqU3HNV3B/LvY0dOni8z1gLwlf8D4GcyN8G4F8jYEHzpo8wzQZB1vL6GoPqczNrAXp524PJKxsMeWx6rPfUacdYnl0LZK9sXyiI2SHT7La1qJ05W0TE63Fz3zvZc2Q63GM1fsm8tRmTv+TwoVrGLCob4fu3jm4+c15bS1wCFylTNSD3f5ZvUrO8N9FlGUHsvjWcr8UxlA1UphyjfJTtzkPawrg1UtVy2OqhyPwexhlh13E/bS3Tx4bndlFZiA5FSFctgyTw78EYiclC3VTb2VIWXxXUw3nk1tKUQd1Qtle1CZ57mY+9ix81U1fG9M4F/t/rsf9nzg+dGVa9XjV+M63fvqf+5K+Rb8j3ZRF/vwhvcU6gbe+XP1UurV1avrv+2/nb9r/W/U9OLF9Sab8jUsf7PvwnPG1Y=</latexit>
slide-20
SLIDE 20

Pond – Simulation Actor

20 Alexander Pöppl | A UPC++ Actor Library | PAW-ATM 2019

Init Done

LeftIn LeftOut RightIn RightOut TopIn TopOut BottomIn BottomOut

compute

𝑢"#$ < 𝑢&'( ∧ 𝑛𝑏𝑧𝑆𝑓𝑏𝑒 ∧ 𝑛𝑏𝑧𝑋𝑠𝑗𝑢𝑓 receiveData(); computeFluxes(); applyUpdates(); sendData() 𝑛𝑏𝑧𝑋𝑠𝑗𝑢𝑓 sendData() 𝑢"#$ ≥ 𝑢&'( stop()

slide-21
SLIDE 21

Performed on NERSC Cori

  • Single socket Intel Xeon Phi (Knights Landing) nodes
  • 68 cores (272 hyperthreads) per node
  • 16GB MCDRAM
  • 6TFlop/s (SP)
  • Intel Compiler 18, Vectorization using AVX512

Comparison of

  • SWE-X10, prior X10 application, based on actorX10, an X10 actor library
  • SWE, prior MPI+OpenMP application, follows the BSP model
  • Pond using our actor library (using the three available execution strategies, Rank, Thread

and Task) All subjects follow same numerical approach, same Riemann solver used in all cases

Evaluation

21 Alexander Pöppl | A UPC++ Actor Library | PAW-ATM 2019

slide-22
SLIDE 22

Radial Dam Break Scenario 40962 grid cells per node

Evaluation – Weak Scaling

  • 22

Alexander Pöppl | A UPC++ Actor Library | PAW-ATM 2019

slide-23
SLIDE 23

Radial Dam Break Scenario 163842 grid cells per node Patch sizes from 512x512 down to 64x64 grid cells

Evaluation – Weak Scaling

23 Alexander Pöppl | A UPC++ Actor Library | PAW-ATM 2019

slide-24
SLIDE 24

Competitive with OpenMP and MPI UPC++ enables overlap of communication and computation Higher abstraction level for application programmer Flexibility regarding backend

Conclusion

24 Alexander Pöppl | A UPC++ Actor Library | PAW-ATM 2019

slide-25
SLIDE 25
  • This research was funded by the German Research

Foundation (DFG, Deutsche Forschungsgemeinschaft) - Project number 14671743 - TRR 89 Invasive Computing.

  • This research was supported by the Exascale

Computing Project (17-SC- 20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration.

  • This research used resources of the National Energy

Research Scientific Computing Center, a DOE Office

  • f Science User Facility supported by the Office of

Science of the U.S. Department of Energy under Contract No. DE- AC02-05CH11231.

  • Scott Baden was supported in part by the Exascale

Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration.

Questions + Acknowledgements

25 Alexander Pöppl | A UPC++ Actor Library | PAW-ATM 2019

UPC++ Tutorial at LBL

  • December 16th
  • At NERSC or Online

More Info at: https://www.exascaleproject.org/event/upcpp https://upcxx.lbl.gov ▶ News