PICS - a Performance-analysis-based Introspective Control System to - - PowerPoint PPT Presentation

pics a performance analysis based introspective control
SMART_READER_LITE
LIVE PREVIEW

PICS - a Performance-analysis-based Introspective Control System to - - PowerPoint PPT Presentation

PICS - a Performance-analysis-based Introspective Control System to Steer Parallel Applications Yanhua Sun , Jonathan Lifflander, Laxmikant V. Kal e April 29, 2014 Yanhua Sun 1/24 Motivation Complexity Modern parallel computer systems are


slide-1
SLIDE 1

PICS - a Performance-analysis-based Introspective Control System to Steer Parallel Applications

Yanhua Sun, Jonathan Lifflander, Laxmikant V. Kal´ e April 29, 2014

Yanhua Sun 1/24

slide-2
SLIDE 2

Motivation

Complexity

Modern parallel computer systems are becoming extremely complex due to complicated network topologies, hierarchical storage systems, heterogeneous processing units, etc. Obtaining best performance is challenging Applications and runtime should be reconfigurable to adapt to various situations The goal of the control system is to adjust the configuration automatically based on application-specific knowledge and runtime observations.

Yanhua Sun 2/24

slide-3
SLIDE 3

Outline

Overview of PICS framework Control points in the runtime system and applications Automatic performance analysis to speedup tuning APIs implemented in Charm++ Results of benchmarks and applications

Yanhua Sun 3/24

slide-4
SLIDE 4

Overview of PICS framework

Performance instrumenta/on Automa/c performance analysis Run/me control points Run/me reconfigura/on Controller Mini apps Real‐world applica/ons Applica/on control points Applica/on reconfigura/on

PICS Adap/ve run/me system applica/ons

Performance data Expert knowledge rules

Yanhua Sun 4/24

slide-5
SLIDE 5

Control Points

Control points

Control points are tunable parameters for application and runtime to interact with control system. First proposed in Dooley’s research.

1 Name, Values : default, min, max 2 Movement unit: +1, ×2 3 Effects, directions

Degree of parallelism Grainsize Priority Memory usage GPU load Message size Number of messages

  • ther effects

Yanhua Sun 5/24

slide-6
SLIDE 6

Application Control Points

1 Application specific control points provided by users 2 Applications should be able to reconfigure to use new values Control points Effects Use Cases sub-block size parallelism, grain size Jacobi, Wave, stencil code parallel threshold parallelism, overhead, grain size state space search stages in pipeline number of messages, message size pipeline collectives algorithm selection degree of parallelism, grain size 3D FFT decomposition (slab or pencil) software cache size memory usage, amount of communication ChaNGa ratio of GPU CPU load computation, load balance NAMD, ChaNGa Yanhua Sun 6/24

slide-7
SLIDE 7

Runtime System Control Points

1 Traditionally, configurations for the runtime system do not change 2 Configurations for the runtime system itself should be tunable 1 Registered by runtime itself 2 Requires no change from applications 3 Affect all applications Yanhua Sun 7/24

slide-8
SLIDE 8

Runtime System Control Points

Control points Effects Use Cases broadcast algorithm selection communication most applications broadcast/reduction branch factor critical path most applications(NAMD) compression algorithm communication, overhead NAMD, ChaNGa fault tolerance frequency

  • verhead, memory usage

most applications load balancing frequency

  • verhead, load balance

most applications tracing data disk write frequency memory usage, overhead most applications number of AMPI virtual threads grain size AMPI applications

Yanhua Sun 8/24

slide-9
SLIDE 9

Observe Program Behaviors

Record all events

Events : begin idle, end idle Functions: name, begin execution, end execution Communication : message creation, size, source/destination Hardware counters

Module link, no source code modification Performance summary data

Yanhua Sun 9/24

slide-10
SLIDE 10

Automatically Analyze the Performance

Many control points are registered. How to reduce the search space? Performance Analysis - Identify Program Problems Decomposition Mapping Scheduling

Yanhua Sun 10/24

slide-11
SLIDE 11

Decomposition Characteristics

Decomposition problem? High cache miss rate (1)too big entry method Bytes per message low too much communication

  • n one object

Decrease grain size (2)too big single object (3)too much critical path (4)too few objects per PE Increase grain size Replicate the objects

Yanhua Sun 11/24

slide-12
SLIDE 12

Mapping Characteristics

Mapping problem? load imbalance too much communication

  • n one PE

Communication time >> LogP model time too much external communication Load balancer Remap Topology aware mapping Yanhua Sun 12/24

slide-13
SLIDE 13

Scheduling Characteristics

scheduling problem? Critical tasks are delayed Prioritize the tasks

Yanhua Sun 13/24

slide-14
SLIDE 14

Other Characteristics

  • ther problems?

Bytes per message low Reduction broadcast Long latency Aggregate Message Collectives Compress message

Yanhua Sun 14/24

slide-15
SLIDE 15

Correlate Performance with Control Points

Performance summary CPU Utilization > 90% Overhead >10% Idle >10% Sequential performance? Cache Miss > 10% Decrease grain size Small entry methods Small Bytes per message Increase grain size Decomposition problem? Mapping problem? Scheduling problem? Others? Longer entry method Larger single

  • bject

Long critical path Few

  • bjects

per PE Large communication

  • n one object

Decrease grain size Load imbalance Large communication

  • n one PE

Communication time >> model time Large external communication Load balancer Remap Compress message Critical tasks are delayed Prioritize the tasks Large Bytes per message Long reduction broadcast Long latency for big msgs Increase aggregation threshold Decrease aggregation threshold Collectives Replicate objects Topology aware mapping

One box can have multiple children One egg can have multiple parents

Yanhua Sun 15/24

slide-16
SLIDE 16

Correlate Performance with Control Points

Traverse the tree using the performance summary results performance results ⇒ solutions solution ⇒ effect of control points What control points to tune, in which direction! How much?

grainsize :

MaxObjLoad AvgLoad

Feed results into the control points database

Yanhua Sun 16/24

slide-17
SLIDE 17

Control System APIs

t y p e d e f s t r u c t C o n t r o l P o i n t t { char name [ 3 0 ] ; enum TP DATATYPE datatype ; double d e f a u l t V a l u e ; double c u r r e n t V a l u e ; double minValue ; double maxValue ; double bestValue ; double moveUnit ; i n t moveOP ; i n t e f f e c t ; i n t e f f e c t D i r e c t i o n ; i n t s t r a t e g y ; i n t entryEP ; i n t

  • b jectI D ;

} C o n t r o l P o i n t ;

Yanhua Sun 17/24

slide-18
SLIDE 18

APIs for applications

void r e g i s t e r C o n t r o l P o i n t ( C o n t r o l P o i n t ∗ tp ) ; void s t a r t S t e p ( ) ; void endStep ( ) ; void s t a r t P h a s e ( i n t phaseId ) ; void endPhase ( ) ; double getTunedParameter ( const char ∗name , bool ∗ v a l i d ) ;

Yanhua Sun 18/24

slide-19
SLIDE 19

Experimental Results of Benchmarks and Applications

1 Control points 2 Performance problem 3 Bluegene/Q machine, Cray XE6 machine Yanhua Sun 19/24

slide-20
SLIDE 20

Tuning Message Pipeline

Control point: number of pipeline messages

2 4 6 8 10 12 14 16 10 20 30 40 50 60 70 80 2 4 6 8 10 12 14 16 timestep(ms/step) number of pipeline messages step timestep(less work) timestep(more work) pipeline(less work) pipeline(more work)

Figure: Tuning the number of pipeline messages

Yanhua Sun 20/24

slide-21
SLIDE 21

Message Compression

Control points: compression algorithm for each type message Runtime control points

500 1000 1500 2000 10 20 30 40 50 60 timestep(ms/step) step r1=0.1, r2=1.0

Figure: Steering the compression algorithm for all-to-all benchmark

Yanhua Sun 21/24

slide-22
SLIDE 22

Jacobi3d Performance Steering

Control Points: sub-block size in each dimension Three control points Cache miss rate, high idle suggest decreasing sub-block size Overhead

0.5 1 1.5 2 2.5 3 3.5 4 5 10 15 20 25 30 35 40 timestep(ms/step) step total time idle time cpu time runtime overhead

Figure: Jacobi3d performance steering on 64 cores for problem of 1024*1024*1024

Yanhua Sun 22/24

slide-23
SLIDE 23

Communication Bottleneck in ChaNGa

Control points: number of mirrors Ratio of maximum communication per object to average

1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 5 10 15 20 25 s/step steps tune mirrors with PICS no mirrors

Figure: Time cost of calculating gravity for various mirrors and no mirror on 16k cores on Blue Gene/Q

Yanhua Sun 23/24

slide-24
SLIDE 24

Conclusion

Automatic performance tuning is required to improve productivity and performance Automatic performance analysis helps guide performance steering Steering both runtime system and applications are important

Yanhua Sun 24/24