PICS - a Performance-analysis-based Introspective Control System to - - PowerPoint PPT Presentation

pics a performance analysis based introspective control
SMART_READER_LITE
LIVE PREVIEW

PICS - a Performance-analysis-based Introspective Control System to - - PowerPoint PPT Presentation

PICS - a Performance-analysis-based Introspective Control System to Steer Parallel Applications Yanhua Sun , Jonathan Lifflander, Laxmikant V. Kal e Parallel Programming Laboratory University of Illinois at Urbana-Champaign sun51@illinois.edu


slide-1
SLIDE 1

PICS - a Performance-analysis-based Introspective Control System to Steer Parallel Applications

Yanhua Sun, Jonathan Lifflander, Laxmikant V. Kal´ e

Parallel Programming Laboratory University of Illinois at Urbana-Champaign sun51@illinois.edu

June 10, 2014

Yanhua Sun Parallel Programming Laboratory, UIUC 1/25

slide-2
SLIDE 2

Motivation

1 Modern parallel computer systems are becoming extremely complex

due to network topologies, hierarchical storage systems, heterogeneous processing units, etc.

2 Obtaining the best performance is challenging. 3 Moreover, multiple configurations for the same application. Yanhua Sun Parallel Programming Laboratory, UIUC 2/25

slide-3
SLIDE 3

Motivation

1 Modern parallel computer systems are becoming extremely complex

due to network topologies, hierarchical storage systems, heterogeneous processing units, etc.

2 Obtaining the best performance is challenging. 3 Moreover, multiple configurations for the same application.

512 1 2 4 8 16 32 64

time(us) Number of messages

time of using different number of messages to send data 1M (f:0.03125)

Yanhua Sun Parallel Programming Laboratory, UIUC 2/25

slide-4
SLIDE 4

Introspection and Adaptivity

General Observation

Configurations of tunable parameters in the runtime system and applications significantly affect the performance.

Top Ten Exascale Research Challenges in DOE Report

”Introspection and automatic adaptation is listed as significant research topic to achieve the performance goal on exascale computers.”

Yanhua Sun Parallel Programming Laboratory, UIUC 3/25

slide-5
SLIDE 5

Introspection and Adaptivity

General Observation

Configurations of tunable parameters in the runtime system and applications significantly affect the performance.

Top Ten Exascale Research Challenges in DOE Report

”Introspection and automatic adaptation is listed as significant research topic to achieve the performance goal on exascale computers.”

Statement

This work addresses the problem of how to improve both parallel programming productivity and performance by letting applications/runtime expose tunable parameters and letting the control system figure out the

  • ptimal configurations of these parameters.

Yanhua Sun Parallel Programming Laboratory, UIUC 3/25

slide-6
SLIDE 6

Related work

Autotuning frameworks : generate multiple implementations (FFTW) Autopilot[Ribler et al.(1998)]: fuzzy logic rules, grid applications, resource managements MATE [Morajko 2006] : fully automatic tuning, performance model Active Harmony[Chung and Hollingsworth(2006)] : heuristic algorithms SEEC: A General and Extensible Framework for Self-Aware Computing[Henry Homann (2010,2011,2013)]

Yanhua Sun Parallel Programming Laboratory, UIUC 4/25

slide-7
SLIDE 7

Our Approach

HPC applications on large scale Not rely on performance models Richer set of tunable parameters due to the powerful intelligent runtime system Not only application configurations are tunned, but also the runtime system itself Automatic performance analysis accelerates steering

Yanhua Sun Parallel Programming Laboratory, UIUC 5/25

slide-8
SLIDE 8

Outline

Overview of PICS framework Control points in the runtime system and applications Automatic performance analysis to accelerate steering APIs implemented in Charm++ Results of benchmarks and applications

Yanhua Sun Parallel Programming Laboratory, UIUC 6/25

slide-9
SLIDE 9

Overview of PICS framework

Performance instrumenta/on Automa/c performance analysis Run/me control points Run/me reconfigura/on Controller Mini apps Real‐world applica/ons Applica/on control points Applica/on reconfigura/on

PICS Adap/ve run/me system applica/ons

Performance data Expert knowledge rules

Yanhua Sun Parallel Programming Laboratory, UIUC 7/25

slide-10
SLIDE 10

Control Points

Control points

Control points are tunable parameters for application and runtime to interact with control system. First proposed in Dooley’s research.

1 Name, Values : default, min, max 2 Movement unit: +1, ×2 3 Effects, directions

Degree of parallelism Grainsize Priority Memory usage GPU load Message size Number of messages

  • ther effects

Yanhua Sun Parallel Programming Laboratory, UIUC 8/25

slide-11
SLIDE 11

Application and Runtime Control Points

Application

1 Application specific control points provided by users 2 Applications should be able to reconfigure to use new values

Runtime

1 Traditionally, configurations for the runtime system do not change 2 Configurations for the runtime system itself should be tunable 1 Registered by runtime itself 2 Requires no change from applications 3 Affect all applications Yanhua Sun Parallel Programming Laboratory, UIUC 9/25

slide-12
SLIDE 12

Observe Program Behaviors

Record all events

Events : begin idle, end idle Functions: name, begin execution, end execution Communication : message creation, size, source/destination Hardware counters

Module link, no source code modification Performance summary data

Yanhua Sun Parallel Programming Laboratory, UIUC 10/25

slide-13
SLIDE 13

Automatically Analyze the Performance

Many control points are registered. How to reduce the search space?

Yanhua Sun Parallel Programming Laboratory, UIUC 11/25

slide-14
SLIDE 14

Automatically Analyze the Performance

Many control points are registered. How to reduce the search space? Performance Analysis - Identify Program Problems Decomposition Mapping Scheduling

Yanhua Sun Parallel Programming Laboratory, UIUC 11/25

slide-15
SLIDE 15

Decomposition Characteristics

Decomposition problem? High cache miss rate (1)too big entry method Bytes per message low too much communication

  • n one object

Decrease grain size (2)too big single object (3)too much critical path (4)too few objects per PE Increase grain size Replicate the objects

Yanhua Sun Parallel Programming Laboratory, UIUC 12/25

slide-16
SLIDE 16

Mapping Characteristics

Mapping problem? load imbalance too much communication

  • n one PE

Communication time >> LogP model time too much external communication Load balancer Remap Topology aware mapping Yanhua Sun Parallel Programming Laboratory, UIUC 13/25

slide-17
SLIDE 17

Scheduling Characteristics

scheduling problem? Critical tasks are delayed Prioritize the tasks

Yanhua Sun Parallel Programming Laboratory, UIUC 14/25

slide-18
SLIDE 18

Other Characteristics

  • ther problems?

Bytes per message low Reduction broadcast Long latency Aggregate Message Collectives Compress message

Yanhua Sun Parallel Programming Laboratory, UIUC 15/25

slide-19
SLIDE 19

Correlate Performance with Control Points

Performance summary CPU Utilization > 90% Overhead >10% Idle >10% Sequential performance? Cache Miss > 10% Decrease grain size Small entry methods Small Bytes per message Increase grain size Decomposition problem? Mapping problem? Scheduling problem? Others? Longer entry method Larger single

  • bject

Long critical path Few

  • bjects

per PE Large communication

  • n one object

Decrease grain size Load imbalance Large communication

  • n one PE

Communication time >> model time Large external communication Load balancer Remap Compress message Critical tasks are delayed Prioritize the tasks Large Bytes per message Long reduction broadcast Long latency for big msgs Increase aggregation threshold Decrease aggregation threshold Collectives Replicate objects Topology aware mapping

One box can have multiple children One egg can have multiple parents

Yanhua Sun Parallel Programming Laboratory, UIUC 16/25

slide-20
SLIDE 20

Correlate Performance with Control Points

Traverse the tree using the performance summary results performance results ⇒ solutions solution ⇒ effect of control points What control points to tune, in which direction! How much?

grainsize :

MaxObjLoad AvgLoad

Feed results into the control points database

Yanhua Sun Parallel Programming Laboratory, UIUC 17/25

slide-21
SLIDE 21

Control System APIs

Implemented in Charm++, over-decomposition, asynchronous, message-driven model. (http://charm.cs.uiuc.edu/)

t y p e d e f s t r u c t C o n t r o l P o i n t t { char name [ 3 0 ] ; enum TP DATATYPE datatype ; double d e f a u l t V a l u e ; double c u r r e n t V a l u e ; double minValue ; double maxValue ; double bestValue ; double moveUnit ; i n t moveOP ; i n t e f f e c t ; i n t e f f e c t D i r e c t i o n ; i n t s t r a t e g y ; i n t entryEP ; i n t

  • b jectI D ;

} C o n t r o l P o i n t ;

Yanhua Sun Parallel Programming Laboratory, UIUC 18/25

slide-22
SLIDE 22

APIs for applications

void r e g i s t e r C o n t r o l P o i n t ( C o n t r o l P o i n t ∗ tp ) ; void s t a r t S t e p ( ) ; void endStep ( ) ; double getTunedParameter ( const char ∗name , bool ∗ v a l i d ) ;

Yanhua Sun Parallel Programming Laboratory, UIUC 19/25

slide-23
SLIDE 23

Experimental Results of Benchmarks and Applications

1 Control points 2 Performance problems 3 Bluegene/Q machine, Cray XE6 machine Yanhua Sun Parallel Programming Laboratory, UIUC 20/25

slide-24
SLIDE 24

Tuning Message Pipeline

Control point: number of pipeline messages

2 4 6 8 10 12 14 16 10 20 30 40 50 60 70 80 2 4 6 8 10 12 14 16 timestep(ms/step) number of pipeline messages step timestep(less work) timestep(more work) pipeline(less work) pipeline(more work)

Figure: Tuning the number of pipeline messages

Yanhua Sun Parallel Programming Laboratory, UIUC 21/25

slide-25
SLIDE 25

Communication Bottleneck in ChaNGa

Control points: number of mirrors

1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 5 10 15 20 25 s/step steps tune mirrors with PICS no mirrors

Figure: Time cost of calculating gravity for various mirrors and no mirror on 16k cores on Blue Gene/Q

Yanhua Sun Parallel Programming Laboratory, UIUC 22/25

slide-26
SLIDE 26

Message Compression

Control points: compression algorithm for each type message Runtime control points

500 1000 1500 2000 10 20 30 40 50 60 timestep(ms/step) step r1=0.1, r2=1.0

Figure: Steering the compression algorithm for all-to-all benchmark

Yanhua Sun Parallel Programming Laboratory, UIUC 23/25

slide-27
SLIDE 27

Jacobi3d Performance Steering

Control Points: sub-block size in each dimension Three control points Cache miss rate, high idle suggest decreasing sub-block size Overhead

0.5 1 1.5 2 2.5 3 3.5 4 5 10 15 20 25 30 35 40 timestep(ms/step) step total time idle time cpu time runtime overhead

Figure: Jacobi3d performance steering on 64 cores for problem of 1024*1024*1024

Yanhua Sun Parallel Programming Laboratory, UIUC 24/25

slide-28
SLIDE 28

Conclusion

Introspective control system is required to improve productivity and performance Automatic performance analysis helps guide performance steering Steering both runtime system and applications are important Implemented the system based on Charm++ programming model

Acknowledgment

This work was supported in part by NIH Grant 9P41GM104601, Center for Macromolecular Modeling and Bioinformatics. It was also supported in part by DOE DE-AC02-06CH11357 Argo Project. This research used resources of the Argonne Leadership Computing Facility at Argonne National Laboratory.

Yanhua Sun Parallel Programming Laboratory, UIUC 25/25