PICS - a Performance-analysis-based Introspective Control System to Steer Parallel Applications
Yanhua Sun, Jonathan Lifflander, Laxmikant V. Kal´ e April 29, 2014
Yanhua Sun 1/24
PICS - a Performance-analysis-based Introspective Control System to - - PowerPoint PPT Presentation
PICS - a Performance-analysis-based Introspective Control System to Steer Parallel Applications Yanhua Sun , Jonathan Lifflander, Laxmikant V. Kal e April 29, 2014 Yanhua Sun 1/24 Motivation Complexity Modern parallel computer systems are
Yanhua Sun 1/24
Yanhua Sun 2/24
Yanhua Sun 3/24
Yanhua Sun 4/24
1 Name, Values : default, min, max 2 Movement unit: +1, ×2 3 Effects, directions
Yanhua Sun 5/24
1 Application specific control points provided by users 2 Applications should be able to reconfigure to use new values Control points Effects Use Cases sub-block size parallelism, grain size Jacobi, Wave, stencil code parallel threshold parallelism, overhead, grain size state space search stages in pipeline number of messages, message size pipeline collectives algorithm selection degree of parallelism, grain size 3D FFT decomposition (slab or pencil) software cache size memory usage, amount of communication ChaNGa ratio of GPU CPU load computation, load balance NAMD, ChaNGa Yanhua Sun 6/24
1 Traditionally, configurations for the runtime system do not change 2 Configurations for the runtime system itself should be tunable 1 Registered by runtime itself 2 Requires no change from applications 3 Affect all applications Yanhua Sun 7/24
Yanhua Sun 8/24
Yanhua Sun 9/24
Yanhua Sun 10/24
Decomposition problem? High cache miss rate (1)too big entry method Bytes per message low too much communication
Decrease grain size (2)too big single object (3)too much critical path (4)too few objects per PE Increase grain size Replicate the objects
Yanhua Sun 11/24
Mapping problem? load imbalance too much communication
Communication time >> LogP model time too much external communication Load balancer Remap Topology aware mapping Yanhua Sun 12/24
Yanhua Sun 13/24
Yanhua Sun 14/24
Performance summary CPU Utilization > 90% Overhead >10% Idle >10% Sequential performance? Cache Miss > 10% Decrease grain size Small entry methods Small Bytes per message Increase grain size Decomposition problem? Mapping problem? Scheduling problem? Others? Longer entry method Larger single
Long critical path Few
per PE Large communication
Decrease grain size Load imbalance Large communication
Communication time >> model time Large external communication Load balancer Remap Compress message Critical tasks are delayed Prioritize the tasks Large Bytes per message Long reduction broadcast Long latency for big msgs Increase aggregation threshold Decrease aggregation threshold Collectives Replicate objects Topology aware mapping
Yanhua Sun 15/24
MaxObjLoad AvgLoad
Yanhua Sun 16/24
Yanhua Sun 17/24
Yanhua Sun 18/24
1 Control points 2 Performance problem 3 Bluegene/Q machine, Cray XE6 machine Yanhua Sun 19/24
2 4 6 8 10 12 14 16 10 20 30 40 50 60 70 80 2 4 6 8 10 12 14 16 timestep(ms/step) number of pipeline messages step timestep(less work) timestep(more work) pipeline(less work) pipeline(more work)
Yanhua Sun 20/24
500 1000 1500 2000 10 20 30 40 50 60 timestep(ms/step) step r1=0.1, r2=1.0
Yanhua Sun 21/24
0.5 1 1.5 2 2.5 3 3.5 4 5 10 15 20 25 30 35 40 timestep(ms/step) step total time idle time cpu time runtime overhead
Yanhua Sun 22/24
1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 5 10 15 20 25 s/step steps tune mirrors with PICS no mirrors
Yanhua Sun 23/24
Yanhua Sun 24/24