SLIDE 19 Correlate Performance with Control Points
Performance summary CPU Utilization > 90% Overhead >10% Idle >10% Sequential performance? Cache Miss > 10% Decrease grain size Small entry methods Small Bytes per message Increase grain size Decomposition problem? Mapping problem? Scheduling problem? Others? Longer entry method Larger single
Long critical path Few
per PE Large communication
Decrease grain size Load imbalance Large communication
Communication time >> model time Large external communication Load balancer Remap Compress message Critical tasks are delayed Prioritize the tasks Large Bytes per message Long reduction broadcast Long latency for big msgs Increase aggregation threshold Decrease aggregation threshold Collectives Replicate objects Topology aware mapping
One box can have multiple children One egg can have multiple parents
Yanhua Sun Parallel Programming Laboratory, UIUC 16/25