RPPM: Rapid Performance Prediction of Multithreaded Workloads on Multicore Processors
Sander De Pestel*, Sam Van den Steen*, Shoaib Akram, Lieven Eeckhout
*Intel, Belgium
Ghent University ISPASS — March 25-26, 2019 Madison—Wisconsin
1
RPPM: Rapid Performance Prediction of Multithreaded Workloads on - - PowerPoint PPT Presentation
RPPM: Rapid Performance Prediction of Multithreaded Workloads on Multicore Processors Sander De Pestel * , Sam Van den Steen * , Shoaib Akram, Lieven Eeckhout * Intel, Belgium Ghent University ISPASS March 25-26, 2019 MadisonWisconsin
*Intel, Belgium
1
2
Michaud [PACT’99], Karkhanis [ISCA’04], Eyerman [TOCS’09]
Van den Steen [ISPASS’15]
effective dispatch rate time I-cache miss branch misprediction long-latency load miss interval 1 interval 2 interval 3
3
4
per-thread characteristics inter-thread interactions uarch-indep profile
RPPM multicore config performance
current limitation: same number of threads in profiling vs. prediction
5
total cycle count N = dynamic instruction count Deff = effective dispatch rate; is function of ILP, I-mix, ALU contention uarch-indep branch predictor model [De Pestel, ISPASS’15] miss rates predicted using StatStack [Eklov and Hagersten, ISPASS’10] uarch-indep MLP model [Van den Steen, CAL’18]
6
[Van den Steen et al., ISPASS’15]
barrier main thread worker threads
7
8
9
10
prediction error number of threads random error per synchronization epoch
11
Per-thread characteristics Synchronization Shared memory accesses
Predict per-thread performance per synchronization epoch Predict impact of synchronization
Van den Steen [ISPASS’15]
12
Pin-based; measured per synchronization epoch
pthread_mutex_lock pthread_mutex_unlock
gomp_team_barrier_wait (gomp_barrier_t)
13
14
15
Reuse distance = no. references = 5 Stack distance = no. unique references = 3 Cache miss rate prediction for LRU cache
16
per-thread reuse distance global reuse distance
17
18
19
Rodinia Parsec Summary
Rodinia Parsec Summary
Rodinia Parsec Summary
23
smallest small base big biggest frequency (GHz) 5.0 3.33 2.5 2.0 1.66 width 2 3 4 5 6
24
a thread’s criticality: share in total execution time a thread’s parallelism:
[Du Bois, OOPSLA’13]
25
26
27
– Accumulating random errors – Inter-thread synchronization, communication, interference
– Design space exploration – Workload characterization
– Predict Y-thread performance from X-thread profile (Y>X) – Predict Y-thread performance on X-core system (Y>X)
28
*Intel, Belgium
29