Bungee Jumps: Accelerating Indirect Branches Through Hardware/Software Co-Design
Daniel ¡S. ¡McFarlin Craig ¡Zilles
1
Bungee Jumps: Accelerating Indirect Branches Through - - PowerPoint PPT Presentation
Bungee Jumps: Accelerating Indirect Branches Through Hardware/Software Co-Design Daniel S. McFarlin Craig Zilles 1 Indirect Branches Are Increasingly Predictable 20 Nehalem Sandy Bridge Haswell TAGE
1
Indirect ¡Branches ¡Are ¡Increasingly ¡Predictable
2
B e t t e r
5 10 15 20
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 Avg
Mispredicts/Kilo Instrs
Nehalem Sandy Bridge Haswell TAGE
Indirect ¡Branches ¡Are ¡Increasingly ¡Predictable
3
B e t t e r
5 10 15 20
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 Avg
Mispredicts/Kilo Instrs
Nehalem Sandy Bridge Haswell TAGE
0.25 0.5 0.75 1 meteor raytrace btree fannkuch fasta richards nqueens revcomp float specnorm regexdna knuke mandelbrot Geomean
predictability bias
And ¡Unbiased
4
In-‑Order ¡Machines ¡Specialize ¡Based ¡on ¡Branch ¡Bias ¡or ¡ Eliminate ¡Branch ¡PredicCon ¡Altogether
Shape area() area() area()
VTable
60 30 10
4
In-‑Order ¡Machines ¡Specialize ¡Based ¡on ¡Branch ¡Bias ¡or ¡ Eliminate ¡Branch ¡PredicCon ¡Altogether
Shape area() area() area()
VTable
60 30 10
area() area() area()
if s->type == Circle else if s->type == Rect else if s->type == Square else
R C P O area()
4
In-‑Order ¡Machines ¡Specialize ¡Based ¡on ¡Branch ¡Bias ¡or ¡ Eliminate ¡Branch ¡PredicCon ¡Altogether
Shape area() area() area()
VTable
60 30 10
area() area() area()
if s->type == Circle else if s->type == Rect else if s->type == Square else
R C P O area()
p0 = (obj is type B); p1 = (obj is type C) p2 = (obj is type D) p0: r = B::func( ); p1: r = C::func( ); p2: r = D::func( ); if( !(p0 | p1 | p2)) r = obj->func( );
Predication
5
G ld r8, [rip+0x5e] ld edi, [rsi*4+0x42] movsxd rcx, r8 ld r10, [rcx*4+0x42] cmp edi, r10 jnz L C ld edx, [rsi*4+0x7d] cmp edx, 0x6 jz I D ld ecx, [rip+0x58] ld esi, [rip+0x81] lea edi, [rsi+rcx*1] cmp edi, edx jz J E ld edx, [rcx*8+0x10] test edx, edx jz K B ld r8, [rip+0x5b] ld edi, [rsi*4+0x92] movsxd rcx, r8 ld r10, [rcx*4+0x92] cmp edi, r10 jnz H A ld r9, [rax*8+0x94] jmp r9
25 24 24 14
F ld r8, [rip+0x39] ld edi, [rsi*4+0x46] movsxd rcx, r8 ld r10, [rcx*4+0x46] cmp edi, r10 jnz M
7 11 99 1 19 5 95 96 4 99 1 1 99
sjeng: f_in_check
Text
5
G ld r8, [rip+0x5e] ld edi, [rsi*4+0x42] movsxd rcx, r8 ld r10, [rcx*4+0x42] cmp edi, r10 jnz L C ld edx, [rsi*4+0x7d] cmp edx, 0x6 jz I D ld ecx, [rip+0x58] ld esi, [rip+0x81] lea edi, [rsi+rcx*1] cmp edi, edx jz J E ld edx, [rcx*8+0x10] test edx, edx jz K B ld r8, [rip+0x5b] ld edi, [rsi*4+0x92] movsxd rcx, r8 ld r10, [rcx*4+0x92] cmp edi, r10 jnz H A ld r9, [rax*8+0x94] jmp r9
25 24 24 14
F ld r8, [rip+0x39] ld edi, [rsi*4+0x46] movsxd rcx, r8 ld r10, [rcx*4+0x46] cmp edi, r10 jnz M
7 11 99 1 19 5 95 96 4 99 1 1 99
sjeng: f_in_check
Text
5
G ld r8, [rip+0x5e] ld edi, [rsi*4+0x42] movsxd rcx, r8 ld r10, [rcx*4+0x42] cmp edi, r10 jnz L C ld edx, [rsi*4+0x7d] cmp edx, 0x6 jz I D ld ecx, [rip+0x58] ld esi, [rip+0x81] lea edi, [rsi+rcx*1] cmp edi, edx jz J E ld edx, [rcx*8+0x10] test edx, edx jz K B ld r8, [rip+0x5b] ld edi, [rsi*4+0x92] movsxd rcx, r8 ld r10, [rcx*4+0x92] cmp edi, r10 jnz H A ld r9, [rax*8+0x94] jmp r9
25 24 24 14
F ld r8, [rip+0x39] ld edi, [rsi*4+0x46] movsxd rcx, r8 ld r10, [rcx*4+0x46] cmp edi, r10 jnz M
7 11 99 1 19 5 95 96 4 99 1 1 99
sjeng: f_in_check
Text
5
G ld r8, [rip+0x5e] ld edi, [rsi*4+0x42] movsxd rcx, r8 ld r10, [rcx*4+0x42] cmp edi, r10 jnz L C ld edx, [rsi*4+0x7d] cmp edx, 0x6 jz I D ld ecx, [rip+0x58] ld esi, [rip+0x81] lea edi, [rsi+rcx*1] cmp edi, edx jz J E ld edx, [rcx*8+0x10] test edx, edx jz K B ld r8, [rip+0x5b] ld edi, [rsi*4+0x92] movsxd rcx, r8 ld r10, [rcx*4+0x92] cmp edi, r10 jnz H A ld r9, [rax*8+0x94] jmp r9
25 24 24 14
F ld r8, [rip+0x39] ld edi, [rsi*4+0x46] movsxd rcx, r8 ld r10, [rcx*4+0x46] cmp edi, r10 jnz M
7 11 99 1 19 5 95 96 4 99 1 1 99
sjeng: f_in_check
Text
5
G ld r8, [rip+0x5e] ld edi, [rsi*4+0x42] movsxd rcx, r8 ld r10, [rcx*4+0x42] cmp edi, r10 jnz L C ld edx, [rsi*4+0x7d] cmp edx, 0x6 jz I D ld ecx, [rip+0x58] ld esi, [rip+0x81] lea edi, [rsi+rcx*1] cmp edi, edx jz J E ld edx, [rcx*8+0x10] test edx, edx jz K B ld r8, [rip+0x5b] ld edi, [rsi*4+0x92] movsxd rcx, r8 ld r10, [rcx*4+0x92] cmp edi, r10 jnz H A ld r9, [rax*8+0x94] jmp r9
25 24 24 14
F ld r8, [rip+0x39] ld edi, [rsi*4+0x46] movsxd rcx, r8 ld r10, [rcx*4+0x46] cmp edi, r10 jnz M
7 11 99 1 19 5 95 96 4 99 1 1 99
sjeng: f_in_check
Text
5
G ld r8, [rip+0x5e] ld edi, [rsi*4+0x42] movsxd rcx, r8 ld r10, [rcx*4+0x42] cmp edi, r10 jnz L C ld edx, [rsi*4+0x7d] cmp edx, 0x6 jz I D ld ecx, [rip+0x58] ld esi, [rip+0x81] lea edi, [rsi+rcx*1] cmp edi, edx jz J E ld edx, [rcx*8+0x10] test edx, edx jz K B ld r8, [rip+0x5b] ld edi, [rsi*4+0x92] movsxd rcx, r8 ld r10, [rcx*4+0x92] cmp edi, r10 jnz H A ld r9, [rax*8+0x94] jmp r9
25 24 24 14
F ld r8, [rip+0x39] ld edi, [rsi*4+0x46] movsxd rcx, r8 ld r10, [rcx*4+0x46] cmp edi, r10 jnz M
7 11 99 1 19 5 95 96 4 99 1 1 99
sjeng: f_in_check
Text
5
G ld r8, [rip+0x5e] ld edi, [rsi*4+0x42] movsxd rcx, r8 ld r10, [rcx*4+0x42] cmp edi, r10 jnz L C ld edx, [rsi*4+0x7d] cmp edx, 0x6 jz I D ld ecx, [rip+0x58] ld esi, [rip+0x81] lea edi, [rsi+rcx*1] cmp edi, edx jz J E ld edx, [rcx*8+0x10] test edx, edx jz K B ld r8, [rip+0x5b] ld edi, [rsi*4+0x92] movsxd rcx, r8 ld r10, [rcx*4+0x92] cmp edi, r10 jnz H A ld r9, [rax*8+0x94] jmp r9
25 24 24 14
F ld r8, [rip+0x39] ld edi, [rsi*4+0x46] movsxd rcx, r8 ld r10, [rcx*4+0x46] cmp edi, r10 jnz M
7 11 99 1 19 5 95 96 4 99 1 1 99
sjeng: f_in_check
Text
5
G ld r8, [rip+0x5e] ld edi, [rsi*4+0x42] movsxd rcx, r8 ld r10, [rcx*4+0x42] cmp edi, r10 jnz L C ld edx, [rsi*4+0x7d] cmp edx, 0x6 jz I D ld ecx, [rip+0x58] ld esi, [rip+0x81] lea edi, [rsi+rcx*1] cmp edi, edx jz J E ld edx, [rcx*8+0x10] test edx, edx jz K B ld r8, [rip+0x5b] ld edi, [rsi*4+0x92] movsxd rcx, r8 ld r10, [rcx*4+0x92] cmp edi, r10 jnz H A ld r9, [rax*8+0x94] jmp r9
25 24 24 14
F ld r8, [rip+0x39] ld edi, [rsi*4+0x46] movsxd rcx, r8 ld r10, [rcx*4+0x46] cmp edi, r10 jnz M
7 11 99 1 19 5 95 96 4 99 1 1 99
sjeng: f_in_check
Text
5
G ld r8, [rip+0x5e] ld edi, [rsi*4+0x42] movsxd rcx, r8 ld r10, [rcx*4+0x42] cmp edi, r10 jnz L C ld edx, [rsi*4+0x7d] cmp edx, 0x6 jz I D ld ecx, [rip+0x58] ld esi, [rip+0x81] lea edi, [rsi+rcx*1] cmp edi, edx jz J E ld edx, [rcx*8+0x10] test edx, edx jz K B ld r8, [rip+0x5b] ld edi, [rsi*4+0x92] movsxd rcx, r8 ld r10, [rcx*4+0x92] cmp edi, r10 jnz H A ld r9, [rax*8+0x94] jmp r9
25 24 24 14
F ld r8, [rip+0x39] ld edi, [rsi*4+0x46] movsxd rcx, r8 ld r10, [rcx*4+0x46] cmp edi, r10 jnz M
7 11 99 1 19 5 95 96 4 99 1 1 99
sjeng: f_in_check
Text
5
G ld r8, [rip+0x5e] ld edi, [rsi*4+0x42] movsxd rcx, r8 ld r10, [rcx*4+0x42] cmp edi, r10 jnz L C ld edx, [rsi*4+0x7d] cmp edx, 0x6 jz I D ld ecx, [rip+0x58] ld esi, [rip+0x81] lea edi, [rsi+rcx*1] cmp edi, edx jz J E ld edx, [rcx*8+0x10] test edx, edx jz K B ld r8, [rip+0x5b] ld edi, [rsi*4+0x92] movsxd rcx, r8 ld r10, [rcx*4+0x92] cmp edi, r10 jnz H A ld r9, [rax*8+0x94] jmp r9
25 24 24 14
F ld r8, [rip+0x39] ld edi, [rsi*4+0x46] movsxd rcx, r8 ld r10, [rcx*4+0x46] cmp edi, r10 jnz M
7 11 99 1 19 5 95 96 4 99 1 1 99
sjeng: f_in_check
Text
5
G ld r8, [rip+0x5e] ld edi, [rsi*4+0x42] movsxd rcx, r8 ld r10, [rcx*4+0x42] cmp edi, r10 jnz L C ld edx, [rsi*4+0x7d] cmp edx, 0x6 jz I D ld ecx, [rip+0x58] ld esi, [rip+0x81] lea edi, [rsi+rcx*1] cmp edi, edx jz J E ld edx, [rcx*8+0x10] test edx, edx jz K B ld r8, [rip+0x5b] ld edi, [rsi*4+0x92] movsxd rcx, r8 ld r10, [rcx*4+0x92] cmp edi, r10 jnz H A ld r9, [rax*8+0x94] jmp r9
25 24 24 14
F ld r8, [rip+0x39] ld edi, [rsi*4+0x46] movsxd rcx, r8 ld r10, [rcx*4+0x46] cmp edi, r10 jnz M
7 11 99 1 19 5 95 96 4 99 1 1 99
sjeng: f_in_check
Text
5
G ld r8, [rip+0x5e] ld edi, [rsi*4+0x42] movsxd rcx, r8 ld r10, [rcx*4+0x42] cmp edi, r10 jnz L C ld edx, [rsi*4+0x7d] cmp edx, 0x6 jz I D ld ecx, [rip+0x58] ld esi, [rip+0x81] lea edi, [rsi+rcx*1] cmp edi, edx jz J E ld edx, [rcx*8+0x10] test edx, edx jz K B ld r8, [rip+0x5b] ld edi, [rsi*4+0x92] movsxd rcx, r8 ld r10, [rcx*4+0x92] cmp edi, r10 jnz H A ld r9, [rax*8+0x94] jmp r9
25 24 24 14
F ld r8, [rip+0x39] ld edi, [rsi*4+0x46] movsxd rcx, r8 ld r10, [rcx*4+0x46] cmp edi, r10 jnz M
7 11 99 1 19 5 95 96 4 99 1 1 99
sjeng: f_in_check
Text
Missed ¡OpCmizaCon ¡Opportunity: ¡Next ¡Branch ¡Bias
6
80 85 90 95 100
gcc95 li m88ksim perl95 eon gap gcc2k perl2k gcc2k6 h264ref
perl2k6 povray sjeng xalanc richards Geomean
Branch Bias
Missed ¡OpCmizaCon ¡Opportunity: ¡Next ¡Branch ¡Bias
6
80 85 90 95 100
gcc95 li m88ksim perl95 eon gap gcc2k perl2k gcc2k6 h264ref
perl2k6 povray sjeng xalanc richards Geomean
Branch Bias 80 85 90 95 100
bench btree fannkuch fasta fastaredux knuke mandelbrot nbody regexdna revcomp run specnorm Geomean
Branch Bias
Missed ¡OpCmizaCon ¡Opportunity: ¡Next ¡Branch ¡Bias
6
80 85 90 95 100
gcc95 li m88ksim perl95 eon gap gcc2k perl2k gcc2k6 h264ref
perl2k6 povray sjeng xalanc richards Geomean
Branch Bias 80 85 90 95 100
bench btree fannkuch fasta fastaredux knuke mandelbrot nbody regexdna revcomp run specnorm Geomean
Branch Bias 80 85 90 95 100
btree fannkuch fasta float knuke mandelbrot meteor nbody queens raytrace regexdna revcomp richards specnorm Geomean
Branch Bias
ExploiCng ¡Predictability: ¡Benefit ¡from ¡Next ¡Branch ¡Bias
7
G ld r8, [rip+0x5e] ld edi, [rsi*4+0x42] movsxd rcx, r8 assert r9, G ld r10, [rcx*4+0x42] cmp edi, r10 jnz L Hoist From C ld edx, [rsi*4+0x7d] assert r9, C cmp edx, 0x6 jz I D ld ecx, [rip+0x58] ld esi, [rip+0x81] lea edi, [rsi+rcx*1] assert r9, D cmp edi, edx jz J E ld edx, [rcx*8+0x10] assert r9, E test edx, edx jz K Hoist From Hoist From Hoist From B ld r8, [rip+0x5b] ld edi, [rsi*4+0x92] movsxd rcx, r8 ld r10, [rcx*4+0x92] cmp edi, r10 assert r9, B jnz H Hoist From A ld r9, [rax*8+0x94] pred-indirect-jump
25 24 24 14
F ld r8, [rip+0x39] ld edi, [rsi*4+0x46] movsxd rcx, r8 assert r9, F ld r10, [rcx*4+0x46] cmp edi, r10 jnz M
11 99 1 19 5 95 96 4 99 1 1 99
ExploiCng ¡Predictability: ¡Benefit ¡from ¡Next ¡Branch ¡Bias
7
G ld r8, [rip+0x5e] ld edi, [rsi*4+0x42] movsxd rcx, r8 assert r9, G ld r10, [rcx*4+0x42] cmp edi, r10 jnz L Hoist From C ld edx, [rsi*4+0x7d] assert r9, C cmp edx, 0x6 jz I D ld ecx, [rip+0x58] ld esi, [rip+0x81] lea edi, [rsi+rcx*1] assert r9, D cmp edi, edx jz J E ld edx, [rcx*8+0x10] assert r9, E test edx, edx jz K Hoist From Hoist From Hoist From B ld r8, [rip+0x5b] ld edi, [rsi*4+0x92] movsxd rcx, r8 ld r10, [rcx*4+0x92] cmp edi, r10 assert r9, B jnz H Hoist From A ld r9, [rax*8+0x94] pred-indirect-jump
25 24 24 14
F ld r8, [rip+0x39] ld edi, [rsi*4+0x46] movsxd rcx, r8 assert r9, F ld r10, [rcx*4+0x46] cmp edi, r10 jnz M
11 99 1 19 5 95 96 4 99 1 1 99
ExploiCng ¡Predictability: ¡Benefit ¡from ¡Next ¡Branch ¡Bias
7
G ld r8, [rip+0x5e] ld edi, [rsi*4+0x42] movsxd rcx, r8 assert r9, G ld r10, [rcx*4+0x42] cmp edi, r10 jnz L Hoist From C ld edx, [rsi*4+0x7d] assert r9, C cmp edx, 0x6 jz I D ld ecx, [rip+0x58] ld esi, [rip+0x81] lea edi, [rsi+rcx*1] assert r9, D cmp edi, edx jz J E ld edx, [rcx*8+0x10] assert r9, E test edx, edx jz K Hoist From Hoist From Hoist From B ld r8, [rip+0x5b] ld edi, [rsi*4+0x92] movsxd rcx, r8 ld r10, [rcx*4+0x92] cmp edi, r10 assert r9, B jnz H Hoist From A ld r9, [rax*8+0x94] pred-indirect-jump
25 24 24 14
F ld r8, [rip+0x39] ld edi, [rsi*4+0x46] movsxd rcx, r8 assert r9, F ld r10, [rcx*4+0x46] cmp edi, r10 jnz M
11 99 1 19 5 95 96 4 99 1 1 99
ExploiCng ¡Predictability: ¡Benefit ¡from ¡Next ¡Branch ¡Bias
7
G ld r8, [rip+0x5e] ld edi, [rsi*4+0x42] movsxd rcx, r8 assert r9, G ld r10, [rcx*4+0x42] cmp edi, r10 jnz L Hoist From C ld edx, [rsi*4+0x7d] assert r9, C cmp edx, 0x6 jz I D ld ecx, [rip+0x58] ld esi, [rip+0x81] lea edi, [rsi+rcx*1] assert r9, D cmp edi, edx jz J E ld edx, [rcx*8+0x10] assert r9, E test edx, edx jz K Hoist From Hoist From Hoist From B ld r8, [rip+0x5b] ld edi, [rsi*4+0x92] movsxd rcx, r8 ld r10, [rcx*4+0x92] cmp edi, r10 assert r9, B jnz H Hoist From A ld r9, [rax*8+0x94] pred-indirect-jump
25 24 24 14
F ld r8, [rip+0x39] ld edi, [rsi*4+0x46] movsxd rcx, r8 assert r9, F ld r10, [rcx*4+0x46] cmp edi, r10 jnz M
11 99 1 19 5 95 96 4 99 1 1 99
ExploiCng ¡Predictability: ¡Benefit ¡from ¡Next ¡Branch ¡Bias
7
G ld r8, [rip+0x5e] ld edi, [rsi*4+0x42] movsxd rcx, r8 assert r9, G ld r10, [rcx*4+0x42] cmp edi, r10 jnz L Hoist From C ld edx, [rsi*4+0x7d] assert r9, C cmp edx, 0x6 jz I D ld ecx, [rip+0x58] ld esi, [rip+0x81] lea edi, [rsi+rcx*1] assert r9, D cmp edi, edx jz J E ld edx, [rcx*8+0x10] assert r9, E test edx, edx jz K Hoist From Hoist From Hoist From B ld r8, [rip+0x5b] ld edi, [rsi*4+0x92] movsxd rcx, r8 ld r10, [rcx*4+0x92] cmp edi, r10 assert r9, B jnz H Hoist From A ld r9, [rax*8+0x94] pred-indirect-jump
25 24 24 14
F ld r8, [rip+0x39] ld edi, [rsi*4+0x46] movsxd rcx, r8 assert r9, F ld r10, [rcx*4+0x46] cmp edi, r10 jnz M
11 99 1 19 5 95 96 4 99 1 1 99
ExploiCng ¡Predictability: ¡Benefit ¡from ¡Next ¡Branch ¡Bias
7
G ld r8, [rip+0x5e] ld edi, [rsi*4+0x42] movsxd rcx, r8 assert r9, G ld r10, [rcx*4+0x42] cmp edi, r10 jnz L Hoist From C ld edx, [rsi*4+0x7d] assert r9, C cmp edx, 0x6 jz I D ld ecx, [rip+0x58] ld esi, [rip+0x81] lea edi, [rsi+rcx*1] assert r9, D cmp edi, edx jz J E ld edx, [rcx*8+0x10] assert r9, E test edx, edx jz K Hoist From Hoist From Hoist From B ld r8, [rip+0x5b] ld edi, [rsi*4+0x92] movsxd rcx, r8 ld r10, [rcx*4+0x92] cmp edi, r10 assert r9, B jnz H Hoist From A ld r9, [rax*8+0x94] pred-indirect-jump
25 24 24 14
F ld r8, [rip+0x39] ld edi, [rsi*4+0x46] movsxd rcx, r8 assert r9, F ld r10, [rcx*4+0x46] cmp edi, r10 jnz M
11 99 1 19 5 95 96 4 99 1 1 99
ExploiCng ¡Predictability: ¡Benefit ¡from ¡Next ¡Branch ¡Bias
7
G ld r8, [rip+0x5e] ld edi, [rsi*4+0x42] movsxd rcx, r8 assert r9, G ld r10, [rcx*4+0x42] cmp edi, r10 jnz L Hoist From C ld edx, [rsi*4+0x7d] assert r9, C cmp edx, 0x6 jz I D ld ecx, [rip+0x58] ld esi, [rip+0x81] lea edi, [rsi+rcx*1] assert r9, D cmp edi, edx jz J E ld edx, [rcx*8+0x10] assert r9, E test edx, edx jz K Hoist From Hoist From Hoist From B ld r8, [rip+0x5b] ld edi, [rsi*4+0x92] movsxd rcx, r8 ld r10, [rcx*4+0x92] cmp edi, r10 assert r9, B jnz H Hoist From A ld r9, [rax*8+0x94] pred-indirect-jump
25 24 24 14
F ld r8, [rip+0x39] ld edi, [rsi*4+0x46] movsxd rcx, r8 assert r9, F ld r10, [rcx*4+0x46] cmp edi, r10 jnz M
11 99 1 19 5 95 96 4 99 1 1 99
ExploiCng ¡Predictability: ¡Benefit ¡from ¡Next ¡Branch ¡Bias
7
G ld r8, [rip+0x5e] ld edi, [rsi*4+0x42] movsxd rcx, r8 assert r9, G ld r10, [rcx*4+0x42] cmp edi, r10 jnz L Hoist From C ld edx, [rsi*4+0x7d] assert r9, C cmp edx, 0x6 jz I D ld ecx, [rip+0x58] ld esi, [rip+0x81] lea edi, [rsi+rcx*1] assert r9, D cmp edi, edx jz J E ld edx, [rcx*8+0x10] assert r9, E test edx, edx jz K Hoist From Hoist From Hoist From B ld r8, [rip+0x5b] ld edi, [rsi*4+0x92] movsxd rcx, r8 ld r10, [rcx*4+0x92] cmp edi, r10 assert r9, B jnz H Hoist From A ld r9, [rax*8+0x94] pred-indirect-jump
25 24 24 14
F ld r8, [rip+0x39] ld edi, [rsi*4+0x46] movsxd rcx, r8 assert r9, F ld r10, [rcx*4+0x46] cmp edi, r10 jnz M
11 99 1 19 5 95 96 4 99 1 1 99
8
... 0: r0 = load[a] 1: r2 = r0 + r1 2: jmp [r2] Prediction Point A B C D
Valid Targets: {A, G} Predict A
8
... 0: r0 = load[a] 1: r2 = r0 + r1 2: jmp [r2] Prediction Point A B C D
Valid Targets: {A, G} Predict A
8
... 0: r0 = load[a] 1: r2 = r0 + r1 2: jmp [r2] Prediction Point A B C D
Valid Targets: {A, G} Predict A
... 0: r0 = load[a] predict ... 2: resolve r3, A A B C D 1: r2 = r0 + r1 2*: r3 = load[r2]
8
... 0: r0 = load[a] 1: r2 = r0 + r1 2: jmp [r2] Prediction Point A B C D
Valid Targets: {A, G} Predict A
... 0: r0 = load[a] predict ... 2: resolve r3, A A B C D 1: r2 = r0 + r1 2*: r3 = load[r2]
8
... 0: r0 = load[a] 1: r2 = r0 + r1 2: jmp [r2] Prediction Point A B C D
Valid Targets: {A, G} Predict A
... 0: r0 = load[a] predict ... 2: resolve r3, A A B C D 1: r2 = r0 + r1 2*: r3 = load[r2]
8
... 0: r0 = load[a] 1: r2 = r0 + r1 2: jmp [r2] Prediction Point A B C D
Valid Targets: {A, G} Predict A
... 0: r0 = load[a] predict ... 2: resolve r3, A A B C D 1: r2 = r0 + r1 2*: r3 = load[r2]
8
... 0: r0 = load[a] 1: r2 = r0 + r1 2: jmp [r2] Prediction Point A B C D
Valid Targets: {A, G} Predict A
... 0: r0 = load[a] predict ... 2: resolve r3, A A B C D 1: r2 = r0 + r1 2*: r3 = load[r2]
M N O P ... 0: r0 = load[a] predict ...
Invalid Target: M
8
... 0: r0 = load[a] 1: r2 = r0 + r1 2: jmp [r2] Prediction Point A B C D
Valid Targets: {A, G} Predict A
... 0: r0 = load[a] predict ... 2: resolve r3, A A B C D 1: r2 = r0 + r1 2*: r3 = load[r2]
M N O P ... 0: r0 = load[a] predict ...
Invalid Target: M
8
... 0: r0 = load[a] 1: r2 = r0 + r1 2: jmp [r2] Prediction Point A B C D
Valid Targets: {A, G} Predict A
... 0: r0 = load[a] predict ... 2: resolve r3, A A B C D 1: r2 = r0 + r1 2*: r3 = load[r2]
M N O P ... 0: r0 = load[a] predict ...
Invalid Target: M
9
... 0: r0 = load[a] predict 0x0 2: resolve r3, A ... A B C D marker 0x0 1: r2 = r0 + r1 2*: r3 = load[r2]
9
... 0: r0 = load[a] predict 0x0 2: resolve r3, A ... A B C D marker 0x0 1: r2 = r0 + r1 2*: r3 = load[r2]
9
... 0: r0 = load[a] predict 0x0 2: resolve r3, A ... A B C D marker 0x0 1: r2 = r0 + r1 2*: r3 = load[r2]
9
... 0: r0 = load[a] predict 0x0 2: resolve r3, A ... A B C D marker 0x0 1: r2 = r0 + r1 2*: r3 = load[r2]
... 0: r0 = load[a] predict 0x0 ... r2 = r0 + r1 jmp [r2]
9
... 0: r0 = load[a] predict 0x0 2: resolve r3, A ... A B C D marker 0x0 1: r2 = r0 + r1 2*: r3 = load[r2]
... 0: r0 = load[a] predict 0x0 ... r2 = r0 + r1 jmp [r2]
9
... 0: r0 = load[a] predict 0x0 2: resolve r3, A ... A B C D marker 0x0 1: r2 = r0 + r1 2*: r3 = load[r2]
... 0: r0 = load[a] predict 0x0 ... r2 = r0 + r1 jmp [r2]
9
... 0: r0 = load[a] predict 0x0 2: resolve r3, A ... A B C D marker 0x0 1: r2 = r0 + r1 2*: r3 = load[r2]
... 0: r0 = load[a] predict 0x0 ... r2 = r0 + r1 jmp [r2] ... 0: r0 = load[a] predict 0x0 ... : P r e d i c t i
P
n t r2 = r0 + r1 jmp [r2]
9
... 0: r0 = load[a] predict 0x0 2: resolve r3, A ... A B C D marker 0x0 1: r2 = r0 + r1 2*: r3 = load[r2]
... 0: r0 = load[a] predict 0x0 ... r2 = r0 + r1 jmp [r2] ... 0: r0 = load[a] predict 0x0 ... : P r e d i c t i
P
n t r2 = r0 + r1 jmp [r2]
Invalid Target: M
M N O P Fails 0x0 Check
9
... 0: r0 = load[a] predict 0x0 2: resolve r3, A ... A B C D marker 0x0 1: r2 = r0 + r1 2*: r3 = load[r2]
... 0: r0 = load[a] predict 0x0 ... r2 = r0 + r1 jmp [r2] ... 0: r0 = load[a] predict 0x0 ... : P r e d i c t i
P
n t r2 = r0 + r1 jmp [r2]
Invalid Target: M
M N O P Fails 0x0 Check
9
... 0: r0 = load[a] predict 0x0 2: resolve r3, A ... A B C D marker 0x0 1: r2 = r0 + r1 2*: r3 = load[r2]
... 0: r0 = load[a] predict 0x0 ... r2 = r0 + r1 jmp [r2] ... 0: r0 = load[a] predict 0x0 ... : P r e d i c t i
P
n t r2 = r0 + r1 jmp [r2]
Invalid Target: M
M N O P Fails 0x0 Check
9
... 0: r0 = load[a] predict 0x0 2: resolve r3, A ... A B C D marker 0x0 1: r2 = r0 + r1 2*: r3 = load[r2]
... 0: r0 = load[a] predict 0x0 ... r2 = r0 + r1 jmp [r2] ... 0: r0 = load[a] predict 0x0 ... : P r e d i c t i
P
n t r2 = r0 + r1 jmp [r2]
Invalid Target: M
M N O P Fails 0x0 Check
9
... 0: r0 = load[a] predict 0x0 2: resolve r3, A ... A B C D marker 0x0 1: r2 = r0 + r1 2*: r3 = load[r2]
... 0: r0 = load[a] predict 0x0 ... r2 = r0 + r1 jmp [r2] ... 0: r0 = load[a] predict 0x0 ... : P r e d i c t i
P
n t r2 = r0 + r1 jmp [r2]
Invalid Target: M
M N O P Fails 0x0 Check r2 = r0 + r1 jmp [r2] R e d i r e c t F e t c h N
r e d i c t : S t a l l F e t c h
9
... 0: r0 = load[a] predict 0x0 2: resolve r3, A ... A B C D marker 0x0 1: r2 = r0 + r1 2*: r3 = load[r2]
... 0: r0 = load[a] predict 0x0 ... r2 = r0 + r1 jmp [r2] ... 0: r0 = load[a] predict 0x0 ... : P r e d i c t i
P
n t r2 = r0 + r1 jmp [r2]
Invalid Target: M
M N O P Fails 0x0 Check r2 = r0 + r1 jmp [r2] R e d i r e c t F e t c h N
r e d i c t : S t a l l F e t c h
9
... 0: r0 = load[a] predict 0x0 2: resolve r3, A ... A B C D marker 0x0 1: r2 = r0 + r1 2*: r3 = load[r2]
... 0: r0 = load[a] predict 0x0 ... r2 = r0 + r1 jmp [r2] ... 0: r0 = load[a] predict 0x0 ... : P r e d i c t i
P
n t r2 = r0 + r1 jmp [r2]
Invalid Target: M
M N O P Fails 0x0 Check r2 = r0 + r1 jmp [r2] R e d i r e c t F e t c h N
r e d i c t : S t a l l F e t c h
Necessitates ¡some ¡changes ¡to ¡ indirect ¡call ¡/return ¡handling
10
10
2: resolve r3, A, RC A B C D 1: r2 = r0 + r1 2*: r3 = load[r2] (1)
resolve
fails (2) Resteer RC addr to Front End (3) Resteer r3 to Front End
10
2: resolve r3, A, RC A B C D 1: r2 = r0 + r1 2*: r3 = load[r2] (1)
resolve
fails (2) Resteer RC addr to Front End (3) Resteer r3 to Front End
10
2: resolve r3, A, RC A B C D 1: r2 = r0 + r1 2*: r3 = load[r2] (1)
resolve
fails (2) Resteer RC addr to Front End (3) Resteer r3 to Front End
10
2: resolve r3, A, RC A B C D 1: r2 = r0 + r1 2*: r3 = load[r2] (1)
resolve
fails (2) Resteer RC addr to Front End (3) Resteer r3 to Front End
10
2: resolve r3, A, RC A B C D 1: r2 = r0 + r1 2*: r3 = load[r2] (1)
resolve
fails (2) Resteer RC addr to Front End (3) Resteer r3 to Front End
10
2: resolve r3, A, RC A B C D 1: r2 = r0 + r1 2*: r3 = load[r2] (1)
resolve
fails (2) Resteer RC addr to Front End (3) Resteer r3 to Front End
10
2: resolve r3, A, RC A B C D 1: r2 = r0 + r1 2*: r3 = load[r2] (1)
resolve
fails (2) Resteer RC addr to Front End (3) Resteer r3 to Front End
undo B undo C RC: undo A undo D (4) Fetch Recovery Code (5) r3 value from Resteer r3: E ... jmp r3 F ( 6 ) F e t c h C
r e c t P a t h s t a r t i n g a t r 3
10
2: resolve r3, A, RC A B C D 1: r2 = r0 + r1 2*: r3 = load[r2] (1)
resolve
fails (2) Resteer RC addr to Front End (3) Resteer r3 to Front End
undo B undo C RC: undo A undo D (4) Fetch Recovery Code (5) r3 value from Resteer r3: E ... jmp r3 F ( 6 ) F e t c h C
r e c t P a t h s t a r t i n g a t r 3
10
2: resolve r3, A, RC A B C D 1: r2 = r0 + r1 2*: r3 = load[r2] (1)
resolve
fails (2) Resteer RC addr to Front End (3) Resteer r3 to Front End
undo B undo C RC: undo A undo D (4) Fetch Recovery Code (5) r3 value from Resteer r3: E ... jmp r3 F ( 6 ) F e t c h C
r e c t P a t h s t a r t i n g a t r 3
10
2: resolve r3, A, RC A B C D 1: r2 = r0 + r1 2*: r3 = load[r2] (1)
resolve
fails (2) Resteer RC addr to Front End (3) Resteer r3 to Front End
undo B undo C RC: undo A undo D (4) Fetch Recovery Code (5) r3 value from Resteer r3: E ... jmp r3 F ( 6 ) F e t c h C
r e c t P a t h s t a r t i n g a t r 3
10
2: resolve r3, A, RC A B C D 1: r2 = r0 + r1 2*: r3 = load[r2] (1)
resolve
fails (2) Resteer RC addr to Front End (3) Resteer r3 to Front End
undo B undo C RC: undo A undo D (4) Fetch Recovery Code (5) r3 value from Resteer r3: E ... jmp r3 F ( 6 ) F e t c h C
r e c t P a t h s t a r t i n g a t r 3
10
2: resolve r3, A, RC A B C D 1: r2 = r0 + r1 2*: r3 = load[r2] (1)
resolve
fails (2) Resteer RC addr to Front End (3) Resteer r3 to Front End
undo B undo C RC: undo A undo D (4) Fetch Recovery Code (5) r3 value from Resteer r3: E ... jmp r3 F ( 6 ) F e t c h C
r e c t P a t h s t a r t i n g a t r 3
11
History PC H A S H Write Port Predictor Table Read Port DBB Tail Pointer New
Indices DBB To Fetch Unit Indices Prediction State Fetch Buffer
To Back-End Old
predict instruction: Insert New DBB Entry
Existing BPU
–Dovetails ¡with ¡exis<ng ¡structures ¡for ¡outstanding ¡branches
11
History PC H A S H Write Port Predictor Table Read Port DBB Tail Pointer New
Indices DBB To Fetch Unit Indices Prediction State Fetch Buffer
To Back-End Old
predict instruction: Insert New DBB Entry
Existing BPU
–Dovetails ¡with ¡exis<ng ¡structures ¡for ¡outstanding ¡branches
11
History PC H A S H Write Port Predictor Table Read Port DBB Tail Pointer New
Indices DBB To Fetch Unit Indices Prediction State Fetch Buffer
To Back-End Old
predict instruction: Insert New DBB Entry
Existing BPU
–Dovetails ¡with ¡exis<ng ¡structures ¡for ¡outstanding ¡branches
11
History PC H A S H Write Port Predictor Table Read Port DBB Tail Pointer New
Indices DBB To Fetch Unit Indices Prediction State Fetch Buffer
To Back-End Old
predict instruction: Insert New DBB Entry
Existing BPU
–Dovetails ¡with ¡exis<ng ¡structures ¡for ¡outstanding ¡branches
11
History PC H A S H Write Port Predictor Table Read Port DBB Tail Pointer New
Indices DBB To Fetch Unit Indices Prediction State Fetch Buffer
To Back-End Old
predict instruction: Insert New DBB Entry
Existing BPU
–Dovetails ¡with ¡exis<ng ¡structures ¡for ¡outstanding ¡branches
11
History PC H A S H Write Port Predictor Table Read Port DBB Tail Pointer New
Indices DBB To Fetch Unit Indices Prediction State Fetch Buffer
To Back-End Old
predict instruction: Insert New DBB Entry
Existing BPU
–Dovetails ¡with ¡exis<ng ¡structures ¡for ¡outstanding ¡branches
11
History PC H A S H Write Port Predictor Table Read Port DBB Tail Pointer New
Indices DBB To Fetch Unit Indices Prediction State Fetch Buffer
To Back-End Old
predict instruction: Insert New DBB Entry
Existing BPU
–Dovetails ¡with ¡exis<ng ¡structures ¡for ¡outstanding ¡branches
12
–Dovetails ¡with ¡exis<ng ¡structures ¡for ¡outstanding ¡branches
History PC H A S H Write Port Predictor Table Read Port DBB Tail Pointer Index
Indices DBB To Fetch Unit Indices Prediction State Fetch Buffer Insert DBB Index Into Branch Resolution Instruction
To Back-End
Existing BPU
12
–Dovetails ¡with ¡exis<ng ¡structures ¡for ¡outstanding ¡branches
History PC H A S H Write Port Predictor Table Read Port DBB Tail Pointer Index
Indices DBB To Fetch Unit Indices Prediction State Fetch Buffer Insert DBB Index Into Branch Resolution Instruction
To Back-End
Existing BPU
12
–Dovetails ¡with ¡exis<ng ¡structures ¡for ¡outstanding ¡branches
History PC H A S H Write Port Predictor Table Read Port DBB Tail Pointer Index
Indices DBB To Fetch Unit Indices Prediction State Fetch Buffer Insert DBB Index Into Branch Resolution Instruction
To Back-End
Existing BPU
12
–Dovetails ¡with ¡exis<ng ¡structures ¡for ¡outstanding ¡branches
History PC H A S H Write Port Predictor Table Read Port DBB Tail Pointer Index
Indices DBB To Fetch Unit Indices Prediction State Fetch Buffer Insert DBB Index Into Branch Resolution Instruction
To Back-End
Existing BPU
13
History PC H A S H Write Port Predictor Table Read Port
Indices DBB
R e s t e e r L
i c
To Fetch Unit From Back-End DBB Index from Branch Resolution Instruction New Prediction State Indices
–Dovetails ¡with ¡exis<ng ¡structures ¡for ¡outstanding ¡branches
13
History PC H A S H Write Port Predictor Table Read Port
Indices DBB
R e s t e e r L
i c
To Fetch Unit From Back-End DBB Index from Branch Resolution Instruction New Prediction State Indices
–Dovetails ¡with ¡exis<ng ¡structures ¡for ¡outstanding ¡branches
13
History PC H A S H Write Port Predictor Table Read Port
Indices DBB
R e s t e e r L
i c
To Fetch Unit From Back-End DBB Index from Branch Resolution Instruction New Prediction State Indices
–Dovetails ¡with ¡exis<ng ¡structures ¡for ¡outstanding ¡branches
13
History PC H A S H Write Port Predictor Table Read Port
Indices DBB
R e s t e e r L
i c
To Fetch Unit From Back-End DBB Index from Branch Resolution Instruction New Prediction State Indices
–Dovetails ¡with ¡exis<ng ¡structures ¡for ¡outstanding ¡branches
13
History PC H A S H Write Port Predictor Table Read Port
Indices DBB
R e s t e e r L
i c
To Fetch Unit From Back-End DBB Index from Branch Resolution Instruction New Prediction State Indices
–Dovetails ¡with ¡exis<ng ¡structures ¡for ¡outstanding ¡branches
13
History PC H A S H Write Port Predictor Table Read Port
Indices DBB
R e s t e e r L
i c
To Fetch Unit From Back-End DBB Index from Branch Resolution Instruction New Prediction State Indices
–Dovetails ¡with ¡exis<ng ¡structures ¡for ¡outstanding ¡branches
14
LLVM ¡3.5
instruc<on ¡stream ¡indir ¡branch ¡
and ¡(pred ¡-‑ ¡bias) ¡> ¡3%
without ¡predic<on ¡guide ¡on ¡REF ¡
¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ for ¡baseline ¡and ¡experimental
Key Structures Configuration Parameters Bpred PTLSim default: GShare, 24 KB 3-table direction predictor, 4K-entry BTB, 64-entry RAS Front-End 5 stages, Experimentally Varied 2/4/8 wide Fetch/Decode/Dispatch, 32-entry FetchBuffer Execution Ports Experimentally Varied 2/4/8 Functional Units Up to 2 x LD/ST, 2 x INT/SIMD-Permute, 4 x 64-bit SIMD/FP, 1-cycle bypass L1 Caches 8-way 32 KB L1-D$, 4-way 32 KB L1-I$, 64B lines, 4-cycle latency L2 Cache 16-way 256KB Unified, 12-cycle latency L3 Cache 32-way 4MB LLC, 25-cycle latency Miss Handling 64-entry Miss Buffer, 64-entry Load Fill Request Queue Main Memory 140-cycle latency
which ¡are ¡indirect ¡branches ¡(PDS) ¡and ¡amenable ¡to ¡ transforma<on ¡(weighted ¡averaged ¡bias: ¡WAB)
15
0% 6% 12% 18% 24% 30% gcc95 li m88ksim perl95 eon gap gcc2k perl2k gcc2k6 h264ref
perl2k6 povray sjeng xalanc richards Geomean 2-wide 4-wide
16
0% 10% 20% 30% 40% 50% b e n c h b t r e e f a n n k u c h f a s t a f a s t a r e d u x k n u k e m a n d e l b r
n b
y r e g e x d n a r e v c
p r u n s p e c n
m G e
e a n 2-wide 4-wide
17
0% 5% 10% 15% 20% btree fannkuch fasta float knuke mandelbrot meteor nbody queens raytrace regexdna revcomp richards specnorm Geomean Speedup 2-wide 4-wide
18
0% 1% 2% 3% 4% gcc95 li m88ksim perl95 eon gap gcc2k perl2k gcc2k6 h264ref
perl2k6 povray sjeng xalanc richards Geomean SPEC Extra Instrs Issued
0% 1% 2% 3% 4%
btree fannkuch fasta float knuke mandelbrot meteor nbody queens raytrace regexdna revcomp richards specnorm Geomean
Python Extra Instrs Issued
0% 1% 2% 3% 4%
bench btree fannkuch fasta fastaredux knuke mandelbrot nbody regexdna revcomp run specnorm Geomean
PHP
Extra Instrs Issued
B e t t e r
–Leverages ¡DBT ¡profiling ¡and ¡specula<on ¡facili<es ¡
19
20
20