1
University of Michigan Electrical Engineering and Com puter Science
Uncovering Hidden Loop Level Parallelism in Sequential Applications - - PowerPoint PPT Presentation
Uncovering Hidden Loop Level Parallelism in Sequential Applications Hongtao Zhong, Mojtaba Mehrara, Steve Lieberman, Scott Mahlke Advanced Computer Architecture Lab. University of Michigan University of Michigan 1 Electrical Engineering and
1
University of Michigan Electrical Engineering and Com puter Science
2
University of Michigan Electrical Engineering and Com puter Science
Intel Core 2 Duo AMD Quad-core (Barcelona) Sun Niagara 2
3
University of Michigan Electrical Engineering and Com puter Science
4
University of Michigan Electrical Engineering and Com puter Science
i = 0 -3 9
DOALL loop
5
University of Michigan Electrical Engineering and Com puter Science
i = 0 -3 9 i = 2 0 -3 9 i = 0 -1 9
Core 1 Core 0 DOALL loop
6
University of Michigan Electrical Engineering and Com puter Science
i = 0 -3 9
Speculative DOALL loop
7
University of Michigan Electrical Engineering and Com puter Science
i = 0 -3 9 i = 1 0 -1 9 i = 3 0 -3 9 i = 0 -9 i = 2 0 -2 9
Core 1 Core 0 Loop Chunk Speculative DOALL loop
8
University of Michigan Electrical Engineering and Com puter Science
i = 0 -3 9 i = 1 0 -1 9 i = 3 0 -3 9 i = 0 -9 i = 2 0 -2 9
Core 1 Core 0 Loop Chunk
Speculative DOALL loop
9
University of Michigan Electrical Engineering and Com puter Science
Initialization Consolidation
Abort Handler for(i=IS; i<IE; i++) { ...... if (brk_cond) local_brk_flag = 1; break;}
XBEGIN if (global_brk_flag) break; perm = RECV(THREADj-1) XCOMMIT if (local_brk_flag) global_brk_flag = 1; kill_other_threads; elseif (IE < n) SEND(perm,THREADj+1) IS = ...; IE = ...;Spawn
10
University of Michigan Electrical Engineering and Com puter Science
L2 cache L2 cache Core 0 Core 1 Core 2 Core 3
11
University of Michigan Electrical Engineering and Com puter Science
L2 cache L2 cache Core 0 Core 1 Core 2 Core 3
12
University of Michigan Electrical Engineering and Com puter Science
L2 cache L2 cache Core 0 Core 1 Core 2 Core 3
13
University of Michigan Electrical Engineering and Com puter Science
for (i=0;i<n;i++) // original loop code
14
University of Michigan Electrical Engineering and Com puter Science
while (...) IS+=...; IE+=...; XBEGIN XCOMMIT for (i=IS;i<IE;i++) // original loop code
15
University of Michigan Electrical Engineering and Com puter Science
RECV(THREADj-1) XCOMMIT SEND(THREADj+1)
while (...) IS+=...; IE+=...; XBEGIN for (i=IS;i<IE;i++) // original loop code
16
University of Michigan Electrical Engineering and Com puter Science
RECV(THREADj-1) XCOMMIT SEND(THREADj+1)
while (...) IS+=...; IE+=...; XBEGIN for (i=IS;i<IE;i++) // original loop code Spawn
17
University of Michigan Electrical Engineering and Com puter Science
RECV(THREADj-1) XCOMMIT SEND(THREADj+1)
while (...) IS+=...; IE+=...; XBEGIN for (i=IS;i<IE;i++) // original loop code if (brkCond) break; Spawn
18
University of Michigan Electrical Engineering and Com puter Science
for (i=IS;i<IE;i++) // original loop code if (brkCond) localBrk=1; break; RECV(THREADj-1) XCOMMIT if (localBrk) globalBrk=1;abortOtherTXs; SEND(THREADj+1)
while (...) IS+=...; IE+=...; XBEGIN if (globalBrk) break; Spawn
19
University of Michigan Electrical Engineering and Com puter Science
for (i=IS;i<IE;i++) // original loop code if (brkCond) localBrk=1; break; RECV(THREADj-1) XCOMMIT if (localBrk) globalBrk=1;abortOtherTXs; SEND(THREADj+1)
while (...) IS+=...; IE+=...; XBEGIN if (globalBrk) break; Consolidation Spawn
20
University of Michigan Electrical Engineering and Com puter Science
for (i=IS;i<IE;i++) // original loop code if (brkCond) localBrk=1; break; RECV(THREADj-1) XCOMMIT if (localBrk) globalBrk=1;abortOtherTXs; SEND(THREADj+1) while (...) IS+=...; IE+=...; XBEGIN if (globalBrk) break; Consolidation Spawn
21
University of Michigan Electrical Engineering and Com puter Science
0 .1 0 .2 0 .3 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 1 052.alvinn 056.ear 171.swim 172.m grid 177.m esa 179.art 183.equake 188.am m p 008.espresso 023.eqntott 026.com press 072.sc 099.go 124.m 88ksim 129.com press 130.li 132.ijpeg 164.gzip 175.vpr 181.m cf 197.parser 256.bzip2 300.twolf cjpeg djpeg epic g721decode g721encode gsm decode gsm encode m peg2dec m peg2enc pegwitdec pegwitenc rawcaudio rawdaudio unepic grep lex yacc average SPEC FP SPEC INT Mediabench Utilities Fraction of sequential execution
Provable DOALL
22
University of Michigan Electrical Engineering and Com puter Science
0 .1 0 .2 0 .3 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 1 052.alvinn 056.ear 171.swim 172.m grid 177.m esa 179.art 183.equake 188.am m p 008.espresso 023.eqntott 026.com press 072.sc 099.go 124.m 88ksim 129.com press 130.li 132.ijpeg 164.gzip 175.vpr 181.m cf 197.parser 256.bzip2 300.twolf cjpeg djpeg epic g721decode g721encode gsm decode gsm encode m peg2dec m peg2enc pegwitdec pegwitenc rawcaudio rawdaudio unepic grep lex yacc average SPEC FP SPEC INT Mediabench Utilities Fraction of sequential execution
Profiled DOALL Provable DOALL
23
University of Michigan Electrical Engineering and Com puter Science
0 .1 0 .2 0 .3 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 1 052.alvinn 056.ear 171.swim 172.m grid 177.m esa 179.art 183.equake 188.am m p 008.espresso 023.eqntott 026.com press 072.sc 099.go 124.m 88ksim 129.com press 130.li 132.ijpeg 164.gzip 175.vpr 181.m cf 197.parser 256.bzip2 300.twolf cjpeg djpeg epic g721decode g721encode gsm decode gsm encode m peg2dec m peg2enc pegwitdec pegwitenc rawcaudio rawdaudio unepic grep lex yacc average SPEC FP SPEC INT Mediabench Utilities Fraction of sequential execution
Profiled DOALL Provable DOALL
Still not good enough! Few dependences hinder parallelization in many loops
24
University of Michigan Electrical Engineering and Com puter Science
0 .1 0 .2 0 .3 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 1 052.alvinn 056.ear 171.swim 172.m grid 177.m esa 179.art 183.equake 188.am m p 008.espresso 023.eqntott 026.com press 072.sc 099.go 124.m 88ksim 129.com press 130.li 132.ijpeg 164.gzip 175.vpr 181.m cf 197.parser 256.bzip2 300.twolf cjpeg djpeg epic g721decode g721encode gsm decode gsm encode m peg2dec m peg2enc pegwitdec pegwitenc rawcaudio rawdaudio unepic grep lex yacc average SPEC FP SPEC INT Mediabench Utilities Fraction of sequential execution
Profiled DOALL Provable DOALL
Still not good enough! Few dependences hinder parallelization in many loops
25
University of Michigan Electrical Engineering and Com puter Science
1: while (node) { 2: work(node); 3: node = node->next; }
26
University of Michigan Electrical Engineering and Com puter Science
1: while (node) { 2: work(node); 3: node = node->next; }
1: while (node) { 4: node_array[count++] = node; 3: node = node->next; }
27
University of Michigan Electrical Engineering and Com puter Science
1: while (node) { 2: work(node); 3: node = node->next; }
1: while (node) { 4: node_array[count++] = node; 3: node = node->next; } XBEGIN 5: node = node_array[IS]; i = 0; 1':while (node && i++ < CS) { 2: work(node); 3': node = node->next; } RECV(THREADj-1) XCOMMIT SEND(THREADj+1) }
28
University of Michigan Electrical Engineering and Com puter Science
1: while (node) { 2: work(node); 3: node = node->next; }
XBEGIN 5: node = node_array[IS]; i = 0; 1':while (node && i++ < CS) { 2: work(node); 3': node = node->next; } RECV(THREADj-1) XCOMMIT if (node!= node_array[IS+CS]){ update_node_array; kill_other_threads();} SEND(THREADj+1) } 1: while (node) { 4: node_array[count++] = node; 3: node = node->next; }
29
University of Michigan Electrical Engineering and Com puter Science
1: while (node) { 2: work(node); 3: node = node->next; }
XBEGIN 5: node = node_array[IS]; i = 0; 1':while (node && i++ < CS) { 2: work(node); 3': node = node->next; } RECV(THREADj-1) XCOMMIT if (node!= node_array[IS+CS]){ update_node_array; kill_other_threads();} SEND(THREADj+1) } 1: while (node) { 4: node_array[count++] = node; 3: node = node->next; }
30
University of Michigan Electrical Engineering and Com puter Science
1: while (node) { 2: work(node); 3: node = node->next; }
XBEGIN 5: node = node_array[IS]; i = 0; 1':while (node && i++ < CS) { 2: work(node); 3': node = node->next; } RECV(THREADj-1) XCOMMIT if (node!= node_array[IS+CS]){ update_node_array; kill_other_threads();} SEND(THREADj+1) } 1: while (node) { 4: node_array[count++] = node; 3: node = node->next; }
31
University of Michigan Electrical Engineering and Com puter Science
1: while (node) { 2: work(node); 3: node = node->next; }
XBEGIN 5: node = node_array[IS]; i = 0; 1':while (node && i++ < CS) { 2: work(node); 3': node = node->next; } RECV(THREADj-1) XCOMMIT if (node!= node_array[IS+CS]){ update_node_array; kill_other_threads();} SEND(THREADj+1) } 1: while (node) { 4: node_array[count++] = node; 3: node = node->next; }
32
University of Michigan Electrical Engineering and Com puter Science
1: while (node) { 2: work(node); 3: node = node->next; }
XBEGIN 5: node = node_array[IS]; i = 0; 1':while (node && i++ < CS) { 2: work(node); 3': node = node->next; } RECV(THREADj-1) XCOMMIT if (node!= node_array[IS+CS]){ update_node_array; kill_other_threads();} SEND(THREADj+1) } 1: while (node) { 4: node_array[count++] = node; 3: node = node->next; }
33
University of Michigan Electrical Engineering and Com puter Science
1: 2:
9 9 % 1 % A B C
34
University of Michigan Electrical Engineering and Com puter Science
1: 2: 1: 2:
9 9 % 1 % A B C A B C
35
University of Michigan Electrical Engineering and Com puter Science
1: 2: 1: 2:
9 9 % 1 % A B C A B C’ C
36
University of Michigan Electrical Engineering and Com puter Science
1: 2: 1: 2:
9 9 % 1 %
break
A B C A C’ C B 1 % 9 9 %
37
University of Michigan Electrical Engineering and Com puter Science
for( j=0; j<=nstate; ++j ){ if( tystate[j] == 0 ) continue; if( tystate[j] == best ) continue; count = 0; cbest = tystate[j]; for (k=j; k<=nstate; ++k) if (tystate[k]==cbest) ++count; if ( count > times) { best = cbest; times = count; } }
Sample loop from yacc benchmark
38
University of Michigan Electrical Engineering and Com puter Science
for( j=0; j<=nstate; ++j ){ if( tystate[j] == 0 ) continue; if( tystate[j] == best ) continue; count = 0; cbest = tystate[j]; for (k=j; k<=nstate; ++k) if (tystate[k]==cbest) ++count; if ( count > times) { best = cbest; times = count; } }
Sample loop from yacc benchmark
39
University of Michigan Electrical Engineering and Com puter Science
for( j=0; j<=nstate; ++j ){ if( tystate[j] == 0 ) continue; if( tystate[j] == best ) continue; count = 0; cbest = tystate[j]; for (k=j; k<=nstate; ++k) if (tystate[k]==cbest) ++count; if ( count > times) { best = cbest; times = count; } }
if ( count > times) { best = cbest; times = count; }
Sample loop from yacc benchmark
40
University of Michigan Electrical Engineering and Com puter Science
for( j=0; j<=nstate; ++j ){ if( tystate[j] == 0 ) continue; if( tystate[j] == best ) continue; count = 0; cbest = tystate[j]; for (k=j; k<=nstate; ++k) if (tystate[k]==cbest) ++count; if ( count > times) { best = cbest; times = count; } }
if ( count > times) { best = cbest; times = count; } j=0; while (j<=nstate){ for( ; j<=nstate; ++j ){ if( tystate[j] == 0 ) continue; if( tystate[j] == best ) continue; count = 0; cbest = tystate[j]; for (k=j; k<=nstate; ++k) if (tystate[k]==cbest) ++count; if ( count > times) break; } if (count > times) { best = cbest; times = count; j++; } }
Sample loop from yacc benchmark
41
University of Michigan Electrical Engineering and Com puter Science
for( j=0; j<=nstate; ++j ){ if( tystate[j] == 0 ) continue; if( tystate[j] == best ) continue; count = 0; cbest = tystate[j]; for (k=j; k<=nstate; ++k) if (tystate[k]==cbest) ++count; if ( count > times) { best = cbest; times = count; } }
if ( count > times) { best = cbest; times = count; } j=0; while (j<=nstate){ for( ; j<=nstate; ++j ){ if( tystate[j] == 0 ) continue; if( tystate[j] == best ) continue; count = 0; cbest = tystate[j]; for (k=j; k<=nstate; ++k) if (tystate[k]==cbest) ++count; if ( count > times) break; } if (count > times) { best = cbest; times = count; j++; } }
Sample loop from yacc benchmark
42
University of Michigan Electrical Engineering and Com puter Science
for( j=0; j<=nstate; ++j ){ if( tystate[j] == 0 ) continue; if( tystate[j] == best ) continue; count = 0; cbest = tystate[j]; for (k=j; k<=nstate; ++k) if (tystate[k]==cbest) ++count; if ( count > times) { best = cbest; times = count; } }
if ( count > times) { best = cbest; times = count; } j=0; while (j<=nstate){ for( ; j<=nstate; ++j ){ if( tystate[j] == 0 ) continue; if( tystate[j] == best ) continue; count = 0; cbest = tystate[j]; for (k=j; k<=nstate; ++k) if (tystate[k]==cbest) ++count; if ( count > times) break; } if (count > times) { best = cbest; times = count; j++; } }
Sample loop from yacc benchmark
43
University of Michigan Electrical Engineering and Com puter Science
for( j=0; j<=nstate; ++j ){ if( tystate[j] == 0 ) continue; if( tystate[j] == best ) continue; count = 0; cbest = tystate[j]; for (k=j; k<=nstate; ++k) if (tystate[k]==cbest) ++count; if ( count > times) { best = cbest; times = count; } }
if ( count > times) { best = cbest; times = count; } j=0; while (j<=nstate){ for( ; j<=nstate; ++j ){ if( tystate[j] == 0 ) continue; if( tystate[j] == best ) continue; count = 0; cbest = tystate[j]; for (k=j; k<=nstate; ++k) if (tystate[k]==cbest) ++count; if ( count > times) break; } if (count > times) { best = cbest; times = count; j++; } }
Sample loop from yacc benchmark
44
University of Michigan Electrical Engineering and Com puter Science
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 052.alvinn 056.ear 171.swim 172.m grid 177.m esa 179.art 183.equake 188.am m p 008.espresso 023.eqntot t 026.com press 072.sc 099.go 124.m 88ksim 129.com press 130.li 132.ijpeg 164.gzip 175.vpr 181.m cf 197.parser 256.bzip2 300.twolf cjpeg djpeg epic g721decode g721encode gsm decode gsm encode m peg2dec m peg2enc pegwit dec pegwit enc rawcaudio rawdaudio unepic grep lex yacc average SPEC FP SPEC INT Mediabench Ut ilit ies
Fraction of sequential execution profiled + provable
45
University of Michigan Electrical Engineering and Com puter Science
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 052.alvinn 056.ear 171.swim 172.m grid 177.m esa 179.art 183.equake 188.am m p 008.espresso 023.eqntot t 026.com press 072.sc 099.go 124.m 88ksim 129.com press 130.li 132.ijpeg 164.gzip 175.vpr 181.m cf 197.parser 256.bzip2 300.twolf cjpeg djpeg epic g721decode g721encode gsm decode gsm encode m peg2dec m peg2enc pegwit dec pegwit enc rawcaudio rawdaudio unepic grep lex yacc average SPEC FP SPEC INT Mediabench Ut ilit ies
Fraction of sequential execution profiled + provable
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 052.alvinn 056.ear 171.swim 172.m grid 177.m esa 179.art 183.equake 188.am m p 008.espresso 023.eqnt ot t 026.com press 072.sc 099.go 124.m 88ksim 129.com press 130.li 132.ijpeg 164.gzip 175.vpr 181.m cf 197.parser 256.bzip2 300.t wolf cjpeg djpeg epic g721decode g721encode gsm decode gsm encode m peg2dec m peg2enc pegwit dec pegwit enc rawcaudio rawdaudio unepic grep lex yacc average SPEC FP SPEC INT Mediabench Ut ilit ies
Fraction of sequential execution profiled + provable transform ations
46
University of Michigan Electrical Engineering and Com puter Science
10 20 30 40 50 60 70
SpecI NT MediaBench Utilities
Fraction of sequential execution
DOALL loops Control speculation for uncounted loops Speculative fission Speculative prem aterialization I nfrequent dependence isolation DOALL loops after transform ations
47
University of Michigan Electrical Engineering and Com puter Science
10 20 30 40 50 60 70
SpecI NT MediaBench Utilities
Fraction of sequential execution
DOALL loops Control speculation for uncounted loops Speculative fission Speculative prem aterialization I nfrequent dependence isolation DOALL loops after transform ations
48
University of Michigan Electrical Engineering and Com puter Science
49
University of Michigan Electrical Engineering and Com puter Science
1 1 .5 2 2 .5 3 3 .5 4 4 .5 5 0 5 2 .alvinn 0 5 6 .ear 1 7 1 .sw im 1 7 2 .m grid 1 7 7 .m esa 1 7 9 .art 1 8 3 .equake 1 8 8 .am m p 0 0 8 .espresso 0 2 3 .eqntott 0 2 6 .com press 0 7 2 .sc 0 9 9 .go 1 2 4 .m 8 8 ksim 1 2 9 .com press 1 3 0 .li 1 3 2 .ijpeg 1 6 4 .gzip 1 7 5 .vpr 1 8 1 .m cf 1 9 7 .parser 2 5 6 .bzip2 3 0 0 .tw olf cjpeg djpeg epic g7 2 1 decode g7 2 1 encode gsm decode gsm encode m peg2 dec m peg2 enc pegw itdec pegw itenc raw caudio raw daudio unepic grep lex yacc average SPEC FP SPEC I NT Mediabench Utilities
Speedup
With transform ations Without transform ations
2 core 4 core 8 core
7 .8 9 7 .3 7 7 .8 7 6 .4 4
50
University of Michigan Electrical Engineering and Com puter Science
1 1 .5 2 2 .5 3 3 .5 4 4 .5 5 0 5 2 .alvinn 0 5 6 .ear 1 7 1 .sw im 1 7 2 .m grid 1 7 7 .m esa 1 7 9 .art 1 8 3 .equake 1 8 8 .am m p 0 0 8 .espresso 0 2 3 .eqntott 0 2 6 .com press 0 7 2 .sc 0 9 9 .go 1 2 4 .m 8 8 ksim 1 2 9 .com press 1 3 0 .li 1 3 2 .ijpeg 1 6 4 .gzip 1 7 5 .vpr 1 8 1 .m cf 1 9 7 .parser 2 5 6 .bzip2 3 0 0 .tw olf cjpeg djpeg epic g7 2 1 decode g7 2 1 encode gsm decode gsm encode m peg2 dec m peg2 enc pegw itdec pegw itenc raw caudio raw daudio unepic grep lex yacc average SPEC FP SPEC I NT Mediabench Utilities
Speedup
With transform ations Without transform ations
2 core 4 core 8 core
7 .8 9 7 .3 7 7 .8 7 6 .4 4
51
University of Michigan Electrical Engineering and Com puter Science
52
University of Michigan Electrical Engineering and Com puter Science
53
University of Michigan Electrical Engineering and Com puter Science
54
University of Michigan Electrical Engineering and Com puter Science
Core 0 Core 1 Core 2 Core 3
Core 0 Core 1 Core 2 Core 3
55
University of Michigan Electrical Engineering and Com puter Science
for (...) { 1: current = ...; 2: work(last); 3: last = current; }
56
University of Michigan Electrical Engineering and Com puter Science
for (...) { 1: current = ...; 2: work(last); 3: last = current; } XBEGIN 1’: current = 3’: last = for (...) { 1: current = ...; 2: work(last); 3: last = current; } XCOMMIT