An Ecosystem for Combining Performance and Correctness for Many-Cores
Pieter Hijma pieter@cs.vu.nl Friday 16 May 2018 Fourth NIRICT GPGPU Reconnaissance Workshop
by SURF & NW O
An Ecosystem for Combining Performance and Correctness for - - PowerPoint PPT Presentation
An Ecosystem for Combining Performance and Correctness for Many-Cores Pieter Hijma pieter@cs.vu.nl Friday 16 May 2018 Fourth NIRICT GPGPU Reconnaissance Workshop by SURF & NW O Message In coming years, software will have to adapt to
by SURF & NW O
10101010101010101010101010101010101 010101010 01010 101 01010 0 010 0 1 1 10101 0 10 0 1 01 1 0 0 01010101 01010101 0101010101010 101010101010
1/28
103 104 105 106 107 108 109 1010 1970197519801985199019952000200520102015 # transistors Moore's law # transistors
8008 8080 8086 80286 80386 80486 Pentium Pentium II Pentium III Pentium 4 Core 2 Core i7 (Nehalem) Core i7 (Sandy) Core i7 (Haswell) Core i7 (Broadwell)
2/28
103 104 105 106 107 108 109 1010 1970197519801985199019952000200520102015 0.1 1 10 100 1000 10000 # transistors clockspeed (MHz) Moore's law # transistors
8008 8080 8086 80286 80386 80486 Pentium Pentium II Pentium III Pentium 4 Core 2 Core i7 (Nehalem) Core i7 (Sandy) Core i7 (Haswell)
clockspeed (MHz)
Core i7 (Broadwell)
2/28
103 104 105 106 107 108 109 1010 1970197519801985199019952000200520102015 0.1 1 10 100 1000 10000 # transistors clockspeed (MHz) Moore's law # transistors
8008 8080 8086 80286 80386 80486 Pentium Pentium II Pentium III Pentium 4 Core 2 Core i7 (Nehalem) Core i7 (Sandy) Core i7 (Haswell)
clockspeed (MHz)
Core i7 (Broadwell)
2/28
103 104 105 106 107 108 109 1010 1970197519801985199019952000200520102015 0.1 1 10 100 1000 10000 # transistors clockspeed (MHz) Moore's law # transistors
8008 8080 8086 80286 80386 80486 Pentium Pentium II Pentium III Pentium 4 Core 2 Core i7 (Nehalem) Core i7 (Sandy) Core i7 (Haswell)
clockspeed (MHz)
Core i7 (Broadwell)
2/28
103 104 105 106 107 108 109 1010 1970197519801985199019952000200520102015 0.1 1 10 100 1000 10000 # transistors clockspeed (MHz) Moore's law # transistors
8008 8080 8086 80286 80386 80486 Pentium Pentium II Pentium III Pentium 4 Core 2 Core i7 (Nehalem) Core i7 (Sandy) Core i7 (Haswell)
clockspeed (MHz)
Core i7 (Broadwell)
2/28
ALU ALU ALU ALU
Control Cache
ALU ALU ALU ALU
Control Cache
ALU ALU ALU ALU
Control Cache
ALU ALU ALU ALU
Control Cache
ALU ALU ALU ALU
Control Cache
3/28
10 20 30 40 50 60 70 80 90 100 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 Performance and Power efficiency of #1 TOP500 Performance (TFLOPS) Efficiency (TFLOPS/W) 4/28
5/28
ALU ALU ALU ALU
Control Cache
ALU ALU ALU ALU
Control Cache
ALU ALU ALU ALU
Control Cache
ALU ALU ALU ALU
Control Cache
ALU ALU ALU ALU
Control Cache
ALU ALU ALU ALU Control Cache ALU ALU ALU ALU Control Cache ALU ALU ALU ALU Control Cache ALU ALU ALU ALU Control Cache6/28
103 104 105 106 107 108 109 1010 1970197519801985199019952000200520102015 # transistors Moore's law # transistors
8008 8080 8086 80286 80386 80486 Pentium Pentium II Pentium III Pentium 4 Core 2 Core i7 (Nehalem) Core i7 (Sandy) Core i7 (Haswell) Core i7 (Broadwell)
7/28
200 400 600 800 1000 1985 1990 1995 2000 2005 2010 2015 2020 Manufactoring process (nm) Lithography over years 8/28
9/28
10/28
1000 10000 100000 1x106 1x107 1x108 1x109 1x1010 1970 1976 1982 1988 1994 2000 2006 2012 2018 # transistors on chip
11/28
1000 10000 100000 1x106 1x107 1x108 1x109 1x1010 1970 1976 1982 1988 1994 2000 2006 2012 2018 # bases in database # transistors on chip
11/28
12/28
13/28
MCL Cashmere Constellation Application
14/28
Program Algorithm Mapping Hardware
15/28
control performance portability
16/28
89 GFLOPS v1: 100 GFLOPS v2: 92 GFLOPS v3: 205 GFLOPS 205 GFLOPS v1: 494 GFLOPS 89 GFLOPS
17/28
18/28
extract model check property p extract refinement of model check property p extract refinement of model check property p check equivalence check equivalence
19/28
20/28
MCL Cashmere Constellation Application
21/28
2 4 6 8 10 12 14 16 1 2 4 8 16 speedup # nodes Ideal Pentax Praktica Olympos
name data set Pentax Praktica Olympos number of images 638 1095 4980 #jobs 2075 1128 73920 time 1 node 47m 14s 44m 44s 53h 25m time 16 nodes 2m 55s 3m 16s 3h 10m
22/28
Titan X 4 Titan X 3 K40 TitanX-Pascal 1 TitanX-Pascal 0 Titan X 1 Titan X 0 K20 6h 55m 7h 00m 7h 05m
23/28
B A C Memory mem Interconnect ic Execution group cores 0,0 0,1 0,2 0,3 0,4 1,0 1,1 1,2 1,3 1,4 2,0 2,1 2,2 2,3 2,4 Device perfect
24/28
CATGTGGTCGGTA CATGCGGTGTA TGTGGTCGGTA ATGCGGTCGGTA
1
CAαβαββαCGGTA CAαβγββαGTA αβαββαCGGTA AαβγββαCGGTA
2
CAαβαββαCGGTA CAαβγββα--GTA
3
CATGTGGTCGGTA CATGCGGT--GTA
4
A C G T A 1 C 0 1 G 0 0 1 T 0 0 0 1 A C G T α β γ A 1 C 0 1 G 0 0 1 T 0 0 0 1 α 0 0 0 1 MMW β 0 0 1 0 MSW MMW γ 0 1 0 0 MSW MSW MMW
A
Sequence2 CATGTGGTCGGTA 12 * *** * **** >Sequence1 CATGCGGTA >Sequence2 CATGTGGTCGGTA CLUSTAL 2.1 multiple sequence alignment Sequence1 CATG----CGGTA 8
B C
25/28
2 1 1 2 2 1 1 2
Athens Greece Berlin Germany Ankara Turkey Bern Switzerland Hanoi Vietnam Lisbon Portugal Moscow Russia Stockholm Sweden Tokio Japan Washington USA 26/28
Henri Bal Alessio Sclocco Pieter Hijma Rob van Nieuwpoort Jaap Heringa Sanne Abeln Maurits Dijkstra Piek Vossen Antske Fokkens Atze van der Ploeg Jan Friso Groote Anton Wijs Tim Willemse Maurice Laveaux PhD student PhD student
by SURF & NW O
Ceriel Jacobs
27/28
10101010101010101010101010101010101 010101010 01010 101 01010 0 010 0 1 1 10101 0 10 0 1 01 1 0 0 01010101 01010101 0101010101010 101010101010
28/28
29/28