An Ecosystem for Combining Performance and Correctness for - - PowerPoint PPT Presentation

an ecosystem for combining performance and correctness
SMART_READER_LITE
LIVE PREVIEW

An Ecosystem for Combining Performance and Correctness for - - PowerPoint PPT Presentation

An Ecosystem for Combining Performance and Correctness for Many-Cores Pieter Hijma pieter@cs.vu.nl Friday 16 May 2018 Fourth NIRICT GPGPU Reconnaissance Workshop by SURF & NW O Message In coming years, software will have to adapt to


slide-1
SLIDE 1

An Ecosystem for Combining Performance and Correctness for Many-Cores

Pieter Hijma pieter@cs.vu.nl Friday 16 May 2018 Fourth NIRICT GPGPU Reconnaissance Workshop

by SURF & NW O

slide-2
SLIDE 2

Message

In coming years, software will have to adapt to hardware more than previously. Our solution: An Ecosystem for Combining Performance and Correctness for Many-Cores

10101010101010101010101010101010101 010101010 01010 101 01010 0 010 0 1 1 10101 0 10 0 1 01 1 0 0 01010101 01010101 0101010101010 101010101010

1/28

slide-3
SLIDE 3

Look into the Past

103 104 105 106 107 108 109 1010 1970197519801985199019952000200520102015 # transistors Moore's law # transistors

8008 8080 8086 80286 80386 80486 Pentium Pentium II Pentium III Pentium 4 Core 2 Core i7 (Nehalem) Core i7 (Sandy) Core i7 (Haswell) Core i7 (Broadwell)

2/28

slide-4
SLIDE 4

Look into the Past

103 104 105 106 107 108 109 1010 1970197519801985199019952000200520102015 0.1 1 10 100 1000 10000 # transistors clockspeed (MHz) Moore's law # transistors

8008 8080 8086 80286 80386 80486 Pentium Pentium II Pentium III Pentium 4 Core 2 Core i7 (Nehalem) Core i7 (Sandy) Core i7 (Haswell)

clockspeed (MHz)

Core i7 (Broadwell)

2/28

slide-5
SLIDE 5

Look into the Past

103 104 105 106 107 108 109 1010 1970197519801985199019952000200520102015 0.1 1 10 100 1000 10000 # transistors clockspeed (MHz) Moore's law # transistors

8008 8080 8086 80286 80386 80486 Pentium Pentium II Pentium III Pentium 4 Core 2 Core i7 (Nehalem) Core i7 (Sandy) Core i7 (Haswell)

clockspeed (MHz)

Core i7 (Broadwell)

Single-core era

2/28

slide-6
SLIDE 6

Look into the Past

103 104 105 106 107 108 109 1010 1970197519801985199019952000200520102015 0.1 1 10 100 1000 10000 # transistors clockspeed (MHz) Moore's law # transistors

8008 8080 8086 80286 80386 80486 Pentium Pentium II Pentium III Pentium 4 Core 2 Core i7 (Nehalem) Core i7 (Sandy) Core i7 (Haswell)

clockspeed (MHz)

Core i7 (Broadwell)

Lucky time

2/28

slide-7
SLIDE 7

Look into the Past

103 104 105 106 107 108 109 1010 1970197519801985199019952000200520102015 0.1 1 10 100 1000 10000 # transistors clockspeed (MHz) Moore's law # transistors

8008 8080 8086 80286 80386 80486 Pentium Pentium II Pentium III Pentium 4 Core 2 Core i7 (Nehalem) Core i7 (Sandy) Core i7 (Haswell)

clockspeed (MHz)

Core i7 (Broadwell)

Multi-core era

2/28

slide-8
SLIDE 8

Processor types

  • Single-core
  • Optimized for latency
  • Multi-core
  • Still optimized for latency, but

just more than one

  • Many-core
  • Optimized for throughput
  • High performance/Watt

ALU ALU ALU ALU

Control Cache

ALU ALU ALU ALU

Control Cache

ALU ALU ALU ALU

Control Cache

ALU ALU ALU ALU

Control Cache

ALU ALU ALU ALU

Control Cache

3/28

slide-9
SLIDE 9

Performance per Watt

10 20 30 40 50 60 70 80 90 100 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 Performance and Power efficiency of #1 TOP500 Performance (TFLOPS) Efficiency (TFLOPS/W) 4/28

slide-10
SLIDE 10

Many-core processors

features

  • throughput oriented
  • fast evolution of the architecture
  • architectural features for high performance

Difficult to program, especially for high-performance

5/28

slide-11
SLIDE 11

Future processors

ALU ALU ALU ALU

Control Cache

ALU ALU ALU ALU

Control Cache

ALU ALU ALU ALU

Control Cache

ALU ALU ALU ALU

Control Cache

ALU ALU ALU ALU

Control Cache

ALU ALU ALU ALU Control Cache ALU ALU ALU ALU Control Cache ALU ALU ALU ALU Control Cache ALU ALU ALU ALU Control Cache

6/28

slide-12
SLIDE 12

Moore’s Law

103 104 105 106 107 108 109 1010 1970197519801985199019952000200520102015 # transistors Moore's law # transistors

8008 8080 8086 80286 80386 80486 Pentium Pentium II Pentium III Pentium 4 Core 2 Core i7 (Nehalem) Core i7 (Sandy) Core i7 (Haswell) Core i7 (Broadwell)

7/28

slide-13
SLIDE 13

Moore’s Law ending

200 400 600 800 1000 1985 1990 1995 2000 2005 2010 2015 2020 Manufactoring process (nm) Lithography over years 8/28

slide-14
SLIDE 14

Walls

  • energy wall
  • memory wall
  • Moore’s law → Moore’s wall

result hardware without compromises to the interface to programmers → difficult to program →

  • programming wall

9/28

slide-15
SLIDE 15

Large demand for computational power

Chemistry

  • in vitro → in silico

Machine Learning

  • Shooting with a computational cannon

Increase in data to process

  • For example gene-sequence alignment

10/28

slide-16
SLIDE 16

Increase in data

1000 10000 100000 1x106 1x107 1x108 1x109 1x1010 1970 1976 1982 1988 1994 2000 2006 2012 2018 # transistors on chip

Moore’s law

11/28

slide-17
SLIDE 17

Increase in data

1000 10000 100000 1x106 1x107 1x108 1x109 1x1010 1970 1976 1982 1988 1994 2000 2006 2012 2018 # bases in database # transistors on chip

Moore’s law against the SRA genetic database.

11/28

slide-18
SLIDE 18

Many-core era

  • window of 5-10 years to figure out:
  • what hardware is going to look like
  • how to program for performance well

12/28

slide-19
SLIDE 19

Recap

To deal with energy problems hardware will be:

  • highly parallel
  • throughput oriented
  • architectural details for performance
  • difficult to program

Result

  • More responsibility for software developers
  • Increase in performance relies on software

13/28

slide-20
SLIDE 20

Ecosystem for Performance and Correctness

  • clusters of many-cores
  • obtain high performance
  • understanding performance
  • correctness with model checking

MCL Cashmere Constellation Application

14/28

slide-21
SLIDE 21

Programming in MCL

A program is an algorithm mapped to hardware

Program Algorithm Mapping Hardware

Solution Incorporate hardware descriptions in the programming model

15/28

slide-22
SLIDE 22

Hierarchy of hardware descriptions

perfect mic gpu nvidia amd fermi xeon phi gtx480 kepler gtx680

control performance portability

16/28

slide-23
SLIDE 23

Stepwise-refinement for performance

perfect mic gpu nvidia amd fermi xeon phi gtx480 kepler gtx680

89 GFLOPS v1: 100 GFLOPS v2: 92 GFLOPS v3: 205 GFLOPS 205 GFLOPS v1: 494 GFLOPS 89 GFLOPS

Feedback Using 1/8 blocks per smp. Reduce the amount of shared memory used by storing/loading shared memory in phases

17/28

slide-24
SLIDE 24

Model checking: mCRL2

  • effective tool for software flaws
  • support rich data structure
  • versatile
  • memory access problems
  • correctness of optimizations

Goals

  • non-intrusive
  • feed back verified properties into

the compiler for optimization

18/28

slide-25
SLIDE 25

Performance-correctness co-refinement

perfect mic gpu nvidia amd fermi xeon phi gtx480 kepler gtx680

extract model check property p extract refinement of model check property p extract refinement of model check property p check equivalence check equivalence

19/28

slide-26
SLIDE 26

Accelerating Verification

  • exploit symmetry in many-core

programs

  • use many-cores to accelerate

model checking

  • accelerate the term-rewriting

core in mCRL2

20/28

slide-27
SLIDE 27

Many-core cluster computers

  • Supports heterogeneous

many-core clusters

  • Can handle large-scale

applications

  • Excellent load balancing and

scalability

MCL Cashmere Constellation Application

21/28

slide-28
SLIDE 28

Scalability results forensics application

2 4 6 8 10 12 14 16 1 2 4 8 16 speedup # nodes Ideal Pentax Praktica Olympos

name data set Pentax Praktica Olympos number of images 638 1095 4980 #jobs 2075 1128 73920 time 1 node 47m 14s 44m 44s 53h 25m time 16 nodes 2m 55s 3m 16s 3h 10m

22/28

slide-29
SLIDE 29

Load balancing

Titan X 4 Titan X 3 K40 TitanX-Pascal 1 TitanX-Pascal 0 Titan X 1 Titan X 0 K20 6h 55m 7h 00m 7h 05m

23/28

slide-30
SLIDE 30

Visualizing kernel execution

Hardware descriptions designed such that they can be drawn:

B A C Memory mem Interconnect ic Execution group cores 0,0 0,1 0,2 0,3 0,4 1,0 1,1 1,2 1,3 1,4 2,0 2,1 2,2 2,3 2,4 Device perfect

24/28

slide-31
SLIDE 31

Bioinformatics application

Motif-aware multiple sequence alignment

CATGTGGTCGGTA CATGCGGTGTA TGTGGTCGGTA ATGCGGTCGGTA

1

CAαβαββαCGGTA CAαβγββαGTA αβαββαCGGTA AαβγββαCGGTA

2

CAαβαββαCGGTA CAαβγββα--GTA

  • -αβαββαCGGTA
  • AαβγββαCGGTA

3

CATGTGGTCGGTA CATGCGGT--GTA

  • -TGTGGTCGGTA
  • ATGCGGTCGGTA

4

A C G T A 1 C 0 1 G 0 0 1 T 0 0 0 1 A C G T α β γ A 1 C 0 1 G 0 0 1 T 0 0 0 1 α 0 0 0 1 MMW β 0 0 1 0 MSW MMW γ 0 1 0 0 MSW MSW MMW

A

Sequence2 CATGTGGTCGGTA 12 * *** * **** >Sequence1 CATGCGGTA >Sequence2 CATGTGGTCGGTA CLUSTAL 2.1 multiple sequence alignment Sequence1 CATG----CGGTA 8

B C

25/28

slide-32
SLIDE 32

Natural Language Processing application

Word embeddings

  • Map words to vectors or real numbers (word2vec)
  • Take large corpus, create large multi-dimensional vector

space

2 1 1 2 2 1 1 2

Athens Greece Berlin Germany Ankara Turkey Bern Switzerland Hanoi Vietnam Lisbon Portugal Moscow Russia Stockholm Sweden Tokio Japan Washington USA 26/28

slide-33
SLIDE 33

Overview people

Henri Bal Alessio Sclocco Pieter Hijma Rob van Nieuwpoort Jaap Heringa Sanne Abeln Maurits Dijkstra Piek Vossen Antske Fokkens Atze van der Ploeg Jan Friso Groote Anton Wijs Tim Willemse Maurice Laveaux PhD student PhD student

by SURF & NW O

DTEC TOP

Ceriel Jacobs

27/28

slide-34
SLIDE 34

Conclusion

In coming years, software will have to adapt to hardware more than previously. Our solution: An Ecosystem for Combining Performance and Correctness for Many-Cores

10101010101010101010101010101010101 010101010 01010 101 01010 0 010 0 1 1 10101 0 10 0 1 01 1 0 0 01010101 01010101 0101010101010 101010101010

28/28

slide-35
SLIDE 35

FPGAs

  • high-level synthesis: as successful as automagically

parallelizing compilers

  • only for:
  • extremely low latency applications
  • extremely power efficient
  • prototyping hardware
  • tools are of low quality

29/28