an ecosystem for combining performance and correctness
play

An Ecosystem for Combining Performance and Correctness for - PowerPoint PPT Presentation

An Ecosystem for Combining Performance and Correctness for Many-Cores Pieter Hijma pieter@cs.vu.nl Friday 16 May 2018 Fourth NIRICT GPGPU Reconnaissance Workshop by SURF & NW O Message In coming years, software will have to adapt to


  1. An Ecosystem for Combining Performance and Correctness for Many-Cores Pieter Hijma pieter@cs.vu.nl Friday 16 May 2018 Fourth NIRICT GPGPU Reconnaissance Workshop by SURF & NW O

  2. Message In coming years, software will have to adapt to hardware more than previously. Our solution: An Ecosystem for Combining Performance and Correctness for Many-Cores 10101010101010101010101010101010101 010101010 01010 101 01010 0 010 0 1 1 10101 0 10 0 1 01 1 0 0 01010101 01010101 0101010101010 101010101010 1/28

  3. Look into the Past Moore's law 10 10 # transistors Core i7 (Broadwell) Core i7 (Haswell) 10 9 Core i7 (Sandy) Core i7 (Nehalem) Core 2 10 8 # transistors Pentium 4 Pentium III 10 7 Pentium II Pentium 10 6 80486 80386 10 5 80286 8086 10 4 8080 8008 10 3 1970197519801985199019952000200520102015 2/28

  4. Look into the Past Moore's law 10 10 10000 # transistors Core i7 (Broadwell) clockspeed (MHz) Core i7 (Haswell) 10 9 Core i7 (Sandy) Core i7 (Nehalem) 1000 Core 2 10 8 clockspeed (MHz) # transistors Pentium 4 Pentium III 100 10 7 Pentium II Pentium 10 6 80486 10 80386 10 5 80286 1 8086 10 4 8080 8008 10 3 0.1 1970197519801985199019952000200520102015 2/28

  5. Look into the Past Moore's law 10 10 10000 # transistors Core i7 (Broadwell) clockspeed (MHz) Core i7 (Haswell) 10 9 Core i7 (Sandy) Core i7 (Nehalem) 1000 Core 2 10 8 clockspeed (MHz) # transistors Pentium 4 Pentium III 100 10 7 Pentium II Pentium 10 6 80486 10 80386 10 5 80286 Single-core era 1 8086 10 4 8080 8008 10 3 0.1 1970197519801985199019952000200520102015 2/28

  6. Look into the Past Moore's law 10 10 10000 # transistors Core i7 (Broadwell) clockspeed (MHz) Core i7 (Haswell) 10 9 Core i7 (Sandy) Core i7 (Nehalem) 1000 Core 2 10 8 clockspeed (MHz) # transistors Pentium 4 Pentium III 100 10 7 Pentium II Pentium 10 6 80486 10 80386 10 5 80286 Lucky time 1 8086 10 4 8080 8008 10 3 0.1 1970197519801985199019952000200520102015 2/28

  7. Look into the Past Moore's law 10 10 10000 # transistors Core i7 (Broadwell) clockspeed (MHz) Core i7 (Haswell) 10 9 Core i7 (Sandy) Core i7 (Nehalem) 1000 Core 2 10 8 clockspeed (MHz) # transistors Pentium 4 Pentium III 100 10 7 Pentium II Pentium 10 6 80486 10 80386 10 5 80286 Multi-core era 1 8086 10 4 8080 8008 10 3 0.1 1970197519801985199019952000200520102015 2/28

  8. Processor types ALU ALU Control ALU ALU • Single-core Cache • Optimized for latency • Multi-core ALU ALU ALU ALU Control Control ALU ALU ALU ALU • Still optimized for latency, but Cache Cache ALU ALU ALU ALU just more than one Control Control ALU ALU ALU ALU Cache Cache • Many-core • Optimized for throughput • High performance/Watt 3/28

  9. Performance per Watt Performance and Power e ffi ciency of #1 TOP500 100 Performance (TFLOPS) 90 E ffi ciency (TFLOPS/W) 80 70 60 50 40 30 20 10 0 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 4/28

  10. Many-core processors features • throughput oriented • fast evolution of the architecture • architectural features for high performance Difficult to program, especially for high-performance 5/28

  11. Future processors ALU ALU Control ALU ALU Cache ALU ALU ALU ALU Control Control ALU ALU ALU ALU ALU ALU ALU ALU Control Control ALU ALU ALU ALU Cache Cache Cache Cache ALU ALU ALU ALU ALU ALU ALU ALU Control Control Control Control ALU ALU ALU ALU ALU ALU ALU ALU Cache Cache Cache Cache 6/28

  12. Moore’s Law Moore's law 10 10 # transistors Core i7 (Broadwell) Core i7 (Haswell) 10 9 Core i7 (Sandy) Core i7 (Nehalem) Core 2 10 8 # transistors Pentium 4 Pentium III 10 7 Pentium II Pentium 10 6 80486 80386 10 5 80286 8086 10 4 8080 8008 10 3 1970197519801985199019952000200520102015 7/28

  13. Moore’s Law ending Lithography over years 1000 Manufactoring process (nm) 800 600 400 200 0 1985 1990 1995 2000 2005 2010 2015 2020 8/28

  14. Walls • energy wall • memory wall • Moore’s law → Moore’s wall result hardware without compromises to the interface to programmers → difficult to program → • programming wall 9/28

  15. Large demand for computational power Chemistry • in vitro → in silico Machine Learning • Shooting with a computational cannon Increase in data to process • For example gene-sequence alignment 10/28

  16. Increase in data 1x10 10 # transistors on chip 1x10 9 1x10 8 1x10 7 1x10 6 100000 10000 1000 1970 1976 1982 1988 1994 2000 2006 2012 2018 Moore’s law 11/28

  17. Increase in data 1x10 10 # transistors on chip # bases in database 1x10 9 1x10 8 1x10 7 1x10 6 100000 10000 1000 1970 1976 1982 1988 1994 2000 2006 2012 2018 Moore’s law against the SRA genetic database. 11/28

  18. Many-core era • window of 5-10 years to figure out: • what hardware is going to look like • how to program for performance well 12/28

  19. Recap To deal with energy problems hardware will be: • highly parallel • throughput oriented • architectural details for performance • difficult to program Result • More responsibility for software developers • Increase in performance relies on software 13/28

  20. Ecosystem for Performance and Correctness Application • clusters of many-cores MCL • obtain high performance • understanding performance Cashmere • correctness with model checking Constellation 14/28

  21. Programming in MCL A program is an algorithm mapped to hardware Program Algorithm Mapping Hardware Solution Incorporate hardware descriptions in the programming model 15/28

  22. Hierarchy of hardware descriptions perfect portability mic gpu xeon phi nvidia amd performance fermi kepler control gtx480 gtx680 16/28

  23. Stepwise-refinement for performance 89 GFLOPS 89 GFLOPS v1: 100 GFLOPS v2: 92 GFLOPS perfect v3: 205 GFLOPS mic gpu xeon phi nvidia amd fermi kepler 205 GFLOPS v1: 494 GFLOPS gtx480 gtx680 Feedback Using 1/8 blocks per smp . Reduce the amount of shared memory used by storing/loading shared memory in phases 17/28

  24. Model checking: mCRL2 • effective tool for software flaws • support rich data structure • versatile • memory access problems • correctness of optimizations Goals • non-intrusive • feed back verified properties into the compiler for optimization 18/28

  25. Performance-correctness co-refinement check equivalence extract model check property p extract re fi nement of model perfect check property p mic gpu check equivalence xeon phi nvidia amd fermi kepler extract re fi nement of model check property p gtx480 gtx680 19/28

  26. Accelerating Verification • exploit symmetry in many-core programs • use many-cores to accelerate model checking • accelerate the term-rewriting core in mCRL2 20/28

  27. Many-core cluster computers Application • Supports heterogeneous many-core clusters MCL • Can handle large-scale applications Cashmere • Excellent load balancing and scalability Constellation 21/28

  28. Scalability results forensics application Ideal 16 Pentax Praktica 14 Olympos 12 10 speedup 8 6 4 2 0 1 2 4 8 16 # nodes name data set Pentax Praktica Olympos number of images 638 1095 4980 #jobs 2075 1128 73920 time 1 node 47m 14s 44m 44s 53h 25m time 16 nodes 2m 55s 3m 16s 3h 10m 22/28

  29. Load balancing K20 Titan X 0 Titan X 1 TitanX-Pascal 0 TitanX-Pascal 1 K40 Titan X 3 Titan X 4 6h 55m 7h 00m 7h 05m 23/28

  30. Visualizing kernel execution Hardware descriptions designed such that they can be drawn: Device perfect Memory mem C A B Interconnect ic Execution group cores 0,0 0,1 0,2 0,3 0,4 1,0 1,1 1,2 1,3 1,4 2,0 2,1 2,2 2,3 2,4 24/28

  31. Bioinformatics application Motif-aware multiple sequence alignment A >Sequence1 CLUSTAL 2.1 multiple sequence alignment CATG C GGTA Sequence1 CATG---- C GGTA 8 >Sequence2 Sequence2 CATG T GGTCGGTA 12 CATG T GGTCGGTA * *** * **** B 1 2 3 4 CA αβ α ββα CGGTA CA αβ α ββα CGGTA CATG T GGTCGGTA CATG T GGTCGGTA CATG C GGTGTA CA αβ γ ββα GTA CA αβ γ ββα --GTA CATG C GGT--GTA TG T GGTCGGTA αβ α ββα CGGTA -- αβ α ββα CGGTA --TG T GGTCGGTA ATG C GGTCGGTA A αβ γ ββα CGGTA -A αβ γ ββα CGGTA -ATG C GGTCGGTA C A C G T A C G T α β γ A 1 A 1 C 0 1 C 0 1 G 0 0 1 G 0 0 1 T 0 0 0 1 T 0 0 0 1 α 0 0 0 1 MMW β 0 0 1 0 MSW MMW γ 0 1 0 0 MSW MSW MMW 25/28

  32. Natural Language Processing application Word embeddings • Map words to vectors or real numbers (word2vec) • Take large corpus, create large multi-dimensional vector space Vietnam 2 Japan USA Hanoi Russia 1 Tokio Turkey Washington Germany Moscow 0 Ankara Sweden Berlin Switzerland Stockholm Bern Portugal 1 Greece Lisbon Athens 2 26/28 2 1 0 1 2

  33. Overview people by SURF & NW O DTEC TOP Rob van Nieuwpoort Henri Bal Jaap Heringa Jan Friso Groote Piek Vossen Alessio Sclocco Sanne Abeln Pieter Hijma Ceriel Jacobs Anton Wijs Tim Willemse Atze van der Ploeg Antske Fokkens PhD student Maurits Dijkstra Maurice Laveaux PhD student 27/28

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend