Hows the Parallel Computing Revolution Going? Towards Parallel, - PowerPoint PPT Presentation

How’s the Parallel Computing Revolution Going? Towards Parallel, Scalable VM Services Kathryn S McKinley The University of Texas at Austin Kathryn McKinley Towards Parallel, Scalable VM Services 1

20 th Century Simplistic Hardware View Faster Processors Frequency Scaling Speculation, OO programs do not change they just run faster Kathryn McKinley Towards Parallel, Scalable VM Services 2

Programming Language Evolution Managed Programming Languages Native Programming Languages 3

20 th Century Simplistic Software View Larger, More Capable Software Managed Languages hardware does not change it just runs faster Kathryn McKinley Towards Parallel, Scalable VM Services 4

Processor Technology Evolution i5 Clarkdale Core 2 Duo (32nm) Core 2 Duo Wolfdale 2010 Conroe Power 5 (45nm) 2009 (65nm) 2 cores 2006 (90nm) 2004 i7 Bloomfield (45nm) 2008 Pentium M Dothan (90nm) 2005 Atom Diamondville Pentium 4 NetBurst (45nm) 2008 (130nm) 2003 5

The 20 th Century Virtuous Cycle ✓ Larger, More Faster Single Capable Processor Software Frequency Scaling Managed Languages Kathryn McKinley Towards Parallel, Scalable VM Services 6

The 21 st Century Virtuous Cycle? ? Scalable Software More Cores Scalable Apps + Chip Multiproccesors CMP Scalable Runtime Kathryn McKinley Towards Parallel, Scalable VM Services 7

How is this new virtuous cycle going? Kathryn McKinley Towards Parallel, Scalable VM Services 8

Measured Power vs Performance 2003 50 2008 Power (W) (log) 2006 ? ? 2010 Pentium 4 (130nm) 2008 Core 2 Duo (65nm) i7 (45nm) Core 2 Duo (45nm) i5 (32nm) 10 0.5 1 5 10 Speedup (v Atom 230) (log) SPEC CPU 2006, DaCapo, SPEC jvm98

How is this new virtuous cycle going for single threaded Java Kathryn McKinley Towards Parallel, Scalable VM Services 10

Performance Scaling Single Threaded Java Benchmarks Core i7: 4 cores, 2 way SMT 4.0 3.5 antlr Speedup 3.0 bloat compress 2.5 db fop 2.0 jack javac 1.5 jess mpegaudio 1.0 pmd raytrace 0.5 geomean 1 2 3 4 5 6 7 8 9 Hardware Contexts Kathryn McKinley Towards Parallel, Scalable VM Services 11

How is this new virtuous cycle going for multi-threaded Java Kathryn McKinley Towards Parallel, Scalable VM Services 12

Performance Scaling Multi-Threaded Java Benchmarks Core i7: 4 cores, 2 way SMT 4.0 avrora 3.5 batik Speedup eclipse 3.0 h2 jython 2.5 luindex lusearch 2.0 mtrt pjbb2005 1.5 sunflow tomcat 1.0 tradebeans tradesoap 0.5 xalan 1 2 3 4 5 6 7 8 9 geomean Hardware Contexts Kathryn McKinley Towards Parallel, Scalable VM Services 13

Power, Performance, and Concurrency Native Java Single threaded hollow; multithreaded solid • Microarchitecture changes from Pentium 4 (130) to i5 (32) • favored parallelism-no surprise Multithreaded performance incurs a significant power cost • 14

Is there hope? Kathryn McKinley Towards Parallel, Scalable VM Services 15

Managed Languages Challenges & Opportunities Kathryn McKinley Towards Parallel, Scalable VM Services 16

Must Start with a Scalable Managed Runtime Kathryn McKinley Towards Parallel, Scalable VM Services 17

Sequential Managed Programs time Single Managed Application Core Runtime • Profiling • Dynamic Analysis • Compilation • Garbage Collection • Other Helper Threads • …… Kathryn McKinley Towards Parallel, Scalable VM Services 18

Steps towards scalability Step 1. Parallel application time Core 0 Core 1 Core 2 Application Core 3 Threads Core 4 Core 5 Core 6 Unused cores Core 7 Each thread has different running time Kathryn McKinley Towards Parallel, Scalable VM Services 19

Steps towards scalability Step 2. Parallel runtime time Core 0 Core 1 Core 2 Application Managed Application Core 3 Runtime Threads Threads Core 4 Core 5 Core 6 Core 7 Runtime waits for all application threads to pause Kathryn McKinley Towards Parallel, Scalable VM Services 20

Steps towards scalability Step 3. Parallel & concurrent runtime time Core 0 Core 1 Core 2 Application Managed Application Core 3 Threads Runtime Threads Core 4 Core 5 Core 6 Core 7 Managed runtime on application’s critical path may perturb its performance Kathryn McKinley Towards Parallel, Scalable VM Services 21

Steps towards scalability Ideal model Step 4. Minimize perturbation time Core 0 Core 1 Core 2 Application Application Core 3 Threads Threads Core 4 Core 5 Whole runtime task taken off critical path Core 6 Core 7 Application offloads work to concurrent runtime threads Kathryn McKinley Towards Parallel, Scalable VM Services 22

Steps towards scalability Ideal model Step 4. Minimize perturbation time Core 0 Core 1 Core 2 Application Application Core 3 Threads Threads Core 4 Core 5 Core 6 Core 7 Worst case is parallel & concurrent Kathryn McKinley Towards Parallel, Scalable VM Services 23

Vision • Scalable Runtimes – Runtime & application parallelism & concurrency – CMP aware runtime improves application scalability • Communication – Cache coherency is expensive and performance sensitive – Memory bandwidth scaling is problematic • Heterogeneity – Move non-critical path off power-hungry cores – Smarter, more aggressive analysis • Specialization? – Tuned cores? Special purpose cores? Kathryn McKinley Towards Parallel, Scalable VM Services 24

Approach • Profiling (feedback directed optimization) – Concurrent analysis – More invasive analysis on low-power cores • GC – High performance concurrent GC – High performance non-moving GC – Reduced synchronization overheads – Distributed & scratchpad GC • JIT – Concurrent, parallel JIT – Cost-benefit shift as low-power cores used • Architecture – Tuned and/or specialized cores for runtime services – Coherence tailed for restricted, common case of GC Kathryn McKinley Towards Parallel, Scalable VM Services 25

Today • Profiling (feedback directed optimization) – Concurrent analysis – More invasive analysis on low-power cores • GC – High performance concurrent GC – High performance non-moving GC – Reduced synchronization overheads – Distributed & scratchpad GC • JIT – Concurrent, parallel JIT – Cost-benefit shift as low-power cores used • Architecture – Tuned and/or specialized cores for runtime services – Coherence tailed for restricted, common case of GC Kathryn McKinley Towards Parallel, Scalable VM Services 26

A Concurrent Dynamic Analysis Framework For CMP Hardware Jungwoo Ha Matthew Arnold U. Texas & UCS/ICI-East IBM Research Stephen M. Blackburn Kathryn S. McKinley Australian National University University of Texas at Austin Kathryn McKinley Towards Parallel, Scalable VM Services 27

Generic Sequential Analysis instrumented code (== overhead) � time Application data collection � analysis � • Difficult to optimize instrumented code • Trade accuracy for overhead (sampling) Kathryn McKinley Towards Parallel, Scalable VM Services 28

Generic Concurrent Analysis instrumented code (reduced overhead) � time Application Application (producer) enqueue � data collection � buffering � Analysis (consumer) dequeue � analysis � • Lower overhead & higher accuracy • Must deal with microarchitectural side-effects Kathryn McKinley Towards Parallel, Scalable VM Services 29

Side-effects to Avoid Core A L1 lower level L1 Core B cache(s) false & true sharing Application Analysis (Producer) (Consumer) High latency memory operation Cache line ping-ponging Kathryn McKinley Towards Parallel, Scalable VM Services 30

Cache-friendly Asymmetric Buffering • Lock-free communication channel between application and analysis thread • Cache-friendly asymmetric buffering – Actively avoids microarchitectural side-effects – Enqueue • light-weight instrumentation • produces one record at time – Dequeue • consumes one chunk (fraction of a buffer) at a time Kathryn McKinley Towards Parallel, Scalable VM Services 31

Cache-friendly Asymmetric Buffering Core A L1 lower level L1 Core B cache(s) Application Analysis (Producer) (Consumer) an analyz yzer w r wai aits s application � ap fo for application writes here � here. � application buffer 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 an analyz yzer � analyzer chunk rea eads here � • 16 slots on the buffer • 4 chunks, 4 slot on each chunk • L1 size == chunk size Kathryn McKinley Towards Parallel, Scalable VM Services 32

Cache-friendly Asymmetric Buffering Core A L1 lower level L1 Core B cache(s) 0 4 8 Application Analysis 5 1 (Producer) (Consumer) 2 6 7 3 application buffer 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 analyzer chunk Delay consumer dequeue operation until cache line is flushed • 2 chunks away (smiley location) – Analyzer operates one chunk at a time • chunk_size > L1 size – In practice, chunk_size >= 2 * L1 works well. – Kathryn McKinley Towards Parallel, Scalable VM Services 33

Hows the Parallel Computing Revolution Going? Towards Parallel, - PowerPoint PPT Presentation

Hows the Parallel Computing Revolution Going? Towards Parallel, Scalable VM Services Kathryn S McKinley The University of Texas at Austin Kathryn McKinley Towards Parallel, Scalable VM Services 1 20 th Century Simplistic Hardware View

The Digital Revolution 1 Digital Revolution Nadias Theme 2 Digital Revolution Digital

5. Revolution and Napoleonic Europe 5.1 The Revolution in France 5.2 The Revolution and Europe

Parallel Computing: Opportunities and Challenges Victor Lee Parallel Computing Lab (PCL), Intel

Triple Revolution 1. Internet Revolution 2. Mobile Revolution 3. Social Media Revolution

Adventures in HPC and R: Going Parallel What is Parallel Computing? Justin Harrington &

From Evolution to (ML?) Revolution in Mobile Networking Slawomir Stanczak The Actual Revolution

Parallel Computing the Why and the How Albert-Jan Yzelman February, 2010 Albert-Jan Yzelman

The Parallel Revolution Has Started: Are You Part of the Solution or Part of the Problem? Dave

Sons of the Sons of the American Revolution American Revolution JROTC/ ROTC & Service

Digital Industrial Revolution Bearing Specialists Association Greg Scheu, President ABB Americas

Outline Overview Theoretical background Parallel computing systems Parallel

Introduction to OpenMP ! Introduction to parallel computing ! Classification of parallel

Introduction to Parallel Computing George Karypis Principles of Parallel Algorithm Design

Overview Parallel computing platforms Approaches to building parallel computers

JOIN THE REVOLUTION 1H18 Results, 25 July 2018 1 the Metro Bank revolution Metro Bank is the

JOIN THE REVOLUTION US Roadshow Presentation October 2018 1 the Metro Bank revolution Metro

CSP 490 U Natural Language Processing Spring 2016 Machine Translation Yejin Choi Slides from

NEHERS Rater Opportunities Outreach and leadership Building community Identifying

How Can Globalization Become More Pro-Poor? Presentation Based on UNU-W I DER Program of Research

From Math 2220 Class 37 Div and Curl Why Greens Greens Dr. Allen Back Problems Stokes and

A Grassroots Movement for Clean, Efficient Power Solarize! A Grassroots Solar Movement

Greens Functions for Stieltjes Boundary Problems Markus Rosenkranz Nitin Serwa School of

Greens Functions Theory for Quantum Many - Body Systems Many ny-Body ody Green ens

Green's Theorem is a special case of Stoke's 1 Some examples for Stoke's Theorem 2 3 4 5 6

Hows the Parallel Computing Revolution Going? Towards Parallel, - PowerPoint PPT Presentation

Hows the Parallel Computing Revolution Going? Towards Parallel, Scalable VM Services Kathryn S McKinley The University of Texas at Austin Kathryn McKinley Towards Parallel, Scalable VM Services 1 20 th Century Simplistic Hardware View

The Digital Revolution 1 Digital Revolution Nadias Theme 2 Digital Revolution Digital

5. Revolution and Napoleonic Europe 5.1 The Revolution in France 5.2 The Revolution and Europe

Parallel Computing: Opportunities and Challenges Victor Lee Parallel Computing Lab (PCL), Intel

Triple Revolution 1. Internet Revolution 2. Mobile Revolution 3. Social Media Revolution

Adventures in HPC and R: Going Parallel What is Parallel Computing? Justin Harrington &amp;

From Evolution to (ML?) Revolution in Mobile Networking Slawomir Stanczak The Actual Revolution

Parallel Computing the Why and the How Albert-Jan Yzelman February, 2010 Albert-Jan Yzelman

The Parallel Revolution Has Started: Are You Part of the Solution or Part of the Problem? Dave

Sons of the Sons of the American Revolution American Revolution JROTC/ ROTC &amp; Service

Digital Industrial Revolution Bearing Specialists Association Greg Scheu, President ABB Americas

Outline Overview Theoretical background Parallel computing systems Parallel

Introduction to OpenMP ! Introduction to parallel computing ! Classification of parallel

Introduction to Parallel Computing George Karypis Principles of Parallel Algorithm Design

Overview Parallel computing platforms Approaches to building parallel computers

JOIN THE REVOLUTION 1H18 Results, 25 July 2018 1 the Metro Bank revolution Metro Bank is the

JOIN THE REVOLUTION US Roadshow Presentation October 2018 1 the Metro Bank revolution Metro

CSP 490 U Natural Language Processing Spring 2016 Machine Translation Yejin Choi Slides from

NEHERS Rater Opportunities Outreach and leadership Building community Identifying

How Can Globalization Become More Pro-Poor? Presentation Based on UNU-W I DER Program of Research

From Math 2220 Class 37 Div and Curl Why Greens Greens Dr. Allen Back Problems Stokes and

A Grassroots Movement for Clean, Efficient Power Solarize! A Grassroots Solar Movement

Greens Functions for Stieltjes Boundary Problems Markus Rosenkranz Nitin Serwa School of

Greens Functions Theory for Quantum Many - Body Systems Many ny-Body ody Green ens

Green's Theorem is a special case of Stoke's 1 Some examples for Stoke's Theorem 2 3 4 5 6

Adventures in HPC and R: Going Parallel What is Parallel Computing? Justin Harrington &

Sons of the Sons of the American Revolution American Revolution JROTC/ ROTC & Service