Hows the Parallel Computing Revolution Going? Towards Parallel, - - PowerPoint PPT Presentation

how s the parallel computing revolution going towards
SMART_READER_LITE
LIVE PREVIEW

Hows the Parallel Computing Revolution Going? Towards Parallel, - - PowerPoint PPT Presentation

Hows the Parallel Computing Revolution Going? Towards Parallel, Scalable VM Services Kathryn S McKinley The University of Texas at Austin Kathryn McKinley Towards Parallel, Scalable VM Services 1 20 th Century Simplistic Hardware View


slide-1
SLIDE 1

How’s the Parallel Computing Revolution Going? Towards Parallel, Scalable VM Services

Kathryn S McKinley

The University of Texas at Austin

1 Towards Parallel, Scalable VM Services Kathryn McKinley

slide-2
SLIDE 2

20th Century Simplistic Hardware View

Kathryn McKinley Towards Parallel, Scalable VM Services 2

Faster Processors

Frequency Scaling Speculation, OO

programs do not change they just run faster

slide-3
SLIDE 3

3

Programming Language Evolution

Native Programming Languages Managed Programming Languages

slide-4
SLIDE 4

20th Century Simplistic Software View

Kathryn McKinley Towards Parallel, Scalable VM Services 4

Larger, More Capable Software

Managed Languages

hardware does not change it just runs faster

slide-5
SLIDE 5

5

Processor Technology Evolution

Pentium 4 NetBurst (130nm) 2003 Pentium M Dothan (90nm) 2005 Core 2 Duo Conroe (65nm) 2006 Atom Diamondville (45nm) 2008 Core 2 Duo Wolfdale (45nm) 2009 i7 Bloomfield (45nm) 2008 i5 Clarkdale (32nm) 2010 Power 5 2 cores (90nm) 2004

slide-6
SLIDE 6

The 20th Century Virtuous Cycle

Kathryn McKinley Towards Parallel, Scalable VM Services 6

Faster Single Processor

Frequency Scaling

Larger, More Capable Software

Managed Languages

slide-7
SLIDE 7

The 21st Century Virtuous Cycle?

Kathryn McKinley Towards Parallel, Scalable VM Services 7

More Cores

Chip Multiproccesors

CMP

Scalable Software

Scalable Apps + Scalable Runtime

?

slide-8
SLIDE 8

How is this new virtuous cycle going?

Kathryn McKinley Towards Parallel, Scalable VM Services 8

slide-9
SLIDE 9

10 0.5 5 Power (W) (log) Speedup (v Atom 230) (log)

Pentium 4 (130nm) Core 2 Duo (65nm) i7 (45nm) Core 2 Duo (45nm) i5 (32nm)

? ?

SPEC CPU 2006, DaCapo, SPEC jvm98

2010 2008 2008 2006 2003

1 10 50

Measured Power vs Performance

slide-10
SLIDE 10

How is this new virtuous cycle going for single threaded Java

Kathryn McKinley Towards Parallel, Scalable VM Services 10

slide-11
SLIDE 11

Performance Scaling

Single Threaded Java Benchmarks Core i7: 4 cores, 2 way SMT

Kathryn McKinley Towards Parallel, Scalable VM Services 11

0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 1 2 3 4 5 6 7 8 9

Speedup Hardware Contexts

antlr bloat compress db fop jack javac jess mpegaudio pmd raytrace geomean

slide-12
SLIDE 12

How is this new virtuous cycle going for multi-threaded Java

Kathryn McKinley Towards Parallel, Scalable VM Services 12

slide-13
SLIDE 13

Performance Scaling

Multi-Threaded Java Benchmarks Core i7: 4 cores, 2 way SMT

Kathryn McKinley Towards Parallel, Scalable VM Services 13

0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 1 2 3 4 5 6 7 8 9

Speedup Hardware Contexts

avrora batik eclipse h2 jython luindex lusearch mtrt pjbb2005 sunflow tomcat tradebeans tradesoap xalan geomean

slide-14
SLIDE 14

14

Power, Performance, and Concurrency

  • Single threaded hollow; multithreaded solid
  • Microarchitecture changes from Pentium 4 (130) to i5 (32)

favored parallelism-no surprise

  • Multithreaded performance incurs a significant power cost

Native Java

slide-15
SLIDE 15

Is there hope?

Kathryn McKinley Towards Parallel, Scalable VM Services 15

slide-16
SLIDE 16

Managed Languages

Challenges & Opportunities

Kathryn McKinley Towards Parallel, Scalable VM Services 16

slide-17
SLIDE 17

Must Start with a Scalable Managed Runtime

Kathryn McKinley Towards Parallel, Scalable VM Services 17

slide-18
SLIDE 18

Sequential Managed Programs

Kathryn McKinley Towards Parallel, Scalable VM Services 18

Application

Managed Runtime Single Core

time

  • Profiling
  • Dynamic Analysis
  • Compilation
  • Garbage Collection
  • Other Helper Threads
  • ……
slide-19
SLIDE 19

Steps towards scalability

Kathryn McKinley Towards Parallel, Scalable VM Services 19

Step 1. Parallel application

Application Threads Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 time

Unused cores Each thread has different running time

slide-20
SLIDE 20

Steps towards scalability

Kathryn McKinley Towards Parallel, Scalable VM Services 20

Step 2. Parallel runtime

Application Threads

Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 time

Runtime Managed Application Threads

Runtime waits for all application threads to pause

slide-21
SLIDE 21

Steps towards scalability

Kathryn McKinley Towards Parallel, Scalable VM Services 21

Step 3. Parallel & concurrent runtime

Application Threads

Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 time

Runtime Managed Application Threads

Managed runtime on application’s critical path may perturb its performance

slide-22
SLIDE 22

Steps towards scalability Ideal model

Kathryn McKinley Towards Parallel, Scalable VM Services 22

Step 4. Minimize perturbation

Application Threads

Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 time

Application Threads

Application offloads work to concurrent runtime threads Whole runtime task taken

  • ff critical path
slide-23
SLIDE 23

Steps towards scalability Ideal model

Kathryn McKinley Towards Parallel, Scalable VM Services 23

Step 4. Minimize perturbation

Application Threads

Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 time

Application Threads

Worst case is parallel & concurrent

slide-24
SLIDE 24

Vision

  • Scalable Runtimes

– Runtime & application parallelism & concurrency – CMP aware runtime improves application scalability

  • Communication

– Cache coherency is expensive and performance sensitive – Memory bandwidth scaling is problematic

  • Heterogeneity

– Move non-critical path off power-hungry cores – Smarter, more aggressive analysis

  • Specialization?

– Tuned cores? Special purpose cores?

Kathryn McKinley Towards Parallel, Scalable VM Services 24

slide-25
SLIDE 25

Approach

  • Profiling (feedback directed optimization)

– Concurrent analysis – More invasive analysis on low-power cores

  • GC

– High performance concurrent GC – High performance non-moving GC – Reduced synchronization overheads – Distributed & scratchpad GC

  • JIT

– Concurrent, parallel JIT – Cost-benefit shift as low-power cores used

  • Architecture

– Tuned and/or specialized cores for runtime services – Coherence tailed for restricted, common case of GC

Kathryn McKinley Towards Parallel, Scalable VM Services 25

slide-26
SLIDE 26

Today

  • Profiling (feedback directed optimization)

– Concurrent analysis – More invasive analysis on low-power cores

  • GC

– High performance concurrent GC – High performance non-moving GC – Reduced synchronization overheads – Distributed & scratchpad GC

  • JIT

– Concurrent, parallel JIT – Cost-benefit shift as low-power cores used

  • Architecture

– Tuned and/or specialized cores for runtime services – Coherence tailed for restricted, common case of GC

Kathryn McKinley Towards Parallel, Scalable VM Services 26

slide-27
SLIDE 27

A Concurrent Dynamic Analysis Framework For CMP Hardware

Stephen M. Blackburn

Australian National University

27 Towards Parallel, Scalable VM Services Kathryn McKinley

Kathryn S. McKinley

University of Texas at Austin

Jungwoo Ha

  • U. Texas & UCS/ICI-East

Matthew Arnold

IBM Research

slide-28
SLIDE 28

Generic Sequential Analysis

  • Difficult to optimize instrumented code
  • Trade accuracy for overhead (sampling)

Kathryn McKinley Towards Parallel, Scalable VM Services 28

time

Application data collection analysis instrumented code (== overhead)

slide-29
SLIDE 29

Generic Concurrent Analysis

  • Lower overhead & higher accuracy
  • Must deal with microarchitectural side-effects

Kathryn McKinley Towards Parallel, Scalable VM Services 29

time

Application analysis instrumented code (reduced overhead) data collection enqueue buffering dequeue Application (producer) Analysis (consumer)

slide-30
SLIDE 30

Side-effects to Avoid

Kathryn McKinley Towards Parallel, Scalable VM Services 30

Application (Producer) Core A L1 lower level cache(s) Analysis (Consumer) L1 Core B

false & true sharing High latency memory operation Cache line ping-ponging

slide-31
SLIDE 31

Cache-friendly Asymmetric Buffering

  • Lock-free communication channel

between application and analysis thread

  • Cache-friendly asymmetric buffering

– Actively avoids microarchitectural side-effects – Enqueue

  • light-weight instrumentation
  • produces one record at time

– Dequeue

  • consumes one chunk (fraction of a buffer) at a time

Kathryn McKinley Towards Parallel, Scalable VM Services 31

slide-32
SLIDE 32

Analysis (Consumer) Application (Producer)

Cache-friendly Asymmetric Buffering

  • 16 slots on the buffer
  • 4 chunks, 4 slot on each chunk
  • L1 size == chunk size

Kathryn McKinley Towards Parallel, Scalable VM Services 32

Core A L1 lower level cache(s) L1 Core B

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

analyzer application chunk buffer

ap application writes here an analyz yzer rea eads here an analyz yzer w r wai aits s fo for application here.

slide-33
SLIDE 33

Analysis (Consumer) Application (Producer) 4 5 6 7 8 1 2 3

Cache-friendly Asymmetric Buffering

  • Delay consumer dequeue operation until cache line is flushed

– 2 chunks away (smiley location)

  • Analyzer operates one chunk at a time

– chunk_size > L1 size – In practice, chunk_size >= 2 * L1 works well.

Kathryn McKinley Towards Parallel, Scalable VM Services 33

Core A L1 lower level cache(s) L1 Core B

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

analyzer application chunk buffer

slide-34
SLIDE 34

Analysis (Consumer) Application (Producer) 1 2 3 5 6 7 4

Cache-friendly Asymmetric Buffering

  • application blocks only when buffer is full

– waiting until two more chunks are available

Kathryn McKinley Towards Parallel, Scalable VM Services 34

Core A L1 lower level cache(s) L1 Core B

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

analyzer application

slide-35
SLIDE 35

Cache-friendly Asymmetric Buffering

  • producer may spin on bufptr, while consumer may spin on buffer
  • Producer code common case is 6 instructions in x86.

Kathryn McKinley Towards Parallel, Scalable VM Services 35 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Consumer only spins here

Producer (sees buffer)

while (*bufptr != 0) { if (*bufptr == MAGIC) bufptr = buffer; if (*bufptr != 0) block(); } *bufptr++ = data;

Consumer (sees chunk)

while (app_is_running) { index = index_of(chunk_num+2); while (buffer[index] == 0) spin_or_sleep(); consume (chunk_num); chunk_num = NEXT(chunk_num); }

MAGIC

slide-36
SLIDE 36

Framework Provides …

  • Cache-friendly Asymmetric Buffering (CAB)

– Minimizes microarchitectural side-effects – Quickly offloads event data from application’s critical path

  • Configurable parameters for optimization

– buffer size & chunk size

  • Various collection mode

– Exhaustive mode – Sampling mode

  • Works on various threading model

– N:M (green) threading model – native threading model

Kathryn McKinley Towards Parallel, Scalable VM Services 36

slide-37
SLIDE 37

8MB L3

Evaluation

  • 3 different CMP processors

Kathryn McKinley Towards Parallel, Scalable VM Services 37

8KB L1 512KB L2

Pentium 4 w/ hyperthreading

32KB

4MB L2

Core 2 Quad

32KB 32KB

4MB L2

32KB

system bus

32KB

256K B

32KB 32KB 32KB

256K B 256K B 256K B

Core i7

slide-38
SLIDE 38

Evaluation

  • Jikes RVM (2 different threading models)

– N:M threading (Jikes RVM 2.9.2) – Native threading (Jikes RVM 3.0.1)

  • Reference Dynamic Analysis Implementation

– Method counting – Call graph – Call tree profiling – Path profiling – Cache simulator using load/store events

  • Benchmarks

– DaCapo, SPEC JVM 98 benchmark suites

  • Parameters

– buffer size = 2MB, chunk size = 128KB

Kathryn McKinley Towards Parallel, Scalable VM Services 38

slide-39
SLIDE 39

Call Graph Profiling

5 10 15 20 25 30 Core-i7 Core-2 Pentium-4

GeoMean Overhead (%)

Instrumentation Enqueue Enqueue+Dequeue Concurrent (N:M) Sequential

Kathryn McKinley Towards Parallel, Scalable VM Services 39

 Instrumentation Overhead – Bar 1  Bar1 – Collect event data and write into a single word. No analysis thread

slide-40
SLIDE 40

Call Graph Profiling

5 10 15 20 25 30 Core-i7 Core-2 Pentium-4

GeoMean Overhead (%)

Instrumentation Enqueue Enqueue+Dequeue Concurrent (N:M) Sequential

Kathryn McKinley Towards Parallel, Scalable VM Services 40

 Enqueueing ¡Overhead ¡– ¡(Bar2 ¡-­‑ ¡Bar ¡1) ¡  Bar2 ¡– ¡Collect ¡event ¡data ¡and ¡write ¡into ¡the ¡buffer. ¡No ¡analysis ¡thread ¡

slide-41
SLIDE 41

Call Graph Profiling

5 10 15 20 25 30 Core-i7 Core-2 Pentium-4

GeoMean Overhead (%)

Instrumentation Enqueue Enqueue+Dequeue Concurrent (N:M) Sequential

Kathryn McKinley Towards Parallel, Scalable VM Services 41

 Communication Overhead – (Bar3 – Bar 2)  Bar3 – Analysis thread dequeues and write it into a single word.

slide-42
SLIDE 42

Call Graph Profiling

5 10 15 20 25 30 Core-i7 Core-2 Pentium-4

GeoMean Overhead (%)

Instrumentation Enqueue Enqueue+Dequeue Concurrent (N:M) Sequential

Kathryn McKinley Towards Parallel, Scalable VM Services 42

 Analysis (data processing) Overhead – (Bar 5 – Bar 1)  Bar4 – Concurrent Analysis  Bar5 – Sequential Analysis

slide-43
SLIDE 43

Call Graph Profiling

5 10 15 20 25 30 Core-i7 Core-2 Pentium-4

GeoMean Overhead (%)

Instrumentation Enqueue Enqueue+Dequeue Concurrent (N:M) Sequential

Kathryn McKinley Towards Parallel, Scalable VM Services 43

 Overhead reduction with Concurrent Analysis – (Bar 5 – Bar 4)  Bar4 – Concurrent Analysis  Bar5 – Sequential Analysis

slide-44
SLIDE 44

Path Profiling

20 40 60 80 100 120 140 160 Core-i7 Core-2 Pentium-4

GeoMean Overhead (%)

Instrumentation Enqueue Enqueue+Dequeue Concurrent (N:M) Sequential

Kathryn McKinley Towards Parallel, Scalable VM Services 44

 Overhead reduction with Concurrent Analysis – (Bar 5 – Bar 4)  More data & computation than call graph

slide-45
SLIDE 45

Path Profiling Multithreaded Benchmarks

0.00 20.00 40.00 60.00 80.00 100.00 120.00 140.00 mtrt hsqldb lusearch xalan

Overhead (%)

Instrumentation Enqueue Enqueue+Dequeue Concurrent (N:M) Sequential

Kathryn McKinley Towards Parallel, Scalable VM Services 45

 Core 2  Multi-threaded benchmarks

slide-46
SLIDE 46

Concurrent Dynamic Analysis Framework

Conclusions

  • Framework eases implementation of

many client analyses

  • CAB efficiently transfers application

data to analysis thread avoiding microarchitectural side-effects

  • Framework efficiently utilizes extra

cycles to perform dynamic analysis concurrently

Kathryn McKinley Towards Parallel, Scalable VM Services 46

slide-47
SLIDE 47

Related Work

  • Concurrent Lock-free Queue

– FastForward – single-producer & single- consumer [Giacomoni et al. 09]

  • Concurrent analysis for specific

clients

– PiPA – cache simulator [Zhao et al. 08]

  • Shadow process approach

– Shadow profiling [Moseley et al. 07] – SuperPin [Wallace et al. 07]

Kathryn McKinley Towards Parallel, Scalable VM Services 47

slide-48
SLIDE 48

Are we finished with CMP efficient buffering?

Kathryn McKinley Towards Parallel, Scalable VM Services 48

slide-49
SLIDE 49

Not yet

Kathryn McKinley Towards Parallel, Scalable VM Services 49

Are we finished with CMP efficient buffering?

slide-50
SLIDE 50

Are we finished with CMP efficient buffering?

Not yet

parallel analysis memory scalable self-tuning parameters

Kathryn McKinley Towards Parallel, Scalable VM Services 50

slide-51
SLIDE 51

There is some hope

but we need many such base mechanisms

Kathryn McKinley Towards Parallel, Scalable VM Services 51

slide-52
SLIDE 52

Software Challenges and Opportunities

Communication (efficient coherency) Analysis (off critical path, new analyses) GC (concurrent, parallel, high throughput) JIT (concurrent, parallel, more aggressive) Heterogeneity (exploit it) Memory (PCM, bandwidth limits)

Kathryn McKinley Towards Parallel, Scalable VM Services 52

slide-53
SLIDE 53

Hardware Challenges and Opportunities Heterogeneity

– Tune cores to ubiquitous loads? – Specialize for ubiquitous loads?

Coherence

– SMT coherency does not scale – software guarantees for simplified protocols?

Memory/Cache

– Exploit access behavior of managed languages

Kathryn McKinley Towards Parallel, Scalable VM Services 53

slide-54
SLIDE 54

The 21st Century Virtuous Cycle?

Kathryn McKinley Towards Parallel, Scalable VM Services 54

More Cores

Chip Multiproccesors

CMP

Scalable Software

Scalable Apps + Scalable Runtime

?

Questions?

slide-55
SLIDE 55

I love my job

  • Because when I fail, I get better

Kathryn McKinley Towards Parallel, Scalable VM Services 55

slide-56
SLIDE 56

Failures

  • Rejected: jobs (all)
  • Failed: my Rice PhD qualifying exam
  • Rejected: jobs (8 of 11)
  • Rejected: my first three grant applications
  • Bad teaching evaluations
  • Rejected 2 times: my most cited paper
  • Rejected: jobs (some)
  • Rejected: papers, grants, papers, grants, …

Kathryn McKinley Towards Parallel, Scalable VM Services 56

slide-57
SLIDE 57

57

Processor Technologies and Power

  • Thermal Design Power (TDP) or chip power budget

– The amount of power consumed without exceeding the maximum junction temperature

  • Power measurement

– Hall effect current sensor on 12V line driving the chip – Sampling rate 50Hz

Processor µArch µProcessor Process nm # of Cores # of Threads Clock GHz LLC MB TDP W Release Date Pentium 4 NetBurst Northwood 130 1 2 2.40 0.5 66.2 May ‘03 Core 2 Duo E6600 Core Conroe 65 2 2 2.40 4.0 65.0 Jul ‘06 Core 2 Quad Q6600 Core Kentsfield 65 4 4 2.40 8.0 105.0 Jan ‘07 Core i7 920 Nehalem Bloomfield 45 4 8 2.66 8.0 130.0 Nov ‘08 Atom 230 Atom Diamondville 45 1 2 1.66 0.5 4.0 Jun ‘08 Core 2 Duo E7600 Core Wolfdale 45 2 2 3.06 3.0 65.0 May ‘09 Atom D510 Atom Pineview 45 2 4 1.66 1.0 13.0 Dec ‘09 Core i5 670 Nehalem Clarkdale 32 2 4 3.40 4.0 73.0 Jan ‘10

slide-58
SLIDE 58

58

Native and Java Benchmarks

  • 61 benchmarks from six suites
  • Native (Fortran/C/C++) single-threaded (NST)

– SPEC CINT2006 (12) – SPEC CFP2006 (15)

  • Native multithreaded benchmark (NMT)

– PARSEC (11)

  • Java single-threaded benchmarks (JST)

– SPEC JVM98 (6) – DaCapo-2006-10-MR2 (2) – DaCapo-9.12 (2)

  • Java multithreaded benchmarks (JMT)

– SPEC JVM98 (1) – DaCapo-9.12 (11) – PJBB2005 (1)

slide-59
SLIDE 59

59

Compiler, JVMs, OS, and Performance

  • Intel compiler v11.1 with -O3 for NST
  • Gnu gcc compiler v4.4.1 with -O3 for NMT
  • Java virtual machines

– Sun (Oracle) HotSpot build 16.3-b01 – Oracle JRockit build R28.0.0-679-130297 – IBM J9 build pxi3260sr8 – We measure and report the 5th iteration

  • Operating system

– 32-bit Ubuntu 9.10 Karmic – Linux kernel version 2.6.31

  • Performance

– Normalized execution time to Atom Diamondville (45nm) – Java with HotSpot default