Joseph M. Lancaster, Roger D. Chamberlain Dept. of Computer Science - - PowerPoint PPT Presentation

joseph m lancaster roger d chamberlain
SMART_READER_LITE
LIVE PREVIEW

Joseph M. Lancaster, Roger D. Chamberlain Dept. of Computer Science - - PowerPoint PPT Presentation

Joseph M. Lancaster, Roger D. Chamberlain Dept. of Computer Science and Engineering Washington University in St. Louis {lancaster, roger}@wustl.edu Research supported by NSF grant CNS-0720667 Performance Monitoring of Diverse Computer Systems


slide-1
SLIDE 1

Performance Monitoring of Diverse Computer Systems

Joseph M. Lancaster, Roger D. Chamberlain

  • Dept. of Computer Science and Engineering

Washington University in St. Louis {lancaster, roger}@wustl.edu

Research supported by NSF grant CNS-0720667

slide-2
SLIDE 2

Performance Monitoring of Diverse Computer Systems

Run correctly

Do not dead-lock Meet hard real-time deadlines

Run fast

High-throughput / low latency Low rate of soft deadline misses

Infrastructure should help us debug when it runs incorrectly or slow

9/25/08 – HPEC 2008 2

slide-3
SLIDE 3

Performance Monitoring of Diverse Computer Systems

9/30/2008 3

Increasingly common in HPEC systems e.g. Mercury, XtremeData, DRC,

Nallatech, ClearSpeed

CMP

CORE CORE

FPGA

µP Logic

slide-4
SLIDE 4

Performance Monitoring of Diverse Computer Systems

9/30/2008 4

App deployed using all four components

CMP

CORE CORE

CMP

CORE CORE

GPU FPGA

CMP CMP G P U F P G A

slide-5
SLIDE 5

Performance Monitoring of Diverse Computer Systems

9/30/2008 5

FPGA

Cell CMP

C O R E C O R E C O R E C O R E C O R E C O R E C O R E C O R E

GPU

CORE x256

CMP

CORE CORE CORE CORE

CMP

CORE CORE CORE CORE

CORE LOGIC

slide-6
SLIDE 6

Performance Monitoring of Diverse Computer Systems

Large performance gains realized Power efficient compared to CMP alone –

Requires knowledge of individual architectures/languages

Components operate independently

Distributed system Separate memories and clocks

9/30/2008 6

slide-7
SLIDE 7

Performance Monitoring of Diverse Computer Systems

9/30/2008 7

Tool support for these systems insufficient

Many architectures lack tools for

monitoring and validation

Tools for different architectures not

integrated

Ad hoc solutions

Solution: Runtime performance monitoring and validation for diverse systems!

slide-8
SLIDE 8

Performance Monitoring of Diverse Computer Systems

Introduction Runtime performance monitoring Frame monitoring User-guidance

9/30/2008 8

slide-9
SLIDE 9

Performance Monitoring of Diverse Computer Systems

9/30/2008 9

Natural fit for

diverse HPEC systems

Dataflow model

  • Composed of

blocks and edges

  • Blocks compute

concurrently

  • Data flows

along edges

Languages:

StreamIt, Streams-C, X

A B C D

slide-10
SLIDE 10

Performance Monitoring of Diverse Computer Systems

9/30/2008 10

GPU FPGA

A B C D

CMP

CORE 1 CORE 2

slide-11
SLIDE 11

Performance Monitoring of Diverse Computer Systems

CMP

CORE 1 CORE 2

9/30/2008 11

GPU FPGA

A B C D A D B C

slide-12
SLIDE 12

Performance Monitoring of Diverse Computer Systems

9/30/2008 12

Programming model Strategy Tools / Environments Shared Memory Execution profiling gprof, Valgrind, PAPI Message Passing Execution profiling, message logging TAU, mpiP, PARAVER Stream Programming Simulation StreamIt [MIT], StreamC [Stanford], Streams-C [LANL], Auto-Pipe [WUSTL]

slide-13
SLIDE 13

Performance Monitoring of Diverse Computer Systems

Limitations for diverse systems

No universal PC or architecture No shared memory Different clocks Communication latency and bandwidth

9/30/2008 13

slide-14
SLIDE 14

Performance Monitoring of Diverse Computer Systems

Simulation is a useful first step but:

Models can abstract away system details Too slow for large datasets HPEC applications growing in complexity

Need to monitor deployed, running app

Measure actual performance of system Validate performance of large, real-world

datasets

9/30/2008 14

slide-15
SLIDE 15

Performance Monitoring of Diverse Computer Systems

Report more than just aggregate statistics

  • Capture rare events

Quantify measurement impact where

possible

  • Overhead due to sampling, communication, etc.

Measure runtime performance efficiently

Low overhead High accuracy

Validate performance of real datasets Increase developer productivity

9/30/2008 15

slide-16
SLIDE 16

Performance Monitoring of Diverse Computer Systems

Monitor edges / queues Find bottlenecks in app

Change over time? Computation or communication?

Measure latency between two points

9/30/2008 16

1 2 3 4 5 6

slide-17
SLIDE 17

Performance Monitoring of Diverse Computer Systems

9/30/2008 17

Interconnects are a precious resource Uses same interconnects as application Stay below bandwidth constraint Keep

perturbation low

CMP

CORE CORE

FPGA µP

FPGA Agent

  • App. Logic

Monitor Server CPU Agent App. Code App. Code

slide-18
SLIDE 18

Performance Monitoring of Diverse Computer Systems

Understand measurement perturbation Dedicate compute resources when possible Aggressively reduce amount of

performance meta-data stored and transmitted

Utilize compression in both time resolution and

fidelity of data values

Use knowledge from user to specify their

performance expectations / measurements

9/30/2008 18

slide-19
SLIDE 19

Performance Monitoring of Diverse Computer Systems

Use CMP core as the server monitor

Monitor other cores for performance information Process data from agents (e.g. FPGA, GPU) Combine hardware and software information for

global view

Use logical clocks to synchronize events

Dedicate unused FPGA area to monitoring

9/30/2008 19

slide-20
SLIDE 20

Performance Monitoring of Diverse Computer Systems

Introduction Runtime Performance Monitoring Frame monitoring User-guidance

9/30/2008 20

slide-21
SLIDE 21

Performance Monitoring of Diverse Computer Systems

9/30/2008 21

slide-22
SLIDE 22

Performance Monitoring of Diverse Computer Systems

A frame summarizes performance over a

period of the execution

Maintain some temporal information

  • Capture system performance anomalies

9/30/2008 22

Time

slide-23
SLIDE 23

Performance Monitoring of Diverse Computer Systems

A frame summarizes performance over a

period of the execution

Maintain some temporal information

  • Capture system performance anomalies

9/30/2008 23

Time

1

slide-24
SLIDE 24

Performance Monitoring of Diverse Computer Systems

A frame summarizes performance over a

period of the execution

Maintain some temporal information

  • Capture system performance anomalies

9/30/2008 24

Time

1 2 3 4 5 6

slide-25
SLIDE 25

Performance Monitoring of Diverse Computer Systems

A frame summarizes performance over a

period of the execution

Maintain some temporal information

  • Capture system performance anomalies

9/30/2008 25

Time

1 2 3 4 5 6 7 8 9

slide-26
SLIDE 26

Performance Monitoring of Diverse Computer Systems

Each frame reports one performance metric Frame size can be dynamic

Dynamic bandwidth budget Low variance data / application phases Trade temporal granularity for lower perturbation

Frames from different agents will likely be

unsynchronized and different sizes

Monitor server presents user with

consistent global view of performance

9/30/2008 26

slide-27
SLIDE 27

Performance Monitoring of Diverse Computer Systems

Introduction Runtime Performance Monitoring Frame Monitoring User-guidance

9/30/2008 27

slide-28
SLIDE 28

Performance Monitoring of Diverse Computer Systems

Why? Related work: Performance Assertions for

Mobile Devices [Lenecevicius’06]

Validates user performance assertions on multi-

threaded embedded CPU

Our system enables validation of

performance expectations across diverse architectures

9/30/2008 28

slide-29
SLIDE 29

Performance Monitoring of Diverse Computer Systems

1.

Measurement

User specifies a set of “taps” for agent Taps can be off an edge or an input queue Agent then records events on each tap Supported measurements for a tap: Average value + standard deviation Min or max value Histogram of values Outliers (based on parameter) Basic arithmetic and logical operators on taps: Arithmetic: add, subtract, multiply, divide Logic: and, or, not

9/30/2008 29

slide-30
SLIDE 30

Performance Monitoring of Diverse Computer Systems

What is the throughput of block A?

9/30/2008 30

A

Measurement Context

Runtime Monitor

slide-31
SLIDE 31

Performance Monitoring of Diverse Computer Systems

What is throughput of block A when it is

not data starved?

9/30/2008 31

A Measurement Context

Runtime Monitor

slide-32
SLIDE 32

Performance Monitoring of Diverse Computer Systems

What is the throughput of block A when

not starved for data and no downstream congestion

9/30/2008 32

A

Measurement Context

Runtime Monitor

slide-33
SLIDE 33

Performance Monitoring of Diverse Computer Systems

9/25/08 – HPEC 2008 33

1.

Measurement

Set of “taps” for agent to count, histogram, or perform

simple logical operations on

Taps can be an edge or an input queue 2.

Performance assertion

User describes their performance expectations of an

application as assertions

Runtime monitor validates these assertions by

collecting measurements and evaluating logical expressions

Arithmetic operators: +, -, *, / Logical operators: and, or, not Annotations: t, L

slide-34
SLIDE 34

Performance Monitoring of Diverse Computer Systems

throughput: “at least 100 A.Input events

will be produced in any period of 1001 time units”

t(A.Input[i +100]) – t(A.Input[i]) ≤ 1001

latency: “A.Output is generated no more

than 125 time units after A.Input”

t(A.Output[i]) – t(A.Input[i]) ≤ 125

queue bound: “A.InQueue never exceeds

100 elements”

L(A.InQueue[i]) ≤ 100

9/30/2008 34

slide-35
SLIDE 35

Performance Monitoring of Diverse Computer Systems

9/25/08 – HPEC 2008 35

Runtime measurements

Query CMP/GPU performance counters Custom FPGA counters

Local assertions

Can be evaluated within a single agent No need for communication with other

agents/system monitor

Global assertions

Requires aggregating results from more than one

agent on different compute resources

slide-36
SLIDE 36

Performance Monitoring of Diverse Computer Systems

Some assertions impose prohibitive

memory requirements

Either disallow these or warn user of large

monitoring impact

Other assertions are compute intensive A few are both! Fortunately, much can be gained from

simple queries

Input queue lengths over time

9/25/08 – HPEC 2008 36

slide-37
SLIDE 37

Performance Monitoring of Diverse Computer Systems

FPGA Agent mostly operational

Monitor only, no user assertions yet

Initial target application is the BLAST

biosequence analysis application

CPU + FPGA hardware platform [Jacob, et al. TRETS ’08]

Next target application is computational

finance

CPU + GPU + FPGA Performance significantly worse than models

9/25/08 – HPEC 2008 37

slide-38
SLIDE 38

Performance Monitoring of Diverse Computer Systems

Runtime performance monitoring enables

More efficient development Better testing for real-time systems

Support correctness assertions Investigate ways to best present results to

developer

9/25/08 – HPEC 2008 38