TAPE: a Transactional Application Profiling Environment Hassan Chafi - - PowerPoint PPT Presentation

tape a transactional application profiling environment
SMART_READER_LITE
LIVE PREVIEW

TAPE: a Transactional Application Profiling Environment Hassan Chafi - - PowerPoint PPT Presentation

TAPE: a Transactional Application Profiling Environment Hassan Chafi , Chi Cao Minh, Austen McDonald, Brian D. Carlstrom, JaeWoong Chung, Lance Hammond, Christos Kozyrakis, and Kunle Olukotun Computer Systems Laboratory Stanford University


slide-1
SLIDE 1

TAPE: A Transactional Application Profiling Environment ICS 2005

TAPE: a Transactional Application Profiling Environment

Hassan Chafi, Chi Cao Minh, Austen McDonald, Brian D. Carlstrom, JaeWoong Chung, Lance Hammond, Christos Kozyrakis, and Kunle Olukotun

Computer Systems Laboratory Stanford University

http://tcc.stanford.edu

slide-2
SLIDE 2

TAPE: A Transactional Application Profiling Environment ICS 2005

Optimizing Parallel Performance

  • CMPs are here but parallel programming is still difficult

Need correct and fast parallel executables

  • Transactional memory simplifies correct parallel programming

No locks Speculative parallelization

  • The Issue is now performance tuning
  • TAPE: a system for performance profiling of transactional applications

Expressive: tracks all performance bottlenecks Accurate: identifies bottleneck location in source code Easy to use: leads to optimal performance in few tuning steps Low overhead: negligible area & performance cost

  • TAPE allows for continuous profiling, even on production runs
slide-3
SLIDE 3

TAPE: A Transactional Application Profiling Environment ICS 2005

TCC Architecture for Transactional Execution

Transactions Start Request Commit Token Commit Commit

Transaction Timeline

Write Buffer

Transaction Control Bits

Read, Modified, etc

Commit Control

TAPE HW

slide-4
SLIDE 4

TAPE: A Transactional Application Profiling Environment ICS 2005

Out-of-the-box TCC Performance

Initial Benchmark runtime for 8 processor CMP

12.5 25 37.5 50 62.5 75 87.5 100 art equake lufact moldyn mp3d quicksort radix swim tomcatv

N orm alized Execution Tim e

Ideal Time

  • Initial parallelization is quick and easy
  • Performance tuning is critical
slide-5
SLIDE 5

TAPE: A Transactional Application Profiling Environment ICS 2005

Performance Bottlenecks

  • Dependency violations

Due to speculative nature of execution

  • Buffer overflows

Transaction’s state does not fit in cache

  • Workload imbalance

Transactions are assigned disproportionate amount of work

  • Transactional API overhead

Overhead of starting, committing, and aborting transactions

slide-6
SLIDE 6

TAPE: A Transactional Application Profiling Environment ICS 2005

Dependency Violations

Time

Useful Arbitrate + commit Idle Violations CPU 1 CPU 2

Commit Write X Read X Restarts Transaction

slide-7
SLIDE 7

TAPE: A Transactional Application Profiling Environment ICS 2005

Buffer Overflows

Time

CPU 1 CPU 2 Useful Arbitration + Commit Overflow Commit Overflow Commit

slide-8
SLIDE 8

TAPE: A Transactional Application Profiling Environment ICS 2005

Initial Performance Results - 8 processors

Initial Benchmark runtime 12.5 25 37.5 50 62.5 75 87.5 100 a r t e q u a k e l u f a c t m

  • l

d y n m p 3 d q u i c k s

  • r

t r a d i x s w i m t

  • m

c a t v

Normalized Execution Time

Useful Idle Arbitration + Commit Violations

Ideal Time

slide-9
SLIDE 9

TAPE: A Transactional Application Profiling Environment ICS 2005

Outline

  • Motivation
  • TAPE system overview
  • Example: Violation Profiling

Information gathering and filtering Using profile information for optimizations

  • Evaluation
  • Conclusions
slide-10
SLIDE 10

TAPE: A Transactional Application Profiling Environment ICS 2005

Key Insights

  • 1. Leverage hardware for transactional execution
  • Already monitoring everything
  • TAPE operations can be amortized at commit time
  • 2. Repeatability of bottlenecks
  • Critical performance bottlenecks occur repeatedly
  • Data aggregation saves space without losing accuracy
  • TAPE automatically filters out infrequent bottlenecks
slide-11
SLIDE 11

TAPE: A Transactional Application Profiling Environment ICS 2005

TAPE System Overview

  • !

"!#

  • Online – Hardware

Each CPU gathers profile data in private buffers Bottlenecks aggregated over multiple occurrences Infrequent bottlenecks filtered out Data periodically flushed to pre- allocated memory regions

  • Offline – Software

Combine information from all CPUs Rank bottleneck by cost Format profiling output & relate data to source code

slide-12
SLIDE 12

TAPE: A Transactional Application Profiling Environment ICS 2005

Profiling Violations

  • CPU-1 writes address X
  • CPU-2 read address X
  • CPU-1 commits first
  • CPU-2 detects violation on X

Inserts entry in Transaction Violation Buffer

  • CPU 2 restarts transaction

Re-reads address X Sends read PC2 to TVB

  • CPU 2 commits

Most costly violations flushed to Period Violation buffer Others may get evicted

  • PVB can be flushed periodically

TVB PVB

Core

L1 Cache

Violation Detection

Network

CPU 2

CPU 1 CPU 2 Read X

Write X

Violation

Read x Commit

TPC Read PC Object addr

Wasted Work

PCt X 500 PC2

slide-13
SLIDE 13

TAPE: A Transactional Application Profiling Environment ICS 2005

5: pSum[TCC_getMyID()] += data[i]; 4: t_for_n (i = 0; i < 10000; i++; 50) { 1: int* data = load_data(); /* input * 2: int i, buckets[101], sum = 0; 3: 4: t_for_n (i = 0; i < 10000; i++; 500) { 5: sum += data[i]; 6: buckets[data[i]]++; 7: } 8: 9: print_buckets(buckets); /* output */

Example of Interaction with TAPE

Violations 8: for i = 0 to num_procs: sum += pSum[i];

slide-14
SLIDE 14

TAPE: A Transactional Application Profiling Environment ICS 2005

Evaluation Methodology

  • 8-core CMP processor

Bus interconnected to shared L2 cache Transactional buffering in private L1 caches (32 Kbytes) Execution driven simulation with accurate contention modeling

  • Applications: SPEC2K FP and SPLASH-2 benchmarks

See ASPLOS’04 for transactional programming details

  • Questions

Ease of performance tuning with TAPE? TAPE buffer size requirements TAPE performance overhead

slide-15
SLIDE 15

TAPE: A Transactional Application Profiling Environment ICS 2005

Performance Improvements for 8 Processors

12.5 25 37.5 50 62.5 75 87.5 100

Initial Rechunk Privatize Initial Rechunk Privatize Initial Rechunk Privatize Initial Rechunk Privatize Initial Rechunk Privatize Initial Rechunk Privatize Initial Split Initial Unordered Initial art equake moldyn radix swim tomcatv mp3d quicksort lufact

Normalized Execution Time

Useful Idle Arbitration + Commit Violations

Ideal Line

  • A maximum of two steps were required to fully optimize applications
  • The programmer is directed to the source of the bottlenecks in the actual code
slide-16
SLIDE 16

TAPE: A Transactional Application Profiling Environment ICS 2005

The Cost of TAPE

  • Low Chip area cost

Proposed design point requires less than 5K SRAM bits, and 244 CAM bits per core Less than 1% of overall chip area

  • Low performance impact

Maximum slowdown of only 1.84% (Average was 0.28%) Allows for continuous profiling, even on production runs Maximum BW usage was 0.11%

  • Memory Usage

On average only 1MB/hr of data generated

slide-17
SLIDE 17

TAPE: A Transactional Application Profiling Environment ICS 2005

Conclusions

  • TAPE: a profiling system for transactional applications

Support easy performance tuning Complement correctness benefits of transactions

  • Key features

Expressive: tracks all performance bottlenecks Accurate: identifies bottleneck location in source code Easy to use: leads to optimal performance in few tuning steps Low overhead: negligible area & performance cost Allows for continuous profiling, even on production runs

slide-18
SLIDE 18

TAPE: A Transactional Application Profiling Environment ICS 2005

Thanks For listening http://tcc.stanford.edu