Adaptive Optimization using Hardware Performance Monitors Master - - PowerPoint PPT Presentation

adaptive optimization using hardware performance monitors
SMART_READER_LITE
LIVE PREVIEW

Adaptive Optimization using Hardware Performance Monitors Master - - PowerPoint PPT Presentation

Adaptive Optimization using Hardware Performance Monitors Master Thesis by Mathias Payer Supervising Professor: Thomas Gross Supervising Assistant: Florian Schneider 1/21 Adaptive Optimization using HPM 1. Summary Tasks: - Interface for


slide-1
SLIDE 1

Adaptive Optimization using HPM 1/21

Adaptive Optimization using Hardware Performance Monitors Master Thesis by Mathias Payer

Supervising Professor: Thomas Gross Supervising Assistant: Florian Schneider

slide-2
SLIDE 2

Adaptive Optimization using HPM 2/21

  • Tasks:
  • Interface for HPM
  • Optimization of MM using HPM information
  • Challenges:
  • Fast sampling & processing
  • Precise sampling
  • Runtime benefit
  • Method:
  • Used kernel perfmon2 kernel patch
  • User-space library libpebsi
  • Collector-Thread in Jikes
  • Changes in Memory-Management
  • 1. Summary
slide-3
SLIDE 3

Adaptive Optimization using HPM 3/21

  • 1. Summary / Introduction
  • 2. Preliminaries:
  • Profiling vs. HW Sampling
  • Pentium 4 Sampling (PEBS)
  • Perfmon2 & Jikes RVM
  • 3. Extensions
  • libpebsi
  • Jikes RVM & Collector
  • 4. Evaluation & Benchmarks
  • 5. Application for HW profiling (Hot/Cold GC)

Overview

slide-4
SLIDE 4

Adaptive Optimization using HPM 4/21

  • 2. Profiling vs. Sampling
  • Modern compilers/VMs (may) use two types of information:
  • Profiling:
  • Monitor & trace runtime events
  • Platform independent (written in Java)
  • Data is used by AOS (OptCompiler)
  • HW-Sampling: - Uses low-level hardware information
  • Direct HW feedback
  • Can be used for (new) optimizations
  • Relatively new field
slide-5
SLIDE 5

Adaptive Optimization using HPM 5/21

  • 2. Pentium 4 Profiling (PEBS)
  • Pentium 4 offers many (new) Hardware Performance Monitors
  • Supports Precise Event Based Sampling (PEBS)
  • HW takes & saves sample in memory, int generated on overflow
  • Programmable over special register, runs in global context
  • Many events can be sampled:
  • Cache misses (L1 & L2), DTLB misses, memory accesses,

arithmetic instructions, ...

slide-6
SLIDE 6

Adaptive Optimization using HPM 6/21

  • 2. Perfmon2
  • Fast, precise sampling is needed for effective optimizations.
  • Many different kernel extensions exist, most are obsolete no longer

maintained or outdated.

  • Perfmon2 is a low level kernel interface and a high level user library.
  • Supports virtualization, access restrictions, PEBS & randomization.
slide-7
SLIDE 7

Adaptive Optimization using HPM 7/21

  • 2. Jikes RVM
  • The Jikes Research Virtual Machine is a complete OO Java VM.
  • Used for implementations of new ideas, GCs and optimizations.
  • The Adaptive Optimization System uses profiling to decide which

methods need recompilation at a higher opt. level.

  • HPMs are not yet used for additional information.
  • A Pebsi thread runs inside Jikes to collect and process samples.
slide-8
SLIDE 8

Adaptive Optimization using HPM 8/21

  • libpebsi directly accesses the PMU (read/write to PMC & PMD)
  • Offers a simple interfaec to PEBS (event, interval, buffer)
  • Bindings for C, C++ and JNI available
  • Written as redistributable library, independent from Jikes
  • Language independent
  • 3. libpebsi
slide-9
SLIDE 9

Adaptive Optimization using HPM 9/21

  • 3. PEBS Control-Flow

CPU libpebsi Jikes RVM (including PEBS Thread) Linux Kernel & Perfmon2 Module

  • 1. Jikes loads & inits libpebsi
  • 2. libpebsi inits perfmon2
  • 3. perfmon2 inits buffer & hw

Buff Buff Buff

slide-10
SLIDE 10

Adaptive Optimization using HPM 10/21

  • 3. PEBS Data-Flow

CPU libpebsi Jikes RVM (including PEBS Thread)

  • 1. Jikes polls libpebsi which

polls perfmon2

  • 2. Samples are copied from

kernel space into libpebsi

  • 3. libpebsi copies the

samples into Jikes Buff Buff Linux Kernel & Perfmon2 Module Buff 1.The CPU copies autonomiously

slide-11
SLIDE 11

Adaptive Optimization using HPM 11/21

  • 3. Jikes Collector Thread
  • Polls libpebsi for new samples
  • EFLAGS, EIP, EAX, EBX, ECX, EDX, ESI, EDI, EBP & ESP
  • Maps the IP to the corresponding compiled method
  • Analyzes the bytecode instruction & gathers information
  • Saves additional statistics if selected
  • Analyzes field references
  • Analyzes method references
slide-12
SLIDE 12

Adaptive Optimization using HPM 12/21

  • 4. Benchmarks
  • Typical benchmarks with high memory and gc activity are used:
  • spec MTRT

Concurrent raytracer with two threads

  • spec JACK

Java parser generator & lexical analyzer

  • DaCapo ANT

Parser and lexer for grammer files

  • DaCapo FOP

XML to PDF transformation using XSL-FO

  • DC HSQLDB

JDBC in memory DB with transactions

  • DC JYTHON

Python interpreter in Java

  • DaCapo PS

PostScript interpreter

  • pseudo JBB

spec pseudo JBB transactional DB

slide-13
SLIDE 13

Adaptive Optimization using HPM 13/21

  • 4. Overhead Benchmarks
  • Overhead (l2 cache miss) & # processed Samples per Second
  • In detail for pseudo spec JBB and all other benchmarks at interval 10000

Interval pseudo spec JBB 5000 2.85% 844.24 10000 1.59% 451.74 15000 1.40% 311.36 25000 0.54% 201.6 50000 0.73% 95.63 100000 0.62% 48.96 Benchmark Overhead Samples / Sec spec MTRT 0.55% 103.06 spec JACK 0.39% 82.05 DaCapo ANTLR

  • 0.22%

132.96 DaCapo FOP 2.24% 305.19 DC HSQLDB 6.89% 139.31 DC JYTHON 0.87% 83.21 DaCapo PS

  • 0.05%

67.01

slide-14
SLIDE 14

Adaptive Optimization using HPM 14/21

  • 4. Benchmark status

spec MTRT spec JACK DaCapo ANTLR DaCapo FOP DaCapo HSQLDB DaCapo JYTHON DaCapo PS pseudo specJBB 0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% 90.00% 100.00%

  • ther BC
  • bj.hdr. acc.

heap array stack interf MR virtual MR special MR static MR reference FR primitive FR static FR

  • Bytecode distribution of second level cache misses:
slide-15
SLIDE 15

Adaptive Optimization using HPM 15/21

  • 5. Application for HW Sampling
  • The HW information is used in an extended garbage collector
  • Objects are handled differently if they are frequently used
  • A special memory space is reserved only for hot objects
  • Hotness depends on variable threshold
  • Adjusted during runtime
  • Analyzes field references and reorders hot fields
slide-16
SLIDE 16

Adaptive Optimization using HPM 16/21

  • 5. Hot Cold Garbage Collector

Mature Space Nursery Copy Space 0 Nursery Mark & Sweep Copy Space 1 Standard generational garbage collector: Hot Cold garbage collector with a hot copy space:

slide-17
SLIDE 17

Adaptive Optimization using HPM 17/21

  • 5. Hot Scanning Algorithm

for (int i = 0; i < NrReferences(type); i++) { Address slot = type.getSlot(object, type.pebsiFieldOrder[i]); trace.traceObjectLocation(slot); } class Foo { Bar a,b; X d; X e; X f; Bar c; } pebsiFieldOrder: d a b c e f Heat: 150 60 60 20 10 5

slide-18
SLIDE 18

Adaptive Optimization using HPM 18/21

  • 5. HC GC Benchmarks

Runtime [s] Total # Samples spec MTRT 15.9 900.5 spec JACK 21.52 627.25 DaCapo FOP 38.54 11941.75 DaCapo ANTLR 12.24 2675.25 DaCapo HSQLD 79.66 2846.25 DaCapo JYTHO 49.31 2087.75 DaCapo PS 27.04 2520.5 specJBB 200.67 63295

spec MTRT spec JACK DaCapo FOP DaCapo ANTLR DaCapo HSQLDB DaCapo JYTHON DaCapo PS pseudo specJBB 0.00% 5.00% 10.00% 15.00% 20.00% 25.00% 30.00% 35.00% 40.00% 45.00% 50.00%

L2 miss reduction (HCcopyMS)

spec MTRT (0.34%) spec JACK (0.08%) DaCapo FOP (0.49%) DaCapo ANTLR (2.56%) DaCapo HSQLDB (6.1%) DaCapo JYTHON (0.34%) DaCapo PS (0.17%) pseudo specJBB (1.17%)

  • 3.00%
  • 2.50%
  • 2.00%
  • 1.50%
  • 1.00%
  • 0.50%

0.00% 0.50% 1.00% 1.50% 2.00% 2.50% 3.00% 3.50%

Runtime benefit & std. deviation (HCcopyMS)

slide-19
SLIDE 19

Adaptive Optimization using HPM 19/21

Conclusions

  • Extendable interface for low overhead sampling
  • Useful for offline performance analysis
  • Suited for direct adaptive optimizations (avg. overhead: ~1%)
  • Many events and rich statistic available inside Jikes
  • Easy portable to other VM/HW-Interface
slide-20
SLIDE 20

Adaptive Optimization using HPM 20/21

Questions

?