Data Placement in Multi-tiered Memory T. Chad Effler 1 , Adam P. - - PowerPoint PPT Presentation

data placement in multi tiered
SMART_READER_LITE
LIVE PREVIEW

Data Placement in Multi-tiered Memory T. Chad Effler 1 , Adam P. - - PowerPoint PPT Presentation

On Automated Feedback-Driven Data Placement in Multi-tiered Memory T. Chad Effler 1 , Adam P. Howard 1 , Tong Zhou 1 , Michael R. Jantz 1 , Kshitij A. Doshi 2 , and Prasad A. Kulkarni 3 1 University of Tennessee,


slide-1
SLIDE 1

On Automated Feedback-Driven Data Placement in Multi-tiered Memory

  • T. Chad Effler1, Adam P. Howard1, Tong Zhou1, Michael R. Jantz1,

Kshitij A. Doshi2, and Prasad A. Kulkarni3

1 University of Tennessee, {teffler,ahoward,tzhou9,mrjantz}@utk.edu 2 Intel Corporation, kshitij.a.doshi@intel.com 3 University of Kansas, kulkarni@ittc.ku.edu

slide-2
SLIDE 2

The Problem

  • Multi-Tiered Memory Hierarchies
  • Different Capacities
  • Different Performance
  • Cross Layer Data Management for Heterogeneous Tiers
  • Match Memory Needs Efficiently
  • Optimality
  • Transparency
  • Simplicity
slide-3
SLIDE 3

Current Solutions

  • Hardware Managed Caching
  • Non-Flexible
  • Large Architectural Costs
  • OS Managed Data Placement
  • Reactive
  • Relies on Non-Standard Hardware
  • Developer Managed Data Placement
slide-4
SLIDE 4

Feedback-Driven Data Placement

  • Allocation Site Partitioning
  • Knapsack
  • Hotset
  • Profile-Guided Management
  • Static Arena Allocation
  • Phase-based Arena Allocation
slide-5
SLIDE 5

Collecting Application Guidance

  • Track by allocation sites
  • Site == path upto malloc, calloc, etc.
  • Intuition 1: pedigree is a good predictor
  • Intuition 2: profile transferability

5

slide-6
SLIDE 6

Collecting Application Guidance

  • Track by allocation sites
  • Site == path upto malloc, calloc, etc.
  • Intuition 1: pedigree is a good predictor
  • Intuition 2: profile transferability
  • Profile = {∀S : <peak RSS, #post-cache-accesses> }

6

slide-7
SLIDE 7

Collecting Application Guidance

  • Track by allocation sites
  • Site == path upto malloc, calloc, etc.
  • Intuition 1: pedigree is a good predictor
  • Intuition 2: profile transferability
  • Profile = {∀S : <peak RSS, #post-cache-accesses> }
  • Partition sites into hot, cold sets.
  • Knapsack, Hotset

7

slide-8
SLIDE 8

Applying Application Guidance

  • Profile-guided allocation into arenas
  • Static: one arena for each tier

8

CPU 0 DDR MCDRAM

Hardware Application Address Space

A0: contains sites 2 and 6 A1: contains sites 1, 3, 4, and 5

Allocation Guidance S1: Cold S2: Hot S3: Cold S4: Cold S5: Cold S6: Hot

A0 A0 A1 A1 A1 A1

slide-9
SLIDE 9

Applying Application Guidance

  • Profile-guided allocation into arenas
  • Per-phase: one arena for each phase signature

9

CPU 0 DDR MCDRAM

Hardware Application Address Space (phase 0)

A0: 10001 A1: 01111

Allocation Guidance S1: 10001 S2: 01111 S3: 10001 S4: 11010 S5: 10100 S6: 11010

A2: 11010 A3: 10100

A1 A2 A0 A0 A2 A3

slide-10
SLIDE 10

Applying Application Guidance

  • Profile-guided allocation into arenas
  • Per-phase: one arena for each phase signature

10

CPU 0 DDR MCDRAM

Hardware Application Address Space (phase 1)

A1: 01111

Allocation Guidance S1: 10001 S2: 01111 S3: 10001 S4: 11010 S5: 10100 S6: 11010

A0: 10001 A3: 10100

A1 A2 A0 A0 A2 A3

A2: 11010

slide-11
SLIDE 11

Simulation Framework

  • Marena – arena allocation library
  • Memtracer – Pin based instrumentation tool
  • Ramulator – Cycle accurate DRAM simulator
slide-12
SLIDE 12

Marena

  • Arena based Allocator
  • Built on Jemalloc
  • Allocation site Guidance
slide-13
SLIDE 13

Memtracer

  • Pin based Instrumentation Tool
  • Profiling
  • Generates Trace Files
slide-14
SLIDE 14

Ramulator

  • Memory Simulator
  • Trace based and cycle accurate modeling
  • Modified to support Tiered Memory Simulation
slide-15
SLIDE 15

Framework

Pin Executable Arena Allocator malloc ( ) free ( )

Instruction Trace

HBM DDR Ramulator

Memory usage guidance Memory usage guidance Allocation Guidance

CPU Memory Controller

slide-16
SLIDE 16

Evaluation

  • Benchmarks: CPU 2006
  • Multi-tier configuration:
  • HBM-DDR
  • HBM capacity = 12.5% of upper-level memory

16

slide-17
SLIDE 17
slide-18
SLIDE 18

Evaluation

  • Baseline Modes
  • Cache mode
  • Static First Touch (FT)
  • HBM/DDR only
  • Feed-back Guidance Directed Modes
  • Static Train
  • Static Ref
  • Adaptive Ref
  • Reactive Profiling
  • First Touch Hot Page (FTHP)
slide-19
SLIDE 19

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 bzip2 gcc mcf milc cactusADM leslie3d gobmk soplex hmmer GemsFDTD libquantum h264ref lbm sphinx3 average mcf milc cactusADM leslie3d soplex GemsFDTD libquantum lbm average 512 KB cache 8 MB cache

IPC relative to DDR3-only Benchmarks cache-mode static-FT HBM-only HBM-DDR3: baseline performance

slide-20
SLIDE 20

0.0 0.5 1.0 1.5 2.0 2.5 3.0 bzip2 gcc mcf milc cactusADM leslie3d gobmk soplex hmmer GemsFDTD libquantum h264ref lbm sphinx3 average mcf milc cactusADM leslie3d soplex GemsFDTD libquantum lbm average 512 KB cache 8 MB cache

IPC relative to DDR3-only Benchmarks static-FT static-train static-ref Performance of static guidance strategies

slide-21
SLIDE 21

0.0 0.5 1.0 1.5 2.0 2.5 3.0 bzip2 gcc mcf milc cactusADM leslie3d gobmk soplex hmmer GemsFDTD libquantum h264ref lbm sphinx3 average mcf milc cactusADM leslie3d soplex GemsFDTD libquantum lbm average 512 KB cache 8 MB cache

IPC relative to DDR3-only Benchmarks static-ref adaptive-ref FTHP Performance of static and adaptive policies

slide-22
SLIDE 22
slide-23
SLIDE 23