Machine Learning for Fine-Grained Hardware Prefetcher Control Jason - - PowerPoint PPT Presentation

machine learning for fine grained hardware prefetcher
SMART_READER_LITE
LIVE PREVIEW

Machine Learning for Fine-Grained Hardware Prefetcher Control Jason - - PowerPoint PPT Presentation

Machine Learning for Fine-Grained Hardware Prefetcher Control Jason Hiebel Laura E. Brown Zhenlin Wang jshiebel@mtu.edu lebrown@mtu.edu zlwang@mtu.edu Department of Computer Science Michigan Technological University International


slide-1
SLIDE 1

Machine Learning for Fine-Grained Hardware Prefetcher Control

Jason Hiebel jshiebel@mtu.edu Laura E. Brown lebrown@mtu.edu Zhenlin Wang zlwang@mtu.edu

Department of Computer Science Michigan Technological University

International Conference on Parallel Processing August 2019

Hiebel, Brown, Wang ICPP ’19 21

slide-2
SLIDE 2

Hardware Prefetching The Contextual Bandit Problem Experimental Design Results Conclusion

Hiebel, Brown, Wang ICPP ’19 21

slide-3
SLIDE 3

Intel Hardware Prefetchers

L2 Cache DPL Data Prefetch Logic ascending/descending stream prefetcher ACL Adjacent Cache Line spatial prefetcher L1 Cache DCU Data Cache Unit ascending stream prefetcher DCU IP Data Cache Unit Instruction Pointer ascending/descending stride prefetcher

Hiebel, Brown, Wang ICPP ’19 1 21

slide-4
SLIDE 4

Resource Contention

◮ Increased usage and contention of shared resources

◮ last level cache ◮ off-chip memory bandwidth

◮ Adverse performance in (some) multi-tenant workloads!

Core Benchmark Average Performance (IPC) Speedup

DPL Enabled DPL Disabled

1 xalancbmk 0.45 0.54 20% 2 fotonik3d 0.73 0.56 −23% 3 lbm 0.79 0.75 −5% 4

  • mnetpp

0.18 0.30 67% 15%

Hiebel, Brown, Wang ICPP ’19 2 21

slide-5
SLIDE 5

Optimizing Prefetcher Usage

Static Recommendations (Liao et al., SC ’09; Rahman et al., HPCC ’15)

◮ Recommend workload-specific configuration ◮ Requires prior profiling/evaluation of workload

Dynamic Optimization (Jiménez et al., PACT ’12)

◮ Periodically test configuration performance and exploit ◮ Requires enumerating configurations of interest

Hiebel, Brown, Wang ICPP ’19 3 21

slide-6
SLIDE 6

Optimizing Prefetcher Usage

◮ Contextual Bandit model for hardware prefetcher control ◮ Dynamic control using architectural metrics

(branch mispredictions, cache misses, memory bandwidth usage, etc.)

◮ Learn model for prefetcher control using random profiling data

Hiebel, Brown, Wang ICPP ’19 4 21

slide-7
SLIDE 7

Hardware Prefetching The Contextual Bandit Problem Experimental Design Results Conclusion

Hiebel, Brown, Wang ICPP ’19 21

slide-8
SLIDE 8

The Contextual Bandit

Sequential model for decision making with limited feedback

(Langford and Zhang, NIPS ’07; Beygelzimer and Langford, SIGKDD ’09)

For iteration t = 1, 2, . . .

  • 1. Observe contextual information

(cache, memory behavior)

  • 2. Select an action

(hardware prefetcher configuration)

  • 3. Receive reward for the selected action

(performance improvement)

Hiebel, Brown, Wang ICPP ’19 5 21

slide-9
SLIDE 9

Binary-Offset

(Beygelzimer and Langford, SIGKDD ’09)

Off-Policy/Offline method learn from logged profiling data using random actions

◮ Convert from bandit data to weighted classification data ◮ Learn control model using a weighted classifier   context action reward   = ⇒   context label weight   = ⇒ model

Hiebel, Brown, Wang ICPP ’19 6 21

slide-10
SLIDE 10

Contextual Information

Challenge Identify relevent architectural behaviors (independent variables) Solution

◮ Performance Monitoring Unit (PMU) ◮ Utilize domain expertise to select a subset of hardware events

Cache

L1D:ALLOCATED_IN_M L1D:M_EVICT L1D:REPLACEMENT L2_LINES_IN:ANY L3_LAT_CACHE:MISS

Memory Bandwidth

OFFCORE_REQUESTS:DEMAND_DATA_RD

DTLB Misses

DTLB_LOAD_MISSES:WALK_COMPLETED

Blocked Loads

LD_BLOCKS:STORE_FORWARD LD_BLOCKS:NO_SR LD_BLOCKS:ALL_BLOCK

Branch Mispredictions

BR_MISP_RETIRED:ALL_BRANCHES Hiebel, Brown, Wang ICPP ’19 7 21

slide-11
SLIDE 11

Action Selection

Challenge Exponential

  • 24 · cores

system-wide configurations Solution

◮ Myopic control: separate bandit per-core and per-prefetcher ◮ Binary action sets

  • enabled, disabled
  • ◮ System-wide effects as part of the context and reward

Hiebel, Brown, Wang ICPP ’19 8 21

slide-12
SLIDE 12

Reward Formulation

Challenge Maximize average workload speedup, 1

n

IPC conf

i

/

IPC base

i

Solution

◮ Extract program phases using performance change-points ◮ Local (i) per-phase speedup vs average: R(i),t ◮ Cross-core (i, j ) per-phase speedup vs average: R(i,j ),t ◮ Average system-wide speedup: Ri,t = 1 n

  • R(i),t +
  • i=j

R(i,j ),t

  • − 1

Hiebel, Brown, Wang ICPP ’19 9 21

slide-13
SLIDE 13

Hardware Prefetching The Contextual Bandit Problem Experimental Design Results Conclusion

Hiebel, Brown, Wang ICPP ’19 21

slide-14
SLIDE 14

Evaluation Environment

CPU Cores Memory Sandy Bridge 3.3 GHz Core i5-2500 4 2x2 GB (DDR3-1333) Kaby Lake 3.6 GHz Core i7-7700 4 4x8 GB (DDR4-2400) Broadwell 2.1 GHz Xeon E5-2620 v4 8 4x32 GB (DDR4-2400)

◮ Disable turbo-boost ◮ Disable energy saving features ◮ Disable hyper-threading (prefetchers on physical cores)

Hiebel, Brown, Wang ICPP ’19 10 21

slide-15
SLIDE 15

Workload Construction

◮ Generate 60 workloads consisting of four benchmarks ◮ Benchmark suites:

SPEC CPU2006, SPEC CPU2017, PARSEC

◮ Determine benchmark sensitivity to each prefetcher (Broadwell)

DPL ACL DCU DCU IP

0% 10% 20% 30% 40% 0% 10% 20% 30% 40% 0% 10% 20% 30% 40% 0% 10% 20% 30% 40% 0.6 0.8 1.0

Bandwidth Reduction Speedup

Hiebel, Brown, Wang ICPP ’19 11 21

slide-16
SLIDE 16

Workload Construction

SPEC CPU2006 bwaves, gcc, GemsFDTD, lbm, leslie3d, libquantum, mcf, milc, omnetpp, soplex, wrf, xalancbmk SPEC CPU2017 bwaves_r, fotonik3d_r, gcc_r, lbm_r, mcf_r, omnetpp_r, roms_r PARSEC fluidanimate

DPL

0% 10% 20% 30% 40% 0.6 0.8 1.0

Bandwidth Reduction Speedup Hiebel, Brown, Wang ICPP ’19 12 21

slide-17
SLIDE 17

DPL Prefetcher Selection

Baselines

◮ DPL Enabled (on all cores) ◮ DPL Disabled (on all cores) ◮ Best Static

Dynamic Policies

◮ Binary-Offset (Ind)

create specialized model for each workload

◮ Binary-Offset (X )

create general model using X training workloads

Hiebel, Brown, Wang ICPP ’19 13 21

slide-18
SLIDE 18

Hardware Prefetching The Contextual Bandit Problem Experimental Design Results Conclusion

Hiebel, Brown, Wang ICPP ’19 21

slide-19
SLIDE 19

DPL Prefetcher (Sandy Bridge)

Overview

  • 0.9

1.0 1.1 1.2 Workload Speedup

  • DPL Disabled

Binary−Offset (Ind) Binary−Offset (5) Binary−Offset (10) Hiebel, Brown, Wang ICPP ’19 14 21

slide-20
SLIDE 20

DPL Prefetcher (Sandy Bridge)

Baseline Performance

  • 0.9

1.0 1.1 1.2 Workload Speedup

  • DPL Disabled

Binary−Offset (Ind) Binary−Offset (5) Binary−Offset (10) Hiebel, Brown, Wang ICPP ’19 15 21

slide-21
SLIDE 21

DPL Prefetcher (Sandy Bridge)

Outperforming “Best Static” Performance

0.9 1.0 1.1 1.2 Workload Speedup

  • DPL Disabled

Binary−Offset (Ind) Binary−Offset (5) Binary−Offset (10) Hiebel, Brown, Wang ICPP ’19 16 21

slide-22
SLIDE 22

DPL Prefetcher (Kaby Lake)

Overview

  • 0.8

0.9 1.0 1.1 1.2 Workload Speedup

  • DPL Disabled

Binary−Offset (Ind) Binary−Offset (5) Binary−Offset (10) Hiebel, Brown, Wang ICPP ’19 17 21

slide-23
SLIDE 23

DPL Prefetcher (Kaby Lake)

Training Data Impact

0.8 0.9 1.0 1.1 1.2 Workload Speedup

  • DPL Disabled

Binary−Offset (Ind) Binary−Offset (5) Binary−Offset (10) Hiebel, Brown, Wang ICPP ’19 18 21

slide-24
SLIDE 24

DPL Prefetcher Selection

4.3% 3.6% 5.1% 3.3% 1.2% 1.0% −1.0% 1.7% 0.4% −8.5%

0.90 0.95 1.00 1.05 Sandy Bridge Kaby Lake Average Speedup DPL Disabled Best Static Binary−Offset (Ind) Binary−Offset (5) Binary−Offset (10)

Hiebel, Brown, Wang ICPP ’19 19 21

slide-25
SLIDE 25

Hardware Prefetching The Contextual Bandit Problem Experimental Design Results Conclusion

Hiebel, Brown, Wang ICPP ’19 21

slide-26
SLIDE 26

Conclusion

◮ Contextual bandit model of hardware prefetcher control

◮ independent control of hardware prefetchers per core ◮ system-wide context, feedback

◮ Binary-Offset policy improves upon static configurations

◮ dynamic control ◮ general, workload-agnostic configuration control ◮ random (non-enumerative) profiling data

Hiebel, Brown, Wang ICPP ’19 20 21

slide-27
SLIDE 27

Machine Learning for Fine-Grained Hardware Prefetcher Control

Jason Hiebel jshiebel@mtu.edu Laura E. Brown lebrown@mtu.edu Zhenlin Wang zlwang@mtu.edu

Department of Computer Science Michigan Technological University

International Conference on Parallel Processing August 2019

Hiebel, Brown, Wang ICPP ’19 21 21

slide-28
SLIDE 28

References I

[1] Alina Beygelzimer and John Langford. The offset tree for learning with partial labels. In Proceedings of 15th International Conference on Knowledge Discovery and Data Mining, KDD ’09, pages 129–38, 2009. [2] Victor Jiménez, Roberto Gioiosa, Francisco J. Cazorla, Alper Buyuktosunoglu, Pradip Bose, and Francis P . O’Connell. Making data prefetch smarter: Adaptive prefetching on POWER7. In 21st International Conference on Parallel Architectures and Compilation Techniques, PACT ’12, pages 137–146, 2012. [3] John Langford and Tong Zhang. The epoch-greedy algorithm for contextual multi-armed bandits. In Advances in Neural Information Processing Systems 20, NIPS, pages 817–824, 2007.

Hiebel, Brown, Wang ICPP ’19 21

slide-29
SLIDE 29

References II

[4] Shih-wei Liao, Tzu-Han Hung, Donald Nguyen, Chinyen Chou, Chiaheng Tu, and Hucheng Zhou. Machine learning-based prefetch optimization for data center applications. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC ’09, pages 1–10, 2009. [5] Saami Rahman, Martin Burtscher, Ziliang Zong, and Apan Qasem. Maximizing hardware prefetch effectiveness with machine learning. In Proceedings of the 17th International Conference on High Performance Computing and Communications, pages 383–389, 2015.

Hiebel, Brown, Wang ICPP ’19 21

slide-30
SLIDE 30

Reward Formulation

IPC (Core 0) IPC (Core 1)

Hiebel, Brown, Wang ICPP ’19 21

slide-31
SLIDE 31

Reward Formulation

Speedup (Core 0) Speedup (Core 1)

Hiebel, Brown, Wang ICPP ’19 21