An Imitation Learning Approach for Cache Replacement Evan Z. Liu, - PowerPoint PPT Presentation

An Imitation Learning Approach for Cache Replacement Evan Z. Liu, Milad Hashemi, Kevin Swersky, Paruhasarathy Ranganathan, Junwhan Ahn

The Need for Faster Compute Small cache improvements can make large difgerences! (Beckman, 2019) E.g., 1% cache hit rate improvement → 35% ● decrease in latency (Cidon, et. al., 2016) Caches are everywhere: CPU chips ● Operating Systems ● Databases ● Web applications ● Our goal : Faster applications via betuer cache replacement policies (htups://openai.com/blog/ai-and-compute/)

TL;DR: I. We approximate the optimal cache replacement policy by (implicitly) predicting the future II. Caching is an aturactive benchmark for the general reinforcement learning / imitation learning communities

Cache Replacement Evict Cache A B C A B D A B D Miss Hit (100x faster) Miss Accesses D A C Goal: Evict the cache lines to maximize cache hits

Cache Replacement Evict Mistake Cache A B C A B D A B D Miss Hit Miss Accesses D A C

Cache Replacement Optimal decision Cache A B C A B D A B D Miss Hit Miss Accesses D A C

Cache Replacement Reuse distance d t (line): number of accesses from access t until the line is reused d 0 (A) = 1, d 0 (B) > 2, d 0 (C) = 2 Cache A B C A B D A B D Miss Hit Miss Accesses D A C Optimal Policy (Belady’s): Evict the line with the greatest reuse distance (Belady, 1966)

Belady’s Requires Future Information Reuse distance d t (line): number of accesses from access t until the line is reused Problem: Computing reuse distance requires knowing the future So in practice, we use heuristics , e.g.: Least-recently used (LRU) ● Most-recently used (MRU) ● … but these pergorm poorly on complex access patuerns

Leveraging Belady’s Idea: approximate Belady’s from past accesses Training Predicted decision Optimal decision Learned Belady’s Model . . . . . . Past accesses Current Future access accesses

Prior Work Evict line X Trained on Current line cache Belady’s friendly or averse? Traditional Hawkeye / Algorithm Glider Past Current Current accesses access cache state Current state-of-the-aru (Shi et. al., ‘19, Jain et. al., ‘18)

Prior Work Evict line X + binary classifjcation is relatively Trained on Current line cache easy to learn Belady’s friendly or averse? - traditional algorithm can’t Traditional Hawkeye / express optimal policy Algorithm Glider Past Current Current accesses access cache state Current state-of-the-aru (Shi et. al., ‘19, Jain et. al., ‘18)

Our contribution: Our Approach Directly approximate Belady’s via imitation learning Evict line X Trained on Current line cache Trained on Evict line X Belady’s friendly or averse? Belady’s Traditional Model Hawkeye / Algorithm Glider . . . Past Current Current Past Current Current accesses access cache state accesses access cache state Current state-of-the-aru Our proposal (Shi et. al., ‘19, Jain et. al., ‘18)

Cache Replacement Markov Decision Process Evict Cache A B C A B D A B D Miss Hit Miss Accesses D A C Similar to Wang, et. al., 2019

Cache Replacement Markov Decision Process Current cache Evict contents Cache A B C A B D A B D Miss Hit Miss Past accesses Current access Accesses D A C Similar to Wang, et. al., 2019

Cache Replacement Markov Decision Process Cache A B C A B D A B D Miss Hit Miss Accesses D A C Similar to Wang, et. al., 2019

Cache Replacement Markov Decision Process Evict Cache A B C A B D A B D Miss Hit Miss Accesses D A C Similar to Wang, et. al., 2019

Leveraging the Optimal Policy Typical imitation learning setuing Observation: Not all errors are equally bad (Pomerlau, 1991, Ross, et. al., 2011, Kim, et. al., 2013) Learning from optimal policy yields ● greater training signal optimal action state state Learned policy Approximate optimal policy Learned policy optimize, e.g., Concretely: minimize a ranking loss

Reuse Distance as an Auxiliary Task Observation: predicting reuse distance is correlated with cache replacement Cast this as an auxiliary task (Jaderberg, et. al., 2016) ● Loss Policy Reuse distance State embedding State s t

Results Optimal cache-hit rate LRU cache-hit rate ~19% cache-hit rate increase over Glider (Shi, et. al., 2019) on memory-intensive SPEC2006 applications (Jaleel, et. al., 2009) ~64% cache-hit rate increase over LRU on Google Web Search

A Note on Practicality address embedding Linear Layer This work : Establish a proof-of-concept Per-byte address embedding Reduce embedding size from 100MB to <10KB ● ● ~6% cache-hit rate increase on SPEC2006 vs. Glider ● ~59% cache-hit rate increase on Google Web Address: 0x 12 C5 A1 ... Search vs. LRU Byte 1 Byte 2 Byte 3

A Note on Practicality address embedding Linear Layer This work : Establish a proof-of-concept Per-byte address embedding Reduce embedding size from 100MB to <10KB ● ● ~6% cache-hit rate increase on SPEC2006 vs. Glider ● ~59% cache-hit rate increase on Google Web Address: 0x 12 C5 A1 ... Search vs. LRU Byte 1 Byte 2 Byte 3 Future work: Production ready learned policies ● Smaller models via distillation (Hinton, et. al., 2015) , pruning (Janowsky, 1989, Han, et. al., 2015, Sze, et. al., 2017), or quantization ● Target domains with longer latency and larger caches (e.g., sofuware caches)

A New Imitation / Reinforcement Learning Benchmark Bellemare, et. al., 2012, Levine, et. al., 2016, Lillicrap, et. al., 2015 Silver, et. al., 2017, OpenAI, 2019, Evict Vinyals, et. al., 2019 A B C Miss D + plentiful data - limited / expensive data + plentiful data - delayed real-world utility + immediate real-world impact + immediate real-world impact Open-source cache replacement Gym environment coming soon!

Takeaways A new state-of-the-aru approach for cache replacement by imitating the ● oracle policy Future work: making this production ready ○ A new benchmark for imitation learning / reinforcement learning research ●

An Imitation Learning Approach for Cache Replacement Evan Z. Liu, - PowerPoint PPT Presentation

An Imitation Learning Approach for Cache Replacement Evan Z. Liu, Milad Hashemi, Kevin Swersky, Paruhasarathy Ranganathan, Junwhan Ahn The Need for Faster Compute Small cache improvements can make large difgerences! (Beckman, 2019) E.g., 1%

Why do imitation and analogy fail? Why do imitation and analogy fail? Imitation Imitation

1 Classifying cache misses Cache Organization Classifying misses by causes (3Cs) Cache size,

Cache Performance Associativity Replacement Samira Khan Cache Performance March 28,

Cache Replacement Championship The 3P and 4P cache replacement policies Pierre Michaud INRIA

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

Web Cache Consistency Web Cache Consistency Web Cache Consistency Web Cache Consistency

L09: Cache Name: ID: Question: Direct Mapping Cache Hit Rate Consider a 4-block empty Cache,

Imitation Learning Initial Concept and Approaches Nguyen, Thi Linh Chi Outline Motivation

Learning to Optimize as Policy Learning Yisong Yue Policy Learning (Reinforcement &

Generations of Cache 1980: no cache in proc; 1989 first Intel proc with a cache on chip.

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Caches Electronic Computers M Caches 1 Cache LOCALITY PRINCIPLE (SPATIAL AND TEMPORAL)

Plan Hierarchical memories and their impact on our programs 1 Cache Memories, Cache Complexity

Cache Performance Samira Khan March 28, 2017 Agenda Review from last lecture Cache

Organization Lecture-13 Caches-2 Performance Shakil M. Khan Example: Intrinsity FastMATH

Recitation 7 Caching By yzhuang Announcements Pick up your exam from ECE course hub

EECS 388: Embedded Systems 10. Timing Analysis Heechul Yun 1 Agenda Execution time analysis

Lecture 12: Memory hierarchy & caches A modern memory subsystem combines fast small

ECE232: Hardware Organization and Design Lecture 22: Introduction to Caches Adapted from Computer

lecture 18 virtual physical physical virtual cache 2 address address address address -

Credits Some of the material in this presentation is taken from: Computer Architecture: A

Improving Cache Performance AMAT: Average Memory Access Time AMAT = T hit + Miss Rate x Miss

An Imitation Learning Approach for Cache Replacement Evan Z. Liu, - PowerPoint PPT Presentation

An Imitation Learning Approach for Cache Replacement Evan Z. Liu, Milad Hashemi, Kevin Swersky, Paruhasarathy Ranganathan, Junwhan Ahn The Need for Faster Compute Small cache improvements can make large difgerences! (Beckman, 2019) E.g., 1%

Why do imitation and analogy fail? Why do imitation and analogy fail? Imitation Imitation

1 Classifying cache misses Cache Organization Classifying misses by causes (3Cs) Cache size,

Cache Performance Associativity Replacement Samira Khan Cache Performance March 28,

Cache Replacement Championship The 3P and 4P cache replacement policies Pierre Michaud INRIA

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

Web Cache Consistency Web Cache Consistency Web Cache Consistency Web Cache Consistency

L09: Cache Name: ID: Question: Direct Mapping Cache Hit Rate Consider a 4-block empty Cache,

Imitation Learning Initial Concept and Approaches Nguyen, Thi Linh Chi Outline Motivation

Learning to Optimize as Policy Learning Yisong Yue Policy Learning (Reinforcement &amp;

Generations of Cache 1980: no cache in proc; 1989 first Intel proc with a cache on chip.

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Cache Memory Chapter 17 S. Dandamudi Outline Introduction Types of cache misses

Caches Electronic Computers M Caches 1 Cache LOCALITY PRINCIPLE (SPATIAL AND TEMPORAL)

Plan Hierarchical memories and their impact on our programs 1 Cache Memories, Cache Complexity

Cache Performance Samira Khan March 28, 2017 Agenda Review from last lecture Cache

Organization Lecture-13 Caches-2 Performance Shakil M. Khan Example: Intrinsity FastMATH

Recitation 7 Caching By yzhuang Announcements Pick up your exam from ECE course hub

EECS 388: Embedded Systems 10. Timing Analysis Heechul Yun 1 Agenda Execution time analysis

Lecture 12: Memory hierarchy &amp; caches A modern memory subsystem combines fast small

ECE232: Hardware Organization and Design Lecture 22: Introduction to Caches Adapted from Computer

lecture 18 virtual physical physical virtual cache 2 address address address address -

Credits Some of the material in this presentation is taken from: Computer Architecture: A

Improving Cache Performance AMAT: Average Memory Access Time AMAT = T hit + Miss Rate x Miss

Learning to Optimize as Policy Learning Yisong Yue Policy Learning (Reinforcement &

Lecture 12: Memory hierarchy & caches A modern memory subsystem combines fast small