Architectures with Large Die-Stacked DRAM Cache Adarsh Patil - - PowerPoint PPT Presentation

architectures with large
SMART_READER_LITE
LIVE PREVIEW

Architectures with Large Die-Stacked DRAM Cache Adarsh Patil - - PowerPoint PPT Presentation

TLB and Pagewalk Performance in Multicore Architectures with Large Die-Stacked DRAM Cache Adarsh Patil Adviser: Prof. R Govindarajan Perspective Seminar 6 th Nov 2015 Outline Introduction Address Translation - TLBs and Page Walks


slide-1
SLIDE 1

TLB and Pagewalk Performance in Multicore Architectures with Large Die-Stacked DRAM Cache

Adarsh Patil

Adviser: Prof. R Govindarajan

Perspective Seminar 6th Nov 2015

slide-2
SLIDE 2

Outline

■ Introduction

฀ Address Translation - TLBs and Page Walks ฀ Die stacked DRAM caches

■ Objective ■ Experimental Setup

฀ Framework ฀ Methodology

■ Results ■ Conclusion and Future Work

CSA Perspective Seminar 2 6th Nov 2015

slide-3
SLIDE 3

*Apps in Big Data Bench / Cloud Suite Benchmark

Computing Trends

■ Software

฀ Large memory footprint

CSA Perspective Seminar 3 6th Nov 2015

slide-4
SLIDE 4

*Source : VMware

Computing Trends

■ Software

฀ Large memory footprint ฀ Virtualization and cloud

computing

CSA Perspective Seminar 4 6th Nov 2015

slide-5
SLIDE 5

Intel Haswell-E & IBM Power 8

Computing Trends

■ Software

฀ Large memory footprint ฀ Virtualization and cloud

computing

■ Architectural

฀ Multicore / Manycore

architectures

CSA Perspective Seminar 5 6th Nov 2015

slide-6
SLIDE 6

*Source : Invensas, Tessera

Computing Trends

■ Software

฀ Large memory footprint ฀ Virtualization and cloud

computing

■ Architectural

฀ Multicore / Manycore

architectures

฀ Large Die stacked DRAM

cache

CSA Perspective Seminar 6 6th Nov 2015

slide-7
SLIDE 7

Paged Virtual Memory

■ Virtual address space divided into “pages” ■ “Page Table” : In-memory table, organized as

radix tree, to map virtual to physical address and store meta-information (replacement, access privilege, dirty bit etc.)

■ Page table entries cached in fast lookup structures

called “Translation Lookaside Buffers (TLBs)”

■ Page Table has evolved to 4-level tree to

accommodate 48-bit VA

CSA Perspective Seminar 7 6th Nov 2015

slide-8
SLIDE 8

Page Table Structure

CSA Perspective Seminar 8 6th Nov 2015

■ Hierarchical page table ■ 4 memory references - 4KB page

3 memory references - 2MB superpage

■ Each entry is 8 bytes ■ TLB stores VA to PA

Sign extension PML4 | PL4 PDP | PL3 PD | PL2 PTE | PL1 Page Offset

47 : 39 38 : 30 29 : 21 20 : 12 11 : 0 63 : 48

… ppn: 673 ppn: 041 ppn: 734 ppn: 137 … … ppn: 424 ppn: 016

CR 3 Register

Data Superpage Data Page

L4 L3 L2 L1 L1

… … ppn: 362 ppn: 382 … … ppn: NULL ppn: 684 … ppn: 156 ppn: 467 ppn: NULL …

slide-9
SLIDE 9

Page Table Structure

CSA Perspective Seminar 9 6th Nov 2015

■ Hierarchical page table ■ 4 memory references - 4KB page

3 memory references - 2MB superpage

■ Each entry is 8 bytes ■ TLB stores VA to PA

Sign extension PML4 | PL4 PDP | PL3 PD | PL2 PTE | PL1 Page Offset

47 : 39 38 : 30 29 : 21 20 : 12 11 : 0 63 : 48

… ppn: 673 ppn: 041 ppn: 734 ppn: 137 … … ppn: 424 ppn: 016

CR 3 Register

Data Superpage Data Page

L4 L3 L2 L1 L1

… … ppn: 362 ppn: 382 … … ppn: NULL ppn: 684 … ppn: 156 ppn: 467 ppn: NULL …

slide-10
SLIDE 10

Page Table Structure

CSA Perspective Seminar 10 6th Nov 2015

■ Hierarchical page table ■ 4 memory references - 4KB page

3 memory references - 2MB superpage

■ Each entry is 8 bytes ■ TLB stores VA to PA

Sign extension PML4 | PL4 PDP | PL3 PD | PL2 PTE | PL1 Page Offset

47 : 39 38 : 30 29 : 21 20 : 12 11 : 0 63 : 48

… ppn: 673 ppn: 041 ppn: 734 ppn: 137 … … ppn: 424 ppn: 016

CR 3 Register

Data Superpage Data Page

L4 L3 L2 L1 L1

… … ppn: 362 ppn: 382 … … ppn: NULL ppn: 684 … ppn: 156 ppn: 467 ppn: NULL …

slide-11
SLIDE 11

Page Table Structure

CSA Perspective Seminar 11 6th Nov 2015

■ Hierarchical page table ■ 4 memory references - 4KB page

3 memory references - 2MB superpage

■ Each entry is 8 bytes ■ TLB stores VA to PA

Sign extension PML4 | PL4 PDP | PL3 PD | PL2 PTE | PL1 Page Offset

47 : 39 38 : 30 29 : 21 20 : 12 11 : 0 63 : 48

… ppn: 673 ppn: 041 ppn: 734 ppn: 137 … … ppn: 424 ppn: 016

CR 3 Register

Data Superpage Data Page

L4 L3 L2 L1 L1

… … ppn: 362 ppn: 382 … … ppn: NULL ppn: 684 … ppn: 156 ppn: 467 ppn: NULL …

slide-12
SLIDE 12

Page Table Structure

CSA Perspective Seminar 12 6th Nov 2015

■ Hierarchical page table ■ 4 memory references - 4KB page

3 memory references - 2MB superpage

■ Each entry is 8 bytes ■ TLB stores VA to PA

Sign extension PML4 | PL4 PDP | PL3 PD | PL2 PTE | PL1 Page Offset

47 : 39 38 : 30 29 : 21 20 : 12 11 : 0 63 : 48

… ppn: 673 ppn: 041 ppn: 734 ppn: 137 … … ppn: 424 ppn: 016

CR 3 Register

Data Superpage Data Page

L4 L3 L2 L1 L1

… … ppn: 362 ppn: 382 … … ppn: NULL ppn: 684 … ppn: 156 ppn: 467 ppn: NULL …

slide-13
SLIDE 13

Page Table Structure

CSA Perspective Seminar 13 6th Nov 2015

■ Hierarchical page table ■ 4 memory references - 4KB page

3 memory references - 2MB superpage

■ Each entry is 8 bytes ■ TLB stores VA to PA

Sign extension PML4 | PL4 PDP | PL3 PD | PL2 PTE | PL1 Page Offset

47 : 39 38 : 30 29 : 21 20 : 12 11 : 0 63 : 48

… ppn: 673 ppn: 041 ppn: 734 ppn: 137 … … ppn: 424 ppn: 016

CR 3 Register

Data Superpage Data Page

L4 L3 L2 L1 L1

… … ppn: 362 ppn: 382 … … ppn: NULL ppn: 684 … ppn: 156 ppn: 467 ppn: NULL …

slide-14
SLIDE 14

Page Table Structure

CSA Perspective Seminar 14 6th Nov 2015

■ Hierarchical page table ■ 4 memory references - 4KB page

3 memory references - 2MB superpage

■ Each entry is 8 bytes ■ TLB stores VA to PA

Sign extension PML4 | PL4 PDP | PL3 PD | PL2 PTE | PL1 Page Offset

47 : 39 38 : 30 29 : 21 20 : 12 11 : 0 63 : 48

… ppn: 673 ppn: 041 ppn: 734 ppn: 137 … … ppn: 424 ppn: 016

CR 3 Register

Data Superpage Data Page

L4 L3 L2 L1 L1

… … ppn: 362 ppn: 382 … … ppn: NULL ppn: 684 … ppn: 156 ppn: 467 ppn: NULL …

slide-15
SLIDE 15

Page Table Structure

CSA Perspective Seminar 15 6th Nov 2015

■ Hierarchical page table ■ 4 memory references - 4KB page

3 memory references - 2MB superpage

■ Each entry is 8 bytes ■ TLB stores VA to PA

Sign extension PML4 | PL4 PDP | PL3 PD | PL2 PTE | PL1 Page Offset

47 : 39 38 : 30 29 : 21 20 : 12 11 : 0 63 : 48

… ppn: 673 ppn: 041 ppn: 734 ppn: 137 … … ppn: 424 ppn: 016

CR 3 Register

Data Superpage Data Page

L4 L3 L2 L1 L1

… … ppn: 362 ppn: 382 … … ppn: NULL ppn: 684 … ppn: 156 ppn: 467 ppn: NULL …

slide-16
SLIDE 16

Page Table Structure

CSA Perspective Seminar 16 6th Nov 2015

■ Hierarchical page table ■ 4 memory references - 4KB page

3 memory references - 2MB superpage

■ Each entry is 8 bytes ■ TLB stores VA to PA

Sign extension PML4 | PL4 PDP | PL3 PD | PL2 PTE | PL1 Page Offset

47 : 39 38 : 30 29 : 21 20 : 12 11 : 0 63 : 48

… ppn: 673 ppn: 041 ppn: 734 ppn: 137 … … ppn: 424 ppn: 016

CR 3 Register

Data Superpage Data Page

L4 L3 L2 L1 L1

… … ppn: 362 ppn: 382 … … ppn: NULL ppn: 684 … ppn: 156 ppn: 467 ppn: NULL …

slide-17
SLIDE 17

Page Table Structure

CSA Perspective Seminar 17 6th Nov 2015

■ Hierarchical page table ■ 4 memory references - 4KB page

3 memory references - 2MB superpage

■ Each entry is 8 bytes ■ TLB stores VA to PA

Sign extension PML4 | PL4 PDP | PL3 PD | PL2 PTE | PL1 Page Offset

47 : 39 38 : 30 29 : 21 20 : 12 11 : 0 63 : 48

… ppn: 673 ppn: 041 ppn: 734 ppn: 137 … … ppn: 424 ppn: 016

CR 3 Register

Data Superpage

Data Page

L4 L3 L2 L1 L1

… … ppn: 362 ppn: 382 … … ppn: NULL ppn: 684 … ppn: 156 ppn: 467 ppn: NULL …

slide-18
SLIDE 18

Page Table Structure-Virtualization

■ Guest Page Table (gPT)

฀ Translate guest virtual to

guest physical

฀ Setup and modified by

guest independently

■ Nested Page Table (nPT)

฀ Translate host virtual to host physical ฀ Controlled by host

■ Upto 24 memory references on page walk ■ TLB stores to end to end translation

CSA Perspective Seminar 18 6th Nov 2015

slide-19
SLIDE 19

Address Translation in Hardware

6th Nov 2015 CSA Perspective Seminar 19

CPU

Multi-level TLB

L1 Cache VIPT VA Large PIPT Caches

?

Translatio n Page Walk

Miss

Memory Hit Miss Data Page Walk cycles < 4 cycles 4 cycles 6 / 10 cycles 180-200 cycles PA

Set

L2 Cache

MMU Caches

L1 Cache

CORE

L2 Cache

MMU Caches

L1 Cache

CORE

MMU Caches

L1 Cache

CORE

MMU Cache

L1 Cache

L3 Cache

Page Walk Cache

Shared L2 TLB Hardware Page Walker

L1 TLB Superpage

L1 TLB Data L1 TLB Instr

L2 Cache L2 Cache

Miss Hit CORE

slide-20
SLIDE 20

TLB-reach & page walk latencies

■ TLB-reach: amount of data that can be accessed without causing a

miss

฀ Clustered [HPCA ‘14] and Coalesced [MICRO ’12] TLBs ฀ Superpage friendly TLBs [HPCA ‘15] using skewed TLBs ฀ Shared last level TLBs [HPCA ‘11] evaluates shared TLBs for multi-cores ฀ Direct segment [ISCA ‘13] - primary region abstraction to map part of the

virtual address space using segment registers and avoid paging completely

฀ Redundant memory mappings [ISCA ‘15] - allocation in units called ranges

(eager paging in OS) and maintain ranges in a separate range-TLB, compatible with traditional paging.

■ Speeding up miss handling

฀ AMD proposed Accelerating 2D page walks [ASPLOS ‘08] by using page walk

caches for virtualization

฀ Characterize TLB behaviors and sensitivity of individual SPEC 2000

[SIGMETRICS ‘02] and PARSEC [PACT ‘09] applications

CSA Perspective Seminar 20 6th Nov 2015

slide-21
SLIDE 21

Die stacked DRAM Cache

50% Lower power consumption* 8X bandwidth improvement* 20% lesser latency than memory Large capacities ~GBs 35% Smaller package size* JEDEC W-IO / HBM standards

6th Nov 2015 CSA Perspective Seminar 21

cores DRAM layers silicon interposer cores DRAM layers

*Source: Applied Materials and HMC

slide-22
SLIDE 22

Die-stacked DRAM caches

■ Meta data overhead – Tag storage

฀ Loh-Hill Cache [MICRO ‘11] propose to co-locate tags and data in same

row along with a missmap and direct mapped Alloy cache [MICRO ‘12] uses TAD (tag and data) structure to reduce latency of access.

฀ Bimodal cache [MICRO ‘14] - caches in 64B and 512B block sizes, uses

the abundant bandwidth to store meta data a bank in another channel

฀ ATCache [PACT ‘14] stores hot meta-data matches in on chip SRAM.

■ Translation piggyback for tag match

฀ Tagtables [HPCA ‘15] organize tags as a hierarchical but flipped page

table organization

฀ Tagless Fully associative DRAM cache [ISCA ‘15] translates VA to cache

address and moves VA to PA translation off critical path

CSA Perspective Seminar 22 6th Nov 2015

slide-23
SLIDE 23

Objective

■ Experimentally measure overheads of paged address

translation in x86-64 architecture

฀ Determine page walk latency on a TLB miss ฀ Assess the effects of caching of the page walk levels in

cache hierarchy

฀ Impact of page walk on IPC compared to an ideal TLB /

address translation.

■ Correlate the TLB-reach problem with size and

latency of last level die-stacked caches

฀ Count percentage of occurrences where TLB miss occurs

for the blocks that hit in large LLCs

CSA Perspective Seminar 23 6th Nov 2015

slide-24
SLIDE 24

Experimental Framework

MARSSx86 – QEMU based full system simulator

฀ Added shared L2 TLB and superpage TLB structures ฀ Page walk handler - Reduced page walk levels for superpages ฀ Flat cache hierarchy for stacked L4 cache

Unmodified Linux 2.6.38 with THP

Configuration

฀ four 5way OoO core, 4GB memory, x86-64 ISA ฀ L1 cache i/d private , 32KB, 2/4 cycles; L2 cache private, 256KB, 6 cycles ฀ L3 cache, 4MB shared, 9 cycles ฀ Die stacked L4 cache, shared, 64/128/256/512/1024MB, 16way SA, 40 cycles ฀ L1 i/d TLB, 64 entries – 4KB pages ฀ L2 TLB shared, 1024 entries – 4KB pages ฀ L1 dTLB 4 entries – 2MB pages

CSA Perspective Seminar 24 6th Nov 2015

slide-25
SLIDE 25

Experimental Methodology

CSA Perspective Seminar 25 6th Nov 2015

■ Multi-programmed

฀ SPEC CPU 2006 programs ฀ High CPI apps chosen ฀ 8 Billion instr fast forward ฀ 4 Billion cycles detailed

■ Multi-threaded

฀ PARSEC - 4 threads ฀ ‘simlarge’ input data set ฀ Entire ROI detailed mode ฀ Apps with unbounded and large working set apps

SPEC CPU 2006 mix mix1 milc, mcf, omnetpp, gcc mix2 GemsFDTD, leslie3d, dealII, soplex mix3 cactusADM, libquantum, tonto, shinpx3 mix4 lbm, bwaves, zuesmp, sjeng mix5 milc, GemsFDTD, cactusADM, lbm mix6 mcf, omnetpp, soplex, leslie3d mix7 bwaves, astar, zeusmp, gcc mix8 gobmk, bzip2, h264ref, hmmer PARSEC 4-threads canneal dedup ferret fluidanimate freqmine raytrace

slide-26
SLIDE 26

Page Walk Latency

CSA Perspective Seminar 26 6th Nov 2015

Multi-programmed workloads Average page walk latency ranges from 37 to 147 cycles

slide-27
SLIDE 27

Page Walk Latency

CSA Perspective Seminar 27 6th Nov 2015

Multi-programmed workloads Average page walk latency ranges from 37 to 147 cycles Page Walk Latency range Count 37-74 2 74-111 1 111-148 5

slide-28
SLIDE 28

Page Walk Latency

CSA Perspective Seminar 28 6th Nov 2015

Multi-threaded workloads Average page walk latency ranges from 18 to 144 cycles Page Walk Latency range Count 18-50 3 50-82 2 114-146 1

slide-29
SLIDE 29

Page Walk Levels locality

PL1 has highest memory access

PL2 has a uniform hit percentage in almost all cache levels and we observe that most of the 8 translations in a cache line are used

PL3 level sees around 50 % of hits in either L1 or L2 cache

Cache pollution - AMD procs perform page walk in L2 cache

CSA Perspective Seminar 29 6th Nov 2015

Multi-programmed mix1 to mix8, left to right Multi-threaded canneal, dedup, ferret, fluidanimate, freqmine, raytrace

slide-30
SLIDE 30

IPC impact due to Page Walks

Modern arch hide memory access latency using ROB and LSQ

128 entry ROB and 80 entry LSQ

Impact on IPC due to page walk latency as compared to an ideal TLB

Zero page walk overhead

No cache pollution

Returns translation in a single cycle

CSA Perspective Seminar 30 6th Nov 2015

slide-31
SLIDE 31

CSA Perspective Seminar 31 6th Nov 2015

TLB-reach and large LLCs

64B block size L4 cache 512B block size L4 cache

slide-32
SLIDE 32

Conclusion

■ Conclusion

฀ Measure effect of Page

walk latency on IPC

฀ Framework to study

effects of page walk latency and TLB-reach

฀ Quantify TLB-reach

problem in context of large die stacked caches

■ Future Work

฀ Using large footprint

application in Cloud Suite & Big data bench

฀ Detailed timing

simulation of die- stacked DRAM

฀ Superpage TLBs

CSA Perspective Seminar 32 6th Nov 2015

slide-33
SLIDE 33

CSA Perspective Seminar 33 6th Nov 2015

slide-34
SLIDE 34

BACKUP SLIDES

CSA Perspective Seminar 34 6th Nov 2015

slide-35
SLIDE 35

CSA Perspective Seminar 35 6th Nov 2015

slide-36
SLIDE 36

CSA Perspective Seminar 36 6th Nov 2015

slide-37
SLIDE 37

CSA Perspective Seminar 37 6th Nov 2015