TLB misses - The Missing Issue of Adaptive Radix Tree? Petrie Wong - - PowerPoint PPT Presentation

tlb misses the missing issue of adaptive radix tree
SMART_READER_LITE
LIVE PREVIEW

TLB misses - The Missing Issue of Adaptive Radix Tree? Petrie Wong - - PowerPoint PPT Presentation

TLB misses - The Missing Issue of Adaptive Radix Tree? Petrie Wong Ziqiang Feng Wenjian Xu Eric Lo Ben Kao Department of Computer Science, The University of Hong Kong Department of Computing, The Hong Kong Polytechnic University


slide-1
SLIDE 1

TLB misses - The Missing Issue of Adaptive Radix Tree?

Petrie Wong Ziqiang Feng Wenjian Xu Eric Lo Ben Kao

Department of Computer Science, The University of Hong Kong Department of Computing, The Hong Kong Polytechnic University

slide-2
SLIDE 2

TLB misses - the Missing Issue of Adaptive Radix Tree? - presented by Petrie Wong

Motivation

  • In-memory databases
  • H-Store
  • Hekaton
  • Efficient in-memory index structures
  • Cache-Sensitive B+-Tree (CSB+-Tree)
  • Fast Architecture Sensitive Tree (FAST)
  • Adaptive Radix Tree (ART)

2

slide-3
SLIDE 3

TLB misses - the Missing Issue of Adaptive Radix Tree? - presented by Petrie Wong

Why Adaptive Radix Tree

  • Outperforms existing index structures
  • both search and update
  • has small memory footprint
  • Avoid cache miss
  • Leverage SIMD data parallelism
  • Reduce branch mis-prediction
  • adopt a radix tree structure

3

  • V. Leis, A. Kemper al et ICDE’13
slide-4
SLIDE 4

What is Adaptive Radix Tree

… … … … … … …

… EE …

01 02 03 04 01 02 03 04

key array pointer array

00 01 FF FD FE

Node256

00 01 02

Node256

Data

00 01 02 03

Node256

Data

01 02 03 FF 1 2 3 48

index array child pointer

Node48 … Node4

pointer array

small node type (Node4) for nodes with few child pointers large node type (Node256) for nodes with many child pointers

slide-5
SLIDE 5

TLB misses - the Missing Issue of Adaptive Radix Tree? - presented by Petrie Wong

Whether TLB miss matter in ART?

  • Translation Look-aside Buffer (TLB)
  • cache for page table entries
  • fast way to translate virtual memory address to physical memory address
  • executing an instruction in CPU
  • in-memory index structure like ART
  • few cache miss, few branch mis-prediction, SIMD-friendly
  • whether misses in TLB would become a bottleneck
  • if positive
  • what are the measures to alleviate
  • how effective those measures are

5

for program for CPU

slide-6
SLIDE 6

TLB misses - the Missing Issue of Adaptive Radix Tree? - presented by Petrie Wong

Whether TLB miss matter in ART?

  • Experiment to show
  • Stall time % due to TLB miss
  • System specification
  • Intel Core i7 2630QM CPU
  • 2.00 GHz clock rate, 2.9 GHz turbo frequency.
  • Each core
  • 32KB L1i cache, 32KB L1d cache, 256KB unified L2 Cache
  • share 6MB L3 cache, 16GB 1600 RAM.

6

slide-7
SLIDE 7

TLB misses - the Missing Issue of Adaptive Radix Tree? - presented by Petrie Wong

Whether TLB miss matter in ART?

  • Data
  • 1,000,000 integer keys
  • Dense: from 1 to n (19MB in RAM)
  • Sparse: random number in 32bit domain (22MB in RAM)
  • cannot fit into 6MB L3 cache

7

slide-8
SLIDE 8

TLB misses - the Missing Issue of Adaptive Radix Tree? - presented by Petrie Wong

Whether TLB miss matter in ART?

  • Workload
  • 256M lookups
  • Varying skewness:
  • zipf=0 (each key is uniformly accessed) 



 to

  • zipf=3 (few very hot keys and many non-hot keys)

8

slide-9
SLIDE 9

TLB misses - the Missing Issue of Adaptive Radix Tree? - presented by Petrie Wong

9

Very skew

Whether TLB miss matter in ART?

  • No, when key access is

very skew (Zipf=2 to 3)

  • few very hot search keys
  • occupies very few page

table entries in TLB

  • very few TLB misses are

incurred (0% to 2% of stall time)

  • TLB miss doesn’t matter

5 10 15 20 25 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 stall time due to TLB miss/index lookup latency (%) Zipf Dense Sparse

0% to 2%

slide-10
SLIDE 10

TLB misses - the Missing Issue of Adaptive Radix Tree? - presented by Petrie Wong

10

Uniform

Whether TLB miss matter in ART?

  • No, when workload is

not skew (Zipf=0 to 1)

  • each key is uniformly

accessed

  • no spatial locality
  • lots of cache misses
  • dominate the latency
  • TLB miss not so matters

(5% to 7%)

5 10 15 20 25 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 stall time due to TLB miss/index lookup latency (%) Zipf Dense Sparse

5% to 7%

slide-11
SLIDE 11

TLB misses - the Missing Issue of Adaptive Radix Tree? - presented by Petrie Wong

11

up to 23%

  • YES, when the

workload posses realistic skewness (Zipf = 1 to 2)

  • key access with certain

spatial locality

  • cache miss is not high
  • TLB matters now (up to

23%)

Whether TLB miss matter in ART?

5 10 15 20 25 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 stall time due to TLB miss/index lookup latency (%) Zipf Dense Sparse

slide-12
SLIDE 12

TLB misses - the Missing Issue of Adaptive Radix Tree? - presented by Petrie Wong

What are the measures that we can take to alleviate?

  • use of huge page
  • workload-conscious node-to-page reorganization

12

slide-13
SLIDE 13

TLB misses - the Missing Issue of Adaptive Radix Tree? - presented by Petrie Wong

What are the measures that we can take to alleviate?

  • use of huge page
  • workload-conscious node-to-page reorganization

13

slide-14
SLIDE 14

TLB misses - the Missing Issue of Adaptive Radix Tree? - presented by Petrie Wong

What is Huge Page?

  • In memory allocation
  • eliminate fragmentation over the whole memory space
  • cutting memory space into pages
  • Regular page size (in most processors e.g. Intel Sandy Bridge - Xeon

E5)

  • 4KB
  • OS’s default value
  • Huge page size (e.g. Sandy Bridge)
  • 2MB, 1GB
  • good tactic to reduce TLB misses

14

slide-15
SLIDE 15

TLB misses - the Missing Issue of Adaptive Radix Tree? - presented by Petrie Wong

Why Huge Page?

  • if apply huge page in ART
  • reduce # of pages spanned by ART nodes
  • reduce the pressure on the TLB
  • fewer TLB miss
  • throughput increase

15

slide-16
SLIDE 16

TLB misses - the Missing Issue of Adaptive Radix Tree? - presented by Petrie Wong

Why Huge Page?

  • page table entry
  • besides being stored in TLB
  • ccupy space in L1/L2/L3 cache and RAM
  • So… fewer page table entries
  • ccupy fewer space in processor’s cache
  • fewer cache misses
  • throughput increase

16

Page Table Entries ART Data L2 Cache when using regular page Others

Page Table Entries

ART Data Others L2 Cache when using huge page

slide-17
SLIDE 17

TLB misses - the Missing Issue of Adaptive Radix Tree? - presented by Petrie Wong

Huge Page always Help?

  • but…
  • different # of TLB entries for different page sizes
  • # of huge page entries are fewer than that of regular page

entries

  • In Xeon E5,
  • 64 DTLB and 512 STLB entires for regular pages
  • 32 DTLB entires for huge pages
  • fewer TLB entries available for huge page
  • throughput may decrease

17

slide-18
SLIDE 18

TLB misses - the Missing Issue of Adaptive Radix Tree? - presented by Petrie Wong

Can Huge Page Help?

  • Yes
  • when workload is uniform and

quite skew (Zipf < 2)

  • reduce TLB miss and cache miss
  • throughput increase as expected
  • when workload extreme skew

(Zipf > 2)

  • very few TLB miss and cache

miss

  • no further improvement

18

5 10 15 20 25 30 35 40 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 Throughput Improvement (%) Zipf Dense Sparse

slide-19
SLIDE 19

TLB misses - the Missing Issue of Adaptive Radix Tree? - presented by Petrie Wong

What are the measures that we can take to alleviate?

  • use of huge page
  • workload-conscious node-to-page

reorganization

19

slide-20
SLIDE 20

TLB misses - the Missing Issue of Adaptive Radix Tree? - presented by Petrie Wong

What is Workload-Conscious Node-to-Page Reorganization?

  • tree nodes in ART are allocated
  • dynamic memory allocation
  • OS’s default scheme
  • eliminate fragmentation over the whole memory space
  • workload-conscious allocation (R. Stoica and A. Ailamaki et al. DaMoN’13)
  • takes over OS’s control
  • rganize the hot ART nodes into the same page.

20

slide-21
SLIDE 21

TLB misses - the Missing Issue of Adaptive Radix Tree? - presented by Petrie Wong

Why Workload-Conscious Node-to-Page Reorganization

  • OLTP workload is skew
  • some keys are hot and accessed frequently
  • if putting all hot nodes into one (huge) page
  • page table entry of the hot page will be kept in TLB
  • no TLB miss when accessing hot keys
  • Throughput increase

21

slide-22
SLIDE 22

TLB misses - the Missing Issue of Adaptive Radix Tree? - presented by Petrie Wong

How Workload-Conscious Node-to-Page Reorganization

  • When query execution
  • log key accesses
  • analyzing access logs
  • sort the keys by their access frequencies
  • node-to-page reorganization
  • according to access frequencies
  • hot nodes will be placed in same page

22

P1 P2 Phot Pcold

slide-23
SLIDE 23

TLB misses - the Missing Issue of Adaptive Radix Tree? - presented by Petrie Wong

Can Workload-Conscious Node- to-Page Reorganization Help?

  • Yes, when
  • data is sparse and workload is

skew

  • sparse data
  • each node contain few children
  • small nodes (Node4, size is 36

byte) are used

  • many nodes, not so condense
  • more space, more pages
  • more page table entries
  • TLB miss matters

23

5M 10M 15M 20M 25M 30M 35M 40M 45M 50M 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 Throughput (lookup/s) Zipf ART with reorganization ART

… … … … … … …

… EE … 01 02 03 04 01 02 03 04 key array pointer array 00 01 FF FD FE Node256 00 01 02 Node256 Data 00 01 02 03 Node256 Data

01 02 03 FF 1 2 3 48 index array child pointer Node48 … Node4 pointer array
slide-24
SLIDE 24

TLB misses - the Missing Issue of Adaptive Radix Tree? - presented by Petrie Wong

Can Workload-Conscious Node- to-Page Reorganization Help?

  • sparse data
  • when workload-conscious

reorganization applied

  • all hot nodes can put into

few pages

  • fewer page table entries

need to be cached (for hot nodes)

  • fewer TLB miss and

throughput increase

24

5M 10M 15M 20M 25M 30M 35M 40M 45M 50M 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 Throughput (lookup/s) Zipf ART with reorganization ART

slide-25
SLIDE 25

TLB misses - the Missing Issue of Adaptive Radix Tree? - presented by Petrie Wong

Can Workload-Conscious Node- to-Page Reorganization Help?

  • No, when
  • data is dense
  • huge page is used
  • be few pages needed.
  • all page table entries can

stay in TLB

  • giving almost no TLB miss
  • make node-to-page

reorganization immaterial

25

5M 10M 15M 20M 25M 30M 35M 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 Throughput (lookup/s) Zipf ART with reorganization ART

slide-26
SLIDE 26

TLB misses - the Missing Issue of Adaptive Radix Tree? - presented by Petrie Wong

Summary

  • TLB miss does matter when the access workload

possess realistic skew

  • the use of huge page provides 1-32% positive

lookup throughput improvement over the use of regular page

  • workload-conscious node-to-page reorganization

does help when the data to be indexed is sparse

26

slide-27
SLIDE 27

Thank you