Search Lookaside Buffer: Efficient Caching for Index Data Structures - - PowerPoint PPT Presentation

search lookaside buffer
SMART_READER_LITE
LIVE PREVIEW

Search Lookaside Buffer: Efficient Caching for Index Data Structures - - PowerPoint PPT Presentation

Search Lookaside Buffer: Efficient Caching for Index Data Structures Xingbo Wu, Fan Ni, Song Jiang Background Large-scale in-memory applications. In-memory databases In-memory NoSQL stores and caches Software routing tables


slide-1
SLIDE 1

Search Lookaside Buffer:

Efficient Caching for Index Data Structures Xingbo Wu, Fan Ni, Song Jiang

slide-2
SLIDE 2

Background

  • Large-scale in-memory applications.

○ In-memory databases ○ In-memory NoSQL stores and caches ○ Software routing tables 2

  • They rely on index data structures to access their data.

B+-tree Hash Table

slide-3
SLIDE 3

Background

  • Large-scale in-memory applications.

○ In-memory databases ○ In-memory NoSQL stores and caches ○ Software routing tables 3

  • They rely on index data structures to access their data.
  • “hash index (i.e., hash table) accesses are the most

significant single source of runtime overhead, constituting 14–94% of total query execution time.” [Kocberber et al., MICRO-46]

slide-4
SLIDE 4

CPU Cache is Not Effectively Used

  • Indices are too large to fit in CPU cache.

In-memory Database: “55% of the total memory”. [Zhang et al., SIGMOD’16] In-memory KV caches: 20–40% of the memory. [Atikoglu et al., Sigmetrics’12]

  • Access locality has potential to address the problem.

Facebook’s Memcached workload study: “All workloads exhibit the expected long-tail distributions, with a small percentage

  • f keys appearing in most of the requests. . .”
  • However, data locality is compromised during index search.

4

slide-5
SLIDE 5

Case Study: Search in a B+-tree-indexed Store

5

10 GB 10 M ops/sec

Store size: 10 GB 8B Keys, 64B Values Zipfian workload 40 MB CPU cache

Accessed data set:

slide-6
SLIDE 6

Case Study: Search in a B+-tree-indexed Store

6 Store size: 10 GB 8B Keys, 64B Values Zipfian workload 40 MB CPU cache

12.5 M ops/sec 10 MB Accessed data set: 10 M ops/sec

slide-7
SLIDE 7

Case Study: Search in a B+-tree-indexed Store

7

382 M ops/sec If we remove the index and put the same data set in an array

Store size: 10 GB 8B Keys, 64B Values Zipfian workload 40 MB CPU cache

10 GB Accessed data set: 10 M ops/sec 12.5 M ops/sec

slide-8
SLIDE 8

A Look at Index Traversal

  • Index search in B+-tree: binary search at each node

8

slide-9
SLIDE 9

A Look at Index Traversal

  • Index search in B+-tree: binary search at each node

9

slide-10
SLIDE 10

A Look at Index Traversal

  • Index search in B+-tree: binary search at each node

10

slide-11
SLIDE 11

A Look at Index Traversal

  • The intermediate entries on the path become hot.

11

slide-12
SLIDE 12

False Temporal Locality

  • The intermediate entries on the path become hot.
  • The purpose of index search is to find the target entry.

Target Entry

12

False temporal Locality

slide-13
SLIDE 13

False Spatial Locality

  • Each hot intermediate entry occupies a whole cache line.
  • Touched cache lines ≫ entries required in the search.

64-byte cache lines

13

False spatial Locality

slide-14
SLIDE 14
  • Chains or open addressing lead to false temporal locality.
  • False spatial locality is significant even with short chains.

False Localities on a Hash Table The target entry

14

slide-15
SLIDE 15

A Closer Look at Your CPU Cache

  • Cache space is occupied by index entries of false localities.

Intermediate entries Target entries

15

slide-16
SLIDE 16

Existing Efforts on Improving Index Search

  • Redesigning the data structure: Cuckoo hash, Masstree..

○ Must be an expert of the data structure ○ Optimizations are specific to certain data structures ○ May add overhead to other operations (e.g., expensive insertions)

  • Hardware accelerators: Widx, MegaKV, etc.

○ High design cost ○ Hard to adapt to new index data structures ○ High latency for out-of-core accelerators (e.g., GPUs, FPGAs) 16

slide-17
SLIDE 17

The Issue of Virtual Address Translation Use of page tables shares the same challenges of index search.

  • Large index: every process has a page table.
  • Frequently accessed: consulted in every memory access.
  • False temporal locality: tree-structured tables.
  • False spatial locality: intermediate page-table directories.

17

slide-18
SLIDE 18

Fast Address translation with TLB TLB directly caches Page Table Entries for translation.

➔ Bypasses page table walking ➔ Covers large memory area with a small cache

PTE

TLB

PTE PTE PTE PTE PTE

18

slide-19
SLIDE 19

Our Solution: Search Lookaside Buffer

  • Pure software library
  • Easy integration with any index data structure
  • Negligible overhead even in the worst case

19

slide-20
SLIDE 20

Index Search with SLB Every lookup first consults SLB.

SLB_GET Not found

X = SLB_GET(key) if X: return X X = INDEX_GET(key) if X: SLB_EMIT(key, X) return X return NULL

20

slide-21
SLIDE 21

Emits a target entry after successful search. Index Search with SLB

21

X = SLB_GET(key) if X: return X X = INDEX_GET(key) if X: SLB_EMIT(key, X) return X return NULL

slide-22
SLIDE 22

A hit in SLB cache completes the search. Index Search with SLB

SLB_GET KV 22

X = SLB_GET(key) if X: return X X = INDEX_GET(key) if X: SLB_EMIT(key, X) return X return NULL

slide-23
SLIDE 23

Design challenges ❖ Tracking KV temperatures can pollute CPU cache

➢ Cache-line-local access counters for cached items. ➢ Approximate access logging for uncached items. 23

slide-24
SLIDE 24

Design challenges ❖ Tracking temperatures of items can pollute CPU cache

➢ Cache-line-local access counters for cached items. ➢ Approximate access logging for uncached items.

❖ Frequent replacement hurts index performance

➢ Adaptive logging throttling for uncached items.

❖ More details in the paper...

24

slide-25
SLIDE 25

Experimental Setup

  • B+-tree, Skip list, and hash tables
  • Filled with 108 KVs (8B K, 64B V)
  • Store size: ~10GB
  • Zipfian workload
  • Accessed data set: 10MB->10GB
  • SLB size: 16/32/64 MB
  • Uses one NUMA node (16 cores)

25

slide-26
SLIDE 26

B+-tree and Skip List

  • Significant improvements for ordered data structures

○ Substantial False localities caused by index traversal

B+-tree Skip list

26

15x 2.5x

slide-27
SLIDE 27

Hash Tables

  • Chaining hash table: average chain length <= 1

○ The index has no false temporal locality. ○ improves by up to 28% by removing false spatial locality

Cuckoo Chaining

27

+50% +28%

slide-28
SLIDE 28

High-performance KV Server

  • An RDMA-port of MICA [Lim et al., NSDI’14]

○ In-memory KV store ○ Bulk-chaining partitioned hash tables ○ Batch-processing ○ Lock-free accesses 28

slide-29
SLIDE 29

MICA over 100Gbps Infiniband

  • GET: Limited improvements due to network bandwidth.

10.7GB/s ~90% Bandwidth

GET PROBE

  • PROBE: only returns True/False

29

+20%~66%

slide-30
SLIDE 30

Conclusion

  • We identify the issue of false temporal/spatial locality in index

search.

  • We propose SLB, a general software solution to improve search

for any index data structure by removing the false localities.

  • SLB improves index search for workloads with strong locality,

and imposes negligible overhead with weak locality.

30

slide-31
SLIDE 31

Thank You !

☺ Questions?

31

slide-32
SLIDE 32

Backup slides

32

slide-33
SLIDE 33

Replaying Facebook KV Workloads Five key-value traces collected on production memcached servers

[Atikoglu et al., Sigmetrics’12] 33

slide-34
SLIDE 34

Replaying Facebook KV Workloads

USR: GET-dominant Less skewed Working set >>> cache No improvement

34

slide-35
SLIDE 35

Replaying Facebook KV Workloads

APP & ETC: More skewed Working set fits the cache 10%-30% DELETE frequent invalidations in SLB Improvement < 20%

35

slide-36
SLIDE 36

Replaying Facebook KV Workloads SYS & VAR: GET & UPDATE Working set fits the cache Improvement > 43%

36