Optimizing Redis for Locality and Capacity Kevin C., Yoongu K. - - PowerPoint PPT Presentation

optimizing redis
SMART_READER_LITE
LIVE PREVIEW

Optimizing Redis for Locality and Capacity Kevin C., Yoongu K. - - PowerPoint PPT Presentation

Optimizing Redis for Locality and Capacity Kevin C., Yoongu K. Lavanya S. 15-799 Project Presentation 12/4/2013 1 Goals of Our Project Leverage DRAM and dataset characteristics to improve performance of in-memory database Locality :


slide-1
SLIDE 1

Optimizing Redis for Locality and Capacity

Kevin C., Yoongu K. Lavanya S. 15-799 Project Presentation 12/4/2013

1

slide-2
SLIDE 2

Goals of Our Project

  • Leverage DRAM and dataset characteristics to

improve performance of in-memory database

  • Locality: Exploit DRAM internal buffers
  • Capacity: Exploit redundancy in dataset

2

slide-3
SLIDE 3

DRAM System Organization

3

CPU DRAM

Bus

slide-4
SLIDE 4

DRAM System Organization

4

CPU DRAM System

Bus

Bank

Banks can be accessed in parallel

slide-5
SLIDE 5

DRAM Bank Organization

  • Row buffer serves as a fast cache in a bank

– Row buffer miss transfers an entire row of data to the row buffer – Row buffer hit for accesses in the same row (reduces latency by 1-2x)

5

Rows (8KB) Columns Row Buffer

slide-6
SLIDE 6

RBL in In-Memory Databases

  • Idea: Map hot data to a few DRAM rows
  • Hot data: Data with high temporal correlation
  • Examples of temporally correlated data:

– Records touched around the same time – Query terms searched together often

6

slide-7
SLIDE 7

Challenge

  • How are data mapped to DRAM? Which bank?

Which row?

Virtual Page Number Offset Physical Page Number Offset

DRAM System Bank

Unexposed to the system: Determined by the HW (memory controller)

7

Virtual Address Physical Address

slide-8
SLIDE 8

Task 1: Find the Mapping to DRAM

  • Approach: Kernel module with assembly code to
  • bserve access latency to different addresses

8

Input: addr1 & addr2

  • 1. Load addr1 // Fill TLB for addr1
  • 2. Load addr2 // Fill TLB for addr2
  • 3. Flush the cache lines of addr1 and addr2
  • 4. Load addr1
  • 5. Read CPU cycle counter // Tstart for addr2
  • 6. Load addr2
  • 7. Read CPU cycle counter // Tend of addr2
  • 1. Cache hit
  • 2. Cache miss – Row Hit
  • 3. Cache miss – Row Miss

Courtesy: Backbone kernel module is obtained from Hyoseung Kim under Prof. Rajkumar

slide-9
SLIDE 9

Task 1: Find the Mapping to DRAM

  • Experimental setup: 3.4GHz Haswell CPU,

2GB DRAM DIMM (8 banks)

  • With an exhaustive selection of addr1 and

addr2, we discover the mapping to be:

9

Physical Page Number Offset Physical Address Offset Bank Row Row

12 13 15 16 18 Byte offset within a row (8KB) XOR bit [15:13] with bit [18:16] to select a bank

slide-10
SLIDE 10

Task 1: Find the Mapping to DRAM

10

P0 Bank 0 P1 Bank 1 P7 Bank 7

Rows

P9 P8 …

0x0000 0x2000 0xFFFF 0x4000

P0 P1

8KB Physical Address Space

Offset Bank Row Row

12 13 15 16 18 Byte offset within a row (8KB) XOR bit [15:13] with bit [18:16] to select a bank

slide-11
SLIDE 11

Task 1: Find the Mapping to DRAM

11

  • Measurement:
  • The cache hit latency includes the overhead of

extra assembly instructions

  • Under investigation: Why does row hit in a

different bank incurs extra latency?

Request Type Approximate Latency (CPU cycles) Cache hit 30 Row hit in the same bank 170 Row hit in a different bank 220 Row miss 270

60% increase

slide-12
SLIDE 12

Task2: Microbenchmark

12

  • Kernel: Allocates 128KB of memory

space(guaranteed to be contiguous physical pages)

Base Base + 8KB

Row X Bank Y

Base + (9 * 8KB)

Row X+1 Bank Y … Test 1: Striding within a row

  • > Results in row hits

Test 2: Zigzag b/w 2 rows in the same bank

  • > Results in row misses
slide-13
SLIDE 13

Why Understand Mapping to DRAM?

  • Enables mapping application data to exploit

locality

  • Pages mapped to rows:

– Data accesses to the same row incur low latency – Colocate frequently accessed data in same row

  • Next cache line prefetched:

– Accessing next cache line incurs low latency – Map data accessed together to adjacent cache lines

13

slide-14
SLIDE 14

Data Mapping Benefits in Redis

  • Is memory access the bottleneck?
  • Profiling using Performance API (PAPI)

– An interface to hardware performance counters

  • Profile set and get key functions

– Determine what fraction of cycles are set and get

14

slide-15
SLIDE 15

Data Mapping Benefits in Redis

15

0.05 0.1 0.15 0.2 0.25 0.3 0.35

Fraction of Cycles Number of Random Queries

Set Cycle Fraction Get Cycle Fraction

Memory is not a significant bottleneck in Redis

slide-16
SLIDE 16

Sensitivity to Payload Size

16

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 2 4 64 128 8192 16384 32768 65536

Fraction Of Cycles Payload Size Set Fraction

Memory still not a significant bottleneck in Redis

slide-17
SLIDE 17

Next Steps

  • Row-hit vs. miss behavior on Redis:

– Memmap to allocate data contiguously in a page – Microbenchmarks to access same and different rows/pages

17

Row X Bank Y Row X+1 Bank Y …

slide-18
SLIDE 18

More Potential for Data Mapping?

  • Single-node databases
  • Mainframe transaction processing systems
  • Data analytics systems

18

slide-19
SLIDE 19

Dataset

  • Could not find suitable in-memory dataset
  • We constructed our own dataset based on the English

Wikipedia corpus

1. XML dump of current revisions for all English articles

  • 43GB (uncompressed)
  • 11/04/2013
  • http://dumps.wikimedia.org/enwiki/20131104/enwiki-20131104-

pages-articles.xml.bz2

2. Article hit-count log (one hour)

  • 307MB (uncompressed)
  • Last hour of 11/04/2013
  • http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-

11/pagecounts-20131105-000001.gz

19

slide-20
SLIDE 20

Dataset (cont’d)

  • Sanitation was unexpectedly non-trivial...

– Spam and/or invalid user queries – ASCII vs. UTF-8 vs. ISO/IEC 8859-1 – URI escape characters, HTML escape characters – Running out of memory

  • Sanitized dataset

– 141K key-value pairs: (title, article) – 3.6GB (uncompressed)

20