Search on Modern CPUs and GPUs N. Satish, C. Kim, J. Chhugani, A. - - PowerPoint PPT Presentation

search on modern cpus and gpus
SMART_READER_LITE
LIVE PREVIEW

Search on Modern CPUs and GPUs N. Satish, C. Kim, J. Chhugani, A. - - PowerPoint PPT Presentation

FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs N. Satish, C. Kim, J. Chhugani, A. Nguyen, V. Lee, D. Kim, P. Dubey SIGMOD 2010 Presented by: Andy Hwang 2 Motivation Index trees are not optimized for architecture


slide-1
SLIDE 1

FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs

  • N. Satish, C. Kim, J. Chhugani, A. Nguyen, V. Lee, D. Kim, P. Dubey

SIGMOD 2010 Presented by: Andy Hwang

slide-2
SLIDE 2

Motivation

  • Index trees are not optimized for architecture
  • Only one node is accessed per tree level,

ineffective cache line utilization

  • Prefetch cannot be used (depends on comparison of

search key to parent)

  • Nodes in different pages, causing TLB misses
  • Previous work optimized for page, cache, SIMD

separately, not together

  • Compression can be used to save memory

bandwidth

2

slide-3
SLIDE 3

Motivation: Index Tree Layout

3

Bad for traversal

slide-4
SLIDE 4

Motivation Hierarchical Blocking CPU/GPU Implementation Compression Throughput/Response Time Summary/Discussion

4

slide-5
SLIDE 5

Hierarchical Blocking

5

Optimize for accesses (SIMD/cache/memory)

slide-6
SLIDE 6

Hierarchical Blocking

6

slide-7
SLIDE 7

Motivation Hierarchical Blocking CPU/GPU Implementation Compression Throughput/Response Time Summary/Discussion

7

slide-8
SLIDE 8

Tree Construction

  • Assuming 4-byte keys (32-bits)
  • Block size depends on SIMD instruction width,

cache line size, and page size

  • Use one SIMD instruction to calculate multiple

indices

  • Parallelize output amongst CPU cores / GPU

shared multiprocessors

8

slide-9
SLIDE 9

Tree Construction: CPU

  • 128-bit SIMD = max 4 nodes at once
  • SIMD block = 2 tree levels (3 nodes)
  • 64-byte cache line = max 16 nodes
  • Cache line block = 4 levels (15 nodes)
  • 2MB page size
  • Page block = 19 levels
  • 4KB page = 10 levels

9

slide-10
SLIDE 10

Tree Construction: GPU

  • 32 data elements (thread warp)
  • Various SIMD block sizes possible (up to 32)
  • Set depth to 4 to make use of instruction

granularity at half-warp

  • No cache exposed – cache line block size set

equal to SIMD block size

10

slide-11
SLIDE 11

Tree Traversal: CPU

11

slide-12
SLIDE 12

Tree Traversal: GPU

12

slide-13
SLIDE 13

Simultaneous Queries

  • Issue queries in parallel on the hardware
  • Software pipelining used to hide cache/TLB

miss or GPU memory latency

  • CPU: 8 concurrent queries per thread, 64 total
  • GPU: 2 concurrent queries per thread warp,

960 total

13

slide-14
SLIDE 14

Optimization Speedup

14

slide-15
SLIDE 15

CPU vs GPU Search Throughput

15

slide-16
SLIDE 16

Tree Traversal: MICA

  • Intel Many-Core Architecture Platform
  • Intel GPGPU effort
  • 32KB L1, 256KB L2 (partitioned)
  • 4 threads/core
  • Traversal code similar to CPU
  • 16-wide SIMD
  • SIMD block depth = 4 (15 nodes at once)

16

slide-17
SLIDE 17

Tree Traversal: MICA

Throughput (million queries / sec) Small Tree (64K keys) Large Tree (16M keys) CPU 280 60 GPU 150 100 MICA 667 183

17

Benefits of both CPU and GPU!

slide-18
SLIDE 18

Motivation Hierarchical Blocking CPU/GPU Implementation Compression Throughput/Response Time Summary/Discussion

18

slide-19
SLIDE 19

Compression

  • Key sizes are different in practice
  • Impact cache line and page usage
  • Non-Contiguous Common Prefix
  • Hashing keys based on their difference (partial

keys)

  • 4-bit blocks as unit of compression
  • SIMD instruction to find similarity and compress

19

slide-20
SLIDE 20

Compression

  • First page partial key size is larger (128 bits) to

reduce false positives

  • Subsequent pages have partial key size 32
  • Construction overhead increased
  • +75% for variable size keys, +30% integer keys
  • During traversal, the query key is compressed

20

slide-21
SLIDE 21

Compression

21

slide-22
SLIDE 22

Compression: Alphabet Size

22

slide-23
SLIDE 23

Compression: Throughput

23

slide-24
SLIDE 24

Query Batching/Buffering

24

slide-25
SLIDE 25

Summary

  • Hierarchical blocking to optimize search tree for

page, cache, SIMD instructions

  • Architectural-aware block depths
  • CPU/GPU/MICA implementations
  • Fast construction, search, and parallel queries for

varying tree sizes

  • Hide memory latency wherever possible
  • NCCP compression for integer and variable length

keys

  • Throughput/Response time for different query

batching schemes

25

slide-26
SLIDE 26

Discussion

  • Focus on throughput
  • Assumes large number of queries
  • Not much info on latency
  • Updates
  • Full reconstruction? Flushed from cache?
  • Synthetic workloads
  • Deployment

26