Search on Modern CPUs and GPUs N. Satish, C. Kim, J. Chhugani, A. - PowerPoint PPT Presentation

FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs N. Satish, C. Kim, J. Chhugani, A. Nguyen, V. Lee, D. Kim, P. Dubey SIGMOD 2010 Presented by: Andy Hwang

2 Motivation • Index trees are not optimized for architecture • Only one node is accessed per tree level, ineffective cache line utilization • Prefetch cannot be used (depends on comparison of search key to parent) • Nodes in different pages, causing TLB misses • Previous work optimized for page, cache, SIMD separately, not together • Compression can be used to save memory bandwidth

3 Motivation: Index Tree Layout Bad for traversal

4 Motivation Hierarchical Blocking CPU/GPU Implementation Compression Throughput/Response Time Summary/Discussion

5 Hierarchical Blocking Optimize for accesses (SIMD/cache/memory)

6 Hierarchical Blocking

8 Tree Construction • Assuming 4-byte keys (32-bits) • Block size depends on SIMD instruction width, cache line size, and page size • Use one SIMD instruction to calculate multiple indices • Parallelize output amongst CPU cores / GPU shared multiprocessors

9 Tree Construction: CPU • 128-bit SIMD = max 4 nodes at once • SIMD block = 2 tree levels (3 nodes) • 64-byte cache line = max 16 nodes • Cache line block = 4 levels (15 nodes) • 2MB page size • Page block = 19 levels • 4KB page = 10 levels

10 Tree Construction: GPU • 32 data elements (thread warp) • Various SIMD block sizes possible (up to 32) • Set depth to 4 to make use of instruction granularity at half-warp • No cache exposed – cache line block size set equal to SIMD block size

11 Tree Traversal: CPU

12 Tree Traversal: GPU

13 Simultaneous Queries • Issue queries in parallel on the hardware • Software pipelining used to hide cache/TLB miss or GPU memory latency • CPU: 8 concurrent queries per thread, 64 total • GPU: 2 concurrent queries per thread warp, 960 total

14 Optimization Speedup

15 CPU vs GPU Search Throughput

16 Tree Traversal: MICA • Intel Many-Core Architecture Platform • Intel GPGPU effort • 32KB L1, 256KB L2 (partitioned) • 4 threads/core • Traversal code similar to CPU • 16-wide SIMD • SIMD block depth = 4 (15 nodes at once)

17 Tree Traversal: MICA Throughput (million queries / sec) Small Tree (64K keys) Large Tree (16M keys) CPU 280 60 GPU 150 100 MICA 667 183 Benefits of both CPU and GPU!

19 Compression • Key sizes are different in practice • Impact cache line and page usage • Non-Contiguous Common Prefix • Hashing keys based on their difference (partial keys) • 4-bit blocks as unit of compression • SIMD instruction to find similarity and compress

20 Compression • First page partial key size is larger (128 bits) to reduce false positives • Subsequent pages have partial key size 32 • Construction overhead increased • +75% for variable size keys, +30% integer keys • During traversal, the query key is compressed

21 Compression

22 Compression: Alphabet Size

23 Compression: Throughput

24 Query Batching/Buffering

25 Summary • Hierarchical blocking to optimize search tree for page, cache, SIMD instructions • Architectural-aware block depths • CPU/GPU/MICA implementations • Fast construction, search, and parallel queries for varying tree sizes • Hide memory latency wherever possible • NCCP compression for integer and variable length keys • Throughput/Response time for different query batching schemes

26 Discussion • Focus on throughput • Assumes large number of queries • Not much info on latency • Updates • Full reconstruction? Flushed from cache? • Synthetic workloads • Deployment

Search on Modern CPUs and GPUs N. Satish, C. Kim, J. Chhugani, A. - PowerPoint PPT Presentation

FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs N. Satish, C. Kim, J. Chhugani, A. Nguyen, V. Lee, D. Kim, P. Dubey SIGMOD 2010 Presented by: Andy Hwang 2 Motivation Index trees are not optimized for architecture

MODERN 1 MODERN 2 MODERN 3 MODERN 4 MODERN A peep at some distant orb has power to raise

Why use GPUs for graph processing? FOSDEM 2020 2 GPUs and Graphs Graphs GPUs Found

CPUs Chapter 3.5 Caches. Memory management. Caches and CPUs address data cache

Search Engines Issues Avi Rappoport Search Tools Consulting Search Issues Enterprise Search

Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs G. Bernab

1,202 CPUs and 176 GPUs used by the Google DeepMind AI to win against the Go game

HetCore: TFET-CMOS Hetero-Device Architecture for CPUs and GPUs Bhargava Gopireddy, Dimitrios

Cartoon parallel architectures; CPUs and GPUs CSE 6230, Fall 2014 Th Sep 11 Thanks

Using GPUs as CPUs for Engineering Applications: Challenges and Issues Michael A. Heroux Sandia

Scalable Multi-Coloring Preconditioning for Multi-core CPUs and GPUs Vincent Heuveline 1 , Dimitar

Data Lake to AI on GPUs CPUs can no longer handle the growing data demands of data science

Boost.Compute A C++ library for GPU computing Kyle Lutz GPUs Multi-core CPUs (NVIDIA, AMD,

Modern Risk Modern Risk Modern Risk Management Modern Risk Management anagement Concepts:

Foundations of Artificial Intelligence 9. State-Space Search: Tree Search and Graph Search Malte

Tabu Search Search Tabu Page 1 Part I Part I Tabu Search Principles Search Principles Tabu

Uninformed Search 2 Informed Search Rest of blind search An informed search strategyone

Single processor tuning (2/2) Prof. Richard Vuduc Georgia Institute of Technology CSE/CS 8803

AScalable,NonblockingApproachto TransactionalMemory JaredCasper

Operating Systems Design and Implementation Chapter 03 (version January 30, 2008 ) Melanie

Lab 2 discussion Last Time Debugging Its a science use experiments to refine

Harmony: Collec.on and Analysis of Parallel Block Vectors

Final Review Logistics Start Midterm next class! Same style as Midterm, 5 questions

Design of a blocking-resistant anonymity system Roger Dingledine, Nick Mathewson The Tor Project

Interrupts and System Calls Don Porter CSE 306 1 CSE 306: Opera.ng Systems Last Time Ok,