Vector Extensions for Decision Support DBMS Acceleration Timothy - PowerPoint PPT Presentation

Vector Extensions for Decision Support DBMS Acceleration Timothy Hayes, Oscar Palomar, Osman Unsal, Adrian Cristal & Mateo Valero Barcelona Supercomputing Center Presented by Timothy Hayes timothy.hayes@bsc.es

Introduction  Databases are important  OnLine Analytical Processing  Data mining  E-commerce  Scientific analysis  Decision Support System DBMSs  Extracts useful information from large structured data  Frequent reads – infrequent updates  Moved from disk-bound to memory/CPU-bound  Abundance of analysis  Recent research on DBMS implementation – Zukowski et al (2006)  Opportunity for computer architecture  Speedup queries in a power-efficient way  Data-level parallelism (DLP) is very attractive here if available 2

Vectorwise  State of the art analytical database engine  Based on MonetDB/X100 – Boncz et al (2005)  Redesigned database software architecture  Highly optimised for modern commodity superscalar CPUs  Finding hotspots is relevant  Column-oriented / block at a time (batches of values)  Possible opportunities for data-level parallelism (DLP)  Profiling  TPC-H decision support benchmark  22 queries – 100 GB database  Intel Nehalem microarchitecture 3

Profiling Vectorwise w/ TPC-H 100 GB 80 70 60 cpu time (seconds) 50 40 hash join other 30 20 10 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 22 tpch query 4

Hash Join Analysis  61 % of total time 1.3  Build – 33% (20%) 1.25  Probe – 67% (41%)  Poor ILP scalability 1.2 speedup  Simulated wide configs 1.15  Superscalar/OoO structs  Maximum speedup 1.21x 1.1  Algorithm has DLP 1.05  Each probe independent  Why isn’t it vectorised? 1 ss2 ss4 ss8 configuration 5

DLP Support in Hardware  SIMD multimedia extensions (SSE/AVX)  Register lengths relatively short  SIMD operations are fixed in length  Indexed memory operations missing*  Experiments show speedup of less than 1%  Vectors: traditional pipelined solution  Solves many problems that SIMD suffers from  Long vector registers with pipelined operations  Programmable vector length  Mask registers for conditionals  Gather/scatter  Traditionally applied to scientific/multimedia domains  Opportunity to explore business-domain applications 6

Paper Contributions  Show that a vectors can be reapplied to DSS workloads  Extend modern out-of-order x86-64 microprocessor  Provide suitable vector ISA extensions  Optimise implementation for DSS workload  Experimental evaluation Demonstrate that vectors are beneficial 1. Design space exploration 2. Memory bandwidth analysis 3. Prefetching support 4. 7

Vector Extensions to x86-64  Vector Instruction Set  Architecture  Traditional instructions  8 vector registers  Vectorises hash join  Size discussed later  4 mask registers  But not overly specific  Integer over floating point  1 vector length register  Classes  Microarchitecture  Arithmetic / Logical  Adds 3 new vector clusters  Compress  2 arithmetic - 1 memory  Optional Mask  Tightly integrated with core  Mask Arithmetic  Not a coprocessor  Programmable Vector Length  Reuse existing structures  Mem. Unit Stride / Indexed  Cache integration difficult  OoO difficult 8

Cache Hierarchy Integration  Want to take advantage of the cache hierarchy  Vectorwise is blocked & cache-conscious  Sometimes datasets are cache-resident  Vector integration should...  Not compromise the existing access time of the L1D cache  Provide enough bandwidth to vector unit  Exploit regular access patterns, i.e. unit stride  Bypass L1D and go directly to L2  Quintana et al. (1999)  Pull many elements in a single request  Amortise extra latency incurred w/ long pipelined ops 9

Out of Order Execution  Espasa et al. (1997) vectors with out of order execution  Performance benefits ✔  Hides memory latency even more ✔  Only supports unit-stride memory access ✘  Very difficult for indexed accesses  Need to check for memory aliases  Gather/Scatter too complex for load/store queue (LSQ)  Our proposal  Explicitly program fences between memory dependencies  Seldomly needed  Relax the memory model  Bypass the LSQ completely  Very simple hardware to track outstanding memory ops 10

Experimental Setup  Scalar Baseline  Simulators  Intel Nehalem 2.67 GHz  PTLsim  DRAMSim2  Single core  Inclusive Cache  Application  L1i – 32 KB – 1 cycle  Hand-vectorised  L1d – 32 KB – 4 cycles  Datasets  L2 – 256 KB – 10 cycles L1 resident (l1r) 1.  Memory System L2 resident (l2r) 2.  DDR3-1333 2 MB 3.  10.667 GB/s bandwidth HUGE 4. TPCH 5. 11

Vector Benefits Are vectors suitable for DSS acceleration? 12

Scalability of Vector Length 4.5 4 3.5 speedup over scalar 3 l1r 2.5 l2r 2 2mb huge 1.5 tpch 1 0.5 0 4 8 16 32 64 vector register length 13

Design Exploration Are the design decisions justified? 14

Design Exploration – MVL64 ooo decoupled fenceless l1 1.8E+09 1.6E+09 1.4E+09 processor cycles 1.2E+09 1.0E+09 8.0E+08 6.0E+08 4.0E+08 2.0E+08 0.0E+00 l1r l2r 2mb huge tpch dataset 15

Memory Bandwidth Can vectors utilise the available bandwidth? 16

Memory Bandwidth Utilisation 4 3.5 speedup over scalar 3 2.5 2 inf. bw mc2 1.5 mc1 1 0.5 0 4 8 16 32 64 vector register length 17

Memory Bandwidth / MSHRs – MVL64 6 5 speedup over scalar 4 3 mshr1x mshr2x mshr4x 2 1 0 s-mc1 s-mc2 s-inf.bw v-mc1 v-mc2 v-inf.bw experiment 18

Software Prefetching Support Increasing the utilisation of available memory bandwidth 19

Prefetching Improvements – MVL64 5 speedup over scalar w/o prefetching 4.5 4 3.5 3 2.5 s-pre v-no-pre 2 v-pre 1.5 1 0.5 0 l1r l2r 2mb huge tpch dataset 20

Conclusions  Superscalar/OoO  Does not offer good scalability for a DSS workload  Does not saturate available memory bandwidth  Vectors ideal for a DSS workload  Speedup between 1.94x – 4.56x for 41% of benchmark  Fully saturates available memory bandwidth  Long vector operations  Potential to scale further  All with pipelining and not parallel lanes  Design Space Measurements  Cache integration  Bypassing L1 cache does not incur a penalty  Out of order integration  Indexed memory support is challenging  1.4x improvement  Future work will discover its cost in area/energy 21

Vector Extensions for Decision Support DBMS Acceleration Timothy Hayes, Oscar Palomar, Osman Unsal, Adrian Cristal & Mateo Valero Barcelona Supercomputing Center Presented by Timothy Hayes timothy.hayes@bsc.es

Vector Extensions for Decision Support DBMS Acceleration Timothy - PowerPoint PPT Presentation

Vector Extensions for Decision Support DBMS Acceleration Timothy Hayes, Oscar Palomar, Osman Unsal, Adrian Cristal & Mateo Valero Barcelona Supercomputing Center Presented by Timothy Hayes timothy.hayes@bsc.es Introduction Databases

Design of Flash- -Based DBMS: Based DBMS: Design of Flash Design of Flash-Based DBMS: An In-

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

DBMS + ML Julian Oks Josh Sennett Jan. 29, 2020 Context + Problem Statement Context: DBMS + ML

Alert: An Architecture for Transforming a Passive DBMS into an Active DBMS Ulf Schreier, Hamid

CS743 - Principles of Database Management and Use Distribution, Replication, and CAP Ken Salem

A GPU-Inspired Soft Processor for High- Throughput Acceleration Throughput Acceleration Jeffrey

Why Deep Learning Is More Natural Questions Efficient than Support Support Vector . . . Support

Matrix and Vector Operations Matrix and Vector Operations 1 / 21 Matrix and Vector Operations

Day 3 Advanced Vector Architectures Session A: Vector Instruction Execution Pipelines Break

Cheap Talk Games: Extensions Cheap Talk Games: Extensions F. Koessler / November 12, 2008 Cheap

Support Vector Machines Preview What is a support vector machine? The perceptron revisited

? 17.10.2018 3 17.10.2018 4 Support Vector Machines (SVM): Background Support Vector Machines

Distributed DBMS reliability Distributed DBMS reliability

Database Management System (DBMS) DBMS contains information about a particular enterprise

Database Management Systems (DBMS) Prof. Pfaff. Lafayette College February 19, 2018 Prof.

Tactical data engineering Julian Hyde April 1718, 2019 San Francisco @julianhyde DBMS Data

FLY QUIET 21 RNAV DEPARTURE CONCEPTS DEPARTURE PROCEDURES Vector Headings Vector Headings

CMAT Final Presentation Language Guru: Michael Berkowitz (meb2235) Project Manager: Frank Cabada

DESIGNING A SHUTTLE VECTOR FOR PROTEIN PRODUCTION IN PICHIA PASTORIS pastoris i PICHIA Georgia

Looking for Hyponyms in Vector Space Marek Rei, SwiftKey Ted Briscoe, University of Cambridge

+ ; = ; where the

CUTENSOR High-Performance CUDA Tensor Primitives Paul Springer, Chen-Han Yu, March 20 th 2019

Vectors Presented by: Ana Chang-Gonzalez, Alyssa Michalke, Kirsten Schroeder, and Connie Xavier

Honolulu Authority for Rapid Transportation STAFF SUMMARY TITLE: STAFF CONTACT: