A Trillion Rows Per Second as a Foundation for Interactive Analytics - - PowerPoint PPT Presentation

a trillion rows per second as a foundation for
SMART_READER_LITE
LIVE PREVIEW

A Trillion Rows Per Second as a Foundation for Interactive Analytics - - PowerPoint PPT Presentation

A Trillion Rows Per Second as a Foundation for Interactive Analytics Eric Hanson, Principal Product Manager April 18, 2018 Overview MemSQL Interactivity and user satisfaction State-of-the-art query execution technology Demo Where


slide-1
SLIDE 1

A Trillion Rows Per Second as a Foundation for Interactive Analytics

Eric Hanson, Principal Product Manager April 18, 2018

slide-2
SLIDE 2

2

Overview

§ MemSQL § Interactivity and user satisfaction § State-of-the-art query execution technology § Demo § Where can we go with this technology?

slide-3
SLIDE 3

MemSQL Overview

3

slide-4
SLIDE 4

4

What is MemSQL?

§ SQL DBMS § Fast: scale-out, compilation, in-memory, vectorized § In-memory rowstore § Disk-based columnstore § Transactions and analytics § Fantastic operational data store

slide-5
SLIDE 5

5

Why MemSQL?

FAST DATA Ingest LOW LATENCY Queries HIGH Concurrency

slide-6
SLIDE 6

MemSQL scale-out architecture

Leaf

Aggregator

Leaf Leaf Leaf

Client App

slide-7
SLIDE 7

7

Challenges to lightning-fast response

§ Large data volume § Many concurrent users § Query complexity § Rapidly changing data

slide-8
SLIDE 8

Response Time, Productivity, and User Satisfaction

8

slide-9
SLIDE 9

9

Stimulation is the indispensable requisite for pleasure in an experience, and the feeling of bare time is the least stimulating experience we can have. WILLIAM JAMES, 1842-1910 Principles of Psychology, Volume I (1890)

slide-10
SLIDE 10

10

The need for speed

§ Users become used to fast response & expect it § Satisfaction increases as

response time decreases

§ Delays over 50-150 msec

are noticeable in realtime apps

§ ~250 msec is median human

reaction time

10 20 30 40 50 60 70

Snooze Meh Good Wow!

Response Time (sec)

Response Time (sec)

slide-11
SLIDE 11

11

Subtleties about response time

§ High variance can bother users

  • < ¼ of mean or > 2X the mean
  • Can help to give message if high variance

§ Unexpectedly fast results can make users apprehensive § Fast response

  • Can lead to more “input errors”
  • Makes users interact and explore more

ç Creates business value

slide-12
SLIDE 12

MemSQL Query Execution Technology

12

slide-13
SLIDE 13

13

MemSQL technology to give lightning-fast response for analytics

§ Scale-out § Compiled query § In-memory row store § Columnstore § Vectorization § Intel AVX2 SIMD

slide-14
SLIDE 14

14

MemSQL Scale-Out

§ True horizontal scaling

  • Shared nothing
  • Not shared disk

§ Hash partitioning across leaf nodes § Can resize cluster and redistribute data § Can add aggregators or leaves § Scales both transactions and analytics

Leaf

Aggregator

Leaf Leaf Leaf

Client App

slide-15
SLIDE 15

15

MemSQL Compiles Queries

§ Queries compile to machine code § Example is Row Store § First run takes compile time § 49.3 million rows/sec on 2 cores § 24.7 rows/sec/core § Compare to 1 to 2 million

rows/sec/core on interpreted DBMS

memsql> select count(*) from t; +----------+ | count(*) | +----------+ | 8388608 | +----------+ 1 row in set (0.10 sec) memsql> select count(*) from t where color = "Red"; +----------+ | count(*) | +----------+ | 4194304 | +----------+ 1 row in set (0.42 sec) ç includes compile time memsql> select count(*) from t where color = "Red"; +----------+ | count(*) | +----------+ | 4194304 | +----------+ 1 row in set (0.17 sec) ç executes from cache

slide-16
SLIDE 16

16

MemSQL Columnstore

§ On disk § 1M-row segments § Each column stored in separate file § Only read columns you touch § Highly compressed

  • Dictionary
  • Run-length
  • LZ
  • Integer value

§ Min/max per column per segment

slide-17
SLIDE 17

17

MemSQL columnstore ctd.

§ Sorted by key § Segment elimination § Compiled code built into system for handling segments § Linux file buffer caches keeps data in RAM § In-memory row store segment for new data § Background merger

slide-18
SLIDE 18

18

Vectorization

§ Process data in 4,000-row

chunks

§ a.k.a. “vector projections” § Process column vector in a tight

loop of C++

  • Filters
  • Local group-by
  • Joins

§ Few hundred million

rows/sec/core

4K-row chunk Column vector

slide-19
SLIDE 19

19

SIMD overview

▪ Intel AVX-2 ▪ 256-bit registers ▪ Pack multiple values per

register

▪ Special instructions for

SIMD register operations

▪ Arithmetic, logic, load,

store etc.

▪ Allows multiple operations

in 1 instruction

1 2 3 4 1 1 1 1 2 3 4 5

+

slide-20
SLIDE 20

20

Operations on Encoded Data in MemSQL

§ Intel AVX-2 SIMD § Filters § Group-By § Process 256-bit chunk of encoded (compressed) data at

  • nce

§ Can process > 3 billion rows/sec/core § Applied before vectorization for local group-by

slide-21
SLIDE 21

MemSQL Confidential 21

Encoded data example

§ Dictionary encoding § Values:

  • Green: 00
  • Red: 01
  • Blue: 10

§

select color, count(*) from t group by color

Red Red Blue Green Red Blue

01 01 10 00 01 10

6 values in only 12 bits! SIMD can process multiple 2-bit values at once

slide-22
SLIDE 22

DEMO

22

slide-23
SLIDE 23

The Hardware

Leaf

Aggregator

Leaf Leaf Leaf Leaf Leaf Leaf Leaf

2 x Intel Xeon Platinum 8180 CPU @ 2.50GHz, 28 cores, “Skylake”

Total leaf cores = 8 x 2 x 28 = 448

slide-24
SLIDE 24

24

The data

§ Synthetically-generated

stock trades

§ 57.8 billion rows

slide-25
SLIDE 25

25

How big is a trillion?

Dollar amount of a football field covered with stacks of $100 bills 6 feet high Number of tweets in 5 years Number of text messages in the world in 45 days More than the number of checkout transactions at Walmart since it was founded

slide-26
SLIDE 26

Drum Roll Please!

26

slide-27
SLIDE 27

27

The results

  • Avg query time: 0.0525 sec
  • 57.8 billion / 0.0525 =

1.10 trillion rows/sec

slide-28
SLIDE 28

28

What does it mean?

§ You can encourage analytic exploration § The technology exists to meet these challenges:

  • Expectation of interactive response
  • Data explosion
  • Higher concurrency demands
  • Preference for SQL
  • Real-time update
  • Need to run on economical hardware
slide-29
SLIDE 29

Thank You!