Architecting a Low-Latency Schemaless SQL Engine Igor Canadi, - - PowerPoint PPT Presentation

architecting a low latency schemaless sql engine
SMART_READER_LITE
LIVE PREVIEW

Architecting a Low-Latency Schemaless SQL Engine Igor Canadi, - - PowerPoint PPT Presentation

Architecting a Low-Latency Schemaless SQL Engine Igor Canadi, Rockset About Rockset Igor Search and analytics engine Rockset Enables data-driven Facebook applications RocksDB GraphQL 2 Overview Hardware and people


slide-1
SLIDE 1

Architecting a Low-Latency Schemaless SQL Engine

Igor Canadi, Rockset

slide-2
SLIDE 2
  • Rockset
  • Facebook
  • RocksDB
  • GraphQL

2

Rockset Igor

  • Search and analytics engine
  • Enables data-driven

applications

About

slide-3
SLIDE 3

Overview

  • Hardware and people efficiency
  • Designing systems for people efficiency

○ Schemaless SQL ○ Converged indexing ○ Serverless architecture

slide-4
SLIDE 4

Hardware Efficiency

slide-5
SLIDE 5
slide-6
SLIDE 6
slide-7
SLIDE 7

Hardware Efficiency

  • Faster databases ~= less hardware
  • How much hardware do I need?
  • Important, but not the only thing that matters
slide-8
SLIDE 8

People Efficiency

slide-9
SLIDE 9

People Efficiency

  • How many people do I need?
  • How much time do I need?
slide-10
SLIDE 10

People Efficiency - Configuration

“My query is slow” “Do you have an index?” “What’s your partition key?” “What’s your buffer size?” “You should hire a DBA”

slide-11
SLIDE 11

People Efficiency - Organizational Friction

  • Pre-cloud era: Application developers blocked on

provisioning

  • Data scientists blocked on data engineers
slide-12
SLIDE 12

People Efficiency - Pipelines

slide-13
SLIDE 13

Hardware vs. People Efficiency

  • Hardware is frequently cheaper than people
  • Increase hardware efficiency - spend less money
  • Increase people efficiency - spark creativity
slide-14
SLIDE 14

Designing Systems for People Efficiency

slide-15
SLIDE 15

Rockset

  • Search and analytics engine
  • “Shortest path from data to applications”
  • Connect to data sources or streams
  • Execute fast queries
slide-16
SLIDE 16

Schemaless SQL

slide-17
SLIDE 17

Choosing the Query Language

  • SQL is the obvious choice
  • Maximize usefulness
  • Existing knowledge
  • Ecosystem of tools
slide-18
SLIDE 18

Querying existing data sources

Web/Mobile Email/Docs Sensors OLTP Social Data Lake Files Logs

SQL

slide-19
SLIDE 19

Querying existing data sources

Web/Mobile Email/Docs Sensors OLTP Social Data Lake Files Logs

SQL ...but first, let me define a schema

slide-20
SLIDE 20

SQL Schema

  • Drag on people efficiency
  • Messy data
  • Complex ETL jobs
slide-21
SLIDE 21

Schemaless SQL

  • “Smart schema”
  • Frictionless data onboarding
  • Data scientists no longer blocked on data engineers
  • Performance overhead?

https://rockset.com/blog/using-smart-schema-to-accelerate-insights-from-nested-json/

slide-22
SLIDE 22

Schemaless SQL - Storage

Strict schema Schema Data

name: String age: Int John 35

Schemaless

“name”: S “John” “age”: I 35

Schemaless (with field interning)

0: S “John” 1: I 35 name: 0 age: 1

slide-23
SLIDE 23

Schemaless SQL - Query Execution

Strict schema

1 10 7 4 5 a b c d e

Schemaless Schemaless (with type hoisting)

Columns Rows I 1 I 10 I 7 I 4 I 5 S a S b I 3 I 5 S e Columns I 1 10 7 4 5 M S a S b I 3 I 5 S e Columns

slide-24
SLIDE 24

Schemaless SQL

  • Superior user experience
  • Field interning reduces storage overhead
  • Type hoisting reduces query execution overhead
slide-25
SLIDE 25

Converged indexing

slide-26
SLIDE 26

Converged Indexing

  • “Query is slow because of the missing index”
slide-27
SLIDE 27

Converged Indexing

  • “Query is slow because of the missing index”

Index all the fields!

slide-28
SLIDE 28

Background on Indexing

  • Columnar storage
  • Search indexing
slide-29
SLIDE 29

Columnar Storage

  • Store each column separately
  • Great compression
  • Only fetch columns the query needs

29

slide-30
SLIDE 30

Columnar Storage

  • Store each column separately
  • Great compression
  • Only fetch columns the query needs

<doc 0> { “name”: “Igor”, “interests”: [“databases”, “snowboarding”], “last_active”: 2019/3/15 } <doc 1> { “name”: “Dhruba”, “interests”: [“cars”, “databases”], “last_active”: 2019/3/22 }

“name” “interests”

Igor 1 Dhruba 0.0 databases 0.1 snowboarding 1.0 cars 1.1 databases

“last_active”

2019/3/15 1 2019/3/22

30

slide-31
SLIDE 31
  • High write latency
  • High minimum read latency
  • Not suitable for online

applications

31

Advantages Disadvantages

  • Cost effective
  • Narrow queries, wide tables
  • Scan queries
  • Analytical queries

Columnar Storage

slide-32
SLIDE 32

Search Indexing

  • For each value, store documents containing that value (posting list)
  • Quickly retrieve a list of document IDs that match a predicate

32

slide-33
SLIDE 33

Search Indexing

  • For each value, store documents containing that value (posting list)
  • Quickly retrieve a list of document IDs that match a predicate

“name” “interests”

Dhruba 1 Igor databases 0.0; 1.1 cars 1.0 snowboarding 0.1

“last_active”

2019/3/15 2019/3/22 1 <doc 0> { “name”: “Igor”, “interests”: [“databases”, “snowboarding”], “last_active”: 2019/3/15 } <doc 1> { “name”: “Dhruba”, “interests”: [“cars”, “databases”], “last_active”: 2019/3/22 }

33

slide-34
SLIDE 34
  • Slower analytical queries

34

Advantages Disadvantages

  • High selectivity queries
  • Low latency queries
  • Suitable for online applications

Search Indexing

slide-35
SLIDE 35
  • Columnar and search indexes in the same system
  • Built on top of key-value store abstraction
  • Each document maps to many key-value pairs

Converged Indexing

35

slide-36
SLIDE 36
  • Columnar and search indexes in the same system
  • Built on top of key-value store abstraction
  • Each document maps to many key-value pairs

Converged Indexing

<doc 0> { “name”: “Igor” } <doc 1> { “name”: “Dhruba” } Key Value R.0.name Igor Row Store R.1.name Dhruba C.name.0 Igor Column Store C.name.1 Dhruba S.name.Dhruba.1 Search index S.name.Igor.0

36

slide-37
SLIDE 37
  • Fast analytical queries + fast search queries
  • Optimizer picks between columnar store or search index

Converged Indexing - Queries

37

slide-38
SLIDE 38
  • Fast analytical queries + fast search queries
  • Optimizer picks between columnar store or search index

Converged Indexing - Queries

SELECT * FROM search_logs WHERE keyword = ‘datacouncil’ AND locale = ‘en’ Search index SELECT keyword, count(*) FROM search_logs GROUP BY keyword ORDER BY count(*) DESC Columnar store

38

slide-39
SLIDE 39
  • One document write results in many

key-value store writes

  • Use write-optimized key-value store -

RocksDB

Converged Indexing - Writes

39

slide-40
SLIDE 40
  • One document write results in many

key-value store writes

  • Use write-optimized key-value store -

RocksDB

Converged Indexing - Writes

40

Storage Memory Manager Memory Buffer SST 1 SST 3 SST 4 new keys background compaction SST 2

slide-41
SLIDE 41
  • Storage
  • Writes

41

More efficient Less efficient

  • Database configuration
  • Queries

Converged Indexing

  • Fast queries out of the box
  • Real-time index writes
slide-42
SLIDE 42

Serverless Architecture

slide-43
SLIDE 43
  • Rockset is a cloud service
  • No need to manage hardware
  • Seamless autoscale

Serverless Architecture

43

slide-44
SLIDE 44

Storage in the Cloud

  • Data is sharded across leaves

Rockset SQL API Aggregator Aggregator Leaf RocksDB Leaf RocksDB Leaf RocksDB Distributed Log

44

slide-45
SLIDE 45

Storage in the Cloud

  • Data is sharded across leaves
  • RocksDB-Cloud keeps consistent

copy in cloud object storage

Rockset SQL API Aggregator Aggregator Object Storage (AWS S3, GCS, Minio, ...) Leaf RocksDB-Cloud RocksDB Leaf RocksDB-Cloud RocksDB Leaf RocksDB-Cloud RocksDB Distributed Log

45

SST files SST files SST files

slide-46
SLIDE 46

Leaf RocksDB-Cloud Object Storage (AWS S3, GCS, Minio, ...) Leaf RocksDB-Cloud

Adding new read replica

Rockset SQL API Aggregator Aggregator Leaf RocksDB-Cloud Leaf RocksDB-Cloud Distributed Log

46

RocksDB RocksDB RocksDB RocksDB SST files SST files SST files

  • Copy data to a new leaf
slide-47
SLIDE 47

Leaf RocksDB-Cloud Object Storage (AWS S3, GCS, Minio, ...) Leaf RocksDB-Cloud

Adding new read replica

Rockset SQL API Aggregator Aggregator Leaf RocksDB-Cloud Leaf RocksDB-Cloud Distributed Log

47

RocksDB RocksDB RocksDB RocksDB SST files SST files SST files

  • Copy data to a new leaf
  • Tail new updates from log
slide-48
SLIDE 48

Leaf RocksDB-Cloud Object Storage (AWS S3, GCS, Minio, ...) Leaf

  • Copy data to a new leaf
  • Tail new updates from log
  • Able to serve more queries

RocksDB-Cloud

Adding new read replica

Rockset SQL API Aggregator Aggregator Leaf RocksDB-Cloud Leaf RocksDB-Cloud Distributed Log

48

RocksDB RocksDB RocksDB RocksDB SST files SST files SST files

slide-49
SLIDE 49

Conclusion

slide-50
SLIDE 50
  • Schemaless SQL
  • Converged indexing
  • Serverless architecture

Conclusion

50

...schema ...indexes ...servers No need to configure...

slide-51
SLIDE 51
  • Rockset - “shortest path from data to applications”
  • Making workflows easy catalyzes creativity

Conclusion

51

slide-52
SLIDE 52

Thank you