Maximilian Michels @stadtlegende max@crate.io mxm@apache.org 2 W - - PowerPoint PPT Presentation

maximilian michels stadtlegende max crate io mxm apache
SMART_READER_LITE
LIVE PREVIEW

Maximilian Michels @stadtlegende max@crate.io mxm@apache.org 2 W - - PowerPoint PPT Presentation

C R AT E D B : A S E A R C H E N G I N E O R A D ATA B A S E ? B O T H ! H O W W E B U I LT A S Q L D ATA B A S E O N T O P O F E L A S T I C S E A R C H A N D L U C E N E Maximilian Michels @stadtlegende max@crate.io


slide-1
SLIDE 1

C R AT E D B : A S E A R C H E N G I N E O R A D ATA B A S E ? B O T H !

H O W W E B U I LT A S Q L D ATA B A S E O N T O P O F E L A S T I C S E A R C H A N D L U C E N E

Maximilian Michels @stadtlegende max@crate.io mxm@apache.org

slide-2
SLIDE 2

W H Y A R E W E TA L K I N G A B O U T T H I S ?

  • Traditional databases are well-researched and there

are plenty of them (Postgres, MySQL, Oracle…)

  • Scalable search using these can be tricky
  • Search engines are databases optimized for search

and scale (Lucene, Solr, Elasticsearch)

  • You can’t typically use SQL with Search Engines
  • Why not stick with an mature query language standard

which everybody knows?

2

slide-3
SLIDE 3

“A scalable SQL database optimized for search without the NoSQL bullshit.”

slide-4
SLIDE 4

C R AT E D B I N A N U T S H E L L

  • Since 2014: https://github.com/crate/crate
  • Apache 2.0 licensed (community edition)
  • Built using Elasticsearch, Lucene, Netty, Antlr, …
  • SQL-99 compatible
  • REST / Postgres Wire Protocol / JDBC / Python …

4

slide-5
SLIDE 5

W H AT T O E X P E C T

  • What is great about CrateDB
  • Easy to setup
  • No funny APIs / SQL
  • Great scale out - Massive reads / writes
  • Container aware
  • Not so great
  • Transactions
  • Foreign keys

5

slide-6
SLIDE 6

U S I N G C R AT E D B

slide-7
SLIDE 7

C R AT E D B I S J U S T L I K E A S Q L D B

  • SQL is the only query API
  • CREATE TABLE fosdem.speakers (id int PRIMARY KEY,

name string)

  • CREATE TABLE fosdem.talks (id INT PRIMARY KEY,

title STRING, abstract STRING, speaker INT);

  • INSERT INTO fosdem.speakers (id, name) VALUES (1,

’max’)

  • INSERT INTO fosdem.talks (id, title, abstract,

speaker) VALUES (1, ‘Talk about CrateDB’, ‘bla’, 1)

  • SELECT * FROM fosdem.talks t1 LEFT JOIN

fosdem.speakers t2 ON t1.id = t2.id

7

slide-8
SLIDE 8

B U T T H E R E I S M O R E

  • CrateDB denormalized (no joins necessary)
  • CREATE TABLE fosdem.speakers (name STRING, talk

OBJECT AS (title STRING, abstract STRING))

  • INSERT INTO fosdem.speakers (name, talk) VALUES

(‘max’, {title = ‘CrateDB’, abstract = ‘Lorem ipsum’})

  • SELECT talk[‘title’] as title FROM

fosdem.speakers ORDER BY title

8

slide-9
SLIDE 9

C L U S T E R I N G / R E P L I C AT I O N

  • CREATE TABLE fosdem.speakers (name STRING, talk

OBJECT AS (title STRING, abstract STRING))

  • CLUSTERED BY name into 4 shards

9

S H A R D N O D E 1 N O D E 2 N O D E 3 N O D E 4

slide-10
SLIDE 10

C L U S T E R I N G / R E P L I C AT I O N

  • CREATE TABLE fosdem.speakers (name STRING, talk

OBJECT AS (title STRING, abstract STRING))

  • CLUSTERED BY name into 4 shards
  • WITH (number_of_replicas = 1)

10

P R I M A RY R E P L I C A N O D E 1 N O D E 2 N O D E 3 N O D E 4

slide-11
SLIDE 11

PA R T I T I O N E D TA B L E S

  • CREATE TABLE fosdem.speakers (name STRING, talk

OBJECT as (title = STRING, abstract = STRING), year INT)

  • CLUSTERED BY name into 4 shards
  • PARTITIONED BY (year, …)
  • WITH (number_of_replicas = 1)

11

N O D E 1 N O D E 2 N O D E 3 N O D E 4 P R I M A RY R E P L I C A

slide-12
SLIDE 12

M O R E F E AT U R E S

  • Aggregations
  • Geo search
  • Text Analyzers
  • UDFs
  • Snapshots
  • User management
  • Schema / Table privileges
  • SSL encryption
  • MQTT Ingestion

12

slide-13
SLIDE 13

A R C H I T E C T U R E

slide-14
SLIDE 14

14

  • CrateDB: Distributed SQL Execution Engine
  • Antlr: Parsing of SQL statements
  • Netty: REST, Postgres Wire Protocol, Web interface
  • Lucene: Storage, Indexing, Queries
  • Elasticsearch: Transport, Routing, Replication

O N T H E S H O U L D E R S O F G I A N T S

slide-15
SLIDE 15

I N T R O D U C T I O N T O

  • Lucene stores documents which are CrateDB’s rows
  • Documents have fields
  • { _id : ‘123’, 


name : ‘Bob’,
 title : ‘How I Learned to Stop Worrying 
 and Love the Bomb’,
 text : ‘Lorem ipsum…'
 }

  • Fields are indexed for efficient lookup
  • Fields have column store for efficient aggregation

15

slide-16
SLIDE 16

I N T R O D U C T I O N T O E L A S T I C S E A R C H

  • Elasticsearch core concepts revolve around indices,

shards, and replicas

  • An index is a document store with n parts,


called shards

  • Each shard has 0 or more replicas which hold copies of

the shard data

  • Replicas are not only useful for fault tolerance but also

increase the search performance

16

slide-17
SLIDE 17

H O W TA B L E S R E L AT E T O I N D I C E S A N D S H A R D S

  • Each table in CrateDB is

represented by an ES index with a mapping

  • Each partition in a

partitioned table is represented by an ES index

  • Partition indices are

created by encoding the partition value in the index name

17 TA B L E T 1 T 2 T 3 … I N D E X

t1 t2.day1 t2.day2

t3 …

S H A R D 1

X X X

X …

S H A R D 2

X X X

X …

S H A R D 3

X X …

S H A R D 4

X …

"properties":{ “name":{"type":"keyword"}, "talks":{"dynamic":"true", "properties":{ "abstract":{"type":"keyword"}, "title":{"type":"keyword"} } } }

slide-18
SLIDE 18

N O D E 1

F R O M Q U E RY T O E X E C U T I O N

  • SELECT name, count(*) as talks FROM fosdem.speakers 


WHERE room = ‘hpc’ AND year = 2018 GROUP BY name ORDER BY name

18

C L I E N T

PA R S E R A N A LY Z E R P L A N N E R E X E C U T O R

R E S T / P O S T G R E S

T R A N S P O RT ( E S ) S T O R A G E ( L U C E N E )

W E B J D B C P S Q L C R A S H

N O D E 3 N O D E 2 N O D E 4 N O D E 5

P Y T H O N R U S T N O D E

slide-19
SLIDE 19

A R C H I T E C T U R E H I G H L I G H T S

  • Distributed storage / Distributed

query execution

  • Masterless
  • Replication
  • Only ephemeral storage needed

(Container aware)

  • Optimized for search: Indexing of

all fields with Lucene (tuneable)

19

N O D E 2 N O D E 1 N O D E 3 N O D E 4

slide-20
SLIDE 20

H A N D S - O N

slide-21
SLIDE 21

21

  • Monitoring (IoT, Industry 4.0, Cyber Security)
  • Stream Analysis
  • Text Analysis
  • Time Series Analysis
  • Geospatial Queries

W H AT C A N Y O U D O W I T H C R AT E D B ?

slide-22
SLIDE 22

D E M O

CrateDB Web Interface

slide-23
SLIDE 23

CrateDB Web Interface

slide-24
SLIDE 24

CrateDB Web Interface

slide-25
SLIDE 25

CrateDB Web Interface

slide-26
SLIDE 26

C O N C L U S I O N

slide-27
SLIDE 27

W H AT W E H AV E L E A R N E D

  • Elasticsearch used Lucene and Netty to built a distributed

search engine

  • CrateDB used Elasticsearch, Lucene, and Netty to built a

distributed SQL database

  • CrateDB is perfect when you
  • want or have to use SQL
  • store large amounts of structured or unstructured data
  • have many thousands of queries per second

27

slide-28
SLIDE 28

S E E F O R Y O U R S E L F !

  • Try out CrateDB
  • Download from https://crate.io/download/
  • or $ curl try.crate.io | bash
  • or $ docker run crate
  • or build from source https://github.com/crate/crate
  • Check out https://crate.io/docs
  • Contributions welcome
  • Check out https://github.com/crate/crate/blob/master/devs/docs/index.rst
  • Check out the issues
  • Stackoverflow
  • Join our Slack channel

28

slide-29
SLIDE 29

T H A N K Y O U !

Maximilian Michels @stadtlegende max@crate.io mxm@apache.org