maximilian michels stadtlegende max crate io mxm apache
play

Maximilian Michels @stadtlegende max@crate.io mxm@apache.org 2 W - PowerPoint PPT Presentation

C R AT E D B : A S E A R C H E N G I N E O R A D ATA B A S E ? B O T H ! H O W W E B U I LT A S Q L D ATA B A S E O N T O P O F E L A S T I C S E A R C H A N D L U C E N E Maximilian Michels @stadtlegende max@crate.io


  1. C R AT E D B : A S E A R C H E N G I N E O R A D ATA B A S E ? B O T H ! H O W W E B U I LT A S Q L D ATA B A S E O N T O P O F E L A S T I C S E A R C H A N D L U C E N E Maximilian Michels @stadtlegende max@crate.io mxm@apache.org

  2. 2 W H Y A R E W E TA L K I N G A B O U T T H I S ? • Traditional databases are well-researched and there are plenty of them (Postgres, MySQL, Oracle…) • Scalable search using these can be tricky • Search engines are databases optimized for search and scale (Lucene, Solr, Elasticsearch) • You can’t typically use SQL with Search Engines • Why not stick with an mature query language standard which everybody knows?

  3. “A scalable SQL database optimized for search without the NoSQL bullshit.”

  4. 4 C R AT E D B I N A N U T S H E L L • Since 2014: https://github.com/crate/crate • Apache 2.0 licensed (community edition) • Built using Elasticsearch, Lucene, Netty, Antlr, … • SQL-99 compatible • REST / Postgres Wire Protocol / JDBC / Python …

  5. 5 W H AT T O E X P E C T • What is great about CrateDB • Easy to setup • No funny APIs / SQL • Great scale out - Massive reads / writes • Container aware • Not so great • Transactions • Foreign keys

  6. U S I N G C R AT E D B

  7. 7 C R AT E D B I S J U S T L I K E A S Q L D B • SQL is the only query API •CREATE TABLE fosdem.speakers (id int PRIMARY KEY, name string) •CREATE TABLE fosdem.talks (id INT PRIMARY KEY, title STRING, abstract STRING, speaker INT); •INSERT INTO fosdem.speakers (id, name) VALUES (1, ’max’) •INSERT INTO fosdem.talks (id, title, abstract, speaker) VALUES (1, ‘Talk about CrateDB’, ‘bla’, 1) •SELECT * FROM fosdem.talks t1 LEFT JOIN fosdem.speakers t2 ON t1.id = t2.id

  8. 8 B U T T H E R E I S M O R E • CrateDB denormalized (no joins necessary) • CREATE TABLE fosdem.speakers (name STRING, talk OBJECT AS (title STRING, abstract STRING)) • INSERT INTO fosdem.speakers (name, talk) VALUES (‘max’, {title = ‘CrateDB’, abstract = ‘Lorem ipsum’}) • SELECT talk[‘title’] as title FROM fosdem.speakers ORDER BY title

  9. 9 C L U S T E R I N G / R E P L I C AT I O N N O D E 1 N O D E 2 N O D E 3 N O D E 4 • CREATE TABLE fosdem.speakers (name STRING, talk OBJECT AS (title STRING, abstract STRING)) • CLUSTERED BY name into 4 shards S H A R D

  10. 10 C L U S T E R I N G / R E P L I C AT I O N N O D E 1 N O D E 2 N O D E 3 N O D E 4 • CREATE TABLE fosdem.speakers (name STRING, talk OBJECT AS (title STRING, abstract STRING)) • CLUSTERED BY name into 4 shards • WITH (number_of_replicas = 1) P R I M A RY R E P L I C A

  11. 11 PA R T I T I O N E D TA B L E S N O D E 1 N O D E 2 N O D E 3 N O D E 4 • CREATE TABLE fosdem.speakers (name STRING, talk OBJECT as (title = STRING, abstract = STRING), year INT) • CLUSTERED BY name into 4 shards • PARTITIONED BY (year, …) • WITH (number_of_replicas = 1) P R I M A RY R E P L I C A

  12. 12 M O R E F E AT U R E S • Aggregations • Geo search • Text Analyzers • UDFs • Snapshots • User management • Schema / Table privileges • SSL encryption • MQTT Ingestion

  13. A R C H I T E C T U R E

  14. 14 O N T H E S H O U L D E R S O F G I A N T S • CrateDB: Distributed SQL Execution Engine • Antlr: Parsing of SQL statements • Netty: REST, Postgres Wire Protocol, Web interface • Lucene: Storage, Indexing, Queries • Elasticsearch: Transport, Routing, Replication

  15. 15 I N T R O D U C T I O N T O • Lucene stores documents which are CrateDB’s rows • Documents have fields • { _id : ‘123’, 
 name : ‘Bob’, 
 title : ‘How I Learned to Stop Worrying 
 and Love the Bomb’, 
 text : ‘Lorem ipsum…' 
 } • Fields are indexed for efficient lookup • Fields have column store for efficient aggregation

  16. 16 I N T R O D U C T I O N T O E L A S T I C S E A R C H • Elasticsearch core concepts revolve around indices, shards, and replicas • An index is a document store with n parts, 
 called shards • Each shard has 0 or more replicas which hold copies of the shard data • Replicas are not only useful for fault tolerance but also increase the search performance

  17. 17 H O W TA B L E S R E L AT E T O I N D I C E S A N D S H A R D S "properties":{ “name":{"type":"keyword"}, • Each table in CrateDB is "talks":{"dynamic":"true", "properties":{ represented by an ES index "abstract":{"type":"keyword"}, "title":{"type":"keyword"} with a mapping } } } • Each partition in a TA B L E T 1 T 2 T 3 … partitioned table is I N D E X t1 t2.day1 t2.day2 … t3 … represented by an ES index S H A R X X X … X … D 1 • Partition indices are S H A R X X … X X … D 2 created by encoding the S H A R X X … D 3 partition value in the index S H A R X … D 4 name … …

  18. 18 F R O M Q U E RY T O E X E C U T I O N SELECT name, count(*) as talks FROM fosdem.speakers 
 • WHERE room = ‘hpc’ AND year = 2018 GROUP BY name ORDER BY name C L I E N T P S Q L W E B J D B C R U S T C R A S H P Y T H O N N O D E N O D E 1 N O D E 2 N O D E 4 R E S T / P O S T G R E S PA R S E R A N A LY Z E R P L A N N E R E X E C U T O R N O D E 3 N O D E 5 T R A N S P O RT ( E S ) S T O R A G E ( L U C E N E )

  19. 19 A R C H I T E C T U R E H I G H L I G H T S • Distributed storage / Distributed query execution N O D E 1 N O D E 3 • Masterless N O D E 2 N O D E 4 • Replication • Only ephemeral storage needed (Container aware) • Optimized for search: Indexing of all fields with Lucene (tuneable)

  20. H A N D S - O N

  21. 21 W H AT C A N Y O U D O W I T H C R AT E D B ? • Monitoring (IoT, Industry 4.0, Cyber Security) • Stream Analysis • Text Analysis • Time Series Analysis • Geospatial Queries

  22. CrateDB Web Interface D E M O

  23. CrateDB Web Interface

  24. CrateDB Web Interface

  25. CrateDB Web Interface

  26. C O N C L U S I O N

  27. 27 W H AT W E H AV E L E A R N E D • Elasticsearch used Lucene and Netty to built a distributed search engine • CrateDB used Elasticsearch, Lucene, and Netty to built a distributed SQL database • CrateDB is perfect when you • want or have to use SQL • store large amounts of structured or unstructured data • have many thousands of queries per second

  28. 28 S E E F O R Y O U R S E L F ! • Try out CrateDB • Download from https://crate.io/download/ • or $ curl try.crate.io | bash • or $ docker run crate • or build from source https://github.com/crate/crate • Check out https://crate.io/docs • Contributions welcome • Check out https://github.com/crate/crate/blob/master/devs/docs/index.rst • Check out the issues • Stackoverflow • Join our Slack channel

  29. T H A N K Y O U ! Maximilian Michels @stadtlegende max@crate.io mxm@apache.org

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend