[PPT] - Turbocharge your MySQL analytics with ElasticSearch Guillaume PowerPoint Presentation

SLIDE 1

Turbocharge your MySQL analytics with ElasticSearch

Guillaume Lefranc Data & Infrastructure Architect, Productsup GmbH Percona Live Europe 2017

SLIDE 2

About the Speaker Guillaume Lefranc

Data Architect at Productsup
Replication Manager for MySQL and

MariaDB - Lead Architect

DBA Manager at MariaDB Corporation
Infrastructure Consultant at Coinigy
DB Architect at dailymotion.com

SLIDE 3

Takeaways In this presentation, we will speak about:

How Elasticsearch works as a document and column store
What are its strengths and weaknesses when it comes to analytics
How to sync data with MySQL
How to build aggregations

SLIDE 4

Case study Disclaimer: This case study is about medium data, not big data

SLIDE 5

Case study Disclaimer: This case study is about medium data, not big data What is medium data?

SLIDE 6

Case study Disclaimer: This case study is about medium data, not big data What is medium data? Answer: from 100GB to a few TBs

SLIDE 7

Case study Disclaimer: This case study is about medium data, not big data What is medium data? Answer: from 100GB to a few TBs Types of data:

User activity (Clicks)

SLIDE 8

Case study Disclaimer: This case study is about medium data, not big data What is medium data? Answer: from 100GB to a few TBs Types of data:

User activity (Clicks)
Market Data

SLIDE 9

Case study Disclaimer: This case study is about medium data, not big data What is medium data? Answer: from 100GB to a few TBs Types of data:

User activity (Clicks)
Market Data
Trips

SLIDE 10

Use Cases Business cases:

A ride-sharing app

Example dataset: NYC Taxi Data (6 months: 78 million trips)

SLIDE 11

Use Cases Business cases:

A ride-sharing app

Example dataset: NYC Taxi Data (6 months: 78 million trips)

Cryptocurrency market data

200 million documents per month - Courtesy of coinigy.com

SLIDE 12

MySQL and Medium Data

Medium data can scale well in MySQL (SELECT … WHERE id = ?)

SLIDE 13

MySQL and Medium Data

Medium data can scale well in MySQL (SELECT … WHERE id = ?)
… not with Analytics

SLIDE 14

MySQL and Medium Data

Medium data can scale well in MySQL (SELECT … WHERE id = ?)
… not with Analytics
Every case cannot be covered by an index

SLIDE 15

MySQL and Medium Data

Medium data can scale well in MySQL (SELECT … WHERE id = ?)
… not with Analytics
Every case cannot be covered by an index
Aggregations can be slow, especially if doing table scans

SLIDE 16

Elasticsearch - What?

“You know, for Search” -> Inverted Index

SLIDE 17

Elasticsearch - What?

“You know, for Search” -> Inverted Index
Document Store

SLIDE 18

Elasticsearch - What?

“You know, for Search” -> Inverted Index
Document Store
REST API

○ POST /market_data -d ‘{ "market": "BTC/USD", value: “3418.03953” }

SLIDE 19

Elasticsearch - What?

“You know, for Search” -> Inverted Index
Document Store
REST API

○ POST /market_data -d ‘{ "market": "BTC/USD", value: “3418.03953” }

JSON Native

SLIDE 20

Elasticsearch - Anatomy of a Document

{ "_index" : "raw-2016-07", "_type" : "market_raw", "_id" : "108051174765", "_score" : 7.455347, "_source" : { "quantity" : 0.64130859, "time_local" : "2016-07-12 06:45:34", "type" : "SELL", "market" : "USD/BTC", "total" : 414.52263332, "@timestamp" : "2016-07-12T06:45:34.000Z", "price" : 646.37, "exchange" : "BITS", "id" : 108051174765, "tradeid" : "11649948" }

SLIDE 21

Elasticsearch - Why?

Distributed

SLIDE 22

Elasticsearch - Why?

Distributed
Fault Tolerant

SLIDE 23

Elasticsearch - Why?

Distributed
Fault Tolerant
Scales Horizontally

SLIDE 24

Elasticsearch - Why?

Distributed
Fault Tolerant
Scales Horizontally

SLIDE 25

Elasticsearch - Column Store

All fields are indexed by default

SLIDE 26

Elasticsearch - Column Store

All fields are indexed by default
Query is (almost) always an index scan

SLIDE 27

Elasticsearch - Column Store

All fields are indexed by default
Query is (almost) always an index scan
Doc values

○ Field values serialized on disk ○ Not stored in the JVM Heap: OS cache reliant ○ Compression ○ By default: all numerics, geo_points, dates, IPs and not_analyzed strings (keywords)

SLIDE 28

Elasticsearch - Column Store

All fields are indexed by default
Query is (almost) always an index scan
Doc values

○ Field values serialized on disk ○ Not stored in the JVM Heap: OS cache reliant ○ Compression ○ By default: all numerics, geo_points, dates, IPs and not_analyzed strings (keywords)

Not a general purpose column-store replacement

SLIDE 29

Elasticsearch - Column Store

All fields are indexed by default
Query is (almost) always an index scan
Doc values

○ Field values serialized on disk ○ Not stored in the JVM Heap: OS cache reliant ○ Compression ○ By default: all numerics, geo_points, dates, IPs and not_analyzed strings (keywords)

Not a general purpose column-store replacement
Not exactly fast as a document DB

Bibliography:

https://www.elastic.co/blog/elasticsearch-as-a-column-store
https://www.elastic.co/guide/en/elasticsearch/guide/current/_deep_dive_on_doc_values.html

SLIDE 30

Alternatives (Open Source)

MariaDB Columnstore
Yandex Clickhouse
Apache Spark

Bibliography: https://www.percona.com/blog/2017/03/17/column-store-database-benchmarks-mariadb-columnstore-vs

clickhouse-vs-apache-spark/

SLIDE 31

MySQL Analytics: Performance Fast if data is in the index Very slow if linear scans have to be used Example: SELECT DATE(pickup_datetime) AS date, SUM(total_amount) AS earnings FROM trips WHERE driver_id=102 GROUP BY date ORDER BY pickup_datetime DESC;

> returns in milliseconds

SELECT DATE(pickup_datetime) AS date, SUM(total_amount) AS earnings FROM trips GROUP BY date ORDER BY pickup_datetime DESC;

> 4 minutes

SLIDE 32

Importing data: Logstash

Logstash is an open source data collection engine with real-time pipelining capabilities.
The L in ELK Stack
ETL for ElasticSearch
Pipeline model (input -> filter -> output)

SLIDE 33

Importing data - Input

input { jdbc { jdbc_driver_library => "/usr/share/java/mysql-connector-java.jar" jdbc_driver_class => "com.mysql.jdbc.Driver" jdbc_connection_string => "jdbc:mysql://db1:3306/taxi_platform?useCursorFetch=true" jdbc_user => "root" jdbc_password => "admin" jdbc_fetch_size => 100000 statement => "SELECT id, driver_id, passenger_id, pickup_datetime, dropoff_datetime, CONCAT(pickup_latitude, ',', pickup_longitude) AS pickup_location, CONCAT(dropoff_latitude,',', dropoff_longitude) AS dropoff_location, payment_type, total_amount from trips WHERE id > :sql_last_value" use_column_value => true tracking_column => id #schedule => "*/5 * * * *" }

SLIDE 34

Importing data - Input

Dealing with large result sets

jdbc_connection_string => "jdbc:mysql://db1:3306/taxi_platform?useCursorFetch=true" jdbc_fetch_size => 10000

SQL Last Value

use_column_value => true tracking_column => id statement => "SELECT … WHERE id > :sql_last_value"

Scheduler

schedule => "*/5 * * * *"

Reference: https://www.elastic.co/guide/en/logstash/current/plugins-inputs-jdbc.html

SLIDE 35

Importing Data - Filters Filters (Input Transformation) filter { mutate { convert => [ "pickup_datetime", "string" ] } date { match => [ "pickup_datetime", "ISO8601" ] } }

Date field is used by partitioning, if there is no suitable field, Elasticsearch will use the current date and

time

SLIDE 36

Importing Data - Output

utput {

elasticsearch { hosts => [ "es1:9200" ] user => "elastic" password => "elasticpassword" index => "taxi-%{+YYYY-MM}" document_type => "trips" document_id => "%{id}" } }

SLIDE 37

Importing Data - Output

Document partitioning

index => "taxi-%{+YYYY-MM}"

Matching ID with MySQL

document_id => "%{id}"

SLIDE 38

Schema Design - Indexes

Document Partitioning
ElasticSearch Types
Number of Indices
Number of Shards
Replication Factor

SLIDE 39

Schema Design - Indexes

PUT /taxi { "settings" : { "index" : { "number_of_shards" : 5, "number_of_replicas" : 1 } } } PUT _template/template1 { “template”: “taxi*”, “settings”: {

SLIDE 40

Schema Design - Mapping

Indexing with the optimal type
Avoiding Full Text Search indexing (aka text or “analyzed”)
Type overview

○ keyword ○ long ○ byte ○ date ○ geo_point ○ scaled_float

SLIDE 41

Schema Design - Mapping

{ "mappings": { "trips": { "properties": { "id": { "type": "long" }, "driver_id": { "type": "long" }, "passenger_id": { "type": "long" }, "pickup_datetime": { "type": "date" }, "dropoff_datetime": { "type": "date" }, "passenger_count": { "type": "byte" }, "pickup_location": { "type": "geo_point" }, "dropoff_location": { "type": "geo_point" }, "payment_type": { "type" : "byte" }, "total_amount": { "type": "scaled_float", "scaling_factor": 100 } } } } }

SLIDE 42

Query Design - Aggregation High Level concepts:

Buckets

○ driver name, passenger name, location ○ currency exchange (bitstamp, coinbase, etc) ○ Can be nested

Metrics

○ count ○ pricing sum, avg, max, min, etc SELECT metric FROM index GROUP BY bucket

SLIDE 43

Building an aggregation, step by step Drivers’ average earning per day with customers using Credit Cards as a filter

Search component: payment_type:1
Bucket terms aggregation (driver name or id)
Bucket date histogram aggregation (per day, month, specific interval, etc)
Metric average aggregation

SLIDE 44

Query design - The search component

Aggregations can be filtered by search
Multiple indexes can be hit using regular expressions

GET /taxi-*/_search?q=payment_type:1

Up to complex searches

GET /taxi-2016-01/_search?q=+pickup_datetime:2016-01-09 +total_amount:>60&size=10000'

SLIDE 45

Query Design - The Terms bucket The Terms bucket { "size": 0, <- do not return documents "aggs": { <- define aggregation "drivers": { <- canonical agg name "terms": { <- agg type "field": "driver_id" <- agg argument (and options) } } } }

SLIDE 46

Query Design - The date histogram bucket "aggs": { "drivers": { "terms": { "field": "driver_id" }, "aggs": { "by_day": { "date_histogram": { "field": "pickup_datetime", "interval": "day" },

SLIDE 47

Query design - Metric aggregation "aggs": { "by_day": { "date_histogram": { "field": "pickup_datetime", "interval": "day" }, "aggs": { "avg_earning": { "avg": { "field": "total_amount" } } } }

SLIDE 48

Query Results Obviously too long to show here! General info comes first: { "took" : 1804, <- 1804ms execution time "timed_out" : false, "_shards" : { "total" : 30, <- number of shards parsed "successful" : 30, "failed" : 0 }, "hits" : { "total" : 46066859 <- number of parsed documents }

SLIDE 49

Query Results - Example aggregation result { "key" : "Shay Banon", "doc_count" : 5476, "by_day" : { "buckets" : [ { "key_as_string" : "2017-01-01 00:00:00", "doc_count" : 152, "avg_earning" : { "value" : 11.11 } },

SLIDE 50

Query filtering - Filtering by location "filter": { "geo_polygon": { "pickup_location": { "points": [ {"lat": -74.04, "lon": 40.56}, {"lat": -74.04, "lon": 40.74}, {"lat": -73.83, "lon": 40.74}, {"lat": -73.83, "lon": 40.57} ] } } }

SLIDE 51

Going further - Bucket aggregations

Adjacency Matrix Aggregation Children Aggregation Date Histogram Aggregation Date Range Aggregation Diversified Sampler Aggregation Filter Aggregation Filters Aggregation Geo Distance Aggregation GeoHash grid Aggregation Global Aggregation Histogram Aggregation IP Range Aggregation Missing Aggregation Nested Aggregation Range Aggregation Reverse nested Aggregation Sampler Aggregation Significant Terms Aggregation Terms Aggregation

SLIDE 52

Going Further - Metric Aggregations

Avg Aggregation Cardinality Aggregation Extended Stats Aggregation Geo Bounds Aggregation Geo Centroid Aggregation Max Aggregation Min Aggregation Percentiles Aggregation Percentile Ranks Aggregation Scripted Metric Aggregation Stats Aggregation Sum Aggregation Top hits Aggregation Value Count Aggregation

SLIDE 53

Going Further - Kibana Build cool dashboards with your aggregations

SLIDE 54

We’re Hiring!

Frontend Developers
Backend Developers
Data Scientists

SLIDE 55