Turbocharge your MySQL analytics with ElasticSearch Guillaume - - PowerPoint PPT Presentation
Turbocharge your MySQL analytics with ElasticSearch Guillaume - - PowerPoint PPT Presentation
Turbocharge your MySQL analytics with ElasticSearch Guillaume Lefranc Data & Infrastructure Architect, Productsup GmbH Percona Live Europe 2017 About the Speaker Guillaume Lefranc Data Architect at Productsup Replication
About the Speaker Guillaume Lefranc
- Data Architect at Productsup
- Replication Manager for MySQL and
MariaDB - Lead Architect
- DBA Manager at MariaDB Corporation
- Infrastructure Consultant at Coinigy
- DB Architect at dailymotion.com
Takeaways In this presentation, we will speak about:
- How Elasticsearch works as a document and column store
- What are its strengths and weaknesses when it comes to analytics
- How to sync data with MySQL
- How to build aggregations
Case study Disclaimer: This case study is about medium data, not big data
Case study Disclaimer: This case study is about medium data, not big data What is medium data?
Case study Disclaimer: This case study is about medium data, not big data What is medium data? Answer: from 100GB to a few TBs
Case study Disclaimer: This case study is about medium data, not big data What is medium data? Answer: from 100GB to a few TBs Types of data:
- User activity (Clicks)
Case study Disclaimer: This case study is about medium data, not big data What is medium data? Answer: from 100GB to a few TBs Types of data:
- User activity (Clicks)
- Market Data
Case study Disclaimer: This case study is about medium data, not big data What is medium data? Answer: from 100GB to a few TBs Types of data:
- User activity (Clicks)
- Market Data
- Trips
Use Cases Business cases:
- A ride-sharing app
Example dataset: NYC Taxi Data (6 months: 78 million trips)
Use Cases Business cases:
- A ride-sharing app
Example dataset: NYC Taxi Data (6 months: 78 million trips)
- Cryptocurrency market data
200 million documents per month - Courtesy of coinigy.com
MySQL and Medium Data
- Medium data can scale well in MySQL (SELECT … WHERE id = ?)
MySQL and Medium Data
- Medium data can scale well in MySQL (SELECT … WHERE id = ?)
- … not with Analytics
MySQL and Medium Data
- Medium data can scale well in MySQL (SELECT … WHERE id = ?)
- … not with Analytics
- Every case cannot be covered by an index
MySQL and Medium Data
- Medium data can scale well in MySQL (SELECT … WHERE id = ?)
- … not with Analytics
- Every case cannot be covered by an index
- Aggregations can be slow, especially if doing table scans
Elasticsearch - What?
- “You know, for Search” -> Inverted Index
Elasticsearch - What?
- “You know, for Search” -> Inverted Index
- Document Store
Elasticsearch - What?
- “You know, for Search” -> Inverted Index
- Document Store
- REST API
○ POST /market_data -d ‘{ "market": "BTC/USD", value: “3418.03953” }
Elasticsearch - What?
- “You know, for Search” -> Inverted Index
- Document Store
- REST API
○ POST /market_data -d ‘{ "market": "BTC/USD", value: “3418.03953” }
- JSON Native
Elasticsearch - Anatomy of a Document
{ "_index" : "raw-2016-07", "_type" : "market_raw", "_id" : "108051174765", "_score" : 7.455347, "_source" : { "quantity" : 0.64130859, "time_local" : "2016-07-12 06:45:34", "type" : "SELL", "market" : "USD/BTC", "total" : 414.52263332, "@timestamp" : "2016-07-12T06:45:34.000Z", "price" : 646.37, "exchange" : "BITS", "id" : 108051174765, "tradeid" : "11649948" }
Elasticsearch - Why?
- Distributed
Elasticsearch - Why?
- Distributed
- Fault Tolerant
Elasticsearch - Why?
- Distributed
- Fault Tolerant
- Scales Horizontally
Elasticsearch - Why?
- Distributed
- Fault Tolerant
- Scales Horizontally
Elasticsearch - Column Store
- All fields are indexed by default
Elasticsearch - Column Store
- All fields are indexed by default
- Query is (almost) always an index scan
Elasticsearch - Column Store
- All fields are indexed by default
- Query is (almost) always an index scan
- Doc values
○ Field values serialized on disk ○ Not stored in the JVM Heap: OS cache reliant ○ Compression ○ By default: all numerics, geo_points, dates, IPs and not_analyzed strings (keywords)
Elasticsearch - Column Store
- All fields are indexed by default
- Query is (almost) always an index scan
- Doc values
○ Field values serialized on disk ○ Not stored in the JVM Heap: OS cache reliant ○ Compression ○ By default: all numerics, geo_points, dates, IPs and not_analyzed strings (keywords)
- Not a general purpose column-store replacement
Elasticsearch - Column Store
- All fields are indexed by default
- Query is (almost) always an index scan
- Doc values
○ Field values serialized on disk ○ Not stored in the JVM Heap: OS cache reliant ○ Compression ○ By default: all numerics, geo_points, dates, IPs and not_analyzed strings (keywords)
- Not a general purpose column-store replacement
- Not exactly fast as a document DB
Bibliography:
- https://www.elastic.co/blog/elasticsearch-as-a-column-store
- https://www.elastic.co/guide/en/elasticsearch/guide/current/_deep_dive_on_doc_values.html
Alternatives (Open Source)
- MariaDB Columnstore
- Yandex Clickhouse
- Apache Spark
Bibliography: https://www.percona.com/blog/2017/03/17/column-store-database-benchmarks-mariadb-columnstore-vs
- clickhouse-vs-apache-spark/
MySQL Analytics: Performance Fast if data is in the index Very slow if linear scans have to be used Example: SELECT DATE(pickup_datetime) AS date, SUM(total_amount) AS earnings FROM trips WHERE driver_id=102 GROUP BY date ORDER BY pickup_datetime DESC;
- > returns in milliseconds
SELECT DATE(pickup_datetime) AS date, SUM(total_amount) AS earnings FROM trips GROUP BY date ORDER BY pickup_datetime DESC;
- > 4 minutes
Importing data: Logstash
- Logstash is an open source data collection engine with real-time pipelining capabilities.
- The L in ELK Stack
- ETL for ElasticSearch
- Pipeline model (input -> filter -> output)
Importing data - Input
input { jdbc { jdbc_driver_library => "/usr/share/java/mysql-connector-java.jar" jdbc_driver_class => "com.mysql.jdbc.Driver" jdbc_connection_string => "jdbc:mysql://db1:3306/taxi_platform?useCursorFetch=true" jdbc_user => "root" jdbc_password => "admin" jdbc_fetch_size => 100000 statement => "SELECT id, driver_id, passenger_id, pickup_datetime, dropoff_datetime, CONCAT(pickup_latitude, ',', pickup_longitude) AS pickup_location, CONCAT(dropoff_latitude,',', dropoff_longitude) AS dropoff_location, payment_type, total_amount from trips WHERE id > :sql_last_value" use_column_value => true tracking_column => id #schedule => "*/5 * * * *" }
Importing data - Input
- Dealing with large result sets
jdbc_connection_string => "jdbc:mysql://db1:3306/taxi_platform?useCursorFetch=true" jdbc_fetch_size => 10000
- SQL Last Value
use_column_value => true tracking_column => id statement => "SELECT … WHERE id > :sql_last_value"
- Scheduler
schedule => "*/5 * * * *"
Reference: https://www.elastic.co/guide/en/logstash/current/plugins-inputs-jdbc.html
Importing Data - Filters Filters (Input Transformation) filter { mutate { convert => [ "pickup_datetime", "string" ] } date { match => [ "pickup_datetime", "ISO8601" ] } }
- Date field is used by partitioning, if there is no suitable field, Elasticsearch will use the current date and
time
Importing Data - Output
- utput {
elasticsearch { hosts => [ "es1:9200" ] user => "elastic" password => "elasticpassword" index => "taxi-%{+YYYY-MM}" document_type => "trips" document_id => "%{id}" } }
Importing Data - Output
- Document partitioning
index => "taxi-%{+YYYY-MM}"
- Matching ID with MySQL
document_id => "%{id}"
Schema Design - Indexes
- Document Partitioning
- ElasticSearch Types
- Number of Indices
- Number of Shards
- Replication Factor
Schema Design - Indexes
PUT /taxi { "settings" : { "index" : { "number_of_shards" : 5, "number_of_replicas" : 1 } } } PUT _template/template1 { “template”: “taxi*”, “settings”: {
Schema Design - Mapping
- Indexing with the optimal type
- Avoiding Full Text Search indexing (aka text or “analyzed”)
- Type overview
○ keyword ○ long ○ byte ○ date ○ geo_point ○ scaled_float
Schema Design - Mapping
{ "mappings": { "trips": { "properties": { "id": { "type": "long" }, "driver_id": { "type": "long" }, "passenger_id": { "type": "long" }, "pickup_datetime": { "type": "date" }, "dropoff_datetime": { "type": "date" }, "passenger_count": { "type": "byte" }, "pickup_location": { "type": "geo_point" }, "dropoff_location": { "type": "geo_point" }, "payment_type": { "type" : "byte" }, "total_amount": { "type": "scaled_float", "scaling_factor": 100 } } } } }
Query Design - Aggregation High Level concepts:
- Buckets
○ driver name, passenger name, location ○ currency exchange (bitstamp, coinbase, etc) ○ Can be nested
- Metrics
○ count ○ pricing sum, avg, max, min, etc SELECT metric FROM index GROUP BY bucket
Building an aggregation, step by step Drivers’ average earning per day with customers using Credit Cards as a filter
- Search component: payment_type:1
- Bucket terms aggregation (driver name or id)
- Bucket date histogram aggregation (per day, month, specific interval, etc)
- Metric average aggregation
Query design - The search component
- Aggregations can be filtered by search
- Multiple indexes can be hit using regular expressions
GET /taxi-*/_search?q=payment_type:1
- Up to complex searches
GET /taxi-2016-01/_search?q=+pickup_datetime:2016-01-09 +total_amount:>60&size=10000'
Query Design - The Terms bucket The Terms bucket { "size": 0, <- do not return documents "aggs": { <- define aggregation "drivers": { <- canonical agg name "terms": { <- agg type "field": "driver_id" <- agg argument (and options) } } } }
Query Design - The date histogram bucket "aggs": { "drivers": { "terms": { "field": "driver_id" }, "aggs": { "by_day": { "date_histogram": { "field": "pickup_datetime", "interval": "day" },
Query design - Metric aggregation "aggs": { "by_day": { "date_histogram": { "field": "pickup_datetime", "interval": "day" }, "aggs": { "avg_earning": { "avg": { "field": "total_amount" } } } }
Query Results Obviously too long to show here! General info comes first: { "took" : 1804, <- 1804ms execution time "timed_out" : false, "_shards" : { "total" : 30, <- number of shards parsed "successful" : 30, "failed" : 0 }, "hits" : { "total" : 46066859 <- number of parsed documents }
Query Results - Example aggregation result { "key" : "Shay Banon", "doc_count" : 5476, "by_day" : { "buckets" : [ { "key_as_string" : "2017-01-01 00:00:00", "doc_count" : 152, "avg_earning" : { "value" : 11.11 } },
Query filtering - Filtering by location "filter": { "geo_polygon": { "pickup_location": { "points": [ {"lat": -74.04, "lon": 40.56}, {"lat": -74.04, "lon": 40.74}, {"lat": -73.83, "lon": 40.74}, {"lat": -73.83, "lon": 40.57} ] } } }
Going further - Bucket aggregations
Adjacency Matrix Aggregation Children Aggregation Date Histogram Aggregation Date Range Aggregation Diversified Sampler Aggregation Filter Aggregation Filters Aggregation Geo Distance Aggregation GeoHash grid Aggregation Global Aggregation Histogram Aggregation IP Range Aggregation Missing Aggregation Nested Aggregation Range Aggregation Reverse nested Aggregation Sampler Aggregation Significant Terms Aggregation Terms Aggregation
Going Further - Metric Aggregations
Avg Aggregation Cardinality Aggregation Extended Stats Aggregation Geo Bounds Aggregation Geo Centroid Aggregation Max Aggregation Min Aggregation Percentiles Aggregation Percentile Ranks Aggregation Scripted Metric Aggregation Stats Aggregation Sum Aggregation Top hits Aggregation Value Count Aggregation
Going Further - Kibana Build cool dashboards with your aggregations
We’re Hiring!
- Frontend Developers
- Backend Developers
- Data Scientists