AVAILABLE IN REAL TIME WITH ELASTICSEARCH Yann Cluchey CTO @ - PowerPoint PPT Presentation

MAKING ENTERPRISE DATA AVAILABLE IN REAL TIME WITH ELASTICSEARCH Yann Cluchey CTO @ Cogenta CTO @ GfK Online Pricing Intelligence

What is Enterprise Data?

Online Pricing Intelligence 1. Gather data from 500+ of eCommerce sites 2. Organise into high quality market view 3. Competitive intelligence tools

Custom Crawler HTML  Parse web content HTML Price,  Discover product data Stock, Price, Meta HTML  Tracking 20m products Stock, Meta  Daily+ HTML Price, Stock, Meta Price, Stock, Meta

Processing, Storage  Enrichment  Persistent Storage Processing  Product Catalogue  + time series data Database

Thing #1 - Detection  Identify distinct products  Automated information retrieval  Lucene + custom index builder  Continuous process Matcher  (Humans for QA) Index Builder Database Lucene Index GUI

Thing #2 - BI Tools  Web Applications Database  Also based on Lucene  Batch index build process  Per-customer indexes Index Builder Customer Customer Customer Index 1 Index 2 Index 3 BI Tools

Thing #1 - Pain  Continuously indexing Database  Track changes, read back out to index  Drain on performance  Latency, coping with peaks  Full rebuild for index schema change Index Builder or inconsistencies  Full rebuild doesn’t scale well… Lucene  Unnecessary work..? Index GUI

Thing #2 - Pain Batch Sync  Twice daily batch rebuild, per customer Database Indexing  Very slow Database  Moar customers?  Moar data?  Moar often? Index Builder  Data set too complex, keeps changing Customer Customer Customer  Index shipping Web Server 1 Web Server 2 Index 1 Index 2 Index 3  Moar web servers? BI Tools

Pain Points  As data, customers scale, Database processes slow down  Adapting to change  Easy to layer on, hard to make fundamental changes Index Builder  Read vs write concerns  Database Maintenance Index

Goals  Eliminate latencies  Improve scalability  Improve availability  Something achievable  Your mileage will vary

elasticsearch  Open source, distributed search engine  Based on Lucene, fully featured API  Querying, filtering, aggregation  Text processing / IR  Schema-free  Yummy (real-time, sharding, highly available)  Silver bullets not included

Our Pipeline Indexing Indexing Database Database Processors Processors Processors Database Processors Indexers Processors Processors Indexes Web Servers Crawlers Indexes Web Servers Crawlers Indexes Web Servers

Our New Pipeline Processors Database Processors Processors Processors Processors Processors Indexers Indexes Web Servers Indexes Crawlers Indexes Web Servers Crawlers Web Servers

Event Hooks  Messages fired OnCreate.. and OnUpdate  Payload contains everything needed for indexing  The data  Keys (still mastered in SQL)  Versioning  Sender has all the information already  Use RabbitMQ to control event message flow  Messages are durable

Indexing Strategy  RESTful API (HTTP, Thrift, Memcache)  Use bulk methods Event Q  They support percolation  Rivers (pull) Index Q Indexer  RabbitMQ River  JDBC River  Mongo/Couch/etc. River  Logstash

Model Your Data  What’s in your documents?  Database = Index Table = Type ...?  Start backwards  What do your applications need?  How will they need to query the data?  Prototyping! Fail quickly!  elasticsearch supports Nested objects, parent/child docs

Joins  Events relate to line-items  Amazon decreased price  Pixmania is running a promotion  Need to group by Product Key/value store  Use key/value store  Get full Product document  Modify it, write it back 5  Enqueue indexing instruction 1 3 3 5 4 1 Indexer Event Q Index Q 3 2 1 4

Where to join?  elasticsearch  Consider performance  Depends how data is structured/indexed (e.g. parent/child)  Compression, collisions  In-memory cache (e.g. Memcache)  Persistent storage (e.g. Cassandra or Mongo)  Two awesome benefits  Quickly re-index if needed  Updates have access to the full Product data  Serialisation is costly

Synchronisation & Concurrency  Fault tolerance  Code to expect missing data  Out of sequence events  Concurrency Control  Apply Optimistic Concurrency Control at Mongo  Optimise for collisions

Synchronisation & Concurrency  Synchronisation  Out of sequence index instructions  elasticsearch external versioning  Can rebuild from scratch if need to  Consistency  Which version is right?  Dates  Revision numbers from SQL  Independent updates

Figures  Ingestion  20m data points/day (continuously)  ~ 200GB Custom-Built elasticsearch Lucene  3K msgs/second at peak Latency 3 hours < 1 second Bottleneck Disk (SQL) CPU  Hardware  SQL: 2 x 12-core, 64GB, 72-spindle SAN  Indexing: 4 x 4-core, 8GB  Mongo: 1 x 4-core, 16GB, 1xSSD  Elastic: 5 x 4-core, 16GB, 1xSSD

Managing Change Client Key/value store Alias Index Index_A Index_A Index_B Index_B Indexer Event Q

Thanks  @YannCluchey  Concurrency Patterns with MongoDB http://slidesha.re/YFOehF  Consistency without Consensus Peter Bourgon, SoundCloud http://bit.ly/1DUAO1R  Eventually Consistent Data Structures Sean Cribbs, Basho https://vimeo.com/43903960

AVAILABLE IN REAL TIME WITH ELASTICSEARCH Yann Cluchey CTO @ - PowerPoint PPT Presentation

MAKING ENTERPRISE DATA AVAILABLE IN REAL TIME WITH ELASTICSEARCH Yann Cluchey CTO @ Cogenta CTO @ GfK Online Pricing Intelligence What is Enterprise Data? What is Enterprise Data? Online Pricing Intelligence 1. Gather data from 500+ of

Real- Real -Time Systems Time Systems Real- -Time Systems Time Systems Real

Real Real- -Time Systems Time Systems Designing a real- Designing a real -time system time

Real- Real -time systems time systems Real- Real -time programming time programming

Real graduates, Real graduates, real transitions, real transitions, real stories: real

Real Real Real Time Real-Time Time Time Model Checking Model Model Checking Model

EMBEDDED EMBEDDED REAL TIME SYSTEMS REAL TIME SYSTEMS EMBEDDED EMBEDDED REAL TIME SYSTEMS

Real Time Operating Systems Shirvaikar Chapter 4 REAL TIME SYSTEMS SHIRVAIKAR 1 Real Time

RTOS Real-Time Operating Systems Chenyang Lu OS Support for Real-Time Real-Time OS

Cycle time: 40 sec Cycle time: 12 sec Cycle time: 0.75 sec Cycle time: 1.25 sec Cycle time: 5

Real-Time Operating system (RTOS) Real-time Embedded systems often have real-time computing

Real-Time Architecture Heechul Yun 1 Topics Introduction to Real-Time Systems, CPS CPS

The Real- -Time UML Standard: Time UML Standard: The Real The Real-Time UML Standard: Theory

Real- -Time Systems Time Systems Real Specification Implementation Task model

Real-Time Communication Integrated Services: Integration of variety of services with

Real Time Operating Systems from Fundamentals of Real Time Systems Mukul Shirvaikar &

Real-Time Performance of Linux Among others: A Measurement-Based Analysis of the Real-Time

Twi$erEcho : a Distributed Focused Crawler to Support Open

Machine Learning: A Promising Direction for Web Tracking Countermeasures Jason Bau, Jonathan

* A new open source language * A concurrent garbage collected language * Builds large programs

CRAWLING WIT ITH Deeksha Kushal Motwani APACHE NUTCH Shailender Joseph Web-Crawling Apache

Leveraging Open Source Designs Creating a component search engine for reference designs used in

An introduction to rate-independent soft crawlers Paolo Gidoni CMAF-CIO, Universidade de Lisboa,

Data Civilizer by A Collection of Folks at MIT, QCRI, Waterloo and TU Berlin The Problem

Catherine Lombardozzi, Ed.D. Page 1 This work is licensed under a Creative Commons

AVAILABLE IN REAL TIME WITH ELASTICSEARCH Yann Cluchey CTO @ - PowerPoint PPT Presentation

MAKING ENTERPRISE DATA AVAILABLE IN REAL TIME WITH ELASTICSEARCH Yann Cluchey CTO @ Cogenta CTO @ GfK Online Pricing Intelligence What is Enterprise Data? What is Enterprise Data? Online Pricing Intelligence 1. Gather data from 500+ of

Real- Real -Time Systems Time Systems Real- -Time Systems Time Systems Real

Real Real- -Time Systems Time Systems Designing a real- Designing a real -time system time

Real- Real -time systems time systems Real- Real -time programming time programming

Real graduates, Real graduates, real transitions, real transitions, real stories: real

Real Real Real Time Real-Time Time Time Model Checking Model Model Checking Model

EMBEDDED EMBEDDED REAL TIME SYSTEMS REAL TIME SYSTEMS EMBEDDED EMBEDDED REAL TIME SYSTEMS

Real Time Operating Systems Shirvaikar Chapter 4 REAL TIME SYSTEMS SHIRVAIKAR 1 Real Time

RTOS Real-Time Operating Systems Chenyang Lu OS Support for Real-Time Real-Time OS

Cycle time: 40 sec Cycle time: 12 sec Cycle time: 0.75 sec Cycle time: 1.25 sec Cycle time: 5

Real-Time Operating system (RTOS) Real-time Embedded systems often have real-time computing

Real-Time Architecture Heechul Yun 1 Topics Introduction to Real-Time Systems, CPS CPS

The Real- -Time UML Standard: Time UML Standard: The Real The Real-Time UML Standard: Theory

Real- -Time Systems Time Systems Real Specification Implementation Task model

Real-Time Communication Integrated Services: Integration of variety of services with

Real Time Operating Systems from Fundamentals of Real Time Systems Mukul Shirvaikar &amp;

Real-Time Performance of Linux Among others: A Measurement-Based Analysis of the Real-Time

Twi$erEcho : a Distributed Focused Crawler to Support Open

Machine Learning: A Promising Direction for Web Tracking Countermeasures Jason Bau, Jonathan

* A new open source language * A concurrent garbage collected language * Builds large programs

CRAWLING WIT ITH Deeksha Kushal Motwani APACHE NUTCH Shailender Joseph Web-Crawling Apache

Leveraging Open Source Designs Creating a component search engine for reference designs used in

An introduction to rate-independent soft crawlers Paolo Gidoni CMAF-CIO, Universidade de Lisboa,

Data Civilizer by A Collection of Folks at MIT, QCRI, Waterloo and TU Berlin The Problem

Catherine Lombardozzi, Ed.D. Page 1 This work is licensed under a Creative Commons

Real Time Operating Systems from Fundamentals of Real Time Systems Mukul Shirvaikar &