AVAILABLE IN REAL TIME WITH ELASTICSEARCH Yann Cluchey CTO @ - - PowerPoint PPT Presentation

available in real time
SMART_READER_LITE
LIVE PREVIEW

AVAILABLE IN REAL TIME WITH ELASTICSEARCH Yann Cluchey CTO @ - - PowerPoint PPT Presentation

MAKING ENTERPRISE DATA AVAILABLE IN REAL TIME WITH ELASTICSEARCH Yann Cluchey CTO @ Cogenta CTO @ GfK Online Pricing Intelligence What is Enterprise Data? What is Enterprise Data? Online Pricing Intelligence 1. Gather data from 500+ of


slide-1
SLIDE 1

MAKING ENTERPRISE DATA AVAILABLE IN REAL TIME WITH ELASTICSEARCH

Yann Cluchey CTO @ Cogenta CTO @ GfK Online Pricing Intelligence

slide-2
SLIDE 2

What is Enterprise Data?

slide-3
SLIDE 3

What is Enterprise Data?

slide-4
SLIDE 4

Online Pricing Intelligence

  • 1. Gather data from 500+ of eCommerce sites
  • 2. Organise into high quality market view
  • 3. Competitive intelligence tools
slide-5
SLIDE 5

Price, Stock, Meta Price, Stock, Meta Price, Stock, Meta Price, Stock, Meta

HTML

Custom Crawler

 Parse web content  Discover product data  Tracking 20m products  Daily+

HTML HTML HTML

slide-6
SLIDE 6

Database

Processing, Storage

 Enrichment  Persistent Storage  Product Catalogue  + time series data

Processing

slide-7
SLIDE 7

Database

Thing #1 - Detection

 Identify distinct products  Automated information retrieval  Lucene + custom index builder  Continuous process  (Humans for QA)

Lucene Index Index Builder GUI Matcher

slide-8
SLIDE 8

Thing #2 - BI Tools

 Web Applications  Also based on Lucene  Batch index build process  Per-customer indexes

Database Customer Index 1 Index Builder BI Tools Customer Index 2 Customer Index 3

slide-9
SLIDE 9

Thing #1 - Pain

 Continuously indexing  Track changes, read back out to index  Drain on performance  Latency, coping with peaks  Full rebuild for index schema change

  • r inconsistencies

 Full rebuild doesn’t scale well…  Unnecessary work..?

Lucene Index Index Builder GUI Database

slide-10
SLIDE 10

Customer Index 2

Thing #2 - Pain

 Twice daily batch rebuild, per customer  Very slow  Moar customers?  Moar data?  Moar often?  Data set too complex,

keeps changing

 Index shipping  Moar web servers?

Database Customer Index 1 Index Builder BI Tools Customer Index 3 Indexing Database Batch Sync Web Server 1 Web Server 2

slide-11
SLIDE 11

Pain Points

 As data, customers scale,

processes slow down

 Adapting to change  Easy to layer on,

hard to make fundamental changes

 Read vs write concerns  Database Maintenance

Index Index Builder Database

slide-12
SLIDE 12

Goals

Eliminate latencies Improve scalability Improve availability Something achievable Your mileage will vary

slide-13
SLIDE 13

elasticsearch

 Open source, distributed search engine  Based on Lucene, fully featured API  Querying, filtering, aggregation  Text processing / IR  Schema-free  Yummy

(real-time, sharding, highly available)

 Silver bullets not included

slide-14
SLIDE 14

Indexing Database Indexing Database

Our Pipeline

Database Crawlers Crawlers Processors Processors Processors Processors Processors Processors Indexers Indexes Indexes Indexes Web Servers Web Servers Web Servers

slide-15
SLIDE 15

Our New Pipeline

Database Crawlers Crawlers Processors Processors Processors Processors Processors Processors Indexers Indexes Indexes Indexes Web Servers Web Servers Web Servers

slide-16
SLIDE 16

Event Hooks

 Messages fired OnCreate.. and OnUpdate  Payload contains everything needed for indexing

 The data  Keys (still mastered in SQL)  Versioning

 Sender has all the information already  Use RabbitMQ to control event message flow  Messages are durable

slide-17
SLIDE 17

Indexing Strategy

 RESTful API (HTTP, Thrift, Memcache)

 Use bulk methods  They support percolation

 Rivers (pull)

 RabbitMQ River  JDBC River  Mongo/Couch/etc. River

 Logstash

Index Q Indexer Event Q

slide-18
SLIDE 18

Model Your Data

 What’s in your documents?  Database = Index

Table = Type ...?

 Start backwards

 What do your applications need?  How will they need to query the data?

 Prototyping! Fail quickly!  elasticsearch supports Nested objects, parent/child docs

slide-19
SLIDE 19

Joins

 Events relate to line-items

 Amazon decreased price  Pixmania is running a promotion

 Need to group by Product  Use key/value store

 Get full Product document  Modify it, write it back  Enqueue indexing instruction

Indexer Event Q 3 3 5 1 4 1 2 1 3 4 Index Q Key/value store 5

slide-20
SLIDE 20

Where to join?

 elasticsearch

 Consider performance  Depends how data is structured/indexed (e.g. parent/child)  Compression, collisions

 In-memory cache (e.g. Memcache)  Persistent storage (e.g. Cassandra or Mongo)  Two awesome benefits

 Quickly re-index if needed  Updates have access to the full Product data

 Serialisation is costly

slide-21
SLIDE 21

Synchronisation & Concurrency

 Fault tolerance

 Code to expect missing data  Out of sequence events

 Concurrency Control

 Apply Optimistic Concurrency Control at Mongo  Optimise for collisions

slide-22
SLIDE 22

Synchronisation & Concurrency

 Synchronisation

 Out of sequence index instructions  elasticsearch external versioning  Can rebuild from scratch if need to

 Consistency

 Which version is right?  Dates  Revision numbers from SQL  Independent updates

slide-23
SLIDE 23

Figures

 Ingestion

 20m data points/day (continuously)  ~ 200GB  3K msgs/second at peak

 Hardware

 SQL:

2 x 12-core, 64GB, 72-spindle SAN

 Indexing: 4 x 4-core, 8GB  Mongo: 1 x 4-core, 16GB, 1xSSD  Elastic:

5 x 4-core, 16GB, 1xSSD Custom-Built Lucene elasticsearch Latency 3 hours < 1 second Bottleneck Disk (SQL) CPU

slide-24
SLIDE 24

Managing Change

Key/value store

Index_A

Client Indexer Event Q Alias

Index_B Index Index_B Index_A

slide-25
SLIDE 25

Thanks

 @YannCluchey  Concurrency Patterns with MongoDB

http://slidesha.re/YFOehF

 Consistency without Consensus

Peter Bourgon, SoundCloud http://bit.ly/1DUAO1R

 Eventually Consistent Data Structures

Sean Cribbs, Basho https://vimeo.com/43903960