SLIDE 1 Scaling Saved Searches
Serving real time push-notifications for millions saved searches
466382733
SLIDE 2
SLIDE 3
SLIDE 5 ebay kleinanzeigen ≠ ebay
SLIDE 6
SLIDE 8 ads = classified ads
SLIDE 9
SLIDE 10
SLIDE 11
SLIDE 12
SLIDE 13
SLIDE 14
SLIDE 17 18M searches/day
SLIDE 18
SLIDE 19
SLIDE 20
SLIDE 21 Saved Searches
Serving real time push-notifications for millions saved searches
466382733
SLIDE 22 700k new ads/day 8M saved searches
SLIDE 23 48.000.000.000 theoretical matches a day!
SLIDE 24 p r o c e s s i t !
SLIDE 26
SLIDE 27
SLIDE 28
SLIDE 29
SLIDE 30
SLIDE 31
SLIDE 32
SLIDE 33
SLIDE 34
SLIDE 37 r e a l t i m e ?
SLIDE 38 s c a l a b l e ?
SLIDE 39 C a n w e d o b e t t e r ?
SLIDE 41 src=https://www.esciencecenter.nl/img/main/logo-elastic.png
SLIDE 42 Percolator
Traditionally you design documents based on your data, store them into an index, and then define queries via the search API in order to retrieve these documents. The percolator works in the opposite direction. First you store queries into an index and then, via the percolate API, you define documents in order to retrieve these queries.
src=https://www.elastic.co/guide/en/elasticsearch/reference/current/search-percolate.html
SLIDE 43
SLIDE 44
SLIDE 45
SLIDE 46
SLIDE 47
SLIDE 48
SLIDE 49
SLIDE 50
SLIDE 51
SLIDE 52
SLIDE 53
SLIDE 54
SLIDE 55
SLIDE 56
SLIDE 57
SLIDE 58
SLIDE 59 H o w m a n y p u s h e s p e r d a y ?
SLIDE 60
SLIDE 61
~3x
SLIDE 63 700k new ads/day
SLIDE 65 a s k s e a r c h
SLIDE 66 h o w m a n y r e s u l t s ?
SLIDE 67 c r e a t e b u c k e t s
SLIDE 68 0 - 100: RT 101 - 1000: 1h 1001 - 10000: 2h > 10000: 6h
SLIDE 70 l i f e t i m e
a s e a r c h
SLIDE 71
SLIDE 72
SLIDE 73 s l e e p ...
Z Z Z Z Z Z
SLIDE 74 Z Z Z Z Z Z
SLIDE 75
SLIDE 76
SLIDE 77
SLIDE 80
SLIDE 83 2 data centers 10 data + 3 master
SLIDE 84 2 data centers 10 data + 3 master
SLIDE 85 replication x1 shards x80
SLIDE 86
SLIDE 87
SLIDE 88
SLIDE 91
SLIDE 92 e l a s t i c f a s t
i n d e x i n g
SLIDE 93 f i l t e r s l e e p i n g s e a r c h e s
SLIDE 94
SLIDE 96 filter:{ “next_pushdate”: [* TO NOW]}
SLIDE 97
s e a r c h e s a r e
SLIDE 99 a v o i d d b - r e a d p e r s e a r c h
SLIDE 100 h a s h p e r s e a r c h
SLIDE 101 b l o o m f i l t e r i n c o o k i e
SLIDE 102 src=https://upload.wikimedia.org/wikipedia/commons/thumb/a/ac/Bloom_filter.svg/2000px-Bloom_filter.svg.png
SLIDE 104 d e e p l i n k
r e s u l t s i z e
SLIDE 105
SLIDE 108 s t o r e s e a r c h e s l o c a l
SLIDE 109 b a c k e n d s y n c
a c t i o n s
SLIDE 110 S a v e d S e a r c h
SLIDE 112 S t a b i l i z e e l a s t i c
SLIDE 113 Boost your percolator!
Tips & Tricks
SLIDE 114 “This indeed seems like a large application of percolate.”
Elastic support, June 2015
SLIDE 115 Performance linear with number of queries
SLIDE 116
- 1. Consider using other systems.
SLIDE 117
- 1. Consider using other systems.
“It is worth noting that simple exist matches on a field are probably not a great application for percolator. This doesn’t utilize any text matching capability or complex boolean.”
Anything, anywhere! Every ad offering something for free!
SLIDE 118
- 1. Consider using other systems.
SLIDE 119
- 2. Optimise your data structure.
SLIDE 120
- 2. Optimise your data structure.
SLIDE 121
- 2. Optimise your data structure.
SLIDE 122
- 3. Filter, filter, filter!
SLIDE 123
- 3. Filter, filter, filter!
“The filter only works on the metadata fields. The query field isn’t indexed by default.”
SLIDE 124
- 3. Filter, filter, filter!
CATEGORY: cars CATEGORY: all CATEGORY: cars OR all
SLIDE 125 … what else can we filter?
- 3. Filter, filter, filter!
SLIDE 126
- 3. Filter, filter, filter!
SLIDE 128
- 5. Use parallel bulk requests.
SLIDE 129
- 5. Use parallel bulk requests.
index node1 A1 node2 A2
SLIDE 130
- 5. Use parallel bulk requests.
“Currently, to utilise all of your shards, you would need to consider sending multipercolate requests in parallel.”
index node1 A1 node2 A2 https://github.com/elastic/elasticsearch/issues/13177
SLIDE 131
- 5. Use parallel bulk requests.
SLIDE 133 Matthias: Antique copper lamps in Pankow André: Cars in Berlin
SLIDE 134
André: Cars in Berlin Matthias: Antique copper lamps in Pankow
SLIDE 135
HIGH PRIORITY LOW PRIORITY
André: Cars in Berlin Matthias: Antique copper lamps in Pankow
SLIDE 137 Outcome
Reduced percolation time:
SLIDE 138 Outcome
Doubled the number of push notifications:
SLIDE 139 S t a b i l i z e e l a s t i c
SLIDE 141 8 0 0 0 0 0 0 s e a r c h e s 7 0 0 0 0 0 a d s / d a y
SLIDE 142 S t a b i l i z e p l a t f o r m
SLIDE 143 eBayK saved searches goes 2016 architecture
SLIDE 144 Before: one DB rules it all
MySQL
SLIDE 145 Before: one DB rules it all
create saved search
MySQL
SLIDE 146 Before: one DB rules it all
create saved search change saved search
MySQL
SLIDE 147 Before: one DB rules it all
create ad create saved search change saved search
MySQL
SLIDE 148 Before: one DB rules it all
create ad create saved search change saved search
MySQL
found match
SLIDE 149 Before: one DB rules it all
create ad create saved search change saved search
MySQL
got push found match
SLIDE 150 MySQL
Before...
SLIDE 151 MySQL
AwakeJob
Before...
SLIDE 152 MySQL
AwakeJob SendJob CreateJob
Before...
SLIDE 153 MySQL
CleanupJob AwakeJob SendJob IndexerJob CreateJob ExpireJob
Before...
SLIDE 154 Before: bottleneck communication via DB
super high performance resiliency scalability
..?
SLIDE 155 Goal: event-driven data pipeline
SLIDE 156
SLIDE 157 What is Apache Kafka?
distributed messaging system - persistent - high throughput
Topic 1 Topic 2
Producer Producer Consumer Consumer Consumer
SLIDE 158 But what’s new?
SLIDE 159 But what’s new?
1 2 3
SLIDE 160 Now: streams and data flows
percolate create ad
SLIDE 161 Now: streams and data flows
percolate create ad found match
SLIDE 162 Now: streams and data flows
percolate process match create ad found match
SLIDE 163 Now: streams and data flows
percolate process push create ad found match
SLIDE 164 Now: streams and data flows
percolate process push create ad found match
MySQL
SLIDE 165 Now: streams and data flows
percolate process push create ad found match
MySQL
SLIDE 166 Compaction
SLIDE 167 Compaction: Kafka == source of truth?
SLIDE 168 Compaction: Kafka == source of truth?
A: 23 B: 12 B: null C: A: 24 time
SLIDE 169 Compaction: Kafka == source of truth?
A: 23 B: 12 B: null C: A: 24 A: 24 C: time
SLIDE 170 Compaction: Kafka == source of truth?
A: 24 C: time
SLIDE 171 Compaction: Kafka == source of truth?
Consumer A: 24 C:
SLIDE 172 Compaction: Kafka == source of truth?
Consumer A: 24 C:
SLIDE 173 Compaction: Kafka == source of truth?
Consumer A: 24 C:
SLIDE 174 Issues encountered
SLIDE 175 Issues encountered
latency - used local cache
SLIDE 176 Issues encountered
some components couldn’t keep up - spot-on optimisation latency - used local cache
SLIDE 177 Issues encountered
some components couldn’t keep up - spot-on optimisation
latency - used local cache
SLIDE 179 simplicity fine tune elastic use streaming
SLIDE 180 T h a n k y o u
SLIDE 181
SLIDE 182 References
”Building LinkedIn’s Real-time Activity Data Pipeline”, Ken Goodhope, Joel Koshy, Jay Kreps, Neha Narkhede, Richard Park, Jun Rao, Victor Yang Ye