Thinking about performance Search: a case study Perf: - - PowerPoint PPT Presentation

thinking about performance search a case study perf speed
SMART_READER_LITE
LIVE PREVIEW

Thinking about performance Search: a case study Perf: - - PowerPoint PPT Presentation

Thinking about performance Search: a case study Perf: speed/power/etc. Perf: why do we care? Premature optimization is the root of all evil We should forget about small efficiencies, say about 97% of the time Different designs:


slide-1
SLIDE 1

Thinking about performance

slide-2
SLIDE 2

Search: a case study

slide-3
SLIDE 3

Perf: speed/power/etc.

slide-4
SLIDE 4

Perf: why do we care?

slide-5
SLIDE 5

“Premature optimization is the root of all evil”

slide-6
SLIDE 6

“We should forget about small efficiencies, say about 97% of the time”

slide-7
SLIDE 7

Different designs: 100x - 1000x perf difference

slide-8
SLIDE 8
slide-9
SLIDE 9

“Coding feels like real work”

slide-10
SLIDE 10

Whiteboard: 1h/iteration Implementation: 2yr/iteration

slide-11
SLIDE 11

Scale

(precursor to perf discussion)

slide-12
SLIDE 12

10k; 10M; 10G

(5kB per doc)

slide-13
SLIDE 13

What’s the actual problem?

slide-14
SLIDE 14

AND queries

slide-15
SLIDE 15

10k; 10M; 10G

(5kB per doc)

10k

slide-16
SLIDE 16

One person’s email One forum

10k

slide-17
SLIDE 17

5kB * 10k = 50MB

10k

slide-18
SLIDE 18

50MB is small!

10k

slide-19
SLIDE 19

$50 phone => 1GB RAM

10k

slide-20
SLIDE 20

Naive algorithm

for loop over all documents { for loop over terms in document { // matching logic here. } } 10k

slide-21
SLIDE 21

10k; 10M; 10G

(5kB per doc)

10M

slide-22
SLIDE 22

~Wikipedia sized

10M

slide-23
SLIDE 23

5kB * 10M = 50GB

10M

slide-24
SLIDE 24

$2000 for 128GB server

(Broadwell single socket Xeon-D)

10M

slide-25
SLIDE 25

25 GB/s memory bandwidth

10M

slide-26
SLIDE 26

50GB / 25 GB/s = 2s

(½ query per sec (QPS))

10M

slide-27
SLIDE 27

Is 2s latency ok?

10M

slide-28
SLIDE 28

Is 1/2 QPS ok?

10M

slide-29
SLIDE 29

Larger service

Latency == $$$

10M

slide-30
SLIDE 30

Latency == $$$

http://assets.en.oreilly.com/1/event/29/Keynote%20Presentation%202.pdf http://www.bizreport.com/2016/08/mobify-report-reveals-impact-of-mobile-website-speed.html http://assets.en.oreilly.com/1/event/29/The%20User%20and%20Business%20Impact%20of%20Server%20Delays,%20Additional%20Bytes,% 20and%20HTTP%20Chunking%20in%20Web%20Search%20Presentation.pptx http://assets.en.oreilly.com/1/event/27/Varnish%20-%20A%20State%20of%20the%20Art%20High-Performance%20Reverse%20Proxy%20Pre sentation.pdf

10M

slide-31
SLIDE 31

Google: 400ms extra latency

0.44% decrease in searches per user

10M

slide-32
SLIDE 32

Google: 400ms extra latency

0.44% decrease in searches per user 0.76% after six weeks

10M

slide-33
SLIDE 33

Google: 400ms extra latency

0.44% decrease in searches per user 0.76% after six weeks 0.21% decrease after delay removed

10M

slide-34
SLIDE 34

Bing

10M

slide-35
SLIDE 35

Mobify

100ms home load => 1.11% delta in conversions

10M

slide-36
SLIDE 36

Mobify

100ms home load => 1.11% delta in conversions 100ms checkout page speed => 1.55% delta in conversions

10M

slide-37
SLIDE 37

10M

slide-38
SLIDE 38

To hit 500ms round trip...

10M

slide-39
SLIDE 39

...budget ~10ms for search

10M

slide-40
SLIDE 40

Larger service

Latency == $$$ Need to handle more than ½ QPS

10M

slide-41
SLIDE 41

Use an index?

Salton; The SMART Retrieval System (1971); work originally done in early 60s

10M

slide-42
SLIDE 42

30 - 30,000 QPS

(we’ll talk about figuring this out later)

http://www.anandtech.com/show/9185/intel-xeon-d-review-performance-per-watt-server-soc-champion/14 Haque et al.; Few-to-Many: Incremental Parallelism for Reducing Tail Latency in Interactive Services (ASPLOS, 2015)

10M

slide-43
SLIDE 43

10k; 10M; 10G;

(5kB per doc)

10B

slide-44
SLIDE 44

5kB * 10G = 50TB

10B

slide-45
SLIDE 45

Horizontal scaling

(use more machines)

10B

slide-46
SLIDE 46

Easy to scale

(different docuemnts on different machines)

10B

slide-47
SLIDE 47

Horizontal scaling

10G docs / (10M docs / machine) = 1k machines

10B

slide-48
SLIDE 48

Redmond-Dresden: 150ms

10B

slide-49
SLIDE 49

Horizontal scaling

10G docs / (10M docs / machine) = 1k machines 1k machines * 10 clusters = 10k machines

10B

slide-50
SLIDE 50

“[With 1800 machines, in one year], it’s typical that 1,000 individual machine failures will occur; thousands of hard drive failures will occur; one power distribution unit will fail, bringing down 500 to 1,000 machines for about 6 hours; 20 racks will fail, each time causing 40 to 80 machines to vanish from the network; 5 racks will “go wonky,” with half their network packets missing in action; and the cluster will have to be rewired once, affecting 5 percent of the machines at any given moment over a 2-day span” 10B

slide-51
SLIDE 51

Horizontal scaling

10G docs / (10M docs / machine) = 1k machines 1k machines * 10 clusters = 10k machines 10k machines * 3 redundancy = 30k machines

10B

slide-52
SLIDE 52

Horizontal scaling

10G docs / (10M docs / machine) = 1k machines 1k machines * 10 clusters = 10k machines 10k machines * 3 redundancy = 30k machines 30k machines * $1k/yr/machine = $30M / yr

10B

slide-53
SLIDE 53

2x perf: $15m/yr

10B

slide-54
SLIDE 54

2% perf: $600k/yr

10B

slide-55
SLIDE 55

Horizontal scaling

10G docs / (10M docs / machine) = 1k machines 1k machines * 10 clusters = 10k machines 10k machines * 3 redundancy = 30k machines 30k machines * $1k/yr/machine = $30M / yr Machine time vs. dev time

10B

slide-56
SLIDE 56

Search Algorithms

slide-57
SLIDE 57

What’s the problem again?

Algorithms

slide-58
SLIDE 58

Posting list

Algorithms: posting list

slide-59
SLIDE 59

See http://nlp.stanford.edu/IR-book/ for implementation details

slide-60
SLIDE 60

HashMap[term] => list[docs]

Algorithms: posting list

slide-61
SLIDE 61

Bloom filter

Algorithms: bloom filter

slide-62
SLIDE 62

BitFunnel

Algorithms: bloom filter

slide-63
SLIDE 63

What about an array?

Algorithms: bloom filter

slide-64
SLIDE 64
slide-65
SLIDE 65

How many terms?

Algorithms: bloom filter

slide-66
SLIDE 66

Algorithms: bloom filter

slide-67
SLIDE 67

One site has 37B primes

Algorithms: bloom filter

slide-68
SLIDE 68

GUIDs, timestamps, DNA, etc.

Algorithms: bloom filter

slide-69
SLIDE 69

Why index that stuff?

Algorithms: bloom filter

slide-70
SLIDE 70

GTGACCTTGGGCAAGTTACTTA ACCTCTCTGTGCCTCAGTTTCCT CATCTGTAAAATGGGGATAATA

Algorithms: bloom filter

slide-71
SLIDE 71
slide-72
SLIDE 72

Most terms aren’t in most docs => use hashing

Algorithms: bloom filter

slide-73
SLIDE 73

Bloom Filters

Algorithms: bloom filter

slide-74
SLIDE 74
slide-75
SLIDE 75
slide-76
SLIDE 76
slide-77
SLIDE 77
slide-78
SLIDE 78

Probability of false positive?

Algorithms: bloom filter

slide-79
SLIDE 79

(assume 10% bit density) 1 location: .1 = 10% false positive rate

Algorithms: bloom filter

slide-80
SLIDE 80

(assume 10% bit density) 1 location: .1 = 10% false positive rate 2 locations: .1 * .1 = 1% false positive rate

Algorithms: bloom filter

slide-81
SLIDE 81

(assume 10% bit density) 1 location: .1 = 10% false positive rate 2 locations: .1 * .1 = 1% false positive rate 3 locations: .1 * .1 * .1 = 0.1% false positive rate

Algorithms: bloom filter

slide-82
SLIDE 82

Linear cost Exponential benefit

Algorithms: bloom filter

slide-83
SLIDE 83

Multiple Documents Multiple Bloom Filters

Algorithms: bloom filter

slide-84
SLIDE 84
slide-85
SLIDE 85

Do comparisons in parallel!

Algorithms: bloom filter

slide-86
SLIDE 86
slide-87
SLIDE 87

Algorithms: bloom filter

slide-88
SLIDE 88

Algorithms: bloom filter

slide-89
SLIDE 89

Algorithms: bloom filter

slide-90
SLIDE 90

Algorithms: bloom filter Algorithms: bloom filter

slide-91
SLIDE 91

Algorithms: bloom filter

slide-92
SLIDE 92
slide-93
SLIDE 93
slide-94
SLIDE 94

Algorithms: bloom filter

slide-95
SLIDE 95

Algorithms: bloom filter

slide-96
SLIDE 96

How do we estimate perf?

slide-97
SLIDE 97

Cost model Number of operations

Perf estimation

slide-98
SLIDE 98

512-bit “blocks”

(pay for memory accesses)

Perf estimation

slide-99
SLIDE 99

How many memory accesses per block?

Perf estimation

slide-100
SLIDE 100

http://bitfunnel.org

Perf estimation

slide-101
SLIDE 101

Perf estimation

slide-102
SLIDE 102
slide-103
SLIDE 103

Why do we have so many rows?

Term rewriting

Perf estimation

slide-104
SLIDE 104

Term Rewriting

“Large yellow dog”

Perf estimation

slide-105
SLIDE 105

Term Rewriting

“Large yellow dog” || “Golden Retriever”

Perf estimation

slide-106
SLIDE 106

Term Rewriting

“Large yellow dog” || “Golden Retriever” || “Old Yeller” ||

Perf estimation

slide-107
SLIDE 107
slide-108
SLIDE 108

Expected performance?

Perf estimation

slide-109
SLIDE 109

10 M docs / 512 bits per block = 20k “blocks”

Perf estimation

slide-110
SLIDE 110

10 M docs / 512 bits per block = 20k “blocks” 20 k-blocks * 5 transfers per block = 100 kT

Perf estimation

slide-111
SLIDE 111

10 M docs / 512 bits per block = 20k “blocks” 20 k-blocks * 5 transfers per block = 100 kT 25 GB/s / 512 bits per transfer = 390 MT/s

Perf estimation

slide-112
SLIDE 112

10 M docs / 512 bits per block = 20k “blocks” 20 k-blocks * 5 transfers per block = 100 kT 25 GB/s / 512 bits per transfer = 390 MT/s 390 MT/s / 100 kT = 3900 QPS (with rounding)

Perf estimation

slide-113
SLIDE 113

Actual performance?

Perf estimation

slide-114
SLIDE 114

Actual performance ~similar

Perf estimation

slide-115
SLIDE 115

Small factors

Perf estimation

slide-116
SLIDE 116

Large factors

Perf estimation

slide-117
SLIDE 117

Ranking results

Perf estimation

slide-118
SLIDE 118

Ingestion

(faster than querying)

Perf estimation

slide-119
SLIDE 119

Ingestion is just setting bits

Perf estimation

slide-120
SLIDE 120

Hierarchical bloom filters

Perf estimation

slide-121
SLIDE 121

Complicating issues?

Perf estimation

slide-122
SLIDE 122
slide-123
SLIDE 123

Conclusions?

slide-124
SLIDE 124

False conclusions

Search is simple

slide-125
SLIDE 125

False conclusions

Search is simple Bloom filters are better than posting lists

Zobel et al., Inverted files versus signature files for text indexing; TODS 1998

slide-126
SLIDE 126

False conclusions

Search is simple Bloom filters are better than posting lists You can easily reason about all performance

Zobel et al., Inverted files versus signature files for text indexing; TODS 1998

slide-127
SLIDE 127

Conclusions!

slide-128
SLIDE 128

You can reason about perf

slide-129
SLIDE 129

It’s often just arithmetic

slide-130
SLIDE 130

Acknowledgements

Thanks to Leah Hanson, Mike Hopcroft, Julia Evans, Hari Angepat, David Turner, Danielle Sucher, Ikhwan Lee, Tejas Sapre, Raul Jara, Rich Ercolani, Bert Muthalaly, Harsha Nori, Jeshua Smith, Bill Barnes, Gary Bernhardt, Marek Majkowski, Tom Crayford, Gina Willard, Laura Lindzey, Larry Marbuger, Siddarth Anand, Eric Lemmon, Tom Ballinger, and [Anonymous Reviewer] for feedback.

slide-131
SLIDE 131

bitfunnel.org/strangeloop github.com/bitfunnel/bitfunnel danluu.com

slide-132
SLIDE 132

Unused slides

(thar be dragons)

slide-133
SLIDE 133

SLIDE FOR HOMEWORK. TODO: USE DIFFERENT TEMPLATE

slide-134
SLIDE 134

Why are posting lists standard?

slide-135
SLIDE 135

Literature on alternatives

“Signatures files were proposed in [23] and shown to be inferior to inverted indexing in [24]. “

“ Inverted indexes have been benchmarked as the most generalisable, and well performing structure (Zobel et al., 1998). The experiments in this thesis are therefore conducted solely on an inverted index system.” “While this technique provides a relatively low computation overhead, studies by Zobel et al. [1998] have shown that inverted files significantly

  • utperform signature files. We will now focus the analysis on inverted files as it is generally considered to be the most efficient indexing method

for most IR systems.” “The other two mechanisms are usually adopted in certain applications even if, recently, they have been mostly abandoned in favor of inverted indexes because some extensive experimental results [194] have shown that: Inverted indexes offer better performance than signature files and bitmaps, in terms of both size of index and speed of query handling [188]” “Zobel et al. [16] compared inverted files and signature files with respect to query response time and space requirements. They found that the inverted files evaluated queries in less time than the signature files and needed less space. Their results showed that the signature files were much larger, more expensive to construct and update, their response time was unpredictable, they support ranked queries only with difficulty, they did not scale well and they were slow“

slide-136
SLIDE 136

Zobel et al., actual quotes

“Inverted file indexes with in-memory search structures require no more disk accesses to answer a conjunctive query than do bitsliced signature files.” “One of the difficulties in the comparison of inverted files and signature files is that many variants of signature file techniques have been proposed, and it is possible that some combination of parameters and variants will result in a better method.”

slide-137
SLIDE 137

Citations are lossy

slide-138
SLIDE 138
slide-139
SLIDE 139
slide-140
SLIDE 140
slide-141
SLIDE 141
slide-142
SLIDE 142
slide-143
SLIDE 143
slide-144
SLIDE 144
slide-145
SLIDE 145
slide-146
SLIDE 146
slide-147
SLIDE 147
slide-148
SLIDE 148
slide-149
SLIDE 149

Search: why do we care?

slide-150
SLIDE 150

$20M/yr * 2% savings = $400k/yr

slide-151
SLIDE 151

How things fit together

slide-152
SLIDE 152

TODO: add diagram

slide-153
SLIDE 153

Posting list

slide-154
SLIDE 154
slide-155
SLIDE 155
slide-156
SLIDE 156
slide-157
SLIDE 157

How many terms?

slide-158
SLIDE 158

TODO: pseudo-code TODO: diagram about how bits drop out TODO: search is a high dynamic range problem. TODO: higher rank rows TODO: sharding by document length TODO: diagram of how things fit together. Could just be concentric circles

slide-159
SLIDE 159

Posting lists are standard

slide-160
SLIDE 160

Posting list optimizations

Skip list Delta compression etc.

slide-161
SLIDE 161

Search

slide-162
SLIDE 162

Perf: how to think about it?

slide-163
SLIDE 163

Performance

slide-164
SLIDE 164

Search is BIG

slide-165
SLIDE 165

Parsing / Tokenization

Harder than it sounds

slide-166
SLIDE 166

Search is a big problem

Tokenization Some languages mix alphabets, are partially left-to-right and right-to-left, etc. Can’t drop non-alphanumeric characters (C# vs C++) Multi-language queries Ranking / Relevance Distributed Systems etc.