[PPT] - Thinking about performance Search: a case study Perf: PowerPoint Presentation

SLIDE 1

Thinking about performance

SLIDE 2

Search: a case study

SLIDE 3

Perf: speed/power/etc.

SLIDE 4

Perf: why do we care?

SLIDE 5

“Premature optimization is the root of all evil”

SLIDE 6

“We should forget about small efficiencies, say about 97% of the time”

SLIDE 7

Different designs: 100x - 1000x perf difference

SLIDE 8

SLIDE 9

“Coding feels like real work”

SLIDE 10

Whiteboard: 1h/iteration Implementation: 2yr/iteration

SLIDE 11

Scale

(precursor to perf discussion)

SLIDE 12

10k; 10M; 10G

(5kB per doc)

SLIDE 13

What’s the actual problem?

SLIDE 14

AND queries

SLIDE 15

10k; 10M; 10G

(5kB per doc)

10k

SLIDE 16

One person’s email One forum

10k

SLIDE 17

5kB * 10k = 50MB

10k

SLIDE 18

50MB is small!

10k

SLIDE 19

$50 phone => 1GB RAM

10k

SLIDE 20

Naive algorithm

for loop over all documents { for loop over terms in document { // matching logic here. } } 10k

SLIDE 21

10k; 10M; 10G

(5kB per doc)

10M

SLIDE 22

~Wikipedia sized

10M

SLIDE 23

5kB * 10M = 50GB

10M

SLIDE 24

$2000 for 128GB server

(Broadwell single socket Xeon-D)

10M

SLIDE 25

25 GB/s memory bandwidth

10M

SLIDE 26

50GB / 25 GB/s = 2s

(½ query per sec (QPS))

10M

SLIDE 27

Is 2s latency ok?

10M

SLIDE 28

Is 1/2 QPS ok?

10M

SLIDE 29

Larger service

Latency == $$$

10M

SLIDE 30

Latency == $$$

http://assets.en.oreilly.com/1/event/29/Keynote%20Presentation%202.pdf http://www.bizreport.com/2016/08/mobify-report-reveals-impact-of-mobile-website-speed.html http://assets.en.oreilly.com/1/event/29/The%20User%20and%20Business%20Impact%20of%20Server%20Delays,%20Additional%20Bytes,% 20and%20HTTP%20Chunking%20in%20Web%20Search%20Presentation.pptx http://assets.en.oreilly.com/1/event/27/Varnish%20-%20A%20State%20of%20the%20Art%20High-Performance%20Reverse%20Proxy%20Pre sentation.pdf

10M

SLIDE 31

Google: 400ms extra latency

0.44% decrease in searches per user

10M

SLIDE 32

Google: 400ms extra latency

0.44% decrease in searches per user 0.76% after six weeks

10M

SLIDE 33

Google: 400ms extra latency

0.44% decrease in searches per user 0.76% after six weeks 0.21% decrease after delay removed

10M

SLIDE 34

Bing

10M

SLIDE 35

Mobify

100ms home load => 1.11% delta in conversions

10M

SLIDE 36

Mobify

100ms home load => 1.11% delta in conversions 100ms checkout page speed => 1.55% delta in conversions

10M

SLIDE 37

10M

SLIDE 38

To hit 500ms round trip...

10M

SLIDE 39

...budget ~10ms for search

10M

SLIDE 40

Larger service

Latency == $$$ Need to handle more than ½ QPS

10M

SLIDE 41

Use an index?

Salton; The SMART Retrieval System (1971); work originally done in early 60s

10M

SLIDE 42

30 - 30,000 QPS

(we’ll talk about figuring this out later)

http://www.anandtech.com/show/9185/intel-xeon-d-review-performance-per-watt-server-soc-champion/14 Haque et al.; Few-to-Many: Incremental Parallelism for Reducing Tail Latency in Interactive Services (ASPLOS, 2015)

10M

SLIDE 43

10k; 10M; 10G;

(5kB per doc)

10B

SLIDE 44

5kB * 10G = 50TB

10B

SLIDE 45

Horizontal scaling

(use more machines)

10B

SLIDE 46

Easy to scale

(different docuemnts on different machines)

10B

SLIDE 47

Horizontal scaling

10G docs / (10M docs / machine) = 1k machines

10B

SLIDE 48

Redmond-Dresden: 150ms

10B

SLIDE 49

Horizontal scaling

10G docs / (10M docs / machine) = 1k machines 1k machines * 10 clusters = 10k machines

10B

SLIDE 50

“[With 1800 machines, in one year], it’s typical that 1,000 individual machine failures will occur; thousands of hard drive failures will occur; one power distribution unit will fail, bringing down 500 to 1,000 machines for about 6 hours; 20 racks will fail, each time causing 40 to 80 machines to vanish from the network; 5 racks will “go wonky,” with half their network packets missing in action; and the cluster will have to be rewired once, affecting 5 percent of the machines at any given moment over a 2-day span” 10B

SLIDE 51

Horizontal scaling

10G docs / (10M docs / machine) = 1k machines 1k machines * 10 clusters = 10k machines 10k machines * 3 redundancy = 30k machines

10B

SLIDE 52

Horizontal scaling

10G docs / (10M docs / machine) = 1k machines 1k machines * 10 clusters = 10k machines 10k machines * 3 redundancy = 30k machines 30k machines * $1k/yr/machine = $30M / yr

10B

SLIDE 53

2x perf: $15m/yr

10B

SLIDE 54

2% perf: $600k/yr

10B

SLIDE 55

Horizontal scaling

10G docs / (10M docs / machine) = 1k machines 1k machines * 10 clusters = 10k machines 10k machines * 3 redundancy = 30k machines 30k machines * $1k/yr/machine = $30M / yr Machine time vs. dev time

10B

SLIDE 56

Search Algorithms

SLIDE 57

What’s the problem again?

Algorithms

SLIDE 58

Posting list

Algorithms: posting list

SLIDE 59

See http://nlp.stanford.edu/IR-book/ for implementation details

SLIDE 60

HashMap[term] => list[docs]

Algorithms: posting list

SLIDE 61

Bloom filter

Algorithms: bloom filter

SLIDE 62

BitFunnel

Algorithms: bloom filter

SLIDE 63

What about an array?

Algorithms: bloom filter

SLIDE 64

SLIDE 65

How many terms?

Algorithms: bloom filter

SLIDE 66

Algorithms: bloom filter

SLIDE 67

One site has 37B primes

Algorithms: bloom filter

SLIDE 68

GUIDs, timestamps, DNA, etc.

Algorithms: bloom filter

SLIDE 69

Why index that stuff?

Algorithms: bloom filter

SLIDE 70

GTGACCTTGGGCAAGTTACTTA ACCTCTCTGTGCCTCAGTTTCCT CATCTGTAAAATGGGGATAATA

Algorithms: bloom filter

SLIDE 71

SLIDE 72

Most terms aren’t in most docs => use hashing

Algorithms: bloom filter

SLIDE 73

Bloom Filters

Algorithms: bloom filter

SLIDE 74

SLIDE 75

SLIDE 76

SLIDE 77

SLIDE 78

Probability of false positive?

Algorithms: bloom filter

SLIDE 79

(assume 10% bit density) 1 location: .1 = 10% false positive rate

Algorithms: bloom filter

SLIDE 80

(assume 10% bit density) 1 location: .1 = 10% false positive rate 2 locations: .1 * .1 = 1% false positive rate

Algorithms: bloom filter

SLIDE 81

(assume 10% bit density) 1 location: .1 = 10% false positive rate 2 locations: .1 * .1 = 1% false positive rate 3 locations: .1 * .1 * .1 = 0.1% false positive rate

Algorithms: bloom filter

SLIDE 82

Linear cost Exponential benefit

Algorithms: bloom filter

SLIDE 83

Multiple Documents Multiple Bloom Filters

Algorithms: bloom filter

SLIDE 84

SLIDE 85

Do comparisons in parallel!

Algorithms: bloom filter

SLIDE 86

SLIDE 87

Algorithms: bloom filter

SLIDE 88

Algorithms: bloom filter

SLIDE 89

Algorithms: bloom filter

SLIDE 90

Algorithms: bloom filter Algorithms: bloom filter

SLIDE 91

Algorithms: bloom filter

SLIDE 92

SLIDE 93

SLIDE 94

Algorithms: bloom filter

SLIDE 95

Algorithms: bloom filter

SLIDE 96

How do we estimate perf?

SLIDE 97

Cost model Number of operations

Perf estimation

SLIDE 98

512-bit “blocks”

(pay for memory accesses)

Perf estimation

SLIDE 99

How many memory accesses per block?

Perf estimation

SLIDE 100

http://bitfunnel.org

Perf estimation

SLIDE 101

Perf estimation

SLIDE 102

SLIDE 103

Why do we have so many rows?

Term rewriting

Perf estimation

SLIDE 104

Term Rewriting

“Large yellow dog”

Perf estimation

SLIDE 105

Term Rewriting

“Large yellow dog” || “Golden Retriever”

Perf estimation

SLIDE 106

Term Rewriting

“Large yellow dog” || “Golden Retriever” || “Old Yeller” ||

Perf estimation

SLIDE 107

SLIDE 108

Expected performance?

Perf estimation

SLIDE 109

10 M docs / 512 bits per block = 20k “blocks”

Perf estimation

SLIDE 110

10 M docs / 512 bits per block = 20k “blocks” 20 k-blocks * 5 transfers per block = 100 kT

Perf estimation

SLIDE 111

10 M docs / 512 bits per block = 20k “blocks” 20 k-blocks * 5 transfers per block = 100 kT 25 GB/s / 512 bits per transfer = 390 MT/s

Perf estimation

SLIDE 112

10 M docs / 512 bits per block = 20k “blocks” 20 k-blocks * 5 transfers per block = 100 kT 25 GB/s / 512 bits per transfer = 390 MT/s 390 MT/s / 100 kT = 3900 QPS (with rounding)

Perf estimation

SLIDE 113

Actual performance?

Perf estimation

SLIDE 114

Actual performance ~similar

Perf estimation

SLIDE 115

Small factors

Perf estimation

SLIDE 116

Large factors

Perf estimation

SLIDE 117

Ranking results

Perf estimation

SLIDE 118

Ingestion

(faster than querying)

Perf estimation

SLIDE 119

Ingestion is just setting bits

Perf estimation

SLIDE 120

Hierarchical bloom filters

Perf estimation

SLIDE 121

Complicating issues?

Perf estimation

SLIDE 122

SLIDE 123

Conclusions?

SLIDE 124

False conclusions

Search is simple

SLIDE 125

False conclusions

Search is simple Bloom filters are better than posting lists

Zobel et al., Inverted files versus signature files for text indexing; TODS 1998

SLIDE 126

False conclusions

Search is simple Bloom filters are better than posting lists You can easily reason about all performance

Zobel et al., Inverted files versus signature files for text indexing; TODS 1998

SLIDE 127

Conclusions!

SLIDE 128

You can reason about perf

SLIDE 129

It’s often just arithmetic

SLIDE 130

Acknowledgements

Thanks to Leah Hanson, Mike Hopcroft, Julia Evans, Hari Angepat, David Turner, Danielle Sucher, Ikhwan Lee, Tejas Sapre, Raul Jara, Rich Ercolani, Bert Muthalaly, Harsha Nori, Jeshua Smith, Bill Barnes, Gary Bernhardt, Marek Majkowski, Tom Crayford, Gina Willard, Laura Lindzey, Larry Marbuger, Siddarth Anand, Eric Lemmon, Tom Ballinger, and [Anonymous Reviewer] for feedback.

SLIDE 131

bitfunnel.org/strangeloop github.com/bitfunnel/bitfunnel danluu.com

SLIDE 132

Unused slides

(thar be dragons)

SLIDE 133

SLIDE FOR HOMEWORK. TODO: USE DIFFERENT TEMPLATE

SLIDE 134

Why are posting lists standard?

SLIDE 135

Literature on alternatives

“Signatures files were proposed in [23] and shown to be inferior to inverted indexing in [24]. “

“ Inverted indexes have been benchmarked as the most generalisable, and well performing structure (Zobel et al., 1998). The experiments in this thesis are therefore conducted solely on an inverted index system.” “While this technique provides a relatively low computation overhead, studies by Zobel et al. [1998] have shown that inverted files significantly

utperform signature files. We will now focus the analysis on inverted files as it is generally considered to be the most efficient indexing method

for most IR systems.” “The other two mechanisms are usually adopted in certain applications even if, recently, they have been mostly abandoned in favor of inverted indexes because some extensive experimental results [194] have shown that: Inverted indexes offer better performance than signature files and bitmaps, in terms of both size of index and speed of query handling [188]” “Zobel et al. [16] compared inverted files and signature files with respect to query response time and space requirements. They found that the inverted files evaluated queries in less time than the signature files and needed less space. Their results showed that the signature files were much larger, more expensive to construct and update, their response time was unpredictable, they support ranked queries only with difficulty, they did not scale well and they were slow“

SLIDE 136

Zobel et al., actual quotes

“Inverted file indexes with in-memory search structures require no more disk accesses to answer a conjunctive query than do bitsliced signature files.” “One of the difficulties in the comparison of inverted files and signature files is that many variants of signature file techniques have been proposed, and it is possible that some combination of parameters and variants will result in a better method.”

SLIDE 137

Citations are lossy

SLIDE 138

SLIDE 139

SLIDE 140

SLIDE 141

SLIDE 142

SLIDE 143

SLIDE 144

SLIDE 145

SLIDE 146

SLIDE 147

SLIDE 148

SLIDE 149

Search: why do we care?

SLIDE 150

$20M/yr * 2% savings = $400k/yr

SLIDE 151

How things fit together

SLIDE 152

TODO: add diagram

SLIDE 153

Posting list

SLIDE 154

SLIDE 155

SLIDE 156

SLIDE 157

How many terms?

SLIDE 158

TODO: pseudo-code TODO: diagram about how bits drop out TODO: search is a high dynamic range problem. TODO: higher rank rows TODO: sharding by document length TODO: diagram of how things fit together. Could just be concentric circles

SLIDE 159

Posting lists are standard

SLIDE 160

Posting list optimizations

Skip list Delta compression etc.

SLIDE 161

Search

SLIDE 162

Perf: how to think about it?

SLIDE 163

Performance

SLIDE 164

Search is BIG

SLIDE 165

Parsing / Tokenization

Harder than it sounds

SLIDE 166

Search is a big problem

Tokenization Some languages mix alphabets, are partially left-to-right and right-to-left, etc. Can’t drop non-alphanumeric characters (C# vs C++) Multi-language queries Ranking / Relevance Distributed Systems etc.