230 Million Tweets per day 2 Billion Queries per day < 10 s - - PowerPoint PPT Presentation

230 million
SMART_READER_LITE
LIVE PREVIEW

230 Million Tweets per day 2 Billion Queries per day < 10 s - - PowerPoint PPT Presentation

230 Million Tweets per day 2 Billion Queries per day < 10 s Indexing latency 50 ms Avg. query response time Earlybird - Realtime Search @twitter Michael Busch @michibusch michael@twitter.com buschmi@apache.org Earlybird - Realtime


slide-1
SLIDE 1
slide-2
SLIDE 2

230 Million

Tweets per day

slide-3
SLIDE 3

2 Billion

Queries per day

slide-4
SLIDE 4

< 10 s

Indexing latency

slide-5
SLIDE 5

50 ms

  • Avg. query response time
slide-6
SLIDE 6

Earlybird - Realtime Search @twitter

Michael Busch

@michibusch michael@twitter.com buschmi@apache.org

slide-7
SLIDE 7

Agenda

  • Introduction
  • Search Architecture
  • Inverted Index 101
  • Memory Model & Concurrency
  • Top Tweets

Earlybird - Realtime Search @twitter

slide-8
SLIDE 8

Introduction

slide-9
SLIDE 9

Introduction

  • Twitter acquired Summize in 2008
  • 1st gen search engine based on MySQL
slide-10
SLIDE 10

Introduction

  • Next gen search engine based on Lucene
  • Improves scalability and performance by orders or magnitude
  • Open Source
slide-11
SLIDE 11

Realtime Search @twitter

Agenda

  • Introduction
  • Search Architecture
  • Inverted Index 101
  • Memory Model & Concurrency
  • Top Tweets
slide-12
SLIDE 12

Search Architecture

slide-13
SLIDE 13

Search Architecture

  • Ingester pre-processes Tweets for search
  • Geo-coding, URL expansion, tokenization, etc.

Ingester Tweets

slide-14
SLIDE 14

Search Architecture

  • Tweets are serialized to MySQL in Thrift format

Thrift MySQL Master MySQL Slaves Ingester Tweets

slide-15
SLIDE 15

Earlybird

  • Earlybird reads from MySQL slaves
  • Builds an in-memory inverted index in real time

Thrift MySQL Master MySQL Slaves Ingester Tweets Earlybird Index

slide-16
SLIDE 16

Blender

Earlybird Index Blender Thrift Thrift

  • Blender is our Thrift service aggregator
  • Queries multiple Earlybirds, merges results
slide-17
SLIDE 17

Realtime Search @twitter

Agenda

  • Introduction
  • Search Architecture
  • Inverted Index 101
  • Memory Model & Concurrency
  • Top Tweets
slide-18
SLIDE 18

Inverted Index 101

slide-19
SLIDE 19

Inverted Index 101

1

The old night keeper keeps the keep in the town

2

In the big old house in the big old gown.

3

The house in the town had the big old keep

4

Where the old night keeper never did sleep.

5

The night keeper keeps the keep in the night

6

And keeps in the dark and sleeps in the light.

Table with 6 documents

Example from: Justin Zobel , Alistair Moffat, Inverted files for text search engines, ACM Computing Surveys (CSUR) v.38 n.2, p.6-es, 2006

slide-20
SLIDE 20

Inverted Index 101

1

The old night keeper keeps the keep in the town

2

In the big old house in the big old gown.

3

The house in the town had the big old keep

4

Where the old night keeper never did sleep.

5

The night keeper keeps the keep in the night

6

And keeps in the dark and sleeps in the light.

term freq and 1 <6> big 2 <2> <3> dark 1 <6> did 1 <4> gown 1 <2> had 1 <3> house 2 <2> <3> in 5 <1> <2> <3> <5> <6> keep 3 <1> <3> <5> keeper 3 <1> <4> <5> keeps 3 <1> <5> <6> light 1 <6> never 1 <4> night 3 <1> <4> <5>

  • ld

4 <1> <2> <3> <4> sleep 1 <4> sleeps 1 <6> the 6 <1> <2> <3> <4> <5> <6> town 2 <1> <3> where 1 <4>

Table with 6 documents Dictionary and posting lists

slide-21
SLIDE 21

Inverted Index 101

1

The old night keeper keeps the keep in the town

2

In the big old house in the big old gown.

3

The house in the town had the big old keep

4

Where the old night keeper never did sleep.

5

The night keeper keeps the keep in the night

6

And keeps in the dark and sleeps in the light.

term freq and 1 <6> big 2 <2> <3> dark 1 <6> did 1 <4> gown 1 <2> had 1 <3> house 2 <2> <3> in 5 <1> <2> <3> <5> <6> keep 3 <1> <3> <5> keeper 3 <1> <4> <5> keeps 3 <1> <5> <6> light 1 <6> never 1 <4> night 3 <1> <4> <5>

  • ld

4 <1> <2> <3> <4> sleep 1 <4> sleeps 1 <6> the 6 <1> <2> <3> <4> <5> <6> town 2 <1> <3> where 1 <4>

Table with 6 documents Dictionary and posting lists

Query: keeper

slide-22
SLIDE 22

Inverted Index 101

1

The old night keeper keeps the keep in the town

2

In the big old house in the big old gown.

3

The house in the town had the big old keep

4

Where the old night keeper never did sleep.

5

The night keeper keeps the keep in the night

6

And keeps in the dark and sleeps in the light.

term freq and 1 <6> big 2 <2> <3> dark 1 <6> did 1 <4> gown 1 <2> had 1 <3> house 2 <2> <3> in 5 <1> <2> <3> <5> <6> keep 3 <1> <3> <5> keeper 3 <1> <4> <5> keeps 3 <1> <5> <6> light 1 <6> never 1 <4> night 3 <1> <4> <5>

  • ld

4 <1> <2> <3> <4> sleep 1 <4> sleeps 1 <6> the 6 <1> <2> <3> <4> <5> <6> town 2 <1> <3> where 1 <4>

Table with 6 documents Dictionary and posting lists

Query: keeper

slide-23
SLIDE 23

Posting list encoding

Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090

slide-24
SLIDE 24

Posting list encoding

Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090 5 10 8985 2 90998 90 Delta encoding:

slide-25
SLIDE 25

Posting list encoding

Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090 5 10 8985 2 90998 90 Delta encoding:

00000101

VInt compression: Values 0 <= delta <= 127 need

  • ne byte
slide-26
SLIDE 26

Posting list encoding

Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090 5 10 8985 2 90998 90 Delta encoding:

11000110

VInt compression: Values 128 <= delta <= 16384 need two bytes

00011001

slide-27
SLIDE 27

Posting list encoding

Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090 5 10 8985 2 90998 90 Delta encoding:

11000110

VInt compression: First bit indicates whether next byte belongs to the same value

00011001

slide-28
SLIDE 28

Posting list encoding

Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090 5 10 8985 2 90998 90 Delta encoding:

11000110

VInt compression:

00011001

  • Variable number of bytes - a VInt-encoded posting can not be written as a

primitive Java type; therefore it can not be written atomically

slide-29
SLIDE 29

Posting list encoding

Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090 5 10 8985 2 90998 90 Delta encoding: Read direction

  • Each posting depends on previous one; decoding only possible in old-to-new

direction

  • With recency ranking (new-to-old) no early termination is possible
slide-30
SLIDE 30

Posting list encoding

  • By default Lucene uses a combination of delta encoding and VInt

compression

  • VInts are expensive to decode
  • Problem 1: How to traverse posting lists backwards?
  • Problem 2: How to write a posting atomically?
slide-31
SLIDE 31

Posting list encoding in Earlybird

int (32 bits) docID 24 bits

  • max. 16.7M

textPosition 8 bits

  • max. 255
  • Tweet text can only have 140 chars
  • Decoding speed significantly improved compared to delta and VInt decoding

(early experiments suggest 5x improvement compared to vanilla Lucene with FSDirectory)

slide-32
SLIDE 32

Posting list encoding in Earlybird

Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090 Earlybird encoding: Read direction 5 15 9000 9002 100000 100090

slide-33
SLIDE 33

Early query termination

Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090 Earlybird encoding: Read direction 5 15 9000 9002 100000 100090 E.g. 3 result are requested: Here we can terminate after reading 3 postings

slide-34
SLIDE 34

Posting list encoding - Summary

  • ints can be written atomically in Java
  • Backwards traversal easy on absolute docIDs (not deltas)
  • Every posting is a possible entry point for a searcher
  • Skipping can be done without additional data structures as binary search,

even though there are better approaches which should be explored

  • On tweet indexes we need about 30% more storage for docIDs compared to

delta+Vints; compensated by compression of complete segments

  • Max. segment size: 2^24 = 16.7M tweets
slide-35
SLIDE 35

Realtime Search @twitter

Agenda

  • Introduction
  • Search Architecture
  • Inverted Index 101
  • Memory Model & Concurrency
  • Top Tweets
slide-36
SLIDE 36

Memory Model & Concurrency

slide-37
SLIDE 37

Inverted index components

Dictionary Posting list storage

?

slide-38
SLIDE 38

Inverted index components

Dictionary Posting list storage

?

slide-39
SLIDE 39

Inverted Index

1

The old night keeper keeps the keep in the town

2

In the big old house in the big old gown.

3

The house in the town had the big old keep

4

Where the old night keeper never did sleep.

5

The night keeper keeps the keep in the night

6

And keeps in the dark and sleeps in the light.

term freq and 1 <6> big 2 <2> <3> dark 1 <6> did 1 <4> gown 1 <2> had 1 <3> house 2 <2> <3> in 5 <1> <2> <3> <5> <6> keep 3 <1> <3> <5> keeper 3 <1> <4> <5> keeps 3 <1> <5> <6> light 1 <6> never 1 <4> night 3 <1> <4> <5>

  • ld

4 <1> <2> <3> <4> sleep 1 <4> sleeps 1 <6> the 6 <1> <2> <3> <4> <5> <6> town 2 <1> <3> where 1 <4>

Table with 6 documents Dictionary and posting lists Per term we store different kinds of metadata: text pointer, frequency, postings pointer, etc.

slide-40
SLIDE 40

Term dictionary

termID int[] textPointer; frequency; postingsPointer; int[] int[] int[] 1 2 3 4 5 6 term text pool

slide-41
SLIDE 41

termID int[] textPointer; frequency; postingsPointer; int[] int[] int[]

t0 p0 f0

1 2 3 4 5 6 c a t

t0

Term dictionary

slide-42
SLIDE 42

1 termID int[] textPointer; frequency; postingsPointer; int[] int[] int[]

t0 t1 p0 p1 f0 f1

1 2 3 4 5 6 c a t f o o

t0 t1

Term dictionary

slide-43
SLIDE 43

1 2 termID int[] textPointer; frequency; postingsPointer; int[] int[] int[]

t0 t1 t2 p0 p1 p2 f0 f1 f2

1 2 3 4 5 6 c a t f o o b a r

t0 t1 t2

Term dictionary

slide-44
SLIDE 44

Term dictionary

  • Number of objects << number of terms
  • O(1) lookups
  • Easy to store more term metadata by adding additional parallel arrays
slide-45
SLIDE 45

Inverted index components

Parallel arrays Dictionary pointer to the most recently indexed posting for a term Posting list storage

?

slide-46
SLIDE 46

Inverted index components

Parallel arrays Dictionary pointer to the most recently indexed posting for a term Posting list storage

?

slide-47
SLIDE 47
  • Store many single-linked lists of different lengths space-efficiently
  • The number of java objects should be independent of the number of lists or

number of items in the lists

  • Every item should be a possible entry point into the lists for iterators, i.e.

items should not be dependent on other items (e.g. no delta encoding)

  • Append and read possible by multiple threads in a lock-free fashion (single

append thread, multiple reader threads)

  • Traversal in backwards order

Posting lists storage - Objectives

slide-48
SLIDE 48

Memory management

= 32K int[] 4 int[] pools

slide-49
SLIDE 49

Memory management

= 32K int[] 4 int[] pools Each pool can be grown individually by adding 32K blocks

slide-50
SLIDE 50

Memory management

  • For simplicity we can forget about the blocks for now and think of the pools

as continuous, unbounded int[] arrays

  • Small total number of Java objects (each 32K block is one object)

4 int[] pools

slide-51
SLIDE 51

Memory management

  • Slices can be allocated in each pool
  • Each pool has a different, but fixed slice size

21 24 27 211 slice size

slide-52
SLIDE 52

Adding and appending to a list

21 24 27 211 slice size available allocated current list

slide-53
SLIDE 53

Adding and appending to a list

21 24 27 211 slice size Store first two postings in this slice available allocated current list

slide-54
SLIDE 54

Adding and appending to a list

21 24 27 211 slice size When first slice is full, allocate another one in second pool available allocated current list

slide-55
SLIDE 55

Adding and appending to a list

21 24 27 211 slice size available allocated current list Allocate a slice on each level as list grows

slide-56
SLIDE 56

Adding and appending to a list

21 24 27 211 slice size available allocated current list On upper most level one list can own multiple slices

slide-57
SLIDE 57

Posting list format

int (32 bits) docID 24 bits

  • max. 16.7M

textPosition 8 bits

  • max. 255
  • Tweet text can only have 140 chars
  • Decoding speed significantly improved compared to delta and VInt decoding

(early experiments suggest 5x improvement compared to vanilla Lucene with FSDirectory)

slide-58
SLIDE 58

Addressing items

  • Use 32 bit (int) pointers to address any item in any list unambiguously:

int (32 bits) poolIndex 2 bits 0-3

  • ffset in slice

1-11 bits depends on pool sliceIndex 19-29 bits depends on pool

  • Nice symmetry: Postings and address pointers both fit into a 32 bit int
slide-59
SLIDE 59

Linking the slices

21 24 27 211 slice size available allocated current list

slide-60
SLIDE 60

Linking the slices

21 24 27 211 slice size available allocated current list Parallel arrays Dictionary pointer to the last posting indexed for a term

slide-61
SLIDE 61

Concurrency - Definitions

  • Pessimistic locking
  • A thread holds an exclusive lock on a resource, while an action is

performed [mutual exclusion]

  • Usually used when conflicts are expected to be likely
  • Optimistic locking
  • Operations are tried to be performed atomically without holding a lock;

conflicts can be detected; retry logic is often used in case of conflicts

  • Usually used when conflicts are expected to be the exception
slide-62
SLIDE 62

Concurrency - Definitions

  • Non-blocking algorithm

Ensures, that threads competing for shared resources do not have their execution indefinitely postponed by mutual exclusion.

  • Lock-free algorithm

A non-blocking algorithm is lock-free if there is guaranteed system-wide progress.

  • Wait-free algorithm

A non-blocking algorithm is wait-free, if there is guaranteed per-thread progress.

* Source: Wikipedia

slide-63
SLIDE 63

Concurrency

  • Having a single writer thread simplifies our problem: no locks have to be used

to protect data structures from corruption (only one thread modifies data)

  • But: we have to make sure that all readers always see a consistent state of

all data structures -> this is much harder than it sounds!

  • In Java, it is not guaranteed that one thread will see changes that another

thread makes in program execution order, unless the same memory barrier is crossed by both threads -> safe publication

  • Safe publication can be achieved in different, subtle ways. Read the great

book “Java concurrency in practice” by Brian Goetz for more information!

slide-64
SLIDE 64

Java Memory Model

  • Program order rule

Each action in a thread happens-before every action in that thread that comes later in the program order.

  • Volatile variable rule

A write to a volatile field happens-before every subsequent read of that same field.

  • Transitivity

If A happens-before B, and B happens-before C, then A happens-before C.

* Source: Brian Goetz: Java Concurrency in Practice

slide-65
SLIDE 65

Concurrency

RAM int x; Cache Thread 1 Thread 2 time

slide-66
SLIDE 66

Concurrency

RAM int x; Cache 5 Thread 1 Thread 2

x = 5;

Thread A writes x=5 to cache time

slide-67
SLIDE 67

Concurrency

RAM int x; Cache 5 Thread 1 Thread 2

x = 5; while(x != 5);

time This condition will likely never become false!

slide-68
SLIDE 68

Concurrency

RAM int x; Cache Thread 1 Thread 2 time

slide-69
SLIDE 69

Concurrency

RAM int x; 1 Cache Thread 1 Thread 2 time volatile int b;

x = 5;

5 Thread A writes b=1 to RAM, because b is volatile

b = 1;

slide-70
SLIDE 70

Concurrency

RAM int x; 1 Cache Thread 1 Thread 2 time volatile int b;

x = 5;

5 Read volatile b

b = 1; int dummy = b; while(x != 5);

slide-71
SLIDE 71

Concurrency

RAM int x; 1 Cache Thread 1 Thread 2 time volatile int b;

x = 5;

5

b = 1; int dummy = b; while(x != 5);

  • Program order rule: Each action in a thread happens-before every action in

that thread that comes later in the program order.

happens-before

slide-72
SLIDE 72

Concurrency

RAM int x; 1 Cache Thread 1 Thread 2 time volatile int b;

x = 5;

5

b = 1; int dummy = b; while(x != 5);

happens-before

  • Volatile variable rule: A write to a volatile field happens-before every

subsequent read of that same field.

slide-73
SLIDE 73

Concurrency

RAM int x; 1 Cache Thread 1 Thread 2 time volatile int b;

x = 5;

5

b = 1; int dummy = b; while(x != 5);

happens-before

  • Transitivity: If A happens-before B, and B happens-before C, then A

happens-before C.

slide-74
SLIDE 74

Concurrency

RAM int x; 1 Cache Thread 1 Thread 2 time volatile int b;

x = 5;

5

b = 1; int dummy = b; while(x != 5);

This condition will be false, i.e. x==5

  • Note: x itself doesn’t have to be volatile. There can be many variables like x,

but we need only a single volatile field.

slide-75
SLIDE 75

Concurrency

RAM int x; 1 Cache Thread 1 Thread 2 time volatile int b;

x = 5;

5

b = 1; int dummy = b; while(x != 5);

Memory barrier

  • Note: x itself doesn’t have to be volatile. There can be many variables like x,

but we need only a single volatile field.

slide-76
SLIDE 76

Demo

slide-77
SLIDE 77

Concurrency

RAM int x; 1 Cache Thread 1 Thread 2 time volatile int b;

x = 5;

5

b = 1; int dummy = b; while(x != 5);

Memory barrier

  • Note: x itself doesn’t have to be volatile. There can be many variables like x,

but we need only a single volatile field.

slide-78
SLIDE 78

Concurrency

IndexWriter IndexReader time

write 100 docs maxDoc = 100 in IR.open(): read maxDoc search upto maxDoc

maxDoc is volatile

write more docs

slide-79
SLIDE 79

Concurrency

IndexWriter IndexReader time

write 100 docs maxDoc = 100 in IR.open(): read maxDoc search upto maxDoc

maxDoc is volatile

write more docs

happens-before

  • Only maxDoc is volatile. All other fields that IW writes to and IR reads from

don’t need to be!

slide-80
SLIDE 80
  • Not a single exclusive lock
  • Writer thread can always make progress
  • Optimistic locking (retry-logic) in a few places for searcher thread
  • Retry logic very simple and guaranteed to always make progress

Wait-free

slide-81
SLIDE 81

Realtime Search @twitter

Agenda

  • Introduction
  • Search Architecture
  • Inverted Index 101
  • Memory Model & Concurrency
  • Top Tweets
slide-82
SLIDE 82

Top Tweets

slide-83
SLIDE 83

Signals

  • Query-dependent
  • E.g. Lucene text score, language
  • Query-independent
  • Static signals (e.g. text quality)
  • Dynamic signals (e.g. retweets)
  • Timeliness
slide-84
SLIDE 84

Signals

Many Earlybird segments (8M documents each) Task: Find best tweet in billions

  • f tweets efficiently
slide-85
SLIDE 85

Signals

Many Earlybird segments (8M documents each)

  • For top tweets we can’t early terminate as efficiently
  • Scoring and ranking billions of tweets is impractical
slide-86
SLIDE 86

Query cache

Many Earlybird segments (8M documents each) Idea: Mark all tweets with high query-independent scores and only visit those for top tweets queries Query results cache High static score doc

slide-87
SLIDE 87

Query cache

Many Earlybird segments (8M documents each)

  • A background thread periodically wakes up, executes queries, and stores the

results in the per-segment cache

  • Rewriting queries
  • User query: q = ‘lucene’
  • Rewritten: q’ = ‘lucene AND cached_filter:toptweets’
  • Efficient skipping over tweets with low query-independent scores

This clause will be executed as a Lucene ConstantScoreQuery that wraps a BitSet or SortedVIntList

slide-88
SLIDE 88

Query cache

Many Earlybird segments (8M documents each)

  • A background thread periodically wakes up, executes queries, and stores the

results in the per-segment cache

  • Configurable per cached query:
  • Result set type: BitSet, SortedVIntList
  • Execution schedule
  • Filter mode: cache-only or hybrid
slide-89
SLIDE 89

Query cache

Many Earlybird segments (8M documents each)

  • Result set type: BitSet, SortedVIntList
  • BitSet for results with many hits
  • SortedVIntList for very sparse results
slide-90
SLIDE 90

Query cache

Many Earlybird segments (8M documents each)

  • Execution schedule
  • Per segment: Sleep time between refreshing the cached results
slide-91
SLIDE 91

Query cache

Many Earlybird segments (8M documents each)

  • Filter mode: cache-only or hybrid
slide-92
SLIDE 92

Query cache

  • Filter mode: cache-only or hybrid

Read direction

slide-93
SLIDE 93

Query cache

  • Filter mode: cache-only or hybrid

Read direction Partially filled segment (IndexWriter is currently appending)

slide-94
SLIDE 94

Query cache

  • Filter mode: cache-only or hybrid

Read direction Query cache can only be computed until current maxDocID

slide-95
SLIDE 95

Query cache

  • Filter mode: cache-only or hybrid

Read direction Some time later: More docs in segment, but query cache was not updated yet

slide-96
SLIDE 96

Query cache

  • Filter mode: cache-only or hybrid

Read direction Cache-only mode: Ignore documents added to segment since cache was updated Search range

slide-97
SLIDE 97

Query cache

  • Filter mode: cache-only or hybrid

Read direction Hybrid mode: Fall back to

  • riginal query and execute it on

latest documents Search range

slide-98
SLIDE 98

Query cache

  • Filter mode: cache-only or hybrid

Read direction Use query cache up to the cache’s maxDocID Search range

slide-99
SLIDE 99

Query result cache

Many Earlybird segments (8M documents each) query cache config yml file

slide-100
SLIDE 100

Questions?