340 Million Tweets per day 2.3 Billion Queries per day < 10 s - - PowerPoint PPT Presentation

340 million
SMART_READER_LITE
LIVE PREVIEW

340 Million Tweets per day 2.3 Billion Queries per day < 10 s - - PowerPoint PPT Presentation

340 Million Tweets per day 2.3 Billion Queries per day < 10 s Indexing latency 50 ms Avg. query response time Earlybird - Realtime Search @twitter Michael Busch @michibusch michael@twitter.com buschmi@apache.org Earlybird - Realtime


slide-1
SLIDE 1
slide-2
SLIDE 2

340 Million

Tweets per day

slide-3
SLIDE 3

2.3 Billion

Queries per day

slide-4
SLIDE 4

< 10 s

Indexing latency

slide-5
SLIDE 5

50 ms

  • Avg. query response time
slide-6
SLIDE 6

Earlybird - Realtime Search @twitter

Michael Busch

@michibusch michael@twitter.com buschmi@apache.org

slide-7
SLIDE 7

Agenda

  • Introduction
  • Search Architecture
  • Inverted Index 101
  • Memory Model & Concurrency
  • What’s next?

Earlybird - Realtime Search @twitter

slide-8
SLIDE 8

Introduction

slide-9
SLIDE 9

Introduction

  • Twitter acquired Summize in 2008
  • 1st gen search engine based on MySQL
slide-10
SLIDE 10

Introduction

  • Next gen search engine based on Lucene
  • Improves scalability and performance by orders or magnitude
  • Open Source
slide-11
SLIDE 11

Realtime Search @twitter

Agenda

  • Introduction
  • Search Architecture
  • Inverted Index 101
  • Memory Model & Concurrency
  • What’s next?
slide-12
SLIDE 12

Search Architecture

slide-13
SLIDE 13

Search Architecture

  • Ingester pre-processes Tweets for search
  • Geo-coding, URL expansion, tokenization, etc.

Ingester Tweets

slide-14
SLIDE 14

Search Architecture

  • Tweets are serialized to MySQL in Thrift format

Thrift MySQL Master MySQL Slaves Ingester Tweets

slide-15
SLIDE 15

Earlybird

  • Earlybird reads from MySQL slaves
  • Builds an in-memory inverted index in real time

Thrift MySQL Master MySQL Slaves Ingester Tweets Earlybird Index

slide-16
SLIDE 16

Blender

Earlybird Index Blender Thrift Thrift

  • Blender is our Thrift service aggregator
  • Queries multiple Earlybirds, merges results
slide-17
SLIDE 17

Cluster layout

Earlybird
slide-18
SLIDE 18

Cluster layout

Earlybird

Replicas

slide-19
SLIDE 19

Cluster layout

Earlybird Earlybird Earlybird Earlybird

...

n hash partitions (docId % n) Replicas

slide-20
SLIDE 20

Cluster layout

Earlybird Earlybird Earlybird Earlybird Earlybird Earlybird Earlybird Earlybird Earlybird Earlybird Earlybird Earlybird

... ... ... ... ... ... ...

Timeslices n hash partitions (docId % n) Replicas

slide-21
SLIDE 21

Cluster layout

Earlybird Earlybird Earlybird Earlybird Earlybird Earlybird Earlybird Earlybird Earlybird Earlybird Earlybird Earlybird

... ... ... ... ... ... ...

Writable timeslice Complete timeslices

slide-22
SLIDE 22

Realtime Search @twitter

Agenda

  • Introduction
  • Search Architecture
  • Inverted Index 101
  • Memory Model & Concurrency
  • What’s next?
slide-23
SLIDE 23

Inverted Index 101

slide-24
SLIDE 24

Inverted Index 101

1

The old night keeper keeps the keep in the town

2

In the big old house in the big old gown.

3

The house in the town had the big old keep

4

Where the old night keeper never did sleep.

5

The night keeper keeps the keep in the night

6

And keeps in the dark and sleeps in the light.

Table with 6 documents

Example from: Justin Zobel , Alistair Moffat, Inverted files for text search engines, ACM Computing Surveys (CSUR) v.38 n.2, p.6-es, 2006

slide-25
SLIDE 25

Inverted Index 101

1

The old night keeper keeps the keep in the town

2

In the big old house in the big old gown.

3

The house in the town had the big old keep

4

Where the old night keeper never did sleep.

5

The night keeper keeps the keep in the night

6

And keeps in the dark and sleeps in the light.

term freq and 1 <6> big 2 <2> <3> dark 1 <6> did 1 <4> gown 1 <2> had 1 <3> house 2 <2> <3> in 5 <1> <2> <3> <5> <6> keep 3 <1> <3> <5> keeper 3 <1> <4> <5> keeps 3 <1> <5> <6> light 1 <6> never 1 <4> night 3 <1> <4> <5>

  • ld

4 <1> <2> <3> <4> sleep 1 <4> sleeps 1 <6> the 6 <1> <2> <3> <4> <5> <6> town 2 <1> <3> where 1 <4>

Table with 6 documents Dictionary and posting lists

slide-26
SLIDE 26

Inverted Index 101

1

The old night keeper keeps the keep in the town

2

In the big old house in the big old gown.

3

The house in the town had the big old keep

4

Where the old night keeper never did sleep.

5

The night keeper keeps the keep in the night

6

And keeps in the dark and sleeps in the light.

term freq and 1 <6> big 2 <2> <3> dark 1 <6> did 1 <4> gown 1 <2> had 1 <3> house 2 <2> <3> in 5 <1> <2> <3> <5> <6> keep 3 <1> <3> <5> keeper 3 <1> <4> <5> keeps 3 <1> <5> <6> light 1 <6> never 1 <4> night 3 <1> <4> <5>

  • ld

4 <1> <2> <3> <4> sleep 1 <4> sleeps 1 <6> the 6 <1> <2> <3> <4> <5> <6> town 2 <1> <3> where 1 <4>

Table with 6 documents Dictionary and posting lists

Query: keeper

slide-27
SLIDE 27

Inverted Index 101

1

The old night keeper keeps the keep in the town

2

In the big old house in the big old gown.

3

The house in the town had the big old keep

4

Where the old night keeper never did sleep.

5

The night keeper keeps the keep in the night

6

And keeps in the dark and sleeps in the light.

term freq and 1 <6> big 2 <2> <3> dark 1 <6> did 1 <4> gown 1 <2> had 1 <3> house 2 <2> <3> in 5 <1> <2> <3> <5> <6> keep 3 <1> <3> <5> keeper 3 <1> <4> <5> keeps 3 <1> <5> <6> light 1 <6> never 1 <4> night 3 <1> <4> <5>

  • ld

4 <1> <2> <3> <4> sleep 1 <4> sleeps 1 <6> the 6 <1> <2> <3> <4> <5> <6> town 2 <1> <3> where 1 <4>

Table with 6 documents Dictionary and posting lists

Query: keeper

slide-28
SLIDE 28

Posting list encoding

Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090

slide-29
SLIDE 29

Posting list encoding

Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090 5 10 8985 2 90998 90 Delta encoding:

slide-30
SLIDE 30

Posting list encoding

Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090 5 10 8985 2 90998 90 Delta encoding:

00000101

VInt compression: Values 0 <= delta <= 127 need

  • ne byte
slide-31
SLIDE 31

Posting list encoding

Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090 5 10 8985 2 90998 90 Delta encoding:

11000110

VInt compression: Values 128 <= delta <= 16384 need two bytes

00011001

slide-32
SLIDE 32

Posting list encoding

Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090 5 10 8985 2 90998 90 Delta encoding:

11000110

VInt compression: First bit indicates whether next byte belongs to the same value

00011001

slide-33
SLIDE 33

Posting list encoding

Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090 5 10 8985 2 90998 90 Delta encoding:

11000110

VInt compression:

00011001

  • Variable number of bytes - a VInt-encoded posting can not be written as a

primitive Java type; therefore it can not be written atomically

slide-34
SLIDE 34

Posting list encoding

Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090 5 10 8985 2 90998 90 Delta encoding: Read direction

  • Each posting depends on previous one; decoding only possible in old-to-new

direction

  • With recency ranking (new-to-old) no early termination is possible
slide-35
SLIDE 35

Posting list encoding

  • By default Lucene uses a combination of delta encoding and VInt

compression

  • VInts are expensive to decode
  • Problem 1: How to traverse posting lists backwards?
  • Problem 2: How to write a posting atomically?
slide-36
SLIDE 36

Posting list encoding in Earlybird

int (32 bits) docID 24 bits

  • max. 16.7M

textPosition 8 bits

  • max. 255
  • Tweet text can only have 140 chars
  • Decoding speed significantly improved compared to delta and VInt decoding

(early experiments suggest 5x improvement compared to vanilla Lucene with FSDirectory)

slide-37
SLIDE 37

Posting list encoding in Earlybird

Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090 Earlybird encoding: Read direction 5 15 9000 9002 100000 100090

slide-38
SLIDE 38

Early query termination

Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090 Earlybird encoding: Read direction 5 15 9000 9002 100000 100090 E.g. 3 result are requested: Here we can terminate after reading 3 postings

slide-39
SLIDE 39

Posting list encoding - Summary

  • ints can be written atomically in Java
  • Backwards traversal easy on absolute docIDs (not deltas)
  • Every posting is a possible entry point for a searcher
  • Skipping can be done without additional data structures as binary search,

even though there are better approaches which should be explored

  • On tweet indexes we need about 30% more storage for docIDs compared to

delta+Vints; compensated by compression of complete segments

  • Max. segment size: 2^24 = 16.7M tweets
slide-40
SLIDE 40

Realtime Search @twitter

Agenda

  • Introduction
  • Search Architecture
  • Inverted Index 101
  • Memory Model & Concurrency
  • What’s next?
slide-41
SLIDE 41

Memory Model & Concurrency

slide-42
SLIDE 42

Inverted index components

Dictionary Posting list storage

?

slide-43
SLIDE 43

Inverted index components

Dictionary Posting list storage

?

slide-44
SLIDE 44

Inverted Index

1

The old night keeper keeps the keep in the town

2

In the big old house in the big old gown.

3

The house in the town had the big old keep

4

Where the old night keeper never did sleep.

5

The night keeper keeps the keep in the night

6

And keeps in the dark and sleeps in the light.

term freq and 1 <6> big 2 <2> <3> dark 1 <6> did 1 <4> gown 1 <2> had 1 <3> house 2 <2> <3> in 5 <1> <2> <3> <5> <6> keep 3 <1> <3> <5> keeper 3 <1> <4> <5> keeps 3 <1> <5> <6> light 1 <6> never 1 <4> night 3 <1> <4> <5>

  • ld

4 <1> <2> <3> <4> sleep 1 <4> sleeps 1 <6> the 6 <1> <2> <3> <4> <5> <6> town 2 <1> <3> where 1 <4>

Table with 6 documents Dictionary and posting lists Per term we store different kinds of metadata: text pointer, frequency, postings pointer, etc.

slide-45
SLIDE 45

Term dictionary

termID int[] textPointer; frequency; postingsPointer; int[] int[] int[] 1 2 3 4 5 6 term text pool

slide-46
SLIDE 46

termID int[] textPointer; frequency; postingsPointer; int[] int[] int[]

t0 p0 f0

1 2 3 4 5 6 c a t

t0

Term dictionary

slide-47
SLIDE 47

1 termID int[] textPointer; frequency; postingsPointer; int[] int[] int[]

t0 t1 p0 p1 f0 f1

1 2 3 4 5 6 c a t f o o

t0 t1

Term dictionary

slide-48
SLIDE 48

1 2 termID int[] textPointer; frequency; postingsPointer; int[] int[] int[]

t0 t1 t2 p0 p1 p2 f0 f1 f2

1 2 3 4 5 6 c a t f o o b a r

t0 t1 t2

Term dictionary

slide-49
SLIDE 49

Term dictionary

  • Number of objects << number of terms
  • O(1) lookups
  • Easy to store more term metadata by adding additional parallel arrays
slide-50
SLIDE 50

Inverted index components

Parallel arrays Dictionary pointer to the most recently indexed posting for a term Posting list storage

?

slide-51
SLIDE 51

Inverted index components

Parallel arrays Dictionary pointer to the most recently indexed posting for a term Posting list storage

?

slide-52
SLIDE 52
  • Store many single-linked lists of different lengths space-efficiently
  • The number of java objects should be independent of the number of lists or

number of items in the lists

  • Every item should be a possible entry point into the lists for iterators, i.e.

items should not be dependent on other items (e.g. no delta encoding)

  • Append and read possible by multiple threads in a lock-free fashion (single

append thread, multiple reader threads)

  • Traversal in backwards order

Posting lists storage - Objectives

slide-53
SLIDE 53

Memory management

= 32K int[] 4 int[] pools

slide-54
SLIDE 54

Memory management

= 32K int[] 4 int[] pools Each pool can be grown individually by adding 32K blocks

slide-55
SLIDE 55

Memory management

  • For simplicity we can forget about the blocks for now and think of the pools

as continuous, unbounded int[] arrays

  • Small total number of Java objects (each 32K block is one object)

4 int[] pools

slide-56
SLIDE 56

Memory management

  • Slices can be allocated in each pool
  • Each pool has a different, but fixed slice size

21 24 27 211 slice size

slide-57
SLIDE 57

Adding and appending to a list

21 24 27 211 slice size available allocated current list

slide-58
SLIDE 58

Adding and appending to a list

21 24 27 211 slice size Store first two postings in this slice available allocated current list

slide-59
SLIDE 59

Adding and appending to a list

21 24 27 211 slice size When first slice is full, allocate another one in second pool available allocated current list

slide-60
SLIDE 60

Adding and appending to a list

21 24 27 211 slice size available allocated current list Allocate a slice on each level as list grows

slide-61
SLIDE 61

Adding and appending to a list

21 24 27 211 slice size available allocated current list On upper most level one list can own multiple slices

slide-62
SLIDE 62

Posting list format

int (32 bits) docID 24 bits

  • max. 16.7M

textPosition 8 bits

  • max. 255
  • Tweet text can only have 140 chars
  • Decoding speed significantly improved compared to delta and VInt decoding

(early experiments suggest 5x improvement compared to vanilla Lucene with FSDirectory)

slide-63
SLIDE 63

Addressing items

  • Use 32 bit (int) pointers to address any item in any list unambiguously:

int (32 bits) poolIndex 2 bits 0-3

  • ffset in slice

1-11 bits depends on pool sliceIndex 19-29 bits depends on pool

  • Nice symmetry: Postings and address pointers both fit into a 32 bit int
slide-64
SLIDE 64

Linking the slices

21 24 27 211 slice size available allocated current list

slide-65
SLIDE 65

Linking the slices

21 24 27 211 slice size available allocated current list Parallel arrays Dictionary pointer to the last posting indexed for a term

slide-66
SLIDE 66

Concurrency - Definitions

  • Pessimistic locking
  • A thread holds an exclusive lock on a resource, while an action is

performed [mutual exclusion]

  • Usually used when conflicts are expected to be likely
  • Optimistic locking
  • Operations are tried to be performed atomically without holding a lock;

conflicts can be detected; retry logic is often used in case of conflicts

  • Usually used when conflicts are expected to be the exception
slide-67
SLIDE 67

Concurrency - Definitions

  • Non-blocking algorithm

Ensures, that threads competing for shared resources do not have their execution indefinitely postponed by mutual exclusion.

  • Lock-free algorithm

A non-blocking algorithm is lock-free if there is guaranteed system-wide progress.

  • Wait-free algorithm

A non-blocking algorithm is wait-free, if there is guaranteed per-thread progress.

* Source: Wikipedia

slide-68
SLIDE 68

Concurrency

  • Having a single writer thread simplifies our problem: no locks have to be used

to protect data structures from corruption (only one thread modifies data)

  • But: we have to make sure that all readers always see a consistent state of

all data structures -> this is much harder than it sounds!

  • In Java, it is not guaranteed that one thread will see changes that another

thread makes in program execution order, unless the same memory barrier is crossed by both threads -> safe publication

  • Safe publication can be achieved in different, subtle ways. Read the great

book “Java concurrency in practice” by Brian Goetz for more information!

slide-69
SLIDE 69

Java Memory Model

  • Program order rule

Each action in a thread happens-before every action in that thread that comes later in the program order.

  • Volatile variable rule

A write to a volatile field happens-before every subsequent read of that same field.

  • Transitivity

If A happens-before B, and B happens-before C, then A happens-before C.

* Source: Brian Goetz: Java Concurrency in Practice

slide-70
SLIDE 70

Concurrency

RAM int x; Cache Thread 1 Thread 2 time

slide-71
SLIDE 71

Concurrency

RAM int x; Cache 5 Thread 1 Thread 2

x = 5;

Thread A writes x=5 to cache time

slide-72
SLIDE 72

Concurrency

RAM int x; Cache 5 Thread 1 Thread 2

x = 5; while(x != 5);

time This condition will likely never become false!

slide-73
SLIDE 73

Concurrency

RAM int x; Cache Thread 1 Thread 2 time

slide-74
SLIDE 74

Concurrency

RAM int x; 1 Cache Thread 1 Thread 2 time volatile int b;

x = 5;

5 Thread A writes b=1 to RAM, because b is volatile

b = 1;

slide-75
SLIDE 75

Concurrency

RAM int x; 1 Cache Thread 1 Thread 2 time volatile int b;

x = 5;

5 Read volatile b

b = 1; int dummy = b; while(x != 5);

slide-76
SLIDE 76

Concurrency

RAM int x; 1 Cache Thread 1 Thread 2 time volatile int b;

x = 5;

5

b = 1; int dummy = b; while(x != 5);

  • Program order rule: Each action in a thread happens-before every action in

that thread that comes later in the program order.

happens-before

slide-77
SLIDE 77

Concurrency

RAM int x; 1 Cache Thread 1 Thread 2 time volatile int b;

x = 5;

5

b = 1; int dummy = b; while(x != 5);

happens-before

  • Volatile variable rule: A write to a volatile field happens-before every

subsequent read of that same field.

slide-78
SLIDE 78

Concurrency

RAM int x; 1 Cache Thread 1 Thread 2 time volatile int b;

x = 5;

5

b = 1; int dummy = b; while(x != 5);

happens-before

  • Transitivity: If A happens-before B, and B happens-before C, then A

happens-before C.

slide-79
SLIDE 79

Concurrency

RAM int x; 1 Cache Thread 1 Thread 2 time volatile int b;

x = 5;

5

b = 1; int dummy = b; while(x != 5);

This condition will be false, i.e. x==5

  • Note: x itself doesn’t have to be volatile. There can be many variables like x,

but we need only a single volatile field.

slide-80
SLIDE 80

Concurrency

RAM int x; 1 Cache Thread 1 Thread 2 time volatile int b;

x = 5;

5

b = 1; int dummy = b; while(x != 5);

Memory barrier

  • Note: x itself doesn’t have to be volatile. There can be many variables like x,

but we need only a single volatile field.

slide-81
SLIDE 81

Demo

slide-82
SLIDE 82

Concurrency

RAM int x; 1 Cache Thread 1 Thread 2 time volatile int b;

x = 5;

5

b = 1; int dummy = b; while(x != 5);

Memory barrier

  • Note: x itself doesn’t have to be volatile. There can be many variables like x,

but we need only a single volatile field.

slide-83
SLIDE 83

Concurrency

IndexWriter IndexReader time

write 100 docs maxDoc = 100 in IR.open(): read maxDoc search upto maxDoc

maxDoc is volatile

write more docs

slide-84
SLIDE 84

Concurrency

IndexWriter IndexReader time

write 100 docs maxDoc = 100 in IR.open(): read maxDoc search upto maxDoc

maxDoc is volatile

write more docs

happens-before

  • Only maxDoc is volatile. All other fields that IW writes to and IR reads from

don’t need to be!

slide-85
SLIDE 85
  • Not a single exclusive lock
  • Writer thread can always make progress
  • Optimistic locking (retry-logic) in a few places for searcher thread
  • Retry logic very simple and guaranteed to always make progress

Wait-free

slide-86
SLIDE 86

Realtime Search @twitter

Agenda

  • Introduction
  • Search Architecture
  • Inverted Index 101
  • Memory Model & Concurrency
  • What’s next?
slide-87
SLIDE 87

What’s next?

slide-88
SLIDE 88

What’s next?

  • Posting list format that supports
  • Positions > 255
  • Payloads
  • Point-in-time document frequencies (df)
  • Code complete
  • Performance tests soon!
slide-89
SLIDE 89

DocID Posting

int (32 bits) docID 24 bits

  • max. 16.7M

textPosition

  • r

term freq (tf) 7 bits

  • max. 127

0=textPosition 1=tf 1 bit

  • textPosition is only stored inline, if (tf == 1 && textPosition <=127 &&

hasPayload == false)

  • change from old format: for tf > 1 we don’t repeat the docID anymore
slide-90
SLIDE 90

Embedded Skip Lists

Slice 2048 ints Postingblock x * 64 bytes (cache line size) Skip list entry

  • x to be determined in performance tests
slide-91
SLIDE 91

Storing Positions and Payloads

  • Use Lucene’s ByteBlockPool: It builds up a linked list of byte slices of

increasing lengths

  • Unlike Lucene’s ByteBlockPool we use double-linked lists, i.e. the slices will

have pointers to the previous and next slices

docIDs positions/ payloads

slide-92
SLIDE 92

Embedded Skip Lists

Slice 2048 ints Postingblock x * 64 bytes (cache line size)

  • x to be determined in performance tests

Skip list entry

slide-93
SLIDE 93

Embedded Skip Lists

Slice 2048 ints Skip list

slide-94
SLIDE 94

Document Frequencies

21 24 27 211 slice size available allocated current list

1

Store sliceID

  • SliceIDs only stored for slices on highest level

E.g. DF = 21 + 24 + 27 + (sliceID * 211) + offsetWithinSlice

slide-95
SLIDE 95

Summary

  • Efficient for small documents due to position inlining
  • Position/payload encoding size comparable to vanilla Lucene for bigger

documents

  • Concurrency model unchanged
  • A reader thread will never try to access positions/payloads that have not

been safely published yet

  • Document frequencies can be looked up in constant time (even worst

case)

slide-96
SLIDE 96

Questions?