340 Million Tweets per day 2.3 Billion Queries per day < 10 s - PowerPoint PPT Presentation

340 Million Tweets per day

2.3 Billion Queries per day

< 10 s Indexing latency

50 ms Avg. query response time

Earlybird - Realtime Search @twitter Michael Busch @michibusch michael@twitter.com buschmi@apache.org

Earlybird - Realtime Search @twitter Agenda ‣ Introduction - Search Architecture - Inverted Index 101 - Memory Model & Concurrency - What’s next?

Introduction

Introduction • Twitter acquired Summize in 2008 • 1st gen search engine based on MySQL

Introduction • Next gen search engine based on Lucene • Improves scalability and performance by orders or magnitude • Open Source

Realtime Search @twitter Agenda - Introduction ‣ Search Architecture - Inverted Index 101 - Memory Model & Concurrency - What’s next?

Search Architecture

Search Architecture Tweets Ingester • Ingester pre-processes Tweets for search • Geo-coding, URL expansion, tokenization, etc.

Search Architecture Tweets Thrift Ingester MySQL Master MySQL Slaves • Tweets are serialized to MySQL in Thrift format

Earlybird Tweets Thrift Ingester MySQL Master Earlybird MySQL Index Slaves • Earlybird reads from MySQL slaves • Builds an in-memory inverted index in real time

Blender Thrift Thrift Blender Earlybird Index • Blender is our Thrift service aggregator • Queries multiple Earlybirds, merges results

Cluster layout Earlybird

Cluster layout Earlybird Replicas

Cluster layout n hash partitions (docId % n) ... Earlybird Earlybird Earlybird Earlybird Replicas

Cluster layout n hash partitions (docId % n) ... Earlybird Earlybird Earlybird Earlybird ... Timeslices Earlybird Earlybird Earlybird Earlybird ... ... ... ... ... Earlybird Earlybird Earlybird Earlybird Replicas

Cluster layout Writable ... timeslice Earlybird Earlybird Earlybird Earlybird ... Earlybird Earlybird Earlybird Earlybird ... ... ... ... Complete timeslices ... Earlybird Earlybird Earlybird Earlybird

Realtime Search @twitter Agenda - Introduction - Search Architecture ‣ Inverted Index 101 - Memory Model & Concurrency - What’s next?

Inverted Index 101

Inverted Index 101 1 The old night keeper keeps the keep in the town 2 In the big old house in the big old gown. 3 The house in the town had the big old keep 4 Where the old night keeper never did sleep. 5 The night keeper keeps the keep in the night 6 And keeps in the dark and sleeps in the light. Table with 6 documents Example from: Justin Zobel , Alistair Moffat, Inverted files for text search engines, ACM Computing Surveys (CSUR) v.38 n.2, p.6-es, 2006

Inverted Index 101 1 term freq The old night keeper keeps the keep in the town and 1 <6> 2 In the big old house in the big old gown. big 2 <2> <3> 3 The house in the town had the big old keep dark 1 <6> 4 Where the old night keeper never did sleep. did 1 <4> 5 gown 1 <2> The night keeper keeps the keep in the night had 1 <3> 6 And keeps in the dark and sleeps in the light. house 2 <2> <3> Table with 6 documents in 5 <1> <2> <3> <5> <6> keep 3 <1> <3> <5> keeper 3 <1> <4> <5> keeps 3 <1> <5> <6> light 1 <6> never 1 <4> night 3 <1> <4> <5> old 4 <1> <2> <3> <4> sleep 1 <4> sleeps 1 <6> the 6 <1> <2> <3> <4> <5> <6> town 2 <1> <3> where 1 <4> Dictionary and posting lists

Inverted Index 101 Query: keeper 1 term freq The old night keeper keeps the keep in the town and 1 <6> 2 In the big old house in the big old gown. big 2 <2> <3> 3 The house in the town had the big old keep dark 1 <6> 4 Where the old night keeper never did sleep. did 1 <4> 5 gown 1 <2> The night keeper keeps the keep in the night had 1 <3> 6 And keeps in the dark and sleeps in the light. house 2 <2> <3> Table with 6 documents in 5 <1> <2> <3> <5> <6> keep 3 <1> <3> <5> keeper 3 <1> <4> <5> keeps 3 <1> <5> <6> light 1 <6> never 1 <4> night 3 <1> <4> <5> old 4 <1> <2> <3> <4> sleep 1 <4> sleeps 1 <6> the 6 <1> <2> <3> <4> <5> <6> town 2 <1> <3> where 1 <4> Dictionary and posting lists

Posting list encoding Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090

Posting list encoding Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090 Delta encoding: 5 10 8985 2 90998 90

Posting list encoding Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090 Delta encoding: 5 10 8985 2 90998 90 VInt compression: 00000101 Values 0 <= delta <= 127 need one byte

Posting list encoding Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090 Delta encoding: 5 10 8985 2 90998 90 VInt compression: 11000110 00011001 Values 128 <= delta <= 16384 need two bytes

Posting list encoding Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090 Delta encoding: 5 10 8985 2 90998 90 VInt compression: 1 1000110 0 0011001 First bit indicates whether next byte belongs to the same value

Posting list encoding Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090 Delta encoding: 5 10 8985 2 90998 90 VInt compression: 11000110 00011001 • Variable number of bytes - a VInt-encoded posting can not be written as a primitive Java type; therefore it can not be written atomically

Posting list encoding Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090 Delta encoding: 5 10 8985 2 90998 90 Read direction • Each posting depends on previous one; decoding only possible in old-to-new direction • With recency ranking (new-to-old) no early termination is possible

Posting list encoding • By default Lucene uses a combination of delta encoding and VInt compression • VInts are expensive to decode • Problem 1: How to traverse posting lists backwards? • Problem 2: How to write a posting atomically?

Posting list encoding in Earlybird int (32 bits) docID textPosition 24 bits 8 bits max. 16.7M max. 255 • Tweet text can only have 140 chars • Decoding speed significantly improved compared to delta and VInt decoding (early experiments suggest 5x improvement compared to vanilla Lucene with FSDirectory)

Posting list encoding in Earlybird Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090 Earlybird encoding: 5 15 9000 9002 100000 100090 Read direction

Early query termination Doc IDs to encode: 5, 15, 9000, 9002, 100000, 100090 Earlybird encoding: 5 15 9000 9002 100000 100090 Read direction E.g. 3 result are requested: Here we can terminate after reading 3 postings

Posting list encoding - Summary • ints can be written atomically in Java • Backwards traversal easy on absolute docIDs (not deltas) • Every posting is a possible entry point for a searcher • Skipping can be done without additional data structures as binary search, even though there are better approaches which should be explored • On tweet indexes we need about 30% more storage for docIDs compared to delta+Vints; compensated by compression of complete segments • Max. segment size: 2^24 = 16.7M tweets

Realtime Search @twitter Agenda - Introduction - Search Architecture - Inverted Index 101 ‣ Memory Model & Concurrency - What’s next?

Memory Model & Concurrency

Inverted index components Posting list storage ? Dictionary

340 Million Tweets per day 2.3 Billion Queries per day < 10 s - PowerPoint PPT Presentation

340 Million Tweets per day 2.3 Billion Queries per day < 10 s Indexing latency 50 ms Avg. query response time Earlybird - Realtime Search @twitter Michael Busch @michibusch michael@twitter.com buschmi@apache.org Earlybird - Realtime

ZT METAL Inc. Ndran 505 Tel.: +420 373 340 811 Kralovice Fax: +420 373 340 810 331 41

COSC 340: Software Engineering Using the Debugger Michael Jantz COSC 340: Software Engineering

COSC 340: Software Engineering Course Project: Introduction Michael Jantz COSC 340: Software

INVESTMENT POTENTIAL 2 Moscow Region: Overview Total area 44 340 sq.km North to South length:

Brownstone contextual facades adjacent to 340 Decatur St Original Facade Tax

COSC 340: Software Engineering Design and Architecture Michael Jantz (adapted from slides by

Agenda: 330-340: World Health Organization NCD Update (Dr Ashley Bloomfield, Partnerships g p

COSC 340: Software Engineering Design Patterns Michael Jantz Recommended text: Design Patterns:

COSC 340: Software Engineering Validation & Verification Prepared by Michael Jantz (adapted

COSC 340: Software Engineering Version Control with Git Michael Jantz Notes adapted from: Pro

COSC 340: Software Engineering Introduction Michael Jantz (adapted from slides by Ravi Sethi,

Ownership Problems 23.1 Million parcel in Cadaster Record 32,5 Million parcel in

Welcome Dimensions - Indian Grid & Electricity Market 3.2 million km 340 GW+ 2.5 GW+

2017 Legislative Session Deans Meeting May 9, 2017 Budget Summary $120 million in new

2019 Project of the Year Environmental $5 Million to Less than $25 Million A reservoir with

Core epidemiology slides Global summary of the AIDS epidemic 2019 Total 38.0 million [31.6

Architecture level Optimizations for Kummer based HECC on FPGAs Gabriel GALLIN Turku Ozlum

CS 126 Lecture A3: Boolean Logic Outline Introduction Logic gates Boolean algebra

ECEU530 Schedule ECE U530 Homework 6 due Wednesday, November 15 Digital Hardware Synthesis

HCAL Back End Requirements and Architecture A. Belloni University of Maryland Biographical

Exploiting Quality-Efficiency Tradeoffs with Arbitrary Quantization Special Session - CODES+ISSS

Outline Overview Parallel Computing with GPU Introduction to CUDA CUDA Thread Model

BUBBLE STR UBBLE STRUGGLE UGGLE Call Graph Visualization with Radare2 Marion Marschalek

Graphics and Visualization Yuriy Tymchuk (almost) Alain Plantec Guillaume Larcheveque What are

Sambuz

Useful Links

Newsletter

Mail Us

340 Million Tweets per day 2.3 Billion Queries per day < 10 s - PowerPoint PPT Presentation

340 Million Tweets per day 2.3 Billion Queries per day < 10 s Indexing latency 50 ms Avg. query response time Earlybird - Realtime Search @twitter Michael Busch @michibusch michael@twitter.com buschmi@apache.org Earlybird - Realtime

ZT METAL Inc. Ndran 505 Tel.: +420 373 340 811 Kralovice Fax: +420 373 340 810 331 41

COSC 340: Software Engineering Using the Debugger Michael Jantz COSC 340: Software Engineering

COSC 340: Software Engineering Course Project: Introduction Michael Jantz COSC 340: Software

INVESTMENT POTENTIAL 2 Moscow Region: Overview Total area 44 340 sq.km North to South length:

Brownstone contextual facades adjacent to 340 Decatur St Original Facade Tax

COSC 340: Software Engineering Design and Architecture Michael Jantz (adapted from slides by

Agenda: 330-340: World Health Organization NCD Update (Dr Ashley Bloomfield, Partnerships g p

COSC 340: Software Engineering Design Patterns Michael Jantz Recommended text: Design Patterns:

COSC 340: Software Engineering Validation &amp; Verification Prepared by Michael Jantz (adapted

COSC 340: Software Engineering Version Control with Git Michael Jantz Notes adapted from: Pro

COSC 340: Software Engineering Introduction Michael Jantz (adapted from slides by Ravi Sethi,

Ownership Problems 23.1 Million parcel in Cadaster Record 32,5 Million parcel in

Welcome Dimensions - Indian Grid &amp; Electricity Market 3.2 million km 340 GW+ 2.5 GW+

2017 Legislative Session Deans Meeting May 9, 2017 Budget Summary $120 million in new

2019 Project of the Year Environmental $5 Million to Less than $25 Million A reservoir with

Core epidemiology slides Global summary of the AIDS epidemic 2019 Total 38.0 million [31.6

Architecture level Optimizations for Kummer based HECC on FPGAs Gabriel GALLIN Turku Ozlum

CS 126 Lecture A3: Boolean Logic Outline Introduction Logic gates Boolean algebra

ECEU530 Schedule ECE U530 Homework 6 due Wednesday, November 15 Digital Hardware Synthesis

HCAL Back End Requirements and Architecture A. Belloni University of Maryland Biographical

Exploiting Quality-Efficiency Tradeoffs with Arbitrary Quantization Special Session - CODES+ISSS

Outline Overview Parallel Computing with GPU Introduction to CUDA CUDA Thread Model

BUBBLE STR UBBLE STRUGGLE UGGLE Call Graph Visualization with Radare2 Marion Marschalek

Graphics and Visualization Yuriy Tymchuk (almost) Alain Plantec Guillaume Larcheveque What are

Sambuz

Useful Links

Newsletter

Mail Us

COSC 340: Software Engineering Validation & Verification Prepared by Michael Jantz (adapted

Welcome Dimensions - Indian Grid & Electricity Market 3.2 million km 340 GW+ 2.5 GW+