SASI, Cassandra on the full text search ride DuyHai DOAN Apache - - PowerPoint PPT Presentation

sasi cassandra on the full text search ride
SMART_READER_LITE
LIVE PREVIEW

SASI, Cassandra on the full text search ride DuyHai DOAN Apache - - PowerPoint PPT Presentation

SASI, Cassandra on the full text search ride DuyHai DOAN Apache Cassandra Evangelist 1 5 minutes introduction to Apache Cassandra 2 SASI introduction 3 SASI cluster-wide 4 SASI local read/write path 5 Query planner 6 Some


slide-1
SLIDE 1

SASI, Cassandra on the full text search ride

DuyHai DOAN – Apache Cassandra™ Evangelist

slide-2
SLIDE 2

@doanduyhai

1 5 minutes introduction to Apache Cassandra™ 2 SASI introduction 3 SASI cluster-wide 4 SASI local read/write path 5 Query planner 6 Some benchmarks 7 Take away

2

slide-3
SLIDE 3

@doanduyhai

Trademark Policy

3

From now on …

Cassandra ⩵ Apache Cassandra™

slide-4
SLIDE 4

@doanduyhai

5 minutes introduction to Apache Cassandra™

slide-5
SLIDE 5

@doanduyhai

The tokens

5

Random hash of #partition à token = hash(#p) Hash: ] –x, x ] hash range: 264 values x = 264/2

C* C* C* C* C* C* C* C*
slide-6
SLIDE 6

@doanduyhai

Token ranges

6

A: −x,−3x 4

⎤ ⎦ ⎥ ⎥ ⎤ ⎦ ⎥ ⎥

B: −3x 4 ,− 2x 4

⎤ ⎦ ⎥ ⎥ ⎤ ⎦ ⎥ ⎥

C: −2x 4 ,−x 4

⎤ ⎦ ⎥ ⎥ ⎤ ⎦ ⎥ ⎥

D: −x 4 ,0

⎤ ⎦ ⎥ ⎥ ⎤ ⎦ ⎥ ⎥

E: 0,x 4

⎤ ⎦ ⎥ ⎥ ⎤ ⎦ ⎥ ⎥

F: x 4 ,2x 4

⎤ ⎦ ⎥ ⎥ ⎤ ⎦ ⎥ ⎥

G: 2x 4 ,3x 4

⎤ ⎦ ⎥ ⎥ ⎤ ⎦ ⎥ ⎥

H : 3x 4 ,x

⎤ ⎦ ⎥ ⎥ ⎤ ⎦ ⎥ ⎥

H A E D B C G F
slide-7
SLIDE 7

@doanduyhai

Distributed tables

7

H A E D B C G F

user_id1 user_id2 user_id3 user_id4 user_id5

CREATE TABLE users( user_id int, …, PRIMARY KEY(user_id) ),

slide-8
SLIDE 8

@doanduyhai

Distributed tables

8

H A E D B C G F

user_id1 user_id2 user_id3 user_id4 user_id5

slide-9
SLIDE 9

@doanduyhai

Coordinator node

9

Responsible for handling requests (read/write) Every node can be coordinator

  • masterless
  • no SPOF
  • proxy role
H A E D B C G F

coordinator request

1 2 3
slide-10
SLIDE 10

@doanduyhai

10

Q & A

slide-11
SLIDE 11

@doanduyhai

SASI introduction

slide-12
SLIDE 12

@doanduyhai

What is SASI ?

12

  • SSTable-Attached Secondary Index à new 2nd index impl that follows

SSTable life-cycle

  • Objective: provide more performant & capable 2nd index
slide-13
SLIDE 13

@doanduyhai

Who created it ?

13

Open-source contribution by an engineers team

slide-14
SLIDE 14

@doanduyhai

Why is it better than native 2nd index ?

14

  • follow SSTable life-cycle (flush, compaction, rebuild …) à more optimized
  • new data-strutures
  • range query (<, ≤, >, ≥) possible
  • full text search options
slide-15
SLIDE 15

@doanduyhai

15

Demo

slide-16
SLIDE 16

@doanduyhai

SASI cluster-wide

slide-17
SLIDE 17

@doanduyhai

Distributed index

17

On cluster level, SASI works exactly like native 2nd index

H A E D B C G F

UK user1 user102 … user493 US user54 user483 … user938 UK user87 user176 … user987 UK user17 user409 … user787

slide-18
SLIDE 18

@doanduyhai

Distributed search algorithm

18

H A E D B C G F

coordinator 1st round Concurrency factor = 1

slide-19
SLIDE 19

@doanduyhai

Distributed search algorithm

19

H A E D B C G F

coordinator Not enough results ?

slide-20
SLIDE 20

@doanduyhai

Distributed search algorithm

20

H A E D B C G F

coordinator 2nd round Concurrency factor = 2

slide-21
SLIDE 21

@doanduyhai

Distributed search algorithm

21

H A E D B C G F

coordinator Still not enough results ?

slide-22
SLIDE 22

@doanduyhai

Distributed search algorithm

22

H A E D B C G F

coordinator 3rd round Concurrency factor = 4

slide-23
SLIDE 23

@doanduyhai

Concurrency factor formula

23

  • more details at: http://www.doanduyhai.com/blog/?p=13191
slide-24
SLIDE 24

@doanduyhai

Caveat 1: non restrictive filters

24

H A E D B C G F

coordinator

Hit all nodes eventually L

slide-25
SLIDE 25

@doanduyhai

Caveat 1 solution : always use LIMIT

25

H A E D B C G F

coordinator

SELECT * FROM … WHERE ... LIMIT 1000

slide-26
SLIDE 26

@doanduyhai

Caveat 2: 1-to-1 index (user_email)

26

H A E D B C G F

coordinator Not found WHERE user_email = ‘xxx'

slide-27
SLIDE 27

@doanduyhai

Caveat 2: 1-to-1 index (user_email)

27

H A E D B C G F

coordinator Still no result WHERE user_email = ‘xxx'

slide-28
SLIDE 28

@doanduyhai

Caveat 2: 1-to-1 index (user_email)

28

H A E D B C G F

coordinator At best 1 user found At worst 0 user found WHERE user_email = ‘xxx'

slide-29
SLIDE 29

@doanduyhai

Caveat 2 solution: materialized views

29

For 1-to-1 index/relationship, use materialized views instead CREATE MATERIALIZED VIEW user_by_email AS SELECT * FROM users WHERE user_id IS NOT NULL and user_email IS NOT NULL PRIMARY KEY (user_email, user_id) But range queries ( <, >, ≤, ≥) not possible …

slide-30
SLIDE 30

@doanduyhai

Caveat 3: fetch all rows for analytics use-case

30

H A E D B C G F

coordinator Client

slide-31
SLIDE 31

@doanduyhai

Caveat 3 solution: use co-located Spark

31

H A E D B C G F

Local index filtering in Cassandra Aggregation in Spark

Local index query

slide-32
SLIDE 32

@doanduyhai

SASI local read/write path

slide-33
SLIDE 33

@doanduyhai

SASI Life-cycle: in-memory

33

Commit log1 . . .

1

Commit log2 Commit logn Memory . . . MemTable Table1 MemTable Table2 MemTable TableN

2

Index MemTable1 Index MemTable2 . . . Index MemTableN

3

ACK the client

slide-34
SLIDE 34

@doanduyhai

Local write path data structures

34

Index mode, data type Data structure Usage

PREFIX, text Guava ConcurrentRadixTree name LIKE 'John%' CONTAINS, text Guava ConcurrentSuffixTree name LIKE ’%John%' name LIKE ’%ny’ PREFIX, other JDK ConcurrentSkipListSet age = 20 age >= 20 AND age <= 30 SPARSE, other JDK ConcurrentSkipListSet age = 20 age >= 20 AND age <= 30 suitable for 1-to-N index with N ≤ 5

slide-35
SLIDE 35

@doanduyhai

SASI Life-cycle: flush to SSTable

35

Commit log1 . . .

1

Commit log2 Commit logn Memory

Table1

SStable1

Table2 Table3

SStable2 SStable3

4

OnDiskIndex1 OnDiskIndex2 OnDiskIndex3

slide-36
SLIDE 36

@doanduyhai

SASI Life-cycle: compaction

36

SSTable1 SSTable2 SSTable3 New SSTable OnDiskIndex1 OnDiskIndex2 OnDiskIndex3 New OnDiskIndex

slide-37
SLIDE 37

@doanduyhai

Local write path summary

37

Index files are built

  • n memtable flush
  • n compaction flush

To avoid OOM, index files are split into chunk of

  • 1Gb for memtable flush
  • max_compaction_flush_memory_in_mb for compaction flush

à consequences: SASI has impact on write bandwidth (CPU & disk I/O)

slide-38
SLIDE 38

@doanduyhai

Local read path

38

  • first, optimize query using Query Planer (see later)
  • then load chunks (4k) of index files from disk into memory
  • perform binary search to find the indexed value(s)
  • retrieve the corresponding partition keys and push them into the Partition

Key Cache à Yes, currently SASI only keep partition key(s) so on wide partition it’s not very

  • ptimized ...
slide-39
SLIDE 39

@doanduyhai

OnDiskIndex files

39

SStable1 SStable2 user_id4 FR user_id1 US user_id5 FR user_id3 UK user_id2 DE OnDiskIndex1 FR US OnDiskIndex2 UK DE B+Tree-like data structures

slide-40
SLIDE 40

@doanduyhai

OnDiskIndex Layout

40

Header Block Data Block Pointer Block Data Block Meta Pointer Block Meta Level Index Offset

4k Multiple of 4k Multiple of 4k

Levels Count

Meta Data Info

slide-41
SLIDE 41

@doanduyhai

Header Block Layout

41

Descriptor Version

Header Block layout

variable Term Size short Index Mode Min Term Max Term Min Pk Max Pk Has Partial short short short short variable byte

slide-42
SLIDE 42

@doanduyhai

OnDiskIndex Layout

42

Header Block Data Block Pointer Block Data Block Meta Pointer Block Meta Level Index Offset

4k Multiple of 4k Multiple of 4k

Levels Count

Meta Data Info

slide-43
SLIDE 43

@doanduyhai

Data Block layout

43

Terms Count 4k Offset Array: [0, 10, 22, …] Term Block TokenTree Block 4k Terms Count Offset Array: [0, 23, 35, …] Term Block TokenTree Block Terms Count Offset Array: [0, 17, 34, …] Term Block TokenTree Block Terms Count Offset Array: [0, 12, 28, …] Term Block TokenTree Block

Padding Padding Padding Padding Padding Padding Padding Padding

slide-44
SLIDE 44

@doanduyhai

OnDiskIndex Layout

44

Header Block Data Block Pointer Block Data Block Meta Pointer Block Meta Level Index Offset

4k Multiple of 4k Multiple of 4k

Levels Count

Meta Data Info

slide-45
SLIDE 45

@doanduyhai

Pointer Block building

45

Data Block1 Pointer Block1 4k Data Block2 Data BlockN

LastTerm1 LastTerm2 LastTermN

Pointer Block2 … 4k Pointer BlockN Pointer BlockN+1

LastTermM LastTermM+1

LastTermO

Pointer BlockN+2

Root Pointer Block Pointer Level 1 Pointer Level 2 Pointer Root Level Data Level

slide-46
SLIDE 46

@doanduyhai

Binary search using OnDiskIndex files

46

Data Block1 Data Block2 Data BlockN Pointer Block Pointer Block

Root Pointer Block Pointer Level 2 Pointer Level 3 Pointer Root Level Data Level Pointer Block Pointer Block Pointer Block Pointer Block

Pointer Block Pointer Block Pointer Block

Pointer Level 1 Data Block3

slide-47
SLIDE 47

@doanduyhai

Term Block Binary Search

47

Term1 Term50 Term100 val < Term100 ? Term25 Term75 Term50 Term75 Term100 val > Term50 ? Term75 Term50 Term63 val < Term75 ?

Term57 val = Term57 ?

slide-48
SLIDE 48

@doanduyhai

Query Planner

slide-49
SLIDE 49

@doanduyhai

Query planner

49

  • build predicates tree
  • predicates push-down & re-ordering
  • predicate fusions for != operator
slide-50
SLIDE 50

@doanduyhai

Query optimization example

50

WHERE age < 100 AND fname LIKE 'p%' AND fname != 'pa%' AND age > 21

slide-51
SLIDE 51

@doanduyhai

Query optimization example

51

AND is associative and commutative

slide-52
SLIDE 52

@doanduyhai

Query optimization example

52

!= transformed to exclusion on range scan

slide-53
SLIDE 53

@doanduyhai

Query optimization example

53

AND is associative and commutative

slide-54
SLIDE 54

@doanduyhai

Some benchmarks

slide-55
SLIDE 55

@doanduyhai

Hardware specs

13 bare-metal machines

  • 6 CPU HT (12 vcores)
  • 64Gb RAM
  • 4 SSDs in RAID0 for a total of 1.5Tb

Data set

  • 13 billions of rows
  • 1 numerical index with 36 distinct values
  • 2 text index with 7 distinct values
  • 1 text index with 3 distinct values

55

slide-56
SLIDE 56

@doanduyhai

Benchmark results

Full table scan using co-located Spark (no LIMIT)

56

Predicate count Fetched rows Query time in sec 1 36 109 986 609 2 2 781 492 330 3 1 044 547 372 4 360 334 116

slide-57
SLIDE 57

@doanduyhai

Benchmark results

Full table scan using co-located Spark (no LIMIT)

57

Predicate count Fetched rows Query time in sec 1 36 109 986 609 2 2 781 492 330 3 1 044 547 372 4 360 334 116

slide-58
SLIDE 58

@doanduyhai

Benchmark results

Beware of disk space usage for full text search !!! Table albums with ≈ 110 000 records, 6.8Mb data size

58

slide-59
SLIDE 59

@doanduyhai

Take Away

slide-60
SLIDE 60

@doanduyhai

SASI vs search engines

SASI vs Solr/ElasticSearch ?

  • Cassandra is not a search engine !!! (database = durability)
  • always slower because 2 passes (SASI index read + original Cassandra data)
  • no scoring
  • no ordering (ORDER BY)
  • no grouping (GROUP BY) à Apache Spark for analytics

If you don’t need the above features, SASI is for you!

60

slide-61
SLIDE 61

@doanduyhai

SASI sweet spots

SASI is a relevant choice if

  • you need multi criteria search and you don't need ordering/grouping/scoring
  • you mostly need 100 to 10000 of rows for your search queries
  • you always know the partition keys of the rows to be searched for (this one applies to

native secondary index too)

  • you want to index static columns (SASI has no penalty since it indexes the whole

partition)

61

slide-62
SLIDE 62

@doanduyhai

SASI blind spots

SASI is a poor choice if

  • you have strong SLA on search latency, for example few millisecs requirement
  • rdering of the search results is important for you

62

slide-63
SLIDE 63

@doanduyhai

63

Q & A

slide-64
SLIDE 64

@doanduyhai

64

Thank You

@doanduyhai duy_hai.doan@datastax.com