SASI, Cassandra on the full text search ride
DuyHai DOAN – Apache Cassandra™ Evangelist
SASI, Cassandra on the full text search ride DuyHai DOAN Apache - - PowerPoint PPT Presentation
SASI, Cassandra on the full text search ride DuyHai DOAN Apache Cassandra Evangelist 1 5 minutes introduction to Apache Cassandra 2 SASI introduction 3 SASI cluster-wide 4 SASI local read/write path 5 Query planner 6 Some
SASI, Cassandra on the full text search ride
DuyHai DOAN – Apache Cassandra™ Evangelist
@doanduyhai
1 5 minutes introduction to Apache Cassandra™ 2 SASI introduction 3 SASI cluster-wide 4 SASI local read/write path 5 Query planner 6 Some benchmarks 7 Take away
2
@doanduyhai
Trademark Policy
3
From now on …
Cassandra ⩵ Apache Cassandra™
@doanduyhai
5 minutes introduction to Apache Cassandra™
@doanduyhai
The tokens
5
Random hash of #partition à token = hash(#p) Hash: ] –x, x ] hash range: 264 values x = 264/2
C* C* C* C* C* C* C* C*@doanduyhai
Token ranges
6
A: −x,−3x 4
⎤ ⎦ ⎥ ⎥ ⎤ ⎦ ⎥ ⎥
B: −3x 4 ,− 2x 4
⎤ ⎦ ⎥ ⎥ ⎤ ⎦ ⎥ ⎥
C: −2x 4 ,−x 4
⎤ ⎦ ⎥ ⎥ ⎤ ⎦ ⎥ ⎥
D: −x 4 ,0
⎤ ⎦ ⎥ ⎥ ⎤ ⎦ ⎥ ⎥
E: 0,x 4
⎤ ⎦ ⎥ ⎥ ⎤ ⎦ ⎥ ⎥
F: x 4 ,2x 4
⎤ ⎦ ⎥ ⎥ ⎤ ⎦ ⎥ ⎥
G: 2x 4 ,3x 4
⎤ ⎦ ⎥ ⎥ ⎤ ⎦ ⎥ ⎥
H : 3x 4 ,x
⎤ ⎦ ⎥ ⎥ ⎤ ⎦ ⎥ ⎥
H A E D B C G F@doanduyhai
Distributed tables
7
H A E D B C G Fuser_id1 user_id2 user_id3 user_id4 user_id5
CREATE TABLE users( user_id int, …, PRIMARY KEY(user_id) ),
@doanduyhai
Distributed tables
8
H A E D B C G Fuser_id1 user_id2 user_id3 user_id4 user_id5
@doanduyhai
Coordinator node
9
Responsible for handling requests (read/write) Every node can be coordinator
coordinator request
1 2 3@doanduyhai
10
Q & A
@doanduyhai
SASI introduction
@doanduyhai
What is SASI ?
12
SSTable life-cycle
@doanduyhai
Who created it ?
13
Open-source contribution by an engineers team
@doanduyhai
Why is it better than native 2nd index ?
14
@doanduyhai
15
Demo
@doanduyhai
SASI cluster-wide
@doanduyhai
Distributed index
17
On cluster level, SASI works exactly like native 2nd index
H A E D B C G FUK user1 user102 … user493 US user54 user483 … user938 UK user87 user176 … user987 UK user17 user409 … user787
@doanduyhai
Distributed search algorithm
18
H A E D B C G Fcoordinator 1st round Concurrency factor = 1
@doanduyhai
Distributed search algorithm
19
H A E D B C G Fcoordinator Not enough results ?
@doanduyhai
Distributed search algorithm
20
H A E D B C G Fcoordinator 2nd round Concurrency factor = 2
@doanduyhai
Distributed search algorithm
21
H A E D B C G Fcoordinator Still not enough results ?
@doanduyhai
Distributed search algorithm
22
H A E D B C G Fcoordinator 3rd round Concurrency factor = 4
@doanduyhai
Concurrency factor formula
23
@doanduyhai
Caveat 1: non restrictive filters
24
H A E D B C G Fcoordinator
Hit all nodes eventually L
@doanduyhai
Caveat 1 solution : always use LIMIT
25
H A E D B C G Fcoordinator
SELECT * FROM … WHERE ... LIMIT 1000
@doanduyhai
Caveat 2: 1-to-1 index (user_email)
26
H A E D B C G Fcoordinator Not found WHERE user_email = ‘xxx'
@doanduyhai
Caveat 2: 1-to-1 index (user_email)
27
H A E D B C G Fcoordinator Still no result WHERE user_email = ‘xxx'
@doanduyhai
Caveat 2: 1-to-1 index (user_email)
28
H A E D B C G Fcoordinator At best 1 user found At worst 0 user found WHERE user_email = ‘xxx'
@doanduyhai
Caveat 2 solution: materialized views
29
For 1-to-1 index/relationship, use materialized views instead CREATE MATERIALIZED VIEW user_by_email AS SELECT * FROM users WHERE user_id IS NOT NULL and user_email IS NOT NULL PRIMARY KEY (user_email, user_id) But range queries ( <, >, ≤, ≥) not possible …
@doanduyhai
Caveat 3: fetch all rows for analytics use-case
30
H A E D B C G Fcoordinator Client
@doanduyhai
Caveat 3 solution: use co-located Spark
31
H A E D B C G FLocal index filtering in Cassandra Aggregation in Spark
Local index query
@doanduyhai
SASI local read/write path
@doanduyhai
SASI Life-cycle: in-memory
33
Commit log1 . . .
1
Commit log2 Commit logn Memory . . . MemTable Table1 MemTable Table2 MemTable TableN
2
Index MemTable1 Index MemTable2 . . . Index MemTableN
3
ACK the client
@doanduyhai
Local write path data structures
34
Index mode, data type Data structure Usage
PREFIX, text Guava ConcurrentRadixTree name LIKE 'John%' CONTAINS, text Guava ConcurrentSuffixTree name LIKE ’%John%' name LIKE ’%ny’ PREFIX, other JDK ConcurrentSkipListSet age = 20 age >= 20 AND age <= 30 SPARSE, other JDK ConcurrentSkipListSet age = 20 age >= 20 AND age <= 30 suitable for 1-to-N index with N ≤ 5
@doanduyhai
SASI Life-cycle: flush to SSTable
35
Commit log1 . . .
1
Commit log2 Commit logn Memory
Table1
SStable1
Table2 Table3
SStable2 SStable3
4
OnDiskIndex1 OnDiskIndex2 OnDiskIndex3
@doanduyhai
SASI Life-cycle: compaction
36
SSTable1 SSTable2 SSTable3 New SSTable OnDiskIndex1 OnDiskIndex2 OnDiskIndex3 New OnDiskIndex
@doanduyhai
Local write path summary
37
Index files are built
To avoid OOM, index files are split into chunk of
à consequences: SASI has impact on write bandwidth (CPU & disk I/O)
@doanduyhai
Local read path
38
Key Cache à Yes, currently SASI only keep partition key(s) so on wide partition it’s not very
@doanduyhai
OnDiskIndex files
39
SStable1 SStable2 user_id4 FR user_id1 US user_id5 FR user_id3 UK user_id2 DE OnDiskIndex1 FR US OnDiskIndex2 UK DE B+Tree-like data structures
@doanduyhai
OnDiskIndex Layout
40
Header Block Data Block Pointer Block Data Block Meta Pointer Block Meta Level Index Offset
4k Multiple of 4k Multiple of 4k
Levels Count
Meta Data Info
@doanduyhai
Header Block Layout
41
Descriptor Version
Header Block layout
variable Term Size short Index Mode Min Term Max Term Min Pk Max Pk Has Partial short short short short variable byte
@doanduyhai
OnDiskIndex Layout
42
Header Block Data Block Pointer Block Data Block Meta Pointer Block Meta Level Index Offset
4k Multiple of 4k Multiple of 4k
Levels Count
Meta Data Info
@doanduyhai
Data Block layout
43
Terms Count 4k Offset Array: [0, 10, 22, …] Term Block TokenTree Block 4k Terms Count Offset Array: [0, 23, 35, …] Term Block TokenTree Block Terms Count Offset Array: [0, 17, 34, …] Term Block TokenTree Block Terms Count Offset Array: [0, 12, 28, …] Term Block TokenTree Block
…
Padding Padding Padding Padding Padding Padding Padding Padding
@doanduyhai
OnDiskIndex Layout
44
Header Block Data Block Pointer Block Data Block Meta Pointer Block Meta Level Index Offset
4k Multiple of 4k Multiple of 4k
Levels Count
Meta Data Info
@doanduyhai
Pointer Block building
45
Data Block1 Pointer Block1 4k Data Block2 Data BlockN
…
LastTerm1 LastTerm2 LastTermN…
Pointer Block2 … 4k Pointer BlockN Pointer BlockN+1
LastTermM LastTermM+1…
LastTermOPointer BlockN+2
…
Root Pointer Block Pointer Level 1 Pointer Level 2 Pointer Root Level Data Level
@doanduyhai
Binary search using OnDiskIndex files
46
Data Block1 Data Block2 Data BlockN Pointer Block Pointer Block
…
Root Pointer Block Pointer Level 2 Pointer Level 3 Pointer Root Level Data Level Pointer Block Pointer Block Pointer Block Pointer Block
…
Pointer Block Pointer Block Pointer Block
…
Pointer Level 1 Data Block3
…
@doanduyhai
Term Block Binary Search
47
Term1 Term50 Term100 val < Term100 ? Term25 Term75 Term50 Term75 Term100 val > Term50 ? Term75 Term50 Term63 val < Term75 ?
…
Term57 val = Term57 ?
@doanduyhai
Query Planner
@doanduyhai
Query planner
49
@doanduyhai
Query optimization example
50
WHERE age < 100 AND fname LIKE 'p%' AND fname != 'pa%' AND age > 21
@doanduyhai
Query optimization example
51
AND is associative and commutative
@doanduyhai
Query optimization example
52
!= transformed to exclusion on range scan
@doanduyhai
Query optimization example
53
AND is associative and commutative
@doanduyhai
Some benchmarks
@doanduyhai
Hardware specs
13 bare-metal machines
Data set
55
@doanduyhai
Benchmark results
Full table scan using co-located Spark (no LIMIT)
56
Predicate count Fetched rows Query time in sec 1 36 109 986 609 2 2 781 492 330 3 1 044 547 372 4 360 334 116
@doanduyhai
Benchmark results
Full table scan using co-located Spark (no LIMIT)
57
Predicate count Fetched rows Query time in sec 1 36 109 986 609 2 2 781 492 330 3 1 044 547 372 4 360 334 116
@doanduyhai
Benchmark results
Beware of disk space usage for full text search !!! Table albums with ≈ 110 000 records, 6.8Mb data size
58
@doanduyhai
Take Away
@doanduyhai
SASI vs search engines
SASI vs Solr/ElasticSearch ?
If you don’t need the above features, SASI is for you!
60
@doanduyhai
SASI sweet spots
SASI is a relevant choice if
native secondary index too)
partition)
61
@doanduyhai
SASI blind spots
SASI is a poor choice if
62
@doanduyhai
63
Q & A
@doanduyhai
64
@doanduyhai duy_hai.doan@datastax.com