[PPT] - Full-Text Search Explained Philipp Krenn @xeraa Infrastructure

SLIDE 1

Full-Text Search

Explained

Philipp Krenn@xeraa

SLIDE 2

SLIDE 3

SLIDE 4

SLIDE 5

Infrastructure | Developer Advocate

SLIDE 6

ViennaDB Papers We Love Vienna

SLIDE 7

Who uses

databases?

SLIDE 8

Who uses

search?

SLIDE 9

Databases

vs

Full-text search

SLIDE 10

SLIDE 11

But I can do...

SELECT * FROM my_table WHERE my_text LIKE ‘%blue%’

SLIDE 12

SLIDE 13

1. Speed

B-Tree

SLIDE 14

SLIDE 15

SLIDE 16

2. Features

Fuzziness, synonyms, scoring,...

SLIDE 17

Storing

SLIDE 18

Indexing

Remove formatting

SLIDE 19

Indexing

Tokenize

SLIDE 20

Indexing

Stop words

SLIDE 21

Indexing

Stemming

http://snowballstem.org

SLIDE 22

Indexing

Synonyms

SLIDE 23

MongoDB

SLIDE 24

Before FTS

Regular expressions !

SLIDE 25

Before FTS

Arrays with relevant terms !

https://docs.mongodb.com/manual/tutorial/model-data-for- keyword-search/

SLIDE 26

{ title : "Moby-Dick" , author : "Herman Melville" , published : 1851 , ISBN : 0451526996 , topics : [ "whaling" , "allegory" , "revenge" , "American" , "novel" , "nautical" , "voyage" , "Cape Cod" ] } db.volumes.createIndex({ topics: 1 }) db.volumes.findOne({ topics : "voyage" }, { title: 1 })

SLIDE 27

No

Stemming Fuzziness Synonyms Scoring

SLIDE 28

Finally FTS

230+ votes Created 2009 Resolved 2013

https://jira.mongodb.org/browse/SERVER-380

SLIDE 29

FTS in MongoDB

Beta since 2.4 Stable since 3.0 "80% solution" — for more Elasticsearch

SLIDE 30

FTS in MongoDB

In Latin alphabets

Case insensitive (default in 3.2) [A-z] other characters removed (default in 3.2)

SLIDE 31

Indexing

String or array of strings Optional language or translations Optional weight if multiple fields indexed

SLIDE 32

$text

Updated version in MongoDB 3.2

{ $text: { $search: "<string>", $language: "<string>", $caseSensitive: <boolean>, $diacriticSensitive: <boolean> } }

SLIDE 33

SLIDE 34

Example Text

These are not the droids you are looking for.

SLIDE 35

Tokenizer

thesearenotthedroidsyouare lookingfor

SLIDE 36

Stop Words

droidslooking

SLIDE 37

Stemming

droidlook

SLIDE 38

> db.starwars.ensureIndex({ quote: "text" }) { "createdCollectionAutomatically": true, "numIndexesBefore": 1, "numIndexesAfter": 2, "ok": 1 }

SLIDE 39

> db.starwars.getIndices() [ ... { "v": 1, "key": { "_fts": "text", "_ftsx": 1 }, "name": "quote_text", "ns": "starwars.starwars", "weights": { "quote": 1 }, "default_language": "english", "language_override": "language", "textIndexVersion": 3 } ]

SLIDE 40

> db.starwars.insert( { quote: "These are not the droids you are looking for." } ) Inserted 1 record(s) in 39ms WriteResult({ "nInserted": 1 })

SLIDE 41

> db.germanStarwars.ensureIndex( { "$**": "text"}, { default_language: "german" } )

SLIDE 42

Elasticsearch

SLIDE 43

Apache Lucene Elasticsearch

SLIDE 44

Example Text

These are not the droids you are looking for.

SLIDE 45

html_strip Char Filter

These are not the droids you are looking for.

SLIDE 46

standard Tokenizer

Thesearenotthedroidsyouare lookingfor

SLIDE 47

lowercase Token Filter

thesearenotthedroidsyouare lookingfor

SLIDE 48

stop Token Filter

droidsyoulooking

SLIDE 49

snowball Token Filter

droidyoulook

SLIDE 50

GET /_analyze { "char_filter": [ "html_strip" ], "tokenizer": "standard", "filter": [ "lowercase", "stop", "snowball" ], "text": "These are not the droids you are looking for." }

SLIDE 51

{ "tokens": [ { "token": "droid", "start_offset": 27, "end_offset": 33, "type": "<ALPHANUM>", "position": 4 }, { "token": "you", "start_offset": 34, "end_offset": 37, "type": "<ALPHANUM>", "position": 5 }, ... ] }

SLIDE 52

Another Example

Obi-Wan never told you what happened to your father.

SLIDE 53

Another Example

biwannevertoldyouwhat

happenyourfather

SLIDE 54

Another Example

No. I am your father.

SLIDE 55

Another Example

iamyourfather

SLIDE 56

Language Rules

English: Philipp's → philipp French: l'église → eglis German: äußerst → ausserst

SLIDE 57

phonetic Token Filter

Plugin

Joe Bloggs → JjoeBLKSbloggs

SLIDE 58

Elasticsearch

Index, type, mapping

SLIDE 59

PUT /my_index { "settings": { "analysis": { "filter": { "my_synonym_filter": { "type": "synonym", "synonyms": [ "word1,synonym", "word2,synonym" ] } },

SLIDE 60

"analyzer": { "my_analyzer": { "char_filter": [ "html_strip" ], "tokenizer": "standard", "filter": [ "lowercase", "stop", "snowball", "my_synonym_filter" ] } } } },

SLIDE 61

"mappings": { "my_type": { "properties": { "my_title": { "type": "text", "analyzer": "my_analyzer" } } } } }

SLIDE 62

PUT /starwars/quotes/1 { "quote": "These are not the droids you are looking for." } PUT /starwars/quotes/2 { "quote": "Obi-Wan never told you what happened to your father." } PUT /starwars/quotes/3 { "quote": "No. I am your father." }

SLIDE 63

Inverted Index

ID 1 ID 2 ID 3 am 0 0 1[2] droid 1[4] 0 0 father 0 1[9] 1[4] happen 0 1[6] 0 i 0 0 1[1] look 1[7] 0 0 never 0 1[2] 0

bi 0 1[0] 0

told 0 1[3] 0 wan 0 1[1] 0 what 0 1[5] 0 you 1[5] 1[4] 0 your 0 1[8] 1[3]

SLIDE 64

Searching

SLIDE 65

MongoDB

SLIDE 66

> db.starwars.insert( { quote: "Obi-Wan never told you what happened to your father." } ) > db.starwars.insert( { quote: "No. I am your father." } )

SLIDE 67

> db.starwars.find({ $text: { $search: "droid" }}) { "_id": ObjectId("57f2d54de814412463c3adef"), "quote": "These are not the droids you are looking for." } Fetched 1 record(s) in 35ms

SLIDE 68

> db.starwars.find({ $text: { $search: "father" }}) { "_id": ObjectId("57f2d56fe814412463c3adf0"), "quote": "Obi-Wan never told you what happened to your father." } { "_id": ObjectId("57f2d581e814412463c3adf1"), "quote": "No. I am your father." } Fetched 2 record(s) in 3ms

SLIDE 69

> db.starwars.find({ $text: { $search: "droid" }}).explain() { "queryPlanner": { ... "$text": { "$search": "droid", "$language": "english", "$caseSensitive": false, "$diacriticSensitive": false } },

SLIDE 70

"winningPlan": { "stage": "TEXT", "indexPrefix": { }, "indexName": "quote_text", "parsedTextQuery": { "terms": [ "droid" ], "negatedTerms": [ ], "phrases": [ ], "negatedPhrases": [ ] }, ...

SLIDE 71

> db.starwars.find({ $text: { $search: "father -obi" }}) { "_id": ObjectId("57f2d581e814412463c3adf1"), "quote": "No. I am your father." } Fetched 1 record(s) in 4ms

SLIDE 72

> db.starwars.find({ $text: { $search: "father -obi" }}).explain() ... "parsedTextQuery": { "terms": [ "father" ], "negatedTerms": [ "obi" ], "phrases": [ ], "negatedPhrases": [ ] }, ...

SLIDE 73

Queries

// OR > db.starwars.find({ $text: { $search: "look droid" } }) // AND but without input stemming > db.starwars.find({ $text: { $search: "\"look\" \"droid\"" } }) // Negation > db.starwars.find({ $text: { $search: "look -droid" } }) // Phrase > db.starwars.find({ $text: { $search: "\"look droid\"" } }) // Translation > db.starwars.find({ $text: { $search: "suchen", $language: "de" } })

SLIDE 74

Elasticsearch

SLIDE 75

POST /starwars/_search { "query": { "match": { "quote": "droid" } } }

SLIDE 76

{ "took": 2, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 0.39556286, "hits": [ { "_index": "starwars", "_type": "quotes", "_id": "1", "_score": 0.39556286, "_source": { "quote": "These are not the droids you are looking for." } } ] } }

SLIDE 77

POST /starwars/_search { "query": { "match": { "quote": { "query": "van", "fuzziness": "AUTO" } } } }

SLIDE 78

{ "took": 14, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 0.18155496, "hits": [ { "_index": "starwars", "_type": "quotes", "_id": "2", "_score": 0.18155496, "_source": { "quote": "Obi-Wan never told you what happened to your father." } } ] } }

SLIDE 79

SELECT * FROM starwars WHERE quote LIKE "?an" OR quote LIKE "V?n" OR quote LIKE "Va?"

SLIDE 80

Scoring

SLIDE 81

MongoDB

SLIDE 82

> db.starwars.find({ $text: { $search: "droid" }}, {score: {$meta: "textScore"}}) { "_id": ObjectId("57f2d54de814412463c3adef"), "quote": "These are not the droids you are looking for.", "score": 0.75 } Fetched 1 record(s) in 14ms

SLIDE 83

One Term

https://github.com/mongodb/mongo/blob/v3.2/src/mongo/db/fts/fts_spec.cpp#L219

double coeff = (0.5 * data.count / numTokens) + 0.5;

data.count: matches numTokens: stemmed words

SLIDE 84

Search for droid

"These are not the droids you are looking for."

droid look == 1 match, 2 tokens

coeff:

SLIDE 85

Search for father

"No. I am your father."

father == 1 match, 1 token

coeff:

SLIDE 86

Search for father

"Obi-Wan never told you what happened to your father."

bi wan never told happen father ==

1 match, 6 tokens

coeff:

SLIDE 87

> db.starwars.find({ $text: { $search: "obi-wan" }}, {score: {$meta: "textScore"}}) { "_id": ObjectId("57f2d56fe814412463c3adf0"), "quote": "Obi-Wan never told you what happened to your father.", "score": 1.1666666666666667 } Fetched 1 record(s) in 6ms

SLIDE 88

Multiple Terms

https://github.com/mongodb/mongo/blob/v3.2/src/mongo/db/fts/fts_spec.cpp#L228

score += (weight * data.freq * coeff * adjustment);

weight: method parameter data.freq, adjustment: 1

SLIDE 89

Search for obi-wan

bi wan never told happen father ==

1 match, 6 tokens

coeff:

SLIDE 90

Search for obi-wan

bi wan never told happen father ==

1 match, 6 tokens

coeff:

SLIDE 91

Search for obi-wan

score:

Sum:

SLIDE 92

Elasticsearch

SLIDE 93

Term Frequency / Inverse Document Frequency (TF/IDF)

Search one term

SLIDE 94

BM25

https://speakerdeck.com/elastic/ improved-text-scoring-with-bm25

SLIDE 95

Term Frequency

SLIDE 96

SLIDE 97

Inverse Document Frequency

SLIDE 98

SLIDE 99

Field-Length Norm

SLIDE 100

Putting it Together

score(q,d) = queryNorm(q) · coord(q,d) · ∑ ( tf(t in d) · idf(t)² · t.getBoost() · norm(t,d) ) (t in q)

SLIDE 101

POST /starwars/_search?explain { "query": { "match": { "quote": "father" } } }

SLIDE 102

"_explanation": { "value": 0.2876821, "description": "weight(quote:father in 0) [PerFieldSimilarity], result of:", "details": [ { "value": 0.2876821, "description": "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:", "details": [ { "value": 0.2876821, "description": "idf(docFreq=1, docCount=1)", "details": [] }, ...

SLIDE 103

Score

0.2876821 vs 0.27233246

SLIDE 104

Vector Space Model

Search multiple terms

SLIDE 105

Score each term Vector Calculate angle

SLIDE 106

Search your obi

SLIDE 107

SLIDE 108

Conclusion

SLIDE 109

Indexing

Formatting Tokenize Lowercase, stop words, stemming Synonyms

SLIDE 110

Elasticsearch Scoring

Term Frequency Inverse Document Frequency Field-Length Norm

SLIDE 111

Elasticsearch Scoring

Vector Space Model

SLIDE 112

MongoDB Limitations

B-tree vs inverted index Simple scoring

SLIDE 113

Elasticsearch Features

Fuzziness Synonyms Suggestions & highlighting Documentation

SLIDE 114

80 - 20 solution or 20 - 80 solution?

SLIDE 115

More

https://www.elastic.co/training Amsterdam: Nov 21-24

SLIDE 116

Thanks!

Questions?

Philipp Krenn@xeraa PS: Stickers

SLIDE 117

Image Credit

→ Schnitzel https://flic.kr/p/9m27wm → Architecture https://flic.kr/p/6dwCAe → Conchita https://flic.kr/p/nBqSHT → Black and grey http://hdimagelib.com/ zedge+quote+wallpapers