Full-Text Search Explained Philipp Krenn @xeraa Infrastructure | - - PowerPoint PPT Presentation

full text search
SMART_READER_LITE
LIVE PREVIEW

Full-Text Search Explained Philipp Krenn @xeraa Infrastructure | - - PowerPoint PPT Presentation

Full-Text Search Explained Philipp Krenn @xeraa Infrastructure | Developer Advocate ViennaDB Papers We Love Vienna Who uses databases? Who uses search? Databases vs Full-text search But I can do... SELECT * FROM my_table


slide-1
SLIDE 1

Full-Text Search

Explained

Philipp Krenn@xeraa

slide-2
SLIDE 2
slide-3
SLIDE 3
slide-4
SLIDE 4
slide-5
SLIDE 5

Infrastructure | Developer Advocate

slide-6
SLIDE 6

ViennaDB Papers We Love Vienna

slide-7
SLIDE 7

Who uses

databases?

slide-8
SLIDE 8

Who uses

search?

slide-9
SLIDE 9

Databases

vs

Full-text search

slide-10
SLIDE 10
slide-11
SLIDE 11

But I can do...

SELECT * FROM my_table WHERE my_text LIKE ‘%blue%’

slide-12
SLIDE 12
slide-13
SLIDE 13
  • 1. Speed

B-Tree

slide-14
SLIDE 14
slide-15
SLIDE 15
slide-16
SLIDE 16
  • 2. Features

Fuzziness, synonyms, scoring,...

slide-17
SLIDE 17

Storing

slide-18
SLIDE 18

Indexing

Remove formatting

slide-19
SLIDE 19

Indexing

Tokenize

slide-20
SLIDE 20

Indexing

Stop words

slide-21
SLIDE 21

Indexing

Stemming

http://snowballstem.org

slide-22
SLIDE 22

Indexing

Synonyms

slide-23
SLIDE 23

MongoDB

slide-24
SLIDE 24

Before FTS

Regular expressions !

slide-25
SLIDE 25

Before FTS

Arrays with relevant terms !

https://docs.mongodb.com/manual/tutorial/model-data-for- keyword-search/

slide-26
SLIDE 26

{ title : "Moby-Dick" , author : "Herman Melville" , published : 1851 , ISBN : 0451526996 , topics : [ "whaling" , "allegory" , "revenge" , "American" , "novel" , "nautical" , "voyage" , "Cape Cod" ] } db.volumes.createIndex({ topics: 1 }) db.volumes.findOne({ topics : "voyage" }, { title: 1 })

slide-27
SLIDE 27

No

Stemming Fuzziness Synonyms Scoring

slide-28
SLIDE 28

Finally FTS

230+ votes Created 2009 Resolved 2013

https://jira.mongodb.org/browse/SERVER-380

slide-29
SLIDE 29

FTS in MongoDB

Beta since 2.4 Stable since 3.0 "80% solution" — for more Elasticsearch

slide-30
SLIDE 30

FTS in MongoDB

In Latin alphabets

Case insensitive (default in 3.2) [A-z] other characters removed (default in 3.2)

slide-31
SLIDE 31

Indexing

String or array of strings Optional language or translations Optional weight if multiple fields indexed

slide-32
SLIDE 32

$text

Updated version in MongoDB 3.2

{ $text: { $search: "<string>", $language: "<string>", $caseSensitive: <boolean>, $diacriticSensitive: <boolean> } }

slide-33
SLIDE 33
slide-34
SLIDE 34

Example Text

These are not the droids you are looking for.

slide-35
SLIDE 35

Tokenizer

thesearenotthedroidsyouare lookingfor

slide-36
SLIDE 36

Stop Words

droidslooking

slide-37
SLIDE 37

Stemming

droidlook

slide-38
SLIDE 38

> db.starwars.ensureIndex({ quote: "text" }) { "createdCollectionAutomatically": true, "numIndexesBefore": 1, "numIndexesAfter": 2, "ok": 1 }

slide-39
SLIDE 39

> db.starwars.getIndices() [ ... { "v": 1, "key": { "_fts": "text", "_ftsx": 1 }, "name": "quote_text", "ns": "starwars.starwars", "weights": { "quote": 1 }, "default_language": "english", "language_override": "language", "textIndexVersion": 3 } ]

slide-40
SLIDE 40

> db.starwars.insert( { quote: "These are not the droids you are looking for." } ) Inserted 1 record(s) in 39ms WriteResult({ "nInserted": 1 })

slide-41
SLIDE 41

> db.germanStarwars.ensureIndex( { "$**": "text"}, { default_language: "german" } )

slide-42
SLIDE 42

Elasticsearch

slide-43
SLIDE 43

Apache Lucene Elasticsearch

slide-44
SLIDE 44

Example Text

These are <em>not</em> the droids you are looking for.

slide-45
SLIDE 45

html_strip Char Filter

These are not the droids you are looking for.

slide-46
SLIDE 46

standard Tokenizer

Thesearenotthedroidsyouare lookingfor

slide-47
SLIDE 47

lowercase Token Filter

thesearenotthedroidsyouare lookingfor

slide-48
SLIDE 48

stop Token Filter

droidsyoulooking

slide-49
SLIDE 49

snowball Token Filter

droidyoulook

slide-50
SLIDE 50

GET /_analyze { "char_filter": [ "html_strip" ], "tokenizer": "standard", "filter": [ "lowercase", "stop", "snowball" ], "text": "These are <em>not</em> the droids you are looking for." }

slide-51
SLIDE 51

{ "tokens": [ { "token": "droid", "start_offset": 27, "end_offset": 33, "type": "<ALPHANUM>", "position": 4 }, { "token": "you", "start_offset": 34, "end_offset": 37, "type": "<ALPHANUM>", "position": 5 }, ... ] }

slide-52
SLIDE 52

Another Example

Obi-Wan never told you what happened to your father.

slide-53
SLIDE 53

Another Example

  • biwannevertoldyouwhat

happenyourfather

slide-54
SLIDE 54

Another Example

<b>No</b>. I am your father.

slide-55
SLIDE 55

Another Example

iamyourfather

slide-56
SLIDE 56

Language Rules

English: Philipp's → philipp French: l'église → eglis German: äußerst → ausserst

slide-57
SLIDE 57

phonetic Token Filter

Plugin

Joe Bloggs → JjoeBLKSbloggs

slide-58
SLIDE 58

Elasticsearch

Index, type, mapping

slide-59
SLIDE 59

PUT /my_index { "settings": { "analysis": { "filter": { "my_synonym_filter": { "type": "synonym", "synonyms": [ "word1,synonym", "word2,synonym" ] } },

slide-60
SLIDE 60

"analyzer": { "my_analyzer": { "char_filter": [ "html_strip" ], "tokenizer": "standard", "filter": [ "lowercase", "stop", "snowball", "my_synonym_filter" ] } } } },

slide-61
SLIDE 61

"mappings": { "my_type": { "properties": { "my_title": { "type": "text", "analyzer": "my_analyzer" } } } } }

slide-62
SLIDE 62

PUT /starwars/quotes/1 { "quote": "These are <em>not</em> the droids you are looking for." } PUT /starwars/quotes/2 { "quote": "Obi-Wan never told you what happened to your father." } PUT /starwars/quotes/3 { "quote": "<b>No</b>. I am your father." }

slide-63
SLIDE 63

Inverted Index

ID 1 ID 2 ID 3 am 0 0 1[2] droid 1[4] 0 0 father 0 1[9] 1[4] happen 0 1[6] 0 i 0 0 1[1] look 1[7] 0 0 never 0 1[2] 0

  • bi 0 1[0] 0

told 0 1[3] 0 wan 0 1[1] 0 what 0 1[5] 0 you 1[5] 1[4] 0 your 0 1[8] 1[3]

slide-64
SLIDE 64

Searching

slide-65
SLIDE 65

MongoDB

slide-66
SLIDE 66

> db.starwars.insert( { quote: "Obi-Wan never told you what happened to your father." } ) > db.starwars.insert( { quote: "No. I am your father." } )

slide-67
SLIDE 67

> db.starwars.find({ $text: { $search: "droid" }}) { "_id": ObjectId("57f2d54de814412463c3adef"), "quote": "These are not the droids you are looking for." } Fetched 1 record(s) in 35ms

slide-68
SLIDE 68

> db.starwars.find({ $text: { $search: "father" }}) { "_id": ObjectId("57f2d56fe814412463c3adf0"), "quote": "Obi-Wan never told you what happened to your father." } { "_id": ObjectId("57f2d581e814412463c3adf1"), "quote": "No. I am your father." } Fetched 2 record(s) in 3ms

slide-69
SLIDE 69

> db.starwars.find({ $text: { $search: "droid" }}).explain() { "queryPlanner": { ... "$text": { "$search": "droid", "$language": "english", "$caseSensitive": false, "$diacriticSensitive": false } },

slide-70
SLIDE 70

"winningPlan": { "stage": "TEXT", "indexPrefix": { }, "indexName": "quote_text", "parsedTextQuery": { "terms": [ "droid" ], "negatedTerms": [ ], "phrases": [ ], "negatedPhrases": [ ] }, ...

slide-71
SLIDE 71

> db.starwars.find({ $text: { $search: "father -obi" }}) { "_id": ObjectId("57f2d581e814412463c3adf1"), "quote": "No. I am your father." } Fetched 1 record(s) in 4ms

slide-72
SLIDE 72

> db.starwars.find({ $text: { $search: "father -obi" }}).explain() ... "parsedTextQuery": { "terms": [ "father" ], "negatedTerms": [ "obi" ], "phrases": [ ], "negatedPhrases": [ ] }, ...

slide-73
SLIDE 73

Queries

// OR > db.starwars.find({ $text: { $search: "look droid" } }) // AND but without input stemming > db.starwars.find({ $text: { $search: "\"look\" \"droid\"" } }) // Negation > db.starwars.find({ $text: { $search: "look -droid" } }) // Phrase > db.starwars.find({ $text: { $search: "\"look droid\"" } }) // Translation > db.starwars.find({ $text: { $search: "suchen", $language: "de" } })

slide-74
SLIDE 74

Elasticsearch

slide-75
SLIDE 75

POST /starwars/_search { "query": { "match": { "quote": "droid" } } }

slide-76
SLIDE 76

{ "took": 2, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 0.39556286, "hits": [ { "_index": "starwars", "_type": "quotes", "_id": "1", "_score": 0.39556286, "_source": { "quote": "These are <em>not</em> the droids you are looking for." } } ] } }

slide-77
SLIDE 77

POST /starwars/_search { "query": { "match": { "quote": { "query": "van", "fuzziness": "AUTO" } } } }

slide-78
SLIDE 78

{ "took": 14, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 0.18155496, "hits": [ { "_index": "starwars", "_type": "quotes", "_id": "2", "_score": 0.18155496, "_source": { "quote": "Obi-Wan never told you what happened to your father." } } ] } }

slide-79
SLIDE 79

SELECT * FROM starwars WHERE quote LIKE "?an" OR quote LIKE "V?n" OR quote LIKE "Va?"

slide-80
SLIDE 80

Scoring

slide-81
SLIDE 81

MongoDB

slide-82
SLIDE 82

> db.starwars.find({ $text: { $search: "droid" }}, {score: {$meta: "textScore"}}) { "_id": ObjectId("57f2d54de814412463c3adef"), "quote": "These are not the droids you are looking for.", "score": 0.75 } Fetched 1 record(s) in 14ms

slide-83
SLIDE 83

One Term

https://github.com/mongodb/mongo/blob/v3.2/src/mongo/db/fts/fts_spec.cpp#L219

double coeff = (0.5 * data.count / numTokens) + 0.5;

data.count: matches numTokens: stemmed words

slide-84
SLIDE 84

Search for droid

"These are not the droids you are looking for."

droid look == 1 match, 2 tokens

coeff:

slide-85
SLIDE 85

Search for father

"No. I am your father."

father == 1 match, 1 token

coeff:

slide-86
SLIDE 86

Search for father

"Obi-Wan never told you what happened to your father."

  • bi wan never told happen father ==

1 match, 6 tokens

coeff:

slide-87
SLIDE 87

> db.starwars.find({ $text: { $search: "obi-wan" }}, {score: {$meta: "textScore"}}) { "_id": ObjectId("57f2d56fe814412463c3adf0"), "quote": "Obi-Wan never told you what happened to your father.", "score": 1.1666666666666667 } Fetched 1 record(s) in 6ms

slide-88
SLIDE 88

Multiple Terms

https://github.com/mongodb/mongo/blob/v3.2/src/mongo/db/fts/fts_spec.cpp#L228

score += (weight * data.freq * coeff * adjustment);

weight: method parameter data.freq, adjustment: 1

slide-89
SLIDE 89

Search for obi-wan

  • bi wan never told happen father ==

1 match, 6 tokens

coeff:

slide-90
SLIDE 90

Search for obi-wan

  • bi wan never told happen father ==

1 match, 6 tokens

coeff:

slide-91
SLIDE 91

Search for obi-wan

score:

Sum:

slide-92
SLIDE 92

Elasticsearch

slide-93
SLIDE 93

Term Frequency / Inverse Document Frequency (TF/IDF)

Search one term

slide-94
SLIDE 94

BM25

https://speakerdeck.com/elastic/ improved-text-scoring-with-bm25

slide-95
SLIDE 95

Term Frequency

slide-96
SLIDE 96
slide-97
SLIDE 97

Inverse Document Frequency

slide-98
SLIDE 98
slide-99
SLIDE 99

Field-Length Norm

slide-100
SLIDE 100

Putting it Together

score(q,d) = queryNorm(q) · coord(q,d) · ∑ ( tf(t in d) · idf(t)² · t.getBoost() · norm(t,d) ) (t in q)

slide-101
SLIDE 101

POST /starwars/_search?explain { "query": { "match": { "quote": "father" } } }

slide-102
SLIDE 102

"_explanation": { "value": 0.2876821, "description": "weight(quote:father in 0) [PerFieldSimilarity], result of:", "details": [ { "value": 0.2876821, "description": "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:", "details": [ { "value": 0.2876821, "description": "idf(docFreq=1, docCount=1)", "details": [] }, ...

slide-103
SLIDE 103

Score

0.2876821 vs 0.27233246

slide-104
SLIDE 104

Vector Space Model

Search multiple terms

slide-105
SLIDE 105

Score each term Vector Calculate angle

slide-106
SLIDE 106

Search your obi

slide-107
SLIDE 107
slide-108
SLIDE 108

Conclusion

slide-109
SLIDE 109

Indexing

Formatting Tokenize Lowercase, stop words, stemming Synonyms

slide-110
SLIDE 110

Elasticsearch Scoring

Term Frequency Inverse Document Frequency Field-Length Norm

slide-111
SLIDE 111

Elasticsearch Scoring

Vector Space Model

slide-112
SLIDE 112

MongoDB Limitations

B-tree vs inverted index Simple scoring

slide-113
SLIDE 113

Elasticsearch Features

Fuzziness Synonyms Suggestions & highlighting Documentation

slide-114
SLIDE 114

80 - 20 solution or 20 - 80 solution?

slide-115
SLIDE 115

More

https://www.elastic.co/training Amsterdam: Nov 21-24

slide-116
SLIDE 116

Thanks!

Questions?

Philipp Krenn@xeraa PS: Stickers

slide-117
SLIDE 117

Image Credit

→ Schnitzel https://flic.kr/p/9m27wm → Architecture https://flic.kr/p/6dwCAe → Conchita https://flic.kr/p/nBqSHT → Black and grey http://hdimagelib.com/ zedge+quote+wallpapers