Full-Text Search
Explained
Philipp Krenn@xeraa
Full-Text Search Explained Philipp Krenn @xeraa Infrastructure | - - PowerPoint PPT Presentation
Full-Text Search Explained Philipp Krenn @xeraa Infrastructure | Developer Advocate ViennaDB Papers We Love Vienna Who uses databases? Who uses search? Databases vs Full-text search But I can do... SELECT * FROM my_table
Philipp Krenn@xeraa
Infrastructure | Developer Advocate
ViennaDB Papers We Love Vienna
vs
But I can do...
SELECT * FROM my_table WHERE my_text LIKE ‘%blue%’
B-Tree
Fuzziness, synonyms, scoring,...
Remove formatting
Tokenize
Stop words
Stemming
http://snowballstem.org
Synonyms
Regular expressions !
Arrays with relevant terms !
https://docs.mongodb.com/manual/tutorial/model-data-for- keyword-search/
{ title : "Moby-Dick" , author : "Herman Melville" , published : 1851 , ISBN : 0451526996 , topics : [ "whaling" , "allegory" , "revenge" , "American" , "novel" , "nautical" , "voyage" , "Cape Cod" ] } db.volumes.createIndex({ topics: 1 }) db.volumes.findOne({ topics : "voyage" }, { title: 1 })
Stemming Fuzziness Synonyms Scoring
230+ votes Created 2009 Resolved 2013
https://jira.mongodb.org/browse/SERVER-380
Beta since 2.4 Stable since 3.0 "80% solution" — for more Elasticsearch
In Latin alphabets
Case insensitive (default in 3.2) [A-z] other characters removed (default in 3.2)
String or array of strings Optional language or translations Optional weight if multiple fields indexed
$text
Updated version in MongoDB 3.2
{ $text: { $search: "<string>", $language: "<string>", $caseSensitive: <boolean>, $diacriticSensitive: <boolean> } }
These are not the droids you are looking for.
thesearenotthedroidsyouare lookingfor
droidslooking
droidlook
> db.starwars.ensureIndex({ quote: "text" }) { "createdCollectionAutomatically": true, "numIndexesBefore": 1, "numIndexesAfter": 2, "ok": 1 }
> db.starwars.getIndices() [ ... { "v": 1, "key": { "_fts": "text", "_ftsx": 1 }, "name": "quote_text", "ns": "starwars.starwars", "weights": { "quote": 1 }, "default_language": "english", "language_override": "language", "textIndexVersion": 3 } ]
> db.starwars.insert( { quote: "These are not the droids you are looking for." } ) Inserted 1 record(s) in 39ms WriteResult({ "nInserted": 1 })
> db.germanStarwars.ensureIndex( { "$**": "text"}, { default_language: "german" } )
Elasticsearch
These are <em>not</em> the droids you are looking for.
html_strip Char Filter
These are not the droids you are looking for.
standard Tokenizer
Thesearenotthedroidsyouare lookingfor
lowercase Token Filter
thesearenotthedroidsyouare lookingfor
stop Token Filter
droidsyoulooking
snowball Token Filter
droidyoulook
GET /_analyze { "char_filter": [ "html_strip" ], "tokenizer": "standard", "filter": [ "lowercase", "stop", "snowball" ], "text": "These are <em>not</em> the droids you are looking for." }
{ "tokens": [ { "token": "droid", "start_offset": 27, "end_offset": 33, "type": "<ALPHANUM>", "position": 4 }, { "token": "you", "start_offset": 34, "end_offset": 37, "type": "<ALPHANUM>", "position": 5 }, ... ] }
Obi-Wan never told you what happened to your father.
happenyourfather
<b>No</b>. I am your father.
iamyourfather
English: Philipp's → philipp French: l'église → eglis German: äußerst → ausserst
phonetic Token Filter
Plugin
Joe Bloggs → JjoeBLKSbloggs
Index, type, mapping
PUT /my_index { "settings": { "analysis": { "filter": { "my_synonym_filter": { "type": "synonym", "synonyms": [ "word1,synonym", "word2,synonym" ] } },
"analyzer": { "my_analyzer": { "char_filter": [ "html_strip" ], "tokenizer": "standard", "filter": [ "lowercase", "stop", "snowball", "my_synonym_filter" ] } } } },
"mappings": { "my_type": { "properties": { "my_title": { "type": "text", "analyzer": "my_analyzer" } } } } }
PUT /starwars/quotes/1 { "quote": "These are <em>not</em> the droids you are looking for." } PUT /starwars/quotes/2 { "quote": "Obi-Wan never told you what happened to your father." } PUT /starwars/quotes/3 { "quote": "<b>No</b>. I am your father." }
Inverted Index
ID 1 ID 2 ID 3 am 0 0 1[2] droid 1[4] 0 0 father 0 1[9] 1[4] happen 0 1[6] 0 i 0 0 1[1] look 1[7] 0 0 never 0 1[2] 0
told 0 1[3] 0 wan 0 1[1] 0 what 0 1[5] 0 you 1[5] 1[4] 0 your 0 1[8] 1[3]
> db.starwars.insert( { quote: "Obi-Wan never told you what happened to your father." } ) > db.starwars.insert( { quote: "No. I am your father." } )
> db.starwars.find({ $text: { $search: "droid" }}) { "_id": ObjectId("57f2d54de814412463c3adef"), "quote": "These are not the droids you are looking for." } Fetched 1 record(s) in 35ms
> db.starwars.find({ $text: { $search: "father" }}) { "_id": ObjectId("57f2d56fe814412463c3adf0"), "quote": "Obi-Wan never told you what happened to your father." } { "_id": ObjectId("57f2d581e814412463c3adf1"), "quote": "No. I am your father." } Fetched 2 record(s) in 3ms
> db.starwars.find({ $text: { $search: "droid" }}).explain() { "queryPlanner": { ... "$text": { "$search": "droid", "$language": "english", "$caseSensitive": false, "$diacriticSensitive": false } },
"winningPlan": { "stage": "TEXT", "indexPrefix": { }, "indexName": "quote_text", "parsedTextQuery": { "terms": [ "droid" ], "negatedTerms": [ ], "phrases": [ ], "negatedPhrases": [ ] }, ...
> db.starwars.find({ $text: { $search: "father -obi" }}) { "_id": ObjectId("57f2d581e814412463c3adf1"), "quote": "No. I am your father." } Fetched 1 record(s) in 4ms
> db.starwars.find({ $text: { $search: "father -obi" }}).explain() ... "parsedTextQuery": { "terms": [ "father" ], "negatedTerms": [ "obi" ], "phrases": [ ], "negatedPhrases": [ ] }, ...
Queries
// OR > db.starwars.find({ $text: { $search: "look droid" } }) // AND but without input stemming > db.starwars.find({ $text: { $search: "\"look\" \"droid\"" } }) // Negation > db.starwars.find({ $text: { $search: "look -droid" } }) // Phrase > db.starwars.find({ $text: { $search: "\"look droid\"" } }) // Translation > db.starwars.find({ $text: { $search: "suchen", $language: "de" } })
Elasticsearch
POST /starwars/_search { "query": { "match": { "quote": "droid" } } }
{ "took": 2, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 0.39556286, "hits": [ { "_index": "starwars", "_type": "quotes", "_id": "1", "_score": 0.39556286, "_source": { "quote": "These are <em>not</em> the droids you are looking for." } } ] } }
POST /starwars/_search { "query": { "match": { "quote": { "query": "van", "fuzziness": "AUTO" } } } }
{ "took": 14, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 0.18155496, "hits": [ { "_index": "starwars", "_type": "quotes", "_id": "2", "_score": 0.18155496, "_source": { "quote": "Obi-Wan never told you what happened to your father." } } ] } }
SELECT * FROM starwars WHERE quote LIKE "?an" OR quote LIKE "V?n" OR quote LIKE "Va?"
> db.starwars.find({ $text: { $search: "droid" }}, {score: {$meta: "textScore"}}) { "_id": ObjectId("57f2d54de814412463c3adef"), "quote": "These are not the droids you are looking for.", "score": 0.75 } Fetched 1 record(s) in 14ms
One Term
https://github.com/mongodb/mongo/blob/v3.2/src/mongo/db/fts/fts_spec.cpp#L219
double coeff = (0.5 * data.count / numTokens) + 0.5;
data.count: matches numTokens: stemmed words
"These are not the droids you are looking for."
droid look == 1 match, 2 tokens
coeff:
"No. I am your father."
father == 1 match, 1 token
coeff:
"Obi-Wan never told you what happened to your father."
1 match, 6 tokens
coeff:
> db.starwars.find({ $text: { $search: "obi-wan" }}, {score: {$meta: "textScore"}}) { "_id": ObjectId("57f2d56fe814412463c3adf0"), "quote": "Obi-Wan never told you what happened to your father.", "score": 1.1666666666666667 } Fetched 1 record(s) in 6ms
Multiple Terms
https://github.com/mongodb/mongo/blob/v3.2/src/mongo/db/fts/fts_spec.cpp#L228
score += (weight * data.freq * coeff * adjustment);
weight: method parameter data.freq, adjustment: 1
1 match, 6 tokens
coeff:
1 match, 6 tokens
coeff:
score:
Sum:
Elasticsearch
Search one term
https://speakerdeck.com/elastic/ improved-text-scoring-with-bm25
Putting it Together
score(q,d) = queryNorm(q) · coord(q,d) · ∑ ( tf(t in d) · idf(t)² · t.getBoost() · norm(t,d) ) (t in q)
POST /starwars/_search?explain { "query": { "match": { "quote": "father" } } }
"_explanation": { "value": 0.2876821, "description": "weight(quote:father in 0) [PerFieldSimilarity], result of:", "details": [ { "value": 0.2876821, "description": "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:", "details": [ { "value": 0.2876821, "description": "idf(docFreq=1, docCount=1)", "details": [] }, ...
0.2876821 vs 0.27233246
Search multiple terms
Formatting Tokenize Lowercase, stop words, stemming Synonyms
Term Frequency Inverse Document Frequency Field-Length Norm
Vector Space Model
B-tree vs inverted index Simple scoring
Fuzziness Synonyms Suggestions & highlighting Documentation
https://www.elastic.co/training Amsterdam: Nov 21-24
Philipp Krenn@xeraa PS: Stickers
Image Credit
→ Schnitzel https://flic.kr/p/9m27wm → Architecture https://flic.kr/p/6dwCAe → Conchita https://flic.kr/p/nBqSHT → Black and grey http://hdimagelib.com/ zedge+quote+wallpapers