stop stupid fuzzy searches Table of contents 01 Fuzzy search 02 - - PowerPoint PPT Presentation
stop stupid fuzzy searches Table of contents 01 Fuzzy search 02 - - PowerPoint PPT Presentation
stop stupid fuzzy searches Table of contents 01 Fuzzy search 02 Smart Query Rewriting 03 Conclusion 04 Surprise 01 Fuzzy search Why we need it / Distribution of spelling errors 100% 90% 80% 70% 60% 50% 37% 40% 26% 30% 23% 20% 18%
01 Fuzzy search 02 Smart Query Rewriting 03 Conclusion 04 Surprise
Table of contents
01 Fuzzy search
37% 23% 9% 5% 15% 11% 26% 20% 10% 12% 18% 14% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Edit distance 0 Edit distance 1 Edit distance 2 Edit distance >2 Singluar & Plural Decomposition Frequency Value/Search
Why we need it / Distribution of spelling errors
10% 25% 27% 10% 17% 11% 6% 23% 34% 7% 16% 14% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Insert Delete Replace Transpose Singular & Plural Decomposition Desktop Mobile
Why we need it / Distribution of spelling errors by device type
Causes of spelling errors
format phonetic typo decomposition
- spannbettlaken
spannbettlacken spanbettlaken spannbettllaken spammbettlaken Spann bettlaken Bettlaken zum spannen
1% 3% 13% 9% 7% 1% 4% 1%
spann-bettlaken
spannbettlaken
61% 4%
22% 8% 5%
…42 additional spellings
83 56 50 47 43
Query-Intent Error-type query Result size
generates
835
candidates
How it works
generates
~650k
candidates GET catalog/products/_search { “query”: { “fuzzy”: { “title”: { “value”: “spannbettlacken”, “fuzziness”: 1 } } } }
EditDistance 1
GET catalog/products/_search { “query”: { “fuzzy”: { “title”: { “value”: “spannbettlacken”, “fuzziness”: 2 } } } }
EditDistance 2
Resulting in / high recall but low precision
0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 0,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1,0
Precision (PREC) Recall (TPR)
Resulting in / low search throughput ~0.1 seconds for spelling a short word
5000 10000 15000 20000 25000 30000 35000 1 2 3 4 5 6
Searches per Second Query Terms
- r - term
and - term
- r - fuzzy 2
and - fuzzy 2
- +
Searches for all possible candidates inside a given edit- distance Natively implemented in Elasticsearch and Lucene Increased CPU usage and query response time Inconsistent and not always relevant results Skewed search analytics
Observations
02 Smart Query rewriting
MAKE FUZZY SEARCH AS FAST, EASY AND RELEVANT AS EXACT SEARCH
spannbettlaken spann-bettlaken spannbettlacken schpanbettlaken spannbettllaken spammbettlaken spanmbettlaken spannbettlaken
Cluster similar Queries Test & Select MasterQuery
spannbettlaken
MasterQuery Search Engine
Our Solution / smart query rewrites
Cluster similar Queries
Based on deep learning & crafted algorithms we clean and cluster queries with the same meaning We use the concept of controlled precision reduction
Exact Match Fingerprint Lemmatization & Phonems Fuzzy Match
Our Solution / smart query rewrites
spannbettlaken spann-bettlaken spannbettlacken schpanbettlaken spannbettllaken spammbettlaken spanmbettlaken spannbettlaken
Test & Select MasterQuery Based on tracking KPIs and deep learning and global parameter
- ptimization we
test & select the query which maximises the balance between the search result interaction probability and the economic outcome
Our Solution / smart query rewrites
spannbettlaken spann-bettlaken spannbettlacken schpanbettlaken spannbettllaken spammbettlaken spanmbettlaken spannbettlaken
CXP search|hub / Query Intelligence Platform
Frontend Search Endpoint
High performance Caching & Logging Da Data|hub Semantic Query Parsing Site Search Analytics Guided Selling Personalization … Solr Elasticsearch FACT-Finder Fredhopper Celebros Algolia ACS
Search Engine
Sm Smart|Quer uery Query Segmentation Query Scoping
03 Conclusion
Impact – top-10 ecom player A
90% 100% 110% 120% 130% 140% Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec w/o smart|query w smart|query
Uses an already a highly optimized state-of-the-art eCommerce Search solution
Impact – top-50 ecom player B
90% 100% 110% 120% 130% 140% Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar w/o smart|query w smart|query
Uses an optimized SolR implementation
Resulting in / High recall & high precision
0,97 0,98 0,99 1 0,5 0,55 0,6 0,65 0,7 0,75 0,8 0,85 0,9 0,95 1 500.000 1.000.000 1.500.000 2.000.000 2.500.000 3.000.000 3.500.000 4.000.000 4.500.000 5.000.000
Precision (PREC) Recall (TPR) Queries
Recall (TPR) Precision (PREC)
Resulting in / insane query performance
~0.00005 seconds for spelling a short word – 80 ops/ms
5000 10000 15000 20000 25000 30000 35000 1 2 3 4 5 6
Searches per Seconds search|hub & Elastic Query Terms
- r - term
and - term
- r - fuzzy 2
and - fuzzy 2
more relevant results consistent results reduced manual effort for curated search results save CPU usage improved query response time consistent site search analytics additional complexity
Observations
+
04 Surprise
CXP smart|query- PreDictLib
fast & accurate spell correction at scale
search|hub -PreDictLib
fast & accurate spell correction at scale Qui Quick Highl hlight hts:
§ extremely fast & constant index access § truly language independent edit distance § ability to add records to the index at runtime without performance decrease based on one of the most efficient spell correction implementations out there called symspell by Wolf Grabe
Symspell/ some Benchmarks
88,7% 45,8% 69,2% 88,3% 88,7% 1,0% 1,0% 1,7% 2,2% 100,0% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Lucene WordCorrect ElasticSearch WordCorrect No.2 eCommerce Search No.1 in eCommerce Search SymSpell
Throughput vs Accuracy
Accuracy Searches/sec
- modified edit distance to a
weighted edit distance
- changed Damerau Levenshtein
distance with a weighted Damerau Levenshtein distance – taking into account keyboard distance
- re-rank the candidate list by
applying additional similarity algorithms
search|hub -PreDictLib
fast & accurate spell correction at scale
89% 46% 69% 88% 89% 89% 99% 1% 1% 1% 2% 86% 100% 98% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Lucene WordCorrect ElasticSearch WordCorrect No.2 eCommerce Search No.1 in eCommerce Search Symspell CXP PreDict (CE) CXP Searchhub
Throughput vs. Accuracy
Accuracy Searches/sec
Search|hub– PreDice(CE) & PreDict(EE) / some Benchmarks
what you‘ll get
CXP SmartQuery – PreDictLib (CE)
fast & accurate spell correction at scale
§ the Lib as Java source § accuracy and benchmark tests § real-life test data
ht https://gi github. b.com/se searchhub/pr preDict