stop stupid fuzzy searches table of contents
play

stop stupid fuzzy searches Table of contents 01 Fuzzy search 02 - PowerPoint PPT Presentation

stop stupid fuzzy searches Table of contents 01 Fuzzy search 02 Smart Query Rewriting 03 Conclusion 04 Surprise 01 Fuzzy search Why we need it / Distribution of spelling errors 100% 90% 80% 70% 60% 50% 37% 40% 26% 30% 23% 20% 18%


  1. stop stupid fuzzy searches

  2. Table of contents 01 Fuzzy search 02 Smart Query Rewriting 03 Conclusion 04 Surprise

  3. 01 Fuzzy search

  4. Why we need it / Distribution of spelling errors 100% 90% 80% 70% 60% 50% 37% 40% 26% 30% 23% 20% 18% 20% 15% 14% 12% 11% 10% 9% 10% 5% 0% Edit distance 0 Edit distance 1 Edit distance 2 Edit distance >2 Singluar & Plural Decomposition Frequency Value/Search

  5. Why we need it / Distribution of spelling errors by device type 100% 90% 80% 70% 60% 50% 40% 34% 27% 30% 25% 23% 17% 16% 20% 14% 11% 10% 10% 7% 6% 10% 0% Insert Delete Replace Transpose Singular & Plural Decomposition Desktop Mobile

  6. Causes of spelling errors query Result size Query-Intent Error-type -spannbettlaken 1% 0 format 4% spann-bettlaken 3% 83 spannbettlacken 13% 56 phonetic 22% spanbettlaken 9% 50 spannbettlaken 61% spannbettllaken 7% 47 typo 8% spammbettlaken 1% 0 Spann bettlaken 4% 43 decomposition 5% Bettlaken zum spannen 1% 0 …42 additional spellings

  7. How it works EditDistance 1 EditDistance 2 GET catalog/products/_search GET catalog/products/_search { { “query”: { “query”: { “fuzzy”: { “fuzzy”: { “title”: { “title”: { “value”: “spannbettlacken”, “value”: “spannbettlacken”, “fuzziness”: 2 “fuzziness”: 1 } } } } } } } } generates generates 835 ~650k candidates candidates

  8. Resulting in / high recall but low precision 1 0,9 0,8 0,7 0,6 Precision (PREC) 0,5 0,4 0,3 0,2 0,1 0 0,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1,0 Recall (TPR)

  9. Resulting in / low search throughput ~0.1 seconds for spelling a short word 35000 30000 25000 Searches per Second 20000 15000 10000 5000 0 1 2 3 4 5 6 Query Terms or - term and - term or - fuzzy 2 and - fuzzy 2

  10. Observations + - Searches for all Increased CPU usage possible candidates and query response inside a given edit- time distance Inconsistent and not Natively implemented always relevant results in Elasticsearch and Lucene Skewed search analytics

  11. 02 Smart Query rewriting MAKE FUZZY SEARCH AS FAST, EASY AND RELEVANT AS EXACT SEARCH

  12. Our Solution / smart query rewrites Cluster similar spannbettlaken Queries spann-bettlaken spannbettlacken MasterQuery Search Engine schpanbettlaken spannbettlaken spannbettllaken spammbettlaken spanmbettlaken Test & Select MasterQuery spannbettlaken

  13. Our Solution / smart query rewrites Cluster similar spannbettlaken Queries spann-bettlaken Based on deep learning & crafted algorithms we clean and cluster queries with spannbettlacken the same meaning schpanbettlaken We use the concept of controlled precision reduction spannbettllaken spammbettlaken Exact Match spanmbettlaken Fingerprint spannbettlaken Lemmatization & Phonems Fuzzy Match

  14. Our Solution / smart query rewrites Test & Select spannbettlaken MasterQuery spann-bettlaken Based on tracking KPIs and deep learning and spannbettlacken global parameter optimization we schpanbettlaken test & select the query which maximises the spannbettllaken balance between the search result interaction spammbettlaken probability and the economic outcome spanmbettlaken spannbettlaken

  15. CXP search|hub / Query Intelligence Platform Solr Elasticsearch Frontend Search Search Engine FACT-Finder Endpoint Fredhopper Celebros Algolia ACS High performance Data|hub Da Caching & Logging Semantic Query Parsing Site Search Analytics Guided Selling Personalization Sm Smart|Quer uery … Query Segmentation Query Scoping

  16. 03 Conclusion

  17. Impact – top-10 ecom player A Uses an already a highly optimized state-of-the-art eCommerce Search solution w/o smart|query w smart|query 140% 130% 120% 110% 100% 90% Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

  18. Impact – top-50 ecom player B Uses an optimized SolR implementation w/o smart|query w smart|query 140% 130% 120% 110% 100% 90% Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar

  19. Resulting in / High recall & high precision 1 1 0,95 0,9 0,85 0,99 0,8 Precision (PREC) Recall (TPR) 0,75 0,7 0,98 0,65 0,6 0,55 0,5 0,97 0 500.000 1.000.000 1.500.000 2.000.000 2.500.000 3.000.000 3.500.000 4.000.000 4.500.000 5.000.000 Queries Recall (TPR) Precision (PREC)

  20. Resulting in / insane query performance ~0.00005 seconds for spelling a short word – 80 ops/ms 35000 Searches per Seconds search|hub & Elastic 30000 25000 20000 15000 10000 5000 0 1 2 3 4 5 6 Query Terms or - term and - term or - fuzzy 2 and - fuzzy 2

  21. Observations + - more relevant results additional complexity consistent results reduced manual effort for curated search results save CPU usage improved query response time consistent site search analytics

  22. 04 Surprise CXP smart|query- PreDictLib fast & accurate spell correction at scale

  23. search|hub -PreDictLib fast & accurate spell correction at scale Qui Quick Highl hlight hts: extremely fast & constant index § access truly language independent edit § distance ability to add records to the index § at runtime without performance decrease based on one of the most efficient spell correction implementations out there called symspell by Wolf Grabe

  24. Symspell/ some Benchmarks Throughput vs Accuracy 100,0% 100% 88,7% 88,7% 88,3% 90% 80% 69,2% 70% 60% 45,8% 50% 40% 30% 20% 10% 2,2% 1,7% 1,0% 1,0% 0% Lucene WordCorrect ElasticSearch No.2 eCommerce No.1 in eCommerce SymSpell WordCorrect Search Search Accuracy Searches/sec

  25. search|hub -PreDictLib fast & accurate spell correction at scale modified edit distance to a • weighted edit distance changed Damerau Levenshtein • distance with a weighted Damerau Levenshtein distance – taking into account keyboard distance re-rank the candidate list by • applying additional similarity algorithms

  26. Search|hub– PreDice(CE) & PreDict(EE) / some Benchmarks Throughput vs. Accuracy 100% 99% 98% 100% 89% 89% 89% 88% 86% 90% 80% 69% 70% 60% 46% 50% 40% 30% 20% 10% 2% 1% 1% 1% 0% Lucene ElasticSearch No.2 No.1 in Symspell CXP PreDict CXP Searchhub WordCorrect WordCorrect eCommerce eCommerce (CE) Search Search Accuracy Searches/sec

  27. what you‘ll get CXP SmartQuery – PreDictLib (CE) fast & accurate spell correction at scale the Lib as Java source § accuracy and benchmark tests § real-life test data § ht https://gi github. b.com/se searchhub/pr preDict

  28. Questions

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend