stop stupid fuzzy searches Table of contents 01 Fuzzy search 02 - - PowerPoint PPT Presentation

stop stupid fuzzy searches table of contents
SMART_READER_LITE
LIVE PREVIEW

stop stupid fuzzy searches Table of contents 01 Fuzzy search 02 - - PowerPoint PPT Presentation

stop stupid fuzzy searches Table of contents 01 Fuzzy search 02 Smart Query Rewriting 03 Conclusion 04 Surprise 01 Fuzzy search Why we need it / Distribution of spelling errors 100% 90% 80% 70% 60% 50% 37% 40% 26% 30% 23% 20% 18%


slide-1
SLIDE 1

stop stupid fuzzy searches

slide-2
SLIDE 2

01 Fuzzy search 02 Smart Query Rewriting 03 Conclusion 04 Surprise

Table of contents

slide-3
SLIDE 3

01 Fuzzy search

slide-4
SLIDE 4

37% 23% 9% 5% 15% 11% 26% 20% 10% 12% 18% 14% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Edit distance 0 Edit distance 1 Edit distance 2 Edit distance >2 Singluar & Plural Decomposition Frequency Value/Search

Why we need it / Distribution of spelling errors

slide-5
SLIDE 5

10% 25% 27% 10% 17% 11% 6% 23% 34% 7% 16% 14% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Insert Delete Replace Transpose Singular & Plural Decomposition Desktop Mobile

Why we need it / Distribution of spelling errors by device type

slide-6
SLIDE 6

Causes of spelling errors

format phonetic typo decomposition

  • spannbettlaken

spannbettlacken spanbettlaken spannbettllaken spammbettlaken Spann bettlaken Bettlaken zum spannen

1% 3% 13% 9% 7% 1% 4% 1%

spann-bettlaken

spannbettlaken

61% 4%

22% 8% 5%

…42 additional spellings

83 56 50 47 43

Query-Intent Error-type query Result size

slide-7
SLIDE 7

generates

835

candidates

How it works

generates

~650k

candidates GET catalog/products/_search { “query”: { “fuzzy”: { “title”: { “value”: “spannbettlacken”, “fuzziness”: 1 } } } }

EditDistance 1

GET catalog/products/_search { “query”: { “fuzzy”: { “title”: { “value”: “spannbettlacken”, “fuzziness”: 2 } } } }

EditDistance 2

slide-8
SLIDE 8

Resulting in / high recall but low precision

0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 0,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1,0

Precision (PREC) Recall (TPR)

slide-9
SLIDE 9

Resulting in / low search throughput ~0.1 seconds for spelling a short word

5000 10000 15000 20000 25000 30000 35000 1 2 3 4 5 6

Searches per Second Query Terms

  • r - term

and - term

  • r - fuzzy 2

and - fuzzy 2

slide-10
SLIDE 10
  • +

Searches for all possible candidates inside a given edit- distance Natively implemented in Elasticsearch and Lucene Increased CPU usage and query response time Inconsistent and not always relevant results Skewed search analytics

Observations

slide-11
SLIDE 11

02 Smart Query rewriting

MAKE FUZZY SEARCH AS FAST, EASY AND RELEVANT AS EXACT SEARCH

slide-12
SLIDE 12

spannbettlaken spann-bettlaken spannbettlacken schpanbettlaken spannbettllaken spammbettlaken spanmbettlaken spannbettlaken

Cluster similar Queries Test & Select MasterQuery

spannbettlaken

MasterQuery Search Engine

Our Solution / smart query rewrites

slide-13
SLIDE 13

Cluster similar Queries

Based on deep learning & crafted algorithms we clean and cluster queries with the same meaning We use the concept of controlled precision reduction

Exact Match Fingerprint Lemmatization & Phonems Fuzzy Match

Our Solution / smart query rewrites

spannbettlaken spann-bettlaken spannbettlacken schpanbettlaken spannbettllaken spammbettlaken spanmbettlaken spannbettlaken

slide-14
SLIDE 14

Test & Select MasterQuery Based on tracking KPIs and deep learning and global parameter

  • ptimization we

test & select the query which maximises the balance between the search result interaction probability and the economic outcome

Our Solution / smart query rewrites

spannbettlaken spann-bettlaken spannbettlacken schpanbettlaken spannbettllaken spammbettlaken spanmbettlaken spannbettlaken

slide-15
SLIDE 15

CXP search|hub / Query Intelligence Platform

Frontend Search Endpoint

High performance Caching & Logging Da Data|hub Semantic Query Parsing Site Search Analytics Guided Selling Personalization … Solr Elasticsearch FACT-Finder Fredhopper Celebros Algolia ACS

Search Engine

Sm Smart|Quer uery Query Segmentation Query Scoping

slide-16
SLIDE 16

03 Conclusion

slide-17
SLIDE 17

Impact – top-10 ecom player A

90% 100% 110% 120% 130% 140% Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec w/o smart|query w smart|query

Uses an already a highly optimized state-of-the-art eCommerce Search solution

slide-18
SLIDE 18

Impact – top-50 ecom player B

90% 100% 110% 120% 130% 140% Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar w/o smart|query w smart|query

Uses an optimized SolR implementation

slide-19
SLIDE 19

Resulting in / High recall & high precision

0,97 0,98 0,99 1 0,5 0,55 0,6 0,65 0,7 0,75 0,8 0,85 0,9 0,95 1 500.000 1.000.000 1.500.000 2.000.000 2.500.000 3.000.000 3.500.000 4.000.000 4.500.000 5.000.000

Precision (PREC) Recall (TPR) Queries

Recall (TPR) Precision (PREC)

slide-20
SLIDE 20

Resulting in / insane query performance

~0.00005 seconds for spelling a short word – 80 ops/ms

5000 10000 15000 20000 25000 30000 35000 1 2 3 4 5 6

Searches per Seconds search|hub & Elastic Query Terms

  • r - term

and - term

  • r - fuzzy 2

and - fuzzy 2

slide-21
SLIDE 21

more relevant results consistent results reduced manual effort for curated search results save CPU usage improved query response time consistent site search analytics additional complexity

Observations

+

slide-22
SLIDE 22

04 Surprise

CXP smart|query- PreDictLib

fast & accurate spell correction at scale

slide-23
SLIDE 23

search|hub -PreDictLib

fast & accurate spell correction at scale Qui Quick Highl hlight hts:

§ extremely fast & constant index access § truly language independent edit distance § ability to add records to the index at runtime without performance decrease based on one of the most efficient spell correction implementations out there called symspell by Wolf Grabe

slide-24
SLIDE 24

Symspell/ some Benchmarks

88,7% 45,8% 69,2% 88,3% 88,7% 1,0% 1,0% 1,7% 2,2% 100,0% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% Lucene WordCorrect ElasticSearch WordCorrect No.2 eCommerce Search No.1 in eCommerce Search SymSpell

Throughput vs Accuracy

Accuracy Searches/sec

slide-25
SLIDE 25
  • modified edit distance to a

weighted edit distance

  • changed Damerau Levenshtein

distance with a weighted Damerau Levenshtein distance – taking into account keyboard distance

  • re-rank the candidate list by

applying additional similarity algorithms

search|hub -PreDictLib

fast & accurate spell correction at scale

slide-26
SLIDE 26

89% 46% 69% 88% 89% 89% 99% 1% 1% 1% 2% 86% 100% 98% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Lucene WordCorrect ElasticSearch WordCorrect No.2 eCommerce Search No.1 in eCommerce Search Symspell CXP PreDict (CE) CXP Searchhub

Throughput vs. Accuracy

Accuracy Searches/sec

Search|hub– PreDice(CE) & PreDict(EE) / some Benchmarks

slide-27
SLIDE 27

what you‘ll get

CXP SmartQuery – PreDictLib (CE)

fast & accurate spell correction at scale

§ the Lib as Java source § accuracy and benchmark tests § real-life test data

ht https://gi github. b.com/se searchhub/pr preDict

slide-28
SLIDE 28

Questions