bm25 is so yesterday
play

BM25 is so Yesterday Modern Techniques for Better Search Relevance - PowerPoint PPT Presentation

BM25 is so Yesterday Modern Techniques for Better Search Relevance in Solr Grant Ingersoll CTO Lucidworks Lucene/Solr/Mahout Committer iPad case "ipad accessory"~3 iPad case OR "ipad case"~5 1. 15.


  1. BM25 is so Yesterday Modern Techniques for Better Search Relevance in Solr Grant Ingersoll CTO Lucidworks Lucene/Solr/Mahout Committer

  2. iPad case 😋

  3. "ipad accessory"~3 iPad case OR "ipad case"~5 😋 🤔

  4. 1. 15.

  5. 👏

  6. So, what do you do?

  7. if (doc.name.contains(“Vikings”)){ Index Time: doc.boost = 100 } OR Query Time: q:(MAIN QUERY) OR (name:Vikings)^Y

  8. TF*IDF • Term Frequency: “How well a term describes a document?” • Measure: how often a term occurs per document • Inverse Document Frequency: “How important is a term overall?” • Measure: how rare the term is across all documents

  9. BM25 (aka Okapi) Score(q, d) = ∑ idf(t) · ( tf(t in d) · (k + 1) ) / ( tf(t in d) + k · (1 – b + b · |d| / avgdl ) t in q Where: t = term; d = document; q = query; i = index 
 tf(t in d) = numTermOccurrencesInDocument ½ idf(t) = 1 + log (numDocs / (docFreq + 1)) |d| = ∑ 1 t in d avgdl = = ( ∑ |d| ) / ( ∑ 1 ) ) d in i d in i k = Free parameter. Usually ~1.2 to 2.0. Increases term frequency saturation point. b = Free parameter. Usually ~0.75. Increases impact of document normalization.

  10. Lather, Rinse, Repeat

  11. 💢

  12. WWGD?

  13. Measure, Measure, Measure • Capture and log pretty much everything • Searches, Time on page/1st click, What was not chosen, etc. • Precision — Of those shown, what’s relevant? • Recall — Of all that’s relevant, what was found? • NDCG — Account for position

  14. Magic fuhgeddaboudit Rules, Domain Specific Knowledge Machine Learning (Clicks, Recs, Personalization, User feedback) Search Aids (Facets, Did You Mean, Highlighting) Core Information Theory (aka Lucene/Solr) Guessing

  15. Next Genera/on Relevance Content Collaboration Context Core Solr capabilities: text matching, Leverage collective intelligence to predict Who are you? Where are you? What have faceting, spell checking, highlighting what users will do based on historical, you done previously? aggregated data Business Rules for content: landing pages, User/Market Segmentation, Roles, Security, boost/block, promotions, etc. Recommenders, Popularity, Search Paths Personalization

  16. But What About the Real World? Indexing Edition NER, Topic Domain Rules: Extraction Detection, Clustering Synonyms, Regexes, Word2Vec, etc. Lexical Resources Content Offline Build W2V, PageRank, Topic, Load Into Spark Clustering Models Models

  17. But What About the Real World? Query Edition iPad case Head/Tail/ Query Intent User Factors: Parse Clickstream 😋 Strategic, Tactical, Segmentation, Location, Semantic enhancement History, Profile, Security … Cascading Rerankers Domain Specific Transform Results Learn To Rank (multi- Rules model), Bias corrections

  18. But What About the Real World? Signals Edition iPad case Signals Load Into Spark Clickstream Models 😋 Raw Recommenders/ Query Edition Query Analysis Jobs Personalization Models

  19. The Perfect(?!?) Query* YMMV! } Precision (Exact/Original Match)^X X > Y > Z > XX All weights can be learned (Sloppy Phrase)~M^Y (AND Q)^Z (OR Q)^XX Recall (Expansions/Click/Head/Tail Boosts)^YY (Personalization Biases)^ZZ Caveat Emptor! ({!ltr …}) Learn to Rank Filters+Options: security, rules, hard preferences, categories * Note: there are a lot of variations on this. edismax handles most

  20. Experimentation, Not Editorialization • Don’t take my word for it, experiment! • Good primer: http://www.slideshare.net/InfoQ/online-controlled-experiments- • introduction-insights-scaling-and-humbling-statistics • Rules are fine, as long as the are contained, have a lifespan and are measured for effectiveness

  21. Show Us Already, Will You!

  22. Fusion Architecture Core Services Apache Spark • But Wait, There’s More! ETL and Query Pipelines Cluster Worker Worker Manager Twigkit Recommenders/Signals/Rules NLP Apache Solr REST API Machine Learning Shards Shards HDFS (Op*onal) Scheduling AlerEng and Messaging Apache Zookeeper Security ZK 1 ZK N • • • Shared Config Connectors Leader Elec*on Load Balancing Management Admin UI LOGS FILE WEB DATABASE CLOUD SECURITY BUILT-IN

  23. Key Features • Solr: • Extensive Text Ranking Features • Similarity Models Apache Spark • Function Queries Cluster Worker Worker Manager • Boost/Block • Pluggable Reranker Apache Solr • Learn to Rank contrib Shards Shards • Multi-tenant • Spark • SparkML (Random Forests, Regression, etc.) • Large scale, distributed compute

  24. Demo Details Demo Details • Best Buy Kaggle Competition Data Set - Product Catalog: ~1.3M - Signals: 1 month of query, document logs • Fusion 3.1 Preview + Recommenders (sampled dataset) + Rules (open source add-on module) + Solr LTR contrib • Twigkit UI (http://twigkit.com)

  25. Resources • http://lucidworks.com • http://lucene.apache.org/solr • http://spark.apache.org/ • https://github.com/lucidworks/spark-solr • https://cwiki.apache.org/confluence/display/solr/ Learning+To+Rank • Bloomberg talk on LTR https://www.youtube.com/watch? v=M7BKwJoh96s

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend