BM25 is so Yesterday Modern Techniques for Better Search Relevance - PowerPoint PPT Presentation

BM25 is so Yesterday Modern Techniques for Better Search Relevance in Solr Grant Ingersoll CTO Lucidworks Lucene/Solr/Mahout Committer

iPad case 😋

"ipad accessory"~3 iPad case OR "ipad case"~5 😋 🤔

1. 15.

So, what do you do?

if (doc.name.contains(“Vikings”)){ Index Time: doc.boost = 100 } OR Query Time: q:(MAIN QUERY) OR (name:Vikings)^Y

TF*IDF • Term Frequency: “How well a term describes a document?” • Measure: how often a term occurs per document • Inverse Document Frequency: “How important is a term overall?” • Measure: how rare the term is across all documents

BM25 (aka Okapi) Score(q, d) = ∑ idf(t) · ( tf(t in d) · (k + 1) ) / ( tf(t in d) + k · (1 – b + b · |d| / avgdl ) t in q Where: t = term; d = document; q = query; i = index   tf(t in d) = numTermOccurrencesInDocument ½ idf(t) = 1 + log (numDocs / (docFreq + 1)) |d| = ∑ 1 t in d avgdl = = ( ∑ |d| ) / ( ∑ 1 ) ) d in i d in i k = Free parameter. Usually ~1.2 to 2.0. Increases term frequency saturation point. b = Free parameter. Usually ~0.75. Increases impact of document normalization.

Lather, Rinse, Repeat

Measure, Measure, Measure • Capture and log pretty much everything • Searches, Time on page/1st click, What was not chosen, etc. • Precision — Of those shown, what’s relevant? • Recall — Of all that’s relevant, what was found? • NDCG — Account for position

Magic fuhgeddaboudit Rules, Domain Specific Knowledge Machine Learning (Clicks, Recs, Personalization, User feedback) Search Aids (Facets, Did You Mean, Highlighting) Core Information Theory (aka Lucene/Solr) Guessing

Next Genera/on Relevance Content Collaboration Context Core Solr capabilities: text matching, Leverage collective intelligence to predict Who are you? Where are you? What have faceting, spell checking, highlighting what users will do based on historical, you done previously? aggregated data Business Rules for content: landing pages, User/Market Segmentation, Roles, Security, boost/block, promotions, etc. Recommenders, Popularity, Search Paths Personalization

But What About the Real World? Indexing Edition NER, Topic Domain Rules: Extraction Detection, Clustering Synonyms, Regexes, Word2Vec, etc. Lexical Resources Content Offline Build W2V, PageRank, Topic, Load Into Spark Clustering Models Models

But What About the Real World? Query Edition iPad case Head/Tail/ Query Intent User Factors: Parse Clickstream 😋 Strategic, Tactical, Segmentation, Location, Semantic enhancement History, Profile, Security … Cascading Rerankers Domain Specific Transform Results Learn To Rank (multi- Rules model), Bias corrections

But What About the Real World? Signals Edition iPad case Signals Load Into Spark Clickstream Models 😋 Raw Recommenders/ Query Edition Query Analysis Jobs Personalization Models

The Perfect(?!?) Query* YMMV! } Precision (Exact/Original Match)^X X > Y > Z > XX All weights can be learned (Sloppy Phrase)~M^Y (AND Q)^Z (OR Q)^XX Recall (Expansions/Click/Head/Tail Boosts)^YY (Personalization Biases)^ZZ Caveat Emptor! ({!ltr …}) Learn to Rank Filters+Options: security, rules, hard preferences, categories * Note: there are a lot of variations on this. edismax handles most

Experimentation, Not Editorialization • Don’t take my word for it, experiment! • Good primer: http://www.slideshare.net/InfoQ/online-controlled-experiments- • introduction-insights-scaling-and-humbling-statistics • Rules are fine, as long as the are contained, have a lifespan and are measured for effectiveness

Show Us Already, Will You!

Fusion Architecture Core Services Apache Spark • But Wait, There’s More! ETL and Query Pipelines Cluster Worker Worker Manager Twigkit Recommenders/Signals/Rules NLP Apache Solr REST API Machine Learning Shards Shards HDFS (Op*onal) Scheduling AlerEng and Messaging Apache Zookeeper Security ZK 1 ZK N • • • Shared Config Connectors Leader Elec*on Load Balancing Management Admin UI LOGS FILE WEB DATABASE CLOUD SECURITY BUILT-IN

Key Features • Solr: • Extensive Text Ranking Features • Similarity Models Apache Spark • Function Queries Cluster Worker Worker Manager • Boost/Block • Pluggable Reranker Apache Solr • Learn to Rank contrib Shards Shards • Multi-tenant • Spark • SparkML (Random Forests, Regression, etc.) • Large scale, distributed compute

Demo Details Demo Details • Best Buy Kaggle Competition Data Set - Product Catalog: ~1.3M - Signals: 1 month of query, document logs • Fusion 3.1 Preview + Recommenders (sampled dataset) + Rules (open source add-on module) + Solr LTR contrib • Twigkit UI (http://twigkit.com)

Resources • http://lucidworks.com • http://lucene.apache.org/solr • http://spark.apache.org/ • https://github.com/lucidworks/spark-solr • https://cwiki.apache.org/confluence/display/solr/ Learning+To+Rank • Bloomberg talk on LTR https://www.youtube.com/watch? v=M7BKwJoh96s

BM25 is so Yesterday Modern Techniques for Better Search Relevance - PowerPoint PPT Presentation

BM25 is so Yesterday Modern Techniques for Better Search Relevance in Solr Grant Ingersoll CTO Lucidworks Lucene/Solr/Mahout Committer iPad case "ipad accessory"~3 iPad case OR "ipad case"~5 1. 15.

Lecture 12 Logistics HW4 was due yesterday HW5 was out yesterday (due next Wednesday)

TF-IDF and Okapi BM25 LM, session 3 CS6200: Information Retrieval Slides by: Jesse Anderton

TF-IDF and Okapi BM25 LM, session 3 CS6200: Information Retrieval Binary Independence Models In

Opera Software The best browsing experience on any device Web Browser Industry Yesterday, Today,

Grid computing: yesterday, today and tomorrow? Dr. Fabrizio Gagliardi EMEA Director External

We will start at 2:05 pm! Thanks for coming early! Yesterday Fundamental 1. Value of

Design Challenges for Entity Linking Xiao Ling , Sameer Singh, Daniel S. Weld Entity Linking

Lecture 5: Math Review I Justin Johnson EECS 442 WI 2020: Lecture 5 - 1 January 23, 2020

Introduction to Natural Language Processing Summary Language models Okapi BM25 Binary

Introduction to Information Retrieval http://informationretrieval.org IIR 11: Probabilistic

III.3 Probabilistic Retrieval Models 1. Probabilistic Ranking Principle 2. Binary Independence

Efficient query processing Efficient scoring, distributed query processing Web Search 1 Ranking

From yesterday to tomorrow: past, present and future of sequencing The NGS revolu-on Laurent

Notes on the NAWAC meeting 22 May 2019 I attended and presented to the NAWAC group yesterday. Pat

Yesterday and Today By Dave W Smith, Chairman Talking Newspaper Objectives To provide a

Homily Presentation of the Lord 2020 Fr. Pat I went to see our CYO Basketball yesterday at St.

Information Visualization Aggregate & Filter Tamara Munzner Department of Computer Science

Economics 2 Professor Christina Romer Spring 2020 Professor David Romer LECTURE 20 PLANNED

Architectural Patterns Architectural Patterns The fundamental problem to be solved with a large

Lectur ture e 12 Planni anning: ng: Intro and For Forward Planning, Slide 1 Announ

When Negotiation Goes Wrong: Debt Collection and Pay for Delay Pay for Delay Joseph Farrell

Value Creation Through Constructive Activism Q3 2018 Shareholder Update Call October 30, 2018 1

Modern Graph Analytic Support in GSQL, TigerGraphss GQL Alin Deutsch TigerGraph Chief

Network Economics -- Lecture 1: Pricing of communication services Patrick Loiseau EURECOM

BM25 is so Yesterday Modern Techniques for Better Search Relevance - PowerPoint PPT Presentation

BM25 is so Yesterday Modern Techniques for Better Search Relevance in Solr Grant Ingersoll CTO Lucidworks Lucene/Solr/Mahout Committer iPad case "ipad accessory"~3 iPad case OR "ipad case"~5 1. 15.

Lecture 12 Logistics HW4 was due yesterday HW5 was out yesterday (due next Wednesday)

TF-IDF and Okapi BM25 LM, session 3 CS6200: Information Retrieval Slides by: Jesse Anderton

TF-IDF and Okapi BM25 LM, session 3 CS6200: Information Retrieval Binary Independence Models In

Opera Software The best browsing experience on any device Web Browser Industry Yesterday, Today,

Grid computing: yesterday, today and tomorrow? Dr. Fabrizio Gagliardi EMEA Director External

We will start at 2:05 pm! Thanks for coming early! Yesterday Fundamental 1. Value of

Design Challenges for Entity Linking Xiao Ling , Sameer Singh, Daniel S. Weld Entity Linking

Lecture 5: Math Review I Justin Johnson EECS 442 WI 2020: Lecture 5 - 1 January 23, 2020

Introduction to Natural Language Processing Summary Language models Okapi BM25 Binary

Introduction to Information Retrieval http://informationretrieval.org IIR 11: Probabilistic

III.3 Probabilistic Retrieval Models 1. Probabilistic Ranking Principle 2. Binary Independence

Efficient query processing Efficient scoring, distributed query processing Web Search 1 Ranking

From yesterday to tomorrow: past, present and future of sequencing The NGS revolu-on Laurent

Notes on the NAWAC meeting 22 May 2019 I attended and presented to the NAWAC group yesterday. Pat

Yesterday and Today By Dave W Smith, Chairman Talking Newspaper Objectives To provide a

Homily Presentation of the Lord 2020 Fr. Pat I went to see our CYO Basketball yesterday at St.

Information Visualization Aggregate &amp; Filter Tamara Munzner Department of Computer Science

Economics 2 Professor Christina Romer Spring 2020 Professor David Romer LECTURE 20 PLANNED

Architectural Patterns Architectural Patterns The fundamental problem to be solved with a large

Lectur ture e 12 Planni anning: ng: Intro and For Forward Planning, Slide 1 Announ

When Negotiation Goes Wrong: Debt Collection and Pay for Delay Pay for Delay Joseph Farrell

Value Creation Through Constructive Activism Q3 2018 Shareholder Update Call October 30, 2018 1

Modern Graph Analytic Support in GSQL, TigerGraphss GQL Alin Deutsch TigerGraph Chief

Network Economics -- Lecture 1: Pricing of communication services Patrick Loiseau EURECOM

Information Visualization Aggregate & Filter Tamara Munzner Department of Computer Science