BM25 is so Yesterday Modern Techniques for Better Search Relevance - - PowerPoint PPT Presentation

bm25 is so yesterday
SMART_READER_LITE
LIVE PREVIEW

BM25 is so Yesterday Modern Techniques for Better Search Relevance - - PowerPoint PPT Presentation

BM25 is so Yesterday Modern Techniques for Better Search Relevance in Solr Grant Ingersoll CTO Lucidworks Lucene/Solr/Mahout Committer iPad case "ipad accessory"~3 iPad case OR "ipad case"~5 1. 15.


slide-1
SLIDE 1
slide-2
SLIDE 2

BM25 is so Yesterday

Modern Techniques for Better Search Relevance in Solr

Grant Ingersoll CTO Lucidworks Lucene/Solr/Mahout Committer

slide-3
SLIDE 3

😋

iPad case

slide-4
SLIDE 4

😋

iPad case

🤔

"ipad accessory"~3 OR "ipad case"~5
slide-5
SLIDE 5

1. 15.

slide-6
SLIDE 6

👏

slide-7
SLIDE 7
slide-8
SLIDE 8

So, what do you do?

slide-9
SLIDE 9

if (doc.name.contains(“Vikings”)){ doc.boost = 100 } OR q:(MAIN QUERY) OR (name:Vikings)^Y Index Time: Query Time:

slide-10
SLIDE 10
slide-11
SLIDE 11
slide-12
SLIDE 12
  • Term Frequency: “How well a term describes a document?”
  • Measure: how often a term occurs per document
  • Inverse Document Frequency: “How important is a term
  • verall?”
  • Measure: how rare the term is across all documents

TF*IDF

slide-13
SLIDE 13 Score(q, d) = ∑ idf(t) · ( tf(t in d) · (k + 1) ) / ( tf(t in d) + k · (1 – b + b · |d| / avgdl ) t in q Where: t = term; d = document; q = query; i = index
 tf(t in d) = numTermOccurrencesInDocument ½ idf(t) = 1 + log (numDocs / (docFreq + 1)) |d| = ∑ 1 t in d avgdl = = ( ∑ |d| ) / ( ∑ 1 ) ) d in i d in i k = Free parameter. Usually ~1.2 to 2.0. Increases term frequency saturation point. b = Free parameter. Usually ~0.75. Increases impact of document normalization.

BM25 (aka Okapi)

slide-14
SLIDE 14

Lather, Rinse, Repeat

slide-15
SLIDE 15
slide-16
SLIDE 16

💢

slide-17
SLIDE 17

WWGD?

slide-18
SLIDE 18
  • Capture and log pretty much everything
  • Searches, Time on page/1st click, What was

not chosen, etc.

  • Precision — Of those shown, what’s relevant?
  • Recall — Of all that’s relevant, what was found?
  • NDCG — Account for position

Measure, Measure, Measure

slide-19
SLIDE 19

Magic Guessing

Core Information Theory (aka Lucene/Solr)

Search Aids (Facets, Did You Mean, Highlighting) Machine Learning (Clicks, Recs, Personalization, User feedback)

Rules, Domain Specific Knowledge fuhgeddaboudit

slide-20
SLIDE 20

Content Collaboration Context

Core Solr capabilities: text matching, faceting, spell checking, highlighting Business Rules for content: landing pages, boost/block, promotions, etc. Leverage collective intelligence to predict what users will do based on historical, aggregated data Recommenders, Popularity, Search Paths Who are you? Where are you? What have you done previously? User/Market Segmentation, Roles, Security, Personalization Next Genera/on Relevance
slide-21
SLIDE 21

But What About the Real World? Indexing Edition

NER, Topic Detection, Clustering Word2Vec, etc. Domain Rules: Synonyms, Regexes, Lexical Resources Extraction Load Into Spark Build W2V, PageRank, Topic, Clustering Models Offline

Content Models
slide-22
SLIDE 22

But What About the Real World? Query Edition

Query Intent

Strategic, Tactical, Semantic

😋

iPad case

Head/Tail/ Clickstream enhancement User Factors:

Segmentation, Location, History, Profile, Security

Parse Domain Specific Rules Transform Results … Cascading Rerankers

Learn To Rank (multi- model), Bias corrections
slide-23
SLIDE 23

But What About the Real World? Signals Edition

Load Into Spark Clickstream Models Signals Query Analysis Jobs Recommenders/ Personalization

😋

iPad case

Query Edition

Raw Models
slide-24
SLIDE 24

(Exact/Original Match)^X (Sloppy Phrase)~M^Y (AND Q)^Z (OR Q)^XX (Expansions/Click/Head/Tail Boosts)^YY (Personalization Biases)^ZZ ({!ltr …}) Filters+Options: security, rules, hard preferences, categories

The Perfect(?!?) Query* YMMV!

} Precision

Recall Caveat Emptor!

* Note: there are a lot of variations on this. edismax handles most

Learn to Rank

X > Y > Z > XX All weights can be learned
slide-25
SLIDE 25
  • Don’t take my word for it, experiment!
  • Good primer:
  • http://www.slideshare.net/InfoQ/online-controlled-experiments-

introduction-insights-scaling-and-humbling-statistics

  • Rules are fine, as long as the are contained, have

a lifespan and are measured for effectiveness

Experimentation, Not Editorialization

slide-26
SLIDE 26

Show Us Already, Will You!

slide-27
SLIDE 27
  • But Wait, There’s More!

Fusion Architecture

SECURITY BUILT-IN Shards Shards Apache Solr Apache Zookeeper ZK 1 Leader Elec*on Load Balancing ZK N Shared Config Management Worker Worker Apache Spark Cluster Manager REST API Admin UI Twigkit LOGS FILE WEB DATABASE CLOUD HDFS (Op*onal) Core Services Connectors
  • • •
ETL and Query Pipelines Recommenders/Signals/Rules NLP Machine Learning AlerEng and Messaging Security Scheduling
slide-28
SLIDE 28

Key Features

Shards Shards Apache Solr Worker Worker Apache Spark Cluster Manager
  • Solr:
  • Extensive Text Ranking Features
  • Similarity Models
  • Function Queries
  • Boost/Block
  • Pluggable Reranker
  • Learn to Rank contrib
  • Multi-tenant
  • Spark
  • SparkML (Random Forests, Regression, etc.)
  • Large scale, distributed compute
slide-29
SLIDE 29 Demo Details
  • Best Buy Kaggle Competition Data Set
  • Product Catalog: ~1.3M
  • Signals: 1 month of query, document logs
  • Fusion 3.1 Preview + Recommenders (sampled dataset) + Rules (open source

add-on module) + Solr LTR contrib

  • Twigkit UI (http://twigkit.com)

Demo Details

slide-30
SLIDE 30
  • http://lucidworks.com
  • http://lucene.apache.org/solr
  • http://spark.apache.org/
  • https://github.com/lucidworks/spark-solr
  • https://cwiki.apache.org/confluence/display/solr/

Learning+To+Rank

  • Bloomberg talk on LTR https://www.youtube.com/watch?

v=M7BKwJoh96s

Resources