Hacking Lucene for Custom Search Results Doug Turnbull OpenSource - PowerPoint PPT Presentation

Hacking Lucene for Custom Search Results Doug Turnbull OpenSource Connections OpenSource Connections

Hello Me @softwaredoug dturnbull@o19s.com Us http://o19s.com - Trusted Advisors in Search, Discovery & Analytics OpenSource Connections

Tough Search Problems • We have demanding users! Switch these two! OpenSource Connections

Tough Search Problems • Demanding users! WRONG! Make search do what is in my head! OpenSource Connections

Tough Search Problems • Our Eternal Problem: In one ear Out the other This is how a search engine /dev/null works! o Customers don’t care about the technology field of Information Retrieval: they just want results o BUT we are constrained by the tech! OpenSource Connections

Satisfying User Expectations • Easy: The Search Relevancy Game: o Solr/Elasticsearch query operations (boosts, etc) o Analysis of query/index to enhance matching • Medium: Forget this, lets write some Java o Solr/Elasticsearch query parsers . Reuse existing Lucene Queries to get closer to user needs OpenSource Connections

That Still Didn’t Work • Look at him, he’s angrier than ever! • For the toughest problems, we’ve made search complex and brittle • WHACK-A-MOLE: o Fix one problem, cause another o We give up, OpenSource Connections

Next Level • Hard: Custom Lucene Scoring – implement a query and scorer to explicitly control matching and scoring This is the Nuclear Option! OpenSource Connections

Shameless Plug • How do we know if we’re making progress? • Quepid! – our search test driven workbench OpenSource Connections

Lucene Lets Review • At some point we wrote a Lucene index to a directory • Boilerplate (open up the index): Boilerplate setup of: • Directory Lucene’s handle to Directory d = new RAMDirectory(); the FS IndexReader ir = DirectoryReader. open(d); • IndexReader – Access to IndexSearcher is = new IndexSearcher(ir); Lucene’s data structures • IndexSearcher – use index searcher to perform search OpenSource Connections

Lucene Lets Review • Queries: Term termToFind = new Term( "tag", "space" ); Make a Query and Search! • TermQuery spaceQ = new TermQuery(termToFind); TermQuery: basic term termToFind = new Term("tag", "star-trek"); search for a field TermQuery starTrekQ = new TermQuery(termToFind); • Queries That Combine Queries BooleanQuery bq = new BooleanQuery(); BooleanClause bClause = new BooleanClause(spaceQ, Occur. MUST); BooleanClause bClause2 = new BooleanClause(starTrekQ, Occur. SHOULD); bq.add(bClause); bq.add(bClause2); OpenSource Connections

Lucene Lets Review • Query responsible for specifying search behavior • Both: o Matching – what documents to include in the results o Scoring – how relevant is a result to the query by assigning a score OpenSource Connections

Lucene Queries, 30,000 ft view Aka, “not really accurate, but what to tell your boss to not confuse them” Next Match Plz Here ya go LuceneQuery IndexSearcher Score That Plz Score of last doc Find Calc. next score Match IndexReader OpenSource Connections

First Stop CustomScoreQuery • Wrap a query but override its score Next Match Plz Rescore doc Here ya go CustomScoreProvider New Score CustomScoreQuery Score That Plz Score of last doc Find Calc. next score Result: Match - Matching Behavior unchanged - Scoring completely overriden LuceneQuery A chance to reorder results of a Lucene Query by tweaking scoring OpenSource Connections

How to use? • Use a normal Lucene query for matching Term t = new Term("tag", "star-trek"); TermQuery tq = new TermQuery(t); • Create & Use a CustomQueryScorer for scoring that wraps the Lucene query CountingQuery ct = new CountingQuery(tq); OpenSource Connections

Implementation • Extend CustomScoreQuery, provide a CustomScoreProvider protected CustomScoreProvider getCustomScoreProvider( AtomicReaderContext context) throws IOException { return new CountingQueryScoreProvider("tag", context); } (boilerplate omitted) OpenSource Connections

Implementation • CustomScoreProvider rescores each doc with IndexReader & docId // Give all docs a score of 1.0 public float customScore(int doc, float subQueryScore, float valSrcScores[]) throws IOException { return (float)(1.0f); // New Score } OpenSource Connections

Implementation • Example: Sort by number of terms in a field // Rescores by counting the number of terms in the field public float customScore(int doc, float subQueryScore, float valSrcScores[]) throws IOException { IndexReader r = context.reader(); Terms tv = r.getTermVector(doc, _field); TermsEnum termsEnum = null; termsEnum = tv.iterator(termsEnum); int numTerms = 0; while((termsEnum.next()) != null) { numTerms++; } return (float)(numTerms); // New Score } OpenSource Connections

CustomScoreQuery, Takeaway • SIMPLE! o Relatively few gotchas or bells & whistles (we will see lots of gotchas) • Limited o No tight control on what matches • If this satisfies your requirements: You should get off the train here OpenSource Connections

Lucene Circle Back • I care about overriding scoring o CustomScoreQuery • I need to control custom scoring and matching o Custom Lucene Queries! OpenSource Connections

Example – Backwards Query • Search for terms backwards! o Instead of banana, lets create a query that finds ananab matches and scores the document (5.0) o But lets also match forward terms (banana), but with a lower score (1.0) • Disclaimer: its probably possible to do this with easier means! https://github.com/o19s/lucene-query-example/ OpenSource Connections

Lucene Queries, 30,000 ft view Aka, “not really accurate, but what to tell your boss to not confuse them” Next Match Plz Here ya go LuceneQuery IndexSearcher Score That Plz Score of last doc Find Calc. next score Match IndexReader OpenSource Connections

Anatomy of Lucene Query LuceneQuery A Tale Of Three Classes: Next Match Plz • Queries Create Weights: Weight Here ya go • Query-level stats for this Score That Plz search • Think “IDF” when you Score of last doc hear weights • Weights Create Scorers: Scorer • Heavy Lifting, reports matches and returns a score Weight & Scorer are inner classes of Query Find Calc. next score Match OpenSource Connections

Backwards Query Outline class BacwkardsQuery { class BackwardsScorer { // matching & scoring functionality } class BackwardsWeight { // query normalization and other “global” stats public Scorer scorer(AtomicReaderContext context, …) } public Weight createWeight(IndexSearcher) } OpenSource Connections

How are these used? When you do: Query q = new BackwardsQuery(); idxSearcher.search(q); This Setup Happens: Weight w = q.createWeight(idxSearcher); normalize(w); foreach IndexReader idxReader: Scorer s = w.scorer(idxReader); Important to know how Lucene is calling your code OpenSource Connections

Weight What should we do with our weight? Weight w = q.createWeight(idxSearcher); normalize(w); IndexSearcher Level Stats - Notice we pass the IndexSearcher when we create the weight - Weight tracks IndexSearcher level statistics used for scoring Query Normalization - Weight also participates in query normalization Remember – its your Weight! Weight can be a no-op and just create searchers OpenSource Connections

Weight & Query Normalization Query Normalization – an optional little ritual to take your Weight instance through: What I think my weight is float v = weight.getValueForNormalization(); float norm = getSimilarity().queryNorm(v); weight.normalize(norm, 1.0f); Normalize that weight against global statistics Pass back the normalized stats OpenSource Connections

Weight & Query Normalization float v = weight.getValueForNormalization(); float norm = getSimilarity().queryNorm(v); weight.normalize(norm, 1.0f); • For TermQuery: o The result of all this ceremony is the IDF (inverse document frequency of the term). • This code is fairly abstract o All three steps are pluggable, and can be totally ignored OpenSource Connections

BackwardsWeight • Custom Weight that completely ignores query normalization: @Override public float getValueForNormalization() throws IOException { return 0.0f; } @Override public void normalize(float norm, float topLevelBoost) { // no-op } OpenSource Connections

Weights make Scorers! @ Override public Scorer scorer(AtomicReaderContext context, boolean scoreDocsInOrder, boolean topScorer, Bits acceptDocs) throws IOException { return new BackwardsScorer(...); } • Scorers Have Two Jobs: o Match! – iterator interface over matching results o Score! – score the current match OpenSource Connections

Scorer as an iterator • Inherits the following from DocIdSetIterator DocsEnum: • nextDoc() DocsEnum o Next match • advance(int docId) – Scorer o Seek to the specified docId • docID() o Id of the current document we’re on OpenSource Connections

In other words… …Actually… • Remember THIS? nextDoc() Next Match Plz Here ya go LuceneQuery IndexSearcher LuceneQuery LuceneScorer Score That Plz score() Score of curr doc Find Calc. next score Match Scorer == Engine of the Query IndexReader OpenSource Connections

Hacking Lucene for Custom Search Results Doug Turnbull OpenSource - PowerPoint PPT Presentation

Hacking Lucene for Custom Search Results Doug Turnbull OpenSource Connections OpenSource Connections Hello Me @softwaredoug dturnbull@o19s.com Us http://o19s.com - Trusted Advisors in Search, Discovery & Analytics OpenSource

CUSTOM BOOTHS CUSTOM BOOTHS CUSTOM BOOTHS CUSTOM BOOTHS CUSTOM BOOTHS CUSTOM BOOTHS CUSTOM

Apache Lucene - a library retrieving data for millions of users Simon Willnauer Apache Lucene

Electronic Packaging Custom Metal Fabrication Custom Metal Fabrication Custom Metal Fabrication

ETHICAL HACKING Daniel Cloherty CAN HACKING BE ETHICAL? What makes hacking ethical?

Bug hunting with Apache Lucene Uwe Schindler Apache Lucene PMC & Apache Software Foundation

Joining in Lucene Martijn van Groningen martijn.vangroningen@searchworkings.com Lucene Committer

Realtime Search with Lucene Michael Busch @michibusch michael@twitter.com buschmi@apache.org

Contemporary Projects Custom Bas Relief Deep Rich Gold gilded paper Custom Plum Blossom Custom

Advanced Document Similarity With Apache Lucene Alessandro Benedetti, Software Engineer, Sease

Lucene And Solr Document Classification Alessandro Benedetti, Software Engineer, Sease Ltd. Who

Drone Hacking Basics Intro to UAS Architectures, Attack Vectors and RF Hacking Matt Koskela June

Hacking Reinforcement Learning Guillem Duran Ballester Guillemdb @Miau_DB A tale about hacking

Apache Lucene 5 New Features and Improvements for Apache Solr and Elasticsearch Uwe Schindler

Query Suggestions with Lucene simonw & rmuir Who we are... who: Simon Willnauer / Robert

Nutch as a Web mining platform Nutch Berlin Buzzwords '10 the present and the future Andrzej

Bugs, Bugs, Bugs Uwe Schindler Apache Lucene Committer & PMC Member uschindler@apache.org

Suf ufferi ring ng Smyrna rna Ou Our Savio ior Ou Our Suffer erin ing Ou Our Surren

Memory Hierarchy Reducing Hit Time Main Memory and Examples Soner Onder Michigan

CMP722 ADVANCED COMPUTER VISION Lecture #5 Language and Vision Aykut Erdem // Hacettepe

Language is Contextual Grounded Semantics Some problems depend on grounding into perceptual

Lecture 16: Reducing Cache Miss Penalty and Exploit Memory Parallelism Critical work first,

COSC 5351 Advanced Computer Architecture Slides modified from Hennessy CS252 course slides ILP

Text Analytics (Text Mining) Concepts, Algorithms, LSI/SVD Duen Horng (Polo) Chau Assistant

Oracle Application Server 10g Upgrade and Migration Monika Dreher Product Technology Services

Hacking Lucene for Custom Search Results Doug Turnbull OpenSource - PowerPoint PPT Presentation

Hacking Lucene for Custom Search Results Doug Turnbull OpenSource Connections OpenSource Connections Hello Me @softwaredoug dturnbull@o19s.com Us http://o19s.com - Trusted Advisors in Search, Discovery & Analytics OpenSource

CUSTOM BOOTHS CUSTOM BOOTHS CUSTOM BOOTHS CUSTOM BOOTHS CUSTOM BOOTHS CUSTOM BOOTHS CUSTOM

Apache Lucene - a library retrieving data for millions of users Simon Willnauer Apache Lucene

Electronic Packaging Custom Metal Fabrication Custom Metal Fabrication Custom Metal Fabrication

ETHICAL HACKING Daniel Cloherty CAN HACKING BE ETHICAL? What makes hacking ethical?

Bug hunting with Apache Lucene Uwe Schindler Apache Lucene PMC &amp; Apache Software Foundation

Joining in Lucene Martijn van Groningen martijn.vangroningen@searchworkings.com Lucene Committer

Realtime Search with Lucene Michael Busch @michibusch michael@twitter.com buschmi@apache.org

Contemporary Projects Custom Bas Relief Deep Rich Gold gilded paper Custom Plum Blossom Custom

Advanced Document Similarity With Apache Lucene Alessandro Benedetti, Software Engineer, Sease

Lucene And Solr Document Classification Alessandro Benedetti, Software Engineer, Sease Ltd. Who

Drone Hacking Basics Intro to UAS Architectures, Attack Vectors and RF Hacking Matt Koskela June

Hacking Reinforcement Learning Guillem Duran Ballester Guillemdb @Miau_DB A tale about hacking

Apache Lucene 5 New Features and Improvements for Apache Solr and Elasticsearch Uwe Schindler

Query Suggestions with Lucene simonw &amp; rmuir Who we are... who: Simon Willnauer / Robert

Nutch as a Web mining platform Nutch Berlin Buzzwords '10 the present and the future Andrzej

Bugs, Bugs, Bugs Uwe Schindler Apache Lucene Committer &amp; PMC Member uschindler@apache.org

Suf ufferi ring ng Smyrna rna Ou Our Savio ior Ou Our Suffer erin ing Ou Our Surren

Memory Hierarchy Reducing Hit Time Main Memory and Examples Soner Onder Michigan

CMP722 ADVANCED COMPUTER VISION Lecture #5 Language and Vision Aykut Erdem // Hacettepe

Language is Contextual Grounded Semantics Some problems depend on grounding into perceptual

Lecture 16: Reducing Cache Miss Penalty and Exploit Memory Parallelism Critical work first,

COSC 5351 Advanced Computer Architecture Slides modified from Hennessy CS252 course slides ILP

Text Analytics (Text Mining) Concepts, Algorithms, LSI/SVD Duen Horng (Polo) Chau Assistant

Oracle Application Server 10g Upgrade and Migration Monika Dreher Product Technology Services

Bug hunting with Apache Lucene Uwe Schindler Apache Lucene PMC & Apache Software Foundation

Query Suggestions with Lucene simonw & rmuir Who we are... who: Simon Willnauer / Robert

Bugs, Bugs, Bugs Uwe Schindler Apache Lucene Committer & PMC Member uschindler@apache.org