Integrating Query Performance Integrating Query Performance - - PowerPoint PPT Presentation

integrating query performance integrating query
SMART_READER_LITE
LIVE PREVIEW

Integrating Query Performance Integrating Query Performance - - PowerPoint PPT Presentation

Integrating Query Performance Integrating Query Performance Prediction in Term Scoring Prediction in Term Scoring for Diachronic Thesaurus for Diachronic Thesaurus Chaya Liebeskind and Ido Dagan Chaya Liebeskind and Ido Dagan LaTeCH2015 1


slide-1
SLIDE 1

Integrating Query Performance Prediction in Term Scoring for Diachronic Thesaurus

Chaya Liebeskind and Ido Dagan

Integrating Query Performance Prediction in Term Scoring for Diachronic Thesaurus

Chaya Liebeskind and Ido Dagan

1

LaTeCH2015

slide-2
SLIDE 2

2

Research Context: Domain Specific Diachronic Corpus

Of every tree of the garden thou mayest freely eat: … and thou shalt eat the herb of the field; (King James Bible, Genesis) Were All Men Vegetarians before the Flood? …God instructed Adam saying, “I have given you every herb that yields…” (Genesis 1:29) … (by Eric Lyons, M.Min.) Example: searching in biblical scholarship archive vegetarian

slide-3
SLIDE 3

3

A useful tool for supporting searches in diachronic corpus

Diachronic Thesaurus

Users are mostly aware of modern language vegetarian Target term Related terms modern ancient Diachronic Thesaurus tree of the garden herb of the field

slide-4
SLIDE 4

4

Our task: Collecting a relevant list of modern target terms

  • Domain/corpus dependent

Diachronic Thesaurus

Prior work: Collecting relevant related terms

  • For given thesaurus entries
slide-5
SLIDE 5

5

Diachronic Thesaurus: Our Task

  • Utilize a given candidate list of modern terms as input
  • Predict which candidates are relevant for the domain corpus

vegetarian ecology × cell-phone × computer

slide-6
SLIDE 6

6

Background: Terminology Extraction (TE)

Corpus-based Terminology Extraction

  • 1. Automatically extract prominent terms from a given corpus
  • 2. Score candidate terms for domain relevancy

Score candidate terms for domain relevancy Statistical measures for identifying prominent terms Based on

  • Frequencies in the target corpus (e.g. tf, tf-idf)

Or

  • Comparison with frequencies in a reference background corpus
slide-7
SLIDE 7

7

Supervised framework for TE

  • 1. Candidate target terms are learning instances
  • 2. Calculate a set of features for each candidate
  • 3. Classification predicts which candidates are suitable

Baseline system (TE)

  • Features : state-of-the-art TE scoring measures
slide-8
SLIDE 8

8

Contributions

  • 1. Integrating Query Performance Prediction in term scoring
  • 2. Penetrating to ancient texts, via query expansion
slide-9
SLIDE 9

9

Contribution #1

Integrating Query Performance Prediction in Term Scoring

slide-10
SLIDE 10

10

Query Performance Prediction (QPP)

Estimate the retrieval quality of search queries

  • Assess quality of query results on the text collection.

Our terminology scoring task

  • QPP scoring measures are potentially useful – may capture

additional aspects of term relevancy for the collection term is relevant for a domain term is a good query

slide-11
SLIDE 11

11

Query Performance Prediction (QPP)

Two types of statistical QPP methods

  • 1. Pre-retrieval methods
  • Analyze query term’s distribution within the corpus
  • 2. Post-retrieval methods
  • Additionally analyze the top search results
slide-12
SLIDE 12

12

Query Performance Prediction (QPP)

Integrate QPP measures as additional features First integrated system (TE-QPPTerm)

  • Applies the QPP measures to the candidate term as the query
  • Utilizes these scores as additional classification features
slide-13
SLIDE 13

13

Contribution #2

Penetrating to ancient texts

slide-14
SLIDE 14

14

Penetrating to ancient periods

In a diachronic corpus

  • A candidate term might be rare in its original modern form,

yet frequently referred to by archaic forms

Were All Men Vegetarians before the Flood? …God instructed Adam saying, “I have given you every herb that yields…” (Genesis 1:29) … (by Eric Lyons, M.Min.) Of every tree of the garden thou mayest freely eat: … and thou shalt eat the herb of the field; (King James Bible, Genesis) every herb that yields and thou shalt eat the herb of the field; (King James Bible, Genesis)

query term: vegetarian

slide-15
SLIDE 15

15

Penetrating to ancient periods

Baseline (TE) and First integrated system (TE-QPPTerm)

  • Rely on corpus occurrences of the original candidate term
  • Prioritize relatively frequent terms

Our inspiration

  • A post-retrieval QPP method

Query Feedback measure (Zhou and Croft, 2007)

slide-16
SLIDE 16

16

Penetrating to ancient periods

Second integrated system (TE-QPPQE)

  • Utilizes Pseudo Relevance Feedback Query Expansion

Top Results query query’ Search Engine QPP QPP score Query Expansion

slide-17
SLIDE 17

17

Evaluation Setting

Diachronic corpus: the Responsa Project Questions posed to rabbis along their detailed rabbinic answers Written over a period of about a thousand years 76,760 articles Used for previous IR and NLP research Candidate target terms

  • Hebrew Wikipedia entries
  • Balanced for positive and negative examples
  • #candidates: 500 train, 200 test

Classifier Support Vector Machine with polynomial kernel

slide-18
SLIDE 18

18

Results

Additional QPP features increase the classification accuracy Accuracy (%) Feature Set 61.5 TE (baseline) 65 TE-QPPTerm 66.5* TE-QPPQE Utilizing ancient documents, via query expansion, improves performance * Improvement over baseline statistically significant

  • p<0.05 McNemar’s test
slide-19
SLIDE 19

19

Summary

Task: target term selection for a diachronic thesaurus Main contributions:

  • 1. Integrating Query Performance Prediction in Term Scoring
  • 2. Penetrating to ancient texts via query expansion

Future work

  • Utilize additional query expansion algorithms
  • Investigate the selective query expansion approach