NPFL103: Information Retrieval (8) Language Models for Information - PowerPoint PPT Presentation

Language models Text classification Naive Bayes Evaluation of text classification NPFL103: Information Retrieval (8) Language Models for Information Retrieval, Text Classification Pavel Pecina Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics Charles University Original slides are courtesy of Hinrich Schütze, University of Stutugart. 1 / 59 pecina@ufal.mff.cuni.cz

Language models Text classification Naive Bayes Evaluation of text classification Contents Language models Text classification Naive Bayes Evaluation of text classification 2 / 59

Language models Text classification Naive Bayes Evaluation of text classification Language models 3 / 59

Language models Text classification Naive Bayes Evaluation of text classification Using language models for Information Retrieval View the document d as a generative model that generates the query q . What we need to do: 1. Define the precise generative model we want to use 2. Estimate parameters (difgerent for each document’s model) 3. Smooth to avoid zeros 4. Apply to query and find document most likely to generate the query 5. Present most likely document(s) to user 4 / 59

Language models Text classification Naive Bayes Evaluation of text classification What is a language model? We can view a finite state automaton as a deterministic language model. I wish I wish I wish I wish I wish … Cannot generate: “wish I wish” or “I wish I” Our basic model: each document was generated by a difgerent automaton 5 / 59 like this except that these automata are probabilistic.

Language models said Example: frog said that toad likes frog STOP STOP is a special symbol indicating that the automaton stops. … … 0.04 that 0.01 frog 0.02 likes 0.1 a Text classification 0.03 0.2 w Naive Bayes Evaluation of text classification A probabilistic language model the w STOP 0.2 toad 0.01 6 / 59 P ( w | q 1 ) P ( w | q 1 ) q 1 This is a one-state probabilistic finite-state automaton – a unigram language model – and the state emission distribution for its one state q 1 . P ( string ) = 0 . 01 · 0 . 03 · 0 . 04 · 0 . 01 · 0 . 02 · 0 . 01 · 0 . 2 = 0 . 0000000000048

Language models .03 Text classification w w STOP .20 toad .02 the .15 said a .04 .08 likes .02 frog .01 that .05 … … query: frog said that toad likes frog STOP … … that the Naive Bayes Evaluation of text classification A difgerent language model for each document w w .01 toad .01 STOP .20 .20 .10 frog likes .02 said 7 / 59 .03 a language model of d 1 language model of d 2 P ( w | . ) P ( w | . ) P ( w | . ) P ( w | . ) P ( query | M d 1 ) = 0 . 01 · 0 . 03 · 0 . 04 · 0 . 01 · 0 . 02 · 0 . 01 · 0 . 2 = 4 . 8 · 10 − 12 P ( query | M d 2 ) = 0 . 01 · 0 . 03 · 0 . 05 · 0 . 02 · 0 . 02 · 0 . 01 · 0 . 2 = 12 · 10 − 12 P ( query | M d 1 ) < P ( query | M d 2 ) : d 2 is more relevant to the query than d 1

Language models Text classification Naive Bayes Evaluation of text classification Using language models in IR higher prior to “high-quality” documents (e.g. by PageRank) 8 / 59 ▶ Each document is treated as (the basis for) a language model. ▶ Given a query q , rank documents based on P ( d | q ) P ( d | q ) = P ( q | d ) P ( d ) P ( q ) ▶ P ( q ) is the same for all documents, so ignore ▶ P ( d ) is the prior – ofuen treated as the same for all d , but we can give a ▶ P ( q | d ) is the probability of q given d . ▶ Under the assumptions we made, ranking documents according according to P ( q | d ) and P ( d | q ) is equivalent.

Language models Text classification Naive Bayes Evaluation of text classification Where we are observed as a random sample from the respective document model. 9 / 59 ▶ In the LM approach to IR, we model the query generation process. ▶ Then we rank documents by the probability that a query would be ▶ That is, we rank according to P ( q | d ) . ▶ Next: how do we compute P ( q | d ) ?

Language models Text classification distinct term t in q 10 / 59 Evaluation of text classification Naive Bayes How to compute P ( q | d ) ▶ The conditional independence assumption: ∏ P ( q | M d ) = P ( ⟨ t 1 , . . . , t | q | ⟩| M d ) = P ( t k | M d ) 1 ≤ k ≤| q | ▶ | q | : length of q ▶ t k : the token occurring at position k in q ▶ This is equivalent to: ∏ P ( q | M d ) = P ( t | M d ) tf t , q ▶ tf t , q : term frequency (# occurrences) of t in q

Language models Parameter estimation Text classification 11 / 59 Evaluation of text classification Naive Bayes ▶ Missing piece: Where do the parameters P ( t | M d ) come from? ▶ Start with maximum likelihood estimates ˆ P ( t | M d ) = tf t , d | d | ▶ | d | : length of d ▶ tf t , d : # occurrences of t in d ▶ The zero problem (in nominator and denominator) ▶ A single t with P ( t | M d ) = 0 will make P ( q | M d ) = ∏ P ( t | M d ) zero. ▶ Example: for query [Michael Jackson top hits] a document about “top songs” (but not with the word “hits”) would have P ( q | M d ) = 0 ▶ We need to smooth the estimates to avoid zeros.

Language models …but no more likely than expected by chance in the collection. T Text classification 12 / 59 Smoothing Evaluation of text classification Naive Bayes ▶ Idea: A nonoccurring term is possible (even though it didn’t occur) ▶ We will use ˆ P ( t | M c ) to “smooth” P ( t | d ) away from zero. ˆ P ( t | M c ) = cf t ▶ M c : the collection model ▶ cf t : the number of occurrences of t in the collection ▶ T = ∑ t cf t : the total number of tokens in the collection.

Language models Text classification Naive Bayes Evaluation of text classification Jelinek-Mercer smoothing collection frequency of the word. documents containing all query words. 13 / 59 ▶ Intuition: Mixing the probability from the document with the general P ( t | d ) = λ P ( t | M d ) + (1 − λ ) P ( t | M c ) ▶ High value of λ : “conjunctive-like” search – tends to retrieve ▶ Low value of λ : more disjunctive, suitable for long queries. ▶ Correctly setuing λ is very important for good performance.

Language models Text classification Naive Bayes Evaluation of text classification Jelinek-Mercer smoothing: Summary query from this document. user had in mind was in fact this one. 14 / 59 ∏ P ( q | d ) ∝ ( λ P ( t k | M d ) + (1 − λ ) P ( t k | M c )) 1 ≤ k ≤| q | ▶ What we model: The user has a document in mind and generates the ▶ The equation represents the probability that the document that the

Language models Text classification Naive Bayes Evaluation of text classification Example 15 / 59 ▶ Collection: d 1 and d 2 ▶ d 1 : Jackson was one of the most talented entertainers of all time. ▶ d 2 : Michael Jackson anointed himself King of Pop. ▶ Qvery q : ▶ q : Michael Jackson ▶ Use mixture model with λ = 1/2 ▶ P ( q | d 1 ) = [(0/11 + 1/18)/2] · [(1/11 + 2/18)/2] ≈ 0 . 003 ▶ P ( q | d 2 ) = [(1/7 + 1/18)/2] · [(1/7 + 2/18)/2] ≈ 0 . 013 ▶ Ranking: d 2 > d 1

Language models the background distribution as our estimate. distribution. Text classification 16 / 59 Dirichlet smoothing Evaluation of text classification Naive Bayes ▶ Intuition: Before having seen any part of the document we start with P ( t | d ) = tf t , d + µ ˆ P ( t | M c ) ˆ L d + µ ▶ The background distribution ˆ P ( t | M c ) is the prior for ˆ P ( t | d ) . ▶ As we read the document and count terms we update the background ▶ The weight factor µ determines how strong an efgect the prior has.

Language models Text classification Naive Bayes Evaluation of text classification Jelinek-Mercer or Dirichlet? performs betuer for verbose queries. shouldn’t use these models without parameter tuning. 17 / 59 ▶ Dirichlet performs betuer for keyword queries, Jelinek-Mercer ▶ Both models are sensitive to the smoothing parameters – you

Language models Text classification Naive Bayes Evaluation of text classification Sensitivity of Dirichlet to smoothing parameter 18 / 59

Language models * * 0.6 0.1024 0.1405 +37.1 * 0.8 0.0160 0.0432 +169.6 1.0 0.2572 0.0028 0.0050 +76.9 average 0.1868 0.2233 +19.6 * The language modeling approach always does betuer in these experiments …but significant gains are shown at higher levels of recall. +22.9 0.2093 Text classification 0.7439 Naive Bayes Evaluation of text classification Language model vs. Vector space model: Example Precision Recall TF-IDF LM significant 0.0 0.7590 0.4 +2.0 0.1 0.4521 0.4910 +8.6 0.2 0.3514 0.4045 +15.1 * 19 / 59 % ∆

Language models Text classification Naive Bayes Evaluation of text classification Language model vs. Vector space model: Things in common 2. Probabilities are inherently “length-normalized”. 3. Mixing document/collection frequencies has an efgect similar to idf. will have a greater influence on the ranking. 20 / 59 1. Term frequency is directly in the model. ▶ But it is not scaled in LMs. ▶ Cosine normalization does something similar for vector space. ▶ Terms rare in the general collection, but common in some documents

NPFL103: Information Retrieval (8) Language Models for Information - PowerPoint PPT Presentation

Language models Text classification Naive Bayes Evaluation of text classification NPFL103: Information Retrieval (8) Language Models for Information Retrieval, Text Classification Pavel Pecina Institute of Formal and Applied Linguistics

NPFL103: Information Retrieval (1) Introduction, Boolean retrieval, Inverted index, Text

NPFL103: Information Retrieval (4) Ranked retrieval, Term weighting, Vector space model Pavel

NPFL103: Information Retrieval (2) Dictionaries, Tolerant retrieval, Spelling correction Pavel

NPFL103: Information Retrieval (11) Latent semantic indexing Pavel Pecina Institute of Formal

Information Retrieval Introducing Information Retrieval and Web Search Information Retrieval

NPFL103: Information Retrieval (6) Result summaries, Relevance Feedback, Qvery Expansion Pavel

NPFL103: Information Retrieval (12) Web search, Crawling, Spam detection Pavel Pecina Institute

NPFL103: Information Retrieval (10) Document clustering Pavel Pecina Institute of Formal and

NPFL103: Information Retrieval (3) Index construction, Distributed and dynamic indexing, Index

NPFL103: Information Retrieval (9) Vector Space Classification Pavel Pecina Institute of Formal

NPFL103: Information Retrieval (5) Ranking, Complete search system, Evaluation, Benchmarks Pavel

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web

CS54701: Information Retrieval CS-54701 Information Retrieval Retrieval Models: Language models

CS54701: Information Retrieval CS-54701 Information Retrieval Luo Si Department of Computer

Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency

Model Divergence Retrieval LM, session 10 CS6200: Information Retrieval Slides by: Jesse

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

Road Map Influenza Types Clinical Diagnos7cs

Data-Intensive Distributed Computing CS 431/631 451/651 (Winter 2019) Part 4: Analyzing Graphs

Supply Chain Management www.traqueasia.com S UPPLY C HAIN AIN The network of retailers,

Farmers Market Twilight Q&A: Operating During the COVID-19 Pandemic Disclaimers

Population Protocols Eric Ruppert York University MiNEMA Winter School G oteborg, Sweden

Introduction Motivation: Business Intelligence Customer information Product information

Information Retrieval Vector space classification Hamid Beigy Sharif university of technology

NPFL103: Information Retrieval (8) Language Models for Information - PowerPoint PPT Presentation

Language models Text classification Naive Bayes Evaluation of text classification NPFL103: Information Retrieval (8) Language Models for Information Retrieval, Text Classification Pavel Pecina Institute of Formal and Applied Linguistics

NPFL103: Information Retrieval (1) Introduction, Boolean retrieval, Inverted index, Text

NPFL103: Information Retrieval (4) Ranked retrieval, Term weighting, Vector space model Pavel

NPFL103: Information Retrieval (2) Dictionaries, Tolerant retrieval, Spelling correction Pavel

NPFL103: Information Retrieval (11) Latent semantic indexing Pavel Pecina Institute of Formal

Information Retrieval Introducing Information Retrieval and Web Search Information Retrieval

NPFL103: Information Retrieval (6) Result summaries, Relevance Feedback, Qvery Expansion Pavel

NPFL103: Information Retrieval (12) Web search, Crawling, Spam detection Pavel Pecina Institute

NPFL103: Information Retrieval (10) Document clustering Pavel Pecina Institute of Formal and

NPFL103: Information Retrieval (3) Index construction, Distributed and dynamic indexing, Index

NPFL103: Information Retrieval (9) Vector Space Classification Pavel Pecina Institute of Formal

NPFL103: Information Retrieval (5) Ranking, Complete search system, Evaluation, Benchmarks Pavel

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web

CS54701: Information Retrieval CS-54701 Information Retrieval Retrieval Models: Language models

CS54701: Information Retrieval CS-54701 Information Retrieval Luo Si Department of Computer

Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency

Model Divergence Retrieval LM, session 10 CS6200: Information Retrieval Slides by: Jesse

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

Road Map Influenza Types Clinical Diagnos7cs

Data-Intensive Distributed Computing CS 431/631 451/651 (Winter 2019) Part 4: Analyzing Graphs

Supply Chain Management www.traqueasia.com S UPPLY C HAIN AIN The network of retailers,

Farmers Market Twilight Q&amp;A: Operating During the COVID-19 Pandemic Disclaimers

Population Protocols Eric Ruppert York University MiNEMA Winter School G oteborg, Sweden

Introduction Motivation: Business Intelligence Customer information Product information

Information Retrieval Vector space classification Hamid Beigy Sharif university of technology

Farmers Market Twilight Q&A: Operating During the COVID-19 Pandemic Disclaimers