Robust Models in Information Retrieval Nedim Lipka Benno Stein - - PowerPoint PPT Presentation

robust models in information retrieval
SMART_READER_LITE
LIVE PREVIEW

Robust Models in Information Retrieval Nedim Lipka Benno Stein - - PowerPoint PPT Presentation

Robust Models in Information Retrieval Nedim Lipka Benno Stein Bauhaus-Universitt Weimar [www.webis.de] Robust Models in Information Retrieval Outline Introduction Bias and Variance Robust Models in IR Summary Excursus: Bias


slide-1
SLIDE 1

Robust Models in Information Retrieval

Nedim Lipka Benno Stein Bauhaus-Universität Weimar

[www.webis.de]

slide-2
SLIDE 2

Robust Models in Information Retrieval

Outline · Introduction · Bias and Variance · Robust Models in IR · Summary · Excursus: Bias Types

slide-3
SLIDE 3

Introduction

[∧] c stein TIR’11

slide-4
SLIDE 4

Introduction

Classification Task Given:

❑ set O of real-world objects o ❑ feature space X with feature vectors x ❑ classification function (closed form unknown) c : X → Y ❑ sample S = {(x, y) | x ∈ X, y = c(x)}

[∧] c stein TIR’11

slide-5
SLIDE 5

Introduction

Classification Task Given:

❑ set O of real-world objects o ❑ feature space X with feature vectors x ❑ classification function (closed form unknown) c : X → Y ❑ sample S = {(x, y) | x ∈ X, y = c(x)}

Searched:

❑ hypothesis h ∈ H that minimizes P(h(x) = c(x))

  • err(h)

, the generalization error.

[∧] c stein TIR’11

slide-6
SLIDE 6

Introduction

Classification Task Given:

❑ set O of real-world objects o ❑ feature space X with feature vectors x ❑ classification function (closed form unknown) c : X → Y ❑ sample S = {(x, y) | x ∈ X, y = c(x)}

Searched:

❑ hypothesis h ∈ H that minimizes P(h(x) = c(x))

  • err(h)

, the generalization error. Measuring effectiveness of h:

❑ err S(h) = 1

|S|

  • x∈S

loss0/1(h(x), c(x)) err S(h) is called test error if S is not used for the construction of h.

❑ err(h∗) := min h∈H err(h) defines lower bound for err(h) ➜ restriction bias.

[∧] c stein TIR’11

slide-7
SLIDE 7

Introduction

Classification Task Given:

❑ set O of real-world objects o ❑ feature space X with feature vectors x ❑ classification function (closed form unknown) c : X → Y ❑ sample S = {(x, y) | x ∈ X, y = c(x)}

Searched:

❑ hypothesis h ∈ H that minimizes P(h(x) = c(x))

  • err(h)

, the generalization error. Measuring effectiveness of h:

❑ err S(h) = 1

|S|

  • x∈S

loss0/1(h(x), c(x)) err S(h) is called test error if S is not used for the construction of h.

❑ err(h∗) := min h∈H err(h) defines lower bound for err(h) ➜ restriction bias.

[∧] c stein TIR’11

slide-8
SLIDE 8

Introduction

Model Formation Task The process (the function) α for deriving x from o is called model formation. α : O → X

[∧] c stein TIR’11

slide-9
SLIDE 9

Introduction

Model Formation Task The process (the function) α for deriving x from o is called model formation. α : O → X Choosing between different model formation functions α1, . . . , αm ➜ choosing between different feature spaces Xα1, . . . , Xαm ➜ choosing between different hypotheses spaces Hα1, . . . , Hαm

[∧] c stein TIR’11

slide-10
SLIDE 10

Introduction

Model Formation Task The process (the function) α for deriving x from o is called model formation. α : O → X Choosing between different model formation functions α1, . . . , αm ➜ choosing between different feature spaces Xα1, . . . , Xαm ➜ choosing between different hypotheses spaces Hα1, . . . , Hαm

Hypotheses spaces Feature spaces

x x h h

Xα1 Xαm ... Hα1 Hαm ...

[∧] c stein TIR’11

slide-11
SLIDE 11

Introduction

Model Formation Task The process (the function) α for deriving x from o is called model formation. α : O → X Choosing between different model formation functions α1, . . . , αm ➜ choosing between different feature spaces Xα1, . . . , Xαm ➜ choosing between different hypotheses spaces Hα1, . . . , Hαm

Hypotheses spaces Feature spaces

x x h h

Xα1 Xαm ... Hα1 Hαm ...

We call the model under α1 being more robust than the model under α2 ⇔ err S(h∗

α1) > err S(h∗ α2)

and err(h∗

α1) < err(h∗ α2)

[∧] c stein TIR’11

slide-12
SLIDE 12

Introduction

The Whole Picture

Objects O Classes Y Object classification (real-world)

[∧] c stein TIR’11

slide-13
SLIDE 13

Introduction

The Whole Picture

Objects O Classes Y Object classification (real-world) Model formation α X Feature space

[∧] c stein TIR’11

slide-14
SLIDE 14

Introduction

The Whole Picture

Objects O Classes Y Object classification (real-world) Model formation α X Feature space Feature vector classification c

Learning means searching for a h ∈ H such that P(h(x) = c(x)) is minimum.

[∧] c stein TIR’11

slide-15
SLIDE 15

Bias and Variance

[∧] c stein TIR’11

slide-16
SLIDE 16

Bias and Variance

Error Decomposition Consider:

❑ A feature vector x and its predicted class label ˆ

y = h(x), where

❑ h is characterized by a weight vector θ, where ❑ θ has been estimated based on a random sample S = {(x, c(x)}.

➜ θ ≡ θ(S), and hence h ≡ h(θS)

[∧] c stein TIR’11

slide-17
SLIDE 17

Bias and Variance

Error Decomposition Consider:

❑ A feature vector x and its predicted class label ˆ

y = h(x), where

❑ h is characterized by a weight vector θ, where ❑ θ has been estimated based on a random sample S = {(x, c(x)}.

➜ θ ≡ θ(S), and hence h ≡ h(θS) Observations:

❑ A series of samples Si, Si ⊆ U, entails a series of hypotheses h(θi), ❑ giving for a feature vector x a series of class labels ˆ

yi = h(θi, x). ➜ ˆ y is considered as a random variable, denoted as Z.

[∧] c stein TIR’11

slide-18
SLIDE 18

Bias and Variance

Error Decomposition Consider:

❑ A feature vector x and its predicted class label ˆ

y = h(x), where

❑ h is characterized by a weight vector θ, where ❑ θ has been estimated based on a random sample S = {(x, c(x)}.

➜ θ ≡ θ(S), and hence h ≡ h(θS) Observations:

❑ A series of samples Si, Si ⊆ U, entails a series of hypotheses h(θi), ❑ giving for a feature vector x a series of class labels ˆ

yi = h(θi, x). ➜ ˆ y is considered as a random variable, denoted as Z. Consequences:

❑ σ2(Z) is the variance of Z, (= variance of the prediction) ❑ |θ| : |S| ↑

➜ σ2(Z) ↑

❑ |S| : |U| ↓

➜ σ2(Z) ↑

[∧] c stein TIR’11

slide-19
SLIDE 19

Bias and Variance

Error Decomposition (continued) Let Z and Y denote the random variables for ˆ y ( = h(θS, x) ) and y ( = c(x) ) . MSE(Z) = E((Z − Y )2) = E(Z2 − 2 · Z · Y + Y 2) = E(Z2) − 2 · E(Z · Y ) + E(Y 2) = (E(Z))2 + σ2(Z) − 2 · E(Z · Y ) + E(Y 2) = (E(Z))2 + σ2(Z) − 2 · E(Z · Y ) + (E(Y ))2 + σ2(Y ) = (E(Z))2 − 2 · E(Z · Y ) + (E(Y ))2 + σ2(Y ) + σ2(Z) = (E(Z) − E(Y ))2 + σ2(Y ) + σ2(Z) = (E(Z − Y ))2 + σ2(Z) + σ2(Y ) = (bias(Z))2 + σ2(Z) + IrreducibleError

If Y is constant:

= (E(Z) − Y )2 + σ2(Z)

[∧] c stein TIR’11

slide-20
SLIDE 20

Bias and Variance

Error Decomposition (continued) Let Z and Y denote the random variables for ˆ y ( = h(θS, x) ) and y ( = c(x) ) . MSE(Z) = E((Z − Y )2) = E(Z2 − 2 · Z · Y + Y 2) = E(Z2) − 2 · E(Z · Y ) + E(Y 2) = (E(Z))2 + σ2(Z) − 2 · E(Z · Y ) + E(Y 2) = (E(Z))2 + σ2(Z) − 2 · E(Z · Y ) + (E(Y ))2 + σ2(Y ) = (E(Z))2 − 2 · E(Z · Y ) + (E(Y ))2 + σ2(Y ) + σ2(Z) = (E(Z) − E(Y ))2 + σ2(Y ) + σ2(Z) = (E(Z − Y ))2 + σ2(Z) + σ2(Y ) = (bias(Z))2 + σ2(Z) + IrreducibleError

If Y is constant:

= (E(Z) − Y )2 + σ2(Z)

[∧] c stein TIR’11

slide-21
SLIDE 21

Bias and Variance

Error Decomposition (continued) Let Z and Y denote the random variables for ˆ y ( = h(θS, x) ) and y ( = c(x) ) . MSE(Z) = E((Z − Y )2) = E(Z2 − 2 · Z · Y + Y 2) = E(Z2) − 2 · E(Z · Y ) + E(Y 2) = (E(Z))2 + σ2(Z) − 2 · E(Z · Y ) + E(Y 2) = (E(Z))2 + σ2(Z) − 2 · E(Z · Y ) + (E(Y ))2 + σ2(Y ) = (E(Z))2 − 2 · E(Z · Y ) + (E(Y ))2 + σ2(Y ) + σ2(Z) = (E(Z) − E(Y ))2 + σ2(Y ) + σ2(Z) = (E(Z − Y ))2 + σ2(Z) + σ2(Y ) = (bias(Z))2 + σ2(Z) + IrreducibleError

If Y is constant:

= (E(Z) − Y )2 + σ2(Z)

[∧] c stein TIR’11

slide-22
SLIDE 22

Bias and Variance

Error Decomposition (continued) Let Z and Y denote the random variables for ˆ y ( = h(θS, x) ) and y ( = c(x) ) . MSE(Z) = E((Z − Y )2) = E(Z2 − 2 · Z · Y + Y 2) = E(Z2) − 2 · E(Z · Y ) + E(Y 2) = (E(Z))2 + σ2(Z) − 2 · E(Z · Y ) + E(Y 2) = (E(Z))2 + σ2(Z) − 2 · E(Z · Y ) + (E(Y ))2 + σ2(Y ) = (E(Z))2 − 2 · E(Z · Y ) + (E(Y ))2 + σ2(Y ) + σ2(Z) = (E(Z) − E(Y ))2 + σ2(Y ) + σ2(Z) = (E(Z − Y ))2 + σ2(Z) + σ2(Y ) = (bias(Z))2 + σ2(Z) + IrreducibleError

If Y is constant:

= (E(Z) − Y )2 + σ2(Z)

[∧] c stein TIR’11

slide-23
SLIDE 23

Bias and Variance

Error Decomposition (continued) Let Z and Y denote the random variables for ˆ y ( = h(θS, x) ) and y ( = c(x) ) . MSE(Z) = E((Z − Y )2) = E(Z2 − 2 · Z · Y + Y 2) = E(Z2) − 2 · E(Z · Y ) + E(Y 2) = (E(Z))2 + σ2(Z) − 2 · E(Z · Y ) + E(Y 2) = (E(Z))2 + σ2(Z) − 2 · E(Z · Y ) + (E(Y ))2 + σ2(Y ) = (E(Z))2 − 2 · E(Z · Y ) + (E(Y ))2 + σ2(Y ) + σ2(Z) = (E(Z) − E(Y ))2 + σ2(Y ) + σ2(Z) = (E(Z − Y ))2 + σ2(Z) + σ2(Y ) = (bias(Z))2 + σ2(Z) + IrreducibleError

If Y is constant:

= (E(Z) − Y )2 + σ2(Z)

[∧] c stein TIR’11

slide-24
SLIDE 24

Bias and Variance

Error Decomposition (continued) Let Z and Y denote the random variables for ˆ y ( = h(θS, x) ) and y ( = c(x) ) . MSE(Z) = E((Z − Y )2) = E(Z2 − 2 · Z · Y + Y 2) = E(Z2) − 2 · E(Z · Y ) + E(Y 2) = (E(Z))2 + σ2(Z) − 2 · E(Z · Y ) + E(Y 2) = (E(Z))2 + σ2(Z) − 2 · E(Z · Y ) + (E(Y ))2 + σ2(Y ) = (E(Z))2 − 2 · E(Z · Y ) + (E(Y ))2 + σ2(Y ) + σ2(Z) = (E(Z) − E(Y ))2 + σ2(Y ) + σ2(Z) = (E(Z − Y ))2 + σ2(Z) + σ2(Y ) = (bias(Z))2 + σ2(Z) + IrreducibleError

If Y is constant:

= (E(Z) − Y )2 + σ2(Z)

[∧] c stein TIR’11

slide-25
SLIDE 25

Bias and Variance

Error Decomposition (continued) Let Z and Y denote the random variables for ˆ y ( = h(θS, x) ) and y ( = c(x) ) . MSE(Z) = E((Z − Y )2) = E(Z2 − 2 · Z · Y + Y 2) = E(Z2) − 2 · E(Z · Y ) + E(Y 2) = (E(Z))2 + σ2(Z) − 2 · E(Z · Y ) + E(Y 2) = (E(Z))2 + σ2(Z) − 2 · E(Z · Y ) + (E(Y ))2 + σ2(Y ) = (E(Z))2 − 2 · E(Z · Y ) + (E(Y ))2 + σ2(Y ) + σ2(Z) = (E(Z) − E(Y ))2 + σ2(Y ) + σ2(Z) = (E(Z − Y ))2 + σ2(Z) + σ2(Y ) = (bias(Z))2 + σ2(Z) + IrreducibleError

If Y is constant:

= (E(Z) − Y )2 + σ2(Z)

[∧] c stein TIR’11

slide-26
SLIDE 26

Bias and Variance

Error Decomposition (continued) Let Z and Y denote the random variables for ˆ y ( = h(θS, x) ) and y ( = c(x) ) . MSE(Z) = E((Z − Y )2) = E(Z2 − 2 · Z · Y + Y 2) = E(Z2) − 2 · E(Z · Y ) + E(Y 2) = (E(Z))2 + σ2(Z) − 2 · E(Z · Y ) + E(Y 2) = (E(Z))2 + σ2(Z) − 2 · E(Z · Y ) + (E(Y ))2 + σ2(Y ) = (E(Z))2 − 2 · E(Z · Y ) + (E(Y ))2 + σ2(Y ) + σ2(Z) = (E(Z) − E(Y ))2 + σ2(Y ) + σ2(Z) = (E(Z − Y ))2 + σ2(Z) + σ2(Y ) = (bias(Z))2 + σ2(Z) + IrreducibleError

If Y is constant:

= (E(Z) − Y )2 + σ2(Z) When analyzing MSE, bias, and σ2 of a classifier h, the average over all examples

  • f the test set is taken.

[∧] c stein TIR’11

slide-27
SLIDE 27

Bias and Variance

Illustration

Bias2 Variance

Parameter number (hypothesis complexity)

low high

MSE Bias Variance

[∧] c stein TIR’11

slide-28
SLIDE 28

Bias and Variance

Illustration

Bias2 Variance

Parameter number (hypothesis complexity)

low high

MSE Bias Variance

[∧] c stein TIR’11

slide-29
SLIDE 29

Bias and Variance

Illustration

Bias2 Variance

Parameter number (hypothesis complexity)

low high

MSE Bias Variance

[∧] c stein TIR’11

slide-30
SLIDE 30

Bias and Variance

Illustration

Bias2 Variance

Parameter number (hypothesis complexity)

low high

MSE Bias Variance

θ1 θ2

err S(h∗

α1) > err S(h∗ α2)

Comparing two model-classifier-combinations under a sample S.

[∧] c stein TIR’11

slide-31
SLIDE 31

Bias and Variance

Illustration

Bias2 Variance

Parameter number (hypothesis complexity)

low high

MSE Bias Variance

θ1 θ2

err(h∗

α1) < err(h∗ α2)

The same model-classifier-combinations under a sample S′, with |S′| ≫ |S|. ➜ The model under α1 is more robust than the model under α2.

[∧] c stein TIR’11

slide-32
SLIDE 32

Bias and Variance

Preliminary Summary

❑ Even when properly choosing training and test sets, a model selection

decision may not be justified by error minimization.

❑ Rationale: the concept of representativeness gets lost for extreme ratios

between the sample size and an application set in the wild. (consider working against the web) ➜ The bias of the less complex classifier is over-estimated. ➜ The variance of the more complex classifier is under-estimated.

❑ This behavior is consistent with the concept of the bias-variance-tradeoff.

[∧] c stein TIR’11

slide-33
SLIDE 33

Robust Models in IR

[∧] c stein TIR’11

slide-34
SLIDE 34

Robust Models in IR

Case Study I: Text Categorization The model under α1 is more robust than the model under α2 ⇔ err S(h∗

α1) > err S(h∗ α2)

and err(h∗

α1) < err(h∗ α2)

Experiment rationale:

❑ Topic classification for the web is learned on extremely small samples. ❑ The web generalization error of a classifier h cannot be computed.

➜ err(h) is usually unknown. ➜ Study the effect with a large (test) corpus in the role of the web by comparing err S(hα) and err(hα) for different α.

[∧] c stein TIR’11

slide-35
SLIDE 35

Robust Models in IR

Case Study I: Text Categorization Experiment setup 1:

❑ Corpus

RCV1

❑ Corpus Size

663 768 documents

❑ Considered classes

corporate (292 348), economics (51 148), government (161 523), market (158 749)

❑ Sample size

800, drawn i.i.d. from RCV1

❑ Ratio sample and corpus

0.0012

❑ Inductive learner

SVM with linear kernel

❑ Model formation functions α

5 VSM variants

  • 1. α1: V = {[a-z]5 ∗}, |V | = 9951
  • 2. α2: V = {[a-z]4 ∗}, |V | = 6172
  • 3. α3: V = {[a-z]3 ∗}, |V | = 2729
  • 4. α4: V = {[a-z]2 ∗}, |V | = 464
  • 5. α5: V = {[a-z] ∗}, |V | = 26

[∧] c stein TIR’11

slide-36
SLIDE 36

Robust Models in IR

Case Study I: Text Categorization

10 20 30 40 50 60 Error rate [%] Restriction bias + − 6 1 7 2 2 7 2 9 4 6 4 2 6 errS 9 9 5 1 Number of VSM index terms α1 α2 α3 α4 α5

Sample error errS

[∧] c stein TIR’11

slide-37
SLIDE 37

Robust Models in IR

Case Study I: Text Categorization

10 20 30 40 50 60 Error rate [%] Restriction bias + − 6 1 7 2 2 7 2 9 4 6 4 2 6 errS 9 9 5 1 Number of VSM index terms α1 α2 α3 α4 α5

Sample error errS

err

[∧] c stein TIR’11

slide-38
SLIDE 38

Robust Models in IR

Case Study I: Text Categorization

10 20 30 40 50 60 Error rate [%] Restriction bias + − 6 1 7 2 2 7 2 9 4 6 4 2 6 errS 9 9 5 1 Number of VSM index terms α1 α2 α3 α4 α5

Sample error errS

err

Optimum model

[∧] c stein TIR’11

slide-39
SLIDE 39

Robust Models in IR

Case Study I: Text Categorization Experiment setup 2:

❑ Corpus

RCV1

❑ Corpus Size

663 768 documents

❑ Considered classes

corporate (292 348), economics (51 148), government (161 523), market (158 749)

❑ Sample size

800, drawn i.i.d. from RCV1

❑ Ratio sample and corpus

0.0012

❑ Inductive learner

SVM with linear kernel

❑ Model formation functions α

2 VSM variants

  • 1. α1: tf·idfweighting scheme
  • 2. α2: Boolean weighting scheme

[∧] c stein TIR’11

slide-40
SLIDE 40

Robust Models in IR

Case Study I: Text Categorization

10 20 30 40 50 60 Error rate [%] Restriction bias + − t f ⋅ i d f Weight type B

  • l

e a n α1 α2 errS

[∧] c stein TIR’11

slide-41
SLIDE 41

Robust Models in IR

Case Study I: Text Categorization

10 20 30 40 50 60 Error rate [%] Restriction bias + − t f ⋅ i d f Weight type B

  • l

e a n α1 α2 errS err

[∧] c stein TIR’11

slide-42
SLIDE 42

Robust Models in IR

Case Study I: Text Categorization

10 20 30 40 50 60 Error rate [%] Restriction bias + − t f ⋅ i d f Weight type B

  • l

e a n α1 α2 errS err

Optimum model

[∧] c stein TIR’11

slide-43
SLIDE 43

Robust Models in IR

Case Study II: Web Genre Classification Given a web page, classify to one of the following 8 classes:

Help Article Discussion Shop Non-pers. home Personal home Link collection Download

Experiment rationale:

❑ The sizes of existing genre corpora vary between 200 - 2500 documents. ❑ The number of the web genres in these corpora is between 3 and 16. ❑ The researchers report an very good (too good?) classification results.

➜ The genre corpora are biased, e.g. because

  • 1. Editors collect their favored documents only.
  • 2. Editors introduce subconsciously correlations between topic and genre.

➜ The classifiers that are learned with these corpora will not generalize well. ➜ Learn two hα1, hα2 on corpus A and measure their export accuracy on corpus B.

[∧] c stein TIR’11

slide-44
SLIDE 44

Robust Models in IR

Case Study II: Web Genre Classification Experiment setup:

❑ Corpus A

KI-04, 1 200 documents

❑ Considered classes

article, discussion, shop, help, personal home, non-personal home, link collection, download

❑ Corpus B

7-Web-Genre, 900 documents

❑ Considered classes

listing (KI-04 link collection), eshop (KI-04 shop), home page (KI-04 personal home)

❑ Sample sizes

50-350, drawn i.i.d. from KI-04

❑ Inductive learner

SVM with linear kernel

❑ Model formation functions α

2 genre retrieval models

  • 1. α1: VSM-based model with 3 500 words
  • 2. α2: special concentrations measures plus core vocabulary (98 features)

[∧] c stein TIR’11

slide-45
SLIDE 45

Robust Models in IR

Case Study II: Web Genre Classification Within corpus accuracy:

45 50 55 60 65 70 75 100 150 200 250 300

predictive accuracy number of training instances

Corpus A. (KI-04)

〈α1, h〉 〈α2, h〉

err S(h∗

α1) < err S(h∗ α2)

[∧] c stein TIR’11

slide-46
SLIDE 46

Robust Models in IR

Case Study II: Web Genre Classification Export accuracy:

45 50 55 60 65 70 75 100 200 300 400 500 600 700

export accuracy number of training instances

Training corpus A. (KI-04) Test corpus B. (7-Web-Genre)

〈α1, h〉 〈α2, h〉

err(h∗

α1) > err(h∗ α2)

[∧] c stein TIR’11

slide-47
SLIDE 47

Summary

[∧] c stein TIR’11

slide-48
SLIDE 48

Summary

  • 1. Be careful, if the ratio between sample size and application set (“test set”)

becomes extreme: A model selection decision may not be justified by error minimization.

  • 2. Consider . . .

❑ a bias over-estimation of the less complex classifier or ❑ a variance under-estimation of the more complex classifier.

  • 3. In web scenarios the true error (generalization error) of a classifier cannot

be analyzed: ➜ develop a scale-up scenario to assess the impact on the error ➜ if being in doubt stick to the less complex classifier

[∧] c stein TIR’11

slide-49
SLIDE 49

Thank you!

slide-50
SLIDE 50

Excursus: Bias Types

[∧] c stein TIR’11

slide-51
SLIDE 51

Excursus: Bias Types

Bias in Classification Tasks

Supervised learning <O,Y> Model formation α <X,Y> <S,Y> Sample formation Task Restriction bias Preference bias Sample selection bias <α,h> Solution

[∧] c stein TIR’11