[PPT] - Making Generative Classifiers Robust to Selection Bias Andrew Smith PowerPoint Presentation

SLIDE 1

Making Generative Classifiers Robust to Selection Bias

Andrew Smith Charles Elkan November 30th, 2007

SLIDE 2

Outline

◮ What is selection bias? ◮ Types of selection bias. ◮ Overcoming learnable bias with weighting. ◮ Overcoming bias with maximum likelihood (ML). ◮ Experiment 1: ADULT dataset. ◮ Experiment 2: CA-housing dataset. ◮ Future work & conclusions.

SLIDE 3

What is selection bias?

Traditional semi-supervised learning assumes:

◮ Some samples are labeled, some are not. ◮ Labeled and unlabeled examples are identically distributed.

SLIDE 4

What is selection bias?

Traditional semi-supervised learning assumes:

◮ Some samples are labeled, some are not. ◮ Labeled and unlabeled examples are identically distributed.

Semi-supervised learning under selection bias:

◮ Labeled examples are selected from the general population not

at random.

◮ Labeled and unlabeled examples may be differently distributed.

SLIDE 5

Examples

◮ Loan application approval

◮ Goal is to model repay/default behavior of all applicants. ◮ But the training set only includes labels for people who were

approved for a loan.

SLIDE 6

Examples

◮ Loan application approval

◮ Goal is to model repay/default behavior of all applicants. ◮ But the training set only includes labels for people who were

approved for a loan.

SLIDE 7

Examples

◮ Loan application approval

◮ Goal is to model repay/default behavior of all applicants. ◮ But the training set only includes labels for people who were

approved for a loan.

◮ Spam filtering

◮ Goal is an up-to-date spam filter. ◮ But, while up-to-date unlabeled emails are available,

hand-labeled data sets are expensive and may be rarely updated.

SLIDE 8

Framework

Types of selection bias are distinguised by conditional independence assumptions between:

◮ x is the feature vector. ◮ y is the class label. If y is binary, y ∈ {1, 0}. ◮ s is the binary selection variable. If yi is observable then

si = 1, otherwise si = 0.

SLIDE 9

Types of selection bias – No bias

x y s

s ⊥ x, s ⊥ y

◮ The standard semi-supervised learning scenario.

SLIDE 10

Types of selection bias – No bias

x y s

s ⊥ x, s ⊥ y

◮ The standard semi-supervised learning scenario. ◮ Labeled examples are selected completely at random from the

general population.

SLIDE 11

Types of selection bias – No bias

x y s

s ⊥ x, s ⊥ y

◮ The standard semi-supervised learning scenario. ◮ Labeled examples are selected completely at random from the

general population.

◮ The missing labels are said to be “missing completely at

random” (MCAR) in the literature.

SLIDE 12

Types of selection bias – Learnable bias

s x y

s ⊥ y|x

◮ Labeled examples are selected from the general population

nly depending on features x.

SLIDE 13

Types of selection bias – Learnable bias

s x y

s ⊥ y|x

◮ Labeled examples are selected from the general population

nly depending on features x.

◮ A model p(s|x) is learnable.

SLIDE 14

Types of selection bias – Learnable bias

s x y

s ⊥ y|x

◮ Labeled examples are selected from the general population

nly depending on features x.

◮ A model p(s|x) is learnable. ◮ The missing labels are said to be “missing at random”

(MAR), or ”ignorable bias” in the literature.

◮ p(y|x, s = 1) = p(y|x).

SLIDE 15

Model mis-specification under learnable bias

p(y|x, s = 1) = p(y|x) implies decision boundaries are the same in the labeled and general populations. But suppose the model is misspecified?

SLIDE 16

Model mis-specification under learnable bias

p(y|x, s = 1) = p(y|x) implies decision boundaries are the same in the labeled and general populations. But suppose the model is misspecified? Then a sub-optimal decision boundary may be learned under MAR bias.

− + + − + + − + + + − − + − − − + + + − + − − − + − + + + − + + − + + + + − − + + − − + − + − + + − + − − + + + − − − − + − − − + − + − − + − + + + + − − − − + + − + − − − + + + − − − + − + − − − − + + − − − + − − − − − + + + + − + − − − + − + + + − − − − + + − + − − − + − + + − − − − + − + − − + + + + + − − + + − − − + + + − + + + − − − + + + + − + − − + − − − − − + + + − − + + + + + + − + + + − − + − + − + + − − + − − − − − + − + + + − + + − − + − − − − − − − + − + + − − − − + − − + + − + + − − − − − − − − + − + + − − − − − − + − + − + + − + + − + + − − − − + − − + + − − − − + − − + − + + + + − − − + − + + − − − + − + − − − − − − + + − − + − − + − + + + + + + − − − − + − − − − − + + − − − − + + + − − + − − − + − − − + + + − − − + + − − − − − + + + + − + + + + − − − − − − + + + − − + − −

Viewing hidden labels

best mis−specified bondary true boundary − + + − + − − − + + − + − − + + + + + + − + − + − − + + − + − + − + − + + − − + − − + + − − − + − − − − − + + − − − − + − + + − + + − − + + − − − + − − − + + + + − + − − − + + + − + + + − + + − − + − − − − + + + + + + + − + − − + − + − − − − + − − + − − − − − + + − + + − + − − − − − − + + + − − − − − + − − − + + + − − − + − − − − + + − + + − − + − − + − + + − + − − − − + − − − − − − + + − + + + − − − − + + − − + − −

Ignoring samples without labels

estim. mis−specified boundary
estim. well−specified boundary

SLIDE 17

Types of selection bias – Arbitrary bias

s x y ◮ Labeled examples are selected from the general population

possibly depending on the label itself.

SLIDE 18

Types of selection bias – Arbitrary bias

s x y ◮ Labeled examples are selected from the general population

possibly depending on the label itself.

◮ No independence assumptions can be made.

SLIDE 19

Types of selection bias – Arbitrary bias

s x y ◮ Labeled examples are selected from the general population

possibly depending on the label itself.

◮ No independence assumptions can be made. ◮ The missing labels are said to be “missing not at random”

(MNAR) in the literature.

SLIDE 20

Overcoming bias – Two alternate goals

The training data consist of {(xi, yi)|si = 1} and {(xi)|si = 0}. Two goals are possible:

SLIDE 21

Overcoming bias – Two alternate goals

The training data consist of {(xi, yi)|si = 1} and {(xi)|si = 0}. Two goals are possible:

◮ General population modeling: Learn p(y|x), e.g. loan

application approval.

SLIDE 22

Overcoming bias – Two alternate goals

The training data consist of {(xi, yi)|si = 1} and {(xi)|si = 0}. Two goals are possible:

◮ General population modeling: Learn p(y|x), e.g. loan

application approval.

◮ Unlabeled population modeling: Learn p(y|x, s = 0),

e.g. spam filtering.

SLIDE 23

Overcoming learnable bias – General population modeling

Lemma 1

Under MAR bias in the labeling, p(x, y) = p(s = 1) p(s = 1|x)p(x, y|s = 1) if all probabilities are non-zero. The distribution of samples in the general population is a weighted version of the distribution of labeled samples. Since p(s|x) is learnable, we can estimate weights.

SLIDE 24

General population modeling – Application

Lemma 1 can be used:

◮ to estimate class-conditional density models p(x|y). ◮ to improve the loss of misspecified discriminative classifiers in

the general population.

− − − − − − − − − + + − + − + − − − + − − − + − − + − − + − − + + + + − + − − − − − − + + + + − − − − − − + − + + − − + − − + + − + − − − + + + − − − − + − + + − − − − + − − − − − − + − + + + − + − − − − + − + − − + − + − − + − + + − + − − + − − − + + − − − − − − − + − + + − + + − − + − − + − − − + − − − + − − − + + + + + − − − − + − − − + − + + − − − − + + − − − − + − + − + + − − − − + + − + − + + + − − − + + − + − + − − + − − + + − + + − − − − − − − − − − + − + − − + − + − + + + − − − + − − + + − + − − + − − + − + + − − + + − − + + − − + + + − − − − − − + + + − + + − − − + + + − + − − − + − + − + − − − − − + − + − − − + + + − + + + − + + + − − + + + − − − + − + + + + − − − − + + − − − − − − − + + − + − + − + + − −+ + − + − − − + − + − + − + + + − + − + + − + − − + + − − + − − − + + − −

Viewing hidden labels

best general−pop. boundary − − − − − − + + − − − − − − − + − − − + + − − − − + + − + + − − − + + − − − − − − − − + + + − − − + + − + − + − + − − + − − − − − − + + − − + − − + − − + − + + + − − + + − − + − − + − + − − − + + + − − + − − − + + − − − − − − − + − + − + + − + + + − + − + + − − + + + + − − + + − + − + − + − + − + + − − + − + − − + − + + + + + − + − − − + − − − − + − − + − − − − + − − + + − + + − − − − + +

Using lemma 1

est. general−pop. boundary

The weighted logistic regression finds a decision boundary close to the best-possible for the general population.

SLIDE 25

Overcoming learnable bias – Unlabeled population modeling

Lemma 2

Under MAR bias in the labeling, p(x, y|s = 0) = 1 − p(s = 1|x) 1 − p(s = 1) p(s = 1) p(s = 1|x)p(x, y|s = 1) if all probabilities are non-zero. Similarly, the distribution of samples in the unlabeled population is a weighted version of the distribution of labeled samples. Since p(s|x) is learnable, we can estimate weights.

SLIDE 26

General population modeling – Application

Lemma 2 can be used:

◮ to estimate class-conditional density models p(x|y, s = 0). ◮ to improve the loss of misspecified discriminative classifiers in

the unlabeled population.

− − − + − − − + − − + + − + − − − + − − + + + + − − − − − − − − − − − + + − + − + + + − − − + + − + + + + − + − − + + − + + − − − − − + + − + + − + + − + − − + − + − + + − − + − − − − − + − + − − + + − − − + − − + − − − − − + − − − − + − + − − − − + − − + + + − − + − + − − − − + − − − − + + + − + + + − − + − − − + − + − − − − + + + − − − − − + + + + − + − + − + + + − − − + ++ − − + − + − − − − − − − + + − − − + − − − + − − − − − + − + − − − − − + − + + − − − + − + − + + − + − + + − + + + + + + + + − − − − + − − − − − − + + − − + + + − − − − − + + + − − − − − − − − − + − − − − − − − − − + − + − + − − + − − − + − + + − + + + − + − − − + − + − − + − + − − − − + + − − − − − + + + − − − + + + + − − + − + − + − − + − + − + − − + − − − − − + − − − + − − + − − + − + + − − − + − + − + − + − − − +

Viewing hidden labels

best unlabeled−pop. boundary − − + + − + − − − − − − − − − − − + + − + + + + + + + + − + + − − − − + − + + + − − − + − − − + − + + − − − + − − − − − + + − + + + − − − − − + − − + + − − + + + − + − + + + − + − + + − − − + − − − − − − + − − − + + − + − − + + − + + + + + + − − − − − + − + + − − + − − − − − + − − − − + + − − + + + + − − − + + − − + − + − − + − − − − + + − − − + + − + − + − + − − + − + − − − − − + − + − − + − + − + − − −

Using Lemma 2

est. unlabeled−pop. boundary

The weighted logistic regression finds a decision boundary close to the best-possible for the unlabeled population.

SLIDE 27

Overcoming arbitrary bias – Maximum Likelihood

The log likelihood of parameters Θ over semi-labeled dataset X is: ℓ(Θ, X) =

m

i=1

log pΘ(xi, yi, si = 1) +

m+n

i=m+1

log

y

pΘ(xi, y, si = 0) for labeled data i = 1...m and unlabeled data i = m + 1...m + n.

SLIDE 28

Overcoming arbitrary bias – Maximum Likelihood

The log likelihood of parameters Θ over semi-labeled dataset X is: ℓ(Θ, X) =

m

i=1

log pΘ(xi, yi, si = 1) +

m+n

i=m+1

log

y

pΘ(xi, y, si = 0) for labeled data i = 1...m and unlabeled data i = m + 1...m + n.

◮ Under the assumption of learnable bias this reduces to a

traditional semi-supervised likelihood equation....

SLIDE 29

Log-likelihood assuming learnable bias

◮ Factor p(x, y, s) = p(s|x, y)p(x|y)p(y) ◮ Assume learnable bias (MAR): p(s|x, y) = p(s|x);

SLIDE 30

Log-likelihood assuming learnable bias

◮ Factor p(x, y, s) = p(s|x, y)p(x|y)p(y) ◮ Assume learnable bias (MAR): p(s|x, y) = p(s|x);

ℓ(Θ, X) =

m

X

i=1

log p(xi, yi, si = 1) +

m+n

X

i=m+1

log X

y

p(xi, y, si = 0) =

m

X

i=1

log p(si|xi, yi)p(xi|yi)p(yi) +

m+n

X

i=m+1

log X

y

p(si|xi, y)p(xi|y)p(y)

SLIDE 31

Log-likelihood assuming learnable bias

◮ Factor p(x, y, s) = p(s|x, y)p(x|y)p(y) ◮ Assume learnable bias (MAR): p(s|x, y) = p(s|x);

ℓ(Θ, X) =

m

X

i=1

log p(xi, yi, si = 1) +

m+n

X

i=m+1

log X

y

p(xi, y, si = 0) =

m

X

i=1

log p(si|xi, yi)p(xi|yi)p(yi) +

m+n

X

i=m+1

log X

y

p(si|xi, y)p(xi|y)p(y) =

m

X

i=1

log p(si|xi)p(xi|yi)p(yi) +

m+n

X

i=m+1

log{p(si|xi) X

y

p(xi|y)p(y)} =

m

X

i=1

log p(xi|yi)p(yi) +

m+n

X

i=m+1

log X

y

p(xi|y)p(y) +

m+n

X

i=1

log p(si|xi)

SLIDE 32

Log-likelihood making no assumptions about bias

Use a different factoring: p(x, y, s) = p(x|y, s)p(y|s)p(s). ℓ(ΘL, ΘU, X) =

m

i=1

log pΘL(xi|yi, si = 1)pΘL(yi|si = 1)p(si = 1) +

n

i=m+1

log

y

pΘU(xi|y, si = 0)pΘU(y|si = 0)p(si = 0)

◮ For |C| classes, this has 2|C| class-conditional density models:

p(x|y = c, s) for each c ∈ C.

SLIDE 33

Log-likelihood making no assumptions about bias

Use a different factoring: p(x, y, s) = p(x|y, s)p(y|s)p(s). ℓ(ΘL, ΘU, X) =

m

i=1

log pΘL(xi|yi, si = 1)pΘL(yi|si = 1)p(si = 1) +

n

i=m+1

log

y

pΘU(xi|y, si = 0)pΘU(y|si = 0)p(si = 0)

◮ For |C| classes, this has 2|C| class-conditional density models:

p(x|y = c, s) for each c ∈ C.

◮ This simplifies to two independent maximizations.

SLIDE 34

Log-likelihood making no assumptions about bias

Use a different factoring: p(x, y, s) = p(x|y, s)p(y|s)p(s). ℓ(ΘL, ΘU, X) =

m

i=1

log pΘL(xi|yi, si = 1)pΘL(yi|si = 1)p(si = 1) +

n

i=m+1

log

y

pΘU(xi|y, si = 0)pΘU(y|si = 0)p(si = 0)

◮ For |C| classes, this has 2|C| class-conditional density models:

p(x|y = c, s) for each c ∈ C.

◮ This simplifies to two independent maximizations. ◮ But how do we maximize the likelihood in a sensible way for

the unlabeled data?

SLIDE 35

The “Shifted Mixture-Model” (SMM) approach to improving the likelihood for unlabeled samples

Solution: Assume the model parameters for the labeled data, pΘL(x|y, s = 1) and pΘL(y|s = 1), and the parameters for the unlabeled data, pΘU(x|y, s = 0) and pΘU(y|s = 0) are “close.”

◮ Learn density models pΘL(xi|yi, si = 1) and priors

pΘL(yi|si = 1) from labeled data.

SLIDE 36

The “Shifted Mixture-Model” (SMM) approach to improving the likelihood for unlabeled samples

Solution: Assume the model parameters for the labeled data, pΘL(x|y, s = 1) and pΘL(y|s = 1), and the parameters for the unlabeled data, pΘU(x|y, s = 0) and pΘU(y|s = 0) are “close.”

◮ Learn density models pΘL(xi|yi, si = 1) and priors

pΘL(yi|si = 1) from labeled data.

◮ Initialize parameters ΘU for unlabeled data with parameters

learned from labeled data ΘL, or possibly using Lemma 2.

SLIDE 37

The “Shifted Mixture-Model” (SMM) approach to improving the likelihood for unlabeled samples

Solution: Assume the model parameters for the labeled data, pΘL(x|y, s = 1) and pΘL(y|s = 1), and the parameters for the unlabeled data, pΘU(x|y, s = 0) and pΘU(y|s = 0) are “close.”

◮ Learn density models pΘL(xi|yi, si = 1) and priors

pΘL(yi|si = 1) from labeled data.

◮ Initialize parameters ΘU for unlabeled data with parameters

learned from labeled data ΘL, or possibly using Lemma 2.

◮ Improve the likelihood of ΘU given the unlabeled

data(unsupervised learning).

SLIDE 38

The “Shifted Mixture-Model” (SMM) approach to improving the likelihood for unlabeled samples

Solution: Assume the model parameters for the labeled data, pΘL(x|y, s = 1) and pΘL(y|s = 1), and the parameters for the unlabeled data, pΘU(x|y, s = 0) and pΘU(y|s = 0) are “close.”

◮ Learn density models pΘL(xi|yi, si = 1) and priors

pΘL(yi|si = 1) from labeled data.

◮ Initialize parameters ΘU for unlabeled data with parameters

learned from labeled data ΘL, or possibly using Lemma 2.

◮ Improve the likelihood of ΘU given the unlabeled

data(unsupervised learning). The SMM approach is useful for both general and unlabeled population modeling, as it produces an explicit generative model of both populations.

SLIDE 39

Example with synthetic data – Concept Drift

SLIDE 40

Example with synthetic data – Concept Drift

SLIDE 41

Application to EM

Application of the SMM approach to EM:

◮ Let Θ0 U be the parameters for pΘU(x|y, s = 0), initialized with

the labeled data.

◮ Limit to a few (5) iterations to limit parameter changes. ◮ Use inertia parameter α = .99 to slow parameter evolution:

given Θt

U, and EM update Θ′, use

Θt+1

U

← αΘt

U + (1 − α)Θ′ ◮ Final Θ5 U is the parameters for the unlabeled data.

SLIDE 42

Experiment 1 – the ADULT data set

Features:

◮ x: (AGE, EDUCATION, CAPITAL GAIN, CAPITAL LOSS,

HOURS PER WEEK, SEX, NATIVE TO US, ETHNICITY, FULL TIME)

◮ y: INCOME > $50,000? ◮ s: MARRIED?

This dataset has information most of which could be used in a loan approval system. The target is analogous to a repay/default behavior label. Marital status is analogous to an unquantifiable measure of responsibility. This wouldn’t be recorded in a bank’s records, but which might influence the label.

SLIDE 43

Determining the type of bias

Is it MCAR?

SLIDE 44

Determining the type of bias

Is it MCAR? No:

◮ p(y = 1|s = 1) = 0.4556 ◮ p(y = 1|s = 0) = 0.0692

SLIDE 45

Determining the type of bias

Is it MCAR? No:

◮ p(y = 1|s = 1) = 0.4556 ◮ p(y = 1|s = 0) = 0.0692

Is it MAR?

SLIDE 46

Determining the type of bias

Is it MCAR? No:

◮ p(y = 1|s = 1) = 0.4556 ◮ p(y = 1|s = 0) = 0.0692

Is it MAR? Not as far as logistic regression can detect:

◮ Accuracy in gen. pop. based on labeled data = 74.2% ◮ Accuracy in gen. pop. based on all data = 80.7%

SLIDE 47

Results

(boxplots over 10 random trials)

LR LR + Lemma 1 GMM GMM + Lemma 1 SMM 0.74 0.76 0.78 0.80 0.82

General population − accuracy

LR LR + Lemma 2 GMM GMM + Lemma 2 SMM 0.80 0.85 0.90

Unlabeled population − accuracy

Analysis in loan-application context:

SLIDE 48

Results

(boxplots over 10 random trials)

LR LR + Lemma 1 GMM GMM + Lemma 1 SMM 0.74 0.76 0.78 0.80 0.82

General population − accuracy

LR LR + Lemma 2 GMM GMM + Lemma 2 SMM 0.80 0.85 0.90

Unlabeled population − accuracy

Analysis in loan-application context:

◮ We can achieve better accuracy than logistic regression. ◮ We can do “reject inference.” ◮ We can improve accuracy by not assuming MAR bias.

SLIDE 49

Experiment 1 – the CA-HOUSING data set

−124 −120 −116 34 36 38 40 42

Features:

◮ x: CA census tract data: MEDIAN

INCOME, MEDIAN HOUSE AGE, TOTAL ROOMS, TOTAL BEDROOMS, POPULATION, HOUSEHOLDS

◮ y: house VALUE > California median? ◮ s: LATITUDE > 36 and

within 0.4 degrees of coast? The goal is to learn a model of housing prices throughout California when price information is available only for the northern California coast.

SLIDE 50

Determining the type of bias

Is it MCAR?

SLIDE 51

Determining the type of bias

Is it MCAR? No:

◮ p(y = 1|s = 1) = 0.751 ◮ p(y = 1|s = 0) = 0.443

SLIDE 52

Determining the type of bias

Is it MCAR? No:

◮ p(y = 1|s = 1) = 0.751 ◮ p(y = 1|s = 0) = 0.443

Is it MAR?

SLIDE 53

Determining the type of bias

Is it MCAR? No:

◮ p(y = 1|s = 1) = 0.751 ◮ p(y = 1|s = 0) = 0.443

Is it MAR? Not as far as logistic regression can detect:

◮ Accuracy in gen. pop. based on labeled data = 74.8% ◮ Accuracy in gen. pop. based on all data = 80.5%

SLIDE 54

Results

(boxplots over 10 random trials)

LR LR + Lemma 1 GMM GMM + Lemma 1 SMM 0.70 0.72 0.74 0.76 0.78

General population − accuracy

LR LR + Lemma 2 GMM GMM + Lemma 2 SMM 0.68 0.72 0.76 0.80

Unlabeled population − accuracy

Analysis:

◮ We can achieve better accuracy than logistic regression. ◮ Isolated areas constitute a biased sample of housing prices

throughout California

◮ We can improve accuracy by not assuming MAR bias.

SLIDE 55

Future Work

More “integrated” ways to keep ΘL and ΘU close:

◮ Specify a Bayesian prior. ◮ Add a term to the likelihood equation: +λKL(ΘL, ΘU).

SLIDE 56

Future Work

More “integrated” ways to keep ΘL and ΘU close:

◮ Specify a Bayesian prior. ◮ Add a term to the likelihood equation: +λKL(ΘL, ΘU).

Other factorings:

◮ We only explored one factoring

p(x, y, s) = p(x|y, s)p(y|s)p(s), resulting in independent maximizations.

◮ Other factorizations yield models with more coupled

parameters.

SLIDE 57

Conclusions

◮ Model misspecification, the reality in real-world applications,

means that even under MAR bias, re-weighting can help discriminative and generative classifiers.

SLIDE 58

Conclusions

◮ Model misspecification, the reality in real-world applications,

means that even under MAR bias, re-weighting can help discriminative and generative classifiers.

◮ Improving the likelihood of the model parameters for the

unlabeled data yields better classifiers in the unlabeled and general populations.

SLIDE 59

Conclusions

◮ Model misspecification, the reality in real-world applications,

means that even under MAR bias, re-weighting can help discriminative and generative classifiers.

◮ Improving the likelihood of the model parameters for the

unlabeled data yields better classifiers in the unlabeled and general populations.

◮ The SMM approach is a practical method of allowing the

unlabeled model parameters to better represent the unlabeled samples while remaining close to the parameters yielding the best predictions for the labeled data.