Making Generative Classifiers Robust to Selection Bias Andrew Smith - - PowerPoint PPT Presentation
Making Generative Classifiers Robust to Selection Bias Andrew Smith - - PowerPoint PPT Presentation
Making Generative Classifiers Robust to Selection Bias Andrew Smith Charles Elkan November 30th, 2007 Outline What is selection bias? Types of selection bias. Overcoming learnable bias with weighting. Overcoming bias with
Outline
◮ What is selection bias? ◮ Types of selection bias. ◮ Overcoming learnable bias with weighting. ◮ Overcoming bias with maximum likelihood (ML). ◮ Experiment 1: ADULT dataset. ◮ Experiment 2: CA-housing dataset. ◮ Future work & conclusions.
What is selection bias?
Traditional semi-supervised learning assumes:
◮ Some samples are labeled, some are not. ◮ Labeled and unlabeled examples are identically distributed.
What is selection bias?
Traditional semi-supervised learning assumes:
◮ Some samples are labeled, some are not. ◮ Labeled and unlabeled examples are identically distributed.
Semi-supervised learning under selection bias:
◮ Labeled examples are selected from the general population not
at random.
◮ Labeled and unlabeled examples may be differently distributed.
Examples
◮ Loan application approval
◮ Goal is to model repay/default behavior of all applicants. ◮ But the training set only includes labels for people who were
approved for a loan.
Examples
◮ Loan application approval
◮ Goal is to model repay/default behavior of all applicants. ◮ But the training set only includes labels for people who were
approved for a loan.
Examples
◮ Loan application approval
◮ Goal is to model repay/default behavior of all applicants. ◮ But the training set only includes labels for people who were
approved for a loan.
◮ Spam filtering
◮ Goal is an up-to-date spam filter. ◮ But, while up-to-date unlabeled emails are available,
hand-labeled data sets are expensive and may be rarely updated.
Framework
Types of selection bias are distinguised by conditional independence assumptions between:
◮ x is the feature vector. ◮ y is the class label. If y is binary, y ∈ {1, 0}. ◮ s is the binary selection variable. If yi is observable then
si = 1, otherwise si = 0.
Types of selection bias – No bias
x y s
s ⊥ x, s ⊥ y
◮ The standard semi-supervised learning scenario.
Types of selection bias – No bias
x y s
s ⊥ x, s ⊥ y
◮ The standard semi-supervised learning scenario. ◮ Labeled examples are selected completely at random from the
general population.
Types of selection bias – No bias
x y s
s ⊥ x, s ⊥ y
◮ The standard semi-supervised learning scenario. ◮ Labeled examples are selected completely at random from the
general population.
◮ The missing labels are said to be “missing completely at
random” (MCAR) in the literature.
Types of selection bias – Learnable bias
s x y
s ⊥ y|x
◮ Labeled examples are selected from the general population
- nly depending on features x.
Types of selection bias – Learnable bias
s x y
s ⊥ y|x
◮ Labeled examples are selected from the general population
- nly depending on features x.
◮ A model p(s|x) is learnable.
Types of selection bias – Learnable bias
s x y
s ⊥ y|x
◮ Labeled examples are selected from the general population
- nly depending on features x.
◮ A model p(s|x) is learnable. ◮ The missing labels are said to be “missing at random”
(MAR), or ”ignorable bias” in the literature.
◮ p(y|x, s = 1) = p(y|x).
Model mis-specification under learnable bias
p(y|x, s = 1) = p(y|x) implies decision boundaries are the same in the labeled and general populations. But suppose the model is misspecified?
Model mis-specification under learnable bias
p(y|x, s = 1) = p(y|x) implies decision boundaries are the same in the labeled and general populations. But suppose the model is misspecified? Then a sub-optimal decision boundary may be learned under MAR bias.
− + + − + + − + + + − − + − − − + + + − + − − − + − + + + − + + − + + + + − − + + − − + − + − + + − + − − + + + − − − − + − − − + − + − − + − + + + + − − − − + + − + − − − + + + − − − + − + − − − − + + − − − + − − − − − + + + + − + − − − + − + + + − − − − + + − + − − − + − + + − − − − + − + − − + + + + + − − + + − − − + + + − + + + − − − + + + + − + − − + − − − − − + + + − − + + + + + + − + + + − − + − + − + + − − + − − − − − + − + + + − + + − − + − − − − − − − + − + + − − − − + − − + + − + + − − − − − − − − + − + + − − − − − − + − + − + + − + + − + + − − − − + − − + + − − − − + − − + − + + + + − − − + − + + − − − + − + − − − − − − + + − − + − − + − + + + + + + − − − − + − − − − − + + − − − − + + + − − + − − − + − − − + + + − − − + + − − − − − + + + + − + + + + − − − − − − + + + − − + − −
Viewing hidden labels
best mis−specified bondary true boundary − + + − + − − − + + − + − − + + + + + + − + − + − − + + − + − + − + − + + − − + − − + + − − − + − − − − − + + − − − − + − + + − + + − − + + − − − + − − − + + + + − + − − − + + + − + + + − + + − − + − − − − + + + + + + + − + − − + − + − − − − + − − + − − − − − + + − + + − + − − − − − − + + + − − − − − + − − − + + + − − − + − − − − + + − + + − − + − − + − + + − + − − − − + − − − − − − + + − + + + − − − − + + − − + − −
Ignoring samples without labels
- estim. mis−specified boundary
- estim. well−specified boundary
Types of selection bias – Arbitrary bias
s x y ◮ Labeled examples are selected from the general population
possibly depending on the label itself.
Types of selection bias – Arbitrary bias
s x y ◮ Labeled examples are selected from the general population
possibly depending on the label itself.
◮ No independence assumptions can be made.
Types of selection bias – Arbitrary bias
s x y ◮ Labeled examples are selected from the general population
possibly depending on the label itself.
◮ No independence assumptions can be made. ◮ The missing labels are said to be “missing not at random”
(MNAR) in the literature.
Overcoming bias – Two alternate goals
The training data consist of {(xi, yi)|si = 1} and {(xi)|si = 0}. Two goals are possible:
Overcoming bias – Two alternate goals
The training data consist of {(xi, yi)|si = 1} and {(xi)|si = 0}. Two goals are possible:
◮ General population modeling: Learn p(y|x), e.g. loan
application approval.
Overcoming bias – Two alternate goals
The training data consist of {(xi, yi)|si = 1} and {(xi)|si = 0}. Two goals are possible:
◮ General population modeling: Learn p(y|x), e.g. loan
application approval.
◮ Unlabeled population modeling: Learn p(y|x, s = 0),
e.g. spam filtering.
Overcoming learnable bias – General population modeling
Lemma 1
Under MAR bias in the labeling, p(x, y) = p(s = 1) p(s = 1|x)p(x, y|s = 1) if all probabilities are non-zero. The distribution of samples in the general population is a weighted version of the distribution of labeled samples. Since p(s|x) is learnable, we can estimate weights.
General population modeling – Application
Lemma 1 can be used:
◮ to estimate class-conditional density models p(x|y). ◮ to improve the loss of misspecified discriminative classifiers in
the general population.
− − − − − − − − − + + − + − + − − − + − − − + − − + − − + − − + + + + − + − − − − − − + + + + − − − − − − + − + + − − + − − + + − + − − − + + + − − − − + − + + − − − − + − − − − − − + − + + + − + − − − − + − + − − + − + − − + − + + − + − − + − − − + + − − − − − − − + − + + − + + − − + − − + − − − + − − − + − − − + + + + + − − − − + − − − + − + + − − − − + + − − − − + − + − + + − − − − + + − + − + + + − − − + + − + − + − − + − − + + − + + − − − − − − − − − − + − + − − + − + − + + + − − − + − − + + − + − − + − − + − + + − − + + − − + + − − + + + − − − − − − + + + − + + − − − + + + − + − − − + − + − + − − − − − + − + − − − + + + − + + + − + + + − − + + + − − − + − + + + + − − − − + + − − − − − − − + + − + − + − + + − −+ + − + − − − + − + − + − + + + − + − + + − + − − + + − − + − − − + + − −
Viewing hidden labels
best general−pop. boundary − − − − − − + + − − − − − − − + − − − + + − − − − + + − + + − − − + + − − − − − − − − + + + − − − + + − + − + − + − − + − − − − − − + + − − + − − + − − + − + + + − − + + − − + − − + − + − − − + + + − − + − − − + + − − − − − − − + − + − + + − + + + − + − + + − − + + + + − − + + − + − + − + − + − + + − − + − + − − + − + + + + + − + − − − + − − − − + − − + − − − − + − − + + − + + − − − − + +
Using lemma 1
- est. general−pop. boundary
The weighted logistic regression finds a decision boundary close to the best-possible for the general population.
Overcoming learnable bias – Unlabeled population modeling
Lemma 2
Under MAR bias in the labeling, p(x, y|s = 0) = 1 − p(s = 1|x) 1 − p(s = 1) p(s = 1) p(s = 1|x)p(x, y|s = 1) if all probabilities are non-zero. Similarly, the distribution of samples in the unlabeled population is a weighted version of the distribution of labeled samples. Since p(s|x) is learnable, we can estimate weights.
General population modeling – Application
Lemma 2 can be used:
◮ to estimate class-conditional density models p(x|y, s = 0). ◮ to improve the loss of misspecified discriminative classifiers in
the unlabeled population.
− − − + − − − + − − + + − + − − − + − − + + + + − − − − − − − − − − − + + − + − + + + − − − + + − + + + + − + − − + + − + + − − − − − + + − + + − + + − + − − + − + − + + − − + − − − − − + − + − − + + − − − + − − + − − − − − + − − − − + − + − − − − + − − + + + − − + − + − − − − + − − − − + + + − + + + − − + − − − + − + − − − − + + + − − − − − + + + + − + − + − + + + − − − + ++ − − + − + − − − − − − − + + − − − + − − − + − − − − − + − + − − − − − + − + + − − − + − + − + + − + − + + − + + + + + + + + − − − − + − − − − − − + + − − + + + − − − − − + + + − − − − − − − − − + − − − − − − − − − + − + − + − − + − − − + − + + − + + + − + − − − + − + − − + − + − − − − + + − − − − − + + + − − − + + + + − − + − + − + − − + − + − + − − + − − − − − + − − − + − − + − − + − + + − − − + − + − + − + − − − +
Viewing hidden labels
best unlabeled−pop. boundary − − + + − + − − − − − − − − − − − + + − + + + + + + + + − + + − − − − + − + + + − − − + − − − + − + + − − − + − − − − − + + − + + + − − − − − + − − + + − − + + + − + − + + + − + − + + − − − + − − − − − − + − − − + + − + − − + + − + + + + + + − − − − − + − + + − − + − − − − − + − − − − + + − − + + + + − − − + + − − + − + − − + − − − − + + − − − + + − + − + − + − − + − + − − − − − + − + − − + − + − + − − −
Using Lemma 2
- est. unlabeled−pop. boundary
The weighted logistic regression finds a decision boundary close to the best-possible for the unlabeled population.
Overcoming arbitrary bias – Maximum Likelihood
The log likelihood of parameters Θ over semi-labeled dataset X is: ℓ(Θ, X) =
m
- i=1
log pΘ(xi, yi, si = 1) +
m+n
- i=m+1
log
- y
pΘ(xi, y, si = 0) for labeled data i = 1...m and unlabeled data i = m + 1...m + n.
Overcoming arbitrary bias – Maximum Likelihood
The log likelihood of parameters Θ over semi-labeled dataset X is: ℓ(Θ, X) =
m
- i=1
log pΘ(xi, yi, si = 1) +
m+n
- i=m+1
log
- y
pΘ(xi, y, si = 0) for labeled data i = 1...m and unlabeled data i = m + 1...m + n.
◮ Under the assumption of learnable bias this reduces to a
traditional semi-supervised likelihood equation....
Log-likelihood assuming learnable bias
◮ Factor p(x, y, s) = p(s|x, y)p(x|y)p(y) ◮ Assume learnable bias (MAR): p(s|x, y) = p(s|x);
Log-likelihood assuming learnable bias
◮ Factor p(x, y, s) = p(s|x, y)p(x|y)p(y) ◮ Assume learnable bias (MAR): p(s|x, y) = p(s|x);
ℓ(Θ, X) =
m
X
i=1
log p(xi, yi, si = 1) +
m+n
X
i=m+1
log X
y
p(xi, y, si = 0) =
m
X
i=1
log p(si|xi, yi)p(xi|yi)p(yi) +
m+n
X
i=m+1
log X
y
p(si|xi, y)p(xi|y)p(y)
Log-likelihood assuming learnable bias
◮ Factor p(x, y, s) = p(s|x, y)p(x|y)p(y) ◮ Assume learnable bias (MAR): p(s|x, y) = p(s|x);
ℓ(Θ, X) =
m
X
i=1
log p(xi, yi, si = 1) +
m+n
X
i=m+1
log X
y
p(xi, y, si = 0) =
m
X
i=1
log p(si|xi, yi)p(xi|yi)p(yi) +
m+n
X
i=m+1
log X
y
p(si|xi, y)p(xi|y)p(y) =
m
X
i=1
log p(si|xi)p(xi|yi)p(yi) +
m+n
X
i=m+1
log{p(si|xi) X
y
p(xi|y)p(y)} =
m
X
i=1
log p(xi|yi)p(yi) +
m+n
X
i=m+1
log X
y
p(xi|y)p(y) +
m+n
X
i=1
log p(si|xi)
Log-likelihood making no assumptions about bias
Use a different factoring: p(x, y, s) = p(x|y, s)p(y|s)p(s). ℓ(ΘL, ΘU, X) =
m
- i=1
log pΘL(xi|yi, si = 1)pΘL(yi|si = 1)p(si = 1) +
n
- i=m+1
log
- y
pΘU(xi|y, si = 0)pΘU(y|si = 0)p(si = 0)
◮ For |C| classes, this has 2|C| class-conditional density models:
p(x|y = c, s) for each c ∈ C.
Log-likelihood making no assumptions about bias
Use a different factoring: p(x, y, s) = p(x|y, s)p(y|s)p(s). ℓ(ΘL, ΘU, X) =
m
- i=1
log pΘL(xi|yi, si = 1)pΘL(yi|si = 1)p(si = 1) +
n
- i=m+1
log
- y
pΘU(xi|y, si = 0)pΘU(y|si = 0)p(si = 0)
◮ For |C| classes, this has 2|C| class-conditional density models:
p(x|y = c, s) for each c ∈ C.
◮ This simplifies to two independent maximizations.
Log-likelihood making no assumptions about bias
Use a different factoring: p(x, y, s) = p(x|y, s)p(y|s)p(s). ℓ(ΘL, ΘU, X) =
m
- i=1
log pΘL(xi|yi, si = 1)pΘL(yi|si = 1)p(si = 1) +
n
- i=m+1
log
- y
pΘU(xi|y, si = 0)pΘU(y|si = 0)p(si = 0)
◮ For |C| classes, this has 2|C| class-conditional density models:
p(x|y = c, s) for each c ∈ C.
◮ This simplifies to two independent maximizations. ◮ But how do we maximize the likelihood in a sensible way for
the unlabeled data?
The “Shifted Mixture-Model” (SMM) approach to improving the likelihood for unlabeled samples
Solution: Assume the model parameters for the labeled data, pΘL(x|y, s = 1) and pΘL(y|s = 1), and the parameters for the unlabeled data, pΘU(x|y, s = 0) and pΘU(y|s = 0) are “close.”
◮ Learn density models pΘL(xi|yi, si = 1) and priors
pΘL(yi|si = 1) from labeled data.
The “Shifted Mixture-Model” (SMM) approach to improving the likelihood for unlabeled samples
Solution: Assume the model parameters for the labeled data, pΘL(x|y, s = 1) and pΘL(y|s = 1), and the parameters for the unlabeled data, pΘU(x|y, s = 0) and pΘU(y|s = 0) are “close.”
◮ Learn density models pΘL(xi|yi, si = 1) and priors
pΘL(yi|si = 1) from labeled data.
◮ Initialize parameters ΘU for unlabeled data with parameters
learned from labeled data ΘL, or possibly using Lemma 2.
The “Shifted Mixture-Model” (SMM) approach to improving the likelihood for unlabeled samples
Solution: Assume the model parameters for the labeled data, pΘL(x|y, s = 1) and pΘL(y|s = 1), and the parameters for the unlabeled data, pΘU(x|y, s = 0) and pΘU(y|s = 0) are “close.”
◮ Learn density models pΘL(xi|yi, si = 1) and priors
pΘL(yi|si = 1) from labeled data.
◮ Initialize parameters ΘU for unlabeled data with parameters
learned from labeled data ΘL, or possibly using Lemma 2.
◮ Improve the likelihood of ΘU given the unlabeled
data(unsupervised learning).
The “Shifted Mixture-Model” (SMM) approach to improving the likelihood for unlabeled samples
Solution: Assume the model parameters for the labeled data, pΘL(x|y, s = 1) and pΘL(y|s = 1), and the parameters for the unlabeled data, pΘU(x|y, s = 0) and pΘU(y|s = 0) are “close.”
◮ Learn density models pΘL(xi|yi, si = 1) and priors
pΘL(yi|si = 1) from labeled data.
◮ Initialize parameters ΘU for unlabeled data with parameters
learned from labeled data ΘL, or possibly using Lemma 2.
◮ Improve the likelihood of ΘU given the unlabeled
data(unsupervised learning). The SMM approach is useful for both general and unlabeled population modeling, as it produces an explicit generative model of both populations.
Example with synthetic data – Concept Drift
Example with synthetic data – Concept Drift
Application to EM
Application of the SMM approach to EM:
◮ Let Θ0 U be the parameters for pΘU(x|y, s = 0), initialized with
the labeled data.
◮ Limit to a few (5) iterations to limit parameter changes. ◮ Use inertia parameter α = .99 to slow parameter evolution:
given Θt
U, and EM update Θ′, use
Θt+1
U
← αΘt
U + (1 − α)Θ′ ◮ Final Θ5 U is the parameters for the unlabeled data.
Experiment 1 – the ADULT data set
Features:
◮ x: (AGE, EDUCATION, CAPITAL GAIN, CAPITAL LOSS,
HOURS PER WEEK, SEX, NATIVE TO US, ETHNICITY, FULL TIME)
◮ y: INCOME > $50,000? ◮ s: MARRIED?
This dataset has information most of which could be used in a loan approval system. The target is analogous to a repay/default behavior label. Marital status is analogous to an unquantifiable measure of responsibility. This wouldn’t be recorded in a bank’s records, but which might influence the label.
Determining the type of bias
Is it MCAR?
Determining the type of bias
Is it MCAR? No:
◮ p(y = 1|s = 1) = 0.4556 ◮ p(y = 1|s = 0) = 0.0692
Determining the type of bias
Is it MCAR? No:
◮ p(y = 1|s = 1) = 0.4556 ◮ p(y = 1|s = 0) = 0.0692
Is it MAR?
Determining the type of bias
Is it MCAR? No:
◮ p(y = 1|s = 1) = 0.4556 ◮ p(y = 1|s = 0) = 0.0692
Is it MAR? Not as far as logistic regression can detect:
◮ Accuracy in gen. pop. based on labeled data = 74.2% ◮ Accuracy in gen. pop. based on all data = 80.7%
Results
(boxplots over 10 random trials)
LR LR + Lemma 1 GMM GMM + Lemma 1 SMM 0.74 0.76 0.78 0.80 0.82
General population − accuracy
LR LR + Lemma 2 GMM GMM + Lemma 2 SMM 0.80 0.85 0.90
Unlabeled population − accuracy
Analysis in loan-application context:
Results
(boxplots over 10 random trials)
LR LR + Lemma 1 GMM GMM + Lemma 1 SMM 0.74 0.76 0.78 0.80 0.82
General population − accuracy
LR LR + Lemma 2 GMM GMM + Lemma 2 SMM 0.80 0.85 0.90
Unlabeled population − accuracy
Analysis in loan-application context:
◮ We can achieve better accuracy than logistic regression. ◮ We can do “reject inference.” ◮ We can improve accuracy by not assuming MAR bias.
Experiment 1 – the CA-HOUSING data set
−124 −120 −116 34 36 38 40 42
Features:
◮ x: CA census tract data: MEDIAN
INCOME, MEDIAN HOUSE AGE, TOTAL ROOMS, TOTAL BEDROOMS, POPULATION, HOUSEHOLDS
◮ y: house VALUE > California median? ◮ s: LATITUDE > 36 and
within 0.4 degrees of coast? The goal is to learn a model of housing prices throughout California when price information is available only for the northern California coast.
Determining the type of bias
Is it MCAR?
Determining the type of bias
Is it MCAR? No:
◮ p(y = 1|s = 1) = 0.751 ◮ p(y = 1|s = 0) = 0.443
Determining the type of bias
Is it MCAR? No:
◮ p(y = 1|s = 1) = 0.751 ◮ p(y = 1|s = 0) = 0.443
Is it MAR?
Determining the type of bias
Is it MCAR? No:
◮ p(y = 1|s = 1) = 0.751 ◮ p(y = 1|s = 0) = 0.443
Is it MAR? Not as far as logistic regression can detect:
◮ Accuracy in gen. pop. based on labeled data = 74.8% ◮ Accuracy in gen. pop. based on all data = 80.5%
Results
(boxplots over 10 random trials)
LR LR + Lemma 1 GMM GMM + Lemma 1 SMM 0.70 0.72 0.74 0.76 0.78
General population − accuracy
LR LR + Lemma 2 GMM GMM + Lemma 2 SMM 0.68 0.72 0.76 0.80
Unlabeled population − accuracy