Importance-Weighted Cross- Importance-Weighted Cross- Validation - - PowerPoint PPT Presentation

▶

Dec 04, 2022 348 likes •627 views

Importance-Weighted Cross- Importance-Weighted Cross- Validation for Covariate Shift Validation for Covariate Shift (1) (2) Masashi Sugiyama , Benjamin Blankertz , (2,3) (2) Matthias Krauledat , Guido Dornhege , (3,2) Klaus-Robert

SLIDE 1

Importance-Weighted Cross- Validation for Covariate Shift Importance-Weighted Cross- Validation for Covariate Shift

Masashi Sugiyama , Benjamin Blankertz , Matthias Krauledat , Guido Dornhege , Klaus-Robert Müller

(1)

Tokyo Institute of Technology, Tokyo, Japan Fraunhofer FIRST.IDA, Berlin, Germany Technical University Berlin, Berlin, Germany

(1) (2) (3) (2,3) (3,2) (2) (2)

SLIDE 2

2

Goal: from given training samples, predict

utput of unseen test samples

To do so, we always assume Is this assumption really true? Training and test samples are drawn from the same distribution

Common Assumption in Supervised Learning Common Assumption in Supervised Learning

SLIDE 3

3

Not Always True! Not Always True!

Less women in face dataset than reality. More criticisms in survey sampling than reality. Tend to collect easy-to-gather samples for training. Sample generation mechanism varies over time.

The Yale Face Database B Brain activity data

SLIDE 4

4

Covariate Shift Covariate Shift

However, no chance for generalization if training and test samples have nothing in common. Covariate shift:

Input distribution changes Functional relation remains unchanged

SLIDE 5

5

Examples of Covariate Shift Examples of Covariate Shift

(Weak) extrapolation: Predict output values outside training region Training samples Test samples

SLIDE 6

6

Examples (cont.) Examples (cont.)

Possible applications:

Non-stationarity compensation in brain-

computer interface

Online system adaptation in robot motor

control

Correcting sample selection bias in survey

sampling

Active learning (experimental design)

Sugiyama (JMLR2006)

SLIDE 7

7

Covariate Shift Covariate Shift

Training samples Test samples

To illustrate the effect of covariate shift, let’s focus on linear extrapolation

True function Learned function

SLIDE 8

8

Ordinary Least-Squares Ordinary Least-Squares

If model is correct: OLS minimizes bias asymptotically If model is misspecified: OLS does not minimize bias even asymptotically. We don’t have correct model in practice, so we need to reduce bias!

SLIDE 9

9

Law of Large Numbers Law of Large Numbers

Sample average converges to the population mean: We want to estimate the expectation

ver test input points only using

training input points .

SLIDE 10

10

Key Trick: Importance-Weighted Average Key Trick: Importance-Weighted Average

Importance： Ratio of test and training input densities Importance-weighted average: (cf. importance sampling)

SLIDE 11

11

Importance-Weighted LS Importance-Weighted LS

Even for misspedified models, IWLS minimizes bias asymptotically.

:Assumed known and strictly positive

SLIDE 12

12

Importance-Weighted LS (cont.) Importance-Weighted LS (cont.)

However, variance of IWLS is larger than OLS (cf. BLUE) We want to reduce variance We reduce variance by adding small bias to IWLS (e.g., changing weight, regularization)

SLIDE 13

13

Adaptive IWLS Adaptive IWLS

Large bias Small variance Small bias Large variance Intermediate

(Shimodaira, 2000)

SLIDE 14

14

Model Selection Model Selection

We want to determine so that generalization error (bias+var) is minimized. However, gen. error is inaccessible. We use a gen. error estimator instead.

SLIDE 15

15

Cross-Validation Cross-Validation

A standard method for gen. error estimation

Divide training samples into groups. Train a learning machine with groups. Validate the trained machine using the rest. Repeat this for all combinations and output the

mean validation error.

Group 1 Group 2 Group k Group k-1 … Training Validation

SLIDE 16

16

0.2 0.4 0.6 0.8 1 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 1.2 1.4

CV under Covariate Shift CV under Covariate Shift

CV is almost unbiased without covariate shift. However, it is heavily biased under covariate shift.

Cross validation True gen. error

SLIDE 17

17

Goal of This Talk Goal of This Talk

We propose a better generalization error estimator under covariate shift!

SLIDE 18

18

Importance-Weighted CV (IWCV) Importance-Weighted CV (IWCV)

When testing the classifier in CV process, we also importance-weight the test error.

Set 1 Set 2 Set k Set k-1 … Training Testing

IWCV gives almost unbiased estimates

f gen. error even under covariate shift

SLIDE 19

19

Example of IWCV Example of IWCV

IWCV is nicely unbiased Model selection by IWCV

utperforms CV!

0.2 0.4 0.6 0.8 1 0.1 0.2 0.3 0.4 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 1.2 1.4 0.2 0.4 0.6 0.8 1 0.1 0.2 0.3 0.4

Ordinary CV True gen. error IWCV

0.077(0.020) IWCV 0.356(0.086) Ordinary CV

Obtained generalization error Mean(Std.)

SLIDE 20

20

Relation to Existing Methods Relation to Existing Methods

IWAIC (Shimodaira, JSPI 2000) IWSIC (Sugiyama & Müller, Stat. & Deci. 2005)

Arbitrary Linear Regular Model Arbitrary Linear Smooth Parameter learning Slow Fast Fast Computation Arbitrary Squared Smooth Loss Finite sample Asymptotic & Finite Asymptotic Unbiasedness IWCV IWSIC IWAIC

IWCV is the first method that is applicable to classification with covariate shift!

SLIDE 21

21

Application: Brain-Computer Interface Application: Brain-Computer Interface

Brain activity in different mental states is transformed into control signals

SLIDE 22

22

Non-Stationarity in EEG Features Non-Stationarity in EEG Features

Bandpower differences between training and test phases

Different mental conditions (attention, sleepiness etc.) between training and test phases may change the EEG signals.

Features extracted from brain activity during training and test phases

SLIDE 23

23

Adaptive Importance-Weighted Linear Discriminant Analysis Adaptive Importance-Weighted Linear Discriminant Analysis

Standard classification method in BCI: LDA (after appropriate feature extraction) We use its variant: AIWLDA

: Ordinary LDA (standard method)
: IWLDA (consistent)
is tuned by proposed IWCV

SLIDE 24

24

BCI Results BCI Results

Proposed method

utperforms existing
ne in 5 cases!

AIWLDA +10IWCV Ordinary LDA Trial Sub- ject 14.0 % 15.3 % 2 21.3 % 21.3 % 1 5 6.4 % 6.4 % 3 2.4 % 2.4 % 2 21.3 % 21.3 % 1 4 17.5 % 22.5 % 3 19.3 % 21.3 % 2 34.4 % 36.9 % 1 3 25.5 % 25.5 % 3 38.7 % 39.3 % 2 40.0 % 40.0 % 1 2 4.3 % 4.3 % 3 8.8 % 8.8 % 2 10.0 % 9.3 % 1 1

SLIDE 25

25

BCI Results BCI Results

When KL is large, IWCV is better. When KL is small, no difference. Non-stationarity in EEG could be successfully modeled by covariate shift!

2.01 0.79 1.83 5.58 9.23 1.25 2.88 2.63 0.43 1.05 0.97 0.69 1.11 0.76 KL AIWLDA +10IWCV Ordinary LDA Trial Sub- ject 14.0 % 15.3 % 2 21.3 % 21.3 % 1 5 6.4 % 6.4 % 3 2.4 % 2.4 % 2 21.3 % 21.3 % 1 4 17.5 % 22.5 % 3 19.3 % 21.3 % 2 34.4 % 36.9 % 1 3 25.5 % 25.5 % 3 38.7 % 39.3 % 2 40.0 % 40.0 % 1 2 4.3 % 4.3 % 3 8.8 % 8.8 % 2 10.0 % 9.3 % 1 1

KL divergence from training to test input distributions

SLIDE 26

26

Conclusions Conclusions

Covariate shift: input distribution varies but functional relation remains unchanged. Importance weight plays a central role in compensating covariate shift. IW cross-validation: unbiased and general IWCV improves the performance of BCI. Class-prior change: a variant of IWCV works Latent distribution shift: