PRLab TUDelft NL LEARNING UNDER COVARIATE SHIFT Domain - - PowerPoint PPT Presentation

prlab tudelft nl learning under covariate shift
SMART_READER_LITE
LIVE PREVIEW

PRLab TUDelft NL LEARNING UNDER COVARIATE SHIFT Domain - - PowerPoint PPT Presentation

PRLab TUDelft NL LEARNING UNDER COVARIATE SHIFT Domain Adaptation, Transfer Learning, Data Shift, Concept Drift Marco Loog Pattern Recognition Laboratory Delft University of Technology PRLab TUDelft NL PRLab TUDelft NL PRLab


slide-1
SLIDE 1

PRLab TUDelft NL

slide-2
SLIDE 2

PRLab TUDelft NL

LEARNING UNDER COVARIATE SHIFT

Domain Adaptation, Transfer Learning, Data Shift, Concept Drift… Marco Loog Pattern Recognition Laboratory Delft University of Technology

slide-3
SLIDE 3

PRLab TUDelft NL

slide-4
SLIDE 4

PRLab TUDelft NL

slide-5
SLIDE 5

PRLab TUDelft NL

Covariate Shift Assumption

Covariate shift via posterior or via label function

P(Y|X) = Q(Y|X) vs. ℓ(X|P) = ℓ(X|Q) = ℓ(X)

Equal to assumption of missing at random

P(S=1|X,Y) = P(S=1|X) Standard setting : P(S=1|X,Y) = P(S=1)

slide-6
SLIDE 6

PRLab TUDelft NL

Graphically Speaking

Covariate shift P(S=1|X,Y) = P(S=1|X) So change of priors is not covariate shift… P(S=1|X,Y) = P(S=1|Y)

slide-7
SLIDE 7

PRLab TUDelft NL

The Canonical Example

How much does it help, really, when hypothesis considered are very nonparametric?

slide-8
SLIDE 8

PRLab TUDelft NL

Importance Weighting : Basic Idea

Expected risk on test : ∫∫ L(x,y|θ) P(x,y) dx dy Rewrite : ∫∫ L(x,y|θ) P(x)/Q(x) Q(x,y) dx dy Empirical loss [on training] : ∑ L(xi,yi|θ) P(xi)/Q(xi) Importance weights : P(xi)/Q(xi)

slide-9
SLIDE 9

PRLab TUDelft NL

Estimation of Importance : E.g.

Estimate P(x) and Q(x) [normal distributions, Parzen densities, whatever] and calculate weights through w = P/Q Sugiyama suggests to estimate weights directly

Find w such that KL(Q||w P) is minimal [KLIEP] Q and P are modelled by Parzen densities More well-founded suggestions have been given by Huang, Smola, Cortes, Mohri, Mansour, et al.

Yet another approach is based on a very simple [Laplace smoothed] nearest neighbor estimate

slide-10
SLIDE 10

PRLab TUDelft NL

Again! A Shameless Plug…

But only a short one this time… Nearest neighbor weighting [NNeW] The idea…

slide-11
SLIDE 11

PRLab TUDelft NL

“Optimal” Weights

Linear regression example

Find the coefficient θ that relates y to x via y = θ x + ɛ Optimal θ = 1 Squared loss Assume one knows the true P(X) and Q(X)

For particular weighting, solution can be found by means of weighted regression

P Q P Q

slide-12
SLIDE 12

PRLab TUDelft NL

Using the true weights Q/P, what behavior do we expect for increasing sample sizes? Let us consider relative improvements : MSE(Q)/MSE(P)

1 training sample? Many [say ∞] training samples? And in between?

Learning Curve for “Optimal” Weights

slide-13
SLIDE 13

PRLab TUDelft NL

As a Side Remark

Can we solve semi-supervised learning by importance weighting?

[Earlier references to Sokolovska and Kawakita]

slide-14
SLIDE 14

PRLab TUDelft NL

[Further] Questions, Remarks, etc.

What problems can be modelled as covariate shift? What if P(S=1|X,Y) cannot be simplified? Bickel et al. take Sugiyama et al. a step further and discrepancy minimization makes yet another step Weighted version can deteriorate even if “true” weights are used Correction by weighting might have hardly any influence when nonparametric hypothesis considered When to use weighting in the first place?

slide-15
SLIDE 15

PRLab TUDelft NL

References

  • Ben-David, Blitzer, Crammer, Kulesza, Pereira, Vaughan, “A theory of learning from different domains,” ML, 2010
  • Ben-David, Lu, Pál, “Impossibility theorems for domain adaptation,” AISTATS, 2010
  • Ben-David, Urner, “On the hardness of domain adaptation and the utility of unlabeled target samples,” ALT, 2012
  • Bickel, Brückner, “Scheffer, Discriminative learning under covariate shift”, JMLR, 2009
  • Cortes, Mohri, “Domain adaptation and sample bias correction theory and algorithm for regression,” Theoretical CS, 2014
  • Daumé III, “Frustratingly easy domain adaptation,” ACL, 2009
  • Dinh, Duin, Piqueras-Salazar, Loog, “FIDOS: A generalized Fisher based feature extraction method for domain shift,” PR, 2013
  • Gama, Zliobaite, Bifet, Pechenizkiy, Bouchachia, “A survey on concept drift adaptation,” ACM CSUR, 2014
  • Jiang, “A literature survey on domain adaptation of statistical classifiers,” 2008
  • Loog, “Nearest neighbor-based importance weighting,” MLSP, 2012
  • Lu, Behbood, Hao, Zuo, Xue, Zhang, “Transfer Learning using Computational Intelligence: A Survey,” KBS, 2015
  • Mansour, Mohri, Rostamizadeh, “Domain adaptation: Learning bounds and algorithms,” COLT, 2009
  • Margolis, “A literature review of domain adaptation with unlabeled data,” University of Washington, TR 35, 2010
  • Pan, Tsang, Kwok, Yang, “Domain adaptation via transfer component analysis,” IEEE TNN, 2011
  • Pan, Yang, “A survey on transfer learning,”, IEEE TKDE, 2010
  • Quionero-Candela, Sugiyama, Schwaighofer, Lawrence, “Dataset shift in machine learning,” The MIT Press, 2009
  • Shimodaira, “Improving predictive inference under covariate shift by weighting the log-likelihood function,” J. Stat. Plan. Inference, 2000
  • Sugiyama, Krauledat, & Müller, “Covariate shift adaptation by importance weighted cross validation,” JMLR, 2007
  • Torrey, Shavlik, “Transfer learning,” Handbook of Research on ML Applications and Trends, 2009