few shot domain adaptation
play

Few-shot Domain Adaptation 1/12 by Causal Mechanism Transfer - PowerPoint PPT Presentation

Few-shot Domain Adaptation 1/12 by Causal Mechanism Transfer Domain adaptation Causal mechanism Q. When is it possible? A. Common causal Transfer assumption (TA)? mechanism as the relation. Summary Common data generating (causal) mechanism


  1. Few-shot Domain Adaptation 1/12 by Causal Mechanism Transfer Domain adaptation Causal mechanism Q. When is it possible? A. Common causal Transfer assumption (TA)? mechanism as the relation. Summary Common data generating (causal) mechanism can be a foundation for domain adaptation.

  2. What’s the intuition? 2/12 Intuition Humans care about finding causal knowledge be- cause, once discovered, it applies to different systems. Motivating example: Regional disease prediction • Predict disease risk from medical records. [1] • Data distributions may vary for different lifestyles. • Common pathological mechanism across regions.

  3. Few-shot Domain Adaptation by Causal Mechanism Transfer Takeshi Teshima 12 , Issei Sato 12 , and Masashi Sugiyama 21 1 The University of Tokyo 2 RIKEN (This work was supported by RIKEN Junior Research Associate Program.)

  4. Preliminary: Causal model (SEMs) 3/12 Structural Equation Models (SEMs) 1, 2 [2] • Generative model for the joint distribution of data. • Consists of deterministic functions of the form:  X 1 = f ′ 1 (pa 1 , S 1 )      = f ′ 2 (pa 2 , S 2 ) X 2  = f ′ 3 (pa 3 , S 3 ) X 3      Y = f ′ 4 (pa 4 , S 4 )  and an independent distribution of ( S 1 , . . . , S D ) . 1 More precisely, NPSEM-IE (Nonparametric SEM with Independent Errors). 2 Acyclicity is assumed.

  5. Nonlinear-ICA and reduced-form SEM 4/12 Reduced form: Structural equations solved for ( X , Y ) .  X 1 = f ′ 1 (pa 1 , S 1 )       X 1 S 1         = f ′ 2 (pa 2 , S 2 ) ⇒  X 2 X 2 S 2      = f         X 3 S 3 X 3 = f ′ 3 (pa 3 , S 3 )              Y S 4  = f ′ 4 (pa 4 , S 4 )  Y  Structural equations Reduced form • Under certain identification conditions , nonlinear-ICA 3 methods can estimate f (we use it in our method). 3 ICA = Independent component analysis.

  6. Formulation 1/3: Data and goal 5/12 Basic setup: regression domain adaptation 1. Homogeneous (i.e., all domains in the same space) X × Y ⊂ R D − 1 × R 2. Multi-source (i.e., multiple source domains) nk i.i.d. D k = { ( x k,i , y k,i ) } ∼ p src( k ) ( k = 1 , . . . , K ) (large n k ) i =1 3. Few-shot supervised (i.e., target data with labels) i.i.d. { ( x tar ,i , y tar ,i ) } n tar ∼ p tar ( n tar is small) i =1 Goal: accurate predictor for the target distribution Find g : R D − 1 → R s.t. R ( g ) := E tar ℓ ( g, X, Y ) is minimal. ( ℓ : loss function)

  7. Formulation 2/3: Each domain 6/12 • Each domain follows a nonlinear-ICA model. Dist. p consists of ( f, q ) 1. D -dimensional ICs S are sampled from q . 2. Invertible f transforms S into ( X, Y ) = f ( S ) . • f can be estimated by ICA under assumptions. • f corresponds to the reduced form of an SEM.

  8. Formulation 3/3: Key assumption 7/12 • Key Assumption: generative mechanism f is common. • Allow flexible shift in q ⇝ Enables DA among seemingly very different distributions.

  9. Proposed method: Strategy 8/12 f − 1 ˆ ˆ f → → → Idea: How to exploit the assumption 1. Estimate f using source domain data (NLICA). 2. Estimate ICs of the target data using ˆ f − 1 . 3. Reshuffle the independent components. 4. Generate target data from reshuffled ICs using ˆ f . 5. Train the predictor g on the generated data.

  10. Theoretical analyses 9/12 Q1. How does the method statistically help? Theorem: If ˆ f = f , the proposed risk estimator is the uniformly minimum variance unbiased risk estimator. The method should help in terms of variance. Q2. What happens when ˆ f ̸ = f ? What’s the catch? Theorem: generalization error bound for ˆ f ̸ = f . Mitigate overfitting. Introduce bias.

  11. Experiment: Setup 10/12 • Dataset: Gasoline consumption dataset [3] . ▶ Panel data from econometrics (SEMs have been applied). ▶ 18 countries (=domains), 19 years, D = 4 . • Baselines for regression domain adaptation. Name Compared method (predictor: KRR) TarOnly Train on target. SrcOnly Train on source. S&TV Train on source, CV on target. TrAdaBoost Boosting for few-shot regression transfer [4] . IW Joint importance weight using RuLSIF [5] . GDM Generalized discrepancy minimization [6] . Copula Non-parametric R-vine copula method [7] . LOO (reference) LOOCV error estimate.

  12. Experiment: Result 11/12 Target (LOO) TrgOnly Prop SrcOnly S&TV TrAda GDM Copula IW(.0) IW(.5) IW(.95) AUT 1 5.88 5.39 9.67 9.84 5.78 31.56 27.33 39.72 39.45 39.18 (1.60) (1.86) (0.57) (0.62) (2.15) (1.39) (0.77) (0.74) (0.72) (0.76) BEL 1 10.70 7.94 8.19 9.48 8.10 89.10 119.86 105.15 105.28 104.30 (7.50) (2.19) (0.68) (0.91) (1.88) (4.12) (2.64) (2.96) (2.95) (2.95) CAN 1 5.16 3.84 157.74 156.65 51.94 516.90 406.91 592.21 591.21 589.87 (1.36) (0.98) (8.83) (10.69) (30.06) (4.45) (1.59) (1.87) (1.84) (1.91) DNK 1 3.26 3.23 30.79 28.12 25.60 16.84 14.46 22.15 22.11 21.72 (0.61) (0.63) (0.93) (1.67) (13.11) (0.85) (0.79) (1.10) (1.10) (1.07) FRA 1 2.79 1.92 4.67 3.05 52.65 91.69 156.29 116.32 116.54 115.29 (1.10) (0.66) (0.41) (0.11) (25.83) (1.34) (1.96) (1.27) (1.25) (1.28) DEU 1 16.99 6.71 229.65 210.59 341.03 739.29 929.03 817.50 818.13 812.60 (8.04) (1.23) (9.13) (14.99) (157.80) (11.81) (4.85) (4.60) (4.55) (4.57) GRC 1 3.80 3.55 5.30 5.75 11.78 26.90 23.05 47.07 45.50 45.72 (2.21) (1.79) (0.90) (0.68) (2.36) (1.89) (0.53) (1.92) (1.82) (2.00) IRL 1 3.05 4.35 135.57 12.34 23.40 3.84 26.60 6.38 6.31 6.16 (0.34) (1.25) (5.64) (0.58) (17.50) (0.22) (0.59) (0.13) (0.14) (0.13) ITA 1 13.00 14.05 35.29 39.27 87.34 226.95 343.10 244.25 244.84 242.60 (4.15) (4.81) (1.83) (2.52) (24.05) (11.14) (10.04) (8.50) (8.58) (8.46) JPN 1 10.55 12.32 8.10 8.38 18.81 95.58 71.02 135.24 134.89 134.16 (4.67) (4.95) (1.05) (1.07) (4.59) (7.89) (5.08) (13.57) (13.50) (13.43) NLD 1 3.75 3.87 0.99 0.99 9.45 28.35 29.53 33.28 33.23 33.14 (0.80) (0.79) (0.06) (0.05) (1.43) (1.62) (1.58) (1.78) (1.77) (1.77) NOR 1 2.70 2.82 1.86 1.63 24.25 23.36 31.37 27.86 27.86 27.52 (0.51) (0.73) (0.29) (0.11) (12.50) (0.88) (1.17) (0.94) (0.93) (0.91) ESP 1 5.18 6.09 5.17 4.29 14.85 33.16 152.59 53.53 52.56 52.06 Proposed > TrgOnly when the other methods using (1.05) (1.53) (1.14) (0.72) (4.20) (6.99) (6.19) (2.47) (2.42) (2.40) SWE 1 6.44 5.47 2.48 2.02 2.18 15.53 2706.85 118.46 118.23 118.27 (2.66) (2.63) (0.23) (0.21) (0.25) (2.59) (17.91) (1.64) (1.64) (1.64) CHE 1 3.51 2.90 43.59 7.48 38.32 8.43 29.71 9.72 9.71 9.79 source domain data suffer from negative transfer. (0.46) (0.37) (1.77) (0.49) (9.03) (0.24) (0.53) (0.29) (0.29) (0.28) TUR 1 1.65 1.06 1.22 0.91 2.19 64.26 142.84 159.79 157.89 157.13 (0.47) (0.15) (0.18) (0.09) (0.34) (5.71) (2.04) (2.63) (2.63) (2.69) GBR 1 5.95 2.66 15.92 10.05 7.57 50.04 68.70 70.98 70.87 69.72 (1.86) (0.57) (1.02) (1.47) (5.10) (1.75) (1.25) (1.01) (0.99) (1.01) USA 1 4.98 1.60 21.53 12.28 2.06 308.69 244.90 462.51 464.75 465.88 (1.96) (0.42) (3.30) (2.52) (0.47) (5.20) (1.82) (2.14) (2.08) (2.16) #Best - 2 10 2 4 0 0 0 0 0 0

  13. Take-home message 12/12 1. Transfer assumption of shared generative mechanism. Developed a few-shot regression DA method. 2. Proposed method extracts and uses the causal model to reduce overfitting via data augmentation. 3. Experiment with real-world data demonstrate the validity. f − 1 ˆ ˆ f → → →

  14. References [1] P. Yadav, M. Steinbach, V. Kumar, and G. Simon, ‘Mining electronic health records (EHRs): A survey’, ACM Computing Surveys , vol. 50, no. 6, pp. 1–40, 2018. [2] J. Pearl, Causality: Models, Reasoning and Inference , Second. Cambridge, U.K. ; New York: Cambridge University Press, 2009. [3] W. H. Greene, Econometric Analysis , Seventh. Boston: Prentice Hall, 2012. [4] D. Pardoe and P. Stone, ‘Boosting for regression transfer’, in Proceedings of the Twenty-Seventh International Conference on Machine Learning , Haifa, Israel, 2010, pp. 863–870. [5] M. Yamada, T. Suzuki, T. Kanamori, H. Hachiya, and M. Sugiyama, ‘Relative density-ratio estimation for robust distribution comparison’, in Advances in Neural Information Processing Systems 24 , J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, Eds., Curran Associates, Inc., 2011, pp. 594–602. [6] C. Cortes, M. Mohri, and A. M. Medina, ‘Adaptation based on generalized discrepancy’, Journal of Machine Learning Research , vol. 20, no. 1, pp. 1–30, 2019.

  15. References (cont.) [7] D. Lopez-paz, J. M. Hernndez-lobato, and B. Schlkopf, ‘Semi-supervised domain adaptation with non-parametric copulas’, in Advances in Neural Information Processing Systems 25 , F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds., Curran Associates, Inc., 2012, pp. 665–673.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend