bridging theory and algorithm for domain adaptation
play

Bridging Theory and Algorithm for Domain Adaptation Yuchen Zhang - PowerPoint PPT Presentation

Bridging Theory and Algorithm for Domain Adaptation Yuchen Zhang Tianle Liu Mingsheng Long Michael I. Jordan School of Software, Tsinghua University National Engineering Lab for Big Data Software University of California, Berkeley 36th


  1. Bridging Theory and Algorithm for Domain Adaptation Yuchen Zhang Tianle Liu Mingsheng Long Michael I. Jordan School of Software, Tsinghua University National Engineering Lab for Big Data Software University of California, Berkeley 36th International Conference on Machine Learning Y. Zhang, T. Liu et al. (Tsinghua Univ) Transfer Learning June 13, 2019 1 / 30

  2. Transfer Learning Outline Transfer Learning 1 Previous Theory and Algorithm 2 MDD: Margin Disparity Discrepancy 3 Definition Generalization Bounds MDD: Theoretically Justified Algorithm 4 Experiments 5 Y. Zhang, T. Liu et al. (Tsinghua Univ) Transfer Learning June 13, 2019 2 / 30

  3. Transfer Learning Transfer Learning Machine learning across domains of Non-IID distributions P � = Q How to design models that effectively bound the generalization error? Source Domain Target Domain 2D Renderings Real Images P ( x , y ) ≠ Q ( x , y ) Model Representation Model f : x → y f : x → y Y. Zhang, T. Liu et al. (Tsinghua Univ) Transfer Learning June 13, 2019 3 / 30

  4. Transfer Learning Notations and Assumptions Notations : 0-1 risk: err D ( h ) = E ( x , y ) ∼ D 1 [ h ( x ) � = y ] � n D ( h ) � E ( x , y ) ∼ � D 1 [ h ( x ) � = y ] = 1 i =1 1 [ h ( x i ) � = y i ] Empirical 0-1 risk: err � n Disparity: disp D ( h ′ , h ) � E D 1 [ h ′ � = h ], Assumptions : In unsupervised domain adaptation, there are two distinct domains, the source P and the target Q . The learner is trained on: A labeled sample � P = { ( x s i , y s i ) } n i =1 drawn from source distribution P . An unlabeled sample � Q = { x t i } m i =1 drawn from target distribution Q . Key Problem : How to control target domain expected risk err Q ( h )? Y. Zhang, T. Liu et al. (Tsinghua Univ) Transfer Learning June 13, 2019 4 / 30

  5. Previous Theory and Algorithm Outline Transfer Learning 1 Previous Theory and Algorithm 2 MDD: Margin Disparity Discrepancy 3 Definition Generalization Bounds MDD: Theoretically Justified Algorithm 4 Experiments 5 Y. Zhang, T. Liu et al. (Tsinghua Univ) Transfer Learning June 13, 2019 5 / 30

  6. Previous Theory and Algorithm Previous Theory In the seminal work [1], the H ∆ H -divergence was proposed to measure domain discrepancy and control the target risk,: � � � disp Q ( h ′ , h ) − disp P ( h ′ , h ) � . d H ∆ H ( P , Q ) = sup (1) h , h ′ ∈H [3] extended the H ∆ H -divergence to general loss functions, leading to the discrepancy distance : | E Q L ( h ′ , h ) − E P L ( h ′ , h ) | , disc L ( P , Q ) = sup (2) h , h ′ ∈H where L should be a bounded function satisfying symmetry and triangle inequality. Note that many widely-used losses, e.g. margin loss , do not satisfy these requirements. Y. Zhang, T. Liu et al. (Tsinghua Univ) Transfer Learning June 13, 2019 6 / 30

  7. Previous Theory and Algorithm Previous Theory Theorem For every hypothesis h ∈ H , err Q ( h ) ≤ err P ( h ) + d H ∆ H ( P , Q ) + λ, (3) where λ = λ ( H , P , Q ) is the ideal combined loss: h ∗ ∈H { err P ( h ∗ ) + err Q ( h ∗ ) } . λ = min (4) err P ( h ) depicts the performance of h on source domain. d H ∆ H bounds the performance gap caused by domain shift. λ quantifies the inverse of “ adaptability ” between domains. � � d / n ), when d is the VC-dimension of H . The order of complexity term is O ( d / m + Y. Zhang, T. Liu et al. (Tsinghua Univ) Transfer Learning June 13, 2019 7 / 30

  8. Previous Theory and Algorithm Previous Algorithm [2] sets a class of domain discriminator G to approximate function class H ∆ H = { 1 [ h ′ � = h ] | h , h ′ ∈ H} for computing d H ∆ H : d H ∆ H ≈ sup ( E Q 1 [ g ( x ) = 0] + E P 1 [ g ( x ) = 1]) g ∈G [4] assumes that h and h ′ should agree on source domain. Then they use L1-loss of two classifiers’ probabilistic outputs on target domain to approximate d H ∆ H : f , f ′ E Q | f ( x ) − f ′ ( x ) | d H ∆ H ≈ sup There are two crucial directions for improvement: Generalization bound for classification with scoring functions and margin loss has not been formally studied in the DA setting. Computing the supremum requires an ergodicity over H ∆ H increases the difficulty of optimization. Y. Zhang, T. Liu et al. (Tsinghua Univ) Transfer Learning June 13, 2019 8 / 30

  9. MDD: Margin Disparity Discrepancy Outline Transfer Learning 1 Previous Theory and Algorithm 2 MDD: Margin Disparity Discrepancy 3 Definition Generalization Bounds MDD: Theoretically Justified Algorithm 4 Experiments 5 Y. Zhang, T. Liu et al. (Tsinghua Univ) Transfer Learning June 13, 2019 9 / 30

  10. MDD: Margin Disparity Discrepancy Definition DD: Hypothesis-induced Discrepancy Definition ( Disparity Discrepancy ) Given a hypothesis space H and a specific classifier h ∈H , the Disparity Discrepancy (DD) induced by h ′ ∈ H is defined by � � � E Q 1 [ h ′ � = h ] − E P 1 [ h ′ � = h ] � . d h , H ( P , Q ) = sup (5) h ′ ∈H The supremum in the disparity discrepancy is taken only over the hypothesis space H and thus can be optimized more easily. Theorem For every hypothesis h ∈ H , err Q ( h ) ≤ err P ( h ) + d h , H ( P , Q ) + λ, (6) where λ = λ ( H , P , Q ) is the ideal combined loss. Y. Zhang, T. Liu et al. (Tsinghua Univ) Transfer Learning June 13, 2019 10 / 30

  11. MDD: Margin Disparity Discrepancy Definition MDD: Towards an Informative Margin Theory Notations for Multi-class Classification Scoring Function: f ∈ F : X × Y → R Labeling Function induced by f : h f : x �→ arg max f ( x , y ) . (7) y ∈Y Margin of a Scoring Function: ρ f ( x , y ) = 1 y ′ � = y f ( x , y ′ )) 2( f ( x , y ) − max Margin Loss:  ρ � x  0 1  0 � x � ρ Φ ρ ( x ) = 1 − x /ρ   x � 0 1 0 ρ 1 Y. Zhang, T. Liu et al. (Tsinghua Univ) Transfer Learning June 13, 2019 11 / 30

  12. MDD: Margin Disparity Discrepancy Definition MDD: Margin Disparity Discrepancy Margin error: err ( ρ ) D ( f ) = E ( x , y ) ∼ D [Φ ρ ◦ ρ f ( x , y )] Margin disparity: disp ( ρ ) D ( f ′ , f ) � E z ∼ D x [Φ ρ ◦ ρ f ′ ( x , h f ( x ))] Definition ( Margin Disparity Discrepancy ) With the definition of margin disparity, we define Margin Disparity Discrepancy (MDD) and its empirical version by � � d ( ρ ) disp ( ρ ) Q ( f ′ , f ) − disp ( ρ ) P ( f ′ , f ) f , F ( P , Q ) � sup , f ′ ∈F � � (8) d ( ρ ) disp ( ρ ) Q ( f ′ , f ) − disp ( ρ ) f , F ( � P , � P ( f ′ , f ) Q ) � sup . � � f ′ ∈F The margin disparity discrepancy is well-defined since d ( ρ ) f , F ( P , P ) = 0 and it satisfies the nonnegativity and subadditivity. Y. Zhang, T. Liu et al. (Tsinghua Univ) Transfer Learning June 13, 2019 12 / 30

  13. MDD: Margin Disparity Discrepancy Definition MDD: Bounding the Target Expected Error Theorem Let F ⊆ R X×Y be a hypothesis set with Y = { 1 , · · · , k } and H ⊆ Y X be the corresponding Y -valued classifier class. For every scoring function f ∈ F , err Q ( h f ) ≤ err ( ρ ) P ( f ) + d ( ρ ) f , F ( P , Q ) + λ, (9) where λ = λ ( ρ, F , P , Q ) is the ideal combined margin loss: f ∗ ∈H { err ( ρ ) P ( f ∗ ) + err ( ρ ) Q ( f ∗ ) } . λ = min (10) This upper bound has a similar form with previous bound. err ( ρ ) P ( f ) depicts the performance of f on source domain. MDD bounds the performance gap caused by domain shift. λ quantifies the inverse of “ adaptability ”. A new perspective for analyzing DA with respect to margin loss. Y. Zhang, T. Liu et al. (Tsinghua Univ) Transfer Learning June 13, 2019 13 / 30

  14. MDD: Margin Disparity Discrepancy Generalization Bounds MDD: Notations for Generalization Bounds For deriving generalization bounds for MDD, we first introduce two function class: Definition Given a class of scoring functions F , Π 1 ( F ) is defined as � � y ∈ Y , f ∈ F} , Π 1 F = { x �→ f ( x , y ) (11) We introduce a new function class Π H F that serves as a ”scoring” version of the symmetric difference hypothesis space H ∆ H : Definition Given a class of scoring functions F and a class of the induced classifiers H , we define Π H F as Π H F � { x �→ f ( x , h ( x )) | h ∈ H , f ∈ F} . (12) Y. Zhang, T. Liu et al. (Tsinghua Univ) Transfer Learning June 13, 2019 14 / 30

  15. MDD: Margin Disparity Discrepancy Generalization Bounds MDD: Notations for Generalization Bounds Definition ( Rademacher complexity ) Then, the empirical Rademacher complexity of F with respect to the sample � D is defined as n � 1 � D ( F ) = E σ sup σ i f ( z i ) . (13) R � n f ∈F i =1 where σ i ’s are independent uniform random variables taking values in {− 1 , +1 } . The Rademacher complexity is D ∼ D n � R n , D ( F ) = E � D ( F ) . (14) R � Definition ( Covering Number ) (Informal) A covering number N 2 ( τ, G ) is the minimal number of L 2 balls of radius τ > 0 needed to cover a class G of bounded functions g : X → R Y. Zhang, T. Liu et al. (Tsinghua Univ) Transfer Learning June 13, 2019 15 / 30

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend