Bridging Theory and Algorithm for Domain Adaptation Yuchen Zhang - PowerPoint PPT Presentation

Bridging Theory and Algorithm for Domain Adaptation Yuchen Zhang Tianle Liu Mingsheng Long Michael I. Jordan School of Software, Tsinghua University National Engineering Lab for Big Data Software University of California, Berkeley 36th International Conference on Machine Learning Y. Zhang, T. Liu et al. (Tsinghua Univ) Transfer Learning June 13, 2019 1 / 30

Transfer Learning Outline Transfer Learning 1 Previous Theory and Algorithm 2 MDD: Margin Disparity Discrepancy 3 Definition Generalization Bounds MDD: Theoretically Justified Algorithm 4 Experiments 5 Y. Zhang, T. Liu et al. (Tsinghua Univ) Transfer Learning June 13, 2019 2 / 30

Transfer Learning Transfer Learning Machine learning across domains of Non-IID distributions P � = Q How to design models that effectively bound the generalization error? Source Domain Target Domain 2D Renderings Real Images P ( x , y ) ≠ Q ( x , y ) Model Representation Model f : x → y f : x → y Y. Zhang, T. Liu et al. (Tsinghua Univ) Transfer Learning June 13, 2019 3 / 30

Transfer Learning Notations and Assumptions Notations : 0-1 risk: err D ( h ) = E ( x , y ) ∼ D 1 [ h ( x ) � = y ] � n D ( h ) � E ( x , y ) ∼ � D 1 [ h ( x ) � = y ] = 1 i =1 1 [ h ( x i ) � = y i ] Empirical 0-1 risk: err � n Disparity: disp D ( h ′ , h ) � E D 1 [ h ′ � = h ], Assumptions : In unsupervised domain adaptation, there are two distinct domains, the source P and the target Q . The learner is trained on: A labeled sample � P = { ( x s i , y s i ) } n i =1 drawn from source distribution P . An unlabeled sample � Q = { x t i } m i =1 drawn from target distribution Q . Key Problem : How to control target domain expected risk err Q ( h )? Y. Zhang, T. Liu et al. (Tsinghua Univ) Transfer Learning June 13, 2019 4 / 30

Previous Theory and Algorithm Outline Transfer Learning 1 Previous Theory and Algorithm 2 MDD: Margin Disparity Discrepancy 3 Definition Generalization Bounds MDD: Theoretically Justified Algorithm 4 Experiments 5 Y. Zhang, T. Liu et al. (Tsinghua Univ) Transfer Learning June 13, 2019 5 / 30

Previous Theory and Algorithm Previous Theory In the seminal work [1], the H ∆ H -divergence was proposed to measure domain discrepancy and control the target risk,: � � � disp Q ( h ′ , h ) − disp P ( h ′ , h ) � . d H ∆ H ( P , Q ) = sup (1) h , h ′ ∈H [3] extended the H ∆ H -divergence to general loss functions, leading to the discrepancy distance : | E Q L ( h ′ , h ) − E P L ( h ′ , h ) | , disc L ( P , Q ) = sup (2) h , h ′ ∈H where L should be a bounded function satisfying symmetry and triangle inequality. Note that many widely-used losses, e.g. margin loss , do not satisfy these requirements. Y. Zhang, T. Liu et al. (Tsinghua Univ) Transfer Learning June 13, 2019 6 / 30

Previous Theory and Algorithm Previous Theory Theorem For every hypothesis h ∈ H , err Q ( h ) ≤ err P ( h ) + d H ∆ H ( P , Q ) + λ, (3) where λ = λ ( H , P , Q ) is the ideal combined loss: h ∗ ∈H { err P ( h ∗ ) + err Q ( h ∗ ) } . λ = min (4) err P ( h ) depicts the performance of h on source domain. d H ∆ H bounds the performance gap caused by domain shift. λ quantifies the inverse of “ adaptability ” between domains. � � d / n ), when d is the VC-dimension of H . The order of complexity term is O ( d / m + Y. Zhang, T. Liu et al. (Tsinghua Univ) Transfer Learning June 13, 2019 7 / 30

Previous Theory and Algorithm Previous Algorithm [2] sets a class of domain discriminator G to approximate function class H ∆ H = { 1 [ h ′ � = h ] | h , h ′ ∈ H} for computing d H ∆ H : d H ∆ H ≈ sup ( E Q 1 [ g ( x ) = 0] + E P 1 [ g ( x ) = 1]) g ∈G [4] assumes that h and h ′ should agree on source domain. Then they use L1-loss of two classifiers’ probabilistic outputs on target domain to approximate d H ∆ H : f , f ′ E Q | f ( x ) − f ′ ( x ) | d H ∆ H ≈ sup There are two crucial directions for improvement: Generalization bound for classification with scoring functions and margin loss has not been formally studied in the DA setting. Computing the supremum requires an ergodicity over H ∆ H increases the difficulty of optimization. Y. Zhang, T. Liu et al. (Tsinghua Univ) Transfer Learning June 13, 2019 8 / 30

MDD: Margin Disparity Discrepancy Outline Transfer Learning 1 Previous Theory and Algorithm 2 MDD: Margin Disparity Discrepancy 3 Definition Generalization Bounds MDD: Theoretically Justified Algorithm 4 Experiments 5 Y. Zhang, T. Liu et al. (Tsinghua Univ) Transfer Learning June 13, 2019 9 / 30

MDD: Margin Disparity Discrepancy Definition DD: Hypothesis-induced Discrepancy Definition ( Disparity Discrepancy ) Given a hypothesis space H and a specific classifier h ∈H , the Disparity Discrepancy (DD) induced by h ′ ∈ H is defined by � � � E Q 1 [ h ′ � = h ] − E P 1 [ h ′ � = h ] � . d h , H ( P , Q ) = sup (5) h ′ ∈H The supremum in the disparity discrepancy is taken only over the hypothesis space H and thus can be optimized more easily. Theorem For every hypothesis h ∈ H , err Q ( h ) ≤ err P ( h ) + d h , H ( P , Q ) + λ, (6) where λ = λ ( H , P , Q ) is the ideal combined loss. Y. Zhang, T. Liu et al. (Tsinghua Univ) Transfer Learning June 13, 2019 10 / 30

MDD: Margin Disparity Discrepancy Definition MDD: Towards an Informative Margin Theory Notations for Multi-class Classification Scoring Function: f ∈ F : X × Y → R Labeling Function induced by f : h f : x �→ arg max f ( x , y ) . (7) y ∈Y Margin of a Scoring Function: ρ f ( x , y ) = 1 y ′ � = y f ( x , y ′ )) 2( f ( x , y ) − max Margin Loss:  ρ � x  0 1  0 � x � ρ Φ ρ ( x ) = 1 − x /ρ   x � 0 1 0 ρ 1 Y. Zhang, T. Liu et al. (Tsinghua Univ) Transfer Learning June 13, 2019 11 / 30

MDD: Margin Disparity Discrepancy Definition MDD: Margin Disparity Discrepancy Margin error: err ( ρ ) D ( f ) = E ( x , y ) ∼ D [Φ ρ ◦ ρ f ( x , y )] Margin disparity: disp ( ρ ) D ( f ′ , f ) � E z ∼ D x [Φ ρ ◦ ρ f ′ ( x , h f ( x ))] Definition ( Margin Disparity Discrepancy ) With the definition of margin disparity, we define Margin Disparity Discrepancy (MDD) and its empirical version by � � d ( ρ ) disp ( ρ ) Q ( f ′ , f ) − disp ( ρ ) P ( f ′ , f ) f , F ( P , Q ) � sup , f ′ ∈F � � (8) d ( ρ ) disp ( ρ ) Q ( f ′ , f ) − disp ( ρ ) f , F ( � P , � P ( f ′ , f ) Q ) � sup . � � f ′ ∈F The margin disparity discrepancy is well-defined since d ( ρ ) f , F ( P , P ) = 0 and it satisfies the nonnegativity and subadditivity. Y. Zhang, T. Liu et al. (Tsinghua Univ) Transfer Learning June 13, 2019 12 / 30

MDD: Margin Disparity Discrepancy Definition MDD: Bounding the Target Expected Error Theorem Let F ⊆ R X×Y be a hypothesis set with Y = { 1 , · · · , k } and H ⊆ Y X be the corresponding Y -valued classifier class. For every scoring function f ∈ F , err Q ( h f ) ≤ err ( ρ ) P ( f ) + d ( ρ ) f , F ( P , Q ) + λ, (9) where λ = λ ( ρ, F , P , Q ) is the ideal combined margin loss: f ∗ ∈H { err ( ρ ) P ( f ∗ ) + err ( ρ ) Q ( f ∗ ) } . λ = min (10) This upper bound has a similar form with previous bound. err ( ρ ) P ( f ) depicts the performance of f on source domain. MDD bounds the performance gap caused by domain shift. λ quantifies the inverse of “ adaptability ”. A new perspective for analyzing DA with respect to margin loss. Y. Zhang, T. Liu et al. (Tsinghua Univ) Transfer Learning June 13, 2019 13 / 30

MDD: Margin Disparity Discrepancy Generalization Bounds MDD: Notations for Generalization Bounds For deriving generalization bounds for MDD, we first introduce two function class: Definition Given a class of scoring functions F , Π 1 ( F ) is defined as � � y ∈ Y , f ∈ F} , Π 1 F = { x �→ f ( x , y ) (11) We introduce a new function class Π H F that serves as a ”scoring” version of the symmetric difference hypothesis space H ∆ H : Definition Given a class of scoring functions F and a class of the induced classifiers H , we define Π H F as Π H F � { x �→ f ( x , h ( x )) | h ∈ H , f ∈ F} . (12) Y. Zhang, T. Liu et al. (Tsinghua Univ) Transfer Learning June 13, 2019 14 / 30

MDD: Margin Disparity Discrepancy Generalization Bounds MDD: Notations for Generalization Bounds Definition ( Rademacher complexity ) Then, the empirical Rademacher complexity of F with respect to the sample � D is defined as n � 1 � D ( F ) = E σ sup σ i f ( z i ) . (13) R � n f ∈F i =1 where σ i ’s are independent uniform random variables taking values in {− 1 , +1 } . The Rademacher complexity is D ∼ D n � R n , D ( F ) = E � D ( F ) . (14) R � Definition ( Covering Number ) (Informal) A covering number N 2 ( τ, G ) is the minimal number of L 2 balls of radius τ > 0 needed to cover a class G of bounded functions g : X → R Y. Zhang, T. Liu et al. (Tsinghua Univ) Transfer Learning June 13, 2019 15 / 30

Bridging Theory and Algorithm for Domain Adaptation Yuchen Zhang - PowerPoint PPT Presentation

Bridging Theory and Algorithm for Domain Adaptation Yuchen Zhang Tianle Liu Mingsheng Long Michael I. Jordan School of Software, Tsinghua University National Engineering Lab for Big Data Software University of California, Berkeley 36th

Dynamic Adaptation Dynamic Adaptation Dynamic Adaptation Dynamic Adaptation Minema Minema

Adaptation Philipp Koehn 27 October 2020 Philipp Koehn Machine Translation: Adaptation 27

discrepancy for unsupervised domain adaptation Hongliang Yan 2017/06/21 Domain Adaptation DA

Robust Causal Domain Adaptation in a Simple Diagnostic Setting Thijs van Ommen Ghent, July 4,

Few-shot Domain Adaptation 1/12 by Causal Mechanism Transfer Domain adaptation Causal mechanism

Coastal Adaptation Kellie Fisher FCERM Senior Advisor Why Adaptation? Adaptation to a

Domain Adaptation with Asymmetrically Relaxed Distribution Alignment Yifan Wu , Ezra Winston,

Research on Race Bridging for 2020 Ben Bolender Assistant Division Chief Population Estimates

Adaptation Techniques for Acoustic Adaptation Techniques for Acoustic Adaptation Techniques for

Odds Algorithm An Online Algorithm Group Fibonado 20. Dec 2016 Group Fibonado Odds Algorithm

Strong Baselines for Neural Semi-supervised Learning under Domain Shift Sebastian Ruder Barbara

Web Hosting and Domain Names Introduction to Web Design Web Hosting and Domain Names

Focusing the Core Domain Model A Domain-Driven Design Case Study, Eric Evans, Domain Language

Image Processing A case study for a domain decomposed MPI code Domain Decomposition 1

Domain Adaptation in Statistical Machine Translation Logic, Language and Computation Bart

Towards Assumption-free Unsupervised Domain Adaptation for Visual recognition

Matrix Inequalities and Convexity Harry Dym Weitzman Institute Jeremy Greene UC San Diego

345($.4+6 -#%(.')%(+/.#2'0'()#% ! "#$%&'()#%+#,+-#%*(.')%(+/.#0.'11)%0

Constraint sat. prob. (Ch. 6) Announcements Midterm regrades: due Nov. 7 th Types of constraints

Scott Domains for Denotational Semantics and Program Extraction Ulrich Berger Swansea University

Criminal Use of Domain Names Greg Aaron, Illumintel Colin Strutt, Interisle Consulting Group 1

Leopard: Understanding the Threat of Blockchain Domain Name Based Malware Zhangrong Huang 1,2 ,

Monitoring the Initial DNS Behavior of Malicious Domains Shuang Hao (Gatech) , Nick Feamster

Partial Transfer Learning with Selective Adversarial Networks Zhangjie Cao 1 , Mingsheng Long 1 ,

Bridging Theory and Algorithm for Domain Adaptation Yuchen Zhang - PowerPoint PPT Presentation

Bridging Theory and Algorithm for Domain Adaptation Yuchen Zhang Tianle Liu Mingsheng Long Michael I. Jordan School of Software, Tsinghua University National Engineering Lab for Big Data Software University of California, Berkeley 36th

Dynamic Adaptation Dynamic Adaptation Dynamic Adaptation Dynamic Adaptation Minema Minema

Adaptation Philipp Koehn 27 October 2020 Philipp Koehn Machine Translation: Adaptation 27

discrepancy for unsupervised domain adaptation Hongliang Yan 2017/06/21 Domain Adaptation DA

Robust Causal Domain Adaptation in a Simple Diagnostic Setting Thijs van Ommen Ghent, July 4,

Few-shot Domain Adaptation 1/12 by Causal Mechanism Transfer Domain adaptation Causal mechanism

Coastal Adaptation Kellie Fisher FCERM Senior Advisor Why Adaptation? Adaptation to a

Domain Adaptation with Asymmetrically Relaxed Distribution Alignment Yifan Wu , Ezra Winston,

Research on Race Bridging for 2020 Ben Bolender Assistant Division Chief Population Estimates

Adaptation Techniques for Acoustic Adaptation Techniques for Acoustic Adaptation Techniques for

Odds Algorithm An Online Algorithm Group Fibonado 20. Dec 2016 Group Fibonado Odds Algorithm

Strong Baselines for Neural Semi-supervised Learning under Domain Shift Sebastian Ruder Barbara

Web Hosting and Domain Names Introduction to Web Design Web Hosting and Domain Names

Focusing the Core Domain Model A Domain-Driven Design Case Study, Eric Evans, Domain Language

Image Processing A case study for a domain decomposed MPI code Domain Decomposition 1

Domain Adaptation in Statistical Machine Translation Logic, Language and Computation Bart

Towards Assumption-free Unsupervised Domain Adaptation for Visual recognition

Matrix Inequalities and Convexity Harry Dym Weitzman Institute Jeremy Greene UC San Diego

345($.4+6 -#%*(.')%(+/.#2'0'()#% ! &quot;#$%&amp;'()#%*+#,+-#%*(.')%(+/.#0.'11)%0

Constraint sat. prob. (Ch. 6) Announcements Midterm regrades: due Nov. 7 th Types of constraints

Scott Domains for Denotational Semantics and Program Extraction Ulrich Berger Swansea University

Criminal Use of Domain Names Greg Aaron, Illumintel Colin Strutt, Interisle Consulting Group 1

Leopard: Understanding the Threat of Blockchain Domain Name Based Malware Zhangrong Huang 1,2 ,

Monitoring the Initial DNS Behavior of Malicious Domains Shuang Hao (Gatech) , Nick Feamster

Partial Transfer Learning with Selective Adversarial Networks Zhangjie Cao 1 , Mingsheng Long 1 ,

345($.4+6 -#%(.')%(+/.#2'0'()#% ! "#$%&'()#%+#,+-#%*(.')%(+/.#0.'11)%0