Learning Transferable Features with Deep Adaptation Networks - PowerPoint PPT Presentation

Learning Transferable Features with Deep Adaptation Networks Mingsheng Long 12 , Yue Cao 1 , Jianmin Wang 1 , and Michael I. Jordan 2 1 School of Software, Institute for Data Science Tsinghua University 2 Department of EECS, Department of Statistics University of California, Berkeley International Conference on Machine Learning, 2015 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M. Long et al. (Tsinghua & UC Berkeley) Deep Adaptation Networks ICML 2015 1 / 15

Motivation Domain Adaptation Deep Learning for Domain Adaptation None or very weak supervision in the target task (new domain) Target classifier cannot be reliably trained due to over-fitting Fine-tuning is impossible as it requires substantial supervision Generalize related supervised source task to the target task Deep networks can learn transferable features for adaptation Hard to find big source task for learning deep features from scratch Transfer from deep networks pre-trained on unrelated big dataset Transferring features from distant tasks better than random features Fine-tune Source Task Labeled Pre-train Unrelated Deep Big Data Neural Network Unlabeled Target Task Adaptation Semi-labeled . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M. Long et al. (Tsinghua & UC Berkeley) Deep Adaptation Networks ICML 2015 2 / 15

Motivation Transferability How Transferable Are Deep Features? Transferability is restricted by (Yosinski et al. 2014; Glorot et al. 2011) Specialization of higher layer neurons to original task (new task ↓ ) Disentangling of variations in higher layers enlarges task discrepancy Transferability of features decreases while task discrepancy increases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M. Long et al. (Tsinghua & UC Berkeley) Deep Adaptation Networks ICML 2015 3 / 15

Method Model Deep Adaptation Network (DAN) Key Observations (AlexNet) (Krizhevsky et al. 2012) Convolutional layers learn general features: safely transferable Safely freeze conv 1 - conv 3 & fine-tune conv 4 - conv 5 Fully-connected layers fit task specificicy: NOT safely transferable Deeply adapt fc 6 - fc 8 using statistically optimal two-sample matching learn learn learn learn fine- fine- source frozen frozen frozen tune tune output MK- MK- MK- MMD MMD MMD target output input conv1 conv2 conv3 conv4 conv5 fc6 fc7 fc8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M. Long et al. (Tsinghua & UC Berkeley) Deep Adaptation Networks ICML 2015 4 / 15

Method Model Objective Function Main Problems Feature transferability decreases with increasing task discrepancy Higher layers are tailored to specific tasks, NOT safely transferable Adaptation effect may vanish in back-propagation of deep networks Deep Adaptation with Optimal Matching Deep adaptation: match distributions in multiple layers, including output Optimal matching: maximize two-sample test power by multiple kernels n a l 2 1 ∑ J ( θ ( x a i ) , y a ∑ θ ∈ Θ max min d 2 D ℓ s , D ℓ (1) ( ) i ) + λ , k t n a k ∈K i =1 ℓ = l 1 λ > 0 is a penalty, D ℓ is the ℓ -th layer hidden representation { h ∗ ℓ } ∗ = i . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M. Long et al. (Tsinghua & UC Berkeley) Deep Adaptation Networks ICML 2015 5 / 15

Method Model MK-MMD Multiple Kernel Maximum Mean Discrepancy (MK-MMD) ≜ RKHS distance between kernel embeddings of distributions p and q k ( p , q ) ≜ ∥ E p [ ϕ ( x s )] − E q [ ϕ ( x t )] ∥ 2 d 2 (2) H k , k ( x s , x t ) = ⟨ ϕ ( x s ) , ϕ ( x t ) ⟩ is a convex combination of m PSD kernels m m { } ∑ ∑ K ≜ k = β u k u : β u = 1 , β u ⩾ 0 , ∀ u (3) . u =1 u =1 Theorem (Two-Sample Test (Gretton et al. 2012)) p = q if and only if d 2 k ( p , q ) = 0 (In practice, d 2 k ( p , q ) < ε ) max k ∈K d 2 k ( p , q ) σ − 2 ⇔ min Type II Error (d 2 k ( p , q ) < ε when p ̸ = q) k . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M. Long et al. (Tsinghua & UC Berkeley) Deep Adaptation Networks ICML 2015 6 / 15

Method Algorithm Learning CNN Linear-Time Algorithm of MK-MMD (Streaming Algorithm) k ( p , q ) = E x s x ′ s k ( x s , x ′ s ) + E x t x ′ t k ( x t , x ′ t ) − 2 E x s x t k ( x s , x t ) O ( n 2 ) : d 2 ∑ n s /2 O ( n ) : d 2 k ( p , q ) = 2 i =1 g k ( z i ) → linear-time unbiased estimate n s Quad-tuple z i ≜ ( x s 2 i − 1 , x s 2 i , x t 2 i − 1 , x t 2 i ) g k ( z i ) ≜ k ( x s 2 i − 1 , x s 2 i ) + k ( x t 2 i − 1 , x t 2 i ) − k ( x s 2 i − 1 , x t 2 i ) − k ( x s 2 i , x t 2 i − 1 ) Stochastic Gradient Descent (SGD) 2 i − 1 , h s ℓ h s ℓ 2 i , h t ℓ 2 i − 1 , h t ℓ For each layer ℓ and for each quad-tuple z ℓ ( ) i = 2 i ∂ Θ ℓ + λ∂ g k ∇ Θ ℓ = ∂ J ( z i ) ( z ℓ ) i (4) ∂ Θ ℓ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M. Long et al. (Tsinghua & UC Berkeley) Deep Adaptation Networks ICML 2015 7 / 15

Method Algorithm Learning Kernel Learning optimal kernel k = ∑ m u =1 β u k u Maximizing test power ≜ minimizing Type II error (Gretton et al. 2012) max k ∈K d 2 D ℓ s , D ℓ σ − 2 (5) ( ) k , k t k ( z ) − [ E z g k ( z )] 2 is the estimation variance. where σ 2 k = E z g 2 Quadratic Program (QP), scaling linearly to sample size: O ( m 2 n + m 3 ) d T β =1 , β ⩾ 0 β T ( Q + ε I ) β , min (6) where d = ( d 1 , d 2 , . . . , d m ) T , and each d u is MMD using base kernel k u . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M. Long et al. (Tsinghua & UC Berkeley) Deep Adaptation Networks ICML 2015 8 / 15

Method Analysis Analysis Theorem (Adaptation Bound) (Ben-David et al. 2010) Let θ ∈ H be a hypothesis, ϵ s ( θ ) and ϵ t ( θ ) be the expected risks of source and target respectively, then ϵ t ( θ ) ⩽ ϵ s ( θ ) + d H ( p , q ) + C 0 ⩽ ϵ s ( θ ) + 2 d k ( p , q ) + C , (7) where C is a constant for the complexity of hypothesis space, the empirical estimate of H -divergence, and the risk of an ideal hypothesis for both tasks. Two-Sample Classifier: Nonparametric vs. Parametric Nonparametric MMD directly approximates d H ( p , q ) Parametric classifier: adversarial training to approximate d H ( p , q ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M. Long et al. (Tsinghua & UC Berkeley) Deep Adaptation Networks ICML 2015 9 / 15

Experiment Setup Experiment Setup Datasets: pre-trained on ImageNet, fined-tuned on Office&Caltech Tasks: 12 adaptation tasks → An unbiased look at dataset bias Variants: DAN; single-layer: DAN 7 , DAN 8 ; single-kernel: DAN SK Protocols: unsupervised adaptation vs semi-supervised adaptation Parameter selection: cross-validation by jointly assessing test errors of source classifier and two-sample classifier (MK-MMD) Pre-train Fine-tune Office & Caltech (Fei-Fei et al. 2012) (Jia et al. 2014) (Saenko et al. 2010) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M. Long et al. (Tsinghua & UC Berkeley) Deep Adaptation Networks ICML 2015 10 / 15

Experiment Results Results and Discussion Learning transferable features by deep adaptation and optimal matching Deep adaptation of multiple domain-specific layers (DAN) vs. shallow adaptation of one hard-to-tweak layer (DDC) Two samples can be matched better by MK-MMD vs. SK-MMD Table: Accuracy on Office-31 dataset via standard protocol (Gong et al. 2013) Method A → W D → W W → D A → D D → A W → A Average TCA 21.5 ± 0.0 50.1 ± 0.0 58.4 ± 0.0 11.4 ± 0.0 8.0 ± 0.0 14.6 ± 0.0 27.3 GFK 19.7 ± 0.0 49.7 ± 0.0 63.1 ± 0.0 10.6 ± 0.0 7.9 ± 0.0 15.8 ± 0.0 27.8 CNN 61.6 ± 0.5 95.4 ± 0.3 99.0 ± 0.2 63.8 ± 0.5 51.1 ± 0.6 49.8 ± 0.4 70.1 LapCNN 60.4 ± 0.3 94.7 ± 0.5 99.1 ± 0.2 63.1 ± 0.6 51.6 ± 0.4 48.2 ± 0.5 69.5 DDC 61.8 ± 0.4 95.0 ± 0.5 98.5 ± 0.4 64.4 ± 0.3 52.1 ± 0.8 52.2 ± 0.4 70.6 DAN 7 63.2 ± 0.2 94.8 ± 0.4 98.9 ± 0.3 65.2 ± 0.4 52.3 ± 0.4 52.1 ± 0.4 71.1 DAN 8 63.8 ± 0.4 94.6 ± 0.5 98.8 ± 0.6 65.8 ± 0.4 52.8 ± 0.4 51.9 ± 0.5 71.3 DAN SK 63.3 ± 0.3 95.6 ± 0.2 99.0 ± 0.4 65.9 ± 0.7 53.2 ± 0.5 52.1 ± 0.4 71.5 68.5 ± 0.4 96.0 ± 0.3 99.0 ± 0.2 67.0 ± 0.4 54.0 ± 0.4 53.1 ± 0.3 72.9 DAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . M. Long et al. (Tsinghua & UC Berkeley) Deep Adaptation Networks ICML 2015 11 / 15

Learning Transferable Features with Deep Adaptation Networks - PowerPoint PPT Presentation

Learning Transferable Features with Deep Adaptation Networks Mingsheng Long 12 , Yue Cao 1 , Jianmin Wang 1 , and Michael I. Jordan 2 1 School of Software, Institute for Data Science Tsinghua University 2 Department of EECS, Department of

Transferable Utility Game Theory Course: Jackson, Leyton-Brown & Shoham Game Theory Course:

Identifying and Showcasing Your Transferable Skills Maggie Evans, Ph.D. July 12, 2018 Learning

Dynamic Adaptation Dynamic Adaptation Dynamic Adaptation Dynamic Adaptation Minema Minema

COMPANY PROFILE WATER FEATURES 1 WATER FEATURES 2 WATER FEATURES 3 WATER FEATURES 4 WATER

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Coastal Adaptation Kellie Fisher FCERM Senior Advisor Why Adaptation? Adaptation to a

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

TRANSFERABLE SKILLS A PRESENTATION TO THE NATIONAL BLACK MBA ASSOCIATION, INC. ATLANTA CHAPTER

Adaptation Philipp Koehn 27 October 2020 Philipp Koehn Machine Translation: Adaptation 27

Adaptation Techniques for Acoustic Adaptation Techniques for Acoustic Adaptation Techniques for

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

learning: defense, transferable and camouflaged attacks Xingjun Ma School of Computing and

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Emergence of Cooperative Long-lasting Loyalty in Double Auction Markets Aleksandra Aloric

Convolutional Neural Networks for Sentence Classification Yoon Kim New York University 1 / 34

Deep learning 6.3. Dropout Fran cois Fleuret https://fleuret.org/ee559/ Nov 2, 2020 A first

Augmentation Introduction ImageNet Classification with Deep Convolutional Neural Networks,

MIT 9.520/6.860, Fall 2018 Class 11: Neural networks tips, tricks & software Andrzej

Deep Learning Gradient-based optimization Caio Corro Universit Paris Sud 23 octobre 2019

CSI5180. MachineLearningfor BioinformaticsApplications Deep learning practical issues by

Generative vs. discriminative Generative Discriminative Belief network A is more More

Sambuz

Useful Links

Newsletter

Mail Us

Learning Transferable Features with Deep Adaptation Networks - PowerPoint PPT Presentation

Learning Transferable Features with Deep Adaptation Networks Mingsheng Long 12 , Yue Cao 1 , Jianmin Wang 1 , and Michael I. Jordan 2 1 School of Software, Institute for Data Science Tsinghua University 2 Department of EECS, Department of

Transferable Utility Game Theory Course: Jackson, Leyton-Brown &amp; Shoham Game Theory Course:

Identifying and Showcasing Your Transferable Skills Maggie Evans, Ph.D. July 12, 2018 Learning

Dynamic Adaptation Dynamic Adaptation Dynamic Adaptation Dynamic Adaptation Minema Minema

COMPANY PROFILE WATER FEATURES 1 WATER FEATURES 2 WATER FEATURES 3 WATER FEATURES 4 WATER

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Coastal Adaptation Kellie Fisher FCERM Senior Advisor Why Adaptation? Adaptation to a

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

TRANSFERABLE SKILLS A PRESENTATION TO THE NATIONAL BLACK MBA ASSOCIATION, INC. ATLANTA CHAPTER

Adaptation Philipp Koehn 27 October 2020 Philipp Koehn Machine Translation: Adaptation 27

Adaptation Techniques for Acoustic Adaptation Techniques for Acoustic Adaptation Techniques for

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

learning: defense, transferable and camouflaged attacks Xingjun Ma School of Computing and

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Emergence of Cooperative Long-lasting Loyalty in Double Auction Markets Aleksandra Aloric

Convolutional Neural Networks for Sentence Classification Yoon Kim New York University 1 / 34

Deep learning 6.3. Dropout Fran cois Fleuret https://fleuret.org/ee559/ Nov 2, 2020 A first

Augmentation Introduction ImageNet Classification with Deep Convolutional Neural Networks,

MIT 9.520/6.860, Fall 2018 Class 11: Neural networks tips, tricks &amp; software Andrzej

Deep Learning Gradient-based optimization Caio Corro Universit Paris Sud 23 octobre 2019

CSI5180. MachineLearningfor BioinformaticsApplications Deep learning practical issues by

Generative vs. discriminative Generative Discriminative Belief network A is more More

Sambuz

Useful Links

Newsletter

Mail Us

Transferable Utility Game Theory Course: Jackson, Leyton-Brown & Shoham Game Theory Course:

MIT 9.520/6.860, Fall 2018 Class 11: Neural networks tips, tricks & software Andrzej