Multi-Task Learning & Transfer Learning Basics CS 330 1

Logistics Homework 1 posted Monday 9/21 , due Wednesday 9/30 at midnight. TensorFlow review session tomorrow at 6:00 pm PT. Project guidelines posted early next week. 2

Plan for Today Multi-Task Learning - Problem statement - Models, objectives, optimization - Challenges short break here - Case study of real-world multi-task learning Transfer Learning - Pre-training & fi ne-tuning Goals for by the end of lecture : - Know the key design decisions when building multi-task learning systems - Understand the di ff erence between multi-task learning and transfer learning - Understand the basics of transfer learning 3

Multi-Task Learning 4

Some notation θ tiger tiger cat lynx cat y x length of paper f θ ( y | x ) Single-task learning: What is a task? 𝒠 = {( x , y ) k } (more formally this time) [supervised] θ ℒ ( θ , 𝒠 ) min 𝒰 i ≜ { p i ( x ), p i ( y | x ), ℒ i } A task: data generating distributions Typical loss: negative log likelihood 𝒠 tr 𝒠 test ℒ ( θ , 𝒠 ) = − 𝔽 ( x , y ) ∼𝒠 [log f θ ( y | x )] Corresponding datasets: i i 𝒠 i 𝒠 tr will use as shorthand for : i 5

Examples of Tasks ℒ i 𝒰 i ≜ { p i ( x ), p i ( y | x ), ℒ i } Multi-task classi fi cation: A task: same across all tasks e.g. per-language data generating distributions handwriting recognition e.g. personalized 𝒠 tr 𝒠 test Corresponding datasets: i i spam fi lter 𝒠 i 𝒠 tr will use as shorthand for : i ℒ i p i ( x ) , same across all tasks Multi-label learning: e.g. CelebA attribute recognition e.g. scene understanding ℒ i When might vary across tasks? - mixed discrete, continuous labels across tasks - multiple metrics that you care about 6

θ length of paper y x summary of paper paper review f θ ( y | x ) f θ ( y | x , z i ) z i task descriptor e.g. one-hot encoding of the task index T or, whatever meta-data you have ∑ ℒ i ( θ , 𝒠 i ) Vanilla MTL Objective: min - personalization: user features/attributes θ i =1 - language description of the task - formal speci fi cations of the task Decisions on the model, the objective, and the optimization. z i How should we condition on ? What objective should we use? How to optimize our objective? 7

z i How should the model be conditioned on ? Model What parameters of the model should be shared? Objective How should the objective be formed? Optimization How should the objective be optimized? 8

Conditioning on the task z i Let’s assume is the one-hot task index. Question : How should you condition on the task in order to share as little as possible? (raise your hand) 9

Conditioning on the task z i y 1 x multiplicative gating y = ∑ y 2 1 ( z i = j ) y j x j … y T x —> independent training within a single network! with no shared parameters 10

The other extreme x y z i z i Concatenate with input and/or activations all parameters are shared z i z i (except the parameters directly following , if is one-hot) 11

An Alternative View on the Multi-Task Architecture θ sh θ i θ Split into shared parameters and task-speci fi c parameters T ∑ ℒ i ({ θ sh , θ i }, 𝒠 i ) min Then, our objective is: θ sh , θ 1 ,…, θ T i =1 Choosing how to Choosing how & where equivalent to condition on z i to share parameters 12

Conditioning: Some Common Choices 1. Concatenation-based conditioning 2. Additive conditioning z i z i (raise your hand) These are actually equivalent! Question : why are they the same thing? Application of following fully-connected layer: Diagram sources: distill.pub/2018/feature-wise-transformations/ 13

Conditioning: Some Common Choices 3. Multi-head architecture 4. Multiplicative conditioning Ruder ‘17 - more expressive per layer Why might multiplicative conditioning be a good idea? - recall: multiplicative gating Multiplicative conditioning generalizes independent networks and independent heads. Diagram sources: distill.pub/2018/feature-wise-transformations/ 14

Conditioning: More Complex Choices Cross-Stitch Networks . Misra, Shrivastava, Gupta, Hebert ‘16 Multi-Task Attention Network . Liu, Johns, Davison ‘18 Deep Relation Networks . Long, Wang ‘15 Sluice Networks . Ruder, Bingel, Augenstein, Sogaard ‘17 15

Conditioning Choices Unfortunately, these design decisions are like neural network architecture tuning: - problem dependent - largely guided by intuition or knowledge of the problem - currently more of an art than a science 16

T T Often want to weight ∑ ∑ ℒ i ( θ , 𝒠 i ) w i ℒ i ( θ , 𝒠 i ) min min Vanilla MTL Objective tasks di ff erently: θ θ i =1 i =1 - dynamically adjust - manually based on w i How to choose ? throughout training importance or priority a. various heuristics ℒ i ( θ a ) ≤ ℒ i ( θ b ) ∀ i θ a θ b dominates if encourage gradients to have similar magnitudes ℒ i ( θ a ) ≠ ∑ and if ∑ ℒ i ( θ b ) (Chen et al. GradNorm. ICML 2018) i i b. use task uncertainty (e.g. see Kendall et al. CVPR 2018) c. aim for monotonic improvement towards Pareto optimal solution (e.g. see Sener et al. NeurIPS 2018) d. optimize for the worst-case task loss θ ⋆ θ ⋆ θ is Pareto optimal if there exists no that dominates ℒ i ( θ , 𝒠 i ) θ ⋆ min θ max (At , improving one task will always require worsening another) i (e.g. for task robustness, or for fairness) 18

̂ Optimizing the objective T ∑ ℒ i ( θ , 𝒠 i ) Vanilla MTL Objective: min θ i =1 Basic Version: 1. Sample mini-batch of tasks ℬ ∼ { 𝒰 i } 2. Sample mini-batch datapoints for each task 𝒠 b i ∼ 𝒠 i ℒ ( θ , ℬ ) = ∑ 3. Compute loss on the mini-batch: ℒ k ( θ , 𝒠 b k ) 𝒰 k ∈ℬ ∇ θ ̂ 4. Backpropagate loss to compute gradient ℒ 5. Apply gradient with your favorite neural net optimizer (e.g. Adam) Note: This ensures that tasks are sampled uniformly, regardless of data quantities. Tip: For regression problems, make sure your task labels are on the same scale! 20

Challenges 21

Challenge #1: Negative transfer Negative transfer : Sometimes independent networks work the best. Multi-Task CIFAR-100 } multi-head architectures recent approaches } cross-stitch architecture } independent training (Yu et al. Gradient Surgery for Multi-Task Learning. 2020) - optimization challenges Why? - caused by cross-task interference - tasks may learn at di ff erent rates - limited representational capacity - multi-task networks often need to be much larger than their single-task counterparts 22

If you have negative transfer, share less across tasks. It’s not just a binary decision! T T ∥ θ t − θ t ′ ∑ ∑ ℒ i ({ θ sh , θ i }, 𝒠 i ) + ∥ min θ sh , θ 1 ,…, θ T t ′ i =1 =1 “soft parameter sharing” y 1 x <-> <-> <-> <-> … constrained weights y T x + allows for more fluid degrees of parameter sharing - yet another set of design decisions / hyperparameters 23

Challenge #2: Over fi tting You may not be sharing enough! Multi-task learning <-> a form of regularization Solution : Share more. 24

Plan for Today Multi-Task Learning - Problem statement - Models, objectives, optimization - Challenges short break here - Case study of real-world multi-task learning Transfer Learning - Pre-training & fi ne-tuning 25

Case study Goal : Make recommendations for YouTube 26

Case study Goal : Make recommendations for YouTube - videos that users will rate highly - videos that users they will share Conflicting objectives: - videos that user will watch implicit bias caused by feedback: user may have watched it because it was recommended! 27

Framework Set-Up Input : what the user is currently watching (query video) + user features 1. Generate a few hundred of candidate videos 2. Rank candidates 3. Serve top ranking videos to the user Candidate videos : pool videos from multiple candidate generation algorithms - matching topics of query video - videos most frequently watched with query video - And others Ranking : central topic of this paper 28

The Ranking Problem Input: query video, candidate video, user & context features Model output: engagement and satisfaction with candidate video Engagement : Satisfaction : - binary classi fi cation tasks like clicks - binary classi fi cation tasks like clicking “like” - regression tasks for tasks related to time spent - regression tasks for tasks such as rating Weighted combination of engagement & satisfaction predictions -> ranking score score weights manually tuned Question: Are these objectives reasonable? What are some of the issues that might come up? (answer in chat) 29

The Architecture Basic option: “Shared-Bottom Model" (i.e. multi-head architecture) -> harm learning when correlation between tasks is low 30

Multi-Task Learning & Transfer Learning Basics CS 330 1 - PowerPoint PPT Presentation

Multi-Task Learning & Transfer Learning Basics CS 330 1 Logistics Homework 1 posted Monday 9/21 , due Wednesday 9/30 at midnight. TensorFlow review session tomorrow at 6:00 pm PT. Project guidelines posted early next week. 2 Plan for Today

Radiative Transfer Radiative Transfer Radiative transfer is a branch of atmospheric physics. We

Multi-Task Active Learning Yi Zhang Outline Active Learning Multi-Task Active Learning

Industrial Transfer Learning Introduction to Industrial Transfer Learning Industrial Transfer

Bond Task Force Draft Bond Task Force Recommendations Tuesday, February 27 , 2018 Bond Task

Task 1d: River basin management Task leader: LNEC; Involved partners EU: ISPRA, DTU, EWA Task

p wered Yva productivity AI Task Manager @nerdybff Task Management Task Management Todoist

Transfer United: Partnerships to Foster Transfer Student Success Tuesday, November 5 th

Multi-Task Learning and Matrix Regularization Andreas Argyriou TTI Chicago Outline

Multi-Task & Meta-Learning Basics CS 330 Logistics Homework 1 posted today, due Wednesday,

Identifying beneficial task relations for multi-task learning in deep neural networks Author:

CGO Task Presentation CGO Task Presentation CGO Task Presentation Effective Task Presentation

Transfer Learning Eu Wern Teh What are we covering? Why transfer learning? Fine

Heat Transfer Heat Transfer Introduction Practical occurrences, applications, factors

Technology Transfer or Knowledge Transfer? Russ Somma, Ph.D. SommaTech,LLC Affiliate of IPS

Technology Transfer and Commercialisation 1 05/06/2015 1 Tech Transfer and Commercialisation

Transfer! The VIEWS of Practitioners The RESULTS from ROI Dr Paul Donovan NUIM TRANSFER THAT

globus online Simplify big data sharing with Globus Online Steve Tuecke Computation Institute

Campaigning in Britain Justin Fisher (Brunel University) PARTY SYSTEMS IN THE UK Calculation of

Designing an ultra low-overhead multithreading runtime for Nim Mamy Ratsimbazafy Weave

A Latin square autotopism secret sharing scheme Talk by Rebecca J. Stones Co-authors: Ming Su,

Multiclass object recognition Sharing parts and transfer learning Sharat Chikkerur Outline

PhEDEx and CMS Data Transfers Paul Rossman Fermilab Global CMS Data Network Paul Rossman

pab stakeholder discussion January 30 and January 31 pab stakeholder goals and timeline

CS11-737: Multilingual Natural Language Processing Translation Yulia Tsvetkov Translation Mr.