multi task meta learning basics
play

Multi-Task & Meta-Learning Basics CS 330 Logistics Homework 1 - PowerPoint PPT Presentation

Multi-Task & Meta-Learning Basics CS 330 Logistics Homework 1 posted today, due Wednesday, October 9 Fill out paper preferences by tomorrow. TensorFlow review session tomorrow, 4:30 pm in Gates B03 Plan for Today Multi-Task Learning - Models


  1. Multi-Task & Meta-Learning Basics CS 330

  2. Logistics Homework 1 posted today, due Wednesday, October 9 Fill out paper preferences by tomorrow. TensorFlow review session tomorrow, 4:30 pm in Gates B03

  3. Plan for Today Multi-Task Learning - Models & training - Challenges - Case study of real-world multi-task learning 
 — short break — 
 Meta-Learning - Problem formulation - General recipe of meta-learning algorithms } Topic of Homework 1! - Black-box adaptation approaches

  4. Multi-Task Learning Basics

  5. Some notation θ tiger tiger cat lynx cat y x length of paper f θ ( y | x ) Single-task learning: What is a task? 𝒠 = {( x , y ) k } (more formally this time) [supervised] θ ℒ ( θ , 𝒠 ) min 𝒰 i ≜ { p i ( x ), p i ( y | x ), ℒ i } A task: data generating distributions Typical loss: negative log likelihood 𝒠 tr 𝒠 test ℒ ( θ , 𝒠 ) = − 𝔽 ( x , y ) ∼𝒠 [log f θ ( y | x )] Corresponding datasets: i i 𝒠 i 𝒠 tr will use as shorthand for : i

  6. Examples of Tasks ℒ i 𝒰 i ≜ { p i ( x ), p i ( y | x ), ℒ i } Multi-task classi fi cation: A task: same across all tasks e.g. per-language data generating distributions handwriting recognition e.g. personalized 𝒠 tr 𝒠 test Corresponding datasets: i i spam fi lter 𝒠 i 𝒠 tr will use as shorthand for : i ℒ i p i ( x ) , same across all tasks Multi-label learning: e.g. CelebA attribute recognition e.g. scene understanding ℒ i When might vary across tasks? - mixed discrete, continuous labels across tasks - if you care more about one task than another

  7. θ length of paper y x summary of paper paper review f θ ( y | x ) f θ ( y | x , z i ) z i task descriptor e.g. one-hot encoding of the task index T ∑ or, whatever meta-data you have ℒ i ( θ , 𝒠 i ) min Objective: θ - personalization: user features/attributes i =1 - language description of the task - formal speci fi cations of the task A model decision and an algorithm decision: z i How should we condition on ? How to optimize our objective?

  8. Conditioning on the task z i Let’s assume is the task index. Question : How should you condition on the task in order to share as little as possible?

  9. Conditioning on the task z i y 1 x multiplicative gating y = ∑ y 2 1 ( z i = j ) y j x j … y T x —> independent training within a single network! with no shared parameters

  10. The other extreme x y z i z i Concatenate with input and/or activations all parameters are shared except z i the parameters directly following

  11. An Alternative View on the Multi-Task Objective θ sh θ i θ Split into shared parameters and task-speci fi c parameters T ∑ ℒ i ({ θ sh , θ i }, 𝒠 i ) min Then, our objective is: θ sh , θ 1 ,…, θ T i =1 Choosing how to Choosing how & where equivalent to z i condition on to share parameters

  12. Conditioning: Some Common Choices 1. Concatenation-based conditioning 2. Additive conditioning z i z i These are actually the same! Diagram sources: distill.pub/2018/feature-wise-transformations/

  13. Conditioning: Some Common Choices 3. Multi-head architecture 4. Multiplicative conditioning Ruder ‘17 - more expressive Why might multiplicative conditioning be a good idea? - recall: multiplicative gating Multiplicative conditioning generalizes independent networks and independent heads. Diagram sources: distill.pub/2018/feature-wise-transformations/

  14. Conditioning: More Complex Choices Cross-Stitch Networks . Misra, Shrivastava, Gupta, Hebert ‘16 Multi-Task Attention Network . Liu, Johns, Davison ‘18 Deep Relation Networks . Long, Wang ‘15 Sluice Networks . Ruder, Bingel, Augenstein, Sogaard ‘17

  15. Conditioning Choices Unfortunately, these design decisions are like neural network architecture tuning: - problem dependent - largely guided by intuition or knowledge of the problem - currently more of an art than a science

  16. ̂ Optimizing the objective T ∑ ℒ i ( θ , 𝒠 i ) min Objective: θ i =1 Basic Version: ℬ ∼ { 𝒰 i } 1. Sample mini-batch of tasks 𝒠 b i ∼ 𝒠 i 2. Sample mini-batch datapoints for each task ℒ ( θ , ℬ ) = ∑ ℒ k ( θ , 𝒠 b 3. Compute loss on the mini-batch: k ) 𝒰 k ∈ℬ ∇ θ ̂ ℒ 4. Backpropagate loss to compute gradient 5. Apply gradient with your favorite neural net optimizer (e.g. Adam) Note: This ensures that tasks are sampled uniformly, regardless of data quantities. Tip: For regression problems, make sure your task labels are on the same scale!

  17. Challenges

  18. Challenge #1: Negative transfer Negative transfer : Sometimes independent networks work the best. Multi-Task CIFAR-100 state-of-the-art approaches - optimization challenges Why? - caused by cross-task interference - tasks may learn at di ff erent rates - limited representational capacity - multi-task networks often need to be much larger than their single-task counterparts

  19. If you have negative transfer, share less across tasks. It’s not just a binary decision! T T ∥ θ t − θ t ′ � ∥ ∑ ∑ ℒ i ({ θ sh , θ i }, 𝒠 i ) + min θ sh , θ 1 ,…, θ T t ′ � =1 i =1 “soft parameter sharing” y 1 x <-> <-> <-> <-> … constrained weights y T x + allows for more fluid degrees of parameter sharing - yet another set of design decisions / hyperparameters

  20. Challenge #2: Over fi tting You may not be sharing enough! Multi-task learning <-> a form of regularization Solution : Share more.

  21. Case study Goal : Make recommendations for YouTube

  22. Case study Goal : Make recommendations for YouTube - videos that users will rate highly - videos that users they will share Conflicting objectives: - videos that user will watch implicit bias caused by feedback: 
 user may have watched it because it was recommended!

  23. Framework Set-Up Input : what the user is currently watching (query video) + user features 1. Generate a few hundred of candidate videos 2. Rank candidates 3. Serve top ranking videos to the user Candidate videos : pool videos from multiple candidate generation algorithms - matching topics of query video - videos most frequently watched with query video - And others Ranking : central topic of this paper

  24. The Ranking Problem Input: query video, candidate video, user & context features Model output: engagement and satisfaction with candidate video Engagement : Satisfaction : - binary classi fi cation tasks like clicks - binary classi fi cation tasks like clicking “like” - regression tasks for tasks related to time spent - regression tasks for tasks such as rating Weighted combination of engagement & satisfaction predictions -> ranking score score weights manually tuned

  25. The Architecture Basic option: “Shared-Bottom Model" 
 (i.e. multi-head architecture) -> harm learning when correlation between tasks is low

  26. The Architecture Allow di ff erent parts of the network to “specialize" Instead: use a form of soft-parameter sharing 
 “ Multi-gate Mixture-of-Experts (MMoE) " expert neural networks Decide which expert to use for input x, task k: Compute features from selected expert: Compute output:

  27. Experiments Set-Up Results - Implementation in TensorFlow, TPUs - Train in temporal order , running training continuously to consume newly arriving data - O ffl ine AUC & squared error metrics - Online A/B testing in comparison to production system - live metrics based on time spent, survey responses, rate of dismissals - Model computational e ffi ciency matters Found 20% chance of gating polarization during distributed training -> use drop-out on experts

  28. Plan for Today Multi-Task Learning - Models & training - Challenges - Case study of real-world multi-task learning 
 — short break — 
 Meta-Learning - Problem formulation - General recipe of meta-learning algorithms } Topic of Homework 1! - Black-box adaptation approaches

  29. Meta-Learning Basics

  30. Two ways to view meta-learning algorithms Mechanistic view Probabilistic view ➢ Deep neural network model that can read in ➢ Extract prior information from a set of (meta- an entire dataset and make predictions for training) tasks that allows efficient learning of new datapoints new tasks ➢ Training this network uses a meta-dataset, ➢ Learning a new task uses this prior and (small) which itself consists of many datasets, each training set to infer most likely posterior for a different task parameters ➢ This view makes it easier to implement meta- ➢ This view makes it easier to understand meta- learning algorithms learning algorithms

  31. Problem definitions label training data input (e.g., image) model parameters regularizer (e.g., weight decay) data likelihood What is wrong with this? ➢ The most powerful models typically require large amounts of labeled data ➢ Labeled data for some tasks may be very limited

  32. Problem definitions Image adapted from Ravi & Larochelle

  33. The meta-learning problem this is the meta-learning problem

  34. A Quick Example test label test input

  35. How do we train this thing? test label test input Key idea: “our training procedure is based on a simple machine learning principle: test and train conditions must match” Vinyals et al., Matching Networks for One-Shot Learning

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend