Multi-Task & Meta-Learning Basics CS 330 Logistics Homework 1 - PowerPoint PPT Presentation

Multi-Task & Meta-Learning Basics CS 330

Logistics Homework 1 posted today, due Wednesday, October 9 Fill out paper preferences by tomorrow. TensorFlow review session tomorrow, 4:30 pm in Gates B03

Plan for Today Multi-Task Learning - Models & training - Challenges - Case study of real-world multi-task learning   — short break —   Meta-Learning - Problem formulation - General recipe of meta-learning algorithms } Topic of Homework 1! - Black-box adaptation approaches

Multi-Task Learning Basics

Some notation θ tiger tiger cat lynx cat y x length of paper f θ ( y | x ) Single-task learning: What is a task? 𝒠 = {( x , y ) k } (more formally this time) [supervised] θ ℒ ( θ , 𝒠 ) min 𝒰 i ≜ { p i ( x ), p i ( y | x ), ℒ i } A task: data generating distributions Typical loss: negative log likelihood 𝒠 tr 𝒠 test ℒ ( θ , 𝒠 ) = − 𝔽 ( x , y ) ∼𝒠 [log f θ ( y | x )] Corresponding datasets: i i 𝒠 i 𝒠 tr will use as shorthand for : i

Examples of Tasks ℒ i 𝒰 i ≜ { p i ( x ), p i ( y | x ), ℒ i } Multi-task classi fi cation: A task: same across all tasks e.g. per-language data generating distributions handwriting recognition e.g. personalized 𝒠 tr 𝒠 test Corresponding datasets: i i spam fi lter 𝒠 i 𝒠 tr will use as shorthand for : i ℒ i p i ( x ) , same across all tasks Multi-label learning: e.g. CelebA attribute recognition e.g. scene understanding ℒ i When might vary across tasks? - mixed discrete, continuous labels across tasks - if you care more about one task than another

θ length of paper y x summary of paper paper review f θ ( y | x ) f θ ( y | x , z i ) z i task descriptor e.g. one-hot encoding of the task index T ∑ or, whatever meta-data you have ℒ i ( θ , 𝒠 i ) min Objective: θ - personalization: user features/attributes i =1 - language description of the task - formal speci fi cations of the task A model decision and an algorithm decision: z i How should we condition on ? How to optimize our objective?

Conditioning on the task z i Let’s assume is the task index. Question : How should you condition on the task in order to share as little as possible?

Conditioning on the task z i y 1 x multiplicative gating y = ∑ y 2 1 ( z i = j ) y j x j … y T x —> independent training within a single network! with no shared parameters

The other extreme x y z i z i Concatenate with input and/or activations all parameters are shared except z i the parameters directly following

An Alternative View on the Multi-Task Objective θ sh θ i θ Split into shared parameters and task-speci fi c parameters T ∑ ℒ i ({ θ sh , θ i }, 𝒠 i ) min Then, our objective is: θ sh , θ 1 ,…, θ T i =1 Choosing how to Choosing how & where equivalent to z i condition on to share parameters

Conditioning: Some Common Choices 1. Concatenation-based conditioning 2. Additive conditioning z i z i These are actually the same! Diagram sources: distill.pub/2018/feature-wise-transformations/

Conditioning: Some Common Choices 3. Multi-head architecture 4. Multiplicative conditioning Ruder ‘17 - more expressive Why might multiplicative conditioning be a good idea? - recall: multiplicative gating Multiplicative conditioning generalizes independent networks and independent heads. Diagram sources: distill.pub/2018/feature-wise-transformations/

Conditioning: More Complex Choices Cross-Stitch Networks . Misra, Shrivastava, Gupta, Hebert ‘16 Multi-Task Attention Network . Liu, Johns, Davison ‘18 Deep Relation Networks . Long, Wang ‘15 Sluice Networks . Ruder, Bingel, Augenstein, Sogaard ‘17

Conditioning Choices Unfortunately, these design decisions are like neural network architecture tuning: - problem dependent - largely guided by intuition or knowledge of the problem - currently more of an art than a science

̂ Optimizing the objective T ∑ ℒ i ( θ , 𝒠 i ) min Objective: θ i =1 Basic Version: ℬ ∼ { 𝒰 i } 1. Sample mini-batch of tasks 𝒠 b i ∼ 𝒠 i 2. Sample mini-batch datapoints for each task ℒ ( θ , ℬ ) = ∑ ℒ k ( θ , 𝒠 b 3. Compute loss on the mini-batch: k ) 𝒰 k ∈ℬ ∇ θ ̂ ℒ 4. Backpropagate loss to compute gradient 5. Apply gradient with your favorite neural net optimizer (e.g. Adam) Note: This ensures that tasks are sampled uniformly, regardless of data quantities. Tip: For regression problems, make sure your task labels are on the same scale!

Challenges

Challenge #1: Negative transfer Negative transfer : Sometimes independent networks work the best. Multi-Task CIFAR-100 state-of-the-art approaches - optimization challenges Why? - caused by cross-task interference - tasks may learn at di ff erent rates - limited representational capacity - multi-task networks often need to be much larger than their single-task counterparts

If you have negative transfer, share less across tasks. It’s not just a binary decision! T T ∥ θ t − θ t ′ � ∥ ∑ ∑ ℒ i ({ θ sh , θ i }, 𝒠 i ) + min θ sh , θ 1 ,…, θ T t ′ � =1 i =1 “soft parameter sharing” y 1 x <-> <-> <-> <-> … constrained weights y T x + allows for more fluid degrees of parameter sharing - yet another set of design decisions / hyperparameters

Challenge #2: Over fi tting You may not be sharing enough! Multi-task learning <-> a form of regularization Solution : Share more.

Case study Goal : Make recommendations for YouTube

Case study Goal : Make recommendations for YouTube - videos that users will rate highly - videos that users they will share Conflicting objectives: - videos that user will watch implicit bias caused by feedback:   user may have watched it because it was recommended!

Framework Set-Up Input : what the user is currently watching (query video) + user features 1. Generate a few hundred of candidate videos 2. Rank candidates 3. Serve top ranking videos to the user Candidate videos : pool videos from multiple candidate generation algorithms - matching topics of query video - videos most frequently watched with query video - And others Ranking : central topic of this paper

The Ranking Problem Input: query video, candidate video, user & context features Model output: engagement and satisfaction with candidate video Engagement : Satisfaction : - binary classi fi cation tasks like clicks - binary classi fi cation tasks like clicking “like” - regression tasks for tasks related to time spent - regression tasks for tasks such as rating Weighted combination of engagement & satisfaction predictions -> ranking score score weights manually tuned

The Architecture Basic option: “Shared-Bottom Model"   (i.e. multi-head architecture) -> harm learning when correlation between tasks is low

The Architecture Allow di ff erent parts of the network to “specialize" Instead: use a form of soft-parameter sharing   “ Multi-gate Mixture-of-Experts (MMoE) " expert neural networks Decide which expert to use for input x, task k: Compute features from selected expert: Compute output:

Experiments Set-Up Results - Implementation in TensorFlow, TPUs - Train in temporal order , running training continuously to consume newly arriving data - O ffl ine AUC & squared error metrics - Online A/B testing in comparison to production system - live metrics based on time spent, survey responses, rate of dismissals - Model computational e ffi ciency matters Found 20% chance of gating polarization during distributed training -> use drop-out on experts

Plan for Today Multi-Task Learning - Models & training - Challenges - Case study of real-world multi-task learning   — short break —   Meta-Learning - Problem formulation - General recipe of meta-learning algorithms } Topic of Homework 1! - Black-box adaptation approaches

Meta-Learning Basics

Two ways to view meta-learning algorithms Mechanistic view Probabilistic view ➢ Deep neural network model that can read in ➢ Extract prior information from a set of (meta- an entire dataset and make predictions for training) tasks that allows efficient learning of new datapoints new tasks ➢ Training this network uses a meta-dataset, ➢ Learning a new task uses this prior and (small) which itself consists of many datasets, each training set to infer most likely posterior for a different task parameters ➢ This view makes it easier to implement meta- ➢ This view makes it easier to understand meta- learning algorithms learning algorithms

Problem definitions label training data input (e.g., image) model parameters regularizer (e.g., weight decay) data likelihood What is wrong with this? ➢ The most powerful models typically require large amounts of labeled data ➢ Labeled data for some tasks may be very limited

Problem definitions Image adapted from Ravi & Larochelle

The meta-learning problem this is the meta-learning problem

A Quick Example test label test input

How do we train this thing? test label test input Key idea: “our training procedure is based on a simple machine learning principle: test and train conditions must match” Vinyals et al., Matching Networks for One-Shot Learning

Multi-Task & Meta-Learning Basics CS 330 Logistics Homework 1 - PowerPoint PPT Presentation

Multi-Task & Meta-Learning Basics CS 330 Logistics Homework 1 posted today, due Wednesday, October 9 Fill out paper preferences by tomorrow. TensorFlow review session tomorrow, 4:30 pm in Gates B03 Plan for Today Multi-Task Learning - Models

Meta- Meta -Programming with Programming with Modelica Modelica for Meta- for Meta

Bayesian Model-Agnostic Meta-Learning Taesup Kim* (presenter), Jaesik Yoon* Ousmane Dia,

META Seal of Recognition and META Prize Award Ceremony Georg Rehm (DFKI) on behalf of the

Multi-Task Active Learning Yi Zhang Outline Active Learning Multi-Task Active Learning

Meta Learning Shengchao Liu Background Meta Learning (AKA Learning to Learn) A

A few meta learning papers Guy Gur-Ari Machine Learning Journal Club, September 2017 Meta

The Meta-Learning Problem & Black-Box Meta-Learning CS 330 Logistics Homework 1 posted today,

MetaFun: Meta-Learning with Iterative Functional Updates Jin Xu, Jean-Francois Ton, Hyunjik Kim,

Tensorflow 2.x Review Session CS330: Deep Multi-task and Meta Learning 9/17/2019 Rafael

Multi-Task Learning and Matrix Regularization Andreas Argyriou TTI Chicago Outline

The Peculiar Optimization and Regularization Challenges in Multi-Task Learning and Meta-Learning

Intelligent Tutoring Systems: A Meta-Analysis Meta-Analysis Wenting Ma March, 2011

Company profile Capabilities Customers & References META-LRA Kft. 8400 Ajka,

Individual Participant Data (IPD) Reviews and Meta analyses Lesley Stewart Director, CRD Larysa

Lecture 31/Chapter 25 More about Meta-Analysis Benefits and Pitfalls An Application:

Simultaneous meta and data manipulation in Blaise Marien Lina Statistics netherlands Statistics

Dependence and Conditioning Will Perkins January 31, 2013 Conditional Probability Definition If

Introduction to Reinforcement Learning A. LAZARIC ( SequeL Team @INRIA-Lille ) Ecole Centrale -

The Rescorla-Wagner Learning Model (and one of its descendants) Computational Models of Neural

Challenges to Sustainability addressed by the Regional centre of expertise Czechia Jana

Conditioned Generation Graham Neubig Site https://phontron.com/class/nn4nlp2017/ Language

Outline n Univariate Gaussian n Multivariate Gaussian n Law of Total Probability n

Auto-conditioned Recurrent Mixture Density Networks for Learning Generalizable Robot

Econometric Causality: Part I on Causality Based in part on Heckman (2008) International

Multi-Task & Meta-Learning Basics CS 330 Logistics Homework 1 - PowerPoint PPT Presentation

Multi-Task & Meta-Learning Basics CS 330 Logistics Homework 1 posted today, due Wednesday, October 9 Fill out paper preferences by tomorrow. TensorFlow review session tomorrow, 4:30 pm in Gates B03 Plan for Today Multi-Task Learning - Models

Meta- Meta -Programming with Programming with Modelica Modelica for Meta- for Meta

Bayesian Model-Agnostic Meta-Learning Taesup Kim* (presenter), Jaesik Yoon* Ousmane Dia,

META Seal of Recognition and META Prize Award Ceremony Georg Rehm (DFKI) on behalf of the

Multi-Task Active Learning Yi Zhang Outline Active Learning Multi-Task Active Learning

Meta Learning Shengchao Liu Background Meta Learning (AKA Learning to Learn) A

A few meta learning papers Guy Gur-Ari Machine Learning Journal Club, September 2017 Meta

The Meta-Learning Problem &amp; Black-Box Meta-Learning CS 330 Logistics Homework 1 posted today,

MetaFun: Meta-Learning with Iterative Functional Updates Jin Xu, Jean-Francois Ton, Hyunjik Kim,

Tensorflow 2.x Review Session CS330: Deep Multi-task and Meta Learning 9/17/2019 Rafael

Multi-Task Learning and Matrix Regularization Andreas Argyriou TTI Chicago Outline

The Peculiar Optimization and Regularization Challenges in Multi-Task Learning and Meta-Learning

Intelligent Tutoring Systems: A Meta-Analysis Meta-Analysis Wenting Ma March, 2011

Company profile Capabilities Customers &amp; References META-LRA Kft. 8400 Ajka,

Individual Participant Data (IPD) Reviews and Meta analyses Lesley Stewart Director, CRD Larysa

Lecture 31/Chapter 25 More about Meta-Analysis Benefits and Pitfalls An Application:

Simultaneous meta and data manipulation in Blaise Marien Lina Statistics netherlands Statistics

Dependence and Conditioning Will Perkins January 31, 2013 Conditional Probability Definition If

Introduction to Reinforcement Learning A. LAZARIC ( SequeL Team @INRIA-Lille ) Ecole Centrale -

The Rescorla-Wagner Learning Model (and one of its descendants) Computational Models of Neural

Challenges to Sustainability addressed by the Regional centre of expertise Czechia Jana

Conditioned Generation Graham Neubig Site https://phontron.com/class/nn4nlp2017/ Language

Outline n Univariate Gaussian n Multivariate Gaussian n Law of Total Probability n

Auto-conditioned Recurrent Mixture Density Networks for Learning Generalizable Robot

Econometric Causality: Part I on Causality Based in part on Heckman (2008) International

The Meta-Learning Problem & Black-Box Meta-Learning CS 330 Logistics Homework 1 posted today,

Company profile Capabilities Customers & References META-LRA Kft. 8400 Ajka,