What should be transferred in transfer learning? Chris Williams and - PowerPoint PPT Presentation

What should be transferred in transfer learning? Chris Williams and Kian Ming A. Chai July 2009 1 / 30

Motivation ◮ Is learning the N -th thing any easier than learning the first? (Thrun, 1996) ◮ Gain strength by sharing information across tasks ◮ Examples of multi-task learning ◮ Co-occurrence of ores (geostats) ◮ Object recognition for multiple object classes ◮ Personalization (personalizing spam filters, speaker adaptation in speech recognition) ◮ Compiler optimization of many computer programs ◮ Robot inverse dynamics (multiple loads) ◮ Are task descriptors available? 2 / 30

Outline ◮ Co-kriging ◮ Intrinsic Correlation Model ◮ Multi-task learning: ◮ 1. MTL as Hierarchical Modelling ◮ 2. MTL as Input-space Transformation ◮ 3. MTL as Shared Feature Extraction ◮ Multi-task learning in Robot Inverse Dynamics 3 / 30

Co-kriging Consider M tasks, and N distinct inputs x 1 , . . . , x N : ◮ f i ℓ is the response for the ℓ th task on the i th input x i ◮ Gaussian process with covariance function k ( x , ℓ ; x ′ , m ) = � f ℓ ( x ) f m ( x ′ ) � ◮ Goal: Given noisy observations y of f make predictions of unobserved values f ∗ at locations X ∗ ◮ Solution Use the usual GP prediction equations 4 / 30

x f 5 / 30

Covariance functions and hyperparameters ◮ The squared-exponential covariance function f exp [ − 1 k ( x , x ′ ) = σ 2 2 ( x − x ′ ) T M ( x − x ′ )] is often used in machine learning ◮ Many other choices, e.g. Matern family, rational quadratic, non-stationary cov fns etc ◮ if M is diagonal, the entries are inverse squared lengthscales → automatic relevance determination (ARD, Neal 1996) ◮ Estimation of hyperparameters by optimization of log marginal likelihood L = − 1 y − 1 2 log | K y | − n 2 y T K − 1 2 log 2 π y 6 / 30

Some questions ◮ What kinds of (cross)-covariance structures match different ideas of multi-task learning? ◮ Are there multi-task relationships that don’t fit well with co-kriging? 7 / 30

Intrinsic Correlation Model (ICM) y i ℓ ∼ N ( f ℓ ( x i ) , σ 2 � f ℓ ( x ) f m ( x ′ ) � = K f ℓ m k x ( x , x ′ ) ℓ ) , ◮ K f : PSD matrix that specifies the inter-task similarities (could depend parametrically on task descriptors if these are available) ◮ k x : Covariance function over inputs ℓ : Noise variance for the ℓ th task. ◮ σ 2 ◮ Linear Model of Coregionalization is a sum of ICMs 8 / 30

ICM as a linear combination of indepenent GPs ◮ Independent GP priors over the functions z j ( x ) ⇒ multi-task GP prior over f m ( x ) s � f ℓ ( x ) f m ( x ′ ) � = K f ℓ m k x ( x , x ′ ) ◮ K f ∈ R M × M is a task (or context) similarity matrix with K f ℓ m = ( ρ m ) T ρ ℓ m = 1 . . . M   ρ m 1 f m ρ m   2   .   . .   ρ m M · · · · · · z 1 z 2 z M 9 / 30

◮ Some problems conform nicely to the ICM setup, e.g. robot inverse dynamics (Chai, Williams, Klanke, Vijayakumar 2009; see later) ◮ Semiparametric latent factor model (SLFM) of Teh et al (2005) has P latent processes each with its own covariance function. Noiseless outputs are obtained by linear mixing of these latent functions 10 / 30

1. Multi-task Learning as Hierarchical Modelling e.g. Baxter (JAIR, 2000), Evgeniou et al (JMLR, 2005), Goldstein (2003) θ f 1 f 2 f 3 y 1 y 2 y 3 11 / 30

◮ Prior on θ may be generic (e.g. isotropic Gaussian) or more structured ◮ Mixture model on θ → task clustering ◮ Task clustering can be implemented in the ICM model using a block diagonal K f , where each block is a cluster ◮ Manifold model for θ , e.g. linear subspace ⇒ low-rank structure of K f (e.g. linear regression with correlated priors) ◮ Combination of the above ideas → a mixture of linear subspaces ◮ If task descriptors are available then can have K f ℓ m = k f ( t ℓ , t m ) 12 / 30

GP view Integrate out θ f 1 f 2 f 3 y 1 y 2 y 3 13 / 30

2. MTL as Input-space Transformation ◮ Ben-David and Schuller (COLT, 2003), f 2 ( x ) is related to f 1 ( x ) by a X -space transformation f : X → X ◮ Suppose f 2 ( x ) is related to f 1 ( x ) by a shift a in x -space ◮ Then � f 1 ( x ) f 2 ( x ′ ) � = � f 1 ( x ) f 1 ( x ′ − a ) � = k 1 ( x , x ′ − a ) 14 / 30

◮ More generally can consider convolutions , e.g. � f i ( x ) = h i ( x − x ′ ) g ( x ′ ) d x ′ to generate dependent f ’s (e.g. Ver Hoef and Barry, 1998; Higdon, 2002; Boyle and Frean, 2005). δ ( x − a ) is a special case ◮ Alvarez and Lawrence (2009) generalize this to allow a linear combination of several latent processes R � � h ir ( x − x ′ ) g r ( x ′ ) d x ′ f i ( x ) = r = 1 ◮ ICM and SPFM are special cases using the δ () kernel 15 / 30

3. Shared Feature Extraction ◮ Intuition: multiple tasks can depend on the same extracted features; all . . . tasks can be used to help output layer learn these features ◮ If data is scarce for each . . . hidden layer 2 task this should help learn the features . . . ◮ Bakker and Heskes hidden layer 1 (2003) – neural network input layer (x) setup 16 / 30

◮ Minka and Picard (1999): assume that the multiple tasks are independent GPs but with shared hyperparameters ◮ Yu, Tresp and Schawaighofer (2005) extend this so that all tasks share the same kernel hyperparameter, but can have different kernels ◮ Could also have inter-task correlations ◮ Interesting case if different tasks have different x -spaces; convert from each task-dependent x -space to same feature space? 17 / 30

Discussion ◮ 3 types of multi-task learning setup ◮ ICM and convolutional cross-covariance functions, shared feature extraction ◮ Are there multi-task relationships that don’t fit well with a co-kriging framework? 18 / 30

Multi-task Learning in Robot Inverse Dynamics end effector ◮ Joint variables q . link 2 ◮ Apply τ i to joint i to trace a trajectory. ◮ Inverse dynamics — predict τ i ( q , ˙ q , ¨ q ) . q 2 link 1 q 1 link 0 base 19 / 30

Inverse Dynamics Characteristics of τ def ◮ Torques are non-linear functions of x = ( q , ˙ q , ¨ q ) . ◮ (One) idealized rigid body control: potential � �� τ i ( x ) = b T q T H i ( q )˙ g i ( q ) + f v q i + f c i ˙ i sgn ( ˙ i ( q )¨ q + ˙ + q i ) , q � �� kinetic viscous and Coulomb frictions ◮ Physics-based modelling can be hard due to factors like unknown parameters, friction and contact forces, joint elasticity, making analytical predictions unfeasible ◮ This is particularly true for compliant, lightweight humanoid robots 20 / 30

Inverse Dynamics Characteristics of τ ◮ Functions change with the loads handled at the end effector ◮ Loads have different mass, shapes, sizes. ◮ Bad news (1): Need a different inverse dynamics model for different loads. ◮ Bad news (2): Different loads may go through different trajectory in data collection phase and may explore different portions of the x -space. 21 / 30

◮ Good news: the changes enter through changes in the dynamic parameters of the last link ◮ Good news: changes are linear wrt the dynamic parameters τ m i ( x ) = y T i ( x ) π m where π m ∈ R 11 (e.g. Petkos and Vijayakumar,2007) ◮ Reparameterization: i ( x ) π m = y T A i π m = z T τ m i ( x ) = y T i ( x ) A − 1 i ( x ) ρ m i i where A i is a non-singular 11 × 11 matrix 22 / 30

GP prior for Inverse Dynamics for multiple loads ◮ Independent GP priors over the functions z ij ( x ) ⇒ multi-task GP prior over τ m i s � � = ( K ρ τ ℓ i ( x ) τ m i ) ℓ m k x i ( x ′ ) i ( x , x ′ ) M is a task (or context) similarity matrix with ◮ K ρ i ∈ R M × ( K ρ i ) ℓ m = ( ρ m i ) T ρ ℓ i m = 1 . . . M  ρ m  i , 1 τ m ρ m i   i , 2   · · ·   · · · ρ m i , s · · ·· · · z i , 1 z i , 2 z i , s i = 1 . . . J 23 / 30

GP prior for k ( x , x ′ ) k ( x , x ′ ) = bias + [ linear with ARD ]( x , x ′ ) + [ squared exponential with ARD ]( x , x ′ ) + [ linear (with ARD) ]( sgn ( ˙ q ) , sgn ( ˙ q ′ )) ◮ Domain knowledge relates to last term (Coulomb friction) 24 / 30

Data ◮ Puma 560 robot arm manipulator: 6 degrees of freedom ◮ Realistic simulator (Corke, 1996), including viscous and asymmetric-Coulomb frictions. ◮ 4 paths × 4 speeds = 16 different trajectories: ◮ Speeds: 5s, 10s, 15s and 20s completion times. ◮ 15 loads (contexts): 0 . 2kg . . . 3 . 0kg, various shapes and sizes. p 4 Joint 1 p 3 Waist 0.5 p 2 p 1 q 3 z / m Joint 6 0.3 Flange Joint 2 Joint 3 Shoulder Elbow 0.2 0 Joint 5 y / m Base −0.2 0.7 Wrist Bend 0.5 0.6 0.3 0.4 Joint 4 x / m Wrist rotation 25 / 30

Data Training data ◮ 1 reference trajectory common to handling of all loads. ◮ 14 unique training trajectories, one for each context (load) ◮ 1 trajectory has no data for any context; thus this is always novel Test data ◮ Interpolation data sets for testing on reference trajectory and the unique trajectory for each load. ◮ Extrapolation data sets for testing on all trajectories. 26 / 30

What should be transferred in transfer learning? Chris Williams and - PowerPoint PPT Presentation

What should be transferred in transfer learning? Chris Williams and Kian Ming A. Chai July 2009 1 / 30 Motivation Is learning the N -th thing any easier than learning the first? (Thrun, 1996) Gain strength by sharing information across

Industrial Transfer Learning Introduction to Industrial Transfer Learning Industrial Transfer

Radiative Transfer Radiative Transfer Radiative transfer is a branch of atmospheric physics. We

Paper Review: What is being transferred in transfer learning? Seyed Iman Mirzadeh

Transfer Outcomes Board Presentation San Diego Community College District Transfer Outcomes

Full Duplex Radios ROHIT KUMAR 2 Types of Communication Simplex Data can be transferred only

Transfer United: Partnerships to Foster Transfer Student Success Tuesday, November 5 th

Transfer Learning Eu Wern Teh What are we covering? Why transfer learning? Fine

Should it stay or should it go? Mark Galtrey www.falcon-chambers.co.uk www.falcon-chambers.co.uk

Heat Transfer Heat Transfer Introduction Practical occurrences, applications, factors

Technology Transfer or Knowledge Transfer? Russ Somma, Ph.D. SommaTech,LLC Affiliate of IPS

Technology Transfer and Commercialisation 1 05/06/2015 1 Tech Transfer and Commercialisation

Transfer! The VIEWS of Practitioners The RESULTS from ROI Dr Paul Donovan NUIM TRANSFER THAT

Transfer Transfer Transitions: Transitions: First Semester First Semester Persistence and

Remit #2 Elimination of Transfer and Settlement What does transfer mean? Transfer

Transfer of transfert Transfer principles Thomas Hales and Julia Gordon December 2015 The

Regional STEMI Transfer Systems: Regional STEMI Transfer Systems: Regional STEMI Transfer

Robotics Review Saurabh Gupta Robotic Tasks Manipulation Typical Robotics Pipeline State

Team Yellow: Section B Sketch Model Presentation Aaron Doody Harmeet Gill Moneer Helu Adam

Hierarchical Modeling A lesson in stick person anatomy. A lesson in stick person anatomy. or or

Exploring Idiomaticity with Variant-based Distributional Measures and Shannon Entropy Marco S. G.

Robots that Learn Old Dreams and New Tools Professor Sethu Vijayakumar FRSE Microsoft Research

The Art of Designing and Rapidly Prototyping Medical Training Technologies ITEC 2019 Angela M.

Bayesian Regression with Input Noise for High Dimensional Data Jo-Anne Ting 1 , Aaron DSouza 2

ROBOTICS 01PEEQW Basilio Bona DAUIN Politecnico di Torino Trajectory Planning 1