CS 330
Multi-Task Learning & Transfer Learning Basics
1
Multi-Task Learning & Transfer Learning Basics CS 330 1 - - PowerPoint PPT Presentation
Multi-Task Learning & Transfer Learning Basics CS 330 1 Logistics Homework 1 posted Monday 9/21 , due Wednesday 9/30 at midnight. TensorFlow review session tomorrow at 6:00 pm PT. Project guidelines posted early next week. 2 Plan for Today
1
2
3
4
data generating distributions Corresponding datasets:
i
i
θ ℒ(θ, )
(more formally this time) will use as shorthand for :
i
tiger tiger cat lynx cat length of paper
5
data generating distributions Corresponding datasets: Multi-task classification: , same across all tasks
e.g. CelebA attribute recognition e.g. per-language handwriting recognition e.g. personalized spam filter e.g. scene understanding will use as shorthand for :
i
When might vary across tasks?
i
i
same across all tasks
Multi-label learning:
6
task descriptor
e.g. one-hot encoding of the task index
θ T
i=1
Decisions on the model, the objective, and the optimization. How should we condition on ?
How to optimize our objective?
length of paper paper review summary of paper What objective should we use?
7
8
(raise your hand)
9
j
multiplicative gating
10
all parameters are shared (except the parameters directly following , if is one-hot)
Concatenate with input and/or activations
11
θsh,θ1,…,θT T
i=1
12
Diagram sources: distill.pub/2018/feature-wise-transformations/
(raise your hand)
13
Diagram sources: distill.pub/2018/feature-wise-transformations/
Ruder ‘17
14
Cross-Stitch Networks. Misra, Shrivastava, Gupta, Hebert ‘16 Multi-Task Attention Network. Liu, Johns, Davison ‘18 Deep Relation Networks. Long, Wang ‘15 Sluice Networks. Ruder, Bingel, Augenstein, Sogaard ‘17
15
16
17
θ T
i=1
θ T
i=1
encourage gradients to have similar magnitudes
(Chen et al. GradNorm. ICML 2018) (e.g. see Kendall et al. CVPR 2018)
(e.g. see Sener et al. NeurIPS 2018)
18
dominates if
θa θb ℒi(θa) ≤ ℒi(θb) ∀i
and if ∑
i
ℒi(θa) ≠ ∑
i
ℒi(θb)
is Pareto optimal if there exists no that dominates
θ⋆ θ θ⋆
(At , improving one task will always require worsening another)
θ⋆
θ max i
(e.g. for task robustness, or for fairness)
19
θ T
i=1
i ∼ i
𝒰k∈ℬ
k)
20
21
(Yu et al. Gradient Surgery for Multi-Task Learning. 2020)
multi-head architectures } cross-stitch architecture } independent training
22
θsh,θ1,…,θT T
i=1
T
t′ =1
y1 x x yT
…
23
24
25
26
27
28
29
(answer in chat)
30
31
Found 20% chance of gating polarization during distributed training -> use drop-out on experts
32
33
34
θ T
i=1
Side note: may include multiple tasks itself.
𝒰a
(answer in chat or raise hand)
35
What makes ImageNet good for transfer learning? Huh, Agrawal, Efros. ‘16
36
37
38