1
Multi-source Meta Transfer for Low Resource MCQA
Ming Yan1, Hao Zhang1,2, Di Jin3, Joey Tianyi Zhou1
1 IHPC, A*STAR, Singapore 2 CSCE, NTU, Singapore 3 CSAIL, MIT, USA
Multi-source Meta Transfer for Low Resource MCQA Ming Yan 1 , Hao - - PowerPoint PPT Presentation
ACL 2020 Multi-source Meta Transfer for Low Resource MCQA Ming Yan 1 , Hao Zhang 1,2 , Di Jin 3 , Joey Tianyi Zhou 1 1 IHPC, A*STAR, Singapore 2 CSCE, NTU, Singapore 3 CSAIL, MIT, USA 1 Background Low resource MCQA with date size under 100K
1
Ming Yan1, Hao Zhang1,2, Di Jin3, Joey Tianyi Zhou1
1 IHPC, A*STAR, Singapore 2 CSCE, NTU, Singapore 3 CSAIL, MIT, USA
SEARCHQA NEWSQA SWAG HOTPOTQA SQUAD RACE SEMEVAL DREAM MCTEST Data Size (K) 35 70 105 140 2.6 6.1 13.9 97.6 108 113 113.5 120 140
2
Extractive/ Abstractive MCQA Multi-hop
Story Dialogue Narrative Text Exam Scenario Text Wikipedia Wikipedia Snippets Newswire Corpus from different domains Low resource MCQA with date size under 100K
3
Β§ Low resource setting Β§ Domains discrepancy
Transfer learning, multi-task learning Fine-tuning on the target domain
[Finn et al. 17] Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks, ICML 2017
πΎ βΆ πππ‘π’ ππ£πππ’πππ ππππ’ π₯! from ππππππππ πππππ π§! = πππππ"(π₯", π¦!) modππ" =: π πππ(πππππ!) π§" = πππππ"(π₯", π¦")
FFL FFm BPL BPm
π₯" =: π₯" + π½ ππΎ" ππ₯" π₯! =: π₯! + π½ ππΎ! ππ₯!
FFL BPL
ππ£ππππ π’ π’ππ‘ππ‘: π¦"~π πΉπππ£ππ π§ π’ππ‘ππ‘: π¦!~π
FFL BPL FFm BPm
fast adaption meta-learning FF: Feedforward BP: Backpropagation
Transfer Learning Multi-task Learning
Source 1 Target Source 3 Source 2
4
[Finn et al. 17] Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks, ICML 2017
πΎ βΆ πππ‘π’ ππ£πππ’πππ ππππ’ π₯! from preπ’π πππππ πππππ ππ£ππππ π’ π: π¦"~π πΉπππ£ππ π§ π: π¦!~π Learn a model that can generalize
Dialogue Exam Story Narrative Text
source domains target domain
meta-learning learning/adaption
π₯!
β!!πΎ"
β!!πΎ# β!!πΎ$
π₯!% π₯!& π₯!' fast adaption meta-learning π₯!% π₯!( π₯!& π₯!' fast adaption
same domain
4 choices 3 choices 4 choices 2 choice
5
Β§ Learn knowledge from multiple sources Β§ Reduce discrepancy between sources and target.
Meta Learning
Meta model
Multi-source Meta Transfer
1 3 2 Target MMT model
Task in source Task in source 2 Task in source 3 Task in source 1
Dialogue 3 choices Exam 4 choices Story 4 choices
Scenario Text 4 choice
6
2 1 3 4
Representation space Input space MMT model
Supervised MMT
Task in source 2 Task in source 3 Task in source 4 Task in source 1 Source representation MMT representation MML MTL
Target Target
Multi-source Meta Learning (MML)
Learn knowledge from multiple sources. Learn a representation near to the target.
4 3 1 2
Multi-source Transfer Learning (MTL)
Finetune meta-model to the target source.
0.2 0.1 β¦ 0.4 0.1 0.2 β¦ 0.4 0.5 0.1 β¦ 0.4 0.20.5 β¦ 0.8 0.2 0.3 β¦ 0.6 0.7 0.4 β¦ 0.3 0.3 0.4 β¦ 0.7 0.6 0.4 β¦ 0.5 0.4 0.7 β¦ 0.2
source 3
0.3 0.4 β¦ 0.7 0.8 0.5 β¦ 0.3 0.4 0.7 β¦ 0.2 0.6 0.4 β¦ 0.5
source 1
0.2 0.1 β¦ 0.4 0.1 0.2 β¦ 0.4 0.5 0.1 β¦ 0.4
7
0.2 0.1 β¦ 0.4 0.1 0.2 β¦ 0.4 0.5 0.1 β¦ 0.4 0.1 0.1 β¦ 0.9 0.2 0.3 β¦ 0.6 0.2 0.5 β¦ 0.3 0.3 0.4 β¦ 0.7 0.6 0.4 β¦ 0.5 0.4 0.7 β¦ 0.3
M Q A
target source 2
0.1 0.1 β¦ 0.9 0.2 0.3 β¦ 0.6 0.7 0.4 β¦ 0.3 0.2 0.5 β¦ 0.8 0.2 0.1 β¦ 0.4 0.1 0.2 β¦ 0.4 0.5 0.1 β¦ 0.4 0.1 0.1 β¦ 0.9 0.2 0.3 β¦ 0.6 0.2 0.5 β¦ 0.8 0.3 0.4 β¦ 0.7 0.8 0.5 β¦ 0.3 0.4 0.7 β¦ 0.2
Algorithm 1: The procedure of MMT Input: Task distribution over source π) π , data distribution over target π* π , backbone model π π , learning rates in MMT π½, πΎ, π Output: Optimized parameters π
Initial the value of π While not done do for all source π do Sample batch of tasks π"
#~π# π
for all π"
# do
Evaluate πΌ$π%+
, π π
with respect to k examples Compute gradient for fast adaption: π& =: π β π½πΌ$π%+
, π π
end Meta model update: π =: π β πΎβ$ β%+
,~(,(%) π%+ , π πβ²
Get batch of data π"
+~π+ π
for all π"
+ do
Evaluate β$π%+
with respect to k examples Gradient for target fine-tuning: π =: π β πΎβ$π%+
end end end
Get all batches of data π"
+~π+ π
for all π"
+ do
Evaluate with respect to batch size; Gradient for meta transfer learning: π =: π β πΎβ$π%+
end
Meta Tasks π# β₯ π$ βπ·% β π#
8
Algorithm 1: The procedure of MMT Input: Task distribution over source π) π , data distribution over target π* π , backbone model π π , learning rates in MMT π½, πΎ, π Output: Optimized parameters π
Initial the value of π While not done do for all source π do Sample batch of tasks π"
#~π# π
for all π"
# do
Evaluate πΌ$π%+
, π π
with respect to k examples Compute gradient for fast adaption: π& =: π β π½πΌ$π%+
, π π
end Meta model update: π =: π β πΎβ$ β%+
,~(,(%) π%+ , π πβ²
Get batch of data π"
+~π+ π
for all π"
+ do
Evaluate β$π%+
with respect to k examples Gradient for target fine-tuning: π =: π β πΎβ$π%+
end end end
Get all batches of data π"
+~π+ π
for all π"
+ do
Evaluate with respect to batch size; Gradient for meta transfer learning: π =: π β πΎβ$π%+
end
d MMT is agnostic to backbone models S1 Target S2 S4 Target Target Target S3 MTL MML MTL d Transfer meta model to the target MML d Support task and Query task sampled from the same distribution Updated the learner (π&) on support task Updated the meta model (π) on query task Updated the meta model (π) on target data
9 Performance of Supervised MMT MCTEST Performance of Unsupervised MMT MMT Ablation Study
10
Transferability Matrix Sources Targets Test on SemEval 2018 T-SNE Visualization of BERT Feature
100 random samples
11