Neural Network based NLP: Its Progresses and Challenges
- Dr. Ming Zhou
Neural Network based NLP: Its Progresses and Challenges Dr. Ming - - PowerPoint PPT Presentation
Neural Network based NLP: Its Progresses and Challenges Dr. Ming Zhou Microsoft Research Asia CLSW 2020, City University of Hong Kong May 30, 2020 Text NMT Generation Multi- MRC Modality Conversati Question- onal Answering System
NMT MRC Conversati
System Text Generation Multi- Modality Question- Answering
This figure was credited to AMiner, Tsinghua University, 2019
Word embedding (Mikol
, 2013) 3) Senten tence ce Embedding ng Encod
er with atten tention tion (Bahd hdana anau et al., , 2014) 4) Transfo nsforme rmer r (Vasw swani ani et al., 2016) 16)
rπππ£ βπ’ππβ π’ππβ πππππ‘π’ππ
β0 = π(π₯0π¦) π§ = π(π₯3β2) β2 = π(π₯2β1) β1 = π(π₯1β0) Active functions: π§ = π(π¦)
Mikolov et al. Efficient Estimation of Word Representations in Vector Space. EMNLP, 2014
π₯π’β2 π₯π’β1 π₯π’+1 π₯π’+2 π₯π’ SUM π₯π’β2 π₯π’β1 π₯π’+1 π₯π’+2 π₯π’ CBOW(Continuous Bag-of-Words) :Using the context words in a window to predict the central word.
Mikolov et al. Efficient Estimation of Word Representations in Vector Space. EMNLP, 2014
Skip ip-gram ram(Continuous Skip-gram) :Using the central word to predict the context words in a window.
electronic products leaders of China leaders of companies comparative adjectives Psychological reaction words
π=(Economic, growth, has, slowed, down, in, recent, years,.) π=( )
η»ζ΅, εε±, ε, ζ ’, δΊ, .
θΏ,
ε εΉ΄,
0.2 0.9 0.1 0.5 0.7 0.0 0.2
economic growth has slowed down in recent years .
θΏ ε εΉ΄ η»ζ΅ εε± ε ζ ’ δΊ . </S>
π=(Economic, growth, has, slowed, down, in, recent, years,.) π=( )
η»ζ΅, εε±, ε, ζ ’, δΊ, .
θΏ,
ε εΉ΄,
0.2 0.9 0.1 0.5 0.7 0.0 0.2
θΏ ε εΉ΄ η»ζ΅ εε± ε ζ ’ δΊ . </S>
β¨
Attention Weight
economic growth has slowed down in recent years .
economic/0 growth/1 has/2 slowed/3 down/4 in/5 recent/6 years/7
θΏ/0 ε εΉ΄/1 ,/2 η»ζ΅/3 εε±/4
Residual Link Residual Link FFN
Self Attention
FFN Residual Link Residual Link
Self Attention Source Hidden States
Self Attention Residual Link Residual Link Residual Link Attention to Source Hidden
Source Hidden
FFN
Pre-trained Model
Self-super supervi vised sed Learning rning Large-scale Corpus Fine-tuning for Task1 Fine-tuning for Taskπ Fine-tuning for Task2
Model for Task1 Model for Taskπ Model for Task2
Pre-tra raini ning ng stage: learn task-agnostic general knowledge from large- scale corpus by self- supervised learning. Fine-tuni tuning ng stage: transfer learnt knowledge to downstream tasks by discriminative training.
A s simplified plified example ple of self-at attention ention in Transfor sformer er
Auto toreg egress ssiv ive e (AR) LM
processing
Auto to-en encod coding ng (AE)
natural
(a) word-level
LM is a typical task in natural language processing
(b) sentence-level Self-supervised learning is a form of unsupervised learning where the data itself provides the supervision.
β¦ β¦ β¦ 0.4 β¦ 0.41 β¦ β¦
An apple is a sweet , edible fruit produced by an apple tree .
[MASK]
fruit An apple is a sweet, edible fruit produced by an apple tree.
β¦ β¦ fruit β¦ company β¦ β¦ Prediction Ground truth 1 loss Vocabulary Unsupervised (Self- supervised) Learning
Pre-trained Model (e.g. Multilayer Transformer)
Contextualized representations
BERT RT-base based Sentence ence Pair ir Match tchin ing g
Given the final hidden vector π· β βπΌ of the first input token ([CLS]), fine-tune BERT by a standard classification loss with π· and π: log(softmax(π·πΏT) ) where πΏ β βπΏΓπΌ is a classification layer, πΏ is the number of labels.
Word2Vec
2013
ULMFiT
2017
ELMo
(Peters et al., 2018)
GPT
2018
U G BERT
2018
U CoVe
2017
β¦
NLP Tasks
Machine ine Trans nslat lation ion Search Engin ine Semant antic ic Parsing ing Questio ion n Answering ing Chatbot & Dialog logue Parap aphr hras ase Classifi ifica cation ion Text Entailm ilment Sentim iment nt Analy lysis is
β¦
XLM U MASS G MT MT-DN DNN U UNILM LM U G Unicoder er U BART G ProphetNet etNet G mBART RT G
GREEN: monolingual pre-trained models BLUE: multilingual pre-trained models U: for understanding tasks G: for generation tasks
https://arxiv.org/abs/2003.08271
Model Name Model Usage Model Backbone Model Contribution GPT (OpenAI) Understanding & Generation Transformer Encoder 1st unidirectional pre-trained LM based on Transformer BERT (Google) Understanding Transformer Encoder 1st bidirectional pre-trained LM based on Transformer MT-DNN (MS) Understanding Transformer Encoder use multiple understanding tasks in pre-training MASS (MS) Generation Separate Transformer Encoder-Decoder use masked span prediction for generation tasks UniLM (MS) Understanding & Generation Unified Transformer Encoder-Decoder unify understanding and generation tasks in pre-training with different attention masks RoBERTa (FB) Understanding Transformer Encoder use better pre-training tricks, such as dynamic masking, large batches, removing NSP , data sampling ERNIE (Baidu) Understanding Transformer Encoder prove noun phrase masking and entity masking are better than word masking SpanBERT (FB) Understanding Transformer Encoder prove random span masking is better than others XLNet (Google) Understanding Transformer Encoder unify autoregressive LM and autoencoding tasks in pre- training with the two-stream self-attention T5 (Google) Generation Separate Transformer Encoder-Decoder use a separate encoder-decoder for understanding and generation tasks and prove it is the best choice; compare different hyper-parameters and show the best settings BART (FB) Generation Separate Transformer Encoder-Decoder try different text noising methods for generation tasks ELECTRA (Google) Understanding Transformer Generator-Discriminator use a simple but effective GAN-style pre-training task ProphetNet (MS) Generation Separate Transformer Encoder-Decoder use future n-gram prediction for generation tasks with the π-stream self-attention
Large ge-scal scale e Corpus pus Task-specifi specific c Dataset asets Pre-tra rained ned Model el Models els for Downst nstream am Tasks ks Pre-tra raini ning ng Fine-tuning tuning
GREEN: efforts for performance BLUE: efforts for practical usage
https://gluebenchmark.com/ CoLA: The Corpus of Linguistic Acceptability SST-2: The Stanford Sentiment Treebank MRPC: The Microsoft Research Paraphrase Corpus STS-B: The Semantic Textual Similarity Benchmark QQP: The Quora Question Pairs MNLI: The Multi-Genre Natural Language Inference Corpus QNLI: The Stanford Question Answering Dataset RTE: The Recognizing Textual Entailment WNLI: The Winograd Schema Challenge
Language e A Language e B Monolin ingua gual l Data ta Bilingua gual l Data ta (A (A οβ B) Monolin ingua gual l Data ta Language e C Bilingua gual l Data ta (A (A οβ C) Monolin ingua gual l Data ta β¦ β¦
(Devlin et al., 2018; Lample and Conneau, 2019; Huang et al., 2019;β¦)
Language e X Monolin ingua gual l Data ta
Label eled ed Data of a given en Task k in Lan. A
task-specific fine-tuning
Task Labe beled d Data a in Lan. . B
task-specific fine-tuning
Task in
Task Labe beled d Data a in Lan. . B
task-specific fine-tuning
Task in
Task Labe beled d Data a in Lan. . B
task-specific fine-tuning
Task in
β¦ β¦
Haoyang Huang, Yaobo Liang, Nan Duan, Ming Gong, Linjun Shou, Daxin Jiang, Ming Zhou. Unicode der: r: A Univ ivers rsal l Langu guage ge Encode der r by Pre-tra train ining ing with h Multi tiple ple Cross-lin lingu gual l Tasks. EMNLP, 2019.
(12 layers, shared 256K vocabulary size, 100 languages) this could be a sentence in any language .
(12 layers, shared 256K vocabulary size, 100 languages) this could be a sentence in any language .
Text Noi
ng Method
Sentence ce Permutat atio ion could this be sentence a in . any language Token Deletion ion this be a in any language Token Maskin ing [MASK] could be a [MASK] in any [MASK] . Text Infil illi ling this could be [MASK] in [MASK] .
Text Denoising ising Method
Yaobo Liang, Nan Duan, Yeyun Gong and Others. XGLUE: E: A New Benchm hmark rk Dataset for Cross-li ling ngua ual l Pre-tra trainin ning, g, Unders rsta tand nding ing and Gene nera ration
NER: Named Entity Recognition POS: Part-of-Speech Tagging NC: News Classification MLQA: Multilingual MRC XNLI: Natural Language Inference PAWS-X: Paraphrase Classification QADSM: Query-Ads Matching WPR: Web Page Ranking QAM: Question-Answer Matching QG: Question Generation NTG: News Title Generation
https://arxiv.org/abs/2004.01401 Leaderboard to be released soon Under erstandi standing ng tasks ks Generat ration ion tasks
English translation: where is the largest sugar factory in the world English translation: The sugar refinery of the Algerian group Cevital produces 2.7 million tonnes of sugar a year, making it the largest refinery in the world. This refinery doubled its exports of white sugar from 377,000 tonnes to 600,000 tonnes in 2012. English translation: delete browser history windows 10 English translation: In the tab "General" you will find the sub-item "Browser History". Click on the "Delete ..." button there. A window with the name "delete browser history" will open. Check the data you want to delete. Click on "Delete". The history has now been removed.
Bin ing fr-FR FR Bin ing de-DE DE
ablish shing ng Stron
elines s for the New Decade: e: Sequen uence ce T agging, ng, Synta tacti ctic c and Semanti ntic c Parsi sing ng with th BERT
UE: A New Bench chma mark k Datase set t for Cross ss-ling ngual Pre-tr trainin aining, Under erst stan anding and Generat ration
applied to all test sets in different languages.
capability on the multilingual NER task.
Deep learning models can find hidden syntactic tree structures of natural language sentences in an unsupervised way.
Emma Strubell, Patrick Verga, Daniel Andor, David Weiss, Andrew McCallum. Linguistically-Informed Self-Attention for Semantic Role Labeling. EMNLP, 2018.
Deep learning models can predict better syntactic tree structures
Danqi Chen and Christopher Manning. A Fast and Accurate Dependency Parser Using Neural Networks. EMNLP, 2014.
NN helps Linguistics ics Linguistics ics helps NN
Linguistic information can improve NLP tasks as input signals.
Yikang Shen, Shawn Tan, Alessandro Sordoni, Aaron Courville. Ordered Neurons: Integrating Tree Structures into Recurrent Neural Networks. ICLR, 2019.
Linguistic information can improve NLP tasks by designing syntactic- aware neural network structures.
Huadong Chen, Shujian Huang, David Chiang, Jiajun Chen. Improved Neural Machine Translation with a Syntax-Aware Encoder and Decoder. ACL, 2017.
challenges in modelling, reasoning and interpretability.
solve these challenges.
Yash Goyal, yal, Tejas as Khot, t, Dougl glas as Summers rs-Sta tay, y, Dhruv ruv Batra ra, , Devi Parikh kh. . Mak aking g the V in VQA Matt tter: r: Elevatin vating g the Role of Image ge Unders rsta tandi ding g in Visual al Questio tion Answerin ring.
, 2017. Thomas as Man anzin zini, , Yao ao Chong g Lim, , Yulia a Tsvetk tkov, v, Alan an W B Black.
ack is to Criminal as Caucas asian is to Police: : Detectin ting g and d Removing g Multic ticlas ass Bias as in Word rd Embedd ddings. . NAACL, CL, 2019.
Mode del Param ramete eters BLEU β BLEU Tra ransfor
er 210.4M 28.8Β±0.2 +0.2 NAS 221.7M 29.0Β±0.1
David R. So, Chen Liang, Quoc V. Le. The Evolve ved Transfo former
https://medium.com/syncedreview/tracking-the-transforming-ai-chip-market-bac117359459
Emma Strubell, Ananya Ganesh and Andrew McCallum. Energy and Policy Considerations for Deep Learning in NLP . ACL, 2019.
Unsu supe perv rvis ised ed learn arning Prior
edge ge and human man role
Low- resource tasks
Trans ansfer fer learnin rning Cross-language language learn arning
Cold-start with seeds such as rules and dictionary, active learning, reinforcement learning Learn mappings and relationships among languages for cross-lingual NLP tasks. Discover knowledge from unannotated data based on distribution and patterns. Transfer knowledge learnt from rich-resource tasks to low-resource tasks, such as BERT and ResNet.
Fac act: ACL 2019 is held in Florence
Q-1: Where is ACL 2019 held? Q-2: Is ACL 2019 held in France? Q-3: Can I attend this conference without an accepted paper? Q-4: Why ACL 2019 is held in Florence? Florence No, because⦠Yes, if⦠Because⦠Florence Common
e and re reasoning soning are required.
Context text modeling eling Explai plainabil ability ty
Multi- turn tasks
Knowledg wledge e and Common mon sense e Inferen erence ce mech chan anis ism m
Mechanism, debugging, evaluation, visualization Extract, represent, conflate and use different types of knowledge and common sense.. Represent, memorize and forget context information in reasoning. Annotate, model and evaluate the inference procedure.
Rich- resource tasks Low- resource tasks Multi-turn tasks
human role
sense
heterogenous contents including texts, images, videos, audios.
machine translation
better multi-turn and reasoning capabilities
news, reports, poetry and music
enterprises and cities,
e-commerce, health, etc.
scenarios
with human in the loop
transformer
sense, reasoning mechanisms
model architectures, for more modalities, with smaller model sizes and faster training
, CL and linguistic study leverage each other