Neural Network based NLP: Its Progresses and Challenges Dr. Ming - - PowerPoint PPT Presentation

β–Ά
neural network based nlp its progresses and challenges
SMART_READER_LITE
LIVE PREVIEW

Neural Network based NLP: Its Progresses and Challenges Dr. Ming - - PowerPoint PPT Presentation

Neural Network based NLP: Its Progresses and Challenges Dr. Ming Zhou Microsoft Research Asia CLSW 2020, City University of Hong Kong May 30, 2020 Text NMT Generation Multi- MRC Modality Conversati Question- onal Answering System


slide-1
SLIDE 1

Neural Network based NLP: Its Progresses and Challenges

  • Dr. Ming Zhou

Microsoft Research Asia CLSW 2020, City University of Hong Kong May 30, 2020

slide-2
SLIDE 2

NMT MRC Conversati

  • nal

System Text Generation Multi- Modality Question- Answering

This figure was credited to AMiner, Tsinghua University, 2019

slide-3
SLIDE 3
  • Neural NLP(NN-NLP)
slide-4
SLIDE 4

Word embedding (Mikol

  • lov
  • v et al.,

, 2013) 3) Senten tence ce Embedding ng Encod

  • der-Decod
  • der

er with atten tention tion (Bahd hdana anau et al., , 2014) 4) Transfo nsforme rmer r (Vasw swani ani et al., 2016) 16)

slide-5
SLIDE 5

rπ‘“π‘šπ‘£ β„Žπ‘’π‘π‘œβ„Ž π‘’π‘π‘œβ„Ž π‘šπ‘π‘•π‘—π‘‘π‘’π‘—π‘‘

β„Ž0 = 𝑔(π‘₯0𝑦) 𝑧 = 𝑔(π‘₯3β„Ž2) β„Ž2 = 𝑔(π‘₯2β„Ž1) β„Ž1 = 𝑔(π‘₯1β„Ž0) Active functions: 𝑧 = 𝑔(𝑦)

slide-6
SLIDE 6

Mikolov et al. Efficient Estimation of Word Representations in Vector Space. EMNLP, 2014

Word embedding tries to map words from a discrete space into a semantic space, in which the semantically similar words have similar embedding vectors. You

  • u sh

shal all l kn know a a wor

  • rd by

y th the company it it keeps

slide-7
SLIDE 7

π‘₯π‘’βˆ’2 π‘₯π‘’βˆ’1 π‘₯𝑒+1 π‘₯𝑒+2 π‘₯𝑒 SUM π‘₯π‘’βˆ’2 π‘₯π‘’βˆ’1 π‘₯𝑒+1 π‘₯𝑒+2 π‘₯𝑒 CBOW(Continuous Bag-of-Words) :Using the context words in a window to predict the central word.

Mikolov et al. Efficient Estimation of Word Representations in Vector Space. EMNLP, 2014

Skip ip-gram ram(Continuous Skip-gram) :Using the central word to predict the context words in a window.

slide-8
SLIDE 8

electronic products leaders of China leaders of companies comparative adjectives Psychological reaction words

slide-9
SLIDE 9

Sentence Embedding

slide-10
SLIDE 10

Encoder-Decoder(Cho et al., 2014)

𝒇=(Economic, growth, has, slowed, down, in, recent, years,.) π’ˆ=( )

经桎, 发展, 变, ζ…’, δΊ†, .

θΏ‘,

ε‡ εΉ΄,

0.2 0.9 0.1 0.5 0.7 0.0 0.2

Encoder Decoder

economic growth has slowed down in recent years .

Decoder Encoder

θΏ‘ ε‡ εΉ΄ 经桎 发展 变 ζ…’ δΊ† . </S>

slide-11
SLIDE 11

Encoder-Decoder with Attention(Bahdanau et al., 2014)

𝒇=(Economic, growth, has, slowed, down, in, recent, years,.) π’ˆ=( )

经桎, 发展, 变, ζ…’, δΊ†, .

θΏ‘,

ε‡ εΉ΄,

0.2 0.9 0.1 0.5 0.7 0.0 0.2

Encoder Decoder Decoder

θΏ‘ ε‡ εΉ΄ 经桎 发展 变 ζ…’ δΊ† . </S>

Encoder Attention

⨀

Attention Weight

βŠ•

economic growth has slowed down in recent years .

slide-12
SLIDE 12

T ransformer(Vaswani et al., 2016)

economic/0 growth/1 has/2 slowed/3 down/4 in/5 recent/6 years/7

βŠ•

θΏ‘/0 ε‡ εΉ΄/1 ,/2 经桎/3 发展/4

Decoder Encoder Attention

slide-13
SLIDE 13

Residual Link Residual Link FFN

Self Attention

FFN Residual Link Residual Link

Self Attention Source Hidden States

slide-14
SLIDE 14

Self Attention Residual Link Residual Link Residual Link Attention to Source Hidden

Source Hidden

FFN

slide-15
SLIDE 15
  • Pre-trained models
slide-16
SLIDE 16

Pre-trained Model

Self-super supervi vised sed Learning rning Large-scale Corpus Fine-tuning for Task1 Fine-tuning for Task𝑂 Fine-tuning for Task2

…

Model for Task1 Model for Task𝑂 Model for Task2

…

Pre-tra raini ning ng stage: learn task-agnostic general knowledge from large- scale corpus by self- supervised learning. Fine-tuni tuning ng stage: transfer learnt knowledge to downstream tasks by discriminative training.

  • Texts
  • Text-Image Pairs
  • Text-Video Pairs
  • Autoregressive LM
  • Auto-Encoding
  • Monolingual
  • Multilingual
  • Multimodal
  • Classification
  • Sequence Labeling
  • Structure Prediction
  • Sequence Generation
  • POS/NER/Parsing
  • Question Answering
  • Text Summarization
  • Machine Translation
  • Image Retrieval
  • Video Captioning
  • …
slide-17
SLIDE 17

embed task-ag agno nostic stic general eral knowled wledge ge transfer sfer learnt arnt knowledg ledge e to downstr tream eam tasks ks hold state te-of

  • f-the

he-ar art results ults on (a (almost

  • st)

) all NLP P tasks ks

slide-18
SLIDE 18

A s simplified plified example ple of self-at attention ention in Transfor sformer er

slide-19
SLIDE 19

Auto toreg egress ssiv ive e (AR) LM

processing

Auto to-en encod coding ng (AE)

natural

(a) word-level

LM is a typical task in natural language processing

(b) sentence-level Self-supervised learning is a form of unsupervised learning where the data itself provides the supervision.

slide-20
SLIDE 20

… … … 0.4 … 0.41 … …

An apple is a sweet , edible fruit produced by an apple tree .

[MASK]

fruit An apple is a sweet, edible fruit produced by an apple tree.

… … fruit … company … … Prediction Ground truth 1 loss Vocabulary Unsupervised (Self- supervised) Learning

Pre-trained Model (e.g. Multilayer Transformer)

Contextualized representations

slide-21
SLIDE 21

BERT RT-base based Sentence ence Pair ir Match tchin ing g

Given the final hidden vector 𝐷 ∈ ℝ𝐼 of the first input token ([CLS]), fine-tune BERT by a standard classification loss with 𝐷 and 𝑋: log(softmax(𝐷𝑿T) ) where 𝑿 ∈ ℝ𝐿×𝐼 is a classification layer, 𝐿 is the number of labels.

slide-22
SLIDE 22

Tim imeli eline ne of

  • f Pr

Pre-tra trained ined Mod

  • dels

els for

  • r Na

Natural l Language guage

Word2Vec

2013

ULMFiT

2017

ELMo

(Peters et al., 2018)

GPT

2018

U G BERT

2018

U CoVe

2017

…

NLP Tasks

Machine ine Trans nslat lation ion Search Engin ine Semant antic ic Parsing ing Questio ion n Answering ing Chatbot & Dialog logue Parap aphr hras ase Classifi ifica cation ion Text Entailm ilment Sentim iment nt Analy lysis is

…

XLM U MASS G MT MT-DN DNN U UNILM LM U G Unicoder er U BART G ProphetNet etNet G mBART RT G

GREEN: monolingual pre-trained models BLUE: multilingual pre-trained models U: for understanding tasks G: for generation tasks

https://arxiv.org/abs/2003.08271

slide-23
SLIDE 23

Connections and Differences Between (Monolingual) Pre-trained Models

Model Name Model Usage Model Backbone Model Contribution GPT (OpenAI) Understanding & Generation Transformer Encoder 1st unidirectional pre-trained LM based on Transformer BERT (Google) Understanding Transformer Encoder 1st bidirectional pre-trained LM based on Transformer MT-DNN (MS) Understanding Transformer Encoder use multiple understanding tasks in pre-training MASS (MS) Generation Separate Transformer Encoder-Decoder use masked span prediction for generation tasks UniLM (MS) Understanding & Generation Unified Transformer Encoder-Decoder unify understanding and generation tasks in pre-training with different attention masks RoBERTa (FB) Understanding Transformer Encoder use better pre-training tricks, such as dynamic masking, large batches, removing NSP , data sampling ERNIE (Baidu) Understanding Transformer Encoder prove noun phrase masking and entity masking are better than word masking SpanBERT (FB) Understanding Transformer Encoder prove random span masking is better than others XLNet (Google) Understanding Transformer Encoder unify autoregressive LM and autoencoding tasks in pre- training with the two-stream self-attention T5 (Google) Generation Separate Transformer Encoder-Decoder use a separate encoder-decoder for understanding and generation tasks and prove it is the best choice; compare different hyper-parameters and show the best settings BART (FB) Generation Separate Transformer Encoder-Decoder try different text noising methods for generation tasks ELECTRA (Google) Understanding Transformer Generator-Discriminator use a simple but effective GAN-style pre-training task ProphetNet (MS) Generation Separate Transformer Encoder-Decoder use future n-gram prediction for generation tasks with the π‘œ-stream self-attention

slide-24
SLIDE 24

Large ge-scal scale e Corpus pus Task-specifi specific c Dataset asets Pre-tra rained ned Model el Models els for Downst nstream am Tasks ks Pre-tra raini ning ng Fine-tuning tuning

  • Pre-training tasks
  • Pre-trained model structures
  • Pre-trained model compression
  • Pre-training acceleration
  • Knowledge distillation
  • Inference acceleration
  • Fine-tuning strategies

GREEN: efforts for performance BLUE: efforts for practical usage

slide-25
SLIDE 25
slide-26
SLIDE 26
slide-27
SLIDE 27
slide-28
SLIDE 28
slide-29
SLIDE 29

https://gluebenchmark.com/ CoLA: The Corpus of Linguistic Acceptability SST-2: The Stanford Sentiment Treebank MRPC: The Microsoft Research Paraphrase Corpus STS-B: The Semantic Textual Similarity Benchmark QQP: The Quora Question Pairs MNLI: The Multi-Genre Natural Language Inference Corpus QNLI: The Stanford Question Answering Dataset RTE: The Recognizing Textual Entailment WNLI: The Winograd Schema Challenge

slide-30
SLIDE 30
slide-31
SLIDE 31

UniLM (Dong et al., 2019)

slide-32
SLIDE 32

Language e A Language e B Monolin ingua gual l Data ta Bilingua gual l Data ta (A (A οƒŸβ†’ B) Monolin ingua gual l Data ta Language e C Bilingua gual l Data ta (A (A οƒŸβ†’ C) Monolin ingua gual l Data ta … …

Cros

  • ss-lin

lingu gual al Pre-traine trained d Model el

(Devlin et al., 2018; Lample and Conneau, 2019; Huang et al., 2019;…)

Language e X Monolin ingua gual l Data ta

Label eled ed Data of a given en Task k in Lan. A

task-specific fine-tuning

Task Labe beled d Data a in Lan. . B

task-specific fine-tuning

Task in

  • Lan. B

Task Labe beled d Data a in Lan. . B

task-specific fine-tuning

Task in

  • Lan. C

Task Labe beled d Data a in Lan. . B

task-specific fine-tuning

Task in

  • Lan. X

… …

slide-33
SLIDE 33

XLM (Lample and Conneau, 2019; Conneau et al., 2019)

slide-34
SLIDE 34

Unicoder (Huang et al., 2019)

Haoyang Huang, Yaobo Liang, Nan Duan, Ming Gong, Linjun Shou, Daxin Jiang, Ming Zhou. Unicode der: r: A Univ ivers rsal l Langu guage ge Encode der r by Pre-tra train ining ing with h Multi tiple ple Cross-lin lingu gual l Tasks. EMNLP, 2019.

slide-35
SLIDE 35

Unicoder (Liang et al., 2020) Unicoder(Liang et al., 2020)

Unic icoder

  • der Encoder
  • der

(12 layers, shared 256K vocabulary size, 100 languages) this could be a sentence in any language .

Unic icoder

  • der Dec

ecod

  • der

er

(12 layers, shared 256K vocabulary size, 100 languages) this could be a sentence in any language .

Text Noi

  • ising

ng Method

Sentence ce Permutat atio ion could this be sentence a in . any language Token Deletion ion this be a in any language Token Maskin ing [MASK] could be a [MASK] in any [MASK] . Text Infil illi ling this could be [MASK] in [MASK] .

Text Denoising ising Method

  • d

Yaobo Liang, Nan Duan, Yeyun Gong and Others. XGLUE: E: A New Benchm hmark rk Dataset for Cross-li ling ngua ual l Pre-tra trainin ning, g, Unders rsta tand nding ing and Gene nera ration

  • tion. arXiv, 2020.
slide-36
SLIDE 36

XNLI LI XTRE REME ME XGL XGLUE UE

Cros

  • ss-lingual

lingual Ben enchmar hmark

slide-37
SLIDE 37

Tasks in XGLUE

NER: Named Entity Recognition POS: Part-of-Speech Tagging NC: News Classification MLQA: Multilingual MRC XNLI: Natural Language Inference PAWS-X: Paraphrase Classification QADSM: Query-Ads Matching WPR: Web Page Ranking QAM: Question-Answer Matching QG: Question Generation NTG: News Title Generation

slide-38
SLIDE 38

Evaluation on XGLUE

https://arxiv.org/abs/2004.01401 Leaderboard to be released soon Under erstandi standing ng tasks ks Generat ration ion tasks

slide-39
SLIDE 39

Application (1): Question Answering

slide-40
SLIDE 40

Application (2): Question Generation

slide-41
SLIDE 41

Application (3): Multilingual Question Answering

English translation: where is the largest sugar factory in the world English translation: The sugar refinery of the Algerian group Cevital produces 2.7 million tonnes of sugar a year, making it the largest refinery in the world. This refinery doubled its exports of white sugar from 377,000 tonnes to 600,000 tonnes in 2012. English translation: delete browser history windows 10 English translation: In the tab "General" you will find the sub-item "Browser History". Click on the "Delete ..." button there. A window with the name "delete browser history" will open. Check the data you want to delete. Click on "Delete". The history has now been removed.

Bin ing fr-FR FR Bin ing de-DE DE

slide-42
SLIDE 42

Application (4): Multilingual News Headline Generation

slide-43
SLIDE 43
  • He and Choi. Estab

ablish shing ng Stron

  • ng Basel

elines s for the New Decade: e: Sequen uence ce T agging, ng, Synta tacti ctic c and Semanti ntic c Parsi sing ng with th BERT

  • RT. arXiv, 2020.
  • Baseline: uses GloVe and Flair embeddings in these two tasks.
  • Baseline \ BERT: substitutes GloVe and Flair with BERT embeddings.
  • Baseline + BERT: uses all three types of embeddings together.
slide-44
SLIDE 44
  • Liang et al. XGLUE

UE: A New Bench chma mark k Datase set t for Cross ss-ling ngual Pre-tr trainin aining, Under erst stan anding and Generat ration

  • n. arXiv, 2020.
  • Fine-tuned on the en labeled data and then

applied to all test sets in different languages.

  • Showed a very strong cross-lingual transfer

capability on the multilingual NER task.

slide-45
SLIDE 45

Can deep learning and linguistics boost each other?

Deep learning models can find hidden syntactic tree structures of natural language sentences in an unsupervised way.

Emma Strubell, Patrick Verga, Daniel Andor, David Weiss, Andrew McCallum. Linguistically-Informed Self-Attention for Semantic Role Labeling. EMNLP, 2018.

Deep learning models can predict better syntactic tree structures

  • f natural language sentences in a supervised way.

Danqi Chen and Christopher Manning. A Fast and Accurate Dependency Parser Using Neural Networks. EMNLP, 2014.

NN helps Linguistics ics Linguistics ics helps NN

Linguistic information can improve NLP tasks as input signals.

Yikang Shen, Shawn Tan, Alessandro Sordoni, Aaron Courville. Ordered Neurons: Integrating Tree Structures into Recurrent Neural Networks. ICLR, 2019.

Linguistic information can improve NLP tasks by designing syntactic- aware neural network structures.

Huadong Chen, Shujian Huang, David Chiang, Jiajun Chen. Improved Neural Machine Translation with a Syntax-Aware Encoder and Decoder. ACL, 2017.

slide-46
SLIDE 46
  • Neural NLP(NN-NLP)
  • Future directions
slide-47
SLIDE 47

Where is the Future Direction of NLP?

  • Are we satisfied with current DNN-NLP?
  • DNN-NLP deeply relies on huge cost of computer power and annotated data and suffers from big

challenges in modelling, reasoning and interpretability.

  • Linguistics, knowledge, common sense and symbolic reasoning should still play important roles to

solve these challenges.

slide-48
SLIDE 48

Dataset: high cost, bias, noises, privacy and discrepancy from real scenarios

Yash Goyal, yal, Tejas as Khot, t, Dougl glas as Summers rs-Sta tay, y, Dhruv ruv Batra ra, , Devi Parikh kh. . Mak aking g the V in VQA Matt tter: r: Elevatin vating g the Role of Image ge Unders rsta tandi ding g in Visual al Questio tion Answerin ring.

  • g. CVPR,

, 2017. Thomas as Man anzin zini, , Yao ao Chong g Lim, , Yulia a Tsvetk tkov, v, Alan an W B Black.

  • ack. Black

ack is to Criminal as Caucas asian is to Police: : Detectin ting g and d Removing g Multic ticlas ass Bias as in Word rd Embedd ddings. . NAACL, CL, 2019.

slide-49
SLIDE 49

Fierce computing power arm races

Mode del Param ramete eters BLEU βˆ† BLEU Tra ransfor

  • rmer

er 210.4M 28.8Β±0.2 +0.2 NAS 221.7M 29.0Β±0.1

David R. So, Chen Liang, Quoc V. Le. The Evolve ved Transfo former

  • rmer. ICML, 2019.

https://medium.com/syncedreview/tracking-the-transforming-ai-chip-market-bac117359459

Emma Strubell, Ananya Ganesh and Andrew McCallum. Energy and Policy Considerations for Deep Learning in NLP . ACL, 2019.

slide-50
SLIDE 50

Important Topics for Low-Resource Tasks

Unsu supe perv rvis ised ed learn arning Prior

  • r knowl
  • wled

edge ge and human man role

Low- resource tasks

Trans ansfer fer learnin rning Cross-language language learn arning

Cold-start with seeds such as rules and dictionary, active learning, reinforcement learning Learn mappings and relationships among languages for cross-lingual NLP tasks. Discover knowledge from unannotated data based on distribution and patterns. Transfer knowledge learnt from rich-resource tasks to low-resource tasks, such as BERT and ResNet.

slide-51
SLIDE 51

Multi-Turn Tasks (Common Sense and Reasoning)

Fac act: ACL 2019 is held in Florence

Q-1: Where is ACL 2019 held? Q-2: Is ACL 2019 held in France? Q-3: Can I attend this conference without an accepted paper? Q-4: Why ACL 2019 is held in Florence? Florence No, because… Yes, if… Because… Florence Common

  • n sense

e and re reasoning soning are required.

slide-52
SLIDE 52

Context text modeling eling Explai plainabil ability ty

Multi- turn tasks

Knowledg wledge e and Common mon sense e Inferen erence ce mech chan anis ism m

Mechanism, debugging, evaluation, visualization Extract, represent, conflate and use different types of knowledge and common sense.. Represent, memorize and forget context information in reasoning. Annotate, model and evaluate the inference procedure.

slide-53
SLIDE 53

Towards interpretable, knowledgeable, ethical, economical and non-stop-learnable NLP

Rich- resource tasks Low- resource tasks Multi-turn tasks

  • Context modelling
  • Data de-biasing
  • Multi-task learning
  • Human knowledge
  • Transfer learning
  • Unsupervised learning
  • Cross-language learning
  • Prior knowledge and

human role

  • Knowledge/common

sense

  • Context modelling
  • Inference mechanism
  • Interpretation
  • Language understanding
  • Text analysis/text mining
  • Reading comprehension
  • Translation
  • Summarization
  • Question answering
  • Text generation
  • Conversation and chat
  • Search engine based on

heterogenous contents including texts, images, videos, audios.

  • Text/speech-based

machine translation

  • Conversational AI with

better multi-turn and reasoning capabilities

  • Text generation for

news, reports, poetry and music

  • Virtual agent and robots
  • Smart devices, homes,

enterprises and cities,

  • AI + education, finance,

e-commerce, health, etc.

  • Clear problem definition
  • Public data and evaluation
  • Fast iteration with real

scenarios

  • The ability of keep-learning

with human in the loop

slide-54
SLIDE 54

A full set of NN-NLP NLP tech ch has been n propo

  • posed

sed and effectivel ctively y appli plied ed in various ious NLP tasks sks

  • End-to-End training with large scale data and big models
  • Expression, encoder-decoder, attention model,

transformer

  • Transfer-learning with pre-trained models

In t the future, e, we we will ll need d to to furthe ther r explor lore: e:

  • NN-NLP enhanced with knowledge graphs and common

sense, reasoning mechanisms

  • Pre-trained models with new pre-training tasks, new

model architectures, for more modalities, with smaller model sizes and faster training

  • Interpretation mechanisms
  • NN-NLP

, CL and linguistic study leverage each other