Deep learning Deep dual learning 1 Hamid Beigy Sharif university of - - PowerPoint PPT Presentation

deep learning
SMART_READER_LITE
LIVE PREVIEW

Deep learning Deep dual learning 1 Hamid Beigy Sharif university of - - PowerPoint PPT Presentation

Deep learning Deep learning Deep dual learning 1 Hamid Beigy Sharif university of technology December 21, 2019 1 Some slides are adopted from Tao Qin, Sreeja R Thoom et al. slides. Hamid Beigy | Sharif university of technology | December 21,


slide-1
SLIDE 1

Deep learning

Deep learning

Deep dual learning1 Hamid Beigy

Sharif university of technology

December 21, 2019

1Some slides are adopted from Tao Qin, Sreeja R Thoom et al. slides. Hamid Beigy | Sharif university of technology | December 21, 2019 1 / 28

slide-2
SLIDE 2

Deep learning

Table of contents

1 Introduction 2 Dual learning 3 Dual Supervised Learning

Hamid Beigy | Sharif university of technology | December 21, 2019 2 / 28

slide-3
SLIDE 3

Deep learning | Introduction

Introduction

Hamid Beigy | Sharif university of technology | December 21, 2019 2 / 28

slide-4
SLIDE 4

Deep learning | Introduction

Three Pillars of Deep Learning

1 Three Pillars of Deep Learning

Big data: web pages, search logs, social networks, and new mechanisms for data collection: conversation and crowd–sourcing. Big models: 1000+ layers, tens of billions of parameters Big computing: CPU clusters, GPU clusters, TPU clusters, FPGA farms, provided by Amazon, Azure, Ali etc.

Hamid Beigy | Sharif university of technology | December 21, 2019 3 / 28

slide-5
SLIDE 5

Deep learning | Introduction

Some Challenges of Deep Learning

1 Big-Data Challenge

Todays deep learning highly relies on huge amount of human-labeled training data Task Typical training data Image classification Millions of labeled images Speech recognition Thousands of hours of annotated voice data Machine translation Tens of millions of bilingual sentence pairs Human labeling is in general very expensive, and it is hard, if not impossible, to obtain large-scale labeled data for rare domains

Hamid Beigy | Sharif university of technology | December 21, 2019 4 / 28

slide-6
SLIDE 6

Deep learning | Introduction

Machine translation

1 How translate from a source language to a destination language? 2 Main problems

How translate words from the source language to the destination language? How order words in the destination language? How measure goodness of translation? What type of corpus is needed? (monolingual or bilingual) How build a sequence of translators? (Persian → English → French)

Hamid Beigy | Sharif university of technology | December 21, 2019 5 / 28

slide-7
SLIDE 7

Deep learning | Introduction

Neural machine translation (NMT)

1 In NMT2, recurrent neural networks such as LSTM or GRU units are

used.

2Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. ”Neural machine

translation by jointly learning to align and translate.” ICLR 2015.

Hamid Beigy | Sharif university of technology | December 21, 2019 6 / 28

slide-8
SLIDE 8

Deep learning | Introduction

Neural machine translation (NMT)

1 A critical disadvantage of this fixed-length context vector design is

incapability of remembering long sentences.

2 The attention mechanism was proposed to help memorize long source

sentences in NMT

3 Another critical disadvantage of this model is training set. We need a

large bilingual corpus.

4 Dual learning was introduced to overcome the need for a large

bilingual corpus.

Hamid Beigy | Sharif university of technology | December 21, 2019 7 / 28

slide-9
SLIDE 9

Deep learning | Dual learning

Dual learning

Hamid Beigy | Sharif university of technology | December 21, 2019 7 / 28

slide-10
SLIDE 10

Deep learning | Dual learning

Duality in Machine Translation

1 Dual learning is a auto-encoder like mechanism to utilize the

monolingual datasets 3.

  • 3Y. Xia, D. He, T. Qin, L. Wang, N. Yu, T.-Y. Liu, and W.-Y. Ma. Dual learning for

machine translation. NIPS 2016.

Hamid Beigy | Sharif university of technology | December 21, 2019 8 / 28

slide-11
SLIDE 11

Deep learning | Dual learning

Duality in Speech Processing

1 Duality in Speech Processing.

Welcome to Beijing!

Primal Task 𝑔: 𝑦 → 𝑧 Dual Task 𝑕: 𝑧 → 𝑦

Speech recognition Speech synthesis

Hamid Beigy | Sharif university of technology | December 21, 2019 9 / 28

slide-12
SLIDE 12

Deep learning | Dual learning

Duality in Question Answering and Generation

1 Duality in Question Answering and Generation.

Generation

for what purpose do

  • rganisms make peroxide

and superoxide ? Parts of the immune system of higher organisms create peroxide , superoxide , and singlet oxygen to destroy invading microbes .

Primal Task 𝑔: 𝑦 → 𝑧 Dual Task 𝑕: 𝑧 → 𝑦

Question answering Question generation

Hamid Beigy | Sharif university of technology | December 21, 2019 10 / 28

slide-13
SLIDE 13

Deep learning | Dual learning

Duality in Search and Advertising

1 Duality in Search and Advertising. Amazon Shopping Amazon.com

Primal Task 𝑔: 𝑦 → 𝑧 Dual Task 𝑕: 𝑧 → 𝑦

Search: find webpages for a given query Advertising: suggest keywords for a given webpage

Hamid Beigy | Sharif university of technology | December 21, 2019 11 / 28

slide-14
SLIDE 14

Deep learning | Dual learning

Structural Duality in AI

Structural duality is very common in artificial intelligence

AI Tasks X → Y Y → X Image classification Translation from EN to CH Translation from CH to EN Speech processing Speech recognition Text to speech Image understanding Image captioning Image generation Conversation Question answering Question generation Search engine Query-document matching Query/keyword suggestion

Currently most machine learning algorithms do not exploit structure duality for training and inference.

Hamid Beigy | Sharif university of technology | December 21, 2019 12 / 28

slide-15
SLIDE 15

Deep learning | Dual learning

Dual Learning

1 A new learning framework that leverages the symmetric (primal-dual)

structure of AI tasks to obtain effective feedback or regularization signals to enhance the learning/inference process.

2 If you dont have enough labeled data for training, can we use

unlabeled data?

3 Dual Unsupervised Learning can leverage structural duality to learn

from unlabeled data.

Hamid Beigy | Sharif university of technology | December 21, 2019 13 / 28

slide-16
SLIDE 16

Deep learning | Dual learning

Dual learning (Definition)

1 Let us to define4

DA Corpus of language A. DB Corpus of language B. P(.|s, θAB) translation model from A to B. P(.|s, θBA) translation model from B to A. LMA(.) learned language model of A. LMB(.) learned language model of B.

  • 4Y. Xia, D. He, T. Qin, L. Wang, N. Yu, T.-Y. Liu, and W.-Y. Ma. Dual learning for

machine translation. NIPS 2016.

Hamid Beigy | Sharif university of technology | December 21, 2019 14 / 28

slide-17
SLIDE 17

Deep learning | Dual learning

Dual learning (Algorithm)

1 We have 2 Generate K translated sentences

smid,1, smid,2, . . . , smid,K from P(.|s, θAB)

3 Compute intermediate rewards

r1,1, r1,2, . . . , r1,K from LMB(smid,K) for each sentence as r1,k = LMB(smid,k)

Hamid Beigy | Sharif university of technology | December 21, 2019 15 / 28

slide-18
SLIDE 18

Deep learning | Dual learning

Dual learning (Algorithm)

1 We have 2 Compute communication rewards r2,1, r2,2, . . . , r2,K

for each sentence as r2,k = ln P(s|smid, ; θBA)

3 Set the total reward of kth sentence as

rk = αr1,k + (1 − α)r2,k

Hamid Beigy | Sharif university of technology | December 21, 2019 16 / 28

slide-19
SLIDE 19

Deep learning | Dual learning

Dual learning (Algorithm)

1 We have 2 Compute the stochastic gradient of θAB and θBA

∇θABE[r] = 1 K

K

  • k=1

rk∇AB ln P(smid,k|s, θAB) ∇θBAE[r] = 1 K

K

  • k=1

(1 − α)∇BA ln P(smid,k|s, θBA)

Hamid Beigy | Sharif university of technology | December 21, 2019 17 / 28

slide-20
SLIDE 20

Deep learning | Dual learning

Dual learning (Algorithm)

1 We have 2 Update the mode parameters θAB and θBA

θAB ← θAB + γ1∇θABE[r] θBA ← θBA + γ2∇θBAE[r]

Hamid Beigy | Sharif university of technology | December 21, 2019 18 / 28

slide-21
SLIDE 21

Deep learning | Dual learning

Dual learning algorithm (pseudo code))

ED:

Hamid Beigy | Sharif university of technology | December 21, 2019 19 / 28

slide-22
SLIDE 22

Deep learning | Dual learning

Experimental results

Hamid Beigy | Sharif university of technology | December 21, 2019 20 / 28

slide-23
SLIDE 23

Deep learning | Dual learning

Experimental results

1 Reconstruction performance (BLEU: geometric mean of n-gram

precision)

0ED

IGRWUGWMRSIUJRUPEGI 25C

IMPSURIPIWJURPFEIOMIPRHIOISIGMEOOM 5/6U5 A

Hamid Beigy | Sharif university of technology | December 21, 2019 21 / 28

slide-24
SLIDE 24

Deep learning | Dual learning

Experimental results

1 For different source sentence length (Improvement is significant for

long sentences)

0ED

6RUHMJJIUIWRUGIIWIGIOIWL

PSURIPIWMMMJMGEWJRUORIWIGI

Hamid Beigy | Sharif university of technology | December 21, 2019 22 / 28

slide-25
SLIDE 25

Deep learning | Dual learning

Experimental results

1 Reconstruction examples

0ED IGRWUGWMRIEPSOI

Hamid Beigy | Sharif university of technology | December 21, 2019 23 / 28

slide-26
SLIDE 26

Deep learning | Dual Supervised Learning

Dual Supervised Learning

Hamid Beigy | Sharif university of technology | December 21, 2019 23 / 28

slide-27
SLIDE 27

Deep learning | Dual Supervised Learning

Supervised learning

1 Given m training pairs {(x1, y2), . . . , (xm, ym)} sampled from the

space X × Y.

2 Learn the bi-directional relationship of (x, y), in two independent

supervised learning tasks (primal f and dual g): min

θxy

1 m

m

  • i

L1 (f (xi; θxy), yi) min

θyx

1 m

m

  • i

L2 (f (yi; θyx), xi)

3 If the learned primal and dual models are perfect, for all x and y, we

should have P(x)P(y|x; θxy) = P(y)P(x|y; θyx)

Hamid Beigy | Sharif university of technology | December 21, 2019 24 / 28

slide-28
SLIDE 28

Deep learning | Dual Supervised Learning

Deep supervised learning

1 Incorporate joint distribution matching in supervised learning

min

θxy

1 m

m

  • i

L1 (f (xi; θxy), yi) min

θyx

1 m

m

  • i

L2 (f (yi; θyx), xi) P(x)P(y|x; θxy) = P(y)P(x|y; θyx)

2 Empirical marginal distributions ˆ

P(x) and ˆ P(y) Lduality =

  • log ˆ

P(x) + log ˆ P(y|x; θxy)

  • log ˆ

P(y) + log ˆ P(x|y; θyx)

  • Hamid Beigy | Sharif university of technology | December 21, 2019

25 / 28

slide-29
SLIDE 29

Deep learning | Dual Supervised Learning

Supervised Dual learning Algorithm

Hamid Beigy | Sharif university of technology | December 21, 2019 26 / 28

slide-30
SLIDE 30

Deep learning | Dual Supervised Learning

Supervised Dual learning Algorithm results

Hamid Beigy | Sharif university of technology | December 21, 2019 27 / 28

slide-31
SLIDE 31

Deep learning | Dual Supervised Learning

Some extensions

1 Yingce Xia, Tao Qin, Wei Chen, Jiang Bian, Nenghai Yu, Tie-Yan

Liu, Dual Supervised Learning, ICML 2017.

2 Yijun Wang, Yingce Xia, Li Zhao, Jiang Bian, Tao Qin, Guiquan Liu

and Tie-Yan Liu, Dual Transfer Learning for Neural Machine Translation with Marginal Distribution Regularization, AAAI 2018.

Hamid Beigy | Sharif university of technology | December 21, 2019 28 / 28