Learning Architectures and Loss Functions in Continuous Space Fei - - PowerPoint PPT Presentation

โ–ถ
learning architectures and loss
SMART_READER_LITE
LIVE PREVIEW

Learning Architectures and Loss Functions in Continuous Space Fei - - PowerPoint PPT Presentation

Learning Architectures and Loss Functions in Continuous Space Fei Tian Machine Learning Group Microsoft Research Asia Self-Introduction Researcher @ MSRA Machine Learning Group Joined in July, 2016 Research Interests: Machine


slide-1
SLIDE 1

Learning Architectures and Loss Functions in Continuous Space

Fei Tian Machine Learning Group Microsoft Research Asia

slide-2
SLIDE 2

Self-Introduction

  • Researcher @ MSRA Machine Learning Group
  • Joined in July, 2016
  • Research Interests:
  • Machine Learning for NLP (especially NMT)
  • Automatic Machine Learning
  • More Information: https://ustctf.github.io
slide-3
SLIDE 3

Outline

  • Overview
  • Efficiently optimizing continuous decisions
  • Loss Function Teaching
  • Continuous space for discrete decisions
  • Neural Architecture Optimization
slide-4
SLIDE 4

Automatic Machine Learning

Architectures, Depth, Width, Batch size, โ€ฆ Learning rate, Dropout, Weight decay, Temperature, โ€ฆ

Automate every decision in machine learning

slide-5
SLIDE 5

Why Continuous Space?

  • Life is easier if we have gradients
  • For example, we have a bunch of powerful gradient-based optimization

algorithms

  • Representation is compact
  • One of |๐‘Š| representations of words V.S. word embeddings
slide-6
SLIDE 6

The Role of Continuous Space in AutoML

  • For continuous decisions
  • How to efficiently optimize them?
  • And the more important, elegantly
  • Our work:

Loss Function Teaching

  • For discrete decisions
  • How to effectively cast them into

continuous space?

  • Our work:

Neural Architecture Optimization

slide-7
SLIDE 7

Learning to Teach with Dynamic Loss Functions

Lijun Wu, Fei Tian, Yingce Xia, Tao Qin, Tie-Yan Liu NeurIPS 2018

slide-8
SLIDE 8

8

Loss Function Teaching

  • Recap to loss function ๐‘€ ๐‘”

๐œ• ๐‘ฆ , ๐‘ง

  • Typical examples:
  • Cross-Entropy: ๐‘€ = โˆ’log ๐‘ž ๐‘ฆ โ‹… ิฆ

๐‘ง, ิฆ ๐‘ง๐‘— = ๐Ÿ๐‘—=๐‘ง

  • Maximum Margin: ๐‘€ = max

๐‘งโ€ฒโ‰ ๐‘ง log ๐‘ž๐‘งโ€ฒ โˆ’ log ๐‘ž๐‘ง

  • Learning objective of ๐‘”

๐œ•:

  • Minimize ๐‘€
  • ๐œ•๐‘ข = ๐œ•๐‘ขโˆ’1 โˆ’ ๐œƒ

๐œ–๐‘€ ๐œ–๐œ•๐‘ขโˆ’1

Discover best loss function ๐‘€ to train student model ๐‘”

๐œ•

X f๐œ•(X) Y L(f๐œ• X , y) fฯ‰

  • Objective of loss function teaching:
  • Ultimate goal: improve the performance of ๐‘”

๐œ•

slide-9
SLIDE 9

Why is it called โ€œTeachingโ€?

  • If we view model ๐‘”

๐œ• as students, then ๐‘€ is the exams

  • Good teachers are adaptive :
  • They set good exams according to the status of the students
  • An analogy:
  • Data ๐‘ฆ, ๐‘ง is the textbook
  • Curriculum learning schedules the textbooks (data) per the status of the

student model

slide-10
SLIDE 10

Can We Achieve Automatic Teaching?

  • The first task: design a good decision space
  • Our way: use another (parametric) neural network ๐‘€๐œš(๐‘”

๐œ• ๐‘ฆ , ๐‘ง) as

the loss function

  • The decision space: coefficients ๐œš
  • It is continuous
slide-11
SLIDE 11

11

Automatic Loss Function Teaching, cont.

  • Assume the loss function itself is a neural network
  • ๐‘€๐œš(๐‘”

๐œ• ๐‘ฆ , ๐‘ง), with ๐œš as its coefficient

  • For example, generalized cross-entropy loss
  • ๐‘€๐œš = ๐œ(โˆ’ log๐‘ˆ ๐‘ž ๐‘ฆ W ิฆ

๐‘ง + ๐‘)

  • ๐œš = {๐‘‹, ๐‘}
  • A parametric teacher model ๐œˆ๐œ„
  • Output ๐œš
  • ๐œš = ๐œˆ๐œ„

X f๐œ•(X) Y L(f๐œ• X , y) ๐œš ฮผฮธ fฯ‰

slide-12
SLIDE 12

12

How to Be Adaptive?

  • Extract feature ๐‘ก๐‘ข at different training

step ๐‘ข of student model ๐‘”

๐œ•

  • The coefficients are adaptive
  • ๐œš๐‘ข = ๐œˆ๐œ„(๐‘ก๐‘ข), generating adaptive loss

functions ๐‘€๐œš๐‘ข(๐‘”

๐œ• ๐‘ฆ , ๐‘ง) ๐‘ก๐‘ข ๐œš๐‘ข

slide-13
SLIDE 13

How to Optimize the Teacher Model?

  • Hyper gradient
  • ๐œ–๐‘€๐‘’๐‘“๐‘ค

๐œ–๐œš = ๐œ–๐‘€๐‘’๐‘“๐‘ค ๐œ–๐œ•๐‘ˆ ๐œ–๐œ•๐‘ˆ ๐œ–๐œš = ๐œ–๐‘€๐‘’๐‘“๐‘ค ๐œ–๐œ•๐‘ˆ ( ๐œ–๐œ•๐‘ˆโˆ’๐‘ข ๐œ–๐œš

โˆ’ ๐œƒ๐‘ˆโˆ’1

๐œ–2๐‘€๐‘ข๐‘ ๐‘๐‘—๐‘œ(๐œ•๐‘ˆโˆ’1) ๐œ–๐œ•๐‘ˆโˆ’1๐œ–๐œš

)

slide-14
SLIDE 14

14

Neural Machine Translation Experiment

Transformer 28.4 28.7 29.1

BLEU ON WMT2014 ENGLISHโ†’GERMAN TRANSLATION

Cross Entropy Reinforcement Learning L2T

slide-15
SLIDE 15

Experiments: Image Classification

  • On CIFAR-10

3/20/2019 RestNet-32 Wide RestNet 7.51 3.8 7.01 3.69 6.56 3.38

ERROR RATE (%) OF CIFAR-10 CLASSIFICATION

Cross Entropy Large Margin Softmax L2T RestNet-32 Wide RestNet 30.38 19.93 30.12 19.75 29.25 18.98

ERROR RATE (%) OF CIFAR-100 CLASSIFICATION

Cross Entropy Large Margin Softmax L2T

slide-16
SLIDE 16

Till nowโ€ฆ

  • We talked about how to set continuous decisions for a particular

AutoML task

  • And how to effectively optimize it
  • But what would if the design space is discrete?
slide-17
SLIDE 17

Neural Architecture Optimization

Renqian Luo, Fei Tian, Tao Qin, En-Hong Chen, Tie-Yan Liu NeurIPS 2018

slide-18
SLIDE 18

The Background: Neural Architecture Search

  • There might be no particular need to introduce the basisโ€ฆ
  • Two mainstream algorithms:
  • Reinforcement Learning and Evolutionary Computing
slide-19
SLIDE 19

How to Cast the Problem into Continuous Space?

  • Intuitive Idea

Map the (discrete) architectures into continuous embeddings -> Optimize the embeddings -> Revert back to the architectures

  • How to optimize?
  • Use the help of a performance predictor function ๐‘”
slide-20
SLIDE 20
  • utput surface of

performance prediction function ๐’ˆ embedding space of all architectures

๐’‡๐’š ๐’‡๐’šโ€ฒ

How NAO Works?

Architecture ๐‘ฆ Encoder Optimized Architecture ๐‘ฆโ€ฒ Decoder Gradient Ascent: ๐’‡๐’š

โ€ฒ = ๐’‡๐’š + ๐œฝ ๐๐’ˆ ๐๐’‡

slide-21
SLIDE 21

Why the Encoder (including perf predictor) Could Work? Two Tricks

  • Normalize the performance into (0,1)
  • Sometimes even with CDF
  • Data augmentation
  • ๐‘ฆ, ๐‘ง โ†’ (๐‘ฆโ€ฒ, ๐‘ง), if ๐‘ฆ and ๐‘ฆโ€ฒ are symmetric
  • Improve the pairwise accuracy by 2% on CIFAR-10
slide-22
SLIDE 22

Why the Decoder (i.e., perfect recovery) Could Work?

  • Sentence-wise AutoEncoder with attention mechanism is easy to train
  • You can even obtain near 100 BLEU on test set!
  • So sometimes need perturbations to avoid trivial solution (e.g., in

unsupervised machine translation [1,2])

  • ๐‘” happens to be the perturbation
  • 1. Artetxe, Mikel, et al. "Unsupervised neural machine translation." ICLR 2018
  • 2. Lample, Guillaume, et al. "Unsupervised machine translation using monolingual corpora only." ICLR 2018
slide-23
SLIDE 23

Experiments: CIFAR-10

Method Error Rate Resource (#GPU ร— #Hours)

ENAS

2.89 12

NAO-WS

2.80 7

AmoebaNet

2.13 3150 * 24

Hie-EA

3.15 300 * 24

NAO

2.10 200 * 24

slide-24
SLIDE 24

Experiments: Transfer to CIFAR-100

slide-25
SLIDE 25

Experiments: PTB Language Modelling

Method Perplexity Resource (#GPU ร— #Hours)

NASNet

62.4 1e4 CPU days

ENAS

58.6 12

NAO

56.0 300

NAO-WS

56.4 8

slide-26
SLIDE 26

Experiments: Transfer to WikiText2

slide-27
SLIDE 27

Open Source

  • https://github.com/renqianluo/NAO
slide-28
SLIDE 28

Thanks!

We are hiring! Send me a message if you are interested: fetia@microsoft.com

slide-29
SLIDE 29

The Panel Discussion

  • AutoMLๅ…ทไฝ“ๅŒ…ๆ‹ฌไป€ไนˆ(็ฝ‘็ปœ็ป“ๆž„ๆœ็ดข๏ผŒ่ถ…ๅ‚ๆ•ฐๆœ็ดข๏ผŒไผ ็ปŸๆœบๅ™จๅญฆ

ไน ๆจกๅž‹็ญ‰)?

  • AutoMLไธŽmeta-learning็š„ๅ…ณ็ณป๏ผŸ
  • NAS็š„ๅฑ€้™ๆ€ง๏ผŸๅฆ‚ไฝ•ๅฎŒๅ…จ้™คๅŽปไบบไธบๅนฒ้ข„๏ผŸ
  • NASไธŽrepresentation/transfer learning๏ผŸ
  • ๅฆ‚ไฝ•็œ‹ๅพ…Random Search and Reproducibility for NAS
  • RL or ES or SGD, gradient-based NASๆ˜ฏๆœชๆฅๅ—๏ผŸ