Adaptive Stochastic Natural Gradient Method for One-Shot Neural - - PowerPoint PPT Presentation

adaptive stochastic natural gradient method for one shot
SMART_READER_LITE
LIVE PREVIEW

Adaptive Stochastic Natural Gradient Method for One-Shot Neural - - PowerPoint PPT Presentation

Adaptive Stochastic Natural Gradient Method for One-Shot Neural Architecture Search Youhei Akimoto (University of Tsukuba / RIKEN AIP) Shinichi Shirakawa (Yokohama National University) Nozomu Yoshinari (Yokohama National University) Kento


slide-1
SLIDE 1

Adaptive Stochastic Natural Gradient Method for One-Shot Neural Architecture Search

○Youhei Akimoto (University of Tsukuba / RIKEN AIP) Shinichi Shirakawa (Yokohama National University) Nozomu Yoshinari (Yokohama National University) Kento Uchida (Yokohama National University) Shota Saito (Yokohama National University) Kouhei Nishida (Shinshu University)

slide-2
SLIDE 2

Neural Architecture

2

Neural Network Architectures

Trial and Error!

VGGNet ResNet Inception

… Task (Dataset)

  • Find a good one
  • Design a brand-new

architecture and train it

  • ften pre-trained
  • n some datasets

Sometimes...

  • a known architecture

works well on our tasks. Happy!

Other times...

slide-3
SLIDE 3

One-Shot Neural Architecture Search

Joint Optimization of Architecture c and Weights w

3

max

w,c f(w, c)

<latexit sha1_base64="vyiD93EaWMADRqvRUYSVe0cJzgA=">ACH3icbVC7TsMwFHXKq5RXgZHFoqoEqSMsCECgwFok+pCaqHNdprdpJZDtAFeUT+Ah+gIUVZjbESEc2PgP3MdCWI13do3Pu1bWPGzIqlWkOjNTC4tLySno1s7a+sbmV3d6pySASmFRxwALRcJEkjPqkqhipBEKgrjLSN3tXQ79+h0Rkgb+reqHxOGo41OPYqS01Moe2Bw9tGLb5fF9cgSHScJ9ArTymErmzOL5ghwnlgTkiufXT3nz79+Kq3st90OcMSJrzBDUjYtM1ROjISimJEkY0eShAj3UIc0NfURJ9KJRx9KYF4rbegFQpev4Ej9uxEjLmWfu3qSI9WVs95Q/M9rRso7dWLqh5EiPh4f8iIGVQCH6cA2FQr1tcEYUH1WyHuIoGw0hlOXF5ojOxZhOYJ7VS0Toulm50OBdgjDTYA/ugACxwAsrgGlRAFWDwCF7AK3gznox348P4HI+mjMnOLpiCMfgFCE6nCA=</latexit>

xt

Conv 3 x 3 W1 Conv 5 x 5 W2 max pooling avg pooling

xt+1 1

+

w: (W1, W2) c: (0, 0, 1, 0)

max

c

f(w∗(c), c) subject to w∗(c) = argmax

w

f(w, c)

<latexit sha1_base64="HZyaidN5WsSwoTEAv8/MtN9ob3w=">ACtHicbZHfb9MwEMedjB+jbNDBIy8W1aCUJUMBLwgTfDC45DoNqkuleNcWjM7juzLusrKH8oD/wtOG01bx0mWPvqe7873dVYp6TBJ/kTxzoOHjx7vPuk93dt/9rx/8OLMmdoKGAujL3IuAMlSxijRAUXlQWuMwXn2eW3Nn9+BdZJU/7EVQVTzelLKTgGKRZf3nENL+eZpL5qGFsOWls2vt8ON9OYd7YAy1jtiCNfoXZ39BoEUTcPodgH9Qhm385u2y1tb5rN+oNklKyD3oe0gwHp4nR2EB2y3IhaQ4lCcecmaVLh1HOLUihoeqx2UHFxyecwCVhyDW7q1w419DAoOS2MDadEulZvV3iunVvpLNzUHBduO9eK/8tNaiw+T70sqxqhFJtBRa2CM7S1m+bSBqPUKgAXVoa3UrHglgsMn3JnSqbDpU1VzIHYbTmZe47HxvPTAWo7HtWkuJCyW1ROe7fJhLN9gLxqbNt6Hs+NR+n50/OPD4ORrZ/EueUVekyFJySdyQr6TUzImgvyNdqK9aD/+GLNYxLC5GkdzUtyJ+LyH1NQ1xc=</latexit>

NAS as hyper-parameter search One-shot NAS

1 c evaluation = 1 training

  • ptimization of x and

c within 1 training

slide-4
SLIDE 4

Difficulties for Practitioners

How to choose / tune the search strategy?

4

w w + ✏wrwf(w, c(θ)) θ θ + ✏θrθf(w, c(θ))

<latexit sha1_base64="HVsL6MfL3cRx/YNlI3wm1Sd7FAs=">ADBXichVFNb9NAEF2brxI+msKRy4qoqBUosgsSHCu4cCxS01bKmi8WSer7nqt3XFDZPnMr+GuHLlL/TfdGMbaBMkRlrpvTczejszaGkwyi6DMJbt+/cvbd1v/fg4aPH2/2dJyfOlJaLETfK2LMUnFAyFyOUqMRZYQXoVInT9PzDKn96IayTJj/GZSESDbNcZpIDemnS/8VSXS1q+oIpkSFYaxa0k15SJgonlS+rWqlmOaQK/tCM7rXoVdPD64YynAuEen+fsd41vmHxW1/36fS/Zi3/j9mkP4iGURN0E8QdGJAujiY7wS6bGl5qkSNX4Nw4jgpMKrAouRJ1j5VOFMDPYSbGHuaghUuqZuU13fXKlGbG+pcjbdTrHRVo5Y69ZUacO7WcyvxX7lxidm7pJ5UaLIeWuUlYqioav70am0gqNaegDcSv9XyudgaO/8g2XVPsZCmsu5FRwozXk04qBnWn4UlfMFMICGrsayFxrqSW6Kou731pC3t+sfH6GjfBycEwfj08+PRmcPi+W/EWeUaekz0Sk7fkHwkR2REeDAMjoMk+Bx+Db+F38MfbWkYdD1PyY0If14BKe/6VA=</latexit>

Search Space Search Strategy Gradient-Based Method Other Choices

  • Evolutionary Computation Based
  • Reinforcement Learning Based

hyper-parameter: step-size

  • how to treat integer variables such as #filters?
  • how to tune the hyper-parameters in such situations?
slide-5
SLIDE 5

Contributions

Novel Search Strategy for One-shot NAS

  • 1. arbitrary search space (categorical + ordinal)
  • 2. robust against its inputs (hyper-param. and search space)

Our approach

max

w,c f(w, c) ⇒ max w,θ J(w, θ) :=

Z f(w, c)p(c | θ)dc

<latexit sha1_base64="ei1X4bv0b/x2an2mpDxfz7crXD4=">AC93icbVLdbtMwFHbC3xb+OrjkxmJC6iRUJZsQ0yS0CW4QVwPRbVJdVY7jtFbtOLJPVqrgh+AJuEPc8jwDrwDTlqdeuRLH8+5zvy8fc5LaWwEMe/g/DW7Tt3721tR/cfPHz0uLPz5MzqyjDeZ1pqc5FSy6UoeB8ESH5RGk5VKvl5On3X1M8vubFCF59hXvKhouNC5IJR8KlRB4iX0Y1mc1eYsKYw3n3P97DEfkxhOgxuiZP1xhAoDH7qrwx4+eoOJKGCtv+z6jSiRLTkZSVXNXDTq7Ma9uA18EyRLsHty3Ht1/PXvt9PRTjAlmWaV4gUwSa0dJHEJw5oaExyF5HK8pKyKR3zgYcFVdwO61Yeh1/4TIZzbfzyE7bZqx01VdbOVeqZisLEXq81yU21QX54bAWRVkBL9jiorySGDRutMaZMJyBnHtAmRF+Vswm1FAG3pG1W1Ll31AafSkyzrRStMgapV1dk3bcOpWe6Rr9Zs5t4HrVlzDM7dQehOzsW5F9b/BS0pbOoEJ9A5b09y3Yyb4Gy/lxz09j96n96iRWyhZ+g56qIEvUYn6D06RX3E0J8ABdtBFM7D7+GP8OeCGgbLnqdoLcJf/wAWwfXD</latexit>
  • 1. Stochastic Relaxation

exponential family differentiable w.r.t. w and θ

  • 2. Stochastic Natural Gradient + Adaptive Step-Size

wt+1 = wt + ✏t

w

\ rwJ(wt, θt) θt+1 = θt + ✏t

θF(θt)−1

\ rθJ(wt+1, θt)

<latexit sha1_base64="HLe5ikCOIzFH1404FMTVwWAgKc=">ADSnicbVFdb9MwFHWyAaN8dfDIi0UFKhqUZCDBC9IEkI8DYluk+q2chxntWo7kX2zqrLyA5H4A/wN3hAvOGkY69orWbo+95x7XuSQgoLUfQzCHd2b9y8tXe7c+fuvfsPuvsPT2xeGsaHLJe5OUuo5VJoPgQBkp8VhlOVSH6azD/W9dMLbqzI9TdYFnys6LkWmWAUPDTtfieLxcTBQVzhZ+/x6lLhA0x4YX0DOexagK+JFI+o+CIpomkKx/6beSF5gAQJ09rwjptJd/bdvSel8PNo0VhVmSuU9V/7LFxL30ys2RteRypm9+deq024sGURN4M4nbpIfaOJ7uB4KkOSsV18AktXYURwWMHTUgmORVh5SWF5TN6Tkf+VRTxe3YNTuv8FOPpDjLjT8acINeVTiqrF2qxDPrD9rtRrcVhuVkL0bO6GLErhmq0FZKTHkuDYQp8JwBnLpE8qM8G/FbEYNZeBtXpuSKP+HwuQXfo0sV4rqtHNOdI81/mlsnlFEuUWVbWFy9h/ruFpw2RbmbUxW9oSmHGgXuHNia9bsZmcHA7i14PDr296Rx9am/bQY/QE9VGM3qIj9BkdoyFiwatgGEyCafgj/BX+Dv+sqGHQah6htdjZ/Qt0fBdr</latexit>

Natural Gradient

J(wt, θt) < J(wt+1, θt) < J(wt+1, θt+1)

<latexit sha1_base64="y6/6GQgAGnSQUJ1h9XWy4NyciY=">ACyXicjVHLThsxFHWGtlBoS4AlGwtUCUQVzdAFLFhEdIPKhkoNIGVC5PHcEAs/pvYdQmrNil/od/RrumHRfkudCVULZNErWTo651zfV1ZI4TCO7xrR3LPnL+YXi4uvXr9Zrm5snrqTGk5dLiRxp5nzIEUGjoUMJ5YGpTMJZdvVhop9dg3XC6M84LqCn2KUWA8EZBqrfPq4lY5GFx6rdzRFxAnapgf0D72T/IcQ8Ha/uRm34jroU5Dcg832Rrz7a49PumvNESaG14q0Mglc6bxAX2PLMouIRqMS0dFIxfsUvoBqiZAtfz9cgVfRuYnA6MDU8jrdl/MzxTzo1VFpyK4dA91ibkLK1b4mC/54UuSgTNp4UGpaRo6GR/NBcWOMpxAIxbEXqlfMgs4xi2/KBKpsIMhTXIgdulGI692F3lfdp3a7PZHBWab8qKpmeDn/67WQ104+0xnOMOvbFIeALGSE4ySPT/EUnO62kvet3U/hSodkGgtknWyQLZKQPdImR+SEdAgn38kP8pP8io6jL9FN9HVqjRr3OWvkQUS3vwEFZe4</latexit>

Monotone Improvement

Under appropriate step-size

5

slide-6
SLIDE 6

Results and Details

  • Faster & Competitive Accuracy to other one-shot NAS

6

The detail will be explained at Poster #53