adaptive stochastic natural gradient method for one shot
play

Adaptive Stochastic Natural Gradient Method for One-Shot Neural - PowerPoint PPT Presentation

Adaptive Stochastic Natural Gradient Method for One-Shot Neural Architecture Search Youhei Akimoto (University of Tsukuba / RIKEN AIP) Shinichi Shirakawa (Yokohama National University) Nozomu Yoshinari (Yokohama National University) Kento


  1. Adaptive Stochastic Natural Gradient Method for One-Shot Neural Architecture Search ○Youhei Akimoto (University of Tsukuba / RIKEN AIP) Shinichi Shirakawa (Yokohama National University) Nozomu Yoshinari (Yokohama National University) Kento Uchida (Yokohama National University) Shota Saito (Yokohama National University) Kouhei Nishida (Shinshu University)

  2. Neural Architecture Neural Network Architectures often pre-trained on some datasets VGGNet ResNet Inception … Task (Dataset) Trial and Error! Sometimes... • a known architecture works well on our tasks. Happy! Other times... • Find a good one • Design a brand-new architecture and train it � 2

  3. <latexit sha1_base64="HZyaidN5WsSwoTEAv8/MtN9ob3w=">ACtHicbZHfb9MwEMedjB+jbNDBIy8W1aCUJUMBLwgTfDC45DoNqkuleNcWjM7juzLusrKH8oD/wtOG01bx0mWPvqe7873dVYp6TBJ/kTxzoOHjx7vPuk93dt/9rx/8OLMmdoKGAujL3IuAMlSxijRAUXlQWuMwXn2eW3Nn9+BdZJU/7EVQVTzelLKTgGKRZf3nENL+eZpL5qGFsOWls2vt8ON9OYd7YAy1jtiCNfoXZ39BoEUTcPodgH9Qhm385u2y1tb5rN+oNklKyD3oe0gwHp4nR2EB2y3IhaQ4lCcecmaVLh1HOLUihoeqx2UHFxyecwCVhyDW7q1w419DAoOS2MDadEulZvV3iunVvpLNzUHBduO9eK/8tNaiw+T70sqxqhFJtBRa2CM7S1m+bSBqPUKgAXVoa3UrHglgsMn3JnSqbDpU1VzIHYbTmZe47HxvPTAWo7HtWkuJCyW1ROe7fJhLN9gLxqbNt6Hs+NR+n50/OPD4ORrZ/EueUVekyFJySdyQr6TUzImgvyNdqK9aD/+GLNYxLC5GkdzUtyJ+LyH1NQ1xc=</latexit> <latexit sha1_base64="vyiD93EaWMADRqvRUYSVe0cJzgA=">ACH3icbVC7TsMwFHXKq5RXgZHFoqoEqSMsCECgwFok+pCaqHNdprdpJZDtAFeUT+Ah+gIUVZjbESEc2PgP3MdCWI13do3Pu1bWPGzIqlWkOjNTC4tLySno1s7a+sbmV3d6pySASmFRxwALRcJEkjPqkqhipBEKgrjLSN3tXQ79+h0Rkgb+reqHxOGo41OPYqS01Moe2Bw9tGLb5fF9cgSHScJ9ArTymErmzOL5ghwnlgTkiufXT3nz79+Kq3st90OcMSJrzBDUjYtM1ROjISimJEkY0eShAj3UIc0NfURJ9KJRx9KYF4rbegFQpev4Ej9uxEjLmWfu3qSI9WVs95Q/M9rRso7dWLqh5EiPh4f8iIGVQCH6cA2FQr1tcEYUH1WyHuIoGw0hlOXF5ojOxZhOYJ7VS0Toulm50OBdgjDTYA/ugACxwAsrgGlRAFWDwCF7AK3gznox348P4HI+mjMnOLpiCMfgFCE6nCA=</latexit> … … One-Shot Neural Architecture Search Joint Optimization of Architecture c and Weights w Conv 3 x 3 0 NAS as hyper-parameter search W 1 1 c evaluation = 1 training Conv 5 x 5 max f ( w ∗ ( c ) , c ) 0 W 2 + c max x t x t+1 subject to w ∗ ( c ) = argmax f ( w , c ) 1 pooling w avg 0 One-shot NAS pooling optimization of x and c within 1 training max w , c f ( w , c ) w : (W1, W2) c : (0, 0, 1, 0) � 3

  4. <latexit sha1_base64="HVsL6MfL3cRx/YNlI3wm1Sd7FAs=">ADBXichVFNb9NAEF2brxI+msKRy4qoqBUosgsSHCu4cCxS01bKmi8WSer7nqt3XFDZPnMr+GuHLlL/TfdGMbaBMkRlrpvTczejszaGkwyi6DMJbt+/cvbd1v/fg4aPH2/2dJyfOlJaLETfK2LMUnFAyFyOUqMRZYQXoVInT9PzDKn96IayTJj/GZSESDbNcZpIDemnS/8VSXS1q+oIpkSFYaxa0k15SJgonlS+rWqlmOaQK/tCM7rXoVdPD64YynAuEen+fsd41vmHxW1/36fS/Zi3/j9mkP4iGURN0E8QdGJAujiY7wS6bGl5qkSNX4Nw4jgpMKrAouRJ1j5VOFMDPYSbGHuaghUuqZuU13fXKlGbG+pcjbdTrHRVo5Y69ZUacO7WcyvxX7lxidm7pJ5UaLIeWuUlYqioav70am0gqNaegDcSv9XyudgaO/8g2XVPsZCmsu5FRwozXk04qBnWn4UlfMFMICGrsayFxrqSW6Kou731pC3t+sfH6GjfBycEwfj08+PRmcPi+W/EWeUaekz0Sk7fkHwkR2REeDAMjoMk+Bx+Db+F38MfbWkYdD1PyY0If14BKe/6VA=</latexit> Difficulties for Practitioners How to choose / tune the search strategy? Search Space Search Strategy Gradient-Based Method w w + ✏ w r w f ( w , c ( θ )) θ θ + ✏ θ r θ f ( w , c ( θ )) hyper-parameter: step-size Other Choices • Evolutionary Computation Based • Reinforcement Learning Based - how to treat integer variables such as #filters? - how to tune the hyper-parameters in such situations? � 4

  5. <latexit sha1_base64="y6/6GQgAGnSQUJ1h9XWy4NyciY=">ACyXicjVHLThsxFHWGtlBoS4AlGwtUCUQVzdAFLFhEdIPKhkoNIGVC5PHcEAs/pvYdQmrNil/od/RrumHRfkudCVULZNErWTo651zfV1ZI4TCO7xrR3LPnL+YXi4uvXr9Zrm5snrqTGk5dLiRxp5nzIEUGjoUMJ5YGpTMJZdvVhop9dg3XC6M84LqCn2KUWA8EZBqrfPq4lY5GFx6rdzRFxAnapgf0D72T/IcQ8Ha/uRm34jroU5Dcg832Rrz7a49PumvNESaG14q0Mglc6bxAX2PLMouIRqMS0dFIxfsUvoBqiZAtfz9cgVfRuYnA6MDU8jrdl/MzxTzo1VFpyK4dA91ibkLK1b4mC/54UuSgTNp4UGpaRo6GR/NBcWOMpxAIxbEXqlfMgs4xi2/KBKpsIMhTXIgdulGI692F3lfdp3a7PZHBWab8qKpmeDn/67WQ104+0xnOMOvbFIeALGSE4ySPT/EUnO62kvet3U/hSodkGgtknWyQLZKQPdImR+SEdAgn38kP8pP8io6jL9FN9HVqjRr3OWvkQUS3vwEFZe4</latexit> <latexit sha1_base64="HLe5ikCOIzFH1404FMTVwWAgKc=">ADSnicbVFdb9MwFHWyAaN8dfDIi0UFKhqUZCDBC9IEkI8DYluk+q2chxntWo7kX2zqrLyA5H4A/wN3hAvOGkY69orWbo+95x7XuSQgoLUfQzCHd2b9y8tXe7c+fuvfsPuvsPT2xeGsaHLJe5OUuo5VJoPgQBkp8VhlOVSH6azD/W9dMLbqzI9TdYFnys6LkWmWAUPDTtfieLxcTBQVzhZ+/x6lLhA0x4YX0DOexagK+JFI+o+CIpomkKx/6beSF5gAQJ09rwjptJd/bdvSel8PNo0VhVmSuU9V/7LFxL30ys2RteRypm9+deq024sGURN4M4nbpIfaOJ7uB4KkOSsV18AktXYURwWMHTUgmORVh5SWF5TN6Tkf+VRTxe3YNTuv8FOPpDjLjT8acINeVTiqrF2qxDPrD9rtRrcVhuVkL0bO6GLErhmq0FZKTHkuDYQp8JwBnLpE8qM8G/FbEYNZeBtXpuSKP+HwuQXfo0sV4rqtHNOdI81/mlsnlFEuUWVbWFy9h/ruFpw2RbmbUxW9oSmHGgXuHNia9bsZmcHA7i14PDr296Rx9am/bQY/QE9VGM3qIj9BkdoyFiwatgGEyCafgj/BX+Dv+sqGHQah6htdjZ/Qt0fBdr</latexit> <latexit sha1_base64="ei1X4bv0b/x2an2mpDxfz7crXD4=">AC93icbVLdbtMwFHbC3xb+OrjkxmJC6iRUJZsQ0yS0CW4QVwPRbVJdVY7jtFbtOLJPVqrgh+AJuEPc8jwDrwDTlqdeuRLH8+5zvy8fc5LaWwEMe/g/DW7Tt3721tR/cfPHz0uLPz5MzqyjDeZ1pqc5FSy6UoeB8ESH5RGk5VKvl5On3X1M8vubFCF59hXvKhouNC5IJR8KlRB4iX0Y1mc1eYsKYw3n3P97DEfkxhOgxuiZP1xhAoDH7qrwx4+eoOJKGCtv+z6jSiRLTkZSVXNXDTq7Ma9uA18EyRLsHty3Ht1/PXvt9PRTjAlmWaV4gUwSa0dJHEJw5oaExyF5HK8pKyKR3zgYcFVdwO61Yeh1/4TIZzbfzyE7bZqx01VdbOVeqZisLEXq81yU21QX54bAWRVkBL9jiorySGDRutMaZMJyBnHtAmRF+Vswm1FAG3pG1W1Ll31AafSkyzrRStMgapV1dk3bcOpWe6Rr9Zs5t4HrVlzDM7dQehOzsW5F9b/BS0pbOoEJ9A5b09y3Yyb4Gy/lxz09j96n96iRWyhZ+g56qIEvUYn6D06RX3E0J8ABdtBFM7D7+GP8OeCGgbLnqdoLcJf/wAWwfXD</latexit> Contributions Novel Search Strategy for One-shot NAS 1. arbitrary search space (categorical + ordinal) 2. robust against its inputs (hyper-param. and search space) Our approach 1. Stochastic Relaxation exponential family Z max w , c f ( w , c ) ⇒ max w , θ J ( w , θ ) := f ( w , c ) p ( c | θ ) d c differentiable w.r.t. w and θ 2. Stochastic Natural Gradient + Adaptive Step-Size w t +1 = w t + ✏ t \ r w J ( w t , θ t ) w θ t +1 = θ t + ✏ t \ θ F ( θ t ) − 1 r θ J ( w t +1 , θ t ) Natural Gradient Under appropriate step-size J ( w t , θ t ) < J ( w t +1 , θ t ) < J ( w t +1 , θ t +1 ) Monotone Improvement � 5

  6. Results and Details • Faster & Competitive Accuracy to other one-shot NAS The detail will be explained at Poster #53 � 6

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend