Deep Reinforcement Learning Building Blocks Arjun Chandra Research - PowerPoint PPT Presentation

our toy problem lookup table N S E W 1 0 0 0 0 2 0 0 0 0 3 0 0 0 0 4 0 0 0 0 5 0 0 0 0 6 0 0 0 0 home 7 0 0 0 0 8 0 0 0 0 9 0 0 0 0

our toy problem lookup table 0 0 0 1 2 3 0 0 0 0 0 0 0 0 0 0 0 0 4 5 6 0 0 0 0 0 0 home 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0

reward structure? 0 0 0 1 2 3 0 0 0 0 0 0 0 0 0 0 0 0 0 4 5 6 0 0 0 0 0 0 0 0 home 0 0 0 0 0 0 0 7 7 8 9 0 0 0 0 0 0 home 0 0 0 move… to 7/home: out of bounds: to 5: to any cell except 5 and 7: 10 -5 -10 -1

let’s fix 𝛽 = 0.1, 𝛿 = 0.5 0 0 0 1 2 3 0 0 0 0 0 0 0 0 0 0 0 0 4 5 6 0 0 0 0 0 0 home 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 0 0 0

a + γ max a ' Q ( s ', a ') − Q ( s , a )) Q ( s , a ) ← Q ( s , a ) + α ( r s 0 0 0 1 2 3 0 0 0 0 0 0 0 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 0 0 0 0 0 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0 say 𝜁 -greedy policy… episode 1 begins...

a + γ max a ' Q ( s ', a ') − Q ( s , a )) Q ( s , a ) ← Q ( s , a ) + α ( r s -1 0 0 0 1 2 3 0 0 ? 0 0 0 0 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 0 0 0 0 0 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0

a + γ max a ' Q ( s ', a ') − Q ( s , a )) Q ( s , a ) ← Q ( s , a ) + α ( r s 0 0 0 1 2 3 0 0 -0.1 0 0 0 0 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 0 0 0 0 0 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0

a + γ max a ' Q ( s ', a ') − Q ( s , a )) Q ( s , a ) ← Q ( s , a ) + α ( r s -5 ? 0 0 1 2 3 0 0 -0.1 0 0 0 0 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 0 0 0 0 0 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0

a + γ max a ' Q ( s ', a ') − Q ( s , a )) Q ( s , a ) ← Q ( s , a ) + α ( r s -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 0 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 0 0 0 0 0 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0

a + γ max a ' Q ( s ', a ') − Q ( s , a )) Q ( s , a ) ← Q ( s , a ) + α ( r s -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 -1 ? 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 0 0 0 0 0 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0

a + γ max a ' Q ( s ', a ') − Q ( s , a )) Q ( s , a ) ← Q ( s , a ) + α ( r s -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 -0.1 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 0 0 0 0 0 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0

a + γ max a ' Q ( s ', a ') − Q ( s , a )) Q ( s , a ) ← Q ( s , a ) + α ( r s -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 -10 -0.1 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 ? 0 0 0 0 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0

a + γ max a ' Q ( s ', a ') − Q ( s , a )) Q ( s , a ) ← Q ( s , a ) + α ( r s -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 -0.1 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 -1 0 0 0 0 0 0 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0

a + γ max a ' Q ( s ', a ') − Q ( s , a )) Q ( s , a ) ← Q ( s , a ) + α ( r s -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 -0.1 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 -1 0 0 0 0 -1 0 ? 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0

a + γ max a ' Q ( s ', a ') − Q ( s , a )) Q ( s , a ) ← Q ( s , a ) + α ( r s -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 -0.1 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 -1 0 0 0 0 0 -0.1 0 0 0 0 7 8 9 0 0 0 0 0 0 home 0 0 0

a + γ max a ' Q ( s ', a ') − Q ( s , a )) Q ( s , a ) ← Q ( s , a ) + α ( r s -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 -0.1 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 -1 0 0 0 0 0 -0.1 0 0 0 0 10 7 8 9 0 0 ? 0 0 0 home 0 0 0

a + γ max a ' Q ( s ', a ') − Q ( s , a )) Q ( s , a ) ← Q ( s , a ) + α ( r s -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 -0.1 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 -1 0 0 0 0 0 -0.1 0 0 0 0 7 8 9 0 0 1 0 0 0 home 0 0 0 episode 1 ends.

let’s work out the next episode, star<ng at state 4 -0.5 0 0 1 2 3 0 0 -0.1 0 0 0 -0.1 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 0 -1 0 0 0 0 0 -0.1 0 0 0 0 7 8 9 0 0 1 0 0 0 home 0 0 0 go WEST and then SOUTH how does the table change?

-0.5 0 0 1 2 3 0 0 -0.1 0 0 0 -0.1 0 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 -0.5 -1 0 0 0 0 1 -0.1 0 0 0 0 7 8 9 0 0 1 0 0 0 0 0 0

and the next episode, star<ng at state 3 go WEST -> SOUTH -> WEST -> SOUTH

-0.5 0 0 1 2 3 0 0 -0.1 0 -0.1 0 -0.1 -1 0 𝛽 = 0.1 0 0 0 𝛿 = 0.5 4 5 6 -0.5 -1 -0.05 0 0 0 1.9 -0.1 0 0 0 0 7 8 9 0 0 1 0 0 0 0 0 0 over <me, values will converge to op<mal!

what we just saw was some episodes of Q-learning values update towards value of op+mal policy : target comes from value of assumed next best ac+on off-policy learning

SARSA-learning? values update towards value of current policy : target comes from value of the actual next ac+on on-policy learning

Q SARSA By Andreas Tille (Own work) [GFDL (www.gnu.org/copylei/fdl.html) or CC-BY-SA-3.0-2.5-2.0-1.0 (www.creaCvecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons data generated by data not generated by target policy target policy 𝜁 : 0.1 𝛿 : 1.0 SARSA Q Example credit Travis DeWolf : https://studywolf.wordpress.com/ and https://git.io/vFBvv

Problem Decomposition nested sub-problems solution to sub-problem informs solution to whole problem

Bellman Expectation Backup system of linear equations solution: value of policy a, q(s, a) s, v(s) r s’ a r a’, q(s’, a’) s’, v(s’) Value of = P(path) * Value(path) Value of = P(path) * Value(path) ⎛ ⎞ a + γ a + γ a v π ( s ') ∑ ∑ ∑ ∑ v π ( s ) = π ( a | s ) r s q π ( s , a ) = r s π ( a '| s ') q π ( s ', a ') P a P ⎜ ⎟ ⎝ ⎠ ss ' ss ' a s ' s ' a ' Bellman expectation equations under a given policy

Bellman Optimality Backup system of non-linear equations solution: value of optimal policy a, q(s, a) s, v(s) r s’ a r a’, q(s’, a’) s’, v(s’) Value of = P(path) * Value(path) Value of = P(path) * Value(path) ⎛ ⎞ a + γ a + γ a v * ( s ') ∑ ∑ q * ( s , a ) = r s a v * ( s ) = max a r s P max a ' q * ( s ', a ') P ⎜ ⎟ ⎝ ⎠ ss ' ss ' s ' s ' Bellman optimality equations under optimal policy

Value Based

Dynamic Programming …using Bellman equations as iterative updates 1 2 what’s best to do? -5 -5 N E -1 W 1 2 -1 -5 -5 3 4 S -10 10 3 4 home

Deep Reinforcement Learning Building Blocks Arjun Chandra Research - PowerPoint PPT Presentation

Deep Reinforcement Learning Building Blocks Arjun Chandra Research Scientist Telenor Research / Telenor-NTNU AI Lab arjun.chandra@telenor.com @boelger 8 November 2017 https://join.slack.com/t/deep-rl-tutorial/signup The Plan The Problem

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

Blocks What is syntax (delimiters) Where can blocks be used Scope and blocks Do

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Deep Reinforcement Learning [Mastering the Game of Go with Deep Reinforcement Learning and Tree

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

Deep Reinforcement Learning [Human-Level Control through deep reinforcement learning, Nature

Deep learning Deep reinforcement learning Hamid Beigy Sharif university of technology December

FBPQ and building blocks FBPQ and building blocks Mark Drye Director of Asset Management

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Deep Reinforcement Learning Philipp Koehn 21 April 2020 Philipp Koehn Artificial Intelligence:

Global polarization signals from hot, dense and whirly QCD matter Malena Tejeda-Yeomans

SAP cyber Slapping A Penetra2on Testers Guide CALL TRANSACTION SUIM Dave

On the status of flavor anomalies Diego Guadagnoli LAPTh Annecy (France) Recap of flavor

Quick context UNSW top 100 research intensive, 59,000 students and 7,000 staff. New

Robust mixture modeling using multivariate skew t distributions Tsung-I Lin Department of

EPD BLUP Daniela Lourenco Keith Bertrand Heather Bradford,

Seguridad de las contraseas Este trabajo fue originalmente desarrollado por

Conformality and Q harmonicity in sub-Riemannian manifolds Joint work L.C., Enrico Le Donne