Deep Hep Reading Group 1611.05763 Learning To Reinforcement Learn - PowerPoint PPT Presentation

Deep Hep Reading Group 1611.05763 Learning To Reinforcement Learn 1611.02779

SchemaAc • Approach for solving Markov Decision Process • Agent interacts with environment – Takes acAons to move from one state to another – Is rewarded or penalized during the process. • Example, grid world

NotaAon If there exists an opAmal , can similarly define cumulaAve regret.

Strategies • Value, Q-value iteraAon – Define value, V(s), of state or Q(s,a) of state and acAon based on opAmal acAon from that state(acAon) unAl end. Easy to do when horizon, T is small. – Iterate in size of T • Policy iteraAon – Similar, don’t use opAmal policy, iteraAvely improve policy. • Good for gridworld bad for Atari

DQN • For large sized games, can’t use exact iteraAon. • Instead model Q parametrically Q(θ). Why not make this a deep neural-net?

Natural GeneralizaAons Train on varied Vanilla RL problems 1611.05763 Trajectory 1611.02779 Dependence

Trajectory Dependence • Use LSTM to retain informaAon

Natural GeneralizaAons Train on varied Vanilla RL problems 1611.05763 Trajectory 1611.02779 Dependence

1611.05763 Idea • Train LSTM to learn structure dependent policies: Some Examples

1611.05763 Training • Fix MDP distribuAon D: – Sample from D, run for Ame T – Repeat many Ames • Details were varied slightly depending on D • Main Point: Agent gets good at all tasks from D, not just a parAcular instance.

1611.05763 Bandit Tasks • Two armed bandit, each arm has probability pi to pay out 1, otherwise gives 0. • Two armed bandit, correlated arms p1 = 1-p2 • Deferred graAficaAon: – Among 11 arms 1 random arm gives high reward, 9 give low, arm 11 encodes which is high, but gives low payout • Goosed up bandit with images

1611.05763 Results Figure 2 : Performance on independent- and correlated-arm bandits. We report performance as the cumulaAve expected regret RT for 150 test episodes, averaged over the top 5 hyperparameters for each agent-task configuraAon, where the top 5 was determined based on performance on a separate set of 150 test episodes. (a) LSTM A2C trained and evaluated on bandits with independent arms (distribuAon Di; see text), and compared with theoreAcally opAmal models. (b) A single agent playing the medium difficulty task with distribuAon Dm. SubopAmal arm pulls over trials are depicted for 300 episodes. (c) LSTM A2C trained and evaluated on bandits with dependent uniform arms (distribuAon Du), (d) trained on medium bandit tasks (Dm) and tested on easy (De), and (e) trained on medium (Dm) and tested on hard task (Dh). (f) CumulaAve regret for all possible combinaAons of training and tesAng environments (Di, Du, De, Dm, Dh).

1611.05763 Deferred GraAficaAon

Goosed Bandit Figure 6 : Learning abstract task structure in visually rich 3D environment. a-c) Example of a single trial, beginning with a central fixaAon, followed by two images with random lee-right placement. d) Average performance (measured in average reward per trial) of top 40 out of 100 seeds during training. Maximum expected performance is indicated with black dashed line. e) Performance at episode 100,000 for 100 random seeds, in decreasing order of performance. f) Probability of selecAng the rewarded image, as a funcAon of trial number for a single A3C stacked LSTM agent for a range of training duraAons (episodes per thread, 32 threads).

1611.02779 Training Structure • Use GRUs instead of LSTMs, also sample broader classes of problems. • “The objecAve is to maximize the … reward... over a single trial ” – odd wording, over each, or mulAple. Slightly different use of episode between two papers, trial here = episode there •

1611. 02779 Bandit Results

1611. 02779 Maze Task r = +1 for reaching target, -0.001 for wall hit, and -0.04 per Ame step

1611. 02779 Maze Results Videos

Comments • The previous learning to learn is a special case of this. – Think of gradient decent as agent moving in a potenAal: state is posiAon and cost, acAon is move in any direcAon any amount, reward is cost decrease. • DQN alone already accomplishes some of this. – Ex think of each frame of atari as new draw – Seaquest agent displays delayed graAficaAon for instance

Deep Hep Reading Group 1611.05763 Learning To Reinforcement Learn - PowerPoint PPT Presentation

Deep Hep Reading Group 1611.05763 Learning To Reinforcement Learn 1611.02779 SchemaAc Approach for solving Markov Decision Process Agent interacts with environment Takes acAons to move from one state to another Is rewarded or

The dual life of giant gravitons David Berenstein UCSB Based on: hep-th/0306090, hep-th/0403110

Reading Mastery - Reading Presentation Book A - Grade 5 Reading Mastery - Reading Presentation

Supertubes and the 4D black hole Per Kraus, UCLA with I. Bena: hep-th/0402144, hep-th/0408186,

HEP DC Trips 2017 US HEP User Community Outreach & Advocacy to the Federal Government

Yasunori Nomura UC Berkeley; LBNL hep-ph/0509039 [PLB] Based on work with hep-ph/0509221 [PLB]

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

6D SCFTs and Group Theory Tom Rudelius Harvard University Based On 1502.05405/hep-th

What is Reading? Reading is making meaning from print. PRE READING SKILLS The image

General Reading Strategies For students who love reading and students who will love reading! Our

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Health Enhancement Program (HEP) Welcome to the State of Connecticut Health Enhancement Program

Madison College HEP HEP Mission and Vision Mission : : The High School Equivalency Programs

NLO Event Simulation for Chargino Production at the ILC based on hep-ph/0607127, hep-ph/0610401

NYC Hep B Patient Navigation Programs NYC Health Department Nirah Johnson, LCSW Director,

What we learned last time 1. Intelligence is the computational part of the ability to achieve

An Estimation Based Allocation Rule with Super-linear Regret and Finite Lock-on Time for

Sample-Based Methods for Continuous Action Markov Decision Processes Chris Mansley Ari

Wireless Optimisation via Convex Bandits Unlicensed LTE/WiFi Coexistence Cristina Cano and

Reinforcement Learning n-armed bandit Kevin Spiteri April 21, 2015 n-armed bandit n-armed

Data-Dependent Algorithms for Bandit Convex Optimization Mehryar Mohri 1 Scott Yang 2 1 Google,

Spectrum Sharing Applications Sreeraj Rajendran rsreeraj@gmail.com FOSDEM 15 , Brussels

Inverting Sampled Traffic Nicolas Hohn, Darryl Veitch Australian Research Council Special

Deep Hep Reading Group 1611.05763 Learning To Reinforcement Learn - PowerPoint PPT Presentation

Deep Hep Reading Group 1611.05763 Learning To Reinforcement Learn 1611.02779 SchemaAc Approach for solving Markov Decision Process Agent interacts with environment Takes acAons to move from one state to another Is rewarded or

The dual life of giant gravitons David Berenstein UCSB Based on: hep-th/0306090, hep-th/0403110

Reading Mastery - Reading Presentation Book A - Grade 5 Reading Mastery - Reading Presentation

Supertubes and the 4D black hole Per Kraus, UCLA with I. Bena: hep-th/0402144, hep-th/0408186,

HEP DC Trips 2017 US HEP User Community Outreach &amp; Advocacy to the Federal Government

Yasunori Nomura UC Berkeley; LBNL hep-ph/0509039 [PLB] Based on work with hep-ph/0509221 [PLB]

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

6D SCFTs and Group Theory Tom Rudelius Harvard University Based On 1502.05405/hep-th

What is Reading? Reading is making meaning from print. PRE READING SKILLS The image

General Reading Strategies For students who love reading and students who will love reading! Our

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Health Enhancement Program (HEP) Welcome to the State of Connecticut Health Enhancement Program

Madison College HEP HEP Mission and Vision Mission : : The High School Equivalency Programs

NLO Event Simulation for Chargino Production at the ILC based on hep-ph/0607127, hep-ph/0610401

NYC Hep B Patient Navigation Programs NYC Health Department Nirah Johnson, LCSW Director,

What we learned last time 1. Intelligence is the computational part of the ability to achieve

An Estimation Based Allocation Rule with Super-linear Regret and Finite Lock-on Time for

Sample-Based Methods for Continuous Action Markov Decision Processes Chris Mansley Ari

Wireless Optimisation via Convex Bandits Unlicensed LTE/WiFi Coexistence Cristina Cano and

Reinforcement Learning n-armed bandit Kevin Spiteri April 21, 2015 n-armed bandit n-armed

Data-Dependent Algorithms for Bandit Convex Optimization Mehryar Mohri 1 Scott Yang 2 1 Google,

Spectrum Sharing Applications Sreeraj Rajendran rsreeraj@gmail.com FOSDEM 15 , Brussels

Inverting Sampled Traffic Nicolas Hohn, Darryl Veitch Australian Research Council Special

HEP DC Trips 2017 US HEP User Community Outreach & Advocacy to the Federal Government