Reinforcement Learning Hang Su suhangss@tsinghua.edu.cn - PowerPoint PPT Presentation

Reinforcement Learning Hang Su suhangss@tsinghua.edu.cn http://www.suhangss.me State Key Lab of Intelligent Technology & Systems Tsinghua University Nov 4 th , 2019

Sequential Decision Making Goal: select actions to maximize total future reward Actions may have long term consequences Reward may be delayed It may be better to sacrifice immediate reward to gain more long-term reward

Learning and Planning Two fundamental problems in sequential decision making Reinforcement Learning: q The environment is initially unknown q The agent interacts with the environment q The agent improves its policy Planning: q A model of the environment is known q The agent performs computations with its model (without any external interaction) q The agent improves its policy via reasoning, search, etc.

Atari Example: Planning Rules of the game are known Can query emulator q perfect model inside agent’s brain If I take action a from state s: right left q what would the next state be? q what would the score be? right left right left Plan ahead to find optimal policy q e.g. tree search

Atari Example: Reinforcement Learning Rules of the game are unknown Learn directly from interactive game-play observation action O t A t Pick actions on joystick, see pixels and scores reward R t

Reinforcement learning Intelligent animals can learn from interactions to adapt to the environment Can computers do similarly?

Reinforcement Learning in a nutshell RL is a general-purpose framework for decision-making q RL is for an agent with the capacity to act q Each action influences the agent’s future state Success is measured by a scalar reward signal q Goal: select actions to maximize future reward

Reinforcement Learning The history is the sequence of observations, actions, rewards H t = O 1 , R 1 , A 1 , ..., A t − 1 , O t , R t q Agent chooses actions so as to maximize expected cumulative reward over a time horizon q Observations can be vectors or other structures q Actions can be multi-dimensional q Rewards are scalar but can be arbitrarily information

Agent and Environment state action s t a t At each step t the agent: q Receives state 𝑡 " reward r t q Receives scalar reward 𝑠 " q Executes action 𝑏 " The environment: q Receives action 𝑏 " q Emits state 𝑡 " q Emits scalar reward 𝑠 "

State Experience is a sequence of observations, actions, rewards o 1 , r 1 , a 1 , ..., a t − 1 , o t , r t The state is a summary of experience s t = f ( o 1 , r 1 , a 1 , ..., a t − 1 , o t , r t ) In a fully observed environment s t = f ( o t )

Major Components of an RL Agent An RL agent may include one or more of these components: q Policy : agent’s behavior function q Value function : how good is each state and/or action q Model : agent’s representation of the environment

Policy A policy is the agent’s behavior It is a map from state to action, e.g Deterministic policy ： y: a = π ( s ) Stochastic policy: : π ( a | s ) = P [ A t = a | S t = s ]

Value Function Value function is a prediction of future reward Used to evaluate the goodness/badness of states Q-value function gives expected total reward q from state s and action a q under policy π q with discount factor γ r t +1 + γ r t +2 + γ 2 r t +3 + ... | s , a ⇥ ⇤ Q π ( s , a ) = E Value functions decompose into a Bellman equation Q π ( s , a ) = E s 0 , a 0 ⇥ ⇤ r + γ Q π ( s 0 , a 0 ) | s , a

Model A model predicts what the environment will do next P predicts the next state P predicts the next (immediate) reward, e.g. R ss 0 = P [ S t +1 = s 0 | S t = s , A t = a ] P a R a s = E [ R t +1 | S t = s , A t = a ]

Reinforcement Learning Agent’s inside: Agent’s goal : learn a policy to maximize long-term total reward

Difference between RL and SL? Both learn a model ... supervised learning reinforcement learning supervised learning reinforcement learning environment environment data data algorithm data algorithm data (s,a,s,r,a,s,r...) (x,y) (s,a,s,r,a,s,r...) s,a,s,r,a,s,r,   (s,a,s,r,a,s,r...) (x,y) (s,a,s,r,a,s,r...) s,a,s,r,a,s,r, (s,a,s,r,a,s,r...) (s,a,s,r,a,s,r...) (x,y) s,a,s,r,a,s,r, model model ... ... ... ... open loop closed loop closed loop learning from labeled data learning from delayed reward explore passive data environment ��

Supervised Learning Spam detection based on supervised learning

Reinforcement Learning Spam detection based on reinforcement learning

Characteristics of Reinforcement Learning What makes reinforcement learning different from other machine learning paradigms? q There is no supervisor q Only a reward signal q Feedback is delayed, not instantaneous q Time really matters (sequential, non i.i.d data) q Agent’s actions affect the subsequent data it receives

RL vs SL (Supervised Learning) Differences from SL q Learn by trial-and-error Need exploration/exploitation trade-off q Optimize long-term reward Need temporal credit assignment Similarities to SL q Representation q Generalization q Hierarchical problem solving q …

Applications: The Atari games n Deepmind Deep Q-learning on Atari ¨ Mnih et al. Human-level control through deep reinforcement learning. Nature, 518(7540): 529-533, 2015

Applications: The game of Go n Deepmind Deep Q-learning on Go ¨ Silver et al. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587): 484 − 489, 2016

Application: Producing flexible behaviors n NIPS 2017: Learning to Run competition

More applications Search Recommendation system Stock prediction every decision changes the world

Generality of RL shortest path problem q Dijkstra's algorithm, Bellman–Ford algorithm, etc st path problem: 6 1 5 3 1 2 t 1 s 2 3 5 einforcement learning by reinforcement learning einforcement learning -6 -1 -5 -3 t 100 -1 -2 0 -1 s -2 -3 -5 y node is a state, an action is an edge out • every node is a state, an action is an edge out • reward function = the negative edge weight �� • optimal policy leads to the shortest path ��

More applications Also as an differentiable approach for structure learning modeling structure data [Bahdanau et al., An Actor-Critic Algorithm for Sequence Prediction. ArXiv 1607.07086] [He et al., Deep Reinforcement Learning with a Natural Language Action Space, ACL’16] [B. Dhingra et al., End-to-End Reinforcement Learning of Dialogue Agents for Information Access, ArXiv 1609.00777] [Yu et al., SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient, AAAI’17] ��

(Partial) History... Idea of programming a computer to learn by trial and error ( Turing, 1954 ） SNARCs (Stochastic Neural-Analog Reinforcement Calculators) (Minsky, 1951) Checkers playing program (Samuel, 59) Lots of RL in the 60s (e.g., Waltz & Fu 65; Mendel 66; Fu 70) MENACE (Matchbox Educable Naughts and Crosses Engine (Mitchie, 63) RL based Tic Tac Toe learner (GLEE) (Mitchie 68) Classifier Systems (Holland, 75) Adaptive Critics (Barto & Sutton, 81) Temporal Differences (Sutton, 88)

Outline Markov Decision Process Value-based methods Policy search Model-based method Deep reinforcement learning

History and State The history is the sequence of observations, actions, rewards H t = O 1 , R 1 , A 1 , ..., A t − 1 , O t , R t all observable variables up to time t What happens next depends on the history: q The agent selects actions q The environment selects observations/rewards State is the information used to determine what happens next Formally, state is a function of the history: S t = f ( H t )

Agent State The agent state is the agent’s S a t internal representation a agent state S t whatever information the agent observation action uses to pick the next action O t A t it is the information used by reinforcement learning reward R t algorithms It can be any function of history: S a t = f ( H t )

Markov state An Markov state contains all useful information from the history. A state S t is Markov if and only if P [ S t +1 | S t ] = P [ S t +1 | S 1 , ..., S t ] “The future is independent of the past given the present” H 1: t → S t → H t +1: ∞ Once the state is known, the history may be thrown away The state is a sufficient statistic of the future

Introduction to MDPs Markov decision processes formally describe an environment for reinforcement learning Where the environment is fully observable q i.e. The current state completely characterizes the process Almost all RL problems can be formalized as MDPs q Optimal control primarily deals with continuous MDPs q Partially observable problems can be converted into MDPs q Bandits are MDPs with one state

Markov Property “The future is independent of the past given the present” A state S t is Markov if and only if P [ S t +1 | S t ] = P [ S t +1 | S 1 , ..., S t ] The state captures all relevant information from the history Once the state is known, the history may be thrown away q The state is a sufficient statistic of the future

Reinforcement Learning Hang Su suhangss@tsinghua.edu.cn - PowerPoint PPT Presentation

Reinforcement Learning Hang Su suhangss@tsinghua.edu.cn http://www.suhangss.me State Key Lab of Intelligent Technology & Systems Tsinghua University Nov 4 th , 2019 Sequential Decision Making Goal: select actions to maximize total future

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Introduction CSCE CSCE 496/896 496/896 Lecture 7: Lecture 7: Reinforcement Reinforcement

Path following with reinforcement learning for autonomous cars - Mozzam Motiwala (IAS) Index

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 11: Hierarchical Reinforcement

Machine Learning for NLP Reinforcement learning Aurlie Herbelot 2019 Centre for Mind/Brain

CS 5100: Foundations of AI History Instructor: Rob Platt rplatt@ccs.neu.edu College of Computer

CS 4100/5100: Foundations of AI History Instructor: Rob Platt rplatt@ccs.neu.edu College of

Qualification Level Module Name Module Number Project Title Name of Candidate Candidate No.

Folding Wheelchairs, Folding Wheelchairs, and More! and More! Wenxian Hong and Tish Scolnik

Part-of-Speech Tagging Informatics 2A: Lecture 16 Shay Cohen School of Informatics University

Where are the hungry: The case for Tanzania Blandina Kilama bkilama@repoa.or.tz 2016 UNU-WIDER

Principles into practice: embedding dignity and respect in a Scottish social security system

TJTS568 Global Information Systems: Introduction to Global Information Systems A Helicopter