Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Reinforcement Learning Lecture 8 Reinforcement Learning November 24, 2015 1
Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Outline 1. Context 2. TD Learning 3. Issues Reinforcement Learning November 24, 2015 2
Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Machine Learning Tasks • Supervised – Given a training set and a target variable, generalize ; measured over a testing set • Unsupervised – Given a dataset, find “interesting” patterns; potentially no “right” answer • Reinforcement – Learn an optional action policy over time; given an environment that provides states, affords actions, and provides feedback as numerical reward , maximize the expected future reward • Never given I/O pairs • Focus: online (balancing exploration/exploitation) Reinforcement Learning November 24, 2015 3
Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Success Stories Reinforcement Learning November 24, 2015 4
Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky The Agent-Environment Interface Agent state ac'on s t a t reward r t+1 Environment (stochas2c) s t+1 Reinforcement Learning November 24, 2015 5
Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Pole Balancing Reinforcement Learning November 24, 2015 6
Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Multi-Armed Bandit Reinforcement Learning November 24, 2015 7
Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Types of Tasks • Some tasks are continuous , meaning they are an ongoing sequence of decisions • Some tasks are episodic , meaning there exist terminal states that reset the problem Reinforcement Learning November 24, 2015 8
Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Policies A policy is a function that associates a probability with taking a particular action in a particular state π ( s, a ) The goal of RL is to learn an “effective” policy for a particular task Reinforcement Learning November 24, 2015 9
Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Objective Select actions so that the sum of the discounted rewards it receives over the future is maximized – Discount rate: 0 ≤ γ ≤ 1 R t = r t +1 + γ r t +2 + γ 2 r t +3 + . . . ∞ X γ k r t + k +1 = k =0 Reinforcement Learning November 24, 2015 10
Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Environmental Modeling • An important issue in RL is state representation – Current sensors (observability!) – Past history? • A stochastic process has the Markov property if the conditional probability distribution of future states of the process depends only upon the present state – Given the present, the future does not depend on the past – Memoryless, pathless Reinforcement Learning November 24, 2015 11
Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Implications of the Markov Property Often the process is not strictly Markovian, but we can either (i) approximate it as such and yield good results, or (ii) include a fixed window of history as state Thus we can approximate P ( s t +1 = s 0 , r t +1 = r | s t , a t , s t � 1 , a t � 1 , . . . r 1 , s 0 , a 1 ) via P ( s t +1 = s 0 , r t +1 = r | s t , a t ) Reinforcement Learning November 24, 2015 12
Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Markov Decision Processes If a process is Markovian, we can model it as a 5-tuple MDP : ( S, A, P ( · , · ) , R ( · , · ) , γ ) – S: set of states – A: set of actions – P a (s, s’): transition function – R a (s, s’): immediate reward Reinforcement Learning November 24, 2015 13
Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Recycling Robot MDP Reinforcement Learning November 24, 2015 14
Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Value Functions Almost all RL algorithms are based on estimating value functions – functions of states (or of state-action pairs) that estimate how good it is for the agent to be in a given state (or how good it is to perform a given action in a given state) Value functions are defined with respect to particular policies Reinforcement Learning November 24, 2015 15
Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky State-Value Function V π ( s ) = E π [ R t | s t = s ] ∞ X γ k r t + k +1 | s t = s ] = E π [ k =0 Reinforcement Learning November 24, 2015 16
Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Action-Value Function Q π ( s, a ) = E π [ R t | s t = s, a t = a ] ∞ X γ k r t + k +1 | s t = s, a t = a ] = E π [ k =0 Reinforcement Learning November 24, 2015 17
Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Example: Golf Reinforcement Learning November 24, 2015 18
Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Temporal Difference (TD) Learning • Combines ideas from Monte Carlo sampling and dynamic programming • Learns directly from raw experience without a model of environment dynamics • Update estimates based in part on other learned estimates, without waiting for a final outcome Reinforcement Learning November 24, 2015 19
Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Visual TD Learning Reinforcement Learning November 24, 2015 20
Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Q-Learning: Off-Policy TD Control 1. Initialize Q(s,a) – Random, optimistic, realistic, knowledge 2. Repeat (for each episode): a. Initialize s b. Repeat (for each step of episode) i. Choose action via Q ii. Take action, observe r, s’ iii. a 0 Q ( s 0 , a 0 ) − Q ( s, a )] Q ( s, a ) ← Q ( s, a ) + α [ r + γ max iv. s = s’ until s is terminal Reinforcement Learning November 24, 2015 21
Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Choosing Actions • Given a Q function, a common approach to selecting action is ε -greedy 1. Select a random value in [0,1] Ø If > ε , take action with highest estimated value Ø Else, select randomly • In the limit, every action will be sampled an infinite number of times Reinforcement Learning November 24, 2015 22
Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Function Representation • Given large state-action spaces, there is a practical problem of how to sample the space, and how to represent it • Modern approaches include hierarchical methods and neural networks Reinforcement Learning November 24, 2015 23
Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Application: Michigan Liar’s Dice • Multi-agent opponents • Varying degrees of background knowledge – Opponent modeling – Probabilistic calculation – Symbolic heuristics Reinforcement Learning November 24, 2015 24
Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Evaluation: Learning vs. Static 250 200 Games Won 150 100 50 0 0 2 4 6 8 10 12 Blocks of Training (250 Games/Block) Reinforcement Learning November 24, 2015 25
Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Evaluation: Learning vs. Learned 250 200 Games Won 150 100 50 0 0 2 4 6 8 10 12 Blocks of Training (250 Games/Block) Reinforcement Learning November 24, 2015 26
Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Evaluation: Value-Function Initialization 250 200 PMH PM Games Won 150 PH P 100 PMH-0 PM-0 50 PH-0 P-0 0 0 5 10 15 20 Blocks of Trainings (250 Games/Block) Reinforcement Learning November 24, 2015 27
Wentworth Institute of Technology COMP4050 – Machine Learning | Fall 2015 | Derbinsky Summary • Reinforcement Learning (RL) is the problem of learning an effective action policy for obtaining reward • Most RL algorithms model the task as a Markov Decision Process (MDP) and estimate the value of states/state-actions in a value function • Temporal-Difference (TD) Learning is one effective method that is online and model-free Reinforcement Learning November 24, 2015 28
Recommend
More recommend