CSCE 496/896 Lecture 7: Learning Stephen Scott Reinforcement - PowerPoint PPT Presentation

CSCE 496/896 Lecture 7: Reinforcement CSCE 496/896 Lecture 7: Learning Stephen Scott Reinforcement Learning Introduction MDPs Q Learning Stephen Scott TD Learning DQN (Adapted from Paul Quint) Atari Example Go Example sscott@cse.unl.edu 1 / 53

Introduction CSCE 496/896 Lecture 7: Reinforcement Learning Consider learning to choose actions, e.g., Stephen Scott Robot learning to dock on battery charger Learning to choose actions to optimize factory output Introduction Learning to play Backgammon, chess, Go, etc. MDPs Note several problem characteristics: Q Learning TD Learning Delayed reward (thus have problem of temporal credit assignment ) DQN Opportunity for active exploration (versus exploitation of Atari Example known good actions) Go Example ⇒ Learner has some influence over the training data it sees Possibility that state only partially observable 2 / 53

Example: TD-Gammon (Tesauro, 1995) CSCE 496/896 Lecture 7: Reinforcement Learning Stephen Scott Learn to play Backgammon Introduction Immediate Reward: MDPs + 100 if win Q Learning − 100 if lose TD Learning 0 for all other states DQN Trained by playing 1.5 million games against itself Atari Example Approximately equal to best human player at that time Go Example 3 / 53

Outline CSCE 496/896 Lecture 7: Reinforcement Learning Stephen Scott Markov decision processes Introduction The agent’s learning task MDPs Q learning Q Learning TD Learning Temporal difference learning DQN Deep Q learning Atari Example Example: Learning to play Atari Go Example 4 / 53

Reinforcement Learning Problem CSCE 496/896 Agent Lecture 7: Reinforcement Learning Stephen Scott State Reward Action Introduction MDPs Environment Q Learning TD Learning DQN a a a 0 1 2 Atari Example s s s ... 0 1 2 r r r Go Example 0 1 2 Goal: Learn to choose actions that maximize 2 r + γ r + r + ... , where 0 < γ <1 γ 0 1 2 5 / 53

Markov Decision Processes CSCE 496/896 Assume Lecture 7: Reinforcement Learning Finite set of states S Stephen Scott Set of actions A Introduction At each discrete time t agent observes state s t ∈ S and MDPs chooses action a t ∈ A Q Learning TD Learning Then receives immediate reward r t , and state changes DQN to s t + 1 Atari Example Markov assumption: s t + 1 = δ ( s t , a t ) and Go Example r t = r ( s t , a t ) I.e., r t and s t + 1 depend only on current state and action Functions δ and r may be nondeterministic Functions δ and r not necessarily known to agent 6 / 53

Agent’s Learning Task CSCE 496/896 Lecture 7: Execute actions in environment, observe results, and Reinforcement Learning Learn action policy π : S → A that maximizes Stephen Scott � � r t + γ r t + 1 + γ 2 r t + 2 + · · · E Introduction MDPs from any starting state in S Q Learning Here 0 ≤ γ < 1 is the discount factor for future TD Learning rewards DQN Note something new: Atari Example Target function is π : S → A Go Example But we have no training examples of form � s , a � Training examples are of form �� s , a � , r � I.e., not told what best action is, instead told reward for executing action a in state s 7 / 53

Value Function CSCE 496/896 Lecture 7: First consider deterministic worlds Reinforcement Learning For each possible policy π the agent might adopt, we Stephen Scott can define discounted cumulative reward as Introduction ∞ MDPs � V π ( s ) ≡ r t + γ r t + 1 + γ 2 r t + 2 + · · · = γ i r t + i , Q Learning i = 0 TD Learning DQN where r t , r t + 1 , . . . are generated by following policy π , Atari Example starting at state s Go Example Restated, the task is to learn an optimal policy π ∗ π ∗ ≡ argmax V π ( s ) , ( ∀ s ) π 8 / 53

Value Function CSCE 496/896 Lecture 7: 0 0 100 Reinforcement G Learning 0 0 0 Stephen Scott 0 0 100 0 0 Introduction 0 0 MDPs Q Learning Q ( s , a ) values r ( s , a ) values TD Learning DQN G G 90 100 0 Atari Example Go Example 81 90 100 V ∗ ( s ) values One optimal policy 9 / 53

What to Learn CSCE We might try to have agent learn the evaluation 496/896 function V π ∗ (which we write as V ∗ ) Lecture 7: Reinforcement Learning It could then do a lookahead search to choose best Stephen Scott action from any state s because Introduction π ∗ ( s ) = argmax [ r ( s , a ) + γ V ∗ ( δ ( s , a ))] , MDPs a Q Learning TD Learning i.e., choose action that maximized immediate reward + DQN discounted reward if optimal strategy followed from Atari Example then on Go Example E.g., V ∗ ( bot . ctr . ) = 0 + γ 100 + γ 2 0 + γ 3 0 + · · · = 90 A problem: This works well if agent knows δ : S × A → S , and r : S × A → R But when it doesn’t, it can’t choose actions this way 10 / 53

Q Function CSCE 496/896 Define new function very similar to V ∗ : Lecture 7: Reinforcement Learning Q ( s , a ) ≡ r ( s , a ) + γ V ∗ ( δ ( s , a )) Stephen Scott Introduction i.e., Q ( s , a ) = total discounted reward if action a taken in MDPs state s and optimal choices made from then on Q Learning If agent learns Q , it can choose optimal action even TD Learning without knowing δ DQN Atari Example π ∗ ( s ) [ r ( s , a ) + γ V ∗ ( δ ( s , a ))] = argmax Go Example a = Q ( s , a ) argmax a Q is the evaluation function the agent will learn 11 / 53

Training Rule to Learn Q CSCE Note Q and V ∗ closely related: 496/896 Lecture 7: Reinforcement V ∗ ( s ) = max Q ( s , a ′ ) Learning a ′ Stephen Scott Which allows us to write Q recursively as Introduction MDPs r ( s t , a t ) + γ V ∗ ( δ ( s t , a t ))) Q ( s t , a t ) = Q Learning Q ( s t + 1 , a ′ ) = r ( s t , a t ) + γ max TD Learning a ′ DQN Let ˆ Atari Example Q denote learner’s current approximation to Q ; Go Example consider training rule ˆ ˆ Q ( s ′ , a ′ ) , Q ( s , a ) ← r + γ max a ′ where s ′ is the state resulting from applying action a in state s 12 / 53

Q Learning for Deterministic Worlds CSCE 496/896 For each s , a initialize table entry ˆ Q ( s , a ) ← 0 Lecture 7: Reinforcement Learning Observe current state s Stephen Scott Do forever: Select an action a (greedily or probabilistically) and Introduction execute it MDPs Receive immediate reward r Q Learning Observe the new state s ′ TD Learning Update the table entry for ˆ Q ( s , a ) as follows: DQN Atari Example ˆ ˆ Q ( s , a ) ← r + γ max Q ( s ′ , a ′ ) Go Example a ′ s ← s ′ Note that actions not taken and states not seen don’t get explicit updates (might need to generalize) 13 / 53

Updating ˆ Q CSCE 496/896 Lecture 7: Reinforcement Learning ˆ Q ( s 2 , a ′ ) ˆ Q ( s 1 , a right ) ← r + γ max 90 72 100 100 R R a ′ 66 66 Stephen Scott 81 81 = 0 + 0 . 9 max { 66 , 81 , 100 } a right = 90 Introduction Initial state: s 1 Next state: s 2 MDPs Q Learning Can show via induction on n that if rewards non-negative and ˆ TD Learning Q s initially 0, then DQN ( ∀ s , a , n ) ˆ Q n + 1 ( s , a ) ≥ ˆ Atari Example Q n ( s , a ) Go Example and ( ∀ s , a , n ) 0 ≤ ˆ Q n ( s , a ) ≤ Q ( s , a ) 14 / 53

Updating ˆ Q Convergence CSCE 496/896 Lecture 7: ˆ Q converges to Q : Consider case of deterministic Reinforcement Learning world where each � s , a � is visited infinitely often Stephen Scott Proof : Define a full interval to be an interval during Introduction which each � s , a � is visited. Will show that during each MDPs full interval the largest error in ˆ Q table is reduced by Q Learning factor of γ TD Learning Let ˆ Q n be table after n updates, and ∆ n be the DQN maximum error in ˆ Q n ; i.e., Atari Example Go Example s , a | ˆ ∆ n = max Q n ( s , a ) − Q ( s , a ) | Let s ′ = δ ( s , a ) 15 / 53

Updating ˆ Q Convergence CSCE 496/896 For any table entry ˆ Q n ( s , a ) updated on iteration n + 1 , Lecture 7: Reinforcement error in the revised estimate ˆ Q n + 1 ( s , a ) is Learning Stephen Scott | ˆ ˆ Q n ( s ′ , a ′ )) Q n + 1 ( s , a ) − Q ( s , a ) | = | ( r + γ max a ′ Introduction − ( r + γ max Q ( s ′ , a ′ )) | MDPs a ′ Q Learning ˆ Q n ( s ′ , a ′ ) − max Q ( s ′ , a ′ ) | = γ | max TD Learning a ′ a ′ | ˆ Q n ( s ′ , a ′ ) − Q ( s ′ , a ′ ) | DQN ( ∗ ) ≤ γ max a ′ Atari Example s ′′ , a ′ | ˆ Q n ( s ′′ , a ′ ) − Q ( s ′′ , a ′ ) | ( ∗∗ ) ≤ γ max Go Example = γ ∆ n ( ∗ ) works since | max a f 1 ( a ) − max a f 2 ( a ) | ≤ max a | f 1 ( a ) − f 2 ( a ) | ( ∗∗ ) works since max will not decrease 16 / 53

Updating ˆ Q Convergence CSCE 496/896 Lecture 7: Reinforcement Learning Stephen Scott Also, ˆ Q 0 ( s , a ) and Q ( s , a ) are both bounded ∀ s , a Introduction ⇒ ∆ 0 bounded MDPs Q Learning Thus after k full intervals, error ≤ γ k ∆ 0 TD Learning Finally, each � s , a � visited infinitely often ⇒ number of DQN intervals infinite, so ∆ n → 0 as n → ∞ Atari Example Go Example 17 / 53

CSCE 496/896 Lecture 7: Learning Stephen Scott Reinforcement - PowerPoint PPT Presentation

CSCE 496/896 Lecture 7: Reinforcement CSCE 496/896 Lecture 7: Learning Stephen Scott Reinforcement Learning Introduction MDPs Q Learning Stephen Scott TD Learning DQN (Adapted from Paul Quint) Atari Example Go Example

Why Are We Here? CSCE CSCE 496/896 496/896 Lecture 10: Lecture 10: CSCE 496/896 Lecture 10:

Introduction CSCE CSCE 496/896 496/896 Lecture 6: Lecture 6: Recurrent Recurrent CSCE

Introduction CSCE CSCE 496/896 496/896 Lecture 7: Lecture 7: Reinforcement Reinforcement

Introduction CSCE CSCE 496/896 496/896 Lecture 9: Lecture 9: word2vec and word2vec and To

Introduction Supervised Learning CSCE CSCE 496/896 496/896 Lecture 2: Lecture 2: Basic

Welcome to CSCE 496/896: Deep Learning! Welcome to CSCE 496/896: Deep Learning! Please check

CSCE 496/896 Lecture 11: Structured Prediction and Structured Prediction and Probabilistic

CSCE 496/896 Lecture 6: Architectures Stephen Scott Recurrent Architectures Introduction Basic

CSCE 496/896 Lecture 5: Stephen Scott Autoencoders Introduction Basic Idea Stacked AE Stephen

CSCE 496/896 Lecture 5: Stephen Scott Autoencoders Introduction Basic Idea Stacked AE Stephen

CSCE 496/896 Lecture 8: Good Research Talk How to Give a Good Research Talk Stephen Scott

CSCE 625: Artificial Intelligence Dr. Dylan Shell 1 Shell CSCE 625 TAMU 2 Shell CSCE 625 TAMU

CSCE 625: Artificial Intelligence Dr. Dylan Shell 1 Shell CSCE 625 TAMU CSCE 625: Artificial

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

Introduction CSCE CSCE 471/871 471/871 Lecture 6: Lecture 6: Multiple Multiple CSCE

Outline CSCE CSCE 471/871 471/871 Lecture 5: Lecture 5: Building Building CSCE 471/871

DAGOBAH An End-to-End Context-Free Tabular Data Semantic Annotation System Yoan Chabot Thomas

hashes Hashes in lisp are basically a lookup table of key-value pairs can create/destroy

Self-applicable probabilistic inference without interpretive overhead Oleg Kiselyov Chung-chieh

UMBC A B M A L T F O U M B C I M Y O R T 1 (December 11, 2000 3:44 pm) I E S

CS184a: Computer Architecture (Structures and Organization) Day8: October 18, 2000 Computing

CONSTANT INSECURITY: THINGS YOU DIDN'T KNOW ABOUT (PECOFF) PORTABLE EXECUTABLE FILE FORMAT

c } false loop body P (postcondition) Loop Invariant Defn : A boolean condition that

Loops, trees and operators Yves Le Jan Universit Paris-Sud May 2010 ENERGY and GREEN FUNCTION