reinforcement learning
play

Reinforcement Learning Philipp Koehn 16 April 2020 Philipp Koehn - PowerPoint PPT Presentation

Reinforcement Learning Philipp Koehn 16 April 2020 Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020 Rewards 1 Agent takes actions Agent occasionally receives reward Maybe just at the end of the process,


  1. Reinforcement Learning Philipp Koehn 16 April 2020 Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

  2. Rewards 1 ● Agent takes actions ● Agent occasionally receives reward ● Maybe just at the end of the process, e.g., Chess: – agent has to decide on individual moves – reward only at end: win/lose ● Maybe more frequently – Scrabble: points for each word played – ping pong: any point scored – baby learning to crawl: any forward movement Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

  3. Markov Decision Process 2 State Map Stochastic Movement ● States s ∈ S , actions a ∈ A ● Model T ( s,a,s ′ ) ≡ P ( s ′ ∣ s,a ) = probability that a in s leads to s ′ ● Reward function R ( s ) (or R ( s,a ) , R ( s,a,s ′ ) ) = { − 0 . 04 (small penalty) for nonterminal states ± 1 for terminal states Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

  4. Agent Designs 3 ● Utility based agent – needs model of environment – learns utility function on states – selects action that maximize expected outcome utility ● Q-learning – learns action-utility function ( Q ( s,a ) function) – does not need to model outcomes of actions – function provides expected utility of taken a given action at a given step ● Reflex agent – learns policy that maps states to actions Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

  5. 4 passive reinforcement learning Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

  6. Setup 5 Reward Function State Map Stochastic Movement ⎧ ⎪ ⎪ ⎪ +1 for goal ⎨ ⎪ R(s) = –1 ⎪ for pit ⎪ ⎩ –0.04 for other ● We know which state we are in (= partially observable environment) ● We know which actions we can take ● But only after taking an action → new state becomes known → reward becomes known Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

  7. Passive Reinforcement Learning 6 ● Given a policy ● Task: compute utility of policy ● We will extend this later to active reinforcement learning ( ⇒ policy needs to be learned) Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

  8. Sampling 7 -0.04 Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

  9. Sampling 8 -0.04 -0.04 Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

  10. Sampling 9 -0.04 -0.04 -0.04 Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

  11. Sampling 10 -0.04 -0.04 -0.04 -0.04 Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

  12. Sampling 11 -0.04 -0.04 -0.04 -0.04 -0.04 Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

  13. Sampling 12 -0.04 -0.04 -0.04 -0.04 -0.04 -0.04 Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

  14. Sampling 13 -0.04 -0.04 -0.04 +1 -0.04 -0.04 -0.04 Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

  15. Sampling 14 0.92 0.96 1.00 0.80 0.88 0.84 0.76 0.72 ● Sample of reward to go Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

  16. Sampling 15 Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

  17. Sampling 16 Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

  18. Utility of Policy 17 ● Definition of utility U of the policy π for state s ∞ U π ( s ) = E [ γ t R ( S t )] ∑ t = 0 ● Start at state S 0 = s ● Reward for state is R ( s ) ● Discount factor γ (we use γ = 1 in our examples) Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

  19. Direct Utility Estimation 18 ● Learning from the samples ● Reward to go: 0.92 0.96 1.00 0.80 – (1,1) one sample: 0.72 0.88 0.84 – (1,2) two samples: 0.76, 0.84 – (1,3) two samples: 0.80, 0.88 0.76 ● Reward to go 0.72 will converge to utility of state ● But very slowly — can we do better? Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

  20. Bellman Equation 19 ● Direct utility estimation ignores dependency between states ● Given by Bellman equation U π ( s ) = R ( s ) + γ ∑ P ( s ′ ∣ s,π ( s )) U π ( s ′ ) s ′ ( γ = reward decay) ● Use of this known dependence can speed up learning ● Requires learning of transition probabilities P ( s ′ ∣ s,π ( s )) Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

  21. Adaptive Dynamic Programming 20 Need to learn: ● State rewards R ( s ) – whenever a state is visited, record award (deterministic) ● Outcome of action π ( s ) at state s according to policy π – collect statistic count ( s,s ′ ) that s ′ is reached from s – estimate probability distribution count ( s,s ′ ) P ( s ′ ∣ s,π ( s )) = ∑ s ′′ count ( s,s ′′ ) ⇒ Ingredients for policy evaluation algorithm Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

  22. Adaptive Dynamic Programming 21 Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

  23. Learning Curve 22 ● Major change at 78 th trial: first time terminated in –1 state at (4,2) Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

  24. Temporal Difference Learning 23 ● Idea: do not model P ( s ′ ∣ s,π ( s )) , directly adjust utilities U ( s ) for all visited states ● Estimate of current utility: U π ( s ) ● Estimate of utility after action: R ( s ) + γU π ( s ′ ) ● Adjust utility of current state U π ( s ) if they differ ∆ U π ( s ) = α ( R ( s ) + γU π ( s ′ ) − U π ( s )) ( α = learning rate) ● Learning rate may decrease when state has been visited often Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

  25. Learning Curve 24 ● Noisier, converging more slowly Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

  26. Comparison 25 ● Both eventually converge to correct values ● Adaptive dynamic programming (ADP) faster than temporal difference learning (TD) – both make adjustments to make successors agree – but: ADP adjusts all possible successors, TD only observed successor ● ADP computationally more expensive due to policy evaluation algorithm Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

  27. 26 active reinforcement learning Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

  28. Active Reinforcement Learning 27 ● Previously: passive agent follows prescribed policy ● Now: active agent decides which action to take – following optimal policy (as currently viewed) – exploration ● Goal: optimize rewards for a given time frame Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

  29. Greedy Agent 28 1. Start with initial policy 2. Compute utilities (using ADP) 3. Optimize policy 4. Go to Step 2 ● This very seldom converges to global optimal policy Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

  30. Learning Curve 29 ● Greedy agent stuck in local optimum Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

  31. Bandit Problems 30 ● Bandit: slot machine Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

  32. Bandit Problems 31 ● Bandit: slot machine ● N-armed bandit: n levers ● Each has different probability distribution over payoffs ● Spend coin on – presume optimal payoff – exploration (new lever) ● If independent – Gittins index : formula for solution – uses payoff / number of times used Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

  33. Greedy in the Limit of Infinite Exploration 32 ● Explore any action in any state unbounded number of times ● Eventually has to become greedy – carry out optimal policy ⇒ maximize reward ● Simple strategy – with probability p ( 1 / t ) take random action – initially ( t small) focus on exploration – later ( t big) focus on optimal policy Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

  34. Extension of Adaptive Dynamic Programming 33 ● Previous definition of utility calculation U ( s ) ← R ( s ) + γ max a ∑ P ( s ′ ∣ s,a ) U ( s ′ ) s ′ ● New utility calculation U + ( s ) ← R ( s ) + γ max a f (∑ P ( s ′ ∣ s,a ) U + ( s ′ ) ,N ( s,a )) s ′ ● One possible definition of f ( u,n ) f ( u,n ) = { R + if n < N c u otherwise R + is optimistic estimate, best possible award in any state Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

  35. Learning Curve 34 ● Performance of exploratory ADP agent ● Parameter settings R + = 2 and N e = 5 ● Fairly quick convergence to optimal policy Philipp Koehn Artificial Intelligence: Reinforcement Learning 16 April 2020

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend