Manuela Veloso Manuela Veloso - PowerPoint PPT Presentation

�� Manuela Veloso Manuela Veloso see “Machine Learning” – Tom Mitchell, chapter 13 on RL 15�381, Fall 2009

�� • Supervised learning – Classification � concept learning – Learning from labeled data – Function approximation • Unsupervised learning – Data is not labeled – Data is not labeled – Data needs to be �� , �� – We need distance metric • Control and action model learning – Learning to select actions efficiently – Feedback: goal achievement, failure, �� – Control learning, reinforcement learning 2

�� • “Reward” today versus future (promised) reward • Future rewards not worth as much as current. • $100K + $100K + $100K + ... INFINITE sum • Assume reality ...: discount factor , say γ . • Assume reality ...: discount factor , say γ . • $100K + γ $100K + γ 2 $100K + ... CONVERGES. 3

�� Goal: Learn to choose actions that maximize r 0 + γ r 1 + γ 2 r 2 + ... , where 0 ≤ γ < 1 4

�� • Assume world can be modeled as a Markov Decision Process, with rewards as a function of state and action. • �� New states and rewards are a function only of the current state and action, i.e., s t +1 = δ ( s t , a t ) s = δ ( s , a ) – – – r t = r ( s t , a t ) • �� Functions δ and r may be nondeterministic and are not necessarily known to learner. 5

�� ! • Execute actions in world, • Observe state of world, • Learn action policy π : S → A • Maximize expected reward E [ r t + γ r t+ 1 + γ 2 r t+ 2 + ...] �� S . – 0 ≤ γ < 1, discount factor for future rewards 6

"�� • We have a target function to learn π : S → A • We have �� training examples of the form 〈 s , a 〉 • We have training examples of the form 〈〈 s , a 〉 , r 〉 (rewards can be �� real number) immediate reward values r ( s , a ) 7

�� • There are �� policies, of course not necessarily �� , i.e., with maximum expected reward • There can be also �� policies. 8

#��$��%$�� • For each possible policy π , define an �� (deterministic world) ( ) π ≡ + γ + γ 2 + V s r r r ... + + t t 1 t 1 ∞ ∑ ≡ γ i r + t i = 0 = i i 0 where r t , r t +1 ,... are generated by following policy π starting at state s • Learning task: Learn OPTIMAL policy π* ≡ argmax π V π ( s ), ( ∀ s ) 9

��#��$��%$�� • Learn the evaluation function V π * � V *. • Select the optimal action from any state s , i.e., have an optimal policy, by using V * with one step lookahead: [ ] ( ) ( ) ( ( ) ) π * = + γ * δ s r s , a V s , a arg max a 10

&��#��$��&��' π*( s ) = argmax a [ r ( s , a ) + γ V *( δ ( s , a ))] A problem: • This works well if agent knows δ : S × A → S , and • This works well if agent knows δ : S × A → S , and r : S × A → ℜ • When it doesn’t, it can’t choose actions this way 11

� %$�� • Define new function very similar to V * Q ( s , a ) ≡ r ( s , a ) + γ V *( δ ( s , a )) Learn Q function – Q �learning • If agent learns Q , it can choose optimal action even without knowing δ or r . [ ] ( ) ( ) ( ( ) ) π * = + γ * δ s r s , a V s , a arg max a ( ) π * = s Q ( s , a ) arg max a 12

� � �� Note that Q and V * are closely related: ( ) ( ) ∗ ′ = V s Q s , a max ′ a Which allows us to write Q recursively as ( ) ( ) ( ( ) ) ∗ = + γ δ Q s , a r s , a V s , a t t t t t t ( ) ( ) + γ r s , a ′ = max Q s , a t t + t 1 ′ a Q �learning actively generates examples. It “processes” examples by updating its Q values. !�� learning, Q values are approximations. 13

��$�� ˆ Let Q denote current approximation to Q . Then Q�learning uses the following ��$�� : ( ) ( ) ˆ ˆ ′ ′ ← + γ Q s , a r max Q s , a ′ a where s ′ is the state resulting from applying action a in state where s ′ is the state resulting from applying action a in state s , and r is the reward that is returned. 14

^ ()�� *�� ( ) ( ) ˆ ˆ ′ ← + γ Q s , a r Q s , a max 1 right 2 ′ a { } ← + 0 0 . 9 max 63 , 81 , 100 ← 90 15

�� +�� ˆ For each s , a initialize table entry Q ( s , a ) ← 0 Observe current state s Do forever: • Select an action a and execute it • Receive immediate reward r • Receive immediate reward r • Observe the new state s ′ ˆ • Update the table entry for Q ( s , a ) as follows: ( ) ( ) ˆ ˆ ′ ′ ← + γ Q s , a r Q s , a max ′ a • s ← s ′ 16

�� ,�� Starts at bottom left corner – moves clockwise around perimeter; Initially Q ( s , a ) = 0; γ = 0.8 ( ) ( ) ˆ ˆ ′ ′ ← + γ Q s , a r max Q s , a ′ a 17

�� How many possible �� are there in this 3�state, 2�action deterministic world? A robot starts in the state Mild. It moves for 4 steps choosing A robot starts in the state Mild. It moves for 4 steps choosing actions +��-�(��-�(��-�+�� . The initial values of its Q�table are 0 and the discount factor is γ = 0.5 Initial State: MILD Action: West Action: East Action: East Action: West New State: HOT New State: MILD New State: COLD New State: MILD East West East West East West East West East West HOT 0 0 0 0 . 0 5 0 5 0 MILD 0 0 0 /0 0 10 0 10 0 10 COLD 0 0 0 0 0 0 0 0 0 �. 18

��()�� 19

1�� What if reward and next state are non�deterministic? We redefine V , Q by taking expected values [ ] ( ) π ≡ + γ + γ 2 + V s E r r r ... + + t t 1 t 2 ≡ ∑     ∞ ∑   γ i E r + t i     = i 0 [ ] ( ) ( ) ( ( ) ) ≡ + γ * δ Q s , a E r s , a V s , a 20

1�� Q learning generalizes to nondeterministic worlds Alter training rule to ( ) ( ) ( ) ˆ ˆ ← − α + Q s , a 1 Q s , a − n n n 1 ( ( ) ) ,   ˆ ′ ′ ′ ′ α α + + γ γ r r max max Q Q s s , , a a ,     − − n n   n n 1 1   ′ a ( ) 1 ′ α = = δ where , and s s , a . ( ) n + 1 visits s , a n ˆ * Q still converges to Q (Watkins and Dayan, 1992) 21

1��()�� 22

Manuela Veloso Manuela Veloso - PowerPoint PPT Presentation

Manuela Veloso Manuela Veloso see Machine Learning Tom Mitchell, chapter 13 on RL 15381, Fall 2009

Thresholded Rewards: Acting Optimally in Timed, Zero-Sum Games Colin McMillen and Manuela Veloso

Stochastic Search for Signal Processing Algorithm Optimization Bryan Singer and Manuela Veloso

Data Centric Security and Data Protection Manuela Cianfrone Bologna 29/10/2016 Speaker Manuela

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

1. We must SEE Jesus clearly 1. We must SEE Jesus clearly 1. We must SEE Jesus clearly 1. We

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Learning to Generate Fast Signal Processing Implementations Bryan Singer Joint work with Manuela

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

15-381: AI Classical Deterministic Planning Representation and Search Fall 2009 Manuela

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Reinforcement Learning Part 2 CS 760@UW-Madison Goals for the lecture you should understand the

Introduction to Deep Learning M S Ram Dept. of Computer Science & Engg. Indian Institute of

COMP24111: Machine Learning and Optimisation Chapter 5: Neural Networks and Deep Learning Dr.

Artificial Neural Networks By: Kodi Neumiller Overview What is an artificial neural network

Utilization of ASQ in Web Design Course Brankica Brati c, Vladimir Kurbalija, Vasileios

Value Function Approximation on Non-linear Manifolds for Robot Motor Control Masashi Sugiyama

Screening: Survey of Well-being of Young Children (SWYC) Rhonda Burk - Crete Public

Reinforcement Learning of Reinforcement Learning of Affordance Cues Affordance Cues Final

Manuela Veloso Manuela Veloso - PowerPoint PPT Presentation

Manuela Veloso Manuela Veloso see Machine Learning Tom Mitchell, chapter 13 on RL 15381, Fall 2009

Thresholded Rewards: Acting Optimally in Timed, Zero-Sum Games Colin McMillen and Manuela Veloso

Stochastic Search for Signal Processing Algorithm Optimization Bryan Singer and Manuela Veloso

Data Centric Security and Data Protection Manuela Cianfrone Bologna 29/10/2016 Speaker Manuela

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

1. We must SEE Jesus clearly 1. We must SEE Jesus clearly 1. We must SEE Jesus clearly 1. We

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

Learning to Generate Fast Signal Processing Implementations Bryan Singer Joint work with Manuela

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

15-381: AI Classical Deterministic Planning Representation and Search Fall 2009 Manuela

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Reinforcement Learning Part 2 CS 760@UW-Madison Goals for the lecture you should understand the

Introduction to Deep Learning M S Ram Dept. of Computer Science &amp; Engg. Indian Institute of

COMP24111: Machine Learning and Optimisation Chapter 5: Neural Networks and Deep Learning Dr.

Artificial Neural Networks By: Kodi Neumiller Overview What is an artificial neural network

Utilization of ASQ in Web Design Course Brankica Brati c, Vladimir Kurbalija, Vasileios

Value Function Approximation on Non-linear Manifolds for Robot Motor Control Masashi Sugiyama

Screening: Survey of Well-being of Young Children (SWYC) Rhonda Burk - Crete Public

Reinforcement Learning of Reinforcement Learning of Affordance Cues Affordance Cues Final

Introduction to Deep Learning M S Ram Dept. of Computer Science & Engg. Indian Institute of