or, Learning and Planning with Markov Decision Processes 295 - PowerPoint PPT Presentation

Lecture 2: Markov DecisionProcesses Markov Property Markov Processes Markov Property “The future is independent of the past given the present” Definition A state S t is Markov if and only if P [ S t +1 | S t ] = P [ S t +1 | S 1 ,..., S t ] The state captures all relevant information from the history Once the state is known, the history may be thrown away i.e. The state is a sufficient statistic of the future 295, Winter 2018 42

Lecture 2: Markov DecisionProcesses State Transition Matrix Markov Processes Markov Property where each row of the matrix sums to 1. 295, Winter 2018 43

Lecture 2: Markov DecisionProcesses Markov Process Markov Processes Markov Chains A Markov process is a memoryless random process, i.e. a sequence of random states S 1 , S 2 , ... with the Markov property. Definition A Markov Process (or Markov Chain ) is a tuple (S , P) S is a (finite) set of states P is a state transition probability matrix, P ss ' = P [ S t +1 = s ' | S t = s ] 295, Winter 2018 44

Lecture 2: Markov DecisionProcesses Example: Student Markov Chain, a transition graph Markov Processes Markov Chains 0.9 Sleep Facebook 0.1 1.0 0.5 0.2 Class 3 0.6 Class 2 0.8 Class 1 0.5 Pass 0.4 0.4 0.2 0.4 Pub 295, Winter 2018 45

Lecture 2: Markov DecisionProcesses Example: Student Markov Chain Episodes Markov Processes Markov Chains Sample episodes for Student Markov Chain starting from S 1 = C1 S 1 , S 2 ,..., S T 0.9 Sleep Facebook 0.1 C1 C2 C3 Pass Sleep 1.0 0.5 0.2 Class 3 0.6 Class 1 0.5 Class 2 0.8 Pass C1 FB FB C1 C2 Sleep 0.4 C1 C2 C3 Pub C2 C3 Pass Sleep 0.4 0.2 0.4 C1 FB FB C1 C2 C3 Pub C1 FB FB Pub FB C1 C2 C3 Pub C2 Sleep 295, Winter 2018 46

Lecture 2: Markov DecisionProcesses Example: Student Markov Chain Transition Matrix Markov Processes Markov Chains 0.9 Facebook Sleep 0.1 1.0 0.5 0.2 Class 3 0.6 Class 2 0.8 Class 1 0.5 Pass 0.4 0.4 0.2 0.4 Pub 295, Winter 2018 47

Markov Decision Processes • States: S • Model: T(s,a,s ’) = P(s’| s,a) • Actions: A(s), A • Reward: R(s), R(s,a), R(s,a,s ’) • Discount: 𝛿 • Policy: 𝜌 𝑡 → 𝑏 • Utility/Value: sum of discounted rewards. • We seek optimal policy that maximizes the expected total (discounted) reward 295, Winter 2018 48

Lecture 2: Markov DecisionProcesses Example: Student MRP Markov Reward Processes MRP 0.9 Sleep Facebook 0.1 R =-1 R =0 1.0 0.5 0.2 Class 1 0.5 Class 3 0.6 Class 2 0.8 Pass R = -2 R = -2 R =-2 0.4 R =+10 0.4 0.2 0.4 Pub R =+1 49

Goals, Returns and Rewards • The agent’s goal is to maximize the total amount of rewards it gets (not immediate ones), relative to the long run. • Reward is -1 typically in mazes for every time step • Deciding how to associate rewards with states is part of the problem modelling. If T is the final step then the return is: 295, Winter 2018 50

Lecture 2: Markov DecisionProcesses Return Markov Reward Processes Return Definition The return G t is the total discounted reward from time-step t . The discount γ ∈ [0 , 1] is the present value of future rewards The value of receiving reward R after k + 1 time-steps is γ k R . This values immediate reward above delayed reward. γ close to 0 leads to ”myopic” evaluation γ close to 1 leads to ”far - sighted” evaluation 295, Winter 2018 51

Lecture 2: Markov DecisionProcesses Why discount? Markov Reward Processes Return Most Markov reward and decision processes are discounted. Why? Mathematically convenient to discount rewards Avoids infinite returns in cyclic Markov processes Uncertainty about the future may not be fully represented If the reward is financial, immediate rewards may earn more interest than delayed rewards Animal/human behaviour shows preference for immediate reward It is sometimes possible to use undiscounted Markov reward processes (i.e. γ = 1), e.g. if all sequences terminate. 295, Winter 2018 52

Lecture 2: Markov DecisionProcesses Value Function Markov Reward Processes Value Function The value function v ( s ) gives the long-term value of state s Definition The state value function v ( s ) of an MRP is the expected return starting from state s v ( s ) = E [ G t | S t = s ] 295, Winter 2018 53

Lecture 2: Markov DecisionProcesses Example: Student MRP Returns Markov Reward Processes Value Function Sample returns for Student MRP: Starting from S 1 = C1 with γ = 1 2 G 1 = R 2 + γ R 3 + ... + γ T −2 R T C1 C2 C3 Pass Sleep C1 FB FB C1 C2 Sleep C1 C2 C3 Pub C2 C3 Pass Sleep C1 FB FB C1 C2 C3 Pub C1 ... FB FB FB C1 C2 C3 Pub C2 Sleep 295, Winter 2018 54

Lecture 2: Markov DecisionProcesses Bellman Equation for MRPs Markov Reward Processes Bellman Equation The value function can be decomposed into two parts: immediate reward R t +1 discounted value of successor state γ v ( S t +1 ) v ( s ) = E [ G t | S t = s ] 2 + ... | S = s] + γ R = E [ R + γ R t +1 t +2 t +3 t = E [ R t +1 + γ ( R t +2 + γ R t +3 + ... ) | S t = s ] = E [ R t +1 + γ G t +1 | S t = s ] = E [ R t +1 + γ v ( S t +1 ) | S t = s ] 295, Winter 2018 55

Lecture 2: Markov DecisionProcesses Bellman Equation for MRPs (2) Markov Reward Processes Bellman Equation 295, Winter 2018 56

Lecture 2: Markov DecisionProcesses Example: Bellman Equation for Student MRP Markov Reward Processes Bellman Equation 4.3 = -2 + 0.6*10 +0.4*0.8 0.9 -23 0 0.1 R =-1 R =0 1.0 0.5 0.2 0.6 0.8 0.5 -13 1.5 4.3 10 R = -2 R =-2 R =-2 0.4 R =+10 0.4 0.2 0.4 0.8 R =+1 57

Lecture 2: Markov DecisionProcesses Markov Reward Processes Bellman Equation in Matrix Form Bellman Equation The Bellman equation can be expressed concisely using matrices, v = R + γ P v where v is a column vector with one entry per state 295, Winter 2018 58

Lecture 2: Markov DecisionProcesses Solving the Bellman Equation Markov Reward Processes Bellman Equation The Bellman equation is a linear equation It can be solved directly: v = R + γ P v ( I − γ P ) v = R v = ( I − γ P ) −1 R Computational complexity is O ( n 3 ) for n states Direct solution only possible for small MRPs There are many iterative methods for large MRPs, e.g. Dynamic programming Monte-Carlo evaluation Temporal-Difference learning 295, Winter 2018 59

Lecture 2: Markov DecisionProcesses Markov Decision Process Markov Decision Processes MDP 295, Winter 2018 60

Lecture 2: Markov DecisionProcesses Example: Student MDP Markov Decision Processes MDP Facebook R = -1 Quit Facebook Sleep R =0 R =0 R = -1 Study Study Study R =+10 R =-2 R =-2 Pub R =+1 0.4 0.2 0.4 295, Winter 2018 61

Lecture 2: Markov DecisionProcesses Policies and Value functions (1) Markov Decision Processes Policies Definition A policy π is a distribution over actions givenstates, π ( a | s ) = P [ A t = a | S t = s ] A policy fully defines the behaviour of an agent MDP policies depend on the current state (not the history) i.e. Policies are stationary (time-independent), A t ∼ π ( ·| S t ) , ∀ > 0 t 295, Winter 2018 62

Policy’s and Value functions 295, Winter 2018 63

Lecture 1: Introduction to Reinforcement Learning Gridworld Example: Prediction Problems withinRL Actions: up, down, left, right. Rewards 0 unless off the grid with reward -1 From A to A’, rewatd +10. from B to B’ reward +5 Policy: actions are uniformly random. A B 3.3 8.8 4.4 5.3 1.5 +5 1.5 3.0 2.3 1.9 0.5 +10 B’ 0.1 0.7 0.7 0.4 -0.4 Figure 3.3 -1.0 -0.4 -0.4 -0.6 -1.2 Actions A’ -1.9 -1.3 -1.2 -1.4 -2.0 (a) (b) What is the value function for the uniform random policy? Gamma=0.9. solved using EQ. 3.14 Exercise: show 3.14 holds for each state in Figure (b). 64

Lecture 2: Markov DecisionProcesses Value Function, Q Functions Markov Decision Processes Value Functions Definition The state-value function v π ( s ) of an MDP is the expected return starting from state s , and then following policy π v π ( s ) = E π [ G t | S t = s ] Definition The action-value function q π ( s , a ) is the expected return starting from state s , taking action a , and then following policy π q π ( s , a ) = E π [ G t | S t = s , A t = a ] 295, Winter 2018 65

Lecture 2: Markov DecisionProcesses Bellman Expectation Equation Markov Decision Processes Bellman Expectation Equation The state-value function can again be decomposed into immediate reward plus discounted value of successor state, v π ( s ) = E π [ R t +1 + γ v π ( S t +1 ) | S t = s ] The action-value function can similarly be decomposed, q π ( s , a ) = E π [ R t +1 + γ q π ( S t +1 , A t +1 ) | S t = s , A t = a ] Expressing the functions recursively, Will translate to one step look-ahead. 295, Winter 2018 66

Lecture 2: Markov DecisionProcesses Bellman Expectation Equation for V π Markov Decision Processes Bellman Expectation Equation 295, Winter 2018 67

Lecture 2: Markov DecisionProcesses Bellman Expectation Equation for Q π Markov Decision Processes Bellman Expectation Equation 295, Winter 2018 68

Lecture 2: Markov DecisionProcesses Bellman Expectation Equation for v π (2) Markov Decision Processes Bellman Expectation Equation 295, Winter 2018 69

Lecture 2: Markov DecisionProcesses Bellman Expectation Equation for q π (2) Markov Decision Processes Bellman Expectation Equation 295, Winter 2018 70

Lecture 2: Markov DecisionProcesses Optimal Policies and Optimal Value Function Markov Decision Processes Optimal Value Functions Definition The optimal state-value function v ∗ ( s ) is the maximum value function over all policies v ( s ) = max v ( s ) ∗ π π The optimal action-value function q ∗ ( s , a ) is the maximum action-value function over all policies q ( s , a ) = max q ( s , a ) ∗ π π The optimal value function specifies the best possible performance in the MDP. An MDP is “solved” when we know the optimal value function.

Lecture 2: Markov DecisionProcesses Optimal Value Function for Student MDP Markov Decision Processes Optimal Value Functions v*(s) for γ =1 Facebook R = -1 6 0 Quit Facebook Sleep R =0 R =0 R = -1 Study Study Study R =+10 6 8 10 R =-2 R =-2 Pub R =+1 0.4 0.2 0.4 295, Winter 2018 72

Lecture 2: Markov DecisionProcesses Optimal Action-Value Function for Student MDP Markov Decision Processes Optimal Value Functions q*(s,a) for γ =1 Facebook R = -1 q * =5 6 0 Quit Facebook Sleep R =-1 R =0 R =0 q*=5 q* =0 q*=6 Study R =+10 Study Study 6 8 10 q* =10 R =-2 R =-2 q*=6 q*=8 Pub R =+1 0.4 q*=8.4 0.2 0.4 295, Winter 2018 73

Lecture 2: Markov DecisionProcesses Optimal Policy Markov Decision Processes Optimal Value Functions Define a partial ordering over policies π ≥ π ' if v π ( s ) ≥ v π ' ( s ) , ∀ s Theorem For any Markov Decision Process There exists an optimal policy π that is better than or equal ∗ ∗ ≥ π, ∀ to all other policies, π π All optimal policies achieve the optimal value function, v π ∗ ( s ) = v ∗ ( s ) All optimal policies achieve the optimal action-value function, q π ∗ ( s , a ) = q ∗ ( s , a ) 295, Winter 2018 74

Lecture 2: Markov DecisionProcesses Finding an Optimal Policy Markov Decision Processes Optimal Value Functions An optimal policy can be found by maximising over q ∗ ( s , a ), There is always a deterministic optimal policy for any MDP If we know q ∗ ( s , a ), we immediately have the optimal policy 295, Winter 2018 75

Bellman Equation for V* and Q* V*(s) q*(s; a) 295, Winter 2018 77

Lecture 2: Markov DecisionProcesses Example: Bellman Optimality Equation in Student MDP Markov Decision Processes Bellman Optimality Equation Facebook 6 = max {-2 + 8, -1 + 6} R = -1 6 0 Quit Facebook Sleep R =0 R =0 R = -1 Study Study Study R =+10 6 8 10 R =-2 R =-2 Pub R =+1 0.4 0.2 0.4 295, Winter 2018 78

Lecture 1: Introduction to Reinforcement Learning Gridworld Example: Control Problems withinRL A B 22.0 24.4 22.0 19.4 17.5 +5 19.8 22.0 19.8 17.8 16.0 +10 B’ 17.8 19.8 17.8 16.0 14.4 16.0 17.8 16.0 14.4 13.0 A’ 14.4 16.0 14.4 13.0 11.7 b) v c) ⇡ π a) gridworld V * * What is the optimal value function over all possible policies? What is the optimal policy? Figure 3.6 295, Winter 2018 79

Lecture 2: Markov DecisionProcesses Solving the Bellman Optimality Equation Markov Decision Processes Bellman Optimality Equation Bellman Optimality Equation is non-linear No closed form solution (in general) Many iterative solution methods Value Iteration Policy Iteration Q-learning Sarsa 295, Winter 2018 80

Planning by Dynamic Programming Sutton & Barto, Chapter 4 295, Winter 2018 81

Lecture 3: Planning by Dynamic Programming Planning by Dynamic Programming Introduction Dynamic programming assumes full knowledge of the MDP It is used for planning in an MDP For prediction: Input: MDP (S , A , P , R , γ ) and policy π or: MRP (S , P π , R π , γ ) Output: value function v π Or for control: Input: MDP (S , A , P , R , γ ) Output: optimal value function v ∗ and: optimal policy π ∗ 295, Winter 2018 83

Lecture 3: Planning by Dynamic Programming Policy Evaluation (Prediction) Policy Evaluation Iterative Policy Evaluation Problem: evaluate a given policy π Solution: iterative application of Bellman expectation backup v 1 → v 2 → ... → v π Using synchronous backups, At each iteration k + 1 For all states s ∈ S Update v k +1 ( s ) from v k ( s ' ) where s ' is a successor state of s We will discuss asynchronous backups later Convergence to v π will be proven at the end of the lecture 295, Winter 2018 84

Iterative Policy Evaluations These is a simultaneous linear equations in ISI unknowns and can be solved. Practically an iterative procedure until a foxed-point can be more effective Iterative policy evaluation. 295, Winter 2018 85

Iterative policy Evaluation 295, Winter 2018 87

Lecture 3: Planning by Dynamic Programming Evaluating a Random Policy in the Small Gridworld Policy Evaluation Example: Small Gridworld Undiscounted episodic MDP ( γ = 1) Nonterminal states 1 , ..., 14 One terminal state (shown twice as shaded squares) Actions leading out of the grid leave state unchanged Reward is − 1 until the terminal state is reached Agent follows uniform random policy π ( n |· ) = π ( e |· ) = π ( s |· ) = π ( w |· ) = 0 . 25 295, Winter 2018 88

Lecture 3: Planning by Dynamic Programming Iterative Policy Evaluation in Small Gridworld Policy Evaluation Example: Small Gridworld v k for the V k Greedy Policy w.r.t. v k V k Random Policy 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 random k = 0 0.0 0.0 0.0 0.0 policy 0.0 0.0 0.0 0.0 0.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 k = 1 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 -1.0 0.0 0.0 -1.7 -2.0 -2.0 -1.7 -2.0 -2.0 -2.0 k = 2 -2.0 -2.0 -2.0 -1.7 -2.0 -2.0 -1.7 0.0 295, Winter 2018 89

Lecture 3: Planning by Dynamic Programming Iterative Policy Evaluation in Small Gridworld (2) Policy Evaluation Example: Small Gridworld 0.0 -2.4 -2.9 -3.0 -2.4 -2.9 -3.0 -2.9 k = 3 -2.9 -3.0 -2.9 -2.4 -3.0 -2.9 -2.4 0.0 0.0 -6.1 -8.4 -9.0 -6.1 -7.7 -8.4 -8.4 optimal k = 10 policy -8.4 -8.4 -7.7 -6.1 -9.0 -8.4 -6.1 0.0 0.0 -14. -20. -22. -14. -18. -20. -20.  k = ∞ -20. -20. -18. -14. -22. -20. -14. 0.0 295, Winter 2018 90

Lecture 3: Planning by Dynamic Programming Policy Improvement Policy Iteration Given a policy π Evaluate the policy π v π ( s ) = E [ R t +1 + γ R t +2 + ... | S t = s ] Improve the policy by acting greedily with respect to v π π ' = greedy( v π ) In Small Gridworld improved policy was optimal, π ' = π ∗ In general, need more iterations of improvement / evaluation But this process of policy iteration always converges to π ∗ 295, Winter 2018 91

Policy Iteration 295, Winter 2018 92

Lecture 3: Planning by Dynamic Programming Policy Iteration Policy Iteration Policy evaluation Estimate v π Iterative policy evaluation Policy improvement Generate π I ≥ π Greedy policy improvement 295, Winter 2018 93

Lecture 3: Planning by Dynamic Programming Policy Improvement Policy Iteration Policy Improvement 295, Winter 2018 94

Lecture 3: Planning by Dynamic Programming Policy Improvement (2) Policy Iteration Policy Improvement If improvements stop, q π ( s , π ' ( s )) = max q π ( s , a ) = q π ( s , π ( s )) = v π ( s ) a ∈A Then the Bellman optimality equation has been satisfied v π ( s ) = max q π ( s , a ) a ∈A Therefore v π ( s ) = v ∗ ( s ) for all s ∈ S so π is an optimal policy 295, Winter 2018 95

Lecture 3: Planning by Dynamic Programming Modified Policy Iteration Policy Iteration Extensions to Policy Iteration Does policy evaluation need to converge to v π ? Or should we introduce a stopping condition e.g. E -convergence of value function Or simply stop after k iterations of iterative policy evaluation? For example, in the small gridworld k = 3 was sufficient to achieve optimal policy Why not update policy every iteration? i.e. stop after k = 1 This is equivalent to value iteration (next section) 295, Winter 2018 96

Lecture 3: Planning by Dynamic Programming Generalised Policy Iteration Policy Iteration Extensions to Policy Iteration Policy evaluation Estimate v π Any policy evaluation algorithm Policy improvement Generate π ' ≥ π Any policy improvement algorithm 295, Winter 2018 97

Lecture 3: Planning by Dynamic Programming Principle of Optimality Value Iteration Value Iteration in MDPs Any optimal policy can be subdivided into two components: An optimal first action A ∗ Followed by an optimal policy from successor state S I Theorem (Principle of Optimality) A policy π ( a | s ) achieves the optimal value from state s, v π ( s ) = v ∗ ( s ) , if and onlyif For any state s ' reachable from s π achieves the optimal value from state s ' , v π ( s ' ) = v ∗ ( s ' ) 295, Winter 2018 98

Lecture 3: Planning by Dynamic Programming Deterministic Value Iteration Value Iteration Value Iteration in MDPs 295, Winter 2018 99

Value Iteration 295, Winter 2018 100

Value Iteration 295, Winter 2018 101

Lecture 3: Planning by Dynamic Programming Example: Shortest Path Value Iteration Value Iteration in MDPs g 0 0 0 0 0 -1 -1 -1 0 -1 -2 -2 0 0 0 0 -1 -1 -1 -1 -1 -2 -2 -2 0 0 0 0 -1 -1 -1 -1 -2 -2 -2 -2 0 0 0 0 -1 -1 -1 -1 -2 -2 -2 -2 Problem V 1 V 2 V 3 0 -1 -2 -3 0 -1 -2 -3 0 -1 -2 -3 0 -1 -2 -3 -1 -2 -3 -3 -1 -2 -3 -4 -1 -2 -3 -4 -1 -2 -3 -4 -2 -3 -3 -3 -2 -3 -4 -4 -2 -3 -4 -5 -2 -3 -4 -5 -3 -3 -3 -3 -3 -4 -4 -4 -3 -4 -5 -5 -3 -4 -5 -6 V 4 V 5 V 6 V 7 295, Winter 2018 102

Lecture 3: Planning by Dynamic Programming Value Iteration Value Iteration Value Iteration in MDPs Problem: find optimal policy π Solution: iterative application of Bellman optimality backup v 1 → v 2 → ... → v ∗ Using synchronous backups At each iteration k + 1 For all states s ∈ S Update v k +1 ( s ) from v k ( s ' ) Convergence to v will be proven later ∗ Unlike policy iteration, there is no explicit policy Intermediate value functions may not correspond to any policy 295, Winter 2018 103

Lecture 3: Planning by Dynamic Programming Value Iteration (2) Value Iteration Value Iteration in MDPs 295, Winter 2018 104

Lecture 3: Planning by Dynamic Programming Asynchronous Dynamic Programming Extensions to Dynamic Programming Asynchronous Dynamic Programming DP methods described so far used synchronous backups i.e. all states are backed up in parallel Asynchronous DP backs up states individually, in any order For each selected state, apply the appropriate backup Can significantly reduce computation Guaranteed to converge if all states continue to be selected 295, Winter 2018 106

or, Learning and Planning with Markov Decision Processes 295 - PowerPoint PPT Presentation

Reinforcement Learning or, Learning and Planning with Markov Decision Processes 295 Seminar, Winter 2018 Rina Dechter Slides will follow David Silvers, and Suttons book Goals: To learn together the basics of RL. Some lectures and classic

1 Agillas System Architecture Agillas Computational Model Node @ (2,1) Node @ (1,1)

Gaining Confidence in the Correctness of Robotic and Autonomous Systems Kerstin Eder Trustworthy

Decision making in Multiagent settings DEC- POMDP December 7 Mohammad Ali Asgharpour Setyani

Choosing T omcat Connectors Jean-Frederic Clere What I will cover Who I am. Connectors

The Session Initiation Protocol (SIP) Henning Schulzrinne Dept. of Computer Science Columbia

Who We Are Our mission at CFED is to make it possible for

Case-Based Discussion: Non-Muscle Invasive Bladder Cancer Se Seth P. . Lerner MD; Gu Gui Go

Overview Overview Grid/NetSolve Grid enabled software Allows easy access to remote

Introduction to offCPU Time Flame Graphs agentzh@cloudflare.com Yichun Zhang (agentzh)

W3C Technology & Society @W3C / MIT CSAIL Wendy Seltzer, wseltzer@w3.org @wseltzer World

Tor: a quick overview Roger Dingledine The Tor Project https://torproject.org/ 1 What is Tor?

Building Your Own WAF as a Service and Forgetting about False Positives Juan Berner 1 About me

Cancer Epidemiology and Prevention Course EPIB 671 Eduardo L. Franco, James McGill Professor and

All About Blood Alcohol Content (BAC) Disclaimer: The contents of this presentation are Ph.

DSHS Grand Rounds . Logistics Registration for free continuing education (CE) hours or

Large Truck Crash Misconceptions Ralph Craft, Ph.D. Analysis Division Webinar April 5, 2011

AN OPEN-DATA, AGENT BASED DR KIRILL SIDOROV PROF PAUL L. ROSIN MODEL OF ALCOHOL-RELATED CRIME

Outline Discuss the Role of the Toxicologist Answer common questions in regards to lab

Colorado Problem Identification Report FY2014 Jan Hart, CDPHE Objectives Review of the Data

Public Safety and Substance Use Trends Presentation to the Substance Abuse Trend and Response

8/5/20 After the Goldrush: Testing Medical Cannabis and CBD in Chronic Pain Patients Douglas

FDA Evaluation of Point of Care Blood Glucose Meters Denise N. Johnson-Lyles, Ph.D., Scientific

GET READY FOR TAKEOFF! Summer IP Experience, 2020 WELCOME MESSAGE Beth Laux Executive Director

Bivariate Data Marc H. Mehlman marcmehlman@yahoo.com University of New Haven Marc Mehlman Marc

or, Learning and Planning with Markov Decision Processes 295 - PowerPoint PPT Presentation

Reinforcement Learning or, Learning and Planning with Markov Decision Processes 295 Seminar, Winter 2018 Rina Dechter Slides will follow David Silvers, and Suttons book Goals: To learn together the basics of RL. Some lectures and classic

1 Agillas System Architecture Agillas Computational Model Node @ (2,1) Node @ (1,1)

Gaining Confidence in the Correctness of Robotic and Autonomous Systems Kerstin Eder Trustworthy

Decision making in Multiagent settings DEC- POMDP December 7 Mohammad Ali Asgharpour Setyani

Choosing T omcat Connectors Jean-Frederic Clere What I will cover Who I am. Connectors

The Session Initiation Protocol (SIP) Henning Schulzrinne Dept. of Computer Science Columbia

Who We Are Our mission at CFED is to make it possible for

Case-Based Discussion: Non-Muscle Invasive Bladder Cancer Se Seth P. . Lerner MD; Gu Gui Go

Overview Overview Grid/NetSolve Grid enabled software Allows easy access to remote

Introduction to offCPU Time Flame Graphs agentzh@cloudflare.com Yichun Zhang (agentzh)

W3C Technology &amp; Society @W3C / MIT CSAIL Wendy Seltzer, wseltzer@w3.org @wseltzer World

Tor: a quick overview Roger Dingledine The Tor Project https://torproject.org/ 1 What is Tor?

Building Your Own WAF as a Service and Forgetting about False Positives Juan Berner 1 About me

Cancer Epidemiology and Prevention Course EPIB 671 Eduardo L. Franco, James McGill Professor and

All About Blood Alcohol Content (BAC) Disclaimer: The contents of this presentation are Ph.

DSHS Grand Rounds . Logistics Registration for free continuing education (CE) hours or

Large Truck Crash Misconceptions Ralph Craft, Ph.D. Analysis Division Webinar April 5, 2011

AN OPEN-DATA, AGENT BASED DR KIRILL SIDOROV PROF PAUL L. ROSIN MODEL OF ALCOHOL-RELATED CRIME

Outline Discuss the Role of the Toxicologist Answer common questions in regards to lab

Colorado Problem Identification Report FY2014 Jan Hart, CDPHE Objectives Review of the Data

Public Safety and Substance Use Trends Presentation to the Substance Abuse Trend and Response

8/5/20 After the Goldrush: Testing Medical Cannabis and CBD in Chronic Pain Patients Douglas

FDA Evaluation of Point of Care Blood Glucose Meters Denise N. Johnson-Lyles, Ph.D., Scientific

GET READY FOR TAKEOFF! Summer IP Experience, 2020 WELCOME MESSAGE Beth Laux Executive Director

Bivariate Data Marc H. Mehlman marcmehlman@yahoo.com University of New Haven Marc Mehlman Marc

W3C Technology & Society @W3C / MIT CSAIL Wendy Seltzer, wseltzer@w3.org @wseltzer World