Announcements CS 188: Artificial Intelligence Spring 2010 P2: Due - PDF document

Announcements CS 188: Artificial Intelligence Spring 2010 � P2: Due tonight � W3: Expectimax, utilities and MDPs---out Lecture 10: MDPs tonight, due next Thursday. 2/18/2010 � Online book: Sutton and Barto http://www.cs.ualberta.ca/~sutton/book/ebook/the-book.html Pieter Abbeel – UC Berkeley Many slides over the course adapted from either Dan Klein, Stuart Russell or Andrew Moore 1 2 Recap: MDPs Recap MPD Example: Grid World � � Markov decision processes: The agent lives in a grid s � � States S Walls block the agent’s path � Actions A � a The agent’s actions do not always � Transitions P(s’|s,a) (or T(s,a,s’)) go as planned: s, a � Rewards R(s,a,s’) (and discount γ ) � 80% of the time, the action North � Start state s 0 s,a,s’ takes the agent North s’ (if there is no wall there) � 10% of the time, North takes the � Quantities: agent West; 10% East � Policy = map of states to actions � If there is a wall in the direction the � Utility = sum of discounted rewards agent would have been taken, the � Values = expected future utility from a state agent stays put � Q-Values = expected future utility from a q-state � Small “living” reward each step � Big rewards come at the end � Goal: maximize sum of rewards 4 Why Not Search Trees? Value Iteration � Idea: � Why not solve with expectimax? � V i* (s) : the expected discounted sum of rewards accumulated when starting from state s and acting optimally for a horizon of i � Problems: time steps. � This tree is usually infinite (why?) � Start with V 0* (s) = 0, which we know is right (why?) � Same states appear over and over (why?) � Given V i* , calculate the values for all states for horizon i+1: � We would search once per state (why?) � Idea: Value iteration � Compute optimal values for all states all at once using successive approximations � This is called a value update or Bellman update � Will be a bottom-up dynamic program � Repeat until convergence similar in cost to memoization � Do all planning offline, no replanning needed! � Theorem: will converge to unique optimal values � Basic idea: approximations get refined towards optimal values � Policy may converge long before values do 6 7 1

Example: γ =0.9, living reward=0, noise=0.2 Example: Bellman Updates Convergence* � Define the max-norm: � Theorem: For any two approximations U and V � I.e. any distinct approximations must get closer to each other, so, in particular, any approximation must get closer to the true U and value iteration converges to a unique, stable, optimal solution � Theorem: � I.e. once the change in our approximation is small, it must also max happens for be close to correct a=right, other actions not shown 8 10 At Convergence Practice: Computing Actions � At convergence, we have found the optimal value � Which action should we chose from state s: function V* for the discounted infinite horizon problem, � Given optimal values V? which satisfies the Bellman equations: � Given optimal q-values Q? � Lesson: actions are easier to select from Q’s! 12 13 Complete procedure � 1. Run value iteration (off-line) � Returns V, which (assuming sufficiently many iterations is a good approximation of V*) � 2. Agent acts. At time t the agent is in state s t and takes the action a t : 14 15 2

Utilities for Fixed Policies Policy Evaluation � How do we calculate the V’s for a fixed policy? � Another basic operation: compute s the utility of a state s under a fix (general non-optimal) policy π (s) � Idea one: modify Bellman updates s, π (s) � Define the utility of a state s, under a s, π (s),s’ fixed policy π : s’ V π (s) = expected total discounted rewards (return) starting in s and following π � Recursive relation (one-step look- � Idea two: it’s just a linear system, solve with ahead / Bellman equation): Matlab (or whatever) 18 19 Policy Iteration Policy Iteration � Alternative approach: � Policy evaluation: with fixed current policy π , find values with simplified Bellman updates: � Step 1: Policy evaluation: calculate utilities for some � Iterate until values converge fixed policy (not optimal utilities!) until convergence � Step 2: Policy improvement: update policy using one- step look-ahead with resulting converged (but not optimal!) utilities as future values � Repeat steps until policy converges � Policy improvement: with fixed utilities, find the best action according to one-step look-ahead � This is policy iteration � It’s still optimal! � Can converge faster under some conditions 20 23 Comparison Asynchronous Value Iteration* � In value iteration: � In value iteration, we update every state in each iteration � Every pass (or “backup”) updates both utilities (explicitly, based on current utilities) and policy (possibly implicitly, based on � Actually, any sequences of Bellman updates will current policy) converge if every state is visited infinitely often � In policy iteration: � Several passes to update utilities with frozen policy � In fact, we can update the policy as seldom or often as � Occasional passes to update policies we like, and we will still converge � Hybrid approaches (asynchronous policy iteration): � Idea: Update states whose value we expect to change: � Any sequences of partial updates to either policy entries or utilities will converge if every state is visited infinitely often If is large then update predecessors of s 25 3

MDPs recap � Markov decision processes: � States S � Actions A � Transitions P(s’|s,a) (or T(s,a,s’)) � Rewards R(s,a,s’) (and discount γ ) � Start state s 0 � Solution methods: � Value iteration (VI) � Policy iteration (PI) � Asynchronous value iteration � Current limitations: � Relatively small state spaces � Assumes T and R are known 27 4

Announcements CS 188: Artificial Intelligence Spring 2010 P2: Due - PDF document

Announcements CS 188: Artificial Intelligence Spring 2010 P2: Due tonight W3: Expectimax, utilities and MDPs---out Lecture 10: MDPs tonight, due next Thursday. 2/18/2010 Online book: Sutton and Barto

Artificial Intelligence Artificial Intelligence Artificial Intelligence Study and design of

Artificial Intelligence Course Presentation Summary Artificial Intelligence Motivations

Artificial Intelligence Course Presentation Summary Artificial Intelligence Motivations

Artificial intelligence Artificial Intelligence is the science of PHILOSOPHY OF ARTIFICIAL

Artificial Intelligence Intro (Chapter 1 of AIMA) Summary Artificial Intelligence What is AI?

Lecture 29: Artificial Intelligence Marvin Zhang 08/10/2016 Some slides are adapted from CS 188

What is Artificial Intelligence? CPSC 322 Lecture 1 September 5, 2007 What is Artificial

Traditional Definition of Artificial Intelligence Trends Artificial Intelligence (AI) is

Standard 188-2015 Presentation - TE Watson ANSI/ASHRAE Standard 188-2015 Legionellosis: Risk

Announcements CS 188: Artificial Intelligence Spring 2011 Practice Final Out (optional)

Announcements Project 4 due Friday HW9 due next Monday CS 188: Artificial Intelligence

Announcements CS 188: Artificial Intelligence W2 is due today (lecture or drop box) Spring

Announcements CS 188: Artificial Intelligence Uncertainty and Utilities Homework 3: Games

Announcements CS 188: Artificial Intelligence On-going: contest (optional and FUN!) Spring

Announcements CS 188: Artificial Intelligence Spring 2011 W4 out, due next week Monday

CS 188: Artificial Intelligence Introduction Instructors: Anca Dragan, Sergey Levine University

Title: Scalable Video Coding based DASH for efficient usage of network resources Presenter: Yago

Knowledge Representation Part VII Protg / RDFS / OWL / ++ Jan Pettersen Nytun, UiA 1 S O

Analyst Meeting Four Seasons Hotel Bangkok Minor International PLC MINTs 1Q07 25 th May 2007

Recommender Systems Francesco Ricci Free University of Bozen-Bolzano Italy fricci@unibz.it 1

MPD Detector at NICA Lyubka Yordanova VBLHEP,JINR,Dubna,Russia On behalf of the MPD team

GEMs Chambers for SBS K. Gnanvo SBS Collaboration Meeting 08/06/2019 UVa: K. Gnanvo, S.Jian, N.

The Connection: The 7 th Concern for innovation - Case study on Serial Innovation in Petrol Pumps

Redelivery Management and Technical Aspects of Leases Oisn Murray - Head of Technical and Asset

Sambuz

Useful Links

Newsletter

Mail Us