logistics
play

Logistics Reading AIMA Ch 21 (Reinforcement Learning) Markov - PDF document

Logistics Reading AIMA Ch 21 (Reinforcement Learning) Markov Decision Processes Project 1 due today 2 printouts of report Email Miao with CSE 573 Source code Document in .doc or .pdf Project 2 description on web New


  1. Logistics • Reading AIMA Ch 21 (Reinforcement Learning) Markov Decision Processes • Project 1 due today 2 printouts of report Email Miao with CSE 573 • Source code • Document in .doc or .pdf • Project 2 description on web New teams • By Monday 11/15 - Email Miao w/ team + direction Feel free to consider other ideas Idea 1: Spam Filter Idea 2: Localization • Decision Tree Learner ? • Placelab data • Ensemble of… ? • Learn “places” • Naïve Bayes ? K-means clustering • Predict movements Bag of Words Representation between places Enhancement Markov model, or …. • Augment Data Set ? • ??????? Proto-idea 4: Openmind.org Proto-idea 3: Captchas • Repository of Knowledge in NLP • The problem of software robots • What the heck can we do with it???? • Turing test is big business • Break or create Non-vision based? 1

  2. Proto-idea 4: Wordnet Openmind Animals www.cogsci.princeton.edu/~wn/ • Giant graph of concepts Centrally controlled � semantics • What to do? • Integrate with FAQ lists, Openmind, ??? 573 Topics Where are We? • Uncertainty • Bayesian Networks • Sequential Stochastic Processes Reinforcement Learning (Hidden) Markov Models Supervised Learning Planning Dynamic Bayesian networks (DBNs) Probabalistic STRIPS Representation Logic-Based Probabilistic • Markov Decision Processes (MDPs) Knowledge Representation & Inference • Reinforcement Learning Search Problem Spaces Agency An Example Bayes Net Planning under uncertainty Planning Static Environment Pr(B=t) Pr(B=f) Earthquake Burglary 0.05 0.95 Instantaneous Stochastic Pr(A|E,B) e,b 0.9 (0.1) e,b 0.2 (0.8) Fully Observable Radio Alarm Perfect e,b 0.85 (0.15) What action e,b 0.01 (0.99) next? Percepts Actions Nbr1Calls Nbr2Calls Full 2

  3. Recap: Markov Models Models of Planning Uncertainty Q: set of states Deterministic Disjunctive Probabilistic Complete Classical Contingent MDP Observation π: init prob distribution Partial ??? Contingent POMDP A: transition probability distribution ONE per ACTION Markov assumption ??? Conformant POMDP None Stationary model assumption Probabilistic “STRIPS”? A Factored domain in O ffice Move:office � cafe • Variables : R aining has_user_coffee (huc) , has_robot_coffee (hrc), robot_is_wet (w), has_robot_umbrella (u), raining (r), robot_in_office (o) has U mbrella -in O ffice • Actions : buy_coffee, deliver_coffee, get_umbrella, P<.1 move -in O ffice + W et What is the number of states? Can we succinctly represent transition -in O ffice -in O ffice probabilities in this case? + W et Dynamic Bayesian Nets Dynamic Bayesian Net for Move Actually table Total values 8 huc huc’ required to should have 16 huc huc represent entries! 4 transition hrc hrc’ hrc hrc probability Pr(w’|u,w) table = 36 u,w 1.0 (0) 16 w w w w’ u,w 0.1 (0.9) Vs 4096 u,w 1.0 (0) 4 u,w 1.0 (0) u u u u’ 2 r r’ r r Pr(r=T) Pr(r=F) 0.95 0.5 2 o o o o’ 3

  4. Actions in DBN Observability huc huc • Full Observability hrc hrc • Partial Observability Last Time: w w • No Observability Actions in DBN u u Unrolling r r o o Don’t need them Today T T+1 a Reward/cost Horizon • Finite : Plan till t stages. • Each action has an associated cost. Reward = R(s 0 )+R(s 1 )+R(s 2 )+…+R(s t ) • Agent may accrue rewards at different • Infinite : The agent never dies. stages. A reward may depend on The reward R(s 0 )+R(s 1 )+R(s 2 )+… The current state Could be unbounded. The (current state, action) pair ? The (current state, action, next state) triplet • Additivity assumption : Costs and rewards are additive. Discounted reward : R(s 0 )+ γ R(s 1 )+ γ 2 R(s 2 )+… • Reward accumulated = R(s 0 )+R(s 1 )+R(s 2 )+… Average reward : lim n � ∞ (1/n)[ Σ i R(s i )] Goal for an MDP Optimal value of a state • Define V*(s) `value of a state’ as the maximum • Find a policy which: expected discounted reward achievable from this maximizes expected discounted reward state. over an infinite horizon • Value of state if we force it to do action “a” right now, but let it act optimally later: for a fully observable Q*(a,s)=R(s) + c(a) + Markov decision process. γΣ s’ ε S Pr(s’|a,s)V*(s’) • V* should satisfy the following equation: Why shouldn’t the planner find a plan?? V*(s) = max a ε A {Q*(a,s)} What is a policy?? = R(s) + max a ε A {c(a) + γΣ s’ ε S Pr(s’|a,s)V*(s’)} 4

  5. Value iteration Bellman Backup • Assign an arbitrary assignment of values to V n Q n+1 (s,a) each state (or use an admissible heuristic). Max V n a 1 • Iterate over the set of states and in each V n iteration improve the value function as follows: a 2 s V n+1 (s) V n V t+1 (s)=R(s) + max a ε A {c(a)+ γΣ s’ ε S Pr(s’|a,s) V t (s’)} a 3 V n `Bellman Backup’ V n • Stop the iteration appropriately. V t approaches V* as t increases. V n Stopping Condition Complexity of value iteration • ε -convergence : A value function is ε –optimal • One iteration takes O(|S| 2 |A|) time. if the error (residue) at every state is less • Number of iterations required : than ε . poly(|S|,|A|,1/(1- γ )) Residue(s)=|V t+1 (s)- V t (s)| • Overall, the algorithm is polynomial in state Stop when max s ε S R(s) < ε space! • Thus exponential in number of state variables. Computation of optimal policy Policy evaluation • Given the value function V*(s), for each • Given a policy Π :S � A, find value of each state, do Bellman backups and the action state using this policy. which maximises the inner product term is • V Π (s) = R(s) + c( Π (s)) + the optimal action. γ [ Σ s’ ε S Pr(s’| Π (s),s)V Π (s’)] • � Optimal policy is stationary (time • This is a system of linear equations independent) – intuitive for infinite horizon involving |S| variables. case. 5

  6. Bellman’s principle of optimality Policy iteration • Start with any policy ( Π 0 ). • A policy Π is optimal if V Π (s) ≥ V Π ’ (s) for all policies Π ’ and all states s є S. • Iterate Policy evaluation : For each state find V Π i (s). • Rather than finding the optimal value Policy improvement : For each state s, find action function, we can try and find the optimal a* that maximises Q Π i (a,s). policy directly, by doing a policy space If Q Π i (a*,s) > V Π i (s) let Π i+1 (s) = a* search. else let Π i+1 (s) = Π i (s) • Stop when Π i+1 = Π i • Converges faster than value iteration but policy evaluation step is more expensive. Modified Policy iteration RTDP iteration • Rather than evaluating the actual value of • Start with initial belief and initialize value of policy by solving system of equations, each belief as the heuristic value. approximate it by using value iteration with • For current belief fixed policy. Save the action that minimises the current state value in the current policy. Update the value of the belief through Bellman Backup. • Apply the minimum action and then randomly pick an observation. • Go to next belief assuming that observation. • Repeat until goal is achieved. Fast RTDP convergence Other speedups • What are the advantages of RTDP? • Heuristics • What are the disadvantages of RTDP? • Aggregations • Reachability Analysis How to speed up RTDP? 6

  7. Going beyond full observability Models of Planning Uncertainty • In execution phase, we are uncertain Deterministic Disjunctive Probabilistic where we are, • but we have some idea of where we can be. Complete Classical Contingent MDP • A belief state = ? Observation Partial ??? Contingent POMDP ??? Conformant POMDP None Speedups Mathematical modelling • Search space : finite/infinite state/belief space. • Reachability Analysis Belief state = some idea of where we are • More informed heuristic • Initial state/belief. • Actions • Action transitions (state to state / belief to belief) • Action costs • Feedback : Zero/Partial/Total Algorithms for search Full Observability • A* : works for sequential solutions. • Modelled as MDPs. (also called fully observable MDPs) • AO* : works for acyclic solutions. • Output : Policy (State � Action) • LAO* : works for cyclic solutions. • Bellman Equation • RTDP : works for cyclic solutions. V*(s)=max a ε A(s) [c(a)+ Σ s’ ε S V*(s’)P(s’|s,a)] 7

  8. Partial Observability No observability • Modelled as POMDPs. (partially observable • Deterministic search in the belief space. MDPs). Also called Probabilistic Contingent • Output ? Planning. • Belief = probabilistic distribution over states. • What is the size of belief space? • Output : Policy (Discretized Belief -> Action) • Bellman Equation o )] V*(b)=max a ε A(b) [c(a)+ Σ o ε O P(b,a,o) V*(b a 8

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend