 
              What Do We Want AI and ML to Do? } Short answer: Lots of things! } Intelligent robot and vehicle navigation } Better web search } Automated personal assistants } Scheduling for delivery vehicles, air traffic control, industrial processes, … } Simulated agents in video games Class #21: Markov Decision } Automated translation systems Processes as Models for Learning Machine Learning (COMP 135): M. Allen, 18 Nov. 19 2 Monday, 18 Nov. 2019 Machine Learning (COMP 135) 1 2 What Do We Need? Markov Decision Processes } AI systems must be able to handle complex, uncertain } Markov Decision Processes (MDPs) combine various worlds, and come up with plans that are useful to us over ideas from probability theory and decision theory extended periods of time } A useful model for doing full planning , and for representing environments where agents can learn what to do } Uncertainty : requires something like probability theory } Value-based planning : we want to maximize expected utility } Basic idea: a world made up of states, changing based on over time, as in decision theory the actions of an AI agent, who is trying to maximize its } Planning over time : we need some sort of temporal model of long-term reward as it does so how the world can change as we go about our business } One technical detail: change happens probabilistically (under the Markov assumption) 3 4 Monday, 18 Nov. 2019 Machine Learning (COMP 135) Monday, 18 Nov. 2019 Machine Learning (COMP 135) 3 4 1
Formal Definition of an MDP An Example: Maze Navigation } An MDP has several components } Suppose we have a robot in M = < S , A , P , R , T > a maze, looking for exit S = a set of states of the world } The robot can see where it 1. A = a set of actions an agent can take is currently, and where 2. surrounding walls are, but P = a state-transition function : P ( s, a, s´ ) is the 3. probability of ending up in state s´ if you start in state s doesn’t know anything else and you take action a : P( s´ | s, a ) } We would like it to be able R = a reward function : R ( s, a, s´ ) is the one-step 4. to learn the shortest route reward you get if you go from state s to state s´ after out of the maze, no matter taking action a where it starts T = a time horizon (how many steps): we assume that 5. } How can we formulate this every state-transition, following a single action, takes a problem as an MDP? single unit of time 5 6 Monday, 18 Nov. 2019 Machine Learning (COMP 135) Monday, 18 Nov. 2019 Machine Learning (COMP 135) 5 6 MDP for the Maze Problem Action Transitions } We can use the transition function to represent important features of the maze problem domain } For instance, the robot cannot move through walls S 2 S 3 } For example, if the robot starts in the corner ( s 1 ), and tries to go } States : each state is simply the robot’s current location S 1 S 4 DOWN , nothing happens: (imagine the map is a grid), including nearby walls P(s 1 , DOWN, s 1 ) = 1.0 } Actions : the robot can move in one of the four directions ( UP, DOWN, LEFT, RIGHT ) 7 8 Monday, 18 Nov. 2019 Machine Learning (COMP 135) Monday, 18 Nov. 2019 Machine Learning (COMP 135) 7 8 2
Action Transitions, II Action Transitions, II } Similarly, we can model uncertain } Similarly, we can model uncertain action outcomes using the action outcomes using the transition model transition model } Suppose the robot is a little } Suppose the robot is a little unstable, and occasionally goes in unstable, and occasionally goes in S 2 S 3 S 2 S 3 the wrong direction the wrong direction } Thus, if it starts in state s 1 and tries } Thus, if it starts in state s 1 and tries S 1 S 4 S 1 S 4 to go UP to s 2 : to go UP to s 2 : 80% of the time it works : 1. P(s 1 , UP, s 2 ) = 0.8 9 10 Monday, 18 Nov. 2019 Machine Learning (COMP 135) Monday, 18 Nov. 2019 Machine Learning (COMP 135) 9 10 Action Transitions, II Rewards in the Maze } Similarly, we can model uncertain } If G is our goal (exit) state , we can action outcomes using the “encourage” the robot, by giving any transition model action that gets to G positive reward: S 1 } Suppose the robot is a little R(s 1 , DOWN, G) = +100 unstable, and occasionally goes in R(s 2 , LEFT, G) = +100 S 2 the wrong direction G S 2 S 3 R(s 3 , UP, G) = +100 } Thus, if it starts in state s 1 and tries } Further, we can reward quicker to go UP to s 2 : S 3 S 1 S 4 solutions by making all other 80% of the time it works: 1. movements have negative reward, e.g.: P(s 1 , UP, s 2 ) = 0.8 R(s 1 , RIGHT, s´ ) = -1 But it may slip and miss : 2. R(s 2 , UP, s´ ) = -1 P(s 1 , UP, s 3 ) = 0.2 etc. 11 12 Monday, 18 Nov. 2019 Machine Learning (COMP 135) Monday, 18 Nov. 2019 Machine Learning (COMP 135) 11 12 3
Solving the Maze Planning and Learning } How do we find policies? } A solution to our problem takes the form of a policy of action, p } If we know the entire problem , we plan S 1 S 2 S 3 } At each state, it tells the agent the } e.g., if we already know the whole maze, and know all the best thing to do: MDP dynamics, we can solve it to find the best policy of action (even if we have to take into account the probability that some G π (s 1 ) = DOWN } S 4 S 5 movements fail some of the time) π (s 2 ) = LEFT } } If we don’t know it all ahead of time, we learn } Similarly for all other states… S 6 S 7 S 8 } Reinforcement Learning: use the positive and negative feedback from the one-step reward in an MDP, and figure out a policy that gives us long-term value 13 14 Monday, 18 Nov. 2019 Machine Learning (COMP 135) Monday, 18 Nov. 2019 Machine Learning (COMP 135) 13 14 Maximizing Expected Return The Infinite (Indefinite) Case } If we are solving a planning problem like an MDP, we want } Unfortunately, this simple idea doesn’t really work for problems with indefinite time-horizons our plan to give us maximum expected reward over time } In such problems, our agent can keep on acting, and we have } In a finite-time problem, the total reward we get at some no known upper bound on how long this may continue time-step t is just the sum of future rewards (up to our } In such cases we treat upper bound as if it is infinite : T = ∞ time-limit T ): } If the time-horizon T is infinite, then the sum of rewards: R t = r t+1 + r t+2 + … + r T R t = r t+1 + r t+2 + … + r T } The optimal policy would make this sum as large as can be infinitely large (or infinitely small), too! possible, taking into account any probabilistic outcomes (e.g. robot moves that go the wrong way by accident) 15 16 Monday, 18 Nov. 2019 Machine Learning (COMP 135) Monday, 18 Nov. 2019 Machine Learning (COMP 135) 15 16 4
Recommend
More recommend