Making Complex Decisions Paolo Turrini Department of Computing, - - PowerPoint PPT Presentation

making complex decisions
SMART_READER_LITE
LIVE PREVIEW

Making Complex Decisions Paolo Turrini Department of Computing, - - PowerPoint PPT Presentation

Intro to AI (2nd Part) Making Complex Decisions Paolo Turrini Department of Computing, Imperial College London Introduction to Artificial Intelligence 2nd Part Paolo Turrini Intro to AI (2nd Part) Intro to AI (2nd Part) AlphaGo beats World


slide-1
SLIDE 1

Intro to AI (2nd Part)

Making Complex Decisions

Paolo Turrini

Department of Computing, Imperial College London

Introduction to Artificial Intelligence 2nd Part

Paolo Turrini Intro to AI (2nd Part)

slide-2
SLIDE 2

Intro to AI (2nd Part)

AlphaGo beats World Go Champion

Paolo Turrini Intro to AI (2nd Part)

slide-3
SLIDE 3

Intro to AI (2nd Part)

AlphaGo beats World Go Champion

Paolo Turrini Intro to AI (2nd Part)

slide-4
SLIDE 4

Intro to AI (2nd Part)

AlphaGo beats World Go Champion

Paolo Turrini Intro to AI (2nd Part)

slide-5
SLIDE 5

Intro to AI (2nd Part)

AlphaGo beats World Go Champion

Paolo Turrini Intro to AI (2nd Part)

slide-6
SLIDE 6

Intro to AI (2nd Part)

AlphaGo beats World Go Champion

Welcome to scientific journalism!

Paolo Turrini Intro to AI (2nd Part)

slide-7
SLIDE 7

Intro to AI (2nd Part)

AlphaGo beats World Go Champion

Welcome to scientific journalism! It’s the number of possible positions the fundamental difference, together with the branching factor.

Paolo Turrini Intro to AI (2nd Part)

slide-8
SLIDE 8

Intro to AI (2nd Part)

Outline

Time Patience Risk

Paolo Turrini Intro to AI (2nd Part)

slide-9
SLIDE 9

Intro to AI (2nd Part)

The main reference

Stuart Russell and Peter Norvig Artificial Intelligence: a modern approach Chapters 17

Paolo Turrini Intro to AI (2nd Part)

slide-10
SLIDE 10

Intro to AI (2nd Part)

The World

Paolo Turrini Intro to AI (2nd Part)

slide-11
SLIDE 11

Intro to AI (2nd Part)

The World

Paolo Turrini Intro to AI (2nd Part)

Begin at the start state

slide-12
SLIDE 12

Intro to AI (2nd Part)

The World

Paolo Turrini Intro to AI (2nd Part)

Begin at the start state The game ends when we reach either goal state +1 or −1

slide-13
SLIDE 13

Intro to AI (2nd Part)

The World

Paolo Turrini Intro to AI (2nd Part)

Begin at the start state The game ends when we reach either goal state +1 or −1 Collision results in no movement

slide-14
SLIDE 14

Intro to AI (2nd Part)

The Agent

Paolo Turrini Intro to AI (2nd Part)

slide-15
SLIDE 15

Intro to AI (2nd Part)

The Agent

The agent goes:

Paolo Turrini Intro to AI (2nd Part)

slide-16
SLIDE 16

Intro to AI (2nd Part)

The Agent

The agent goes: towards the intendend direction with probability 0.8

Paolo Turrini Intro to AI (2nd Part)

slide-17
SLIDE 17

Intro to AI (2nd Part)

The Agent

The agent goes: towards the intendend direction with probability 0.8 to the left of the intended direction with probability 0.1

Paolo Turrini Intro to AI (2nd Part)

slide-18
SLIDE 18

Intro to AI (2nd Part)

The Agent

The agent goes: towards the intendend direction with probability 0.8 to the left of the intended direction with probability 0.1 to the right of the intended direction with probability 0.1

Paolo Turrini Intro to AI (2nd Part)

slide-19
SLIDE 19

Intro to AI (2nd Part)

The Agent and the World

Paolo Turrini Intro to AI (2nd Part)

slide-20
SLIDE 20

Intro to AI (2nd Part)

The Agent and the World

Paolo Turrini Intro to AI (2nd Part)

The environment is fully observable:

slide-21
SLIDE 21

Intro to AI (2nd Part)

The Agent and the World

Paolo Turrini Intro to AI (2nd Part)

The environment is fully observable: the agent always knows what the world looks like: e.g., there is a wall, where the wall is, how to get to the wall . . .

slide-22
SLIDE 22

Intro to AI (2nd Part)

The Agent and the World

Paolo Turrini Intro to AI (2nd Part)

The environment is fully observable: the agent always knows what the world looks like: e.g., there is a wall, where the wall is, how to get to the wall . . . the agent always knows his or her position during the game, even though some trajectories might not be reached with certainty.

slide-23
SLIDE 23

Intro to AI (2nd Part)

The Agent and the World

Paolo Turrini Intro to AI (2nd Part)

The environment is Markovian:

slide-24
SLIDE 24

Intro to AI (2nd Part)

The Agent and the World

Paolo Turrini Intro to AI (2nd Part)

The environment is Markovian: the probability of reaching a state, only depends on the state the agent is in and the action she performs.

slide-25
SLIDE 25

Intro to AI (2nd Part)

The Agent and the World

Paolo Turrini Intro to AI (2nd Part)

slide-26
SLIDE 26

Intro to AI (2nd Part)

The Agent and the World

Paolo Turrini Intro to AI (2nd Part)

[x, y]t is the fact that the agent is at square [x, y] at time t

slide-27
SLIDE 27

Intro to AI (2nd Part)

The Agent and the World

Paolo Turrini Intro to AI (2nd Part)

[x, y]t is the fact that the agent is at square [x, y] at time t (x, y)t is the fact that the agent intends to go to [x, y] at time t

slide-28
SLIDE 28

Intro to AI (2nd Part)

The Agent and the World

Paolo Turrini Intro to AI (2nd Part)

[x, y]t is the fact that the agent is at square [x, y] at time t (x, y)t is the fact that the agent intends to go to [x, y] at time t P([x, y]t | (x, y)t−1, [x − 1, y]t−1) =

slide-29
SLIDE 29

Intro to AI (2nd Part)

The Agent and the World

Paolo Turrini Intro to AI (2nd Part)

[x, y]t is the fact that the agent is at square [x, y] at time t (x, y)t is the fact that the agent intends to go to [x, y] at time t P([x, y]t | (x, y)t−1, [x − 1, y]t−1) = P([x, y]t | (x, y)t−1, [x − 1, y]t−1, [x − 5, y − 6]t−20) =

slide-30
SLIDE 30

Intro to AI (2nd Part)

The Agent and the World

Paolo Turrini Intro to AI (2nd Part)

[x, y]t is the fact that the agent is at square [x, y] at time t (x, y)t is the fact that the agent intends to go to [x, y] at time t P([x, y]t | (x, y)t−1, [x − 1, y]t−1) = P([x, y]t | (x, y)t−1, [x − 1, y]t−1, [x − 5, y − 6]t−20) = P([x, y]t | (x, y)t−1, [x − 1, y]t−1, (x − 4, y − 6)t−20)

slide-31
SLIDE 31

Intro to AI (2nd Part)

The Agent and the World

Paolo Turrini Intro to AI (2nd Part)

These properties allow us to make plans.

slide-32
SLIDE 32

Intro to AI (2nd Part)

The Agent and the World

Paolo Turrini Intro to AI (2nd Part)

These properties allow us to make plans. E.g., plans with determistic agents: as we know P([x, y]t | (x, y)t−1, [x − 1, y]t−1) = 1

slide-33
SLIDE 33

Intro to AI (2nd Part)

Lets make plans then

Paolo Turrini Intro to AI (2nd Part)

slide-34
SLIDE 34

Intro to AI (2nd Part)

Lets make plans then

Paolo Turrini Intro to AI (2nd Part)

{Up, Down, Left, Right} to denote the intended directions.

slide-35
SLIDE 35

Intro to AI (2nd Part)

Lets make plans then

Paolo Turrini Intro to AI (2nd Part)

{Up, Down, Left, Right} to denote the intended directions. So [Up, Down, Up, Right] is going to be the plan that, from the starting state, executes the moves n the specified order.

slide-36
SLIDE 36

Intro to AI (2nd Part)

Makings plans

Paolo Turrini Intro to AI (2nd Part)

slide-37
SLIDE 37

Intro to AI (2nd Part)

Makings plans

Paolo Turrini Intro to AI (2nd Part)

Goal: get to +1

slide-38
SLIDE 38

Intro to AI (2nd Part)

Makings plans

Paolo Turrini Intro to AI (2nd Part)

Goal: get to +1 Consider the plan [Up, Up, Right, Right, Right]:

slide-39
SLIDE 39

Intro to AI (2nd Part)

Makings plans

Paolo Turrini Intro to AI (2nd Part)

Goal: get to +1 Consider the plan [Up, Up, Right, Right, Right]: With deterministic agents, it gets us to +1 with probability 1.

slide-40
SLIDE 40

Intro to AI (2nd Part)

Makings plans

Paolo Turrini Intro to AI (2nd Part)

Goal: get to +1 Consider the plan [Up, Up, Right, Right, Right]: With deterministic agents, it gets us to +1 with probability 1. But now?

slide-41
SLIDE 41

Intro to AI (2nd Part)

Makings plans

Paolo Turrini Intro to AI (2nd Part)

Goal: get to +1 Consider the plan [Up, Up, Right, Right, Right]: With deterministic agents, it gets us to +1 with probability 1. But now? What’s the probability that [Up, Up, Right, Right, Right] gets us to +1?

slide-42
SLIDE 42

Intro to AI (2nd Part)

Makings plans

Paolo Turrini Intro to AI (2nd Part)

slide-43
SLIDE 43

Intro to AI (2nd Part)

Makings plans

Paolo Turrini Intro to AI (2nd Part)

It’s not 0.85!

slide-44
SLIDE 44

Intro to AI (2nd Part)

Makings plans

Paolo Turrini Intro to AI (2nd Part)

It’s not 0.85! 0.85 is the probability that we get to +1 actually using the intended plan [Up, Up, Right, Right, Right]

slide-45
SLIDE 45

Intro to AI (2nd Part)

Makings plans

Paolo Turrini Intro to AI (2nd Part)

It’s not 0.85! 0.85 is the probability that we get to +1 actually using the intended plan [Up, Up, Right, Right, Right] 0.85 = 0.32768: this means that we do not even get there 1 time

  • ut of 3.
slide-46
SLIDE 46

Intro to AI (2nd Part)

Makings plans

Paolo Turrini Intro to AI (2nd Part)

slide-47
SLIDE 47

Intro to AI (2nd Part)

Makings plans

Paolo Turrini Intro to AI (2nd Part)

There is a small chance of [Up, Up, Right, Right, Right] accidentally reaching the goal by going the other way round!

slide-48
SLIDE 48

Intro to AI (2nd Part)

Makings plans

Paolo Turrini Intro to AI (2nd Part)

There is a small chance of [Up, Up, Right, Right, Right] accidentally reaching the goal by going the other way round! The probability of this to happen is 0.14 × 0.8 = 0.00008

slide-49
SLIDE 49

Intro to AI (2nd Part)

Makings plans

Paolo Turrini Intro to AI (2nd Part)

There is a small chance of [Up, Up, Right, Right, Right] accidentally reaching the goal by going the other way round! The probability of this to happen is 0.14 × 0.8 = 0.00008 So the probability that [Up, Up, Right, Right, Right] gets us to +1 is 0.32768 + 0.00008 = 0.32776

slide-50
SLIDE 50

Intro to AI (2nd Part)

Makings plans

Paolo Turrini Intro to AI (2nd Part)

slide-51
SLIDE 51

Intro to AI (2nd Part)

Makings plans

Paolo Turrini Intro to AI (2nd Part)

In this case, the probability of accidental successes doesn’t play a significant role. However it might very well, under different decision models, rewards, environments etc.

slide-52
SLIDE 52

Intro to AI (2nd Part)

Makings plans

Paolo Turrini Intro to AI (2nd Part)

In this case, the probability of accidental successes doesn’t play a significant role. However it might very well, under different decision models, rewards, environments etc. 0.32776 is still less than 1

3, so we don’t seem to be doing very well.

slide-53
SLIDE 53

Intro to AI (2nd Part)

Rewards

Paolo Turrini Intro to AI (2nd Part)

We introduce a utility function r : S → R

slide-54
SLIDE 54

Intro to AI (2nd Part)

Rewards

Paolo Turrini Intro to AI (2nd Part)

We introduce a utility function r : S → R r stands for rewards. To avoid confusion with established terminology, we also call it a reward function.

slide-55
SLIDE 55

Intro to AI (2nd Part)

Rewards

Paolo Turrini Intro to AI (2nd Part)

We introduce a utility function r : S → R r stands for rewards. To avoid confusion with established terminology, we also call it a reward function.

slide-56
SLIDE 56

Intro to AI (2nd Part)

Terminology

Paolo Turrini Intro to AI (2nd Part)

slide-57
SLIDE 57

Intro to AI (2nd Part)

Terminology

Paolo Turrini Intro to AI (2nd Part)

rewards for local utilities, assigned to states - denoted r

slide-58
SLIDE 58

Intro to AI (2nd Part)

Terminology

Paolo Turrini Intro to AI (2nd Part)

rewards for local utilities, assigned to states - denoted r values for global long-range utilities, also assigned to states - denoted v

slide-59
SLIDE 59

Intro to AI (2nd Part)

Terminology

Paolo Turrini Intro to AI (2nd Part)

rewards for local utilities, assigned to states - denoted r values for global long-range utilities, also assigned to states - denoted v utility and expected utility used as general terms applied to actions, states, sequences of states etc. - denoted u

slide-60
SLIDE 60

Intro to AI (2nd Part)

Rewards

Paolo Turrini Intro to AI (2nd Part)

Consider now the following. The reward is:

slide-61
SLIDE 61

Intro to AI (2nd Part)

Rewards

Paolo Turrini Intro to AI (2nd Part)

Consider now the following. The reward is: +1 at state +1, -1 at -1, -0.04 in all other states.

slide-62
SLIDE 62

Intro to AI (2nd Part)

Rewards

Paolo Turrini Intro to AI (2nd Part)

Consider now the following. The reward is: +1 at state +1, -1 at -1, -0.04 in all other states. What’s the expected utility of [Up, Up, Right, Right, Right]?

slide-63
SLIDE 63

Intro to AI (2nd Part)

Rewards

Paolo Turrini Intro to AI (2nd Part)

Consider now the following. The reward is: +1 at state +1, -1 at -1, -0.04 in all other states. What’s the expected utility of [Up, Up, Right, Right, Right]? IT DEPENDS

slide-64
SLIDE 64

Intro to AI (2nd Part)

Rewards

Paolo Turrini Intro to AI (2nd Part)

Consider now the following. The reward is: +1 at state +1, -1 at -1, -0.04 in all other states. What’s the expected utility of [Up, Up, Right, Right, Right]? IT DEPENDS on how we are going to put rewards together!

slide-65
SLIDE 65

Intro to AI (2nd Part)

Utility of state sequences

We need to compare sequences of states.

Paolo Turrini Intro to AI (2nd Part)

slide-66
SLIDE 66

Intro to AI (2nd Part)

Utility of state sequences

We need to compare sequences of states. Look at the following:

Paolo Turrini Intro to AI (2nd Part)

slide-67
SLIDE 67

Intro to AI (2nd Part)

Utility of state sequences

We need to compare sequences of states. Look at the following: u[s1, s2, . . . sn] is the utility of sequence s1, s2, . . . sn.

Paolo Turrini Intro to AI (2nd Part)

slide-68
SLIDE 68

Intro to AI (2nd Part)

Utility of state sequences

We need to compare sequences of states. Look at the following: u[s1, s2, . . . sn] is the utility of sequence s1, s2, . . . sn. Does it remind you of anything?

Paolo Turrini Intro to AI (2nd Part)

slide-69
SLIDE 69

Intro to AI (2nd Part)

Utility of state sequences

We need to compare sequences of states. Look at the following: u[s1, s2, . . . sn] is the utility of sequence s1, s2, . . . sn. Does it remind you of anything? multi-criteria decision making

Paolo Turrini Intro to AI (2nd Part)

slide-70
SLIDE 70

Intro to AI (2nd Part)

Utility of state sequences

We need to compare sequences of states. Look at the following: u[s1, s2, . . . sn] is the utility of sequence s1, s2, . . . sn. Does it remind you of anything? multi-criteria decision making Many ways of comparing states:

Paolo Turrini Intro to AI (2nd Part)

slide-71
SLIDE 71

Intro to AI (2nd Part)

Utility of state sequences

We need to compare sequences of states. Look at the following: u[s1, s2, . . . sn] is the utility of sequence s1, s2, . . . sn. Does it remind you of anything? multi-criteria decision making Many ways of comparing states: summing all the rewards

Paolo Turrini Intro to AI (2nd Part)

slide-72
SLIDE 72

Intro to AI (2nd Part)

Utility of state sequences

We need to compare sequences of states. Look at the following: u[s1, s2, . . . sn] is the utility of sequence s1, s2, . . . sn. Does it remind you of anything? multi-criteria decision making Many ways of comparing states: summing all the rewards giving priority to the immediate rewards

Paolo Turrini Intro to AI (2nd Part)

slide-73
SLIDE 73

Intro to AI (2nd Part)

Utility of state sequences

We need to compare sequences of states. Look at the following: u[s1, s2, . . . sn] is the utility of sequence s1, s2, . . . sn. Does it remind you of anything? multi-criteria decision making Many ways of comparing states: summing all the rewards giving priority to the immediate rewards . . .

Paolo Turrini Intro to AI (2nd Part)

slide-74
SLIDE 74

Intro to AI (2nd Part)

Utility of state sequences

We are going to assume only one axiom,

Paolo Turrini Intro to AI (2nd Part)

slide-75
SLIDE 75

Intro to AI (2nd Part)

Utility of state sequences

We are going to assume only one axiom, stationary preferences on reward sequences:

Paolo Turrini Intro to AI (2nd Part)

slide-76
SLIDE 76

Intro to AI (2nd Part)

Utility of state sequences

We are going to assume only one axiom, stationary preferences on reward sequences: [r, r0, r1, r2, . . .] ≻ [r, r′

0, r′ 1, r′ 2, . . .] ⇔ [r0, r1, r2, . . .] ≻ [r′ 0, r′ 1, r′ 2, . . .]

Paolo Turrini Intro to AI (2nd Part)

slide-77
SLIDE 77

Intro to AI (2nd Part)

Utility of state sequences

Theorem There are only two ways to combine rewards over time.

Paolo Turrini Intro to AI (2nd Part)

slide-78
SLIDE 78

Intro to AI (2nd Part)

Utility of state sequences

Theorem There are only two ways to combine rewards over time. Additive utility function:

Paolo Turrini Intro to AI (2nd Part)

slide-79
SLIDE 79

Intro to AI (2nd Part)

Utility of state sequences

Theorem There are only two ways to combine rewards over time. Additive utility function: u([s0, s1, s2, . . .]) = r(s0) + r(s1) + r(s2) + · · ·

Paolo Turrini Intro to AI (2nd Part)

slide-80
SLIDE 80

Intro to AI (2nd Part)

Utility of state sequences

Theorem There are only two ways to combine rewards over time. Additive utility function: u([s0, s1, s2, . . .]) = r(s0) + r(s1) + r(s2) + · · · Discounted utility function:

Paolo Turrini Intro to AI (2nd Part)

slide-81
SLIDE 81

Intro to AI (2nd Part)

Utility of state sequences

Theorem There are only two ways to combine rewards over time. Additive utility function: u([s0, s1, s2, . . .]) = r(s0) + r(s1) + r(s2) + · · · Discounted utility function: u([s0, s1, s2, . . .]) = r(s0) + γr(s1) + γ2r(s2) + · · ·

Paolo Turrini Intro to AI (2nd Part)

slide-82
SLIDE 82

Intro to AI (2nd Part)

Utility of state sequences

Theorem There are only two ways to combine rewards over time. Additive utility function: u([s0, s1, s2, . . .]) = r(s0) + r(s1) + r(s2) + · · · Discounted utility function: u([s0, s1, s2, . . .]) = r(s0) + γr(s1) + γ2r(s2) + · · · where γ ∈ [0, 1] is the discount factor

Paolo Turrini Intro to AI (2nd Part)

slide-83
SLIDE 83

Intro to AI (2nd Part)

Discount factor

Paolo Turrini Intro to AI (2nd Part)

slide-84
SLIDE 84

Intro to AI (2nd Part)

Discount factor

γ is a measure of the agent patience. How much more she values a gain of 5 today than a gain of 5 tomorrow, the day after etc...

Paolo Turrini Intro to AI (2nd Part)

slide-85
SLIDE 85

Intro to AI (2nd Part)

Discount factor

γ is a measure of the agent patience. How much more she values a gain of 5 today than a gain of 5 tomorrow, the day after etc... Used everywhere in AI, game theory, cognitive psychology

Paolo Turrini Intro to AI (2nd Part)

slide-86
SLIDE 86

Intro to AI (2nd Part)

Discount factor

γ is a measure of the agent patience. How much more she values a gain of 5 today than a gain of 5 tomorrow, the day after etc... Used everywhere in AI, game theory, cognitive psychology A lot of experimental research on it

Paolo Turrini Intro to AI (2nd Part)

slide-87
SLIDE 87

Intro to AI (2nd Part)

Discount factor

γ is a measure of the agent patience. How much more she values a gain of 5 today than a gain of 5 tomorrow, the day after etc... Used everywhere in AI, game theory, cognitive psychology A lot of experimental research on it Variants: hyperbolic discounting

Paolo Turrini Intro to AI (2nd Part)

slide-88
SLIDE 88

Intro to AI (2nd Part)

Discounting

Paolo Turrini Intro to AI (2nd Part)

slide-89
SLIDE 89

Intro to AI (2nd Part)

Discounting

With discounted rewards the utility of an infinite sequence if finite

Paolo Turrini Intro to AI (2nd Part)

slide-90
SLIDE 90

Intro to AI (2nd Part)

Discounting

With discounted rewards the utility of an infinite sequence if finite In fact, if γ < 1 and rewards are bounded by r, we have:

Paolo Turrini Intro to AI (2nd Part)

slide-91
SLIDE 91

Intro to AI (2nd Part)

Discounting

With discounted rewards the utility of an infinite sequence if finite In fact, if γ < 1 and rewards are bounded by r, we have: u[s1, s2, . . .] =

  • t=0

γtr(st) ≤

  • t=0

γtr = r 1 − γ

Paolo Turrini Intro to AI (2nd Part)

slide-92
SLIDE 92

Intro to AI (2nd Part)

Markov Decision Process

A Markov Decision Process is a sequential decision problem for a:

Paolo Turrini Intro to AI (2nd Part)

slide-93
SLIDE 93

Intro to AI (2nd Part)

Markov Decision Process

A Markov Decision Process is a sequential decision problem for a: fully observable environment

Paolo Turrini Intro to AI (2nd Part)

slide-94
SLIDE 94

Intro to AI (2nd Part)

Markov Decision Process

A Markov Decision Process is a sequential decision problem for a: fully observable environment with stochastic actions

Paolo Turrini Intro to AI (2nd Part)

slide-95
SLIDE 95

Intro to AI (2nd Part)

Markov Decision Process

A Markov Decision Process is a sequential decision problem for a: fully observable environment with stochastic actions with a Markovian transition model

Paolo Turrini Intro to AI (2nd Part)

slide-96
SLIDE 96

Intro to AI (2nd Part)

Markov Decision Process

A Markov Decision Process is a sequential decision problem for a: fully observable environment with stochastic actions with a Markovian transition model and with discounted (possibly additive) rewards

Paolo Turrini Intro to AI (2nd Part)

slide-97
SLIDE 97

Intro to AI (2nd Part)

MDPs formally

Paolo Turrini Intro to AI (2nd Part)

Definition States s ∈ S, actions a ∈ A Model P(s′|s, a) = probability that a in s leads to s′ Reward function R(s) (or R(s, a), R(s, a, s′)) = −0.04 (small penalty) for nonterminal states ±1 for terminal states

slide-98
SLIDE 98

Intro to AI (2nd Part)

Value of plans

Paolo Turrini Intro to AI (2nd Part)

The utility of executing a plan p from state s is given by:

slide-99
SLIDE 99

Intro to AI (2nd Part)

Value of plans

Paolo Turrini Intro to AI (2nd Part)

The utility of executing a plan p from state s is given by: vp(s) = E[

  • t=0

γtr(St)]

slide-100
SLIDE 100

Intro to AI (2nd Part)

Value of plans

Paolo Turrini Intro to AI (2nd Part)

The utility of executing a plan p from state s is given by: vp(s) = E[

  • t=0

γtr(St)] Where St is a random variable and the expectation is wrt to the probability distribution over state sequences determined by s and p.

slide-101
SLIDE 101

Intro to AI (2nd Part)

Value of plans

Paolo Turrini Intro to AI (2nd Part)

Calculate the utility of the sequences you can actually perform, times the probability of reaching them.

slide-102
SLIDE 102

Intro to AI (2nd Part)

Value of plans

Paolo Turrini Intro to AI (2nd Part)

Calculate the utility of the sequences you can actually perform, times the probability of reaching them. Add these numbers

slide-103
SLIDE 103

Intro to AI (2nd Part)

Value of plans

Paolo Turrini Intro to AI (2nd Part)

Calculate the utility of the sequences you can actually perform, times the probability of reaching them. Add these numbers Forget about the rest

slide-104
SLIDE 104

Intro to AI (2nd Part)

Value of plans

Paolo Turrini Intro to AI (2nd Part)

slide-105
SLIDE 105

Intro to AI (2nd Part)

Value of plans

Paolo Turrini Intro to AI (2nd Part)

For instance the plan [Up, Up] can generate sequences

slide-106
SLIDE 106

Intro to AI (2nd Part)

Value of plans

Paolo Turrini Intro to AI (2nd Part)

For instance the plan [Up, Up] can generate sequences ([1, 1], [2, 1], [3, 1]) with probability 0.82

slide-107
SLIDE 107

Intro to AI (2nd Part)

Value of plans

Paolo Turrini Intro to AI (2nd Part)

For instance the plan [Up, Up] can generate sequences ([1, 1], [2, 1], [3, 1]) with probability 0.82 ([1, 1], [2, 1], [2, 1]) with probability 2 × 0.8 × 0.1 (collisions)

slide-108
SLIDE 108

Intro to AI (2nd Part)

Value of plans

Paolo Turrini Intro to AI (2nd Part)

For instance the plan [Up, Up] can generate sequences ([1, 1], [2, 1], [3, 1]) with probability 0.82 ([1, 1], [2, 1], [2, 1]) with probability 2 × 0.8 × 0.1 (collisions) ([1, 1], [1, 2], [1, 2]) with probability 0.1 × 0.8

slide-109
SLIDE 109

Intro to AI (2nd Part)

Value of plans

Paolo Turrini Intro to AI (2nd Part)

For instance the plan [Up, Up] can generate sequences

slide-110
SLIDE 110

Intro to AI (2nd Part)

Value of plans

Paolo Turrini Intro to AI (2nd Part)

For instance the plan [Up, Up] can generate sequences ([1, 1], [1, 1], [1, 1]) with probability 0.12

slide-111
SLIDE 111

Intro to AI (2nd Part)

Value of plans

Paolo Turrini Intro to AI (2nd Part)

For instance the plan [Up, Up] can generate sequences ([1, 1], [1, 1], [1, 1]) with probability 0.12 ([1, 1], [1, 1], [2, 1]) with probability 0.1 × 0.8

slide-112
SLIDE 112

Intro to AI (2nd Part)

Value of plans

Paolo Turrini Intro to AI (2nd Part)

For instance the plan [Up, Up] can generate sequences ([1, 1], [1, 1], [1, 1]) with probability 0.12 ([1, 1], [1, 1], [2, 1]) with probability 0.1 × 0.8 ([1, 1], [1, 1], [1, 2]) with probability 0.12

slide-113
SLIDE 113

Intro to AI (2nd Part)

Value of plans

Paolo Turrini Intro to AI (2nd Part)

For instance the plan [Up, Up] can generate sequences

slide-114
SLIDE 114

Intro to AI (2nd Part)

Value of plans

Paolo Turrini Intro to AI (2nd Part)

For instance the plan [Up, Up] can generate sequences ([1, 1], [1, 2], [1, 1]) with probability 0.12

slide-115
SLIDE 115

Intro to AI (2nd Part)

Value of plans

Paolo Turrini Intro to AI (2nd Part)

For instance the plan [Up, Up] can generate sequences ([1, 1], [1, 2], [1, 1]) with probability 0.12 ([1, 1], [1, 2], [1, 3]) with probability 0.12

slide-116
SLIDE 116

Intro to AI (2nd Part)

Value of plans

Paolo Turrini Intro to AI (2nd Part)

For instance the plan [Up, Up] can generate sequences ([1, 1], [1, 2], [1, 1]) with probability 0.12 ([1, 1], [1, 2], [1, 3]) with probability 0.12 for a total of nine sequences

slide-117
SLIDE 117

Intro to AI (2nd Part)

Value of plans

Paolo Turrini Intro to AI (2nd Part)

Adding utility and summing up, we have that the expected utility is −0.08

slide-118
SLIDE 118

Intro to AI (2nd Part)

Value of plans

Paolo Turrini Intro to AI (2nd Part)

Adding utility and summing up, we have that the expected utility is −0.08 To be expected, because no matter how we proceed, we are making two steps and at each step getting −0.04 of reward.

slide-119
SLIDE 119

Intro to AI (2nd Part)

Plans vs Policies

We have looked at a finite sequence of actions. But why should the agent stop after, say, five steps, if she can reach the terminal states in a few steps?

Paolo Turrini Intro to AI (2nd Part)

slide-120
SLIDE 120

Intro to AI (2nd Part)

Plans vs Policies

We have looked at a finite sequence of actions. But why should the agent stop after, say, five steps, if she can reach the terminal states in a few steps? The intuitively “best” course of action is not getting us there in 2

3 of the cases, even if we count getting unwanted

trajectories.

Paolo Turrini Intro to AI (2nd Part)

slide-121
SLIDE 121

Intro to AI (2nd Part)

Plans vs Policies

We have looked at a finite sequence of actions. But why should the agent stop after, say, five steps, if she can reach the terminal states in a few steps? The intuitively “best” course of action is not getting us there in 2

3 of the cases, even if we count getting unwanted

  • trajectories. Can we do better?

Paolo Turrini Intro to AI (2nd Part)

slide-122
SLIDE 122

Intro to AI (2nd Part)

Plans vs Policies

We have looked at a finite sequence of actions. But why should the agent stop after, say, five steps, if she can reach the terminal states in a few steps? The intuitively “best” course of action is not getting us there in 2

3 of the cases, even if we count getting unwanted

  • trajectories. Can we do better?

The idea is that we don’t only care about specifying a sequence of moves, but we need to think of what to do in each situation.

Paolo Turrini Intro to AI (2nd Part)

slide-123
SLIDE 123

Intro to AI (2nd Part)

Plans vs Policies

We have looked at a finite sequence of actions. But why should the agent stop after, say, five steps, if she can reach the terminal states in a few steps? The intuitively “best” course of action is not getting us there in 2

3 of the cases, even if we count getting unwanted

  • trajectories. Can we do better?

The idea is that we don’t only care about specifying a sequence of moves, but we need to think of what to do in each situation. A policy is a specification of moves at each decision point

Paolo Turrini Intro to AI (2nd Part)

slide-124
SLIDE 124

Intro to AI (2nd Part)

A policy

Paolo Turrini Intro to AI (2nd Part)

slide-125
SLIDE 125

Intro to AI (2nd Part)

Expected utility of a policy

The expected utility (or value) of policy π, from state s is:

Paolo Turrini Intro to AI (2nd Part)

slide-126
SLIDE 126

Intro to AI (2nd Part)

Expected utility of a policy

The expected utility (or value) of policy π, from state s is: vπ(s) = E[

  • t=0

γtr(st)]

Paolo Turrini Intro to AI (2nd Part)

slide-127
SLIDE 127

Intro to AI (2nd Part)

Expected utility of a policy

The expected utility (or value) of policy π, from state s is: vπ(s) = E[

  • t=0

γtr(st)] The probability distribution over the state sequences is induced by the policy π, the initial state t and the transition model for the environment.

Paolo Turrini Intro to AI (2nd Part)

slide-128
SLIDE 128

Intro to AI (2nd Part)

Expected utility of a policy

The expected utility (or value) of policy π, from state s is: vπ(s) = E[

  • t=0

γtr(st)] The probability distribution over the state sequences is induced by the policy π, the initial state t and the transition model for the environment. We want the optimal policy: π∗

s = argmax π

vπ(s)

Paolo Turrini Intro to AI (2nd Part)

slide-129
SLIDE 129

Intro to AI (2nd Part)

A remarkable fact

Theorem With discounted rewards and infinite horizons π∗

s = π∗ s′, for each s′ ∈ S

Paolo Turrini Intro to AI (2nd Part)

slide-130
SLIDE 130

Intro to AI (2nd Part)

A remarkable fact

Theorem With discounted rewards and infinite horizons π∗

s = π∗ s′, for each s′ ∈ S

Idea: Take π∗

a and π∗

  • b. If they both reach a state c, because they

are both optimal, there is no reason why they should disagree. So π∗

c is identical for both. But then they behave the same at all

states!

Paolo Turrini Intro to AI (2nd Part)

slide-131
SLIDE 131

Intro to AI (2nd Part)

Optimal policies

Figure: Optimal policy when state penalty R(s) is –0.04:

Paolo Turrini Intro to AI (2nd Part)

slide-132
SLIDE 132

Intro to AI (2nd Part)

Risk and reward

Paolo Turrini Intro to AI (2nd Part)

slide-133
SLIDE 133

Intro to AI (2nd Part)

Risk and reward

Paolo Turrini Intro to AI (2nd Part)

slide-134
SLIDE 134

Intro to AI (2nd Part)

Risk and reward

Paolo Turrini Intro to AI (2nd Part)

slide-135
SLIDE 135

Intro to AI (2nd Part)

Risk and reward

Paolo Turrini Intro to AI (2nd Part)

slide-136
SLIDE 136

Intro to AI (2nd Part)

Risk and reward

Paolo Turrini Intro to AI (2nd Part)

slide-137
SLIDE 137

Intro to AI (2nd Part)

To be continued

Next Tuesday we are going to finish the slides on MDPs

Paolo Turrini Intro to AI (2nd Part)