Goals and Preferences Alice . . . went on Would you please tell me, - - PowerPoint PPT Presentation

goals and preferences
SMART_READER_LITE
LIVE PREVIEW

Goals and Preferences Alice . . . went on Would you please tell me, - - PowerPoint PPT Presentation

Goals and Preferences Alice . . . went on Would you please tell me, please, which way I ought to go from here? That depends a good deal on where you want to get to, said the Cat. I dont much care where said Alice.


slide-1
SLIDE 1

Goals and Preferences

Alice . . . went on “Would you please tell me, please, which way I ought to go from here?” “That depends a good deal on where you want to get to,” said the Cat. “I don’t much care where —” said Alice. “Then it doesn’t matter which way you go,” said the Cat. Lewis Carroll, 1832–1898 Alice’s Adventures in Wonderland, 1865 Chapter 6

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 1

slide-2
SLIDE 2

Learning Objectives

At the end of the class you should be able to: justify the use and semantics of utility estimate the utility of an outcome build a decision network for a domain compute the optimal policy of a decision network

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 2

slide-3
SLIDE 3

Preferences

Actions result in outcomes Agents have preferences over outcomes A rational agent will do the action that has the best

  • utcome for them

Sometimes agents don’t know the outcomes of the actions, but they still need to compare actions Agents have to act. (Doing nothing is (often) an action).

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 3

slide-4
SLIDE 4

Preferences Over Outcomes

If o1 and o2 are outcomes

  • 1 o2 means o1 is at least as desirable as o2.
  • 1 ∼ o2 means o1 o2 and o2 o1.
  • 1 ≻ o2 means o1 o2 and o2 o1

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 4

slide-5
SLIDE 5

Lotteries

An agent may not know the outcomes of their actions, but only have a probability distribution of the outcomes. A lottery is a probability distribution over outcomes. It is written [p1 : o1, p2 : o2, . . . , pk : ok] where the oi are outcomes and pi ≥ 0 such that

  • i

pi = 1 The lottery specifies that outcome oi occurs with probability pi. When we talk about outcomes, we will include lotteries.

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 5

slide-6
SLIDE 6

Properties of Preferences

Completeness: Agents have to act, so they must have preferences: ∀o1∀o2 o1 o2 or o2 o1

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 6

slide-7
SLIDE 7

Properties of Preferences

Completeness: Agents have to act, so they must have preferences: ∀o1∀o2 o1 o2 or o2 o1 Transitivity: Preferences must be transitive: if o1 o2 and o2 ≻ o3 then o1 ≻ o3 (Similarly for other mixtures of ≻ and .)

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 7

slide-8
SLIDE 8

Properties of Preferences

Completeness: Agents have to act, so they must have preferences: ∀o1∀o2 o1 o2 or o2 o1 Transitivity: Preferences must be transitive: if o1 o2 and o2 ≻ o3 then o1 ≻ o3 (Similarly for other mixtures of ≻ and .) Rationale: otherwise o1 o2 and o2 ≻ o3 and o3 o1. If they are prepared to pay to get o2 instead of o3, and are happy to have o1 instead of o2, and are happy to have o3 instead of o1 − → money pump.

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 8

slide-9
SLIDE 9

Properties of Preferences (cont.)

Monotonicity: An agent prefers a larger chance of getting a better outcome than a smaller chance: If o1 ≻ o2 and p > q then [p : o1, 1 − p : o2] ≻ [q : o1, 1 − q : o2]

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 9

slide-10
SLIDE 10

Consequence of axioms

Suppose o1 ≻ o2 and o2 ≻ o3. Consider whether the agent would prefer

◮ o2 ◮ the lottery [p : o1, 1 − p : o3]

for different values of p ∈ [0, 1]. Plot which one is preferred as a function of p:

  • 2 -

lottery - 1

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 10

slide-11
SLIDE 11

Properties of Preferences (cont.)

Continuity: Suppose o1 ≻ o2 and o2 ≻ o3, then there exists a p ∈ [0, 1] such that

  • 2 ∼ [p : o1, 1 − p : o3]

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 11

slide-12
SLIDE 12

Properties of Preferences (cont.)

Decomposability: (no fun in gambling). An agent is indifferent between lotteries that have same probabilities and

  • utcomes. This includes lotteries over lotteries. For example:

[p : o1, 1 − p : [q : o2, 1 − q : o3]] ∼ [p : o1, (1 − p)q : o2, (1 − p)(1 − q) : o3]

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 12

slide-13
SLIDE 13

Properties of Preferences (cont.)

Substitutability: if o1 ∼ o2 then the agent is indifferent between lotteries that only differ by o1 and o2: [p : o1, 1 − p : o3] ∼ [p : o2, 1 − p : o3]

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 13

slide-14
SLIDE 14

Alternative Axiom for Substitutability

Substitutability: if o1 o2 then the agent weakly prefers lotteries that contain o1 instead of o2, everything else being equal. That is, for any number p and outcome o3: [p : o1, (1 − p) : o3] [p : o2, (1 − p) : o3]

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 14

slide-15
SLIDE 15

What we would like

We would like a measure of preference that can be combined with probabilities. So that value([p : o1, 1 − p : o2]) = p × value(o1) + (1 − p) × value(o2) Money does not act like this. What would you prefer $1, 000, 000 or [0.5 : $0, 0.5 : $2, 000, 000]?

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 15

slide-16
SLIDE 16

What we would like

We would like a measure of preference that can be combined with probabilities. So that value([p : o1, 1 − p : o2]) = p × value(o1) + (1 − p) × value(o2) Money does not act like this. What would you prefer $1, 000, 000 or [0.5 : $0, 0.5 : $2, 000, 000]? It may seem that preferences are too complex and muti-faceted to be represented by single numbers.

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 16

slide-17
SLIDE 17

Theorem

If preferences follow the preceding properties, then preferences can be measured by a function utility : outcomes → [0, 1] such that

  • 1 o2 if and only if utility(o1) ≥ utility(o2).

Utilities are linear with probabilities: utility([p1 : o1, p2 : o2, . . . , pk : ok]) =

k

  • i=1

pi × utility(oi)

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 17

slide-18
SLIDE 18

Proof

If all outcomes are equally preferred,

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 18

slide-19
SLIDE 19

Proof

If all outcomes are equally preferred, set utility(oi) = 0 for all outcomes oi. Otherwise, suppose the best outcome is best and the worst outcome is worst. For any outcome oi, define utility(oi) to be the number ui such that

  • i ∼ [ui : best, 1 − ui : worst]

This exists by

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 19

slide-20
SLIDE 20

Proof

If all outcomes are equally preferred, set utility(oi) = 0 for all outcomes oi. Otherwise, suppose the best outcome is best and the worst outcome is worst. For any outcome oi, define utility(oi) to be the number ui such that

  • i ∼ [ui : best, 1 − ui : worst]

This exists by the Continuity property.

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 20

slide-21
SLIDE 21

Proof (cont.)

Suppose o1 o2 and utility(oi) = ui, then by Substitutability, [u1 : best, 1 − u1 : worst]

  • c
  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 21

slide-22
SLIDE 22

Proof (cont.)

Suppose o1 o2 and utility(oi) = ui, then by Substitutability, [u1 : best, 1 − u1 : worst]

  • [u2 : best, 1 − u2 : worst]

Which, by completeness and monotonicity implies

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 22

slide-23
SLIDE 23

Proof (cont.)

Suppose o1 o2 and utility(oi) = ui, then by Substitutability, [u1 : best, 1 − u1 : worst]

  • [u2 : best, 1 − u2 : worst]

Which, by completeness and monotonicity implies u1 ≥ u2.

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 23

slide-24
SLIDE 24

Proof (cont.)

Suppose p = utility([p1 : o1, p2 : o2, . . . , pk : ok]). Suppose utility(oi) = ui. We know:

  • i ∼ [ui : best, 1 − ui : worst]

By substitutability, we can replace each oi by [ui : best, 1 − ui : worst], so p = utility( [ p1 : [u1 : best, 1 − u1 : worst] . . . pk : [uk : best, 1 − uk : worst]])

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 24

slide-25
SLIDE 25

By decomposability, this is equivalent to: p = utility( [ p1u1 + · · · + pkuk : best, p1(1 − u1) + · · · + pk(1 − uk) : worst]]) Thus, by definition of utility, p = p1 × u1 + · · · + pk × uk

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 25

slide-26
SLIDE 26

Utility as a function of money

$0 $2,000,000 Utility 1 Risk averse R i s k n e u t r a l Risk seeking

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 26

slide-27
SLIDE 27

Possible utility as a function of money

Someone who really wants a toy worth $30, but who would also like one worth $20:

10 20 30 40 50 60 70 80 90 100 1 dollars utility c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 27

slide-28
SLIDE 28

Factored Representation of Utility

Suppose the outcomes can be described in terms of features X1, . . . , Xn. An additive utility is one that can be decomposed into set of factors: u(X1, . . . , Xn) = f1(X1) + · · · + fn(Xn). This assumes additive independence . Strong assumption: contribution of each feature doesn’t depend on other features. Many ways to represent the same utility: — a number can be added to one factor as long as it is subtracted from others.

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 28

slide-29
SLIDE 29

Additive Utility

An additive utility has a canonical representation: u(X1, . . . , Xn) = w1 × u1(X1) + · · · + wn × un(Xn). If besti is the best value of Xi, ui(Xi=besti) = 1. If worsti is the worst value of Xi, ui(Xi=worsti) = 0. wi are weights,

i wi = 1.

The weights reflect the relative importance of features. We can determine weights by comparing outcomes. w1 =

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 29

slide-30
SLIDE 30

Additive Utility

An additive utility has a canonical representation: u(X1, . . . , Xn) = w1 × u1(X1) + · · · + wn × un(Xn). If besti is the best value of Xi, ui(Xi=besti) = 1. If worsti is the worst value of Xi, ui(Xi=worsti) = 0. wi are weights,

i wi = 1.

The weights reflect the relative importance of features. We can determine weights by comparing outcomes. w1 = u(best1, x2, . . . , xn) − u(worst1, x2, . . . , xn). for any values x2, . . . , xn of X2, . . . , Xn.

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 30

slide-31
SLIDE 31

General Setup for Additive Utility

Suppose there are: multiple users multiple alternatives to choose among, e.g., hotel1,. . . multiple criteria upon which to judge, e.g., rate, location utility is a function of

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 31

slide-32
SLIDE 32

General Setup for Additive Utility

Suppose there are: multiple users multiple alternatives to choose among, e.g., hotel1,. . . multiple criteria upon which to judge, e.g., rate, location utility is a function of users and alternatives fact(crit, alt) is the fact about the domain value of criteria crit for alternative alt. E.g., fact(rate, hotel1) is the room rate for hotel#1, which is $125 per night. score(val, user, crit) gives the score of the domain value for user on criteria crit. utility(user, alt) =

  • crit

weight(user, crit) × score(fact(crit, alt), user, crit) for user, alternative alt, criteria crit

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 32

slide-33
SLIDE 33

Complements and Substitutes

Often additive independence is not a good assumption. Values x1 of feature X1 and x2 of feature X2 are complements if having both is better than the sum of the two. Values x1 of feature X1 and x2 of feature X2 are substitutes if having both is worse than the sum of the two.

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 33

slide-34
SLIDE 34

Complements and Substitutes

Often additive independence is not a good assumption. Values x1 of feature X1 and x2 of feature X2 are complements if having both is better than the sum of the two. Values x1 of feature X1 and x2 of feature X2 are substitutes if having both is worse than the sum of the two. Example: on a holiday

◮ An excursion for 6 hours North on day 3. ◮ An excursion for 6 hours South on day 3. c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 34

slide-35
SLIDE 35

Complements and Substitutes

Often additive independence is not a good assumption. Values x1 of feature X1 and x2 of feature X2 are complements if having both is better than the sum of the two. Values x1 of feature X1 and x2 of feature X2 are substitutes if having both is worse than the sum of the two. Example: on a holiday

◮ An excursion for 6 hours North on day 3. ◮ An excursion for 6 hours South on day 3.

Example: on a holiday

◮ A trip to a location 3 hours North on day 3 ◮ The return trip for the same day. c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 35

slide-36
SLIDE 36

Generalized Additive Utility

A generalized additive utility can be written as a sum of factors: u(X1, . . . , Xn) = f1(X1) + · · · + fk(Xk) where Xi ⊆ {X1, . . . , Xn}. An intuitive canonical representation is difficult to find. It can represent complements and substitutes.

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 36

slide-37
SLIDE 37

Utility and time

Would you prefer $1000 today or $1000 next year? What price would you pay now to have an eternity of happiness? How can you trade off pleasures today with pleasures in the future?

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 37

slide-38
SLIDE 38

Pascal’s Wager (1670)

Decide whether to believe in God.

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 38

slide-39
SLIDE 39

Pascal’s Wager (1670)

Decide whether to believe in God. Believe in God Utility God Exists

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 39

slide-40
SLIDE 40

Utility and time

How would you compare the following sequences of rewards (per week): A: $1000000, $0, $0, $0, $0, $0,. . . B: $1000, $1000, $1000, $1000, $1000,. . . C: $1000, $0, $0, $0, $0,. . . D: $1, $1, $1, $1, $1,. . . E: $1, $2, $3, $4, $5,. . .

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 40

slide-41
SLIDE 41

Rewards and Values

Suppose the agent receives a sequence of rewards r1, r2, r3, r4, . . . in time. What utility should be assigned? “Return” or “value”

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 41

slide-42
SLIDE 42

Rewards and Values

Suppose the agent receives a sequence of rewards r1, r2, r3, r4, . . . in time. What utility should be assigned? “Return” or “value” total reward V =

  • i=1

ri average reward V = lim

n→∞(r1 + · · · + rn)/n

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 42

slide-43
SLIDE 43

Average vs Accumulated Rewards

Agent goes on forever? Agent gets stuck in "absorbing" state(s) with zero reward? yes no yes no

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 43

slide-44
SLIDE 44

Rewards and Values

Suppose the agent receives a sequence of rewards r1, r2, r3, r4, . . . in time. discounted return V = r1 + γr2 + γ2r3 + γ3r4 + · · · γ is the discount factor 0 ≤ γ ≤ 1.

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 44

slide-45
SLIDE 45

Properties of the Discounted Rewards

The discounted return for rewards r1, r2, r3, r4, . . . is V = r1 + γr2 + γ2r3 + γ3r4 + · · · =

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 45

slide-46
SLIDE 46

Properties of the Discounted Rewards

The discounted return for rewards r1, r2, r3, r4, . . . is V = r1 + γr2 + γ2r3 + γ3r4 + · · · = r1 + γ(r2 + γ(r3 + γ(r4 + . . . ))) If Vt is the value obtained from time step t Vt =

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 46

slide-47
SLIDE 47

Properties of the Discounted Rewards

The discounted return for rewards r1, r2, r3, r4, . . . is V = r1 + γr2 + γ2r3 + γ3r4 + · · · = r1 + γ(r2 + γ(r3 + γ(r4 + . . . ))) If Vt is the value obtained from time step t Vt = rt + γVt+1

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 47

slide-48
SLIDE 48

Properties of the Discounted Rewards

The discounted return for rewards r1, r2, r3, r4, . . . is V = r1 + γr2 + γ2r3 + γ3r4 + · · · = r1 + γ(r2 + γ(r3 + γ(r4 + . . . ))) If Vt is the value obtained from time step t Vt = rt + γVt+1 How is the infinite future valued compared to immediate rewards?

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 48

slide-49
SLIDE 49

Properties of the Discounted Rewards

The discounted return for rewards r1, r2, r3, r4, . . . is V = r1 + γr2 + γ2r3 + γ3r4 + · · · = r1 + γ(r2 + γ(r3 + γ(r4 + . . . ))) If Vt is the value obtained from time step t Vt = rt + γVt+1 How is the infinite future valued compared to immediate rewards? 1 + γ + γ2 + γ3 + · · · = 1/(1 − γ) Therefore minimum reward 1 − γ ≤ Vt ≤ maximum reward 1 − γ We can approximate V with the first k terms, with error: V − (r1 + γr2 + · · · + γk−1rk) = γkVk+1

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 49

slide-50
SLIDE 50

Allais Paradox (1953)

What would you prefer: A: $1m — one million dollars B: lottery [0.10 : $2.5m, 0.89 : $1m, 0.01 : $0]

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 50

slide-51
SLIDE 51

Allais Paradox (1953)

What would you prefer: A: $1m — one million dollars B: lottery [0.10 : $2.5m, 0.89 : $1m, 0.01 : $0] What would you prefer: C: lottery [0.11 : $1m, 0.89 : $0] D: lottery [0.10 : $2.5m, 0.9 : $0]

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 51

slide-52
SLIDE 52

Allais Paradox (1953)

What would you prefer: A: $1m — one million dollars B: lottery [0.10 : $2.5m, 0.89 : $1m, 0.01 : $0] What would you prefer: C: lottery [0.11 : $1m, 0.89 : $0] D: lottery [0.10 : $2.5m, 0.9 : $0] It is inconsistent with the axioms of preferences to have A ≻ B and D ≻ C.

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 52

slide-53
SLIDE 53

Allais Paradox (1953)

What would you prefer: A: $1m — one million dollars B: lottery [0.10 : $2.5m, 0.89 : $1m, 0.01 : $0] What would you prefer: C: lottery [0.11 : $1m, 0.89 : $0] D: lottery [0.10 : $2.5m, 0.9 : $0] It is inconsistent with the axioms of preferences to have A ≻ B and D ≻ C. A,C: lottery [0.11 : $1m, 0.89 : X] B,D: lottery [0.10 : $2.5m, 0.01 : $0, 0.89 : X]

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 53

slide-54
SLIDE 54

Framing Effects [Tversky and Kahneman]

A disease is expected to kill 600 people. Two alternative programs have been proposed: Program A: 200 people will be saved Program B: probability 1/3: 600 people will be saved probability 2/3: no one will be saved Which program would you favor?

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 54

slide-55
SLIDE 55

Framing Effects [Tversky and Kahneman]

A disease is expected to kill 600 people. Two alternative programs have been proposed: Program C: 400 people will die Program D: probability 1/3: no one will die probability 2/3: 600 will die Which program would you favor?

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 55

slide-56
SLIDE 56

Framing Effects [Tversky and Kahneman]

A disease is expected to kill 600 people. Two alternative programs have been proposed: Program A: 200 people will be saved Program B: probability 1/3: 600 people will be saved probability 2/3: no one will be saved Which program would you favor? A disease is expected to kill 600 people. Two alternative programs have been proposed: Program C: 400 people will die Program D: probability 1/3: no one will die probability 2/3: 600 will die Which program would you favor? Tversky and Kahneman: 72% chose A over B. 22% chose C over D.

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 56

slide-57
SLIDE 57

Prospect Theory

$ psychological value Gains Losses

In mixed gambles, loss aversion causes extreme risk-averse choices In bad choices, diminishing responsibility causes risk seeking.

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 57

slide-58
SLIDE 58

Reference Points

Consider Anthony and Betty: Anthony’s current wealth is $1 million. Betty’s current wealth is $4 million. They are both offered the choice between a gamble and a sure thing: Gamble: equal chance to end up owning $1 million or $4 million. Sure Thing: own $2 million What does expected utility theory predict?

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 58

slide-59
SLIDE 59

Reference Points

Consider Anthony and Betty: Anthony’s current wealth is $1 million. Betty’s current wealth is $4 million. They are both offered the choice between a gamble and a sure thing: Gamble: equal chance to end up owning $1 million or $4 million. Sure Thing: own $2 million What does expected utility theory predict? What does prospect theory predict?

[From D. Kahneman, Thinking, Fast and Slow, 2011, pp. 275-276.]

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 59

slide-60
SLIDE 60

Framing Effects

What do you think of Alan and Ben: Alan: intelligent—industrious—impulsive—critical— stubborn—envious

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 60

slide-61
SLIDE 61

Framing Effects

What do you think of Alan and Ben: Ben: envious—stubborn—critical—impulsive— industrious—intelligent

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 61

slide-62
SLIDE 62

Framing Effects

What do you think of Alan and Ben: Alan: intelligent—industrious—impulsive—critical— stubborn—envious Ben: envious—stubborn—critical—impulsive— industrious—intelligent [From D. Kahneman, Thinking Fast and Slow, 2011, p. 82]

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 62

slide-63
SLIDE 63

Framing Effects

Suppose you had bought tickets for the theatre for $50. When you got to the theatre, you had lost the tickets. You have your credit card and can buy equivalent tickets for $50. Do you buy the replacement tickets on your credit card?

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 63

slide-64
SLIDE 64

Framing Effects

Suppose you had bought tickets for the theatre for $50. When you got to the theatre, you had lost the tickets. You have your credit card and can buy equivalent tickets for $50. Do you buy the replacement tickets on your credit card? Suppose you had $50 in your pocket to buy tickets. When you got to the theatre, you had lost the $50. You have your credit card and can buy equivalent tickets for $50. Do you buy the tickets on your credit card?

[From R.M. Dawes, Rational Choice in an Uncertain World, 1988.]

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 64

slide-65
SLIDE 65

The Ellsberg Paradox

Two bags: Bag 1 40 white chips, 30 yellow chips, 30 green chips Bag 2 40 white chips, 60 chips that are yellow or green What do you prefer: A: Receive $1m if a white or yellow chip is drawn from bag 1 B: Receive $1m if a white or yellow chip is drawn from bag 2 C: Receive $1m if a white or green chip is drawn from bag 2

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 65

slide-66
SLIDE 66

The Ellsberg Paradox

Two bags: Bag 1 40 white chips, 30 yellow chips, 30 green chips Bag 2 40 white chips, 60 chips that are yellow or green What do you prefer: A: Receive $1m if a white or yellow chip is drawn from bag 1 B: Receive $1m if a white or yellow chip is drawn from bag 2 C: Receive $1m if a white or green chip is drawn from bag 2 What about D: Lottery [0.5 : B, 0.5 : C]

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 66

slide-67
SLIDE 67

The Ellsberg Paradox

Two bags: Bag 1 40 white chips, 30 yellow chips, 30 green chips Bag 2 40 white chips, 60 chips that are yellow or green What do you prefer: A: Receive $1m if a white or yellow chip is drawn from bag 1 B: Receive $1m if a white or yellow chip is drawn from bag 2 C: Receive $1m if a white or green chip is drawn from bag 2 What about D: Lottery [0.5 : B, 0.5 : C] However A and D should give same outcome, no matter what the proportion in Bag 2.

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 67

slide-68
SLIDE 68
  • St. Petersburg Paradox

What if there is no “best” outcome? Are utilities unbounded?

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 68

slide-69
SLIDE 69
  • St. Petersburg Paradox

What if there is no “best” outcome? Are utilities unbounded? Suppose utilities are unbounded. Then for any outcome oi there is an outcome oi+1 such that u(oi+1) > 2u(oi).

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 69

slide-70
SLIDE 70
  • St. Petersburg Paradox

What if there is no “best” outcome? Are utilities unbounded? Suppose utilities are unbounded. Then for any outcome oi there is an outcome oi+1 such that u(oi+1) > 2u(oi). Would the agent prefer o1 or the lottery [0.5 : o2, 0.5 : 0]? where 0 is the worst outcome.

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 70

slide-71
SLIDE 71
  • St. Petersburg Paradox

What if there is no “best” outcome? Are utilities unbounded? Suppose utilities are unbounded. Then for any outcome oi there is an outcome oi+1 such that u(oi+1) > 2u(oi). Would the agent prefer o1 or the lottery [0.5 : o2, 0.5 : 0]? where 0 is the worst outcome. Is it rational to gamble o1 to on a coin toss to get o2? Is it rational to gamble o2 to on a coin toss to get o3? Is it rational to gamble o3 to on a coin toss to get o4?

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 71

slide-72
SLIDE 72
  • St. Petersburg Paradox

What if there is no “best” outcome? Are utilities unbounded? Suppose utilities are unbounded. Then for any outcome oi there is an outcome oi+1 such that u(oi+1) > 2u(oi). Would the agent prefer o1 or the lottery [0.5 : o2, 0.5 : 0]? where 0 is the worst outcome. Is it rational to gamble o1 to on a coin toss to get o2? Is it rational to gamble o2 to on a coin toss to get o3? Is it rational to gamble o3 to on a coin toss to get o4? What will eventually happen?

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 72

slide-73
SLIDE 73

Predictor Paradox

Two boxes: Box 1: contains $10,000 Box 2: contains either $0 or $1m You can either choose both boxes or just box 2.

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 73

slide-74
SLIDE 74

Predictor Paradox

Two boxes: Box 1: contains $10,000 Box 2: contains either $0 or $1m You can either choose both boxes or just box 2. The “predictor” has put $1m in box 2 if he thinks you will take box 2 and $0 in box 2 if he thinks you will take both. The predictor has been correct in previous predictions. Do you take both boxes or just box 2?

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.1, Page 74

slide-75
SLIDE 75

Making Decisions Under Uncertainty

What an agent should do depends on: The agent’s ability — what options are available to it. The agent’s beliefs — the ways the world could be, given the agent’s knowledge. Sensing updates the agent’s beliefs. The agent’s preferences — what the agent wants and tradeoffs when there are risks. Decision theory specifies how to trade off the desirability and probabilities of the possible outcomes for competing actions.

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.2, Page 1

slide-76
SLIDE 76

Decision Variables

Decision variables are like random variables that an agent gets to choose a value for. A possible world specifies a value for each decision variable and each random variable. For each assignment of values to all decision variables, the measure of the set of worlds satisfying that assignment sum to 1. The probability of a proposition is undefined unless the agent condition on the values of all decision variables.

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.2, Page 2

slide-77
SLIDE 77

Decision Tree for Delivery Robot

The robot can choose to wear pads to protect itself or not. The robot can choose to go the short way past the stairs or a long way that reduces the chance of an accident. There is one random variable of whether there is an accident.

wear pads don’t wear pads short way long way short way long way accident no accident accident no accident accident no accident accident no accident w0 - moderate damage w2 - moderate damage w4 - severe damage w6 - severe damage w1 - quick, extra weight w3 - slow, extra weight w5 - quick, no weight w7 - slow, no weight

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.2, Page 3

slide-78
SLIDE 78

Expected Values

The expected value of a function of possible worlds is its average value, weighting possible worlds by their probability. Suppose f (ω) is the value of function f on world ω.

◮ The expected value of f is

E(f ) =

  • ω∈Ω

P(ω) × f (ω).

◮ The conditional expected value of f given e is

E(f |e) =

  • ω|

=e

P(ω|e) × f (ω).

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.2, Page 4

slide-79
SLIDE 79

Single decisions

In a single decision variable, the agent can choose D = di for any di ∈ dom(D). The expected utility of decision D = di is E(u|D = di) where u(ω) is the utility of world ω. An optimal single decision is a decision D = dmax whose expected utility is maximal: E(u|D = dmax) = max

di∈dom(D) E(u|D = di).

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.2, Page 5

slide-80
SLIDE 80

Single-stage decision networks

Extend belief networks with: Decision nodes, that the agent chooses the value for. Domain is the set of possible actions. Drawn as rectangle. Utility node, the parents are the variables on which the utility depends. Drawn as a diamond. Which Way Accident Utility Wear Pads This shows explicitly which nodes affect whether there is an accident.

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.2, Page 6

slide-81
SLIDE 81

Finding an optimal decision

Suppose the random variables are X1, . . . , Xn, and utility depends on Xi1, . . . , Xik E(u|D) =

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.2, Page 7

slide-82
SLIDE 82

Finding an optimal decision

Suppose the random variables are X1, . . . , Xn, and utility depends on Xi1, . . . , Xik E(u|D) =

  • X1,...,Xn

P(X1, . . . , Xn|D) × u(Xi1, . . . , Xik) =

  • X1,...,Xn

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.2, Page 8

slide-83
SLIDE 83

Finding an optimal decision

Suppose the random variables are X1, . . . , Xn, and utility depends on Xi1, . . . , Xik E(u|D) =

  • X1,...,Xn

P(X1, . . . , Xn|D) × u(Xi1, . . . , Xik) =

  • X1,...,Xn

n

  • i=1

P(Xi|parents(Xi)) × u(Xi1, . . . , Xik) To find an optimal decision:

◮ Create a factor for each conditional probability and for

the utility

◮ Sum out all of the random variables ◮ This creates a factor on D that gives the expected utility

for each D

◮ Choose the D with the maximum value in the factor. c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.2, Page 9

slide-84
SLIDE 84

Example Initial Factors

Which Way Accident Value long true 0.01 long false 0.99 short true 0.2 short false 0.8 Which Way Accident Wear Pads Value long true true 30 long true false long false true 75 long false false 80 short true true 35 short true false 3 short false true 95 short false false 100

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.2, Page 10

slide-85
SLIDE 85

After summing out Accident

Which Way Wear Pads Value long true 74.55 long false 79.2 short true 83.0 short false 80.6

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.2, Page 11

slide-86
SLIDE 86

Decision Networks

flat or modular or hierarchical explicit states or features or individuals and relations static or finite stage or indefinite stage or infinite stage fully observable or partially observable deterministic or stochastic dynamics goals or complex preferences single agent or multiple agents knowledge is given or knowledge is learned perfect rationality or bounded rationality

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.2, Page 12

slide-87
SLIDE 87

Sequential Decisions

An intelligent agent doesn’t carry out a multi-step plan ignoring information it receives between actions. A more typical scenario is where the agent:

  • bserves, acts, observes, acts, . . .

Subsequent actions can depend on what is observed. What is observed depends on previous actions. Often the sole reason for carrying out an action is to provide information for future actions. For example: diagnostic tests, spying.

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.2, Page 13

slide-88
SLIDE 88

Sequential decision problems

A sequential decision problem consists of a sequence of decision variables D1, . . . , Dn. Each Di has an information set of variables parents(Di), whose value will be known at the time decision Di is made.

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.2, Page 14

slide-89
SLIDE 89

Decisions Networks

A decision network is a graphical representation of a finite sequential decision problem, with 3 types of nodes: A random variable is drawn as an

  • ellipse. Arcs into the node represent

probabilistic dependence. A decision variable is drawn as an

  • rectangle. Arcs into the node

represent information available when the decision is make. A utility node is drawn as a

  • diamond. Arcs into the node

represent variables that the utility depends on.

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.2, Page 15

slide-90
SLIDE 90

Umbrella Decision Network

Umbrella Weather Utility Forecast

You don’t get to observe the weather when you have to decide whether to take your umbrella. You do get to observe the forecast.

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.2, Page 16

slide-91
SLIDE 91

Decision Network for the Alarm Problem

Tampering Fire Alarm Leaving Report Smoke SeeSmoke Check Smoke Call Utility

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.2, Page 17

slide-92
SLIDE 92

No-forgetting

A No-forgetting decision network is a decision network where: The decision nodes are totally ordered. This is the order the actions will be taken. All decision nodes that come before Di are parents of decision node Di. Thus the agent remembers its previous actions. Any parent of a decision node is a parent of subsequent decision nodes. Thus the agent remembers its previous

  • bservations.

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.2, Page 18

slide-93
SLIDE 93

What should an agent do?

What an agent should do at any time depends on what it will do in the future. What an agent does in the future depends on what it did before.

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.2, Page 19

slide-94
SLIDE 94

Policies

A policy specifies what an agent should do under each circumstance. A policy is a sequence δ1, . . . , δn of decision functions δi : dom(parents(Di)) → dom(Di). This policy means that when the agent has observed O ∈ dom(parents(Di)), it will do δi(O).

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.2, Page 20

slide-95
SLIDE 95

Expected Utility of a Policy

Possible world ω satisfies policy δ, written ω | = δ if the world assigns the value to each decision node that the policy specifies. The expected utility of policy δ is E(u|δ) =

  • ω|

u(ω) × P(ω), An optimal policy is one with the highest expected utility.

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.2, Page 21

slide-96
SLIDE 96

Finding an optimal policy

Create a factor for each conditional probability table and a factor for the utility.

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.2, Page 22

slide-97
SLIDE 97

Finding an optimal policy

Create a factor for each conditional probability table and a factor for the utility. Repeat:

◮ Sum out random variables that are not parents of a

decision node.

◮ Select a variable D that is only in a factor f with (some

  • f) its parents.

◮ Eliminate D by maximizing. This returns: ◮ an optimal decision function for D: arg maxD f ◮ a new factor: maxD f

until there are no more decision nodes. Sum out the remaining random variables. Multiply the factors: this is the expected utility of an optimal policy.

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.2, Page 23

slide-98
SLIDE 98

Initial factors for the Umbrella Decision

Weather Value norain 0.7 rain 0.3 Weather Fcast Value norain sunny 0.7 norain cloudy 0.2 norain rainy 0.1 rain sunny 0.15 rain cloudy 0.25 rain rainy 0.6 Weather Umb Value norain take 20 norain leave 100 rain take 70 rain leave

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.2, Page 24

slide-99
SLIDE 99

Eliminating By Maximizing

f : Fcast Umb Val sunny take 12.95 sunny leave 49.0 cloudy take 8.05 cloudy leave 14.0 rainy take 14.0 rainy leave 7.0 maxUmb f : Fcast Val sunny 49.0 cloudy 14.0 rainy 14.0 arg maxUmb f : Fcast Umb sunny leave cloudy leave rainy take

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.2, Page 25

slide-100
SLIDE 100

Exercise

Disease Symptoms Test Result Test Treatment Utility Outcome

What are the factors? Which random variables get summed out first? Which decision variable is eliminated? What factor is created? Then what is eliminated (and how)? What factors are created after maximization?

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.2, Page 26

slide-101
SLIDE 101

Complexity of finding an optimal policy

Decision D has k binary parents, and has b possible actions: there are assignments of values to the parents.

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.2, Page 27

slide-102
SLIDE 102

Complexity of finding an optimal policy

Decision D has k binary parents, and has b possible actions: there are 2k assignments of values to the parents. there are different decision functions.

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.2, Page 28

slide-103
SLIDE 103

Complexity of finding an optimal policy

Decision D has k binary parents, and has b possible actions: there are 2k assignments of values to the parents. there are b2k different decision functions. If there are multiple decision functions The number of policies is

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.2, Page 29

slide-104
SLIDE 104

Complexity of finding an optimal policy

Decision D has k binary parents, and has b possible actions: there are 2k assignments of values to the parents. there are b2k different decision functions. If there are multiple decision functions The number of policies is the product of the number decision functions. The number of optimizations in the dynamic programming is

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.2, Page 30

slide-105
SLIDE 105

Complexity of finding an optimal policy

Decision D has k binary parents, and has b possible actions: there are 2k assignments of values to the parents. there are b2k different decision functions. If there are multiple decision functions The number of policies is the product of the number decision functions. The number of optimizations in the dynamic programming is the sum of the number of assignments of values to parents. The dynamic programming algorithm is much more efficient than searching through policy space.

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.2, Page 31

slide-106
SLIDE 106

Value of Information

The value of information X for decision D is the utility of the network with an arc from X to D (+ no-forgetting arcs) minus the utility of the network without the arc. The value of information is always

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.2, Page 32

slide-107
SLIDE 107

Value of Information

The value of information X for decision D is the utility of the network with an arc from X to D (+ no-forgetting arcs) minus the utility of the network without the arc. The value of information is always non-negative. It is positive only if

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.2, Page 33

slide-108
SLIDE 108

Value of Information

The value of information X for decision D is the utility of the network with an arc from X to D (+ no-forgetting arcs) minus the utility of the network without the arc. The value of information is always non-negative. It is positive only if the agent changes its action depending on X. The value of information provides a bound on how much an agent should be prepared to pay for a sensor. How much is a better weather forecast worth?

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.2, Page 34

slide-109
SLIDE 109

Value of Information

The value of information X for decision D is the utility of the network with an arc from X to D (+ no-forgetting arcs) minus the utility of the network without the arc. The value of information is always non-negative. It is positive only if the agent changes its action depending on X. The value of information provides a bound on how much an agent should be prepared to pay for a sensor. How much is a better weather forecast worth? We need to be careful when adding an arc would create a

  • cycle. E.g., how much would it be worth knowing whether

the fire truck will arrive quickly when deciding whether to call them?

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.2, Page 35

slide-110
SLIDE 110

Value of Control

The value of control of a variable X is the value of the network when you make X a decision variable (and add no-forgetting arcs) minus the value of the network when X is a random variable. You need to be explicit about what information is available when you control X. If you control X without observing, controlling X can be worse than observing X. E.g., controlling a thermometer. If you keep the parents the same, the value of control is always non-negative.

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.2, Page 36

slide-111
SLIDE 111

Agents as Processes

Agents carry out actions: forever infinite horizon until some stopping criteria is met indefinite horizon finite and fixed number of steps finite horizon

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.3, Page 1

slide-112
SLIDE 112

Decision-theoretic Planning

What should an agent do when it gets rewards (and punishments) and tries to maximize its rewards received actions can be stochastic; the outcome of an action can’t be fully predicted there is a model that specifies the (probabilistic) outcome

  • f actions and the rewards

the world is fully observable

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.3, Page 2

slide-113
SLIDE 113

Initial Assumptions

flat or modular or hierarchical explicit states or features or individuals and relations static or finite stage or indefinite stage or infinite stage fully observable or partially observable deterministic or stochastic dynamics goals or complex preferences single agent or multiple agents knowledge is given or knowledge is learned perfect rationality or bounded rationality

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.3, Page 3

slide-114
SLIDE 114

Utility and time

Would you prefer $1000 today or $1000 next year? What price would you pay now to have an eternity of happiness? How can you trade off pleasures today with pleasures in the future?

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.3, Page 4

slide-115
SLIDE 115

Utility and time

How would you compare the following sequences of rewards (per week): A: $1000000, $0, $0, $0, $0, $0,. . . B: $1000, $1000, $1000, $1000, $1000,. . . C: $1000, $0, $0, $0, $0,. . . D: $1, $1, $1, $1, $1,. . . E: $1, $2, $3, $4, $5,. . .

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.3, Page 5

slide-116
SLIDE 116

Rewards and Values

Suppose the agent receives a sequence of rewards r1, r2, r3, r4, . . . in time. What utility should be assigned? “Return” or “value”

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.3, Page 6

slide-117
SLIDE 117

Rewards and Values

Suppose the agent receives a sequence of rewards r1, r2, r3, r4, . . . in time. What utility should be assigned? “Return” or “value” total reward V =

  • i=1

ri average reward V = lim

n→∞(r1 + · · · + rn)/n

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.3, Page 7

slide-118
SLIDE 118

Average vs Accumulated Rewards

Agent goes on forever? Agent gets stuck in "absorbing" state(s) with zero reward? yes no yes no

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.3, Page 8

slide-119
SLIDE 119

Rewards and Values

Suppose the agent receives a sequence of rewards r1, r2, r3, r4, . . . in time. discounted return V = r1 + γr2 + γ2r3 + γ3r4 + · · · γ is the discount factor 0 ≤ γ ≤ 1.

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.3, Page 9

slide-120
SLIDE 120

Properties of the Discounted Rewards

The discounted return for rewards r1, r2, r3, r4, . . . is V = r1 + γr2 + γ2r3 + γ3r4 + · · · =

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.3, Page 10

slide-121
SLIDE 121

Properties of the Discounted Rewards

The discounted return for rewards r1, r2, r3, r4, . . . is V = r1 + γr2 + γ2r3 + γ3r4 + · · · = r1 + γ(r2 + γ(r3 + γ(r4 + . . . ))) If Vt is the value obtained from time step t Vt =

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.3, Page 11

slide-122
SLIDE 122

Properties of the Discounted Rewards

The discounted return for rewards r1, r2, r3, r4, . . . is V = r1 + γr2 + γ2r3 + γ3r4 + · · · = r1 + γ(r2 + γ(r3 + γ(r4 + . . . ))) If Vt is the value obtained from time step t Vt = rt + γVt+1

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.3, Page 12

slide-123
SLIDE 123

Properties of the Discounted Rewards

The discounted return for rewards r1, r2, r3, r4, . . . is V = r1 + γr2 + γ2r3 + γ3r4 + · · · = r1 + γ(r2 + γ(r3 + γ(r4 + . . . ))) If Vt is the value obtained from time step t Vt = rt + γVt+1 How is the infinite future valued compared to immediate rewards?

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.3, Page 13

slide-124
SLIDE 124

Properties of the Discounted Rewards

The discounted return for rewards r1, r2, r3, r4, . . . is V = r1 + γr2 + γ2r3 + γ3r4 + · · · = r1 + γ(r2 + γ(r3 + γ(r4 + . . . ))) If Vt is the value obtained from time step t Vt = rt + γVt+1 How is the infinite future valued compared to immediate rewards? 1 + γ + γ2 + γ3 + · · · = 1/(1 − γ) Therefore minimum reward 1 − γ ≤ Vt ≤ maximum reward 1 − γ We can approximate V with the first k terms, with error: V − (r1 + γr2 + · · · + γk−1rk) = γkVk+1

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.3, Page 14

slide-125
SLIDE 125

World State

The world state is the information such that if the agent knew the world state, no information about the past is relevant to the future. Markovian assumption . Si is state at time i, and Ai is the action at time i: P(St+1|S0, A0, . . . , St, At) =

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.3, Page 15

slide-126
SLIDE 126

World State

The world state is the information such that if the agent knew the world state, no information about the past is relevant to the future. Markovian assumption . Si is state at time i, and Ai is the action at time i: P(St+1|S0, A0, . . . , St, At) = P(St+1|St, At) P(s′|s, a) is the probability that the agent will be in state s′ immediately after doing action a in state s. The dynamics is stationary if the distribution is the same for each time point.

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.3, Page 16

slide-127
SLIDE 127

Decision Processes

A Markov decision process augments a Markov chain with actions and values:

S0 S1 S3 S2 A0 A1 A2 R1 R2 R3

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.3, Page 17

slide-128
SLIDE 128

Markov Decision Processes

An MDP consists of: set S of states. set A of actions. P(St+1|St, At) specifies the dynamics. R(St, At, St+1) specifies the reward at time t. R(s, a, s′) is the expected reward received when the agent is in state s, does action a and ends up in state s′. γ is discount factor.

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.3, Page 18

slide-129
SLIDE 129

Example: to exercise or not?

Each week Sam has to decide whether to exercise or not: States: {fit, unfit} Actions: {exercise, relax} Dynamics: State Action P(fit|State, Action) fit exercise 0.99 fit relax 0.7 unfit exercise 0.2 unfit relax 0.0 Reward (does not depend on resulting state): State Action Reward fit exercise 8 fit relax 10 unfit exercise unfit relax 5

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.3, Page 19

slide-130
SLIDE 130

Example: Simple Grid World

+10

  • 10
  • 5
  • 1
  • 1
  • 1
  • 1

+3

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.3, Page 20

slide-131
SLIDE 131

Grid World Model

Actions: up, down, left, right. 100 states corresponding to the positions of the robot. Robot goes in the commanded direction with probability 0.7, and one of the other directions with probability 0.1. If it crashes into an outside wall, it remains in its current position and has a reward of −1. Four special rewarding states; the agent gets the reward when leaving.

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.3, Page 21

slide-132
SLIDE 132

Planning Horizons

The planning horizon is how far ahead the planner looks to make a decision. The robot gets flung to one of the corners at random after leaving a positive (+10 or +3) reward state.

◮ the process never halts ◮

infinite horizon

The robot gets +10 or +3 in the state, then it stays there getting no reward. These are absorbing states.

◮ The robot will eventually reach an absorbing state. ◮

indefinite horizon

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.3, Page 22

slide-133
SLIDE 133

Information Availability

What information is available when the agent decides what to do? fully-observable MDP the agent gets to observe St when deciding on action At. partially-observable MDP (POMDP) the agent has some noisy sensor of the state. It needs to remember its sensing and acting history. [This lecture only considers FOMDPs]

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.3, Page 23

slide-134
SLIDE 134

Policies

A stationary policy is a function: π : S → A Given a state s, π(s) specifies what action the agent who is following π will do. An optimal policy is one with maximum expected discounted reward. For a fully-observable MDP with stationary dynamics and rewards with infinite or indefinite horizon, there is always an optimal stationary policy.

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.3, Page 24

slide-135
SLIDE 135

Example: to exercise or not?

Each week Sam has to decide whether to exercise or not: States: {fit, unfit} Actions: {exercise, relax} How many stationary policies are there? What are they?

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.3, Page 25

slide-136
SLIDE 136

Example: to exercise or not?

Each week Sam has to decide whether to exercise or not: States: {fit, unfit} Actions: {exercise, relax} How many stationary policies are there? What are they? For the grid world with 100 states and 4 actions, how many stationary policies are there?

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.3, Page 26

slide-137
SLIDE 137

Value of a Policy

Given a policy π: Qπ(s, a), where a is an action and s is a state, is the expected value of doing a in state s, then following policy π. V π(s), where s is a state, is the expected value of following policy π in state s. Qπ and V π can be defined mutually recursively: Qπ(s, a) = V π(s) =

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.3, Page 27

slide-138
SLIDE 138

Value of the Optimal Policy

Q∗(s, a), where a is an action and s is a state, is the expected value of doing a in state s, then following the

  • ptimal policy.

V ∗(s), where s is a state, is the expected value of following the optimal policy in state s. Q∗ and V ∗ can be defined mutually recursively: Q∗(s, a) = V ∗(s) = π∗(s) =

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.3, Page 28

slide-139
SLIDE 139

Value Iteration

Let Vk and Qk be k-step lookahead value and Q functions. Idea: Given an estimate of the k-step lookahead value function, determine the k + 1 step lookahead value function. Set V0 arbitrarily. Compute Qi+1, Vi+1 from Vi. This converges exponentially fast (in k) to the optimal value function. The error reduces proportionally to γk 1 − γ

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.3, Page 29

slide-140
SLIDE 140

Asynchronous Value Iteration

The agent doesn’t need to sweep through all the states, but can update the value functions for each state individually. This converges to the optimal value functions, if each state and action is visited infinitely often in the limit. It can either store V [s] or Q[s, a].

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.3, Page 30

slide-141
SLIDE 141

Asynchronous VI: storing V [s]

Repeat forever:

◮ Select state s ◮ V [s] ← max

a

  • s′

P(s′|s, a)

  • R(s, a, s′) + γV [s′]
  • c
  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.3, Page 31

slide-142
SLIDE 142

Asynchronous VI: storing Q[s, a]

Repeat forever:

◮ Select state s, action a ◮ Q[s, a] ←

  • s′

P(s′|s, a)

  • R(s, a, s′) + γ max

a′ Q[s′, a′]

  • c
  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.3, Page 32

slide-143
SLIDE 143

Policy Iteration

Set π0 arbitrarily, let i = 0 Repeat:

◮ evaluate Qπi(s, a) ◮ let πi+1(s) = argmaxaQπi(s, a) ◮ set i = i + 1

until πi(s) = πi−1(s)

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.3, Page 33

slide-144
SLIDE 144

Policy Iteration

Set π0 arbitrarily, let i = 0 Repeat:

◮ evaluate Qπi(s, a) ◮ let πi+1(s) = argmaxaQπi(s, a) ◮ set i = i + 1

until πi(s) = πi−1(s) Evaluating Qπi(s, a) means finding a solution to a set of |S| × |A| linear equations with |S| × |A| unknowns. It can also be approximated iteratively.

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.3, Page 34

slide-145
SLIDE 145

Modified Policy Iteration

Set π[s] arbitrarily Set Q[s, a] arbitrarily Repeat forever: Repeat for a while:

◮ Select state s, action a ◮ Q[s, a] ←

  • s′

P(s′|s, a)

  • R(s, a, s′) + γQ[s′, π[s′]]
  • π[s] ← argmaxaQ[s, a]

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.3, Page 35

slide-146
SLIDE 146

Q, V , π, R

Q∗(s, a) =

  • s′

P(s′|a, s) (R(s, a, s′) + γV ∗(s′)) V ∗(s) = max

a

Q(s, a) π∗(s) = argmaxaQ(s, a) Let R(s, a) =

  • s′

P(s′|a, s)R(s, a, s′) Then: Q∗(s, a) =

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.3, Page 36

slide-147
SLIDE 147

Q, V , π, R

Q∗(s, a) =

  • s′

P(s′|a, s) (R(s, a, s′) + γV ∗(s′)) V ∗(s) = max

a

Q(s, a) π∗(s) = argmaxaQ(s, a) Let R(s, a) =

  • s′

P(s′|a, s)R(s, a, s′) Then: Q∗(s, a) = R(s, a) + γ

  • s′

P(s′|a, s)V ∗(s′)

c

  • D. Poole and A. Mackworth 2010

Artificial Intelligence, Lecture 9.3, Page 37