Decision Theory Decision theory is about making choices It has a - - PDF document

decision theory
SMART_READER_LITE
LIVE PREVIEW

Decision Theory Decision theory is about making choices It has a - - PDF document

Decision Theory Decision theory is about making choices It has a normative aspect what rational people should do . . . and a descriptive aspect what people do do Not surprisingly, its been studied by economists, psychol-


slide-1
SLIDE 1

Decision Theory

Decision theory is about making choices

  • It has a normative aspect
  • what “rational” people should do
  • . . . and a descriptive aspect
  • what people do do

Not surprisingly, it’s been studied by economists, psychol-

  • gist, and philosophers.

More recently, computer scientists have looked at it too:

  • How should we design robots that make reasonable

decisions

  • What about software agents acting on our behalf
  • agents bidding for you on eBay
  • managed health care
  • Algorithmic issues in decision making

This course will focus on normative aspects, informed by a computer science perspective.

1

slide-2
SLIDE 2

Uncertain Prospects

Suppose you have to eat at a restaurant and your choices are:

  • chicken
  • quiche

Normally you prefer chicken to quiche, but . . . Now you’re uncertain as to whether the chicken has salmonella. You think it’s unlikely, but it’s possible.

  • Key point: you no longer know the outcome of your

choice.

  • This is the common situation!

How do you model this, so you can make a sensible choice?

2

slide-3
SLIDE 3

States, Acts, and Outcomes

The standard formulation of decision problems involves:

  • a set S of states of the world,
  • state: the way that the world could be (the chicken

is infected or isn’t)

  • a set O of outcomes
  • outcome: what happens (you eat chicken and get

sick)

  • a set A of acts
  • act: function from states to outcomes

3

slide-4
SLIDE 4

One way of modeling the example:

  • two states:
  • s1: chicken is not infected
  • s2: chicken is infected
  • three outcomes:
  • o1: you eat quiche
  • o2: you eat chicken and don’t get sick
  • o3: you eat chicken and get sick
  • Two acts:
  • a1: eat quiche

∗ a1(s1) = a1(s2) = o1

  • a2: eat chicken

∗ a2(s1) = o2 ∗ a2(s2) = o3 This is often easiest to represent using a matrix, where the columns correspond to states, the rows correspond to acts, and the entries correspond to outcomes: s1 s2 a1 eat quiche eat quiche a2 eat chicken; don’t get sick eat chicken; get sick

4

slide-5
SLIDE 5

Specifying a Problem

Sometimes it’s pretty obvious what the states, acts, and

  • utcomes should be; sometimes it’s not.

Problem 1: the state might not be detailed enough to make the act a function.

  • Even if the chicken is infected, you might not get sick.

Solution 1: Acts can return a probability distribution over

  • utcomes:
  • If you eat the chicken in state s1, with probability 60%

you might get infected Solution 2: Put more detail into the state.

  • state s11: the chicken is infected and you have a weak

stomach

  • state s12: the chicken is infected and you have a strong

stomach

5

slide-6
SLIDE 6

Problem 2: Treating the act as a function may force you to identify two acts that should be different. Example: Consider two possible acts:

  • carrying a red umbrella
  • carrying a blue umbrella

If the state just mentions what the weather will be (sunny, rainy, . . . ) and the outcome just involves whether you stay dry, these acts are the same.

  • An act is just a function from states to outcomes

Solution: If you think these acts are different, take a richer state space and outcome space.

6

slide-7
SLIDE 7

Problem 3: The choice of labels might matter. Example: Suppose you’re a doctor and need to decide between two treatments for 1000 people. Consider the following outcomes:

  • Treatment 1 results in 400 people being dead
  • Treatment 2 results in 600 people being saved

Are they the same?

  • Most people don’t think so!

7

slide-8
SLIDE 8

Problem 4: The states must be independent of the acts. Example: Should you bet on the American League or the National League in the All-Star game? AL wins NL wins Bet AL +$5

  • $2

Bet NL

  • $2

+$3 But suppose you use a different choice of states: I win my bet I lose my bet Bet AL +$5

  • $2

Bet NL +$3

  • $2

It looks like betting AL is at least as good as betting NL, no matter what happens. So should you bet AL? What is wrong with this representation? Example: Should the US build up its arms, or disarm? War No war Arm Dead Status quo Disarm Red Improved society

8

slide-9
SLIDE 9

Problem 5: The actual outcome might not be among the outcomes you list! Similarly for states.

  • In 2002, the All-Star game was called before it ended,

so it was a tie.

  • What are the states/outcomes if trying to decide whether

to attack Iraq?

9

slide-10
SLIDE 10

Decision Rules

We want to be able to tell a computer what to do in all circumstances.

  • Assume the computer knows S, O, A
  • This is reasonable in limited domains, perhaps not

in general.

  • Remember that the choice of S, O, and A may

affect the possible decisions!

  • Moreover, assume that there is a utility function u

mapping outcomes to real numbers.

  • You have a total preference order on outcomes!
  • There may or may not have a measure of likelihood

(probability or something else) on S. You want a decision rule: something that tells the com- puter what to do in all circumstances, as a function of these inputs. There are lots of decision rules out there.

10

slide-11
SLIDE 11

Maximin

This is a conservative rule:

  • Pick the act with the best worst case.
  • Maximize the minimum

Formally, given act a ∈ A, define worstu(a) = min{ua(s) : s ∈ S}.

  • worstu(a) is the worst-case outcome for act a

Maximin rule says a a′ iff worstu(a) ≥ worstu(a′). s1 s2 s3 s4 a1 5 0∗ 0∗ 2 a2 −1∗ 4 3 7 a3 6 4 4 1∗ a4 5 6 4 3∗ Thus, get a4 ≻ a3 ≻ a1 ≻ a2. But what if you thought s4 was much likelier than the

  • ther states?

11

slide-12
SLIDE 12

Maximax

This is a rule for optimists:

  • Choose the rule with the best case outcome:
  • Maximize the maximum

Formally, given act a ∈ A, define bestu(a) = max{ua(s) : s ∈ S}.

  • bestu(a) is the best-case outcome for act a

Maximax rule says a a′ iff bestu(a) ≥ bestu(a′). s1 s2 s3 s4 a1 5∗ 0 2 a2 -1 4 3 7∗ a3 6∗ 4 4 1 a4 5 6∗ 4 3 Thus, get a2 ≻ a4 ∼ a3 ≻ a1.

12

slide-13
SLIDE 13

Optimism-Pessimism Rule

Idea: weight the best case and the worst case according to how optimistic you are. Define optα

u(a) = αbestu(a) + (1 − α)worstu(a).

  • if α = 1, get maximax
  • if α = 0, get maximin
  • in general, α measures how optimistic you are.

Rule: a a′ if optα

u(a) ≥ optα u(a′)

This rule is strange if you think probabilistically:

  • worstu(a) puts weight (probability) 1 on the state

where a has the worst outcome.

  • This may be a different state for different acts!
  • More generally, optα

u puts weight α on the state where

a has the best outcome, and weight 1−α on the state where it has the worst outcome.

13

slide-14
SLIDE 14

Minimax Regret

Idea: minimize how much regret you would feel once you discovered the true state of the world.

  • The “I wish I would have done x” feeling

For each state s, let as be the act with the best outcome in s. regretu(a, s) = uas(s) − ua(s) regretu(a) = maxs∈S regretu(a, s)

  • regretu(a) is the maximum regret you could ever feel

if you performed act a Minimax regret rule: a a′ iff regretu(a) ≤ regretu(a′)

  • minimize the maximum regret

14

slide-15
SLIDE 15

Example: s1 s2 s3 s4 a1 5 2 a2 −1 4 3 7∗ a3 6∗ 4 4∗ 1 a4 5 6∗ 4∗ 3

  • as1 = a3; uas1(s1) = 6
  • as2 = a4; uas2(s2) = 6
  • as3 = a3 (and a4); uas3(s3) = 4
  • as4 = a2; uas4(s4) = 7
  • regretu(a1) = max(6 − 5, 6 − 0, 4 − 0, 7 − 2) = 6
  • regretu(a2) = max(6 − (−1), 6 − 4, 4 − 3, 7 − 7) = 7
  • regretu(a3) = max(6 − 6, 6 − 4, 4 − 4, 7 − 1) = 6
  • regretu(a4) = max(6 − 5, 6 − 6, 4 − 4, 7 − 3) = 4

Get a4 ≻ a1 ∼ a3 ≻ a2.

15

slide-16
SLIDE 16

Effect of Transformations

Proposition Let f be an ordinal transformation of util- ities (i.e., f is an increasing function):

  • maximin(u) = maximin(f(u))
  • The preference order determined by maximin given

u is the same as that determined by maximin given f(u).

  • An ordinal transformation doesn’t change what is

the worst outcome

  • maximax(u) = maximax(f(u))
  • optα(u) may not be the same as optα((u))
  • regret(u) may not be the same as regret(f(u)).

Proposition: Let f be a positive affine transformation

  • f(x) = ax + b, where a > 0.

Then

  • maximin(u) = maximin(f(u))
  • maximax(u) = maximax(f(u))
  • optα(u) = optα(f(u))
  • regret(u) = regret(f(u))

16

slide-17
SLIDE 17

“Irrelevant” Acts

Suppose that A = {a1, . . . , an} and, according to some decision rule, a1 ≻ a2. Can adding another possible act change things? That is, suppose A′ = A ∪ {a}.

  • Can it now be the case that a2 ≻ a1?

No, in the case of maximin, maximax, and optα. But . . . Possibly yes in the case of minimax regret!

  • The new act may change what is the best act in a

given state, so may change all the calculations.

17

slide-18
SLIDE 18

Example: start with s1 s2 a1 8 1 a2 2 5 regretu(a1) = 4 < regretu(a2) = 6 a1 ≻ a2 But now suppose we add a3: s1 s2 a1 8 1 a2 2 5 a3 0 8 Now regretu(a2) = 6 < regretu(a1) = 7 < regretu(a3) = 8 a2 ≻ a1 ≻ a3 Is this reasonable?

18

slide-19
SLIDE 19

Multiplicative Regret

The notion of regret is additive; we want an act that such that the difference between what you get and what you could have gotten is not too large. There is a multiplicative version:

  • find an act such that the ratio of what you get and

what you could have gotten is not too large.

  • usual formulation:

your cost/what your cost could have been is low. This notion of regret has been extensively studied in the CS literature, under the name online algorithms or com- petitive ratio. Given a problem P with optimal algorithm OPT.

  • The optimal algorithm is given the true state

Algorithm A for P has competitive ratio c if there exists a constant k such that, for all inputs x running time(A(x)) ≤ c(running time(OPT(x))) + k

19

slide-20
SLIDE 20

The Object Location Problem

Typical goal in CS literature:

  • find optimal competitive ratio for problems of interest

This approach has been applied to lots of problems,

  • caching, scheduling, portfolio selection, . . .

Example: Suppose you have a robot located at point 0

  • n a line, trying to find an object located somewhere on

the line.

  • What’s a good algorithm for the robot to use?

The optimal algorithm is trivial:

  • Go straight to the object

Here’s one algorithm:

  • Go to +1, then −2, then +4, then −8, until you find

the object Can be shown: this algorithm has a competitive ratio of 9

  • I believe this is optimal

20

slide-21
SLIDE 21

The Ski Rental Problem

Example:

  • It costs $p to purchase skis
  • it costs $r to rent skis
  • You will ski for at most N days (but maybe less)

How long should you rent before you buy?

  • It depends (in part) on the ratio of p to r
  • If the purchase price is high relative to rental, you

should rent longer, to see if you like skiing

21

slide-22
SLIDE 22

The Principle of Insufficient Reason

Consider the following example: s1 s2 s3 s4 s5 s6 s7 s8 s9 a1 9 9 9 9 9 9 9 9 a2 9 9 None of the previous decision rules can distinguish a1 and

  • a2. But a lot of people would find a1 better.
  • it’s more “likely” to produce a better result

Formalization:

  • ua(s) = u(a(s)): the utility of act a in state s
  • ua is a random variable
  • Let Pr be the uniform distribution on S
  • All states are equiprobable
  • No reason to assume that one is more likely than
  • thers.
  • Let EPr(ua) be the expected value of ua

Rule: a ≻ a′ if EPr(ua) > EPr(u′

a).

22

slide-23
SLIDE 23

Problem: this approach is sensitive to the choice of states.

  • What happens if we split s9 into 20 states?

Related problem: why is it reasonable to assume that all states are equally likely?

  • Sometimes it’s reasonable (we do it all the time when

analyzing card games); often it’s not

23

slide-24
SLIDE 24

Maximizing Expected Utility

If there is a probability distribution Pr on states, can compute the expected probability of each act a: EPr(ua) = Σs∈S Pr(s)ua(s). Maximizing expected utility (MEU) rule: a ≻ a′ iff EPr(ua) > EPr(ua′). Obvious question:

  • Where is the probability coming from?

In computer systems:

  • Computer can gather statistics
  • Unlikely to be complete

When dealing with people:

  • Subjective probabilities
  • These can be hard to elicit
  • What do they even mean?

24

slide-25
SLIDE 25

Maximizing Expected Utility

Another possibility: instead of maximizing expect utility, maximize expected regret: EPr(regretu(a)) = Σs∈S Pr(s)regretu(a, s). In the literature, there are other approaches that use probability, but don’t maximize the expected value of anything.

  • “probabilistically sophisticated approaches”
  • at least the decision maker uses probability

25

slide-26
SLIDE 26

Eliciting Utilities

MEU is unaffected by positive affine transformation, but may be affected by ordinal transformations:

  • if f is a positive affine transformation, then MEU(u)

= MEU(f(u))

  • if f is an ordinal transformation, then MEU(u) =

MEU(f(u)). So where are the utilities coming from?

  • People are prepared to say “good”, “better”, “terri-

ble”

  • This can be converted to an ordinal utility
  • Can people necessarily give differences?
  • Utility elicitation is a significant problem in practice,

and the subject of lots of research.

26

slide-27
SLIDE 27

Minimizing Expected Regret

Recall that as is the act with te best outcome in state s. regretu(a, s) = uas(s) − ua(s) regretu(a) = maxs∈S regretu(a, s) Given Pr, the expected regret of a is EPr(regretu(a, ·)) = Σs∈S Pr(s)regretu(a, s) Minimizing expected regret (MER) rule: a ≻ a′ iff EPr(regretu(a, ·)) < EPr(regretu(a′, ·)) Theorem: MEU and MER are equivalent rules! a ≻MEU a′ iff a ≻MER a′ Proof:

  • 1. Let u′ = −u
  • Maximizing EPr(ua) is the same as minimizing EPr(u′

a).

  • 2. Let uv(a, s) = u′(a, s) + v(s), where v : S → I

R is arbitrary.

  • Minimizing EPr(u′

a) is the same as minimizing EPr(uv a).

  • You’ve just added the same constant (EPr(v)) to

the expected value of u′

a, for each a

  • 3. Taking v(s) = u(as), then EPr(uv

a) is the expected

regret of a!

27

slide-28
SLIDE 28

Representing Uncertainty by a Set of Probabilities

Why is probability even the right way to represent uncer- tainty?? Consider tossing a fair coin. A reasonable way to rep- resent your uncertainty is with the probability measure Pr1/2: Pr1/2(heads) = Pr1/2(tails) = 1/2. Now suppose the bias of the coin is unknown. How do you represent your uncertainty about heads?

  • Could still use Pr1/2
  • Perhaps better: use the set

{Pra : a ∈ [0, 1]}, where Pra(heads) = a.

28

slide-29
SLIDE 29

Decision Rules with Sets of Probabilities

Given set P of probabilities, define EP(ua) = inf

Pr∈P{EPr(ua) : Pr ∈ P}

This is like maximin:

  • Optimizing the worst-case expectation

In fact, if PS consists of all probability measures on S, then EPS(ua) = worstu(a). Decision rule 1: a >1

P a′ iff EP(ua) > EP(ua′)

  • maximin order agrees with >1

PS.

  • >1

P can take advantage of extra information

Define EP(ua) = supPr∈P{EPr(ua) : Pr ∈ P}.

  • Rule 2: a >1

P a′ iff EP(ua) > EP(ua′)

  • This is like maximax
  • Rule 3: a >3

P a′ iff EP(ua) > EP(ua′)

  • This is an extremely conservative rule
  • Rule 4: a >4

P a′ iff EPr(ua) > EPr(ua′) for all Pr ∈ P

  • Can show: a ≥3

P a′ implies a ≥4 P a′

29

slide-30
SLIDE 30

What’s the “right” rule?

One way to determine the right rule is to characterize the rules axiomatically:

  • What properties of a preference order on acts guar-

antees that it can be represented by MEU? maximin? . . .

  • We’ll talk about this for MEU

Can also look at examples.

30

slide-31
SLIDE 31

Rawls vs. Harsanyi

Which of two societies (each with 1000 people) is better:

  • Society 1: 900 people get utility 90, 100 get 1
  • Society 2: everybody gets utility 35.

To make this a decision problem:

  • two acts:
  • 1. live in Society 1
  • 2. live in Society 2
  • 1000 states: in state i, you get to be person i

Rawls says: use maximin to decide Harsanyi says: use principle of insufficient reason

  • If you like maximin, consider Society 1′, where 999

people get utility 100, 1 gets utility 34.

  • If you like the principle of insufficient reason, consider

society 1′′, where 1 person gets utility 100,000, 999 get utility 1.

31

slide-32
SLIDE 32

Example: The Paging Problem

Consider a two-level virtual memory system:

  • Each level can store a number of fixed-size memory

units called pages

  • Slow memory can store N pages
  • Fast memory (aka cache) can store k < N of these
  • Given a request for a page p, the system must make

p available in fast memory.

  • If p is already in fast memory (a hit) then there’s

nothing to do

  • otherwise (on a miss) the system incurs a page fault

and must copy p from slow memory to fast memory

  • But then a page must be deleted from fast memory
  • Which one?

Cost models:

  • 1. charge 0 for a hit, charge 1 for a miss
  • 2. charge 1 for a hit, charge s > 1 for a miss

The results I state are for the first cost model.

32

slide-33
SLIDE 33

Algorithms Used in Practice

Paging has been studied since the 1960s. Many algo- rithms used:

  • LRU (Least Recently Used): replace page whose most

recent request was earliest

  • FIFO (First In/ First out): replace page which has

been in fast memory longest

  • LIFO (Last In/ First out): replace page most recently

moved to fast memory

  • LFU (Least Frequently Used): Replace page requested

the least since entering fast memory

  • . . .

These are all online algorithms; they don’t depend on knowing the full sequence of future requests. What you’d love to implement is:

  • LFD (longest-forward distance): replace page whose

next request is latest But this requires knowing the request sequence.

33

slide-34
SLIDE 34

Paging as a Decision Problem

This is a dynamic problem. What are the states/outcomes/acts?

  • States: sequence of requests
  • Acts: strategy for initially placing pages in fast mem-
  • ry + replacement strategy
  • Outcomes: a sequence of hits + misses

Typically, no distribution over request sequences is as- sumed.

  • If a distribution were assumed, you could try to com-

pute the strategy that minimized expected cost

  • utility = −cost
  • But this might be difficult to do in practice
  • Characterizing the distribution of request sequences

is also difficult

  • A set of distributions may be more reasonable

∗ There has been some work on this

  • Each distribution characterizes a class of “requestors”

34

slide-35
SLIDE 35

Paging: Competitive Ratio

Maximin is clearly not a useful decision rule for paging

  • Whatever the strategy, can always find a request se-

quence that results in all misses There’s been a lot of work on the competitive ratio of various algorithms: Theorem: [Belady] LFD is an optimal offline algorithm.

  • replacing page whose next request comes latest seems

like the obvious thing to do, but proving optimality is not completely trivial.

  • The theorem says that we should thus compare the

performance of an online algorithm to that of LFD. Theorem: If fast memory has size k, LRU and FIFO are k-competitive:

  • For all request sequences, they have at most k times

as many misses as LFD

  • There is a matching lower bound.

LIFO and LFU are not competitive

  • For all ℓ, there exists a request sequence for which

LIFO (LRU) has at least ℓ times as many misses as LFD

35

slide-36
SLIDE 36
  • For LIFO, consider request sequence

p1, . . . , pk, pk+1, pk, pk+1, pk, pk+1, . . .

  • Whatever the initial fast memory, LFD has at most

k + 1 misses

  • LIFO has a miss at every step after the first k
  • For LFU, consider request sequence

pℓ

1, . . . , pℓ k−1, (pk, pk+1)ℓ−1

  • Whatever the initial fast memory, LFD has at most

k + 1 misses

  • LFU has a miss at every step after the first (k−1)ℓ

⇒ 2(ℓ − 1) misses ∗ Thus, (k − 1) + 2(ℓ − 1) misses altogether. ∗ This makes the competitive ratio [(k − 1) + 2(ℓ − 1)]/(k − 1) ∗ Since ℓ can be arbitrarily large, the competitive ratio can be made arbitrarily large.

  • Note both examples require that there be only k + 1

pages altogether.

36

slide-37
SLIDE 37

Paging: Theory vs. Practice

  • the “empirical” competitive ratio of LRU is < 2, in-

dependent of fast memory size

  • the “empirical” competitive ratio of FIFO is ∼ 3, in-

dependent of fast memory size Why do they do well in practice?

  • One intution: in practice, request sequences obey some

locality of reference

  • Consecutive requests are related

37

slide-38
SLIDE 38

Modeling Locality of Reference

One way to model locality of reference: use an access graph G

  • the nodes in G are requests
  • require that successive requests in a sequence have an

edge between them in G

  • if G is completely connected, arbitrary sequences of

requests are possible

  • FIFO does not adequately exploit locality of reference
  • For any access graph G, the competitive ratio of

FIFO is > k/2

  • LRU can exploit locality of reference
  • E.g.: if G is a line, the competitive ration of LRU

is 1 ∗ LRU does as well as the optimal algorithm in this case!

  • E.g.: if G is a grid, the competitive ration of LRU

is ∼ 3/2 Key point: you can model knowledge of the access pattern without necessarily using probability.

38

slide-39
SLIDE 39

The rest of the decision theory course

  • Savage’s approach
  • Finding the right state space:
  • The wrong state space leads to intuitively incorrect

answers when conditioning

  • Taking causality into account
  • If you don’t, again you have problems
  • A few words on computational issues:
  • Computing probabilities and utilities efficiently us-

ing graphical representations

  • Problems with maximizing expected utility
  • Effects of framing
  • Ellsberg paradox
  • Recent research by Blume, Easley, Halpern

39

slide-40
SLIDE 40

Savage

Leonard (Jimmy) Savage proved perhaps the most signif- icant representation theorem in decision theory.

  • He assumed that agents have a preference order on

acts

  • a1 a2: a1 is at least as good as a2
  • He provided seven postulates (axioms) that should

satisfy (see next slide)

  • in his view, the postulates were normative
  • a rational person should satisfy the postulates

Theorem: [Savage] If satisfies Savage’s postulates, then there exists a probability Pr on states and a utility u on outcomes such that a a′ iff EPr(ua) ≥ EPr(ua′). Moreover, Pr unique and u is unique up to affine trans- formations.

40

slide-41
SLIDE 41

Savage’s Postulates

Some typical postulates include:

  • Completeness: For all acts a and a′, either a a′
  • r a′ a.
  • If you have millions of possible acts, can you really

compare them all?

  • Transitivity: If a1 a2 and a2 a3, then a1 a3.

Some notation: If T ⊆ S, let aTa′ be the act that agrees with a on T and with a′ on T c: aTa′(s) =

      

a(s) if s ∈ T a′(s) if s / ∈ T.

  • Independence: If aTb a′

Tb then aTc a′ Tc.

41

slide-42
SLIDE 42

Three-Prisoners Puzzle

Computing the value of information involves condition-

  • ing. Conditioning can be subtle ...

Consider the three-prisoner’s puzzle:

  • Two of three prisoners a, b, and c are chosen at ran-

dom to be executed,

  • a’s prior that he will be executed is 2/3.
  • a asks the jailer whether b or c will be executed
  • The jailer says b.

It seems that the jailer gives a no useful information about his own chances of being executed.

  • a already knew that one of b or c was going to be

executed But conditioning seems to indicate that a’s posterior prob- ability of being executed should be 1/2. This is easily rephrased in terms of value of information . . .

42

slide-43
SLIDE 43

The Monty Hall Puzzle

  • You’re on a game show and given a choice of three

doors.

  • Behind one is a car; behind the others are goats.
  • You pick door 1.
  • Monty Hall opens door 2, which has a goat.
  • He then asks you if you still want to take what’s be-

hind door 1, or to take what’s behind door 3 instead. Should you switch?

  • What’s the value of Monty’s information?

43

slide-44
SLIDE 44

The Second-Ace Puzzle

Alice gets two cards from a deck with four cards: A♠, 2♠, A♥, 2♥. A♠ A♥ A♠ 2♠ A♠ 2♥ A♥ 2♠ A♥ 2♥ 2♠ 2♥ Alice then tells Bob “I have an ace”.

  • Conditioning ⇒ Pr(both aces | one ace) = 1/5.

She then says “I have the ace of spades”.

  • PrB(both aces | A♠) = 1/3.

The situation is similar if if Alice says “I have the ace of hearts”. Puzzle: Why should finding out which particular ace it is raise the conditional probability of Alice having two aces?

44

slide-45
SLIDE 45

Protocols

Claim 1: conditioning is always appropriate here, but you have to condition in the right space. Claim 2: The right space has to take the protocol (al- gorithm, strategy) into account:

  • a protocol is a description of each agent’s actions as a

function of their information.

  • if receive message

then send acknowledgment

45

slide-46
SLIDE 46

Protocols

What is the protocol in the second-ace puzzle?

  • There are lots of possibilities!

Possibility 1:

  • 1. Alice gets two cards
  • 2. Alice tells Bob whether she has an ace
  • 3. Alice tells Bob whether she has the ace of spades

There are six possible runs (one for each pair of cards that Alice could have gotten); the earlier analysis works:

  • PrB(two aces | one ace) = 1/5
  • PrB(two aces | A♠) = 1/3

With this protocol, we can’t say “Bob would also think that the probability was 1/3 if Alice said she had the ace

  • f hearts”

46

slide-47
SLIDE 47

Possibility 2:

  • 1. Alice gets two cards
  • 2. Alice tells Bob she has an ace iff her leftmost card is

an ace; otherwise she says nothing.

  • 3. Alice tells Bob the kind of ace her leftmost card is, if

it is an ace. This protocol is not well specified. What does Alice do at step 3 if she has both aces?

47

slide-48
SLIDE 48

Possibility 2(a):

  • She chooses which ace to say at random:

Now there are seven possible runs.

✡ ✡ ✡ ✡ ✡ ✡ ✡ ❏ ❏ ❏ ❏ ❏ ❏ ❏

1/6 1/6 1/6 1/6 1/6 1/6 says A♥ says A♠ 1/2 1/2 A♥,A♠ A♥,2♠ A♥,2♥ A♠,2♠ A♠,2♥ 2♥,2♠

  • Each run has probability 1/6, except the two runs

where Alice was dealt two aces, which each have prob- ability 1/12.

  • PrB(two aces | one ace) = 1/5
  • PrB(two aces | A♠) = 1

12/(1 6 + 1 6 + 1 12) = 1/5

  • PrB(two aces | A♥) = 1/5

48

slide-49
SLIDE 49

More generally: Possibility 2(b):

  • She says “I have the ace of spades” with probability

α

  • Possibility 2(a) is a special case with α = 1/2

Again, there are seven possible runs.

  • PrB(two aces | A♠) = α/(α + 2)
  • if α = 1/2, get 1/5, as before
  • if α = 0, get 0
  • if α = 1, get 1/3 (reduces to protocol 1)

49

slide-50
SLIDE 50

Possibility 3:

  • 1. Alice gets two cards
  • 2. Alice tells Bob she has an ace iff her leftmost card is

an ace; otherwise she says nothing.

  • 3. Alice tells Bob the kind of ace her leftmost card is, if

it is an ace. What is the sample space in this case?

  • has 12 points, not 6: the order matters
  • (2♥, A♠) is not the same as (A♠, 2♥)

Now Pr(2 aces | Alice says she has an ace) = 1/3.

50

slide-51
SLIDE 51

The Monty Hall puzzle

Again, what is the protocol?

  • 1. Monty places a car behind one door and a goat behind

the other two. (Assume Monty chooses at random.)

  • 2. You choose a door.
  • 3. Monty opens a door (with a goat behind it, other than

the one you’ve chosen). This protocol is not well specified.

  • How does Monty choose which door to open if you

choose the door with the car?

  • Is this even the protocol? What if Monty does not

have to open a door at Step 3? Not to hard to show:

  • If Monty necessarily opens a door at step 3, and chooses

which one at random if Door 1 has the car, then switching wins with probability 2/3. But . . .

  • if Monty does not have to open a door at step 3, then

all bets are off!

51

slide-52
SLIDE 52

Naive vs. Sophisticated Spaces

Working in the sophisticated space gives the right an- swers, BUT . . .

  • the sophisticated space can be very large
  • it is often not even clear what the sophisticated space

is

  • What exactly is Alice’s protocol?

When does conditioning in the naive space give the right answer?

  • Hardly ever!

52

slide-53
SLIDE 53

Formalization

Assume

  • There is an underlying space W: the naive space
  • The sophisticated space S consists of pairs (w, o) where
  • w ∈ W
  • o (the observation) is a subset of W
  • w ∈ o: the observation is always accurate.

Example: Three prisoners

  • The naive space is W = {wa, wb, wc}, where wx is the

world where x is not executed.

  • There are two possible observations:
  • {wa, wb}: c is to be executed (i.e., one of a or b

won’t be executed)

  • {wa, wc}: b is to be executed

The sophisticated space consists of four elements of the form (wx, {wx, wy}), where x = y and {wx, wy} = {wb, wc}

  • the jailer will not tell a that he won’t be executed

53

slide-54
SLIDE 54

Given a probability Pr on S (the sophisticated space), let PrW be the marginal on W: PrW(U) = Pr({(w, o) : w ∈ U}). In the three-prisoners puzzle, PrW(w) = 1/3 for all w ∈ W, but Pr is not specified. Some notation:

  • Let XO and XW be random variables describing the

agent’s observation and the actual world: XO = U is the event {(w, o) : o = U}. XW ∈ U is the event {(w, o) : w ∈ U}. Question of interest: When is conditioning on U the same as conditioning on the observation of U?

  • When is Pr(· | XO = U) = Pr(· | XW ∈ U)?
  • Equivalently, when is Pr(· | XO = U) = PrW(· | U)?

This question has been studied before in the statistics

  • community. The CAR (Conditioning at Random) condi-

tion characterizes when this happens.

54

slide-55
SLIDE 55

The CAR Condition

Theorem: Fix a probability Pr on R and a set U ⊆ W. The following are equivalent: (a) If Pr(XO = U) > 0, then for all w ∈ U Pr(XW = w | XO = U) = Pr(XW = w | XW ∈ U). (b) If Pr(XW = w) > 0 and Pr(XW = w′) > 0, then Pr(XO = U | XW = w) = Pr(XO = U | XW = w′). For the three-prisoners puzzle, this means that

  • the probability of the jailer saying “b will be executed”

must be the same if a is pardoned and if c is pardoned.

  • Similarly, for “c will be executed”.

This is impossible no matter what protocol the jailer uses.

  • Thus, conditioning must give the wrong answers.

CAR also doesn’t hold for Monty Hall or any of the other puzzles.

55

slide-56
SLIDE 56

Why CAR is important

Consider drug testing:

  • In a medical study to test a new drug, several patients

drop out before the end of the experiment

  • for compliers (who don’t drop out) you observe

their actual response; for dropouts, you observe nothing at all. You may be interested in the fraction of people who have a bad side effect as a result of taking the drug three times:

  • You can observe the fraction of compliers who have

bad side effects

  • Are dropouts “missing at random”?
  • If someone drops out, you observe W.
  • Is Pr(XW = w | XO = W) =

Pr(XW = w | XW ∈ W) = Pr(XW = w)? Similar issues arise in questionnaires and polling:

  • Are shoplifters really as likely as non-shoplifters to

answer a question like “Have you ever shoplifted?”

  • concerns of homeless under-represented in polls

56

slide-57
SLIDE 57

Newcomb’s Paradox

A highly superior being presents you with two boxes, one

  • pen and one closed:
  • The open box contains a $1,000 bill
  • Either $0 or $1,000,000 has just been placed in the

closed box by the being. You can take the closed box or both boxes.

  • You get to keep what’s in the boxes; no strings at-

tached. But there’s a catch:

  • The being can predict what humans will do
  • If he predicted you’ll take both boxes, he put $0 in

the second box.

  • If he predicted you’ll just take the closed box, he

put $1,000,000 in the second box. The being has been right 999 of the the last 1000 times this was done. What do you do?

57

slide-58
SLIDE 58

The decision matrix:

  • s1: the being put $0 in the second box
  • s2: the being put $1,000,000 in the second box
  • a1: choose both boxes
  • a2: choose only the closed box

s1 s2 a1 $1,000 $1,001,000 a2 $0 $1,000,000 Dominance suggests choosing a1.

  • But we’ve already seen that dominance is inappropri-

ate if states and acts are not indepdent. What does expected utility maximization say:

  • If acts and states aren’t independent, we need to com-

pute Pr(si | aj).

  • Suppose Pr(s1 | a1) = .999 and Pr(s2 | a2) = .999.
  • Then take act a that maximizes

Pr(s1 | a)u(s1, a) + Pr(s2 | a)u(s2, a).

  • That’s a2.

Is this really right?

  • the money is either in the box, or it isn’t . . .

58

slide-59
SLIDE 59

A More Concrete Version

The facts

  • Smoking cigarettes is highly correlated with heart dis-

ease.

  • Heart disease runs in families
  • Heart disease more common in type A personalities

Suppose that type A personality is inherited and people with type A personalities are more likely to smoke.

  • That’s why smoking is correlated with heart disease.

Suppose you’re a type A personality.

  • Should you smoke?

Now you get a decision table similar to Newcomb’s para- dox.

  • But the fact that Pr(heart disease | smoke) is high

shouldn’t deter you from smoking.

59

slide-60
SLIDE 60

More Details

Consider two causal models:

  • 1. Smoking causes heart disease:
  • Pr(heart disease | smoke) = .6
  • Pr(heart disease | ¬smoke) = .2
  • 2. There is a gene that causes a type A personality, heart

disease, and a desire to smoke.

  • Pr(heart disease ∧ smoke | gene) = .48
  • Pr(heart disease ∧ ¬smoke | gene) = .04
  • Pr(smoke | gene) = .8
  • Pr(heart disease ∧ smoke | ¬gene) = .12
  • Pr(heart disease ∧ ¬smoke | ¬gene) = .16
  • Pr(smoke | ¬gene) = .2
  • Pr(gene) = .3

Conclusion:

  • Pr(heart disease | smoke) = .6
  • Pr(heart disease | ¬smoke) = .2

Both causal models lead to the same statistics.

  • Should the difference affect decisions?

60

slide-61
SLIDE 61

Recall:

  • Pr(heart disease | smoke) = .6
  • Pr(heart disease | ¬smoke) = .2

Suppose that

  • u(heart disease) = −1, 000, 000
  • u(smoke) = 1, 000

A naive use of expected utility suggests: EU(smoke) = −999, 000 Pr(heart-disease | smoke) +1, 000 Pr(¬heart-disease | smoke) = −999, 000(.6) + 1, 000(.4) = −593, 600 EU(¬smoke) = −1, 000, 000 Pr(heart-disease | ¬smoke) = −200, 000 Conclusion: don’t smoke.

  • But if smoking doesn’t cause heart disease (even though

they’re correlated) then you have nothing to lose by smoking!

61

slide-62
SLIDE 62

Causal Decision Theory

In the previous example, we want to distinguish between the case where smoking causes heart disease and the case where they are correlated, but there is no causal relation- ship.

  • the probabilities are the same in both cases

This is the goal of causal decision theory:

  • Want to distinguish between Pr(s|a) and probability

that a causes s.

  • What is the probability that smoking causes heart

disease vs. probability that you get heart disease, given that you smoke. Let PrC(s|a) denote the probability that a causes s.

  • Causal decision theory recommends choosing the act

a that maximizes ΣsPrC(s | a)u(s, a) as opposed to the act that maximizes Σs Pr(s | a)u(s, a) So how do you compute PrC(s | a)?

62

slide-63
SLIDE 63
  • You need a good model of causality . . .

Basic idea:

  • include the causal model as part of the state, so state

has form: (causal model, rest of state).

  • put probability on causal models; the causal model

tells you the probability of the rest of the state

  • in the case of smoking, you need to know the proba-

bility that

63

slide-64
SLIDE 64

In smoking example, need to know the probability that

  • smoking is a cause of cancer: α
  • the probability of heart disease given that you smoke,

if smoking is a cause: .6

  • the probability of no disease given that you don’t

smoke, if smoking is a cause: .2

  • the probability that the gene is the cause: 1 − α
  • the probability of heart disease if the gene is the cause

(whether or not you smoke): (.52 × .3) + (.28 × .7) = .352. EU(smoke) = α(.6(−999, 000) + .4(1, 000))+ (1 − α)(.352(−999, 000) + .658(1, 000)) EU(¬smoke)+α(.2(−1, 000, 000))+(1−α)(.352(−1, 000, 000).

  • If α = 1 (smoking causes heart disease), then gets

the same answer as standard decision theory: you shouldn’t smoke.

  • If α = 0 (there’ s a gene that’s a common cause for

smoking and heart disease), you have nothing to lose by smoking.

64

slide-65
SLIDE 65

So what about Newcomb?

  • Choose both boxes unless you believe that choosing

both boxes causes the second box to be empty!

65

slide-66
SLIDE 66

A Medical Decision Problem

You want to build a system to help doctors make deci- sions, by maximizing expected utility.

  • What are the states/acts/outcomes?

States:

  • Assume a state is characterized by n binary random

variables, X1, . . . , Xn:

  • A state is a tuple (x1, . . . , xn, xi ∈ {0, 1}).
  • The Xis describe symptoms and diseases.

∗ Xi = 0: you haven’t got it ∗ Xi = 1: you have it

  • For any one disease, relatively few symptoms may be

relevant.

  • But in a complete system, you need to keep track of

all of them. Acts:

  • Ordering tests, performing operations, prescribing med-

ication

66

slide-67
SLIDE 67

Outcomes are also characterized by m random variables:

  • Does patient die?
  • If not, length of recovery time
  • Quality of life after recovery
  • Side-effects of medications

67

slide-68
SLIDE 68

Some obvious problems:

  • 1. Suppose n = 100 (certainly not unreasonable).
  • Then there are 2100 states
  • How do you get all the probabilities?
  • You don’t have statistics for most combinations!
  • How do you even begin describe a probability dis-

tribution on 2100 states?

  • 2. To compute expected utility, you have to attach a

numerical utility to outcomes.

  • What the utility of dying? Living in pain for 5

years?

  • Different people have different utilities
  • Eliciting these utilities is very difficult

∗ People often don’t know their own utilities

  • Knowing these utilities is critical for making a

decision.

68

slide-69
SLIDE 69

Computer Science to the Rescue

Solution to dealing with representing probabilities and utilities

  • Use graphical models
  • Many of the relevant random variables are (condition-

ally) independent. Thinking in terms of (in)dependence

  • helps structure a problem
  • makes it easier to elicit information from experts
  • makes it easier to represent and reasona about the

problem efficiently. The same is true for utilities!

  • Explaning all the details could fill up another course

. . .

69

slide-70
SLIDE 70

Probems with Savage

All Savage’s axioms appear reasonable . . . until you start to look at them carefully.

  • People don’t always act as Savage would expect.
  • When this is pointed out to them, sometimes they

agree that they made a mistake, but not always

  • Much recent work has been motivated at fiding alter-

native approaches, often motivated by some standard “paradoxes”.

70

slide-71
SLIDE 71

Ellsberg Paradox

There is one urn with with 300 balls: 100 of these balls are red (R) and the rest are either blue (B) or yellow(Y). Consider the following two choice situations: I: a: Win $100 if a ball drawn from the urn is R and nothing otherwise. a′: Win $100 if a ball drawn from the urn is B and nothing otherwise. II: b: Win $100 if a ball drawn from the urn is R or Y and nothing otherwise. b′: Win $100 if a ball drawn from the urn is B or Y and nothing otherwise.

71

slide-72
SLIDE 72

This is inconsistent with MEU

Suppose a decision maker’s preferences are such that a ≻ a′ and b′ ≻ b. If there is a probability on states, then the first choice implies that the probability of a red ball is greater than the probability of a blue ball and the second choice implies the reverse. Which of Savage’s axioms is violated?

  • Independence: Remember that an act is a function

from states to outcomes. Let T ⊆ S be a subset of

  • states. Then

aTb a′

Tb iff aTc a′ Tc.

State space, S = {R, B, Y } Outcomes O = {0, 100} Let T = {R, B} and note that S = E ∪{Y }. On T, a = b and a′ = b′. Further a(Y ) = a′(Y ) and b(Y ) = b′(Y ). We have a ≻ a′. The independence axiom then implies that b ≻ b′. But we have b′ ≻ b. So the independence axiom is violated.

72

slide-73
SLIDE 73

Maxmin Expected Utility Rule

Suppose that the decision maker’s uncertainty can be rep- resented by a set P of probabilities . Let EP(ua) = inf

Pr∈P{EPr(ua) : Pr ∈ P}

Recall the maximin expected utility rule: (covered earlier in the course):

  • a >1

P a′ iff EP(ua) > EP(ua′)

This is like maximin:

  • Optimizing the worst-case expectation

This could explain the Ellsberg Paradox:

  • Let P = {(1/3, pB, pY ) : 0 ≤ pB ≤ 2/3}

Gilboa and Schmeidler axiomatized the maxmin expected utility rule

  • It does not satisfy independence
  • Gilboa and Schmeidler replaced independence by a

weaker axiom.

73

slide-74
SLIDE 74

Framing Effects

[McNeill et al.]: DMs are asked to choose between surgery

  • r radiation therapy as a treatment for lung cancer. The

problem is framed in two ways.

  • Version I (survival frame): DMs are told that, of

100 people having surgery, 90 live through the post-

  • perative period, 68 are alive at the end of the first

year, and 34 are alive at the end of five years; and

  • f 100 people have radiation therapy, all live through

the treatment, 77 are alive at the end of the first year, and 22 are alive at the end of five years.

  • Version II (mortality frame): DMs are told that of

100 people having surgery, 10 die through the post-

  • perative period, 32 die by the end of the first year,

and 66 die by the end of five years; and of 100 people having radiation therapy, all live through the treat- ment, 23 die by the end of the first year, and 78 die by the end of five years. The outcomes are equivalent in the two frames—k of 100 people living is the same as 100 − k out of 100 dying.

  • Yet, while only 18% of DMs prefer radiation therapy

in the survival frame, the number goes up to 44% in the mortality frame.

74

slide-75
SLIDE 75

Conjunction Fallacy

Kahneman and Tversky experiment: Linda is 31 years old, single, outspoken, and very bright. She majored in philosophy. As a student, she was deeply concerned with issues of discrimination and social jus- tice, and also participated in anti-nuclear demonstrations. Which is more probable?

  • Linda is a bank teller.
  • Linda is a bank teller and is active in the feminist

movement.

75