CS286r Presentation James Burns March 7, 2006 Calibrated Learning - - PDF document

cs286r presentation james burns march 7 2006 calibrated
SMART_READER_LITE
LIVE PREVIEW

CS286r Presentation James Burns March 7, 2006 Calibrated Learning - - PDF document

CS286r Presentation James Burns March 7, 2006 Calibrated Learning and Correlated Equi- librium by Dean Foster and Rakesh Vohra Regret in On-Line Decision Problem by Dean Foster and Rakesh Vohra Outline Correlated Equlibria


slide-1
SLIDE 1

CS286r Presentation James Burns March 7, 2006

  • Calibrated Learning and Correlated Equi-

librium – by Dean Foster and Rakesh Vohra

  • Regret in On-Line Decision Problem

– by Dean Foster and Rakesh Vohra

slide-2
SLIDE 2

Outline

  • Correlated Equlibria
  • Forecasts and Calibration
  • Calibration and Correlated Equilibria
  • Loss and Regret
  • Existence of a no regret forecasting scheme
  • Further results and discussion
slide-3
SLIDE 3

Correlated Equilibria

  • Motivation

– Difficult to find learning rules that guar- antee convergence to NE – CE are easy to compute – Consistent with Bayesian perspective (Au- mann 87) – CEs can pareto dominate NEs- relevant?

  • Drawback

– Problem of multiplicity of equilibrium is worse!

slide-4
SLIDE 4

Forecasts

  • f(t) = {p1(t), · · · , pn(t)}
  • pj(t) is forecasted probability that event j
  • ccurs at time t
  • N(p, t) be the number of times that f gen-

erates the forecast p up to time t

slide-5
SLIDE 5

Calibration

  • Let χ(j, t) = 1 if event j occurs at time t
  • We now define ρ(p, j, t) as the empirical fre-

quency of action j given the forecast p ρ(p, j, t) =

  

if N(p, t) = 0

t

s=1 If(s)=pχ(j,s) N(p,t)

  • therwise
  • For the forecasting scheme to be calibrated

we require: lim

t→∞

  • p

|ρ(p, j, t) − pj|N(p, t) t = 0

slide-6
SLIDE 6

Example: Forecasting the Weather

  • Pick a forecasting scheme to predict whether

it will rain or not

  • f(t) = p(t) is forecasted probability that it

will rain at time t

  • N(p, t) be the number of times that f(t) =

p up to the time t

  • ρ(p, t) is the frequency with which it rained

given that it was forecasted to rain with probability p For the forecasting scheme to be calibrated we require:

  • limt→∞
  • p |ρ(p, t) − p|N(p,t)

t

= 0

slide-7
SLIDE 7

How does fictitious play fit in?

  • Fictitious play is a particular forecast scheme

that requires the forecast to be equal to an agent’s prior updated by the unconditioned empirical frequency of events

  • This means that if the forecast converges,

we have pj(t) → 1 t

t

  • s=1

χ(j, s) where χ(j, s) = 1 if event j occurs at time t

  • In fictitious play forecasts converge to em-

pirical frequencies, whereas calibration re- quires that forecasts converge to empirical frequencies conditioned on the forecasts.

slide-8
SLIDE 8

Calibrated Forecasts and Correlated Equilib- rium

  • Consider a two player game G.

We can characterize a CE in the set of all CE of the game G, π(G), by the induced joint distribution over the agents strategy sets S(1) × S(2).

  • We denote this joint distribution by D(x, y).

Further, let Dt(x, y) be the empirical fre- quency that (x, y) is played up to time t.

slide-9
SLIDE 9
  • Theorem 1 (VF, 97) If each player uses a

forecast that is calibrated against the oth- ers sequence of plays, and then makes a best response to this forecast, then, min

D∈π(G)

max

x∈S(1),y∈S(2) |Dt(x, y) − D(x, y)| → 0

  • Important assumption:

players use a de- terministic tie breaking rule in making best responses.

  • What does this actually claim?
slide-10
SLIDE 10

Outline of Proof

  • Dt(x, y) lies in the (nm − 1)-dimensional

unit simplex-hence closed and bounded

  • BW theorem implies that Dt(x, y) has a

convergent subsequence Dti(x, y)

  • Let D∗ be the limit of Dti(x, y), we show

D∗ is a Correlated Equilibrium

  • Basic Argument: Show that the vector whose

yth component is D∗(x, y)/

c∈S(2) D∗(x, c)

is in the set of mixtures over S(2) for which x is a best response. This will hold because the forecasting rule is calibrated.

slide-11
SLIDE 11
  • Missing!

If theorem does not hold there must be a sequence Dtj(x, y) such that |Dtj(x, y) − D(x, y)| > ǫ for some ǫ and all t. However, this subse- quence must have itself a convergent sub- sequence that, from above, must converge to a CE, contradicting our assumption.

slide-12
SLIDE 12

Calibration and CE continued

  • Theorem 2 (VF, 97) For almost every game

the set of distributions which calibrated learn- ing rules can converge to is identical to the set of correlated equilibrium. – Proof is constructive – Is this theorem useful? what can it re- ally tell us?

slide-13
SLIDE 13
  • Theorem 3 (VF, 97) There exists a ran-

domized forecast that player 1 can use such that no matter what learning rule player 2 uses, player 1 will be calibrated. – Proof does give algorithm for construct- ing a randomized forecast rule that is calibrated, but not intuitive. – Based on a regret-measure. – Each step in procedure requires com- puting an invariant vector of increasing size

slide-14
SLIDE 14
  • We consider an ODP in which an agent

incurs a loss in every period as a function

  • f the decision made and the state of the

world in that period. The objective of the agent is to minimize the total loss incurred. e.g. guessing a sequence of 0s and 1s.

slide-15
SLIDE 15

Loss

  • Notation

– Let D = {d1, d2, · · · , dn} set of possible decisions at time t – Lj

t ≤ 1 loss incurred at time t from tak-

ing action j – We represent a decision making scheme S by the probability vectors wt where wj

t

the probability that decision j is chosen at time t.

  • Define L(S), the expected loss from using

scheme S over T periods

T

  • t=1
  • dj∈D

wj

tLj t

slide-16
SLIDE 16

Regret

  • We now compare the loss under the scheme

S with the loss that would have been in- curred had a different scheme been used.

  • In particular, we consider the change in loss

from replacing an action dj with another action di.

  • Given a scheme S that uses decision dj in

period t with probability wt

j, define the pair-

wise regret of switching from decision dj to di as Rj→i

T

(S) =

T

  • t=1

wj(t)(Lj

t − Li t)

slide-17
SLIDE 17
  • Define the regret incurred by S from using

decision dj up to time T as Rj

T(S) =

  • i∈D
  • Rj→i

T

(S)

+

where

  • Rj→i

T

(S)

+ = max {0, Rj→i

T

(S)}

  • Define the regret from using S as

RT(S) =

  • j∈D

Rj

T(S)

  • We say that the scheme S has the no in-

ternal regret property if its expected regret is small RT(S) = o(T)

slide-18
SLIDE 18

Existence of a No-Regret Scheme

  • Proof for case where |D| = 2
  • We have defined

RT(S) =

  • i∈D
  • j∈D
  • (Ri→j

T

(S)

+

  • But
  • R0→0

T

(S)

+ =

  • R1→1

T

(S)

+ = 0

  • Goal:

to show that the time average of

  • R1→0

T

(S)

+ and

  • R0→1

T

(S)

+ go to zero.

slide-19
SLIDE 19
  • Define the following game

– Agent chooses between strategy ”0” and strategy ”1” in each period – Payoffs are vectors with the payoff for using strategy ”0” in period t is (L0

t −

L1

t , 0) and (0, L1 t −L0 t ) for using strategy

”1”

  • Suppose that the agent follows a scheme

that chooses strategy ”0” with probability wt, then the time averaged payoffs at round T are

T

t=1 wt(L0 t − L1 t )

T ,

T

t=1(1 − wt)(L1 t − L0 t )

T

  • .
  • Note that we have defined the payoffs such

that the time averaged payoffs are equal to (R1→0

T

(S)/T, R0→1

T

(S)/T) as defined above.

slide-20
SLIDE 20
  • Blackwells Approachability Theorem: A con-

vex set is approachable iff every tangent hyperplane of G is approachable.

  • Our target set is nonpositive orthant- that

is we want R1→0

T

(S)/T ≤ 0 and R0→1

T

(S)/T ≤ 0

  • If the payoff vector is not in the nonpositive
  • rthant then we consider the line separat-

ing the payoff vector from the target set. The line l is given by

  • R0→1

T

(S)

+ x +

  • R1→0

T

(S)

+ y = 0

slide-21
SLIDE 21
  • The agent must choose ”0” with probabil-

ity p such the expected payoff vector

  • p
  • L0

T+1 − L1 T+1

  • , (1 − p)
  • L1

T+1 − L0 T+1

  • lies on the line l.
  • This requires:
  • R0→1

T

(S)

+ p

  • L0

T+1 − L1 T+1

  • +
  • R1→0

T

(S)

+ (1−p)

  • L1

T+1 − L0 T+1

  • = 0
  • Which yields: p =
  • R0→1

T

(S)

+

[R1→0

T

(S)+]−[R0→1

T

(S)]+

  • Not what is in paper!
slide-22
SLIDE 22
  • We have solved for p that will in expecta-

tion be on the line separating the payoff vector from the target set. From Black- well’s Theorem the target set is approach-

  • able. We have found a no-regret scheme.
  • This result can be generalized to D > 2 but

will require solving a system of equations.

slide-23
SLIDE 23

Further Results:

  • The existence of a no-regret scheme im-

plies the existence of an almost calibrated forecast scheme

  • If all agents in a game play a no-regret

strategy, play will converge to correlated equilibrium.

slide-24
SLIDE 24

Further Reading

  • A Simple Adaptive Procedure Leading to

Correlated Equilibrium - Hart and Mas-Colell 2000

  • A General Class of Adaptive Strategies-

Hart and MasColell 2001

  • A General Class of No-Regret Learning Al-

gorithms and Game-Theoretic Equilibria- Greenwald, Jafari and Marks