cs286r presentation james burns march 7 2006 calibrated
play

CS286r Presentation James Burns March 7, 2006 Calibrated Learning - PDF document

CS286r Presentation James Burns March 7, 2006 Calibrated Learning and Correlated Equi- librium by Dean Foster and Rakesh Vohra Regret in On-Line Decision Problem by Dean Foster and Rakesh Vohra Outline Correlated Equlibria


  1. CS286r Presentation James Burns March 7, 2006 • Calibrated Learning and Correlated Equi- librium – by Dean Foster and Rakesh Vohra • Regret in On-Line Decision Problem – by Dean Foster and Rakesh Vohra

  2. Outline • Correlated Equlibria • Forecasts and Calibration • Calibration and Correlated Equilibria • Loss and Regret • Existence of a no regret forecasting scheme • Further results and discussion

  3. Correlated Equilibria • Motivation – Difficult to find learning rules that guar- antee convergence to NE – CE are easy to compute – Consistent with Bayesian perspective (Au- mann 87) – CEs can pareto dominate NEs- relevant? • Drawback – Problem of multiplicity of equilibrium is worse!

  4. Forecasts • f ( t ) = { p 1 ( t ) , · · · , p n ( t ) } • p j ( t ) is forecasted probability that event j occurs at time t • N ( p, t ) be the number of times that f gen- erates the forecast p up to time t

  5. Calibration • Let χ ( j, t ) = 1 if event j occurs at time t • We now define ρ ( p, j, t ) as the empirical fre- quency of action j given the forecast p  0 N ( p, t ) = 0 if  ρ ( p, j, t ) = I f ( s )= p χ ( j,s ) � t otherwise s =1  N ( p,t ) • For the forecasting scheme to be calibrated we require: | ρ ( p, j, t ) − p j | N ( p, t ) � lim = 0 t t →∞ p

  6. Example: Forecasting the Weather • Pick a forecasting scheme to predict whether it will rain or not • f ( t ) = p ( t ) is forecasted probability that it will rain at time t • N ( p, t ) be the number of times that f ( t ) = p up to the time t • ρ ( p, t ) is the frequency with which it rained given that it was forecasted to rain with probability p For the forecasting scheme to be calibrated we require: p | ρ ( p, t ) − p | N ( p,t ) • lim t →∞ = 0 � t

  7. How does fictitious play fit in? • Fictitious play is a particular forecast scheme that requires the forecast to be equal to an agent’s prior updated by the unconditioned empirical frequency of events • This means that if the forecast converges, we have t p j ( t ) → 1 � χ ( j, s ) t s =1 where χ ( j, s ) = 1 if event j occurs at time t • In fictitious play forecasts converge to em- pirical frequencies, whereas calibration re- quires that forecasts converge to empirical frequencies conditioned on the forecasts .

  8. Calibrated Forecasts and Correlated Equilib- rium • Consider a two player game G . We can characterize a CE in the set of all CE of the game G, π ( G ), by the induced joint distribution over the agents strategy sets S (1) × S (2). • We denote this joint distribution by D ( x, y ). Further, let D t ( x, y ) be the empirical fre- quency that ( x, y ) is played up to time t .

  9. • Theorem 1 (VF, 97) If each player uses a forecast that is calibrated against the oth- ers sequence of plays, and then makes a best response to this forecast, then, min x ∈ S (1) ,y ∈ S (2) | D t ( x, y ) − D ( x, y ) | → 0 max D ∈ π ( G ) • Important assumption: players use a de- terministic tie breaking rule in making best responses. • What does this actually claim?

  10. Outline of Proof • D t ( x, y ) lies in the ( nm − 1)-dimensional unit simplex-hence closed and bounded • BW theorem implies that D t ( x, y ) has a convergent subsequence D t i ( x, y ) • Let D ∗ be the limit of D t i ( x, y ), we show D ∗ is a Correlated Equilibrium • Basic Argument: Show that the vector whose yth component is D ∗ ( x, y ) / � c ∈ S (2) D ∗ ( x, c ) is in the set of mixtures over S (2) for which x is a best response. This will hold because the forecasting rule is calibrated.

  11. • Missing! If theorem does not hold there must be a sequence D t j ( x, y ) such that | D t j ( x, y ) − D ( x, y ) | > ǫ for some ǫ and all t . However, this subse- quence must have itself a convergent sub- sequence that, from above, must converge to a CE, contradicting our assumption.

  12. Calibration and CE continued • Theorem 2 (VF, 97) For almost every game the set of distributions which calibrated learn- ing rules can converge to is identical to the set of correlated equilibrium. – Proof is constructive – Is this theorem useful? what can it re- ally tell us?

  13. • Theorem 3 (VF, 97) There exists a ran- domized forecast that player 1 can use such that no matter what learning rule player 2 uses, player 1 will be calibrated. – Proof does give algorithm for construct- ing a randomized forecast rule that is calibrated, but not intuitive. – Based on a regret-measure. – Each step in procedure requires com- puting an invariant vector of increasing size

  14. • We consider an ODP in which an agent incurs a loss in every period as a function of the decision made and the state of the world in that period. The objective of the agent is to minimize the total loss incurred. e.g. guessing a sequence of 0s and 1s.

  15. Loss • Notation – Let D = { d 1 , d 2 , · · · , d n } set of possible decisions at time t – L j t ≤ 1 loss incurred at time t from tak- ing action j – We represent a decision making scheme S by the probability vectors w t where w j t the probability that decision j is chosen at time t . • Define L(S), the expected loss from using scheme S over T periods T w j t L j � � t t =1 d j ∈ D

  16. Regret • We now compare the loss under the scheme S with the loss that would have been in- curred had a different scheme been used. • In particular, we consider the change in loss from replacing an action d j with another action d i . • Given a scheme S that uses decision d j in period t with probability w t j , define the pair- wise regret of switching from decision d j to d i as T R j → i w j ( t )( L j t − L i � ( S ) = t ) T t =1

  17. • Define the regret incurred by S from using decision d j up to time T as � + R j R j → i � � T ( S ) = ( S ) T i ∈ D � + = max { 0 , R j → i R j → i � where ( S ) ( S ) } T T • Define the regret from using S as R j � R T ( S ) = T ( S ) j ∈ D • We say that the scheme S has the no in- ternal regret property if its expected regret is small R T ( S ) = o ( T )

  18. Existence of a No-Regret Scheme • Proof for case where | D | = 2 • We have defined � + ( R i → j � � � R T ( S ) = ( S ) T i ∈ D j ∈ D � + = � + = 0 � � R 0 → 0 R 1 → 1 • But ( S ) ( S ) T T • Goal: to show that the time average of � + and � + go to zero. � � R 1 → 0 R 0 → 1 ( S ) ( S ) T T

  19. • Define the following game – Agent chooses between strategy ”0” and strategy ”1” in each period – Payoffs are vectors with the payoff for using strategy ”0” in period t is ( L 0 t − L 1 t , 0) and (0 , L 1 t − L 0 t ) for using strategy ”1” • Suppose that the agent follows a scheme that chooses strategy ”0” with probability w t , then the time averaged payoffs at round T are �� T t =1 w t ( L 0 t − L 1 � T t =1 (1 − w t )( L 1 t − L 0 � t ) t ) , T T . • Note that we have defined the payoffs such that the time averaged payoffs are equal to ( R 1 → 0 ( S ) /T, R 0 → 1 ( S ) /T ) as defined above. T T

  20. • Blackwells Approachability Theorem: A con- vex set is approachable iff every tangent hyperplane of G is approachable. • Our target set is nonpositive orthant- that is we want R 1 → 0 ( S ) /T ≤ 0 T and R 0 → 1 ( S ) /T ≤ 0 T • If the payoff vector is not in the nonpositive orthant then we consider the line separat- ing the payoff vector from the target set. The line l is given by � + x + � + y = 0 � R 0 → 1 � R 1 → 0 ( S ) ( S ) T T

  21. • The agent must choose ”0” with probabil- ity p such the expected payoff vector L 0 T +1 − L 1 L 1 T +1 − L 0 � � � � �� p , (1 − p ) T +1 T +1 lies on the line l . • This requires: � + p � + (1 − p ) R 0 → 1 L 0 T +1 − L 1 R 1 → 0 L 1 T +1 − L 0 � � � � � � ( S ) + ( S ) = 0 T T +1 T T +1 � + R 0 → 1 � ( S ) T • Which yields: p = ( S ) ] + [ R 1 → 0 ( S ) + ] − [ R 0 → 1 T T • Not what is in paper!

  22. • We have solved for p that will in expecta- tion be on the line separating the payoff vector from the target set. From Black- well’s Theorem the target set is approach- able. We have found a no-regret scheme. • This result can be generalized to D > 2 but will require solving a system of equations.

  23. Further Results: • The existence of a no-regret scheme im- plies the existence of an almost calibrated forecast scheme • If all agents in a game play a no-regret strategy, play will converge to correlated equilibrium.

  24. Further Reading • A Simple Adaptive Procedure Leading to Correlated Equilibrium - Hart and Mas-Colell 2000 • A General Class of Adaptive Strategies- Hart and MasColell 2001 • A General Class of No-Regret Learning Al- gorithms and Game-Theoretic Equilibria- Greenwald, Jafari and Marks

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend