 
              CSC304 Lecture 6 Game Theory : Minimax Theorem via Expert Learning CSC304 - Nisarg Shah 1
2-Player Zero-Sum Games β’ Reward of P2 = - Reward of P1 β’ Matrix π΅ s.t. π΅ π,π is reward to P1 when P1 chooses her π π’β action and P2 chooses her π π’β action π π΅ π¦ 2 β’ Mixed strategy profile (π¦ 1 , π¦ 2 ) β reward to P1 is π¦ 1 β’ Minimax Theorem: For all π΅ , π π΅ π¦ 2 = min π π΅ π¦ 2 max min π¦ 1 max π¦ 1 π¦ 1 π¦ 2 π¦ 2 π¦ 1 β’ Proof through online expert learning! CSC304 - Nisarg Shah 2
Online Expert Learning β’ Setup: β’ On each day, we want to predict if a stock price will go up or down β’ π experts provide their predictions every day o Each expert says either up or down β’ Based on their advice, we make a final prediction β’ At the end of the day, we learn if our prediction was correct (reward = 1) or wrong (reward = 0) β’ Goal: β’ Do almost as good as the best expert in hindsight! CSC2420 β Allan Borodin & Nisarg Shah 3
Online Expert Learning β’ Notation β’ π = #experts β’ Predictions and ground truth: 1 or 0 (π) = #mistakes of expert π in first π steps β’ π π β’ π (π) = #mistakes of the algorithm in first π steps β’ Simplest idea: β’ Keep a weight for each expert β’ Use weighted majority of experts to make prediction β’ Decrease the weight of an expert whenever the expert makes a mistake CSC2420 β Allan Borodin & Nisarg Shah 4
Online Expert Learning β’ Weighted Majority: β’ Fix π β€ 1/2 . (1) = 1 . β’ Start with π₯ π β’ In time step π’ , predict 1 if the total weight of experts predicting 1 is larger than the total weight of experts predicting 0 , and vice-versa. (π’+1) β π₯ π (π’) β (1 β π) for β’ At the end of time step π’ , set π₯ π every expert that made a mistake. CSC2420 β Allan Borodin & Nisarg Shah 5
Online Expert Learning β’ Theorem: For every π and π , (π) + 2 ln π π (π) β€ 2 1 + π π π π β’ Proof: β’ Consider a βpotential functionβ Ξ¦ (π’) = Ο π π₯ π (π’) . β’ If the algorithm makes a mistake in round π’ , at least half of the weight decreases by a factor of 1 β π : Ξ¦ (π’+1) β€ Ξ¦ (π’) 1 2 + 1 = Ξ¦ (π’) 1 β π 2 1 β π 2 CSC2420 β Allan Borodin & Nisarg Shah 6
Online Expert Learning β’ Theorem: For every π and π , (π) + 2 ln π π (π) β€ 2 1 + π π π π β’ Proof: β’ Ξ¦ (1) = π π (π) β’ Thus: Ξ¦ (π+1) β€ π 1 β π . 2 (π+1) = 1 β π π π (π) β’ Weight of expert π : π₯ π π+1 and β ln 1 β π β€ π + π 2 β’ Use Ξ¦ (π+1) β₯ π₯ π (as π β€ 1/2 ). CSC2420 β Allan Borodin & Nisarg Shah 7
Online Expert Learning β’ Beautiful! β’ Comparison to the best expert in hindsight . β’ At most (roughly) twice as many mistakes + small additive term β’ In the worst case over how experts make mistakes o No statistical assumptions. β’ Simple policy to implement. β’ It can be shown that this bound is tight for any deterministic algorithm. CSC2420 β Allan Borodin & Nisarg Shah 8
Randomized Weighted Majority β’ Randomization β beat the factor of 2 β’ Simple Change: β’ At the beginning of round π’ , let (π’) = total weight of experts predicting 1 o Ξ¦ 1 π’ = total weight of experts predicting 0 o Ξ¦ 0 (π’) > Ξ¦ 0 (π’) , 0 otherwise. β’ Deterministic: predict 1 if Ξ¦ 1 π’ Ξ¦ 1 β’ Randomized: predict 1 with probability (π’) , 0 with (π’) +Ξ¦ 0 Ξ¦ 1 the remaining probability. CSC2420 β Allan Borodin & Nisarg Shah 9
Randomized Weighted Majority β’ Equivalently: β’ βPick an expert with probability proportional to weight, and go with their predictionβ π’ π’ = π₯ π β’ Pr[ picking expert π in step π’] = π π Ξ¦ π’ π’ = 1 if expert π makes a mistake in step π’ , 0 otherwise. β’ Let π π β’ Algorithm makes a mistake in round π’ with probability π’ π π π’ = π π’ β π π’ ΰ· π π π π π’ β π π’ π β’ πΉ[ #mistakes after π rounds ] = Ο π’=1 CSC2420 β Allan Borodin & Nisarg Shah 10
Randomized Weighted Majority π’+1 = Ο π π₯ π π’ β 1 β ππ π Ξ¦ π’+1 = Ο π π₯ π π’ π’ β π π = Ξ¦ π’ β π Ξ¦ π’ Ο π π π π’ 1 β π π π’ β π π’ = Ξ¦ π’ β€ Ξ¦ π’ exp βπ π π’ β π π’ β’ Applying iteratively: Ξ¦ π+1 β€ π β exp βπ β πΉ #mistakes π+1 β₯ 1 β π π π π β’ But Ξ¦ π+1 β₯ π₯ π β’ QED! CSC2420 β Allan Borodin & Nisarg Shah 11
Randomized Weighted Majority β’ Theorem: For every π and π , the expected number of mistakes of randomized weighted majority in the first π rounds is π + 2 ln π π π β€ 1 + π π π π π + π ln π : π π β€ π π β’ Setting π = π β ln π π β’ We say that the algorithm has π π β ln π regret β’ Sublinear regret in π β’ Regret per round β 0 as π β β CSC2420 β Allan Borodin & Nisarg Shah 12
How is this related to the minimax theorem?!! CSC304 - Nisarg Shah 13
Minimax via Regret Learning β’ Recall: π π΅ π¦ 2 π π = max π¦ 1 min π¦ 2 π¦ 1 π π΅ π¦ 2 π π· = min π¦ 2 max π¦ 1 π¦ 1 β’ Row playerβs guarantee: my reward β₯ π π β’ Column playerβs guarantee: row playerβs reward β€ π π· β’ Hence, π π β€ π π· (trivial direction) β’ To prove: π π = π π· CSC2420 - Allan Borodin & Nisarg Shah 14
Minimax via Regret Learning β’ Scale values in π΅ to be in [0,1] . β’ Without loss of generality. β’ Suppose for contradiction that π π = π π· β π , π > 0 . β’ Suppose row player π uses randomized weighted majority (experts = row playerβs actions) β’ In each round, column player π· responds by choosing her action that minimizes the row playerβs expected reward. CSC2420 - Allan Borodin & Nisarg Shah 15
Minimax via Regret Learning β’ After π iterations, row playerβs reward is: β’ π β€ π β π π β’ π β₯ βreward of best action in hindsightβ β π π β ln π o Reward of best action in hindsight β₯ π β π π· . o Why? o Suppose column player plays action π π’ in round π’ o Equivalent to playing mixed strategy π‘ in each round β’ π‘ picks π’ β {1, β¦ , π} at random and plays π π’ o By definition of π π· , π‘ cannot ensure that row playerβs reward is less than π π· β’ Then, there is an action of row player with E[reward] at least π π· against π‘ CSC2420 - Allan Borodin & Nisarg Shah 16
Minimax via Regret Learning β’ After π iterations, row playerβs reward is: β’ π β€ π β π π β’ π β₯ π β π π· β π π β ln π β’ π β π π = π β (π π· β π) β₯ π β π π· β π π β ln π β’ π π β€ π π β ln π β’ Contradiction for sufficiently large π . β’ QED! CSC2420 - Allan Borodin & Nisarg Shah 17
Yaoβs Minimax Principle β’ Goal: β’ Provide a lower bound on the expected running time that any randomized algorithm for a problem can achieve in the worst case over problem instances β’ Note: β’ Expectation (in running time) is over randomization of the algorithm β’ The problem instance (worst case) is chosen to maximize this expected running time CSC304 - Nisarg Shah 18
Yaoβs Minimax Principle β’ Notation β’ Capital letters for βrandomizedβ, small for deterministic β’ π : a deterministic algorithm β’ π : a randomized algorithm β’ π : a problem instance β’ π : a distribution over problem instances β’ π : running time β’ We are interested in min max π(π, π) π π CSC304 - Nisarg Shah 19
Yaoβs Minimax Principle Det. Algorithms Running Problem Instances times CSC304 - Nisarg Shah 20
Yaoβs Minimax Principle β’ Minimax Theorem: min max π(π, π) = max min π(π, π) π π π π β’ So: β’ To lower bound the E[running time] of any randomized algorithm π on its worst-case instance π by a quantity π β¦ β’ Choose a distribution π over problem instances, and show that every det. algorithm π has expected running time at least π on problems drawn from π CSC304 - Nisarg Shah 21
Recommend
More recommend