Today Experts/Zero-Sum Games Equilibrium. Boosting and Experts. - - PowerPoint PPT Presentation

today
SMART_READER_LITE
LIVE PREVIEW

Today Experts/Zero-Sum Games Equilibrium. Boosting and Experts. - - PowerPoint PPT Presentation

Today Experts/Zero-Sum Games Equilibrium. Boosting and Experts. Routing and Experts. Two person zero sum games. m n payoff matrix A . Row mixed strategy: x = ( x 1 ,..., x m ) . Column mixed strategy: y = ( y 1 ,..., y n ) . Payoff for


slide-1
SLIDE 1

Today

Experts/Zero-Sum Games Equilibrium. Boosting and Experts. Routing and Experts.

slide-2
SLIDE 2

Two person zero sum games.

m ×n payoff matrix A. Row mixed strategy: x = (x1,...,xm). Column mixed strategy: y = (y1,...,yn). Payoff for strategy pair (x,y): p(x,y) = xtAy That is,

i

xi

j

ai,jyj

  • = ∑

j

i

xiai,j

  • yj.

Recall row minimizes, column maximizes. Equilibrium pair: (x∗,y∗)? (x∗)tAy∗ = max

y (x∗)tAy = min x xtAy∗.

(No better column strategy, no better row strategy.)

slide-3
SLIDE 3

Equilibrium.

Equilibrium pair: (x∗,y∗)? p(x,y) = (x∗)tAy∗ = max

y (x∗)tAy = min x xtAy∗.

(No better column strategy, no better row strategy.) No row is better: mini A(i) ·y = (x∗)tAy∗. 1 No column is better: maxj(At)(j) ·x = (x∗)tAy∗.

1A(i) is ith row.

slide-4
SLIDE 4

Best Response

Column goes first: Find y, where best row is not too low.. R = max

y

min

x (xtAy).

Note: x can be (0,0,...,1,...0). Example: Roshambo. Value of R? Row goes first: Find x, where best column is not high. C = min

x max y (xtAy).

Agin: y of form (0,0,...,1,...0). Example: Roshambo. Value of C?

slide-5
SLIDE 5

Duality.

R = max

y

min

x (xtAy).

C = min

x max y (xtAy).

Weak Duality: R ≤ C. Proof: Better to go second. At Equilibrium (x∗,y∗), payoff v: row payoffs (Ay∗) all ≥ v = ⇒ R ≥ v. column payoffs ((x∗)tA) all ≤ v = ⇒ v ≥ C. = ⇒ R ≥ C Equilibrium = ⇒ R = C! Strong Duality: There is an equilibrium point! and R = C! Doesn’t matter who plays first!

slide-6
SLIDE 6

Proof of Equilibrium.

  • Later. Still later...

Aproximate equilibrium ... C(x) = maxy xtAy R(y) = minx xtAy Always: R(y) ≤ C(x) Strategy pair: (x,y) Equilibrium: (x,y) R(y) = C(x) → C(x)−R(y) = 0. Approximate Equilibrium: C(x)−R(y) ≤ ε. With R(y) ≤ C(x) → “Response y to x is within ε of best response” → “Response x to y is within ε of best response”

slide-7
SLIDE 7

Proof of approximate equilibrium.

How? (A) Using geometry. (B) Using a fixed point theorem. (C) Using multiplicative weights. (D) By the skin of my teeth. (C) ..and (D). Not hard. Even easy. Still, head scratching happens.

slide-8
SLIDE 8

Games and experts

Again: find (x∗,y∗), such that (maxy x∗Ay)−(minx x∗Ay∗) ≤ ε C(x∗) − R(y∗) ≤ ε Experts Framework: n Experts, T days, L∗ -total loss. Multiplicative Weights Method yields loss L where L ≤ (1+ε)L∗ + logn

ε

slide-9
SLIDE 9

Games and Experts.

Assume: A has payoffs in [0,1]. For T = logn

ε2

days: 1) m pure row strategies are experts. Use multiplicative weights, produce row distribution. Let xt be distribution (row strategy) xt on day t. 2) Each day, adversary plays best column response to xt. Choose column of A that maximizes row’s expected loss. Let yt be indicator vector for this column. Let y∗ = 1

T ∑t yt and x∗ = argminxt xtAyt.

Claim: (x∗,y)∗ are 2ε-optimal for matrix A. Proof Idea: xt minimizes the best column response is chosen. Clearly good for row. column best response is at least what it is against xt. Total loss, L is at least column payoff. Best row payoff, L∗ is roughly less than L due to MW anlysis. Combine bounds. Done!

slide-10
SLIDE 10

Approximate Equilibrium!

Experts: xt is strategy on day t, yt is best column against xt. Let y∗ = 1

T ∑t yt and x∗ = argminxt xtAyt.

Claim: (x∗,y)∗ are 2ε-optimal for matrix A. Column payoff: C(x∗) = maxy x∗Ay. Loss on day t, xtAyt ≥ C(x∗) by the choice of x . Thus, algorithm loss, L, is ≥ TC(x∗). Best expert: L∗- best row against all the columns played. best row against ∑t Ayt and Ty∗ = ∑t yt → best row against TAy∗. → L∗ ≤ TR(y∗). Multiplicative Weights: L ≤ (1+ε)L∗ + lnn

ε

TC(x∗) ≤ (1+ε)TR(y∗)+ lnn

ε

→ C(x∗) ≤ (1+ε)R(y∗)+ lnn

εT

→ C(x∗)−R(y∗) ≤ εR(y∗)+ lnn

εT .

T = lnn

ε2 , R(y∗) ≤ 1

→ C(x∗)−R(y∗) ≤ 2ε.

slide-11
SLIDE 11

Approximate Equilibrium: notes!

Experts: xt is strategy on day t, yt is best column against xt. Let x∗ = 1

T ∑t xt and y∗ = 1 T ∑t yt.

Claim: (x∗,y)∗ are 2ε-optimal for matrix A. Column payoff: C(x∗) = maxy x∗Ay. Let yr be best response to C(x∗). Day t,yt best response to xt → xtAyt ≥ xtAyr . Algorithm loss: ∑t xtAyt ≥ ∑t xtAyr L ≥ TC(x∗). Best expert: L∗- best row against all the columns played. best row against ∑t Ayt and Ty∗ = ∑t yt → best row against TAy∗. → L∗ ≤ TR(y∗). Multiplicative Weights: L ≤ (1+ε)L∗ + lnn

ε

TC(x∗) ≤ (1+ε)TR(y∗)+ lnn

ε

→ C(x∗) ≤ (1+ε)R(y∗)+ lnn

εT

→ C(x∗)−R(y∗) ≤ εR(y∗)+ lnn

εT .

T = lnn

ε2 , R(y∗) ≤ 1 → C(x∗)−R(y∗) ≤ 2ε.

slide-12
SLIDE 12

Comments

For any ε, there exists an ε-Approximate Equilibrium. Does an equilibrium exist? Yes. Something about math here? Fixed point theorem. Later: will use geometry, linear programming. Complexity? T = lnn

ε2 → O(nm logn ε2 ). Basically linear!

Versus Linear Programming: O(n3m) Basically quadratic. (Faster linear programming: O(√n +m) linear solution solves.) Still much slower ... and more complicated. Dynamics: best response, update weight, best response. Also works with both using multiplicative weights. “In practice.”

slide-13
SLIDE 13

Learning.

Learning just a bit. Example: set of labelled points, find hyperplane that separates. + − + − + − − + − + − + Looks hard. Get 1/2 on correct side? Easy. Arbitrary line. And Scan.

  • Useless. A bit more than 1/2

Weak Learner: Classify ≥ 1

2 +ε points correctly.

Not really important but ...

slide-14
SLIDE 14

Weak Learner/Strong Learner

Input: n labelled points. Weak Learner: produce hypothesis correctly classifies 1

2 +ε fraction

Strong Learner: produce hypothesis correctly classifies 1+ µ fraction That’s a really strong learner! produce hypothesis correctly classifies 1− µ fraction Same thing? Can one use weak learning to produce strong learner? Boosting: use a weak learner to produce strong learner.

slide-15
SLIDE 15

Poll.

Given a weak learning method (produce ok hypotheses.) produce a great hypothesis. Can we do this? (A) Yes (B) No If yes. How? Multiplicative Weights! The endpoint to a line of research.

slide-16
SLIDE 16

Experts Picture

slide-17
SLIDE 17

Boosting/MW Framework

Experts are points. “Adversary” weak learner. Points want to be misclassified. Learner wants to maximize probability

  • f classifying random point correctly.

Strong learner algorithm will come from adversary. Do T = 2

γ2 log 1 µ rounds

  • 1. Row player: multiplicative weights( 1−γ) on points.
  • 2. Column: run weak learner on row distribution.
  • 3. Hypothesis h(x): majority of h1(x),h2(x),...,hT(x).

Claim: h(x) is correct on 1− µ of the points ! ! ! Cool! Really? Proof?

slide-18
SLIDE 18

Some intuition

Intuition 1: Each point classified correctly independently in each round with probability 1

2 +ε.

After enough rounds, majority rule correct for almost all points. Intuition 2: Say some point classified correctly ≤ 1/2 of time. High probability of choosing such point in distribuiont. In limit, whole distribution becomes such point. This subset will be classified correctly with probability 1/2+ε.

slide-19
SLIDE 19

Adaboost proof.

Claim: h(x) is correct on 1− µ of the points ! ! Let Sbad be the set of points where h(x) is incorrect. majority of ht(x) are wrong for x ∈ Sbad. x ∈ Sbad is a good expert – loses less than 1

2 the time.

W(T) ≥ (1−ε)

T 2 |Sbad|

Each day, weak learner gets ≥ 1

2 +γ payoff.

→ Lt ≥ 1

2 +γ.

→ W(T) ≤ n(1−ε)L ≤ ne−εL ≤ ne−ε( 1

2 +γ)T

Combining |Sbad|(1−ε)T/2 ≤ W(T) ≤ neε( 1

2+γT)

slide-20
SLIDE 20

Calculation..

|Sbad|(1−ε)T/2 ≤ neε( 1

2 +γ)T

Set ε = γ, take logs. ln

  • |Sbad|

n

  • + T

2 ln(1−γ) ≤ −γT( 1 2 +γ)

Again, −γ −γ2 ≤ ln(1−γ), ln

  • |Sbad|

n

  • + T

2 (−γ −γ2) ≤ −γT( 1 2 +γ) → ln

  • |Sbad|

n

  • ≤ − γ2T

2

And T = 2

γ2 log 1 µ ,

→ ln

  • |Sbad|

n

  • ≤ logµ → |Sbad|

n

≤ µ. The misclassified set is at most µ fraction of all the points. The hypothesis correctly classifies 1− µ of the points ! ! ! Claim: Multiplicative weights: h(x) is correct on 1− µ of the points !

Claim: Weak learning → strong learning! not so weak after all.

slide-21
SLIDE 21

Some details...

Weak learner learns over distributions of points not points. Make copies of points to simulate distributions. Used often in machine learning.

slide-22
SLIDE 22

Example.

Set of points on unit ball in d-space. Learner: learns hyperplanes through origin. Can learn if there is a hyperplane, H , that separates all the points. and find 1

2 +ε weighted separating plane.

Experts output is average of hyperplanes ...a hyperplane!

1 2 +ε separating hyperplane?

Assumption: margin γ. Random hyperplane? Not likely to be exactly normal to H . Should get 1

2 +γ/

√ d O( d logn

γ2

) to find separating hyperplane. Weak learner: random Wow. That’s weak.

slide-23
SLIDE 23

Better weak learner?

Hyperplane that separates weighted average of +/- points? Change loss a bit, and get better results.

slide-24
SLIDE 24

Toll/Congestion

Given: G = (V,E). Given (s1,t1)...(sk,tk). Row: choose routing of all paths. Column: choose edge. Row pays if column chooses edge on any path. Matrix: row for each routing: r column for each edge: e A[r,e] is congestion on edge e by routing r Offense: (Best Response.) Router: route along shortest paths. Toll: charge most loaded edge. Defense: Toll: maximize shortest path under tolls. Route: minimize max congestion on any edge.

slide-25
SLIDE 25

Two person game.

Row for every roting. (A[r,e]) An exponential number of rows! Two person game with experts won’t be so easy to implement. Version with row and column flipped may work. A[e,r] - congestion of edge e on routing r. m rows. Exponential number of columns. Multiplicative Weights only maintains m weights. Adversary only needs to provide best column each day. Runtime only dependent on m and T (number of days.)

slide-26
SLIDE 26

Congestion minimization and Experts.

Will use gain and [0,ρ] version of experts: G ≥ (1−ε)G∗ − ρ logn

ε

. Let T = k logn

ε2

  • 1. Row player runs multiplicative weights:

wi = wi(1+ε)gi/k.

  • 2. Route all paths along shortest paths.
  • 3. Output the average of all routings: 1

T ∑t f(t).

Claim: The congestion, cmax is at most (1+ε)C∗ +ε/(1−ε). Proof: G ≥ G∗(1−ε)− k logn

ε

G∗ = cmaxT — Best row payoff against average routing. G ≤ C∗T — each day, gain is average congestion ≤ C∗ since each day cost is toll solution which is at most C∗ C∗T ≥ cmaxT(1−ε)− k logn

ε

For T = k logn

ε2

→ C∗

1 1−ε +ε ≥ cmax plus 1 1−ε ≤ 1+ε →

cmax −C∗ ≤ εC +ε/(1−ε)

slide-27
SLIDE 27

Better setup.

Runtime: O(km) to route in each step. O(k logn( 1

ε2 )) steps

→ O(k2mlogn) to get a constant approximation. Homework: O(kmlogn) algorithm.

slide-28
SLIDE 28

Fractional versus Integer.

Did we solve path routing? Yes? No? No! Average of T routings. We approximately solved fractional routing problem. No solution to the path routing problem that is (1+ε) optimal! Homework 2. Problem 1. Decent solution to path routing problem?

slide-29
SLIDE 29

Randomized Rounding

For each si,ti, choose path pi with probability f(pi). Congestion c(e) edge rounds to ˜ c(e). Edge e. used by paths p1,...,pm. Let Xi = 1, if path pi is chosen.

  • therwise, Xi = 0.

Rounded congestion, ˜ c(e), is ∑i Xi. Expected Congestion: ∑i E(Xi). E(Xi) = 1Pr[Xi = 1]+0Pr[Xi = 0] = f(pi) → ∑i E(Xi) = ∑i f(pi) = c(e). → E(˜ c(e)) = c(e). Concentration (law of large numbers) c(e) is relatively large (Ω(logn)) → ˜ c(e) ≈ c(e). Concentration results? later.

slide-30
SLIDE 30

See you on Tuesday.