1
Announcements
ØHW 1 deadline is postponed to next Tuesday before class, e.g.,
3:30 pm
1
Announcements
ØHW 1 deadline is postponed to next Tuesday before class, e.g.,
3:30 pm
Announcements HW 1 deadline is postponed to next Tuesday before - - PowerPoint PPT Presentation
Announcements Announcements HW 1 deadline is postponed to next Tuesday before class, e.g., HW 1 deadline is postponed to next Tuesday before class, e.g., 3:30 pm 3:30 pm 1 1 CS6501: T opics in Learning and Game Theory (Fall 2019) Swap
1
ØHW 1 deadline is postponed to next Tuesday before class, e.g.,
3:30 pm
1
ØHW 1 deadline is postponed to next Tuesday before class, e.g.,
3:30 pm
CS6501: T
(Fall 2019)
Instructor: Haifeng Xu
3
Ø (External) Regret vs Swap Regret Ø Convergence to Correlated Equilibrium Ø Converting Regret Bounds to Swap Regret Bounds
4
At each time step 𝑢 = 1, ⋯ , 𝑈, the following occurs in order:
1.
Learner picks a distribution 𝑞( over actions [𝑜]
2.
Adversary picks cost vector 𝑑( ∈ 0,1 /
3.
Action 𝑗( ∼ 𝑞( is chosen and learner incurs cost 𝑑((𝑗()
4.
Learner observes 𝑑( (for use in future time steps)
5
ØExternal regret ØBenchmark min
7∈[/] ∑( 𝑑((𝑘) is the learner utility had he known 𝑑:, ⋯ , 𝑑;
and is allowed to take the best single action across all rounds
ØDescribes how much the learner regrets, had he known the cost
vector 𝑑:, ⋯ , 𝑑; in hindsight
𝑆; = 𝔽>?∼@? ∑(∈[;] 𝑑( 𝑗( − min
7∈[/] ∑(∈[;] 𝑑((𝑘)
6
ØA closer look at external regret
= max
7∈[/] ∑(∈ ; ∑>∈[/] 𝑑( 𝑗 𝑞((𝑗) − ∑(∈[;] 𝑑((𝑘)
𝑆; = 𝔽>?∼@? ∑(∈[;] 𝑑( 𝑗( − min
7∈[/] ∑(∈[;] 𝑑((𝑘)
= ∑(∈ ; ∑>∈[/] 𝑑( 𝑗 𝑞((𝑗) − min
7∈[/] ∑(∈[;] 𝑑((𝑘)
= max
7∈[/] ∑(∈ ; ∑>∈[/][𝑑( 𝑗 − 𝑑((𝑘)]𝑞((𝑗)
Many-to-one action swap
7
ØA closer look at external regret
= max
7∈[/] ∑(∈ ; ∑>∈[/] 𝑑( 𝑗 𝑞((𝑗) − ∑(∈[;] 𝑑((𝑘)
𝑆; = 𝔽>?∼@? ∑(∈[;] 𝑑( 𝑗( − min
7∈[/] ∑(∈[;] 𝑑((𝑘)
= ∑(∈ ; ∑>∈[/] 𝑑( 𝑗 𝑞((𝑗) − min
7∈[/] ∑(∈[;] 𝑑((𝑘)
= max
7∈[/] ∑(∈ ; ∑>∈[/][𝑑( 𝑗 − 𝑑((𝑘)]𝑞((𝑗)
ØIn external regret, learner is allowed to swap to a single action 𝑘
and can choose the best 𝑘 in hindsight
8
ØA closer look at external regret
𝑆;
ØSwap regret allows many-to-many action swap
ØFormally,
where max is over all possible swap functions
Ø𝑜/ many swap functions, each action 𝑗 has 𝑜 choices to swap to ØQuiz: how many many-to-one swaps?
𝑑((𝑡(𝑗))
𝑡𝑥𝑆; = max
I
∑(∈ ; ∑>∈[/][𝑑( 𝑗 − 𝑑((𝑡(𝑗))]𝑞((𝑗) = max
7∈[/] ∑(∈ ; ∑>∈[/][𝑑( 𝑗 − 𝑑((𝑘)]𝑞((𝑗)
9
Recall swap regret
𝑡𝑥𝑆; = max
I
∑(∈ ; ∑>∈[/][𝑑( 𝑗 − 𝑑((𝑡(𝑗))]𝑞((𝑗)
Fact 1. For any algorithm: 𝑡𝑥𝑆; ≥ 𝑆; Proof:
Ø𝑡(𝑗) only affects term ∑(∈ ; [𝑑( 𝑗 − 𝑑((𝑡(𝑗))]𝑞((𝑗), so should be
picked to maximize this term Fact 2. For any algorithm execution 𝑞:, ⋯ , 𝑞;, the optimal swap function 𝑡∗ satisfies, for any 𝑗, 𝑡∗ 𝑗 = arg max
7∈[/] ∑(∈ ; [𝑑( 𝑗 − 𝑑((𝑘)]𝑞((𝑗)
10
Remarks:
ØThe optimal swap can be decided “independently” for each 𝑗
Fact 1. For any algorithm: 𝑡𝑥𝑆; ≥ 𝑆; Fact 2. For any algorithm execution 𝑞:, ⋯ , 𝑞;, the optimal swap function 𝑡∗ satisfies, for any 𝑗, 𝑡∗ 𝑗 = arg max
7∈[/] ∑(∈ ; [𝑑( 𝑗 − 𝑑((𝑘)]𝑞((𝑗)
11
Remarks:
ØBenchmark of swap regret depends on the algorithm execution
𝑞:, ⋯ , 𝑞;, but benchmark of external regret does not.
ØThis raises a subtle issue: an algorithm minimize swap regret
does not necessarily minimize the total loss
does not have many opportunities to swap
Fact 1. For any algorithm: 𝑡𝑥𝑆; ≥ 𝑆; Fact 2. For any algorithm execution 𝑞:, ⋯ , 𝑞;, the optimal swap function 𝑡∗ satisfies, for any 𝑗, 𝑡∗ 𝑗 = arg max
7∈[/] ∑(∈ ; [𝑑( 𝑗 − 𝑑((𝑘)]𝑞((𝑗)
12
Fact 1. For any algorithm: 𝑡𝑥𝑆; ≥ 𝑆; Fact 2. For any algorithm execution 𝑞:, ⋯ , 𝑞;, the optimal swap function 𝑡∗ satisfies, for any 𝑗, 𝑡∗ 𝑗 = arg max
7∈[/] ∑(∈ ; [𝑑( 𝑗 − 𝑑((𝑘)]𝑞((𝑗)
is also called the internal regret
max
>∈[/] max 7∈[/] ∑(∈ ; [𝑑( 𝑗 − 𝑑((𝑘)]𝑞((𝑗)
pick worst 𝑗
13
Ø (External) Regret vs Swap Regret Ø Convergence to Correlated Equilibrium Ø Converting Regret Bounds to Swap Regret Bounds
14
Ø 𝑜 players, denoted by set 𝑜 = {1, ⋯ , 𝑜} Ø Player 𝑗 takes action 𝑏> ∈ 𝐵> Ø Player utility depends on the outcome of the game, i.e., an action
profile 𝑏 = (𝑏:, ⋯ , 𝑏/)
/ 𝐵>
Ø Correlated equilibrium is an action recommendation policy
A recommendation policy 𝜌 is a correlated equilibrium if
∑VWX 𝑣> 𝑏>, 𝑏Y> ⋅ 𝜌(𝑏>, 𝑏Y>) ≥ ∑VWX 𝑣> 𝑏[>, 𝑏Y> ⋅ 𝜌 𝑏>, 𝑏Y> , ∀ 𝑏[> ∈ 𝐵>, ∀𝑗 ∈ 𝑜 .
Ø That is, for any recommended action 𝑏>, player 𝑗 does not want
to “swap” to another 𝑏>
[
15
Repeated Games with No-Swap-Regret Players
ØThe game is played repeatedly for 𝑈 rounds ØEach player uses an online learning algorithm to select a mixed
strategy at each round 𝑢
ØFor any player 𝑗’s perspective, the following occurs in order at 𝑢
( ∈ Δ|`X| over actions in 𝐵>
( ∈ Δ|`b|
(, 𝑦Y> (
= 𝔽V∼(cX
?,cWX ? ) 𝑣>(𝑏)
( (for future use)
16
Remarks:
ØIn mixed strategy profile 𝑦:
(, 𝑦d (, ⋯ , 𝑦/ ( , prob. of 𝑏 is Π>∈ / 𝑦> ((𝑏>)
Ø𝜌;(𝑏) is simply the average of Π>∈ / 𝑦>
((𝑏>) over 𝑈 rounds
strategy sequence 𝑦>
( (∈[;] for 𝑗. The following recommendation
policy 𝜌; converges to a CE: 𝜌; 𝑏 =
: ; ∑( Π>∈ / 𝑦> ((𝑏>) , ∀ 𝑏 ∈ 𝐵.
17
∑V∈`
: ; ∑( Π>∈ / 𝑦> ((𝑏>)
⋅ 𝑣>(𝑏) =
: ; ∑( ∑V∈` Π>∈ / 𝑦> ((𝑏>) ⋅ 𝑣>(𝑏)
Proof:
ØDerive player 𝑗’s expected utility from 𝜌;
strategy sequence 𝑦>
( (∈[;] for 𝑗. The following recommendation
policy 𝜌; converges to a CE: 𝜌; 𝑏 =
: ; ∑( Π>∈ / 𝑦> ((𝑏>) , ∀ 𝑏 ∈ 𝐵.
18
∑V∈`
: ; ∑( Π>∈ / 𝑦> ((𝑏>)
⋅ 𝑣>(𝑏) =
: ; ∑( ∑V∈` Π>∈ / 𝑦> ((𝑏>) ⋅ 𝑣>(𝑏)
=
: ; ∑( 𝑣>(𝑦> (, 𝑦Y> ( )
Proof:
ØDerive player 𝑗’s expected utility from 𝜌;
strategy sequence 𝑦>
( (∈[;] for 𝑗. The following recommendation
policy 𝜌; converges to a CE: 𝜌; 𝑏 =
: ; ∑( Π>∈ / 𝑦> ((𝑏>) , ∀ 𝑏 ∈ 𝐵.
19
∑V∈`
: ; ∑( Π>∈ / 𝑦> ((𝑏>)
⋅ 𝑣>(𝑏) =
: ; ∑( ∑V∈` Π>∈ / 𝑦> ((𝑏>) ⋅ 𝑣>(𝑏)
=
: ; ∑( 𝑣>(𝑦> (, 𝑦Y> ( )
Proof:
ØDerive player 𝑗’s expected utility from 𝜌;
strategy sequence 𝑦>
( (∈[;] for 𝑗. The following recommendation
policy 𝜌; converges to a CE: 𝜌; 𝑏 =
: ; ∑( Π>∈ / 𝑦> ((𝑏>) , ∀ 𝑏 ∈ 𝐵.
=
: ; ∑VX∈`X ∑(T: ;
𝑣> 𝑏>, 𝑦Y>
(
⋅ 𝑦>
((𝑏>)
ØPlayer 𝑗’s expected utility conditioned on being recommended 𝑏> is
: ; ∑(T: ;
𝑣> 𝑏>, 𝑦Y>
(
⋅ 𝑦>
((𝑏>)
(normalization factor omitted)
20
Proof:
ØThe CE condition requires for all player 𝑗 and all 𝑏> ∈ 𝐵>
strategy sequence 𝑦>
( (∈[;] for 𝑗. The following recommendation
policy 𝜌; converges to a CE: 𝜌; 𝑏 =
: ; ∑( Π>∈ / 𝑦> ((𝑏>) , ∀ 𝑏 ∈ 𝐵.
≥
: ; ∑(T: ;
𝑣> 𝑡(𝑏>), 𝑦Y>
(
⋅ 𝑦>
( 𝑏> , ∀𝑡 𝑏> ∈ 𝐵> : ; ∑(T: ;
𝑣> 𝑏>, 𝑦Y>
(
⋅ 𝑦>
( 𝑏>
21
Proof:
ØThe CE condition requires for all player 𝑗 and all 𝑏> ∈ 𝐵>
strategy sequence 𝑦>
( (∈[;] for 𝑗. The following recommendation
policy 𝜌; converges to a CE: 𝜌; 𝑏 =
: ; ∑( Π>∈ / 𝑦> ((𝑏>) , ∀ 𝑏 ∈ 𝐵.
≥
: ; ∑(T: ;
𝑣> 𝑡(𝑏>), 𝑦Y>
(
⋅ 𝑦>
( 𝑏> , ∀𝑡 𝑏> ∈ 𝐵>
ØLet 𝑡∗ be the optimal swap function in the swap regret:
𝑡𝑥𝑆;
> = max I
∑(T:
;
∑VX∈`X[𝑣> 𝑡 𝑏> , 𝑦Y> − 𝑣> 𝑏>, 𝑦Y>
(
] ⋅ 𝑦>
((𝑏>)
= ∑VX ∑(T:
;
[𝑣> 𝑡∗ 𝑏> , 𝑦Y> − 𝑣> 𝑏>, 𝑦Y>
(
] ⋅ 𝑦>
((𝑏>)
≥ ∑(T:
;
𝑣> 𝑡∗ 𝑏> , 𝑦Y> − 𝑣> 𝑏>, 𝑦Y>
(
⋅ 𝑦>
( 𝑏> ,
∀𝑏>
: ; ∑(T: ;
𝑣> 𝑏>, 𝑦Y>
(
⋅ 𝑦>
( 𝑏>
22
Proof:
ØThe CE condition requires for all player 𝑗 and all 𝑏> ∈ 𝐵>
strategy sequence 𝑦>
( (∈[;] for 𝑗. The following recommendation
policy 𝜌; converges to a CE: 𝜌; 𝑏 =
: ; ∑( Π>∈ / 𝑦> ((𝑏>) , ∀ 𝑏 ∈ 𝐵.
: ; ∑(T: ;
𝑣> 𝑏>, 𝑦Y>
(
⋅ 𝑦>
( 𝑏> ≥ : ; ∑(T: ;
𝑣> 𝑡(𝑏>), 𝑦Y>
(
⋅ 𝑦>
( 𝑏> , ∀𝑡 𝑏> ∈ 𝐵>
ØLet 𝑡∗ be the optimal swap function in the swap regret:
𝑡𝑥𝑆;
> ≥ ∑(T: ;
𝑣> 𝑡∗ 𝑏> , 𝑦Y> − 𝑣> 𝑏>, 𝑦Y>
(
⋅ 𝑦>
( 𝑏> ,
∀𝑏>
23
Proof:
ØThe CE condition requires for all player 𝑗 and all 𝑏> ∈ 𝐵>
strategy sequence 𝑦>
( (∈[;] for 𝑗. The following recommendation
policy 𝜌; converges to a CE: 𝜌; 𝑏 =
: ; ∑( Π>∈ / 𝑦> ((𝑏>) , ∀ 𝑏 ∈ 𝐵.
: ; ∑(T: ;
𝑣> 𝑏>, 𝑦Y>
(
⋅ 𝑦>
( 𝑏> ≥ : ; ∑(T: ;
𝑣> 𝑡(𝑏>), 𝑦Y>
(
⋅ 𝑦>
( 𝑏> , ∀𝑡 𝑏> ∈ 𝐵>
ØLet 𝑡∗ be the optimal swap function in the swap regret:
𝑡𝑥𝑆;
> ≥ ∑(T: ;
𝑣> 𝑡∗ 𝑏> , 𝑦Y> − 𝑣> 𝑏>, 𝑦Y>
(
⋅ 𝑦>
( 𝑏> ,
∀𝑏>
ØFrom Fact 2 before, optimal swap function 𝑡∗ satisfies
𝑡∗ 𝑏> = arg max
I VX ∈`X
∑(T:
;
𝑣> 𝑡 𝑏> , 𝑦Y> − 𝑣> 𝑏>, 𝑦Y>
(
⋅ 𝑦>
((𝑏>)
ØThis implies
𝑡𝑥𝑆;
> ≥ ∑(T: ;
𝑣> 𝑡 𝑏> , 𝑦Y> − 𝑣> 𝑏>, 𝑦Y>
(
⋅ 𝑦>
( 𝑏> ,
∀𝑏> and 𝑡(𝑏>) Thm follows by diving both sides by 𝑈(→ ∞)
24
Ø (External) Regret vs Swap Regret Ø Convergence to Correlated Equilibrium Ø Converting Regret Bounds to Swap Regret Bounds
25
Good External Regret ≠ Good Swap Regret
ØAn algorithm with small swap regret also has small external regret ØThe reverse is not true – an algorithm with small external regret
does not necessarily have small swap regret
Do there exist online learning algorithms with sublinear regret?
26
Ø𝐼 utilizes 𝐵 but is different and more complicated ØThere exists no-swap-regret online learning algorithm
converted to another online algorithm 𝐼 swap regret 𝑜𝑆. 𝑜 = number of actions
27
Proof Overview:
ØThe idea starts from the following observations
Let 𝑡∗ be the optimal swap function, then: 𝑡𝑥𝑆; = max
I
∑(∈ ; ∑>∈[/][𝑑( 𝑗 − 𝑑((𝑡(𝑗))]𝑞((𝑗) = ∑>∈[/] ∑(∈ ; [𝑑( 𝑗 − 𝑑((𝑡∗(𝑗))]𝑞((𝑗)
converted to another online algorithm 𝐼 swap regret 𝑜𝑆.
28
Proof Overview:
ØThe idea starts from the following observations
Let 𝑡∗ be the optimal swap function, then: 𝑡𝑥𝑆; = max
I
∑(∈ ; ∑>∈[/][𝑑( 𝑗 − 𝑑((𝑡(𝑗))]𝑞((𝑗) = ∑>∈[/] ∑(∈ ; [𝑑( 𝑗 − 𝑑((𝑡∗(𝑗))]𝑞((𝑗) Two observations:
1.
The red terms “looks like” an external regret term
2.
If the red term is less than 𝑆 for any 𝑗, then we are done
converted to another online algorithm 𝐼 swap regret 𝑜𝑆. regret from action 𝑗’s swap
29
converted to another online algorithm 𝐼 swap regret 𝑜𝑆. Proof Step 1: constructing 𝐼
ØMake 𝑜 copies of algorithm 𝐵 as 𝐵:, ⋯ , 𝐵/
ØConstruction of 𝐼
> ∈ Δ/ be the randomized action of 𝐵> generated at round 𝑢
∑> 𝑞((𝑗) = 1 ∑> 𝑞( 𝑗 𝑟(
>(𝑘) = 𝑞((𝑘) , ∀𝑘 ∈ [𝑜]
𝑞( is a distribution 𝑞( is stationary
That is, following two ways for 𝐼 to select actions are equivalent
30
converted to another online algorithm 𝐼 swap regret 𝑜𝑆. Proof Step 1: constructing 𝐼
ØMake 𝑜 copies of algorithm 𝐵 as 𝐵:, ⋯ , 𝐵/
ØConstruction of 𝐼
> ∈ Δ/ be the randomized action of 𝐵> generated at round 𝑢
cost” to algorithm 𝐵> for its future use ∑> 𝑞((𝑗) = 1 ∑> 𝑞( 𝑗 𝑟(
>(𝑘) = 𝑞((𝑘) , ∀𝑘 ∈ [𝑜]
𝑞( is a distribution 𝑞( is stationary
31
converted to another online algorithm 𝐼 swap regret 𝑜𝑆. Proof Step 2: deriving regret bound
Ø𝐵> has external regret 𝑆, so
∑(∈ ; ∑7 𝑟(
>(𝑘) [𝑞( 𝑗 𝑑( 𝑘 − 𝑞( 𝑗 𝑑((𝑘′)] ≤ 𝑆
∀𝑘[ ∈ 𝑜 (1)
ØSwap regret of 𝐼
By our construction: ∑> 𝑞( 𝑗 𝑟(
>(𝑘) = 𝑞((𝑘) , ∀𝑘 ∈ [𝑜]
Need to somehow relate 𝑡𝑥𝑆; to 𝑟(
>’s, because Inequality (1)
is the only bound we have 𝑡𝑥𝑆; = max
I
∑(∈ ; ∑7∈[/] 𝑞((𝑘)[𝑑( 𝑘 − 𝑑((𝑡(𝑘))]
32
converted to another online algorithm 𝐼 swap regret 𝑜𝑆. Proof Step 2: deriving regret bound
Ø𝐵> has external regret 𝑆, so
∑(∈ ; ∑7 𝑟(
>(𝑘) [𝑞( 𝑗 𝑑( 𝑘 − 𝑞( 𝑗 𝑑((𝑘′)] ≤ 𝑆
∀𝑘[ ∈ 𝑜 (1)
ØSwap regret of 𝐼
By our construction: ∑> 𝑞( 𝑗 𝑟(
>(𝑘) = 𝑞((𝑘) , ∀𝑘 ∈ [𝑜]
= max
I
∑(∈ ; ∑7∈[/] ∑> 𝑞( 𝑗 𝑟(
>(𝑘) [𝑑( 𝑘 − 𝑑((𝑡(𝑘))]
𝑡𝑥𝑆; = max
I
∑(∈ ; ∑7∈[/] 𝑞((𝑘)[𝑑( 𝑘 − 𝑑((𝑡(𝑘))]
33
converted to another online algorithm 𝐼 swap regret 𝑜𝑆. Proof Step 2: deriving regret bound
Ø𝐵> has external regret 𝑆, so
∑(∈ ; ∑7 𝑟(
>(𝑘) [𝑞( 𝑗 𝑑( 𝑘 − 𝑞( 𝑗 𝑑((𝑘′)] ≤ 𝑆
∀𝑘[ ∈ 𝑜 (1)
ØSwap regret of 𝐼
= max
I
∑>(∑(∈ ; ∑7∈[/] 𝑞( 𝑗 𝑟(
>(𝑘)[𝑑( 𝑘 − 𝑑((𝑡(𝑘))] )
= max
I
∑(∈ ; ∑7∈[/] ∑> 𝑞( 𝑗 𝑟(
> 𝑘 [𝑑( 𝑘 − 𝑑( 𝑡 𝑘
] 𝑡𝑥𝑆; = max
I
∑(∈ ; ∑7∈[/] 𝑞((𝑘)[𝑑( 𝑘 − 𝑑((𝑡(𝑘))]
34
converted to another online algorithm 𝐼 swap regret 𝑜𝑆. Proof Step 2: deriving regret bound
Ø𝐵> has external regret 𝑆, so
∑(∈ ; ∑7 𝑟(
>(𝑘) [𝑞( 𝑗 𝑑( 𝑘 − 𝑞( 𝑗 𝑑((𝑘′)] ≤ 𝑆
∀𝑘[ ∈ 𝑜 (1)
ØSwap regret of 𝐼
= max
I
∑> ∑(∈ ; ∑7∈[/] 𝑞( 𝑗 𝑟(
>(𝑘)[𝑑( 𝑘 − 𝑑((𝑡(𝑘))]
= max
I
∑(∈ ; ∑7∈[/] ∑> 𝑞( 𝑗 𝑟(
>(𝑘) [𝑑( 𝑘 − 𝑑((𝑡(𝑘))]
𝑡𝑥𝑆; = max
I
∑(∈ ; ∑7∈[/] 𝑞((𝑘)[𝑑( 𝑘 − 𝑑((𝑡(𝑘))] ≤ 𝑜 ⋅ 𝑆
Haifeng Xu
University of Virginia hx4ad@virginia.edu