Crush Optimism with Pessimism: Structured Bandits Beyond Asymptotic Optimality
Kwang-Sung Jun join work with Chicheng Zhang
1
Crush Optimism with Pessimism: Structured Bandits Beyond Asymptotic - - PowerPoint PPT Presentation
Crush Optimism with Pessimism: Structured Bandits Beyond Asymptotic Optimality Kwang-Sung Jun join work with Chicheng Zhang 1 Structured bandits e.g., linear = ! , , " # = { $ : #
1
(unknown to the learner)
" = πβ π" + (zero-mean stochastic noise)
"
$βπ πβ π
"'( #
2
e.g., linear π = π!, β¦, π" β β# β± = {π β¦ π$π: π β β#}
βthe set of possible configurations of the mean rewardsβ
) * log π regret bound (instance-dependent)
*.
3
[Ok18] Ok et al., Exploration in Structured Reinforcement Learning, NeurIPS, 2018
(the worst-case regret is beyond the scope)
4
(AISTATSβ17)
Do they like orange or apple? Maybe have them try lemon and see if they are sensitive to sourness..
(1,0) (0.95, 0.1) (1,0)
π πβ = min
&!,β¦,&" )* 3 +,! "
πΏ+ β Ξ+
β, β¦, πΏ) β
β β log π times.
β
β
5
βπ β π πβ , 3
+,! "
πΏ+ β KL- π π , π π β₯ 1 βcompetingβ hypotheses KL divergence with noise distribution π Ξ+ = max
.βπ πβ π
β πβ(π)
πΏ+β 1 = 0
Lattimore & Munos, Bounded regret for finite-armed structured bandits, 2014.
+.2 # (3+.2 +.2 #.
+.2 # (3+.2 +.2 #
6
*Dependence on πΏ can be avoided in special cases (e.g., linear).
7
* itβs necessary (will be updated in arxiv)
8
" = πβ π" + π"
$βπ π π ,
9
10
= πβ
π
$
arms
mean reward
1 2 3
π
%
π
&
π
'
π
(
1
π
)
.75 .5 .25
βπ β π πβ , 3
+,! "
πΏ+ β πβ π β π π
2
2 β₯ 1
11
βcompetingβ hypotheses
πΏ+β 1β = 0 π πβ β min
&!,β¦,&" )* 3 +,! "
πΏ+ β Ξ+ πΏ+ ln π samples for each π β π can distinguish πβ from π confidently.
Agrawal, Teneketzis, Anantharam. Asymptotically Efficient Adaptive Allocation Schemes for Controlled I.I.D. Processes: Finite Parameter Space, 1989.
Ξ+ = max
.βπ πβ π
β πβ(π)
+.2! ) 4!
) 5 ln π
12
A1 A2 A3 A4 A5 A6 π
)
π 1 β π 1 β π 1 β π π
%
1 β π π 1 β π 1 β π Ξ π
&
1 β π 1 β π π 1 β π Ξ π
$
1 β π 1 β π 1 β π π Ξ Ξ π
'
π + π 1 1 β π 1 β π π
(
1 π + π 1 β π 1 β π Ξ π
*
1 β π 1 β π π + π 1 Ξ β¦ β¦ β¦ β¦ β¦ β¦ β¦
cheating arms log2 πΏ* base arms πΏ*
{1 β π, 1, 1 + π} rewards: 0, Ξ
13
π
3
π
2
π
4
π
5
π
6
π
!
π
3
mean reward
π
2
π
4
π
5
π
6
π
!
14
6'( "
6 β π π6 7
9ββ± π"8( π β€ πΎ" β Ξ ln π’ β±
15
confidence level: 1 β poly
! 7
" = arg min ;ββ±" max $βπ π(π)
(break ties by the cum. loss)
"
$βπ <=++_?.=/@($) C#
β
16
" = arg max ;ββ±" max $βπ π(π)
(, π 7, π D
17
Arms A1 A2 A3 A4 A5 π
)
1 .99 .98 π
%
.98 .99 .98 .25 π
&
.97 .97 .98 .25 .25
18
3
+
πΏ+ π π β π π
2
2 β₯ 1 πΏ β₯ max πΏ π , π π
Arms A1 A2 A3 A4 A5 π
)
1 .99 .98 π
%
.98 .99 .98 .25 π
&
.97 .97 1 .25 .25 π
$
.97 .97 1 .2499 .25
π π β arg min
&β *,8 " Ξ9:; π β πΏ+β 1 +
3
+<+β 1
Ξ+ π β πΏ+ πβ =
" if
β π β Z β±7, 3
+
πΏ+ Μ π
7
π π β Μ π
7 π 2
2 β₯ 1
" .
19
mean rewards arms
pessimistic set
At time π’,
Μ π
7 is sufficient to eliminate the optimistic set Z
β±7
Μ π
7
Μ π
7
+βπ =>??_AB>;C(+) F+,-
20
7,
7, itβs fine.
D, we have suboptimal_const β log π regret (and can be made arbitrarily
# -confident.
( +.2 # -confident set.
21
Arms A1 A2 A3 A4 A5 π
)
1 .99 .98 π
%
.98 .99 .98 .25 π
&
.98 .99 .98 .25 .50
"
"
22
πΏ+β 1 = 0 π π = arg min
&!,β¦,&" )* 3 +,! "
πΏ+ β Ξ+
βπ β β° π : πΏ π β πΏ π , 3
+,! "
πΏ+ β π π β π π
2
2 β₯ 1 confidence level: 1 β poly
! ?BG 7
At time π’,
Μ β±7: πΏ(π) and πΏ π are not proportional to each other
π7 = π Μ π
7
Μ π
7 is sufficient to eliminate the optimistic set Z
β±7
Μ π
7
Μ π
7
+βπ =>??_AB>;C(+) F+,-
23
24
( ln π + π 7 ln ln π
D ln β±
( = 0, then π 7 = 0. Thus, bounded regret.
25
π
! = 3 +
Ξ+ β πΏ+(πβ) π
2 = 3 +
Ξ+ β max
1ββ°(1β) π+(π)
π
4 = 3 +
Ξ+ β max
1ββ± π+(π)
+/()) 4! ln π
+/()) 4! ln π + πΏ
26
A1 A2 A3 A3 A5 A6 π
)
1 1 β π 1 β π 1 β π π
%
1 β π 1 1 β π 1 β π Ξ π
&
1 β π 1 β π 1 1 β π Ξ π
$
1 β π 1 β π 1 β π 1 Ξ Ξ π
'
1 1 + π 1 β π 1 β π π
(
1 1 β π 1 + π 1 β π π
*
1 1 β π 1 β π 1 + π β¦ β¦ β¦ β¦ β¦ β¦ β¦
cheating arms log2 πΏ* base arms πΏ*
27
28
5 ln π , ππ
4! ln π , π
29
A1 A2 A3 A4 A5 A6 π
"
π 1 β π 1 β π 1 β π π
!
1 β π π 1 β π 1 β π Ξ π
#
1 β π 1 β π π 1 β π Ξ π
$
1 β π 1 β π 1 β π π Ξ Ξ π
%
π 1 + π 1 β π 1 β π π
&
π 1 β π 1 + π 1 β π π
'
π 1 β π 1 β π 1 + π β¦ β¦ β¦ β¦ β¦ β¦ β¦
cheating arms log% πΏ. base arms πΏ. The oracle UCB regret time
4! ln π , ππ
30
AO UCB regret time
31