SLIDE 1
Self-bounding functions and concentration of variance
Andreas Maurer Advances in stochastic inequalities and their applications, BIRS 2009
SLIDE 2 Notation and de…nitions
:= Qn
k=1 k is some product space with product probability = n k=1k.
for x 2 write xy;k :=
x1; :::; xk1; y; xk+1; :::; xn .
f : ! R is some generic function and bded below For 1 k n de…ne functions infk f, Df : ! R by inf
k f (x)
: = inf
y2k
f
: =
n
X
k=1
k f (x)
2
: Df is a local measure of the sensitivity of f to modi…cations of individual arguments.
SLIDE 3
Theorem 1 Boucheron, Lugosi, Massart (2003), Maurer (2006) Pr ff E [f] tg exp t2 2 kDfk1
!
: If also 8k; f infk f 1 a.s. then Pr fE [f] f tg exp t2 2 kDfk1 + 2t=3
!
: Applies to convex Lipschitz functions, eigenvalues of random symmetric matri- ces, shortest TSP’s...
SLIDE 4
Theorem 2 Boucheron, Lugosi, Massart (2003), Maurer (2006) Suppose Df af a.s., with a > 0; Then Pr ff E [f] tg exp t2 2aE [f] + at
!
: If also 8k; f infk f 1 a.s. and a 1 then Pr fE [f] f tg exp t2 2aE [f]
!
: This talk is about applications of this result.
SLIDE 5 Application 1 Amendment to Theorem 1, idea from Boucheron, Lugosi, Massart (2009) If f 0 and f2 infk f2 1, then Pr fE [f] f tg exp t2 8 kDfk1
!
: Proof: D
=
X
k
k f2
2
=
X
k
k f
2
f + inf
k f
2
so by Theorem 2 applied to f2 Pr fE [f] f tg Pr
n
E
h
f2i f2 E [f] t
t2 8 kDfk1
!
SLIDE 6
Application 2 (with Massi Pontil for COLT09): X; X1; :::; Xn iid r.v. with values in [0; 1]. Want to give bounds on EX in terms of X = (X1; :::; Xn) with high con…dence 1 . Hoe¤ding: Pr
8 < :EX
X
s
ln 1= 2n
9 = ; 1 :
Bernstein/Bennett: Pr
8 < :EX
X p V
s
2 ln 1= n + ln 1= 3n
9 = ; 1 :
To use Bernstein without other information we need a bound on the standard deviation p V in terms of sample.
SLIDE 7 Estimators for variance and standard deviation For the variance use the sample variance ^ V ^ V (x) = 1 2n (n 1)
X
i;j
2 for x 2 [0; 1]n
For the standard deviation we use
p ^
V . Then we can show this: f := n ^ V satis…es f inf
k f 1 and Df
n n 1f; and Theorem 2 gives the lower tail bounds Pr
n
V ^ V > t
(n 1) t2 2V
!
, and Pr
p
V
q
^ V > t
(n 1) t2 2
!
:
SLIDE 8 Other methods to get such bounds Audibert, Munos, Szepesvári (2007): Apply Bernstein-like bounds to Xi, Xi and (Xi EX)2 respectively, combine to get Pr
p
V
q
^ Vemp > t
nt2 3:24
!
; where ^ Vemp = (n 1) ^ V =n (=variance of empirical distribution). Alternative: ^ V is U-statistic with kernel q
x; x0 = x x02 =2.
Hoe¤dings version of Bennett’s inequality for U-statistics leads to Pr
p
V
q
^ V > t
(n 1) t2 2:62
!
:
SLIDE 9
Empirical Bernstein bounds Substitution of above in Bernstein’s inequality gives empirical version: Pr
8 < :EX
X
q
^ V
s
2 ln 2= n + 7 ln 2= 3 (n 1)
9 = ; 1 :
Applications: Multi-armed bandit problem (Audibert, Munos, Szepesvári, 2007), stopping algorithms (Mnih, Szepesvári, Audibert, 2008), sample variance pe- nalization (Pontil, Maurer, 2009).
SLIDE 10 Application 3 (Largest eigenvalue of the Gramian):
X = (X1; :::; Xn) indep. r.v. distributed in unit ball B of Hilbert space H:
G (x)ij =
D
xi; xj
E
, f (x) = max (x) = largest eigenvalue of G (x) : By Weyls monotonicity infk f (x) = f
Also 9u 2 Rn; kukRn = 1, such that f (x) f
i
uixi
i6=k
uixi
=
*
ukxk;
X
i
uixi +
X
i6=k
uixi
+
i
uixi
q
f (x): Conclusion1: f infk f 1 Conclusion2: Square and sum over k to get Df 4f
SLIDE 11 Application 3 (Largest eigenvalue of the Gramian):
X = (X1; :::; Xn) indep. r.v. distributed in unit ball B of Hilbert space H:
G (x)ij =
D
xi; xj
E
, f (x) = max (x) = largest eigenvalue of G (x) : From Theorem 2 we get Pr fmax Emax > tg
t2 8Emax + 4t
!
Pr fEmax max > tg
t2 8Emax
!
For the largest singular value of the matrix X we get Pr f (max Emax) > tg et2=8:
SLIDE 12
Another result related to self-bounded functions: Theorem 3 Suppose f; g : ! R, 0 f g and Df ag and Dg ag and a 1 Then Pr ff Ef > tg exp t2 4aEg + 3at=2
!
If also f infk f 1 Pr fEf f > tg exp t2 4aEg + at
!
SLIDE 13 Application 4 (any eigenvalue of the Gramian)
X = (X1; :::; Xn) indep. r.v. distributed in unit ball B of Hilbert space H:
G (X)ij =
D
Xi; Xj
E
, now let d (X) be any eigenvalue of G (X) Set f := d=2 and g = max=2. We can show 0 f g and f inf
k f 1 and Df 2g and Dg 2g:
Applying Theorem 3 gives Pr fd Ed > tg
t2 16Emax + 6t
!
Pr fEd d > tg
t2 16Emax + 4t
!
:
SLIDE 14 References
[1] J. Y. Audibert, R. Munos, C. Szepesvári. Exploration-exploitation trade-
- ¤ using variance estimates in multi-armed bandits, Theoretical Computer
Science, 2008. [2] S. Boucheron, G. Lugosi, P. Massart, Concentration inequalities using the entropy method, Annals of Probability (2003) 31:1583-1614. [3] M. Ledoux, The Concentration of Measure Phenomenon, AMS Surveys and Monographs 89 (2001) [4] A. Maurer, Concentration inequalities for functions of independent vari-
- ables. Random Structures Algorithms 29 121–138 2006
SLIDE 15 [5] Volodymyr Mnih, C. Szepesvári, J. Y. Audibert. Empirical Bernstein Stop-