Concentration inequalities Jean-Yves Audibert 1 , 2 1. Imagine - - - PowerPoint PPT Presentation

concentration inequalities
SMART_READER_LITE
LIVE PREVIEW

Concentration inequalities Jean-Yves Audibert 1 , 2 1. Imagine - - - PowerPoint PPT Presentation

Concentration inequalities Jean-Yves Audibert 1 , 2 1. Imagine - ENPC/CSTB - universit e Paris Est 2. Willow (INRIA/ENS/CNRS) ThRaSH2010 Problem Tight upper and lower bounds on f ( X 1 , . . . , X n ) with X 1 , . . . , X n i.i.d. random


slide-1
SLIDE 1

Concentration inequalities

Jean-Yves Audibert1,2

  • 1. Imagine - ENPC/CSTB - universit´

e Paris Est

  • 2. Willow (INRIA/ENS/CNRS)

ThRaSH’2010

slide-2
SLIDE 2

Problem

Tight upper and lower bounds on f(X1, . . . , Xn) with X1, . . . , Xn i.i.d. random variables taking their values in some (measurable) space X and f : X n → R a function which value depends on all the variables but not too much

  • n any of them. For example: f(X1, . . . , Xn) = X1+···+Xn

n

  • r

f(X1, . . . , Xn) = sup

g∈G

g(X1) + · · · + g(Xn) n

slide-3
SLIDE 3

Outline

  • Asymptotic viewpoint
  • Non asymptotic

– Gaussian approximation – Gaussian processes – Sum of i.i.d. r.v. – Functions with bounded differences – Self-bounding functions

slide-4
SLIDE 4

The asymptotic viewpoint

  • What is the limit of f(X1, . . . , Xn)?
  • What is the limit of its centered and scaled version:

f(X1, . . . , Xn) − Ef(X1, . . . , Xn)

  • Var f(X1, . . . , Xn)

?

slide-5
SLIDE 5

Convergence of random variables

  • Convergence in distribution: Wn

d

− →

n→+∞ W

⇔ ∀t ∈ R s.t. FW cont. at t, FWn(t) − →

n→+∞ FW(t)

⇔ ∀f : R → R cont. and bounded, Ef(Wn) − →

n→+∞ Ef(W)

∀t ∈ R , EeitWn − →

n→+∞ EeitW

(with i2 = −1)

  • Convergence in probability: Wn

P

− →

n→+∞ W

⇔ ∀ε > 0, P(|Wn − W| ≥ ε) − →

n→+∞ 0

  • Almost sure convergence: Wn

a.s.

− →

n→+∞ W ⇔ P(Wn

− →

n→+∞ W) = 1

Almost sure cvg ⇒ cvg in probability ⇒ cvg in distribution

  • If ∀ε > 0,

n≥1 P(|Wn − W| > ε) < +∞, then Wn a.s.

− →

n→+∞ W

slide-6
SLIDE 6

Convergence of the empirical mean f(X1, . . . , Xn) = X1+···+Xn

n

  • LLN (1713): If X, X1, X2, . . . are i.i.d. r.v. with E|X| < +∞, then

¯ X =

n

i=1 Xi

n a.s.

− →

n→+∞ EX

  • CLT (1733): If X, X1, X2, . . . are i.i.d. r.v. with EX2 < +∞, then

√n ¯ X − EX

  • d

− →

n→+∞ N(0, Var X),

  • r equivalently: for any t,

P

  • n

Var X

¯ X − EX

  • > t

n→+∞

+∞

t e−u2

2

√ 2π du.

slide-7
SLIDE 7

Slutsky’s lemma (1925)

Let (Vn) and (Wn) be two sequences of random vectors or variables. If Vn

P

− →

n→+∞ v and Wn d

− →

n→+∞ W, then

  • 1. Vn + Wn

d

− →

n→+∞ v + W

  • 2. VnWn

d

− →

n→+∞ vW

  • 3. V −1

n Wn d

− →

n→+∞ v−1W if v invertible

slide-8
SLIDE 8

An example of complicated functional: the t-statistics

Let f(X1, . . . , Xn) = √n( ¯ X − EX) Sn , with S2

n = 1

n

n

  • i=1

(Xi − ¯ X)2 Since S2

n = 1 n

n

i=1(Xi − EX)2 − (EX − ¯

X)2, from the LLN, we have S2

n a.s.

− →

n→+∞ Var X. From the CLT, √n( ¯

X − EX)

d

− →

n→+∞ N(0, Var X).

Thus, from Slutsky’s lemma, f(X1, . . . , Xn)

d

− →

n→+∞ N(0, 1).

Appropriate decompositions

  • f

complicated functionals allow to compute their asymptotic distribution.

slide-9
SLIDE 9

Nonasymptotic bounds

Motivations:

  • When the nonasymptotic regime plays a crucial role (for instance,

multi-armed bandit problems, racing algorithms, stopping times problems)

  • When asymptotic analysis is not achievable through standard

arguments

  • To derive asymptotic results!
slide-10
SLIDE 10

The Berry (1941)-Esseen (1942) theorem

  • X, X1, . . . , Xn i.i.d.
  • E|X|3 < +∞ and σ2 = Var X
  • ¯

X = X1+···+Xn

n

  • Z ∼ N(E ¯

X, Var ¯ X) sup

x∈R

  • P( ¯

X > x) − P(Z > x)

  • ≤ E|X − EX|3

2σ3 1 √n

slide-11
SLIDE 11

Slud’s theorem (1977)

  • X1, . . . , Xn i.i.d. ∼ B(p) with p ≤ 1

2

  • Z ∼ N(E ¯

X, Var ¯ X)

  • for any x ∈ [p, 1 − p]

P( ¯ X > x) ≥ P(Z > x)

slide-12
SLIDE 12

the Paley-Zygmund inequality (1932)

  • X1, . . . , Xn i.i.d.
  • for any 0 ≤ λ < 1,

P

  • √n( ¯

X − EX) √ Var X > λ

  • ≥ (1 − λ2)2 min

1 3, (Var X)2 E(X − EX)4

  • .
slide-13
SLIDE 13

Supremum of Gaussian processes (GP)

  • Gaussian

process (W(g))g∈G: for any g1, . . . , gd ∈ G

  • W(g1), . . . , W(gd)
  • is a Gaussian random vector
  • GP: a powerful flexible probabilistic model parametrized by

µ(g) = EW(g) and K(g, g′) = Cov

  • W(g), W(g′)
  • Good intuition on GP ⇒ good intuition on supg∈G

g(X1)+···+g(Xn) n

sup

g∈G

g(X1) + · · · + g(Xn) n ≈ sup

g∈G

W(g) with µ(g) = Eg(X) and K(g, g′) = 1

nCov

  • g(X), g′(X)
  • .
slide-14
SLIDE 14

The Borell (1975) - Cirel’son et al. (1976) inequality

  • Z = supg∈G
  • W(g) − EW(g)
  • σ2 = supg∈G Var W(g) = supg∈G K(g, g)

for any λ ∈ R, log Eeλ(Z−EZ) ≤ λ2σ2 2 for any t > 0, P(Z − EZ ≥ t) ≤ e− t2

2σ2

slide-15
SLIDE 15

Dudley’s integral (1967)

  • d(g, g′) =
  • E[W(g) − W(g′)]2
  • N(ε) = ε-packing number of (G, d)
  • σ2 = supg∈G Var W(g) = supg∈G K(g, g)

E sup

g∈G

  • W(g) − EW(g)
  • ≤ 12

σ

  • log N(ε)dε,
slide-16
SLIDE 16

Another Borell (1975) - Cirel’son et al. (1976) inequality

  • X1, . . . , Xn i.i.d. ∼ N(0, 1)
  • f : Rn → R L-Lipschitz for the Euclidean distance

for any x, x′ in Rn, |f(x) − f(x′)| ≤ Lx − x′

for any t > 0, P

  • f(X1, . . . , Xn) − Ef(X1, . . . , Xn) ≥ t
  • ≤ e− t2

2L2.

slide-17
SLIDE 17

Some useful probabilistic inequalities

  • Markov’s inequality: for any r.v. X and a > 0,

since |X| ≥ a1|X|≥a

P(|X| ≥ a) ≤ 1 aE|X|.

  • Jensen’s ineq.: for any integrable r.v. X and ϕ : Rd → R convex,

ϕ(EX) ≤ Eϕ(X).

  • For any r.v. X,

EX ≤ +∞ P(X ≥ t)dt

(with equality if X ≥ 0)

  • Markov’s inequality is at the basis of Chernoff’s argument: ∀s > 0

P(X ≥ t) = P

  • esX ≥ est

≤ e−stEesX. Control of the Laplace transform ⇒ control of the large deviations.

slide-18
SLIDE 18

Hoeffding’s inequality (1963)

If X, X1, X2, . . . are i.i.d. r.v. with a ≤ X ≤ b, then

  • 1. ∀s ∈ R,

Ees(X−EX) ≤ e

s2(b−a)2 8

  • 2. For any t ≥ 0,

P ¯ X − EX ≥ t

  • ≤ e

− 2nt2

(b−a)2,

  • r equivalently, for any ε > 0

P

  • ¯

X − EX < (b − a)

  • log(ε−1)

2n

  • ≥ 1 − ε,

i.e., “w.h.p.” ¯ X − EX < (b − a)

  • log(ε−1)

2n

.

slide-19
SLIDE 19

Log-Laplace upper bound

  • 1. ∀s ∈ R, Ees(X−EX) ≤ e

s2(b−a)2 8

ϕ(s) = log EesX Ps(dω) = esX(ω)

EesX · P(dω)

ϕ′(s) = EPsX ϕ′′(s) = VarPs X VarPs X = infr∈R EPs (X − r)2 ≤ EPs

  • X − a+b

2

2 ≤ (b−a)2

4

. ϕ(s) = ϕ(0) + sϕ′(0) + s

0 (s − t)ϕ′′(t)dt

⇒ log EesX ≤ sEX + s (s − t)(b − a)2 4 dt ≤ sEX + (b − a)2s2 8

slide-20
SLIDE 20

Chernoff’s Argument

  • 2. For any t ≥ 0,

P ¯ X − EX > t

  • ≤ e

− 2nt2

(b−a)2.

P(X − EX ≥ t) = P

  • es(X−EX) ≥ est

≤ e−stE[es(X−EX)] = e−stE

  • e

s n i=1(Xi−EX) n

  • = e−stE
  • e

s(X−EX) n

n ≤ e−st+s2

n b−a2 8

= e

− 2nt2

(b−a)2

by choosing s = 4nt (b − a)2.

slide-21
SLIDE 21

Union bound

  • P(A) ≥ 1 − ε and P(B) ≥ 1 − ε ⇒ P(A ∩ B) ≥ 1 − 2ε

(since P(Ac ∪ Bc) ≤ P(Ac) + P(Bc))

For instance: Hoeffding to X + Hoeffding to −X + union bound ⇒ with proba ≥ 1 − ε, | ¯ X − EX| < (b − a)

  • log(2ε−1)

2n (leads to pessimistic but correct confidence intervals unlike the CLT)

  • If P(A1) ≥ 1 − ε,. . . ,P(Am) ≥ 1 − ε, then

P

  • A1 ∩ · · · ∩ Am
  • ≥ 1 − mε
slide-22
SLIDE 22

Bernstein’s (1946) inequality

Hoeffding’s inequality vs CLT: e

−2α2 Var X

(b−a)2

≥ P

  • n

Var X( ¯

X − EX) > α

n→+∞ P(Z > α) ≈ e

−α2 2

α √ 2π

⇒ Hoeffding’s inequality is imprecise for r.v. having low variance Bernstein’s inequality: If X, X1, X2, . . . are i.i.d. r.v. with X − EX ≤ c, then

  • for any ε > 0, with proba at least 1 − ε,

¯ X ≤ EX +

  • 2 log(ε−1) Var X

n

+ clog(ε−1)

3n

  • for any t ≥ 0,

P ¯ X − EX > t

  • ≤ e

nt2 2 Var X+2ct/3

slide-23
SLIDE 23

Empirical Bernstein’s inequality (A., Munos, Szepesv´ ari, 2007; Maurer, Pontil, 2009)

If X, X1, X2, . . . are i.i.d. r.v. with a ≤ X ≤ b, then for any ε > 0, with proba at least 1 − ε, EX ≤ ¯ X +

  • 2 log(ε−1)ˆ

σ2 n + 7(b − a)log(ε−1) 3n with ˆ σ2 = n

i=1(Xi − ¯

X)2 n − 1 .

  • to be compared with EX ≤ ¯

X +

  • 2 log(ε−1)Var X

n

+ (b − a)log(ε−1)

3n

slide-24
SLIDE 24

Hoeffding-Azuma inequalities (McDiarmid’s version, 1989)

If for some c ≥ 0, sup

i∈{1,...,n} (x1,...,xn)∈X n x∈X

f(x1, . . . , xn) − f(x1, . . . , xi−1, x, xi+1, . . . , xn) ≤ c, then, for any λ ∈ R, W = f(X1, . . . , Xn) satisfies Eeλ(W −EW ) ≤ e

nλ2c2 8

and for any t ≥ 0, P

  • W − EW > t
  • ≤ e− 2t2

nc2

slide-25
SLIDE 25

First example: Hoeffding’s inequality in Hilbert space

  • X1, . . . , Xn i.i.d. r.v. taking values in a separable Hilbert space
  • EX = 0 and X ≤ 1

For any t ≥ 4√n, P

  • X1 + · · · + Xn
  • ≥ t
  • ≤ e− t2

8n.

slide-26
SLIDE 26

Second example: supremum of empirical process W = f(X1, . . . , Xn) = sup

g∈G g(X1)+···+g(Xn) n

G finite

  • Assumptions: ∀g ∈ G, g takes its values in [−1, 1] and Eg(X1) = 0
  • sup

i∈{1,...,n} (x1,...,xn)∈X n x∈X

f(xi−1

1

, xi, xn

i+1) − f(xi−1 1

, x, xn

i+1) ≤ 2 n,

  • McDiarmid’s inequality ⇒ P
  • W − EW > t
  • ≤ e−nt2/2

⇒ with proba ≥ 1 − ε, sup

g∈G

g(X1) + · · · + g(Xn) n ≤ E sup

g∈G

g(X1) + · · · + g(Xn) n +

  • 2 log(ε−1)

n

slide-27
SLIDE 27

Third example: kernel density estimation

  • X1, . . . , Xn i.i.d. r.v. from a distribution with density p on R.
  • h > 0 and K : R → R+ with
  • R K = 1
  • ˆ

p(x) =

1 nh

n

i=1 K

  • x−Xi

h

  • W = f(X1, . . . , Xn) =

ˆ p(x) − p(x)

  • dx

f(xi−1

1

, xi, xn

i+1) − f(xi−1 1

, x′

i, xn i+1) ≤ 1

nh

  • K

x − xi h

  • n − K

x − x′

i

h

  • ≤ 2

n,

W − EW ≤

  • 2 log(ε−1)

n

slide-28
SLIDE 28

Self bounded functions

(Boucheron, Lugosi, Massart, 2003, 2009; Maurer, 2005)

  • fi(x1, . . . , xn) = infxi∈X f(x1, . . . , xn)
  • If for some a, b ≥ 0, for any (x1, . . . , xn) ∈ X n,

n

  • i=1
  • f(x1, . . . , xn) − fi(x1, . . . , xn)

2 ≤ af(x1, . . . , xn) + b, then, for any t ≥ 0, W = f(X1, . . . , Xn) satisfies P

  • W − EW > t
  • ≤ e

t2 2(aEW +b+at/2)

slide-29
SLIDE 29

Talagrand’s inequality

(Talagrand, 1996; Rio, 2002; Bousquet, 2003)

  • W = supg∈G

g(X1)+···+g(Xn) n

  • Eg(X) = 0 and g(X) ≤ c
  • v = supg∈G Var g(X) + 2cEW

for any ε > 0, with proba at least 1 − ε, W − EW ≤

  • 2v log(ε−1)

n

+ clog(ε−1)

3n

for any t ≥ 0, P

  • W − EW > t
  • ≤ e

nt2 2v+2ct/3

slide-30
SLIDE 30

Expected maximal deviations

Let σ > 0, m ≥ 2, W1, . . . , Wm r.v. s.t. for all s > 0 and any 1 ≤ i ≤ m, EesWi ≤ e

s2σ2 2 . Then

E

  • max

1≤i≤m Wi

  • ≤ σ
  • 2 log m.

If for any s > 0, we also have Ee−sWi ≤ e

s2σ2 2 , then

E

  • max

1≤i≤m |Wi|

  • ≤ σ
  • 2 log(2m).

Proof: max

1≤i≤m Wi ≤ 1

s log

m

  • i=1

esWi ≤ 1 s log(mes2σ2/2).

slide-31
SLIDE 31

Extension to martingale difference sequences

  • Let X1, X2, . . . and U1, U2, . . . be r.v. such that

E[Xi|U1, . . . , Ui−1] = 0 for all i ≥ 1

  • Assume that for some c > 0, and some r.v. Ai measurable w.r.t.

U1, . . . , Ui−1, Xi takes its values in [Ai, Ai + 1] for P( ¯ X > t) ≤ e−2nt2

  • same r.h.s. as if we had i.i.d. r.v. taking values in [0, 1]
slide-32
SLIDE 32

Other extensions

  • All upper bounds easily extends to independent non identically

distributed r.v.

  • Some upper bounds on the empirical mean can be extended to

random vectors

  • All upper bounds on the empirical mean are valid if the Xi are

samples without replacement

slide-33
SLIDE 33

Some nice references:

  • Appendix of G. Lugosi and N. Cesa-Bianchi’s book:

’prediction, learning and games’

  • G. Lugosi’s lecture notes on concentration inequalities.
  • Boucheron, Lugosi, Massart (2003,2009)
  • P. Massart Saint Flour lecture notes