concentration inequalities
play

Concentration inequalities Jean-Yves Audibert 1 , 2 1. Imagine - - PowerPoint PPT Presentation

Concentration inequalities Jean-Yves Audibert 1 , 2 1. Imagine - ENPC/CSTB - universit e Paris Est 2. Willow (INRIA/ENS/CNRS) ThRaSH2010 Problem Tight upper and lower bounds on f ( X 1 , . . . , X n ) with X 1 , . . . , X n i.i.d. random


  1. Concentration inequalities Jean-Yves Audibert 1 , 2 1. Imagine - ENPC/CSTB - universit´ e Paris Est 2. Willow (INRIA/ENS/CNRS) ThRaSH’2010

  2. Problem Tight upper and lower bounds on f ( X 1 , . . . , X n ) with X 1 , . . . , X n i.i.d. random variables taking their values in some (measurable) space X and f : X n → R a function which value depends on all the variables but not too much on any of them. For example: f ( X 1 , . . . , X n ) = X 1 + ··· + X n or n g ( X 1 ) + · · · + g ( X n ) f ( X 1 , . . . , X n ) = sup n g ∈G

  3. Outline • Asymptotic viewpoint • Non asymptotic – Gaussian approximation – Gaussian processes – Sum of i.i.d. r.v. – Functions with bounded differences – Self-bounding functions

  4. The asymptotic viewpoint • What is the limit of f ( X 1 , . . . , X n ) ? • What is the limit of its centered and scaled version: f ( X 1 , . . . , X n ) − E f ( X 1 , . . . , X n ) ? � V ar f ( X 1 , . . . , X n )

  5. Convergence of random variables d • Convergence in distribution: W n − → n → + ∞ W ⇔ ∀ t ∈ R s.t. F W cont. at t, F W n ( t ) − → n → + ∞ F W ( t ) ⇔ ∀ f : R → R cont. and bounded, E f ( W n ) − → n → + ∞ E f ( W ) (with i 2 = − 1 ) ⇔ ∀ t ∈ R , E e itWn n → + ∞ E e itW − → P • Convergence in probability: W n − → n → + ∞ W ⇔ ∀ ε > 0 , P ( | W n − W | ≥ ε ) − → n → + ∞ 0 a.s. • Almost sure convergence: W n n → + ∞ W ⇔ P ( W n − → − → n → + ∞ W ) = 1 Almost sure cvg ⇒ cvg in probability ⇒ cvg in distribution a.s. • If ∀ ε > 0 , � n ≥ 1 P ( | W n − W | > ε ) < + ∞ , then W n − → n → + ∞ W

  6. Convergence of the empirical mean f ( X 1 , . . . , X n ) = X 1 + ··· + X n n • LLN (1713): If X, X 1 , X 2 , . . . are i.i.d. r.v. with E | X | < + ∞ , then � n a.s. i =1 X i ¯ − → X = n → + ∞ E X n • CLT (1733): If X, X 1 , X 2 , . . . are i.i.d. r.v. with E X 2 < + ∞ , then � ¯ √ n d � X − E X n → + ∞ N (0 , V ar X ) , − → or equivalently: for any t , e − u 2 � ¯ � + ∞ �� � n 2 � X − E X − → > t 2 π du. P √ V ar X t n → + ∞

  7. Slutsky’s lemma (1925) Let ( V n ) and ( W n ) be two sequences of random vectors or variables. d P − → − → If V n n → + ∞ v and W n n → + ∞ W , then d − → 1. V n + W n n → + ∞ v + W d − → 2. V n W n n → + ∞ vW d 3. V − 1 n → + ∞ v − 1 W if v invertible − → n W n

  8. An example of complicated functional: the t -statistics Let √ n ( ¯ X − E X ) f ( X 1 , . . . , X n ) = , S n with n n = 1 � ( X i − ¯ S 2 X ) 2 n i =1 i =1 ( X i − E X ) 2 − ( E X − ¯ � n n = 1 Since S 2 X ) 2 , from the LLN, we have n n → + ∞ V ar X . From the CLT, √ n ( ¯ a.s. d S 2 − → X − E X ) n → + ∞ N (0 , V ar X ) . − → n Thus, from Slutsky’s lemma, d n → + ∞ N (0 , 1) . − → f ( X 1 , . . . , X n ) Appropriate decompositions of complicated functionals allow to compute their asymptotic distribution.

  9. Nonasymptotic bounds Motivations: • When the nonasymptotic regime plays a crucial role (for instance, multi-armed bandit problems, racing algorithms, stopping times problems) • When asymptotic analysis is not achievable through standard arguments • To derive asymptotic results!

  10. The Berry (1941)-Esseen (1942) theorem • X, X 1 , . . . , X n i.i.d. • E | X | 3 < + ∞ and σ 2 = V ar X • ¯ X = X 1 + ··· + X n n • Z ∼ N ( E ¯ X, V ar ¯ X ) � ≤ E | X − E X | 3 1 � P ( ¯ � � √ n sup X > x ) − P ( Z > x ) 2 σ 3 x ∈ R

  11. Slud’s theorem (1977) • X 1 , . . . , X n i.i.d. ∼ B ( p ) with p ≤ 1 2 • Z ∼ N ( E ¯ X, V ar ¯ X ) • for any x ∈ [ p, 1 − p ] P ( ¯ X > x ) ≥ P ( Z > x )

  12. the Paley-Zygmund inequality (1932) • X 1 , . . . , X n i.i.d. • for any 0 ≤ λ < 1 , √ n ( ¯ ( V ar X ) 2 �� � X − E X ) � � 1 � ≥ (1 − λ 2 ) 2 min � � √ > λ 3 , . P � � E ( X − E X ) 4 V ar X � �

  13. Supremum of Gaussian processes (GP) • Gaussian ∈ G process ( W ( g )) g ∈G : for any g 1 , . . . , g d � � W ( g 1 ) , . . . , W ( g d ) is a Gaussian random vector • GP: a powerful flexible probabilistic model parametrized by µ ( g ) = E W ( g ) and K ( g, g ′ ) = Cov W ( g ) , W ( g ′ ) � � g ( X 1 )+ ··· + g ( X n ) • Good intuition on GP ⇒ good intuition on sup g ∈G n g ( X 1 ) + · · · + g ( X n ) ≈ sup sup W ( g ) n g ∈G g ∈G with µ ( g ) = E g ( X ) and K ( g, g ′ ) = 1 g ( X ) , g ′ ( X ) � � n Cov .

  14. The Borell (1975) - Cirel’son et al. (1976) inequality � � • Z = sup g ∈G W ( g ) − E W ( g ) • σ 2 = sup g ∈G V ar W ( g ) = sup g ∈G K ( g, g ) for any λ ∈ R , log E e λ ( Z − E Z ) ≤ λ 2 σ 2 2 for any t > 0 , P ( Z − E Z ≥ t ) ≤ e − t 2 2 σ 2

  15. Dudley’s integral (1967) � • d ( g, g ′ ) = E [ W ( g ) − W ( g ′ )] 2 • N ( ε ) = ε -packing number of ( G , d ) • σ 2 = sup g ∈G V ar W ( g ) = sup g ∈G K ( g, g ) � σ � � � W ( g ) − E W ( g ) ≤ 12 E sup log N ( ε ) dε, g ∈G 0

  16. Another Borell (1975) - Cirel’son et al. (1976) inequality • X 1 , . . . , X n i.i.d. ∼ N (0 , 1) • f : R n → R L -Lipschitz for the Euclidean distance for any x, x ′ in R n , | f ( x ) − f ( x ′ ) | ≤ L � x − x ′ � for any t > 0 , ≤ e − t 2 � � f ( X 1 , . . . , X n ) − E f ( X 1 , . . . , X n ) ≥ t 2 L 2 . P

  17. Some useful probabilistic inequalities • Markov’s inequality: for any r.v. X and a > 0 , since | X | ≥ a 1 | X |≥ a P ( | X | ≥ a ) ≤ 1 a E | X | . • Jensen’s ineq.: for any integrable r.v. X and ϕ : R d → R convex, ϕ ( E X ) ≤ E ϕ ( X ) . � + ∞ • For any r.v. X , E X ≤ P ( X ≥ t ) dt (with equality if X ≥ 0 ) 0 • Markov’s inequality is at the basis of Chernoff’s argument: ∀ s > 0 e sX ≥ e st � ≤ e − st E e sX . � P ( X ≥ t ) = P Control of the Laplace transform ⇒ control of the large deviations.

  18. Hoeffding’s inequality (1963) If X, X 1 , X 2 , . . . are i.i.d. r.v. with a ≤ X ≤ b , then 1. ∀ s ∈ R , s 2( b − a )2 E e s ( X − E X ) ≤ e 8 2. For any t ≥ 0 , − 2 nt 2 � ¯ � ( b − a )2 , X − E X ≥ t ≤ e P or equivalently, for any ε > 0 � log( ε − 1 ) � � ¯ X − E X < ( b − a ) ≥ 1 − ε, P 2 n � log( ε − 1 ) i.e., “w.h.p.” ¯ X − E X < ( b − a ) . 2 n

  19. Log-Laplace upper bound s 2( b − a )2 1. ∀ s ∈ R , E e s ( X − E X ) ≤ e 8 P s ( dω ) = e sX ( ω ) ϕ ( s ) = log E e sX E e sX · P ( dω ) ϕ ′ ( s ) = E P s X ϕ ′′ ( s ) = V ar P s X � 2 ≤ ( b − a ) 2 V ar P s X = inf r ∈ R E P s ( X − r ) 2 ≤ E P s X − a + b � . 2 4 � s ϕ ( s ) = ϕ (0) + sϕ ′ (0) + 0 ( s − t ) ϕ ′′ ( t ) dt � s ( s − t )( b − a ) 2 ⇒ log E e sX ≤ s E X + dt 4 0 ≤ s E X + ( b − a ) 2 s 2 8

  20. Chernoff’s Argument − 2 nt 2 � ¯ � ( b − a )2 . 2. For any t ≥ 0 , X − E X > t ≤ e P e s ( X − E X ) ≥ e st � � P ( X − E X ≥ t ) = P ≤ e − st E [ e s ( X − E X ) ] s � n i =1( Xi − E X ) � � = e − st E e n � n s ( X − E X ) � = e − st E e n ≤ e − st + s 2 b − a 2 n 8 − 2 nt 2 4 nt ( b − a )2 = e by choosing s = ( b − a ) 2 .

  21. Union bound • P ( A ) ≥ 1 − ε and P ( B ) ≥ 1 − ε ⇒ P ( A ∩ B ) ≥ 1 − 2 ε (since P ( A c ∪ B c ) ≤ P ( A c ) + P ( B c ) ) For instance: Hoeffding to X + Hoeffding to − X + union bound � log(2 ε − 1 ) ⇒ with proba ≥ 1 − ε , | ¯ X − E X | < ( b − a ) 2 n (leads to pessimistic but correct confidence intervals unlike the CLT) • If P ( A 1 ) ≥ 1 − ε ,. . . , P ( A m ) ≥ 1 − ε , then � � A 1 ∩ · · · ∩ A m ≥ 1 − mε P

  22. Bernstein’s (1946) inequality Hoeffding’s inequality vs CLT: − α 2 − 2 α 2 V ar X V ar X ( ¯ n n → + ∞ P ( Z > α ) ≈ e 2 ( b − a )2 �� � ≥ P X − E X ) > α − → e √ α 2 π ⇒ Hoeffding’s inequality is imprecise for r.v. having low variance Bernstein’s inequality: If X, X 1 , X 2 , . . . are i.i.d. r.v. with X − E X ≤ c , then • for any ε > 0 , with proba at least 1 − ε , � + c log( ε − 1 ) 2 log( ε − 1 ) V ar X ¯ X ≤ E X + n 3 n • for any t ≥ 0 , � ¯ nt 2 − � X − E X > t ≤ e P 2 V ar X +2 ct/ 3

  23. Empirical Bernstein’s inequality (A., Munos, Szepesv´ ari, 2007; Maurer, Pontil, 2009) If X, X 1 , X 2 , . . . are i.i.d. r.v. with a ≤ X ≤ b , then for any ε > 0 , with proba at least 1 − ε , � + 7( b − a )log( ε − 1 ) 2 log( ε − 1 )ˆ σ 2 E X ≤ ¯ X + 3 n n with � n i =1 ( X i − ¯ X ) 2 σ 2 = ˆ . n − 1 � + ( b − a ) log( ε − 1 ) � � 2 log( ε − 1 ) V ar X to be compared with E X ≤ ¯ X + n 3 n

  24. Hoeffding-Azuma inequalities (McDiarmid’s version, 1989) If for some c ≥ 0 , f ( x 1 , . . . , x n ) − f ( x 1 , . . . , x i − 1 , x, x i +1 , . . . , x n ) ≤ c, sup i ∈{ 1 ,...,n } ( x 1 ,...,x n ) ∈X n x ∈X then, for any λ ∈ R , W = f ( X 1 , . . . , X n ) satisfies nλ 2 c 2 E e λ ( W − E W ) ≤ e 8 and for any t ≥ 0 , ≤ e − 2 t 2 � � W − E W > t nc 2 P

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend