Concentration inequalities Jean-Yves Audibert 1 , 2 1. Imagine - - PowerPoint PPT Presentation

Concentration inequalities Jean-Yves Audibert 1 , 2 1. Imagine - ENPC/CSTB - universit´ e Paris Est 2. Willow (INRIA/ENS/CNRS) ThRaSH’2010

Problem Tight upper and lower bounds on f ( X 1 , . . . , X n ) with X 1 , . . . , X n i.i.d. random variables taking their values in some (measurable) space X and f : X n → R a function which value depends on all the variables but not too much on any of them. For example: f ( X 1 , . . . , X n ) = X 1 + ··· + X n or n g ( X 1 ) + · · · + g ( X n ) f ( X 1 , . . . , X n ) = sup n g ∈G

Outline • Asymptotic viewpoint • Non asymptotic – Gaussian approximation – Gaussian processes – Sum of i.i.d. r.v. – Functions with bounded differences – Self-bounding functions

The asymptotic viewpoint • What is the limit of f ( X 1 , . . . , X n ) ? • What is the limit of its centered and scaled version: f ( X 1 , . . . , X n ) − E f ( X 1 , . . . , X n ) ? � V ar f ( X 1 , . . . , X n )

Convergence of random variables d • Convergence in distribution: W n − → n → + ∞ W ⇔ ∀ t ∈ R s.t. F W cont. at t, F W n ( t ) − → n → + ∞ F W ( t ) ⇔ ∀ f : R → R cont. and bounded, E f ( W n ) − → n → + ∞ E f ( W ) (with i 2 = − 1 ) ⇔ ∀ t ∈ R , E e itWn n → + ∞ E e itW − → P • Convergence in probability: W n − → n → + ∞ W ⇔ ∀ ε > 0 , P ( | W n − W | ≥ ε ) − → n → + ∞ 0 a.s. • Almost sure convergence: W n n → + ∞ W ⇔ P ( W n − → − → n → + ∞ W ) = 1 Almost sure cvg ⇒ cvg in probability ⇒ cvg in distribution a.s. • If ∀ ε > 0 , � n ≥ 1 P ( | W n − W | > ε ) < + ∞ , then W n − → n → + ∞ W

Convergence of the empirical mean f ( X 1 , . . . , X n ) = X 1 + ··· + X n n • LLN (1713): If X, X 1 , X 2 , . . . are i.i.d. r.v. with E | X | < + ∞ , then � n a.s. i =1 X i ¯ − → X = n → + ∞ E X n • CLT (1733): If X, X 1 , X 2 , . . . are i.i.d. r.v. with E X 2 < + ∞ , then � ¯ √ n d � X − E X n → + ∞ N (0 , V ar X ) , − → or equivalently: for any t , e − u 2 � ¯ � + ∞ �� n 2 � X − E X − → > t 2 π du. P √ V ar X t n → + ∞

Slutsky’s lemma (1925) Let ( V n ) and ( W n ) be two sequences of random vectors or variables. d P − → − → If V n n → + ∞ v and W n n → + ∞ W , then d − → 1. V n + W n n → + ∞ v + W d − → 2. V n W n n → + ∞ vW d 3. V − 1 n → + ∞ v − 1 W if v invertible − → n W n

An example of complicated functional: the t -statistics Let √ n ( ¯ X − E X ) f ( X 1 , . . . , X n ) = , S n with n n = 1 � ( X i − ¯ S 2 X ) 2 n i =1 i =1 ( X i − E X ) 2 − ( E X − ¯ � n n = 1 Since S 2 X ) 2 , from the LLN, we have n n → + ∞ V ar X . From the CLT, √ n ( ¯ a.s. d S 2 − → X − E X ) n → + ∞ N (0 , V ar X ) . − → n Thus, from Slutsky’s lemma, d n → + ∞ N (0 , 1) . − → f ( X 1 , . . . , X n ) Appropriate decompositions of complicated functionals allow to compute their asymptotic distribution.

Nonasymptotic bounds Motivations: • When the nonasymptotic regime plays a crucial role (for instance, multi-armed bandit problems, racing algorithms, stopping times problems) • When asymptotic analysis is not achievable through standard arguments • To derive asymptotic results!

The Berry (1941)-Esseen (1942) theorem • X, X 1 , . . . , X n i.i.d. • E | X | 3 < + ∞ and σ 2 = V ar X • ¯ X = X 1 + ··· + X n n • Z ∼ N ( E ¯ X, V ar ¯ X ) � ≤ E | X − E X | 3 1 � P ( ¯ � � √ n sup X > x ) − P ( Z > x ) 2 σ 3 x ∈ R

Slud’s theorem (1977) • X 1 , . . . , X n i.i.d. ∼ B ( p ) with p ≤ 1 2 • Z ∼ N ( E ¯ X, V ar ¯ X ) • for any x ∈ [ p, 1 − p ] P ( ¯ X > x ) ≥ P ( Z > x )

the Paley-Zygmund inequality (1932) • X 1 , . . . , X n i.i.d. • for any 0 ≤ λ < 1 , √ n ( ¯ ( V ar X ) 2 �� X − E X ) � � 1 � ≥ (1 − λ 2 ) 2 min � � √ > λ 3 , . P � � E ( X − E X ) 4 V ar X � �

Supremum of Gaussian processes (GP) • Gaussian ∈ G process ( W ( g )) g ∈G : for any g 1 , . . . , g d � � W ( g 1 ) , . . . , W ( g d ) is a Gaussian random vector • GP: a powerful flexible probabilistic model parametrized by µ ( g ) = E W ( g ) and K ( g, g ′ ) = Cov W ( g ) , W ( g ′ ) � � g ( X 1 )+ ··· + g ( X n ) • Good intuition on GP ⇒ good intuition on sup g ∈G n g ( X 1 ) + · · · + g ( X n ) ≈ sup sup W ( g ) n g ∈G g ∈G with µ ( g ) = E g ( X ) and K ( g, g ′ ) = 1 g ( X ) , g ′ ( X ) � � n Cov .

The Borell (1975) - Cirel’son et al. (1976) inequality � � • Z = sup g ∈G W ( g ) − E W ( g ) • σ 2 = sup g ∈G V ar W ( g ) = sup g ∈G K ( g, g ) for any λ ∈ R , log E e λ ( Z − E Z ) ≤ λ 2 σ 2 2 for any t > 0 , P ( Z − E Z ≥ t ) ≤ e − t 2 2 σ 2

Dudley’s integral (1967) � • d ( g, g ′ ) = E [ W ( g ) − W ( g ′ )] 2 • N ( ε ) = ε -packing number of ( G , d ) • σ 2 = sup g ∈G V ar W ( g ) = sup g ∈G K ( g, g ) � σ � � � W ( g ) − E W ( g ) ≤ 12 E sup log N ( ε ) dε, g ∈G 0

Another Borell (1975) - Cirel’son et al. (1976) inequality • X 1 , . . . , X n i.i.d. ∼ N (0 , 1) • f : R n → R L -Lipschitz for the Euclidean distance for any x, x ′ in R n , | f ( x ) − f ( x ′ ) | ≤ L � x − x ′ � for any t > 0 , ≤ e − t 2 � � f ( X 1 , . . . , X n ) − E f ( X 1 , . . . , X n ) ≥ t 2 L 2 . P

Some useful probabilistic inequalities • Markov’s inequality: for any r.v. X and a > 0 , since | X | ≥ a 1 | X |≥ a P ( | X | ≥ a ) ≤ 1 a E | X | . • Jensen’s ineq.: for any integrable r.v. X and ϕ : R d → R convex, ϕ ( E X ) ≤ E ϕ ( X ) . � + ∞ • For any r.v. X , E X ≤ P ( X ≥ t ) dt (with equality if X ≥ 0 ) 0 • Markov’s inequality is at the basis of Chernoff’s argument: ∀ s > 0 e sX ≥ e st � ≤ e − st E e sX . � P ( X ≥ t ) = P Control of the Laplace transform ⇒ control of the large deviations.

Hoeffding’s inequality (1963) If X, X 1 , X 2 , . . . are i.i.d. r.v. with a ≤ X ≤ b , then 1. ∀ s ∈ R , s 2( b − a )2 E e s ( X − E X ) ≤ e 8 2. For any t ≥ 0 , − 2 nt 2 � ¯ � ( b − a )2 , X − E X ≥ t ≤ e P or equivalently, for any ε > 0 � log( ε − 1 ) � � ¯ X − E X < ( b − a ) ≥ 1 − ε, P 2 n � log( ε − 1 ) i.e., “w.h.p.” ¯ X − E X < ( b − a ) . 2 n

Log-Laplace upper bound s 2( b − a )2 1. ∀ s ∈ R , E e s ( X − E X ) ≤ e 8 P s ( dω ) = e sX ( ω ) ϕ ( s ) = log E e sX E e sX · P ( dω ) ϕ ′ ( s ) = E P s X ϕ ′′ ( s ) = V ar P s X � 2 ≤ ( b − a ) 2 V ar P s X = inf r ∈ R E P s ( X − r ) 2 ≤ E P s X − a + b � . 2 4 � s ϕ ( s ) = ϕ (0) + sϕ ′ (0) + 0 ( s − t ) ϕ ′′ ( t ) dt � s ( s − t )( b − a ) 2 ⇒ log E e sX ≤ s E X + dt 4 0 ≤ s E X + ( b − a ) 2 s 2 8

Chernoff’s Argument − 2 nt 2 � ¯ � ( b − a )2 . 2. For any t ≥ 0 , X − E X > t ≤ e P e s ( X − E X ) ≥ e st � � P ( X − E X ≥ t ) = P ≤ e − st E [ e s ( X − E X ) ] s � n i =1( Xi − E X ) � � = e − st E e n � n s ( X − E X ) � = e − st E e n ≤ e − st + s 2 b − a 2 n 8 − 2 nt 2 4 nt ( b − a )2 = e by choosing s = ( b − a ) 2 .

Union bound • P ( A ) ≥ 1 − ε and P ( B ) ≥ 1 − ε ⇒ P ( A ∩ B ) ≥ 1 − 2 ε (since P ( A c ∪ B c ) ≤ P ( A c ) + P ( B c ) ) For instance: Hoeffding to X + Hoeffding to − X + union bound � log(2 ε − 1 ) ⇒ with proba ≥ 1 − ε , | ¯ X − E X | < ( b − a ) 2 n (leads to pessimistic but correct confidence intervals unlike the CLT) • If P ( A 1 ) ≥ 1 − ε ,. . . , P ( A m ) ≥ 1 − ε , then � � A 1 ∩ · · · ∩ A m ≥ 1 − mε P

Bernstein’s (1946) inequality Hoeffding’s inequality vs CLT: − α 2 − 2 α 2 V ar X V ar X ( ¯ n n → + ∞ P ( Z > α ) ≈ e 2 ( b − a )2 �� ≥ P X − E X ) > α − → e √ α 2 π ⇒ Hoeffding’s inequality is imprecise for r.v. having low variance Bernstein’s inequality: If X, X 1 , X 2 , . . . are i.i.d. r.v. with X − E X ≤ c , then • for any ε > 0 , with proba at least 1 − ε , � + c log( ε − 1 ) 2 log( ε − 1 ) V ar X ¯ X ≤ E X + n 3 n • for any t ≥ 0 , � ¯ nt 2 − � X − E X > t ≤ e P 2 V ar X +2 ct/ 3

Empirical Bernstein’s inequality (A., Munos, Szepesv´ ari, 2007; Maurer, Pontil, 2009) If X, X 1 , X 2 , . . . are i.i.d. r.v. with a ≤ X ≤ b , then for any ε > 0 , with proba at least 1 − ε , � + 7( b − a )log( ε − 1 ) 2 log( ε − 1 )ˆ σ 2 E X ≤ ¯ X + 3 n n with � n i =1 ( X i − ¯ X ) 2 σ 2 = ˆ . n − 1 � + ( b − a ) log( ε − 1 ) � � 2 log( ε − 1 ) V ar X to be compared with E X ≤ ¯ X + n 3 n

Hoeffding-Azuma inequalities (McDiarmid’s version, 1989) If for some c ≥ 0 , f ( x 1 , . . . , x n ) − f ( x 1 , . . . , x i − 1 , x, x i +1 , . . . , x n ) ≤ c, sup i ∈{ 1 ,...,n } ( x 1 ,...,x n ) ∈X n x ∈X then, for any λ ∈ R , W = f ( X 1 , . . . , X n ) satisfies nλ 2 c 2 E e λ ( W − E W ) ≤ e 8 and for any t ≥ 0 , ≤ e − 2 t 2 � � W − E W > t nc 2 P

Concentration inequalities Jean-Yves Audibert 1 , 2 1. Imagine - - PowerPoint PPT Presentation

Concentration inequalities Jean-Yves Audibert 1 , 2 1. Imagine - ENPC/CSTB - universit e Paris Est 2. Willow (INRIA/ENS/CNRS) ThRaSH2010 Problem Tight upper and lower bounds on f ( X 1 , . . . , X n ) with X 1 , . . . , X n i.i.d. random

OSMOSIS and DIFFUSION Concentration gradient Concentration Gradient - change in the concentration

Concentration inequalities, the entropy method, search for super -concentration Concentration, ...

Concentration inequalities G abor Lugosi ICREA and Pompeu Fabra University Barcelona what is

Concentration inequalities for occupancy models with log-concave marginals Jay Bartroff, Larry

Concentration Inequalities for Random Matrices M. Ledoux Institut de Math ematiques de

Inequalities for Symmetric Polynomials Curtis Greene October 24, 2009 Inequalities for Symmetric

Diffusion Contaminant at Contaminant Solutes (contaminants) migrate due to concentration

Probabilistic Program Analysis and Concentration of Measure Part I: Concentration of Measure

Concentration inequalities and the entropy method G abor Lugosi ICREA and Pompeu Fabra

Concentration for Coulomb gases and Coulomb transport inequalities Djalil Chafa 1 , Adrien Hardy

Wittens Laplacian and the Morse Inequalities Background Morse Inequalities Wittens Idea

Health inequalities slides Wirral January 2020 Version 1.1 Why health inequalities are

Health Inequalities: A postcode lottery Postcode Lottery Health Inequalities Health

Welcome Health inequalities What are health inequalities? Our presenters will be introducing the

The well- -baby vision baby vision The well Span of concentration Span of concentration

Concentration Risk Measures and De-concentration Optimization Luyang Fu, Ph.D., FCAS, MAAA March

Monte-Carlo Game Tree Search: Advanced Techniques Tsan-sheng Hsu tshsu@iis.sinica.edu.tw

SIMULATE RESULTS IN AUSTRALIAN FOOTBALL Dr Karl Jackson Champion Data (Melbourne, Australia) A

3. Independence and Random Variables Andrej Bogdanov Independence of two events Let E 1 be

Ch5: Special Discrete Distributions 5.1 Bernoulli and binomial random variables The sample

Probabilistic Inequalities and Examples Lecture 3 January 22, 2019 Chandra (UIUC) CS498ABD 1

Regularization with Lipschitz Loss Pierre Alquier Sequential, structured, and/or statistical

Machine learning theory for time series Exponential inequalities for nonstationary Markov chains

On Estimation of Modal Decompositions Anuran Makur, Gregory W. Wornell, and Lizhong Zheng