Lectures on learning theory G abor Lugosi ICREA and Pompeu Fabra - PowerPoint PPT Presentation

johnson-lindenstrauss Suppose A = { a 1 , . . . , a n } ⊂ R D is a finite set, D is large. We would like to embed A in R d where d ≪ D . Is this possible? In what sense? Given ε > 0 , a function f : R D → R d is an ε -isometry if for all a , a ′ ∈ A , � 2 ≤ � 2 ≤ (1 + ε ) � 2 . � � a − a ′ � � � f(a) − f(a ′ ) � � � a − a ′ � (1 − ε ) Johnson-Lindenstrauss lemma: If d ≥ (c /ε 2 ) log n , then there exists an ε -isometry f : R D → R d . Independent of D !

random projections We take f to be linear. How? At random!

random projections We take f to be linear. How? At random! Let f = (W i , j ) d × D with 1 W i , j = √ X i , j d where the X i , j are independent standard normal.

random projections We take f to be linear. How? At random! Let f = (W i , j ) d × D with 1 W i , j = √ X i , j d where the X i , j are independent standard normal. For any a = ( α 1 , . . . , α D ) ∈ R D , d D E � f(a) � 2 = 1 i , j = � a � 2 . � � α 2 j E X 2 d i=1 j=1 The expected squared distances are preserved!

random projections We take f to be linear. How? At random! Let f = (W i , j ) d × D with 1 W i , j = √ X i , j d where the X i , j are independent standard normal. For any a = ( α 1 , . . . , α D ) ∈ R D , d D E � f(a) � 2 = 1 i , j = � a � 2 . � � α 2 j E X 2 d i=1 j=1 The expected squared distances are preserved! � f(a) � 2 / � a � 2 is a weighted sum of squared normals.

random projections Let b = a i − a j for some a i , a j ∈ A . Then √ √  �  � � � f(b) � 2 8 log(n / δ ) + 8 log(n / δ )  � �  P  ∃ b : − 1 � > � � � b � 2 � � d d �  √ √  �  � � � f(b) � 2 � n 8 log(n / δ ) + 8 log(n / δ ) �  � �  ≤ P − 1 � > � � � b � 2 2 � � d d  �  ≤ δ (by a Bernstein-type inequality) . √ If d ≥ (c /ε 2 ) log(n / δ ) , then √ √ � 8 log(n / δ ) + 8 log(n / δ ) ≤ ε d d and f is an ε -isometry with probability ≥ 1 − δ .

martingale representation X 1 , . . . , X n are independent random variables taking values in some set X . Let f : X n → R and Z = f(X 1 , . . . , X n ) . Denote E i [ · ] = E [ ·| X 1 , . . . , X i ] . Thus, E 0 Z = E Z and E n Z = Z .

martingale representation X 1 , . . . , X n are independent random variables taking values in some set X . Let f : X n → R and Z = f(X 1 , . . . , X n ) . Denote E i [ · ] = E [ ·| X 1 , . . . , X i ] . Thus, E 0 Z = E Z and E n Z = Z . Writing ∆ i = E i Z − E i − 1 Z , we have n � Z − E Z = ∆ i i=1 This is the Doob martingale representation of Z .

martingale representation X 1 , . . . , X n are independent random variables taking values in some set X . Let f : X n → R and Z = f(X 1 , . . . , X n ) . Denote E i [ · ] = E [ ·| X 1 , . . . , X i ] . Thus, E 0 Z = E Z and E n Z = Z . Writing ∆ i = E i Z − E i − 1 Z , we have n � Z − E Z = ∆ i i=1 This is the Doob martingale Joseph Leo Doob (1910–2004) representation of Z .

martingale representation: the variance � n  � 2  n � � � � �  = ∆ 2 Var (Z) = E ∆ i + 2 E ∆ i ∆ j . E  i i=1 i=1 j > i Now if j > i , E i ∆ j = 0 , so E i ∆ j ∆ i = ∆ i E i ∆ j = 0 , We obtain � n  � 2  n � � � �  = ∆ 2 Var (Z) = E ∆ i . E  i i=1 i=1

martingale representation: the variance � n  � 2  n � � � � �  = ∆ 2 Var (Z) = E ∆ i + 2 E ∆ i ∆ j . E  i i=1 i=1 j > i Now if j > i , E i ∆ j = 0 , so E i ∆ j ∆ i = ∆ i E i ∆ j = 0 , We obtain � n  � 2  n � � � �  = ∆ 2 Var (Z) = E ∆ i . E  i i=1 i=1 From this, using independence, it is easy derive the Efron-Stein inequality.

efron-stein inequality (1981) Let X 1 , . . . , X n be independent random variables taking values in X . Let f : X n → R and Z = f(X 1 , . . . , X n ) . Then n n (Z − E (i) Z) 2 = E � � Var (i) (Z) . Var (Z) ≤ E i=1 i=1 where E (i) Z is expectation with respect to the i -th variable X i only.

efron-stein inequality (1981) Let X 1 , . . . , X n be independent random variables taking values in X . Let f : X n → R and Z = f(X 1 , . . . , X n ) . Then n n (Z − E (i) Z) 2 = E � � Var (i) (Z) . Var (Z) ≤ E i=1 i=1 where E (i) Z is expectation with respect to the i -th variable X i only. We obtain more useful forms by using that Var (X) = 1 2 E (X − X ′ ) 2 Var (X) ≤ E (X − a) 2 and for any constant a .

efron-stein inequality (1981) If X ′ 1 , . . . , X ′ n are independent copies of X 1 , . . . , X n , and Z ′ i = f(X 1 , . . . , X i − 1 , X ′ i , X i+1 , . . . , X n ) , then � n � Var (Z) ≤ 1 � (Z − Z ′ i ) 2 2 E i=1 Z is concentrated if it doesn’t depend too much on any of its variables.

efron-stein inequality (1981) If X ′ 1 , . . . , X ′ n are independent copies of X 1 , . . . , X n , and Z ′ i = f(X 1 , . . . , X i − 1 , X ′ i , X i+1 , . . . , X n ) , then � n � Var (Z) ≤ 1 � (Z − Z ′ i ) 2 2 E i=1 Z is concentrated if it doesn’t depend too much on any of its variables. If Z = � n i=1 X i then we have an equality. Sums are the “least concentrated” of all functions!

efron-stein inequality (1981) If for some arbitrary functions f i Z i = f i (X 1 , . . . , X i − 1 , X i+1 , . . . , X n ) , then � n � � (Z − Z i ) 2 Var (Z) ≤ E i=1

efron, stein, and steele Mike Steele Charles Stein Bradley Efron

example: uniform deviations Let A be a collection of subsets of X , and let X 1 , . . . , X n be n random points in X drawn i.i.d. Let n P n (A) = 1 � P(A) = P { X 1 ∈ A } and ✶ X i ∈ A n i=1 If Z = sup A ∈A | P(A) − P n (A) | , Var (Z) ≤ 1 2n

example: uniform deviations Let A be a collection of subsets of X , and let X 1 , . . . , X n be n random points in X drawn i.i.d. Let n P n (A) = 1 � P(A) = P { X 1 ∈ A } and ✶ X i ∈ A n i=1 If Z = sup A ∈A | P(A) − P n (A) | , Var (Z) ≤ 1 2n regardless of the distribution and the richness of A .

example: kernel density estimation Let X 1 , . . . , X n be i.i.d. real samples drawn according to some density φ . The kernel density estimate is n φ n (x) = 1 � x − X i � � K , nh h i=1 � where h > 0 , and K is a nonnegative “kernel” K = 1 . The L 1 error is � Z = f(X 1 , . . . , X n ) = | φ (x) − φ n (x) | dx .

example: kernel density estimation Let X 1 , . . . , X n be i.i.d. real samples drawn according to some density φ . The kernel density estimate is n φ n (x) = 1 � x − X i � � K , nh h i=1 � where h > 0 , and K is a nonnegative “kernel” K = 1 . The L 1 error is � Z = f(X 1 , . . . , X n ) = | φ (x) − φ n (x) | dx . It is easy to see that | f(x 1 , . . . , x n ) − f(x 1 , . . . , x ′ i , . . . , x n ) | � x − x ′ 1 � � � x − x i � �� dx ≤ 2 i � � ≤ � K − K n , � � nh h h Var (Z) ≤ 2 so we get n .

✶ ✶ ✶ bounding the expectation i ∈ A and let E ′ denote expectation only n (A) = 1 � n Let P ′ i=1 ✶ X ′ n with respect to X ′ 1 , . . . , X ′ n . | E ′ [P n (A) − P ′ E sup | P n (A) − P(A) | = E sup n (A)] | A ∈A A ∈A � n � n (A) | = 1 � � | P n (A) − P ′ � ≤ E sup n E sup ( ✶ X i ∈ A − ✶ X ′ i ∈ A ) � � � � A ∈A A ∈A � � i=1

bounding the expectation i ∈ A and let E ′ denote expectation only n (A) = 1 � n Let P ′ i=1 ✶ X ′ n with respect to X ′ 1 , . . . , X ′ n . | E ′ [P n (A) − P ′ E sup | P n (A) − P(A) | = E sup n (A)] | A ∈A A ∈A � n � n (A) | = 1 � � | P n (A) − P ′ � ≤ E sup n E sup ( ✶ X i ∈ A − ✶ X ′ i ∈ A ) � � � � A ∈A A ∈A � � i=1 Second symmetrization: if ε 1 , . . . , ε n are independent Rademacher variables, then � n � � n � = 1 � ≤ 2 � � � � � � n E sup ε i ( ✶ X i ∈ A − ✶ X ′ i ∈ A ) n E sup ε i ✶ X i ∈ A � � � � � � � � A ∈A A ∈A � � � i=1 i=1

conditional rademacher average If � n � � � � R n = E ε sup ε i ✶ X i ∈ A � � � � A ∈A � � i=1 then | P n (A) − P(A) | ≤ 2 E sup n E R n . A ∈A

conditional rademacher average If � n � � � � R n = E ε sup ε i ✶ X i ∈ A � � � � A ∈A � � i=1 then | P n (A) − P(A) | ≤ 2 E sup n E R n . A ∈A R n is a data-dependent quantity!

concentration of conditional rademacher average Define � � � � � R (i) � � n = E ε sup ε j ✶ X j ∈ A � � � � A ∈A j � =i � � One can show easily that n 0 ≤ R n − R (i) � (R n − R (i) n ≤ 1 and n ) ≤ R n . i=1 By the Efron-Stein inequality, n n ) 2 ≤ E R n . � (R n − R (i) Var (R n ) ≤ E i=1

concentration of conditional rademacher average Define � � � � � R (i) � � n = E ε sup ε j ✶ X j ∈ A � � � � A ∈A j � =i � � One can show easily that n 0 ≤ R n − R (i) � (R n − R (i) n ≤ 1 and n ) ≤ R n . i=1 By the Efron-Stein inequality, n n ) 2 ≤ E R n . � (R n − R (i) Var (R n ) ≤ E i=1 Standard deviation is at most √ E R n !

concentration of conditional rademacher average Define � � � � � R (i) � � n = E ε sup ε j ✶ X j ∈ A � � � � A ∈A j � =i � � One can show easily that n 0 ≤ R n − R (i) � (R n − R (i) n ≤ 1 and n ) ≤ R n . i=1 By the Efron-Stein inequality, n n ) 2 ≤ E R n . � (R n − R (i) Var (R n ) ≤ E i=1 Standard deviation is at most √ E R n ! Such functions are called self-bounding.

bounding the conditional rademacher average If S(X n 1 , A ) is the number of different sets of form { X 1 , . . . , X n } ∩ A : A ∈ A then R n is the maximum of S(X n 1 , A ) sub-Gaussian random variables. By the maximal inequality, � log S(X n 1 1 , A ) 2R n ≤ . 2n

bounding the conditional rademacher average If S(X n 1 , A ) is the number of different sets of form { X 1 , . . . , X n } ∩ A : A ∈ A then R n is the maximum of S(X n 1 , A ) sub-Gaussian random variables. By the maximal inequality, � log S(X n 1 1 , A ) 2R n ≤ . 2n In particular, � log S(X n 1 , A ) E sup | P n (A) − P(A) | ≤ 2 E . 2n A ∈A

random VC dimension Let V = V(x n 1 , A ) be the size of the largest subset of { x 1 , . . . , x n } shattered by A . By Sauer’s lemma, log S(X n 1 , A ) ≤ V(X n 1 , A ) log(n + 1) .

random VC dimension Let V = V(x n 1 , A ) be the size of the largest subset of { x 1 , . . . , x n } shattered by A . By Sauer’s lemma, log S(X n 1 , A ) ≤ V(X n 1 , A ) log(n + 1) . V is also self-bounding: n (V − V (i) ) 2 ≤ V � i=1 so by Efron-Stein, Var (V) ≤ E V

vapnik and chervonenkis Alexey Chervonenkis Vladimir Vapnik

beyond the variance X 1 , . . . , X n are independent random variables taking values in some set X . Let f : X n → R and Z = f(X 1 , . . . , X n ) . Recall the Doob martingale representation: n � Z − E Z = ∆ i where ∆ i = E i Z − E i − 1 Z , i=1 with E i [ · ] = E [ ·| X 1 , . . . , X i ] . To get exponential inequalities, we bound the moment generating function E e λ (Z − E Z) .

azuma’s inequality Suppose that the martingale differences are bounded: | ∆ i | ≤ c i . Then �� n − 1 � E e λ (Z − E Z) = E e λ ( � n i=1 ∆ i ) = EE n e λ i=1 ∆ i + λ ∆ n �� n − 1 � λ i=1 ∆ i E n e λ ∆ n = E e �� n − 1 � λ i=1 ∆ i n / 2 (by Hoeffding) e λ 2 c 2 ≤ E e · · · ≤ e λ 2 ( � n i=1 c 2 i ) / 2 . This is the Azuma-Hoeffding inequality for sums of bounded martingale differences.

bounded differences inequality If Z = f(X 1 , . . . , X n ) and f is such that | f(x 1 , . . . , x n ) − f(x 1 , . . . , x ′ i , . . . , x n ) | ≤ c i then the martingale differences are bounded.

bounded differences inequality If Z = f(X 1 , . . . , X n ) and f is such that | f(x 1 , . . . , x n ) − f(x 1 , . . . , x ′ i , . . . , x n ) | ≤ c i then the martingale differences are bounded. Bounded differences inequality: if X 1 , . . . , X n are independent, then P {| Z − E Z | > t } ≤ 2e − 2t 2 / � n i=1 c 2 i .

bounded differences inequality If Z = f(X 1 , . . . , X n ) and f is such that | f(x 1 , . . . , x n ) − f(x 1 , . . . , x ′ i , . . . , x n ) | ≤ c i then the martingale differences are bounded. Bounded differences inequality: if X 1 , . . . , X n are independent, then P {| Z − E Z | > t } ≤ 2e − 2t 2 / � n i=1 c 2 i . McDiarmid’s inequality. Colin McDiarmid

hoeffding in a hilbert space Let X 1 , . . . , X n be independent zero-mean random variables in a separable Hilbert space such that � X i � ≤ c / 2 and denote v = nc 2 / 4 . Then, for all t ≥ √ v , �� n � � ≤ e − (t −√ v) 2 / (2v) . � � � P X i � > t � � � � � i=1

hoeffding in a hilbert space Let X 1 , . . . , X n be independent zero-mean random variables in a separable Hilbert space such that � X i � ≤ c / 2 and denote v = nc 2 / 4 . Then, for all t ≥ √ v , �� n � � ≤ e − (t −√ v) 2 / (2v) . � � � P X i � > t � � � � � i=1 �� n � has the bounded � � Proof: By the triangle inequality, i=1 X i differences property with constants c , so �� n � � �� n � � n � � n � � � � � � � � � � � � � � X i � > t = P X i � − E X i � > t − E X i P � � � � � � � � � � � � � � � � � � � � � i=1 i=1 i=1 i=1 �� 2 � �� n � � � � t − E i=1 X i ≤ exp − . 2v Also, � � 2 � n � � n � n � � E � X i � 2 ≤ √ v . � � � � � � � � � X i � ≤ X i = E � E � � � � � � � � � � � � i=1 i=1 i=1

bounded differences inequality Easy to use. Distribution free. Often close to optimal (e.g., L 1 error of kernel density estimate). Does not exploit “variance information.” Often too rigid. Other methods are necessary.

shannon entropy If X , Y are random variables taking values in a set of size N , � H(X) = − p(x) log p(x) x H(X | Y)= H(X , Y) − H(Y) � = − p(x , y) log p(x | y) Claude Shannon x , y (1916–2001) H(X) ≤ log N and H(X | Y) ≤ H(X)

han’s inequality If X = (X 1 , . . . , X n ) and X (i) = (X 1 , . . . , X i − 1 , X i+1 , . . . , X n ) , then n � � � H(X) − H(X (i) ) ≤ H(X) i=1 Proof: H(X)= H(X (i) ) + H(X i | X (i) ) ≤ H(X (i) ) + H(X i | X 1 , . . . , X i − 1 ) Since � n i=1 H(X i | X 1 , . . . , X i − 1 ) = H(X) , summing Te Sun Han the inequality, we get n � H(X (i) ) . (n − 1)H(X) ≤ i=1

subadditivity of entropy The entropy of a random variable Z ≥ 0 is Ent (Z) = E Φ(Z) − Φ( E Z) where Φ(x) = x log x . By Jensen’s inequality, Ent (Z) ≥ 0 .

subadditivity of entropy The entropy of a random variable Z ≥ 0 is Ent (Z) = E Φ(Z) − Φ( E Z) where Φ(x) = x log x . By Jensen’s inequality, Ent (Z) ≥ 0 . Han’s inequality implies the following sub-additivity property. Let X 1 , . . . , X n be independent and let Z = f(X 1 , . . . , X n ) , where f ≥ 0 . Denote Ent (i) (Z) = E (i) Φ(Z) − Φ( E (i) Z) Then n � Ent (i) (Z) . Ent (Z) ≤ E i=1

a logarithmic sobolev inequality on the hypercube Let X = (X 1 , . . . , X n ) be uniformly distributed over {− 1 , 1 } n . If f : {− 1 , 1 } n → R and Z = f(X) , n Ent (Z 2 ) ≤ 1 � (Z − Z ′ i ) 2 2 E i=1 The proof uses subadditivity of the entropy and calculus for the case n = 1 . Implies Efron-Stein and the edge-isoperimetric inequality.

herbst’s argument: exponential concentration If f : {− 1 , 1 } n → R , the log-Sobolev inequality may be used with g(x) = e λ f(x) / 2 where λ ∈ R . If F( λ ) = E e λ Z is the moment generating function of Z = f(X) , � Ze λ Z � � e λ Z � � Ze λ Z � Ent (g(X) 2 )= λ E − E log E = λ F ′ ( λ ) − F( λ ) log F( λ ) . Differential inequalities are obtained for F( λ ) .

herbst’s argument As an example, suppose f is such that � n i=1 (Z − Z ′ i ) 2 + ≤ v . Then by the log-Sobolev inequality, λ F ′ ( λ ) − F( λ ) log F( λ ) ≤ v λ 2 4 F( λ ) If G( λ ) = log F( λ ) , this becomes � ′ � G( λ ) ≤ v 4 . λ This can be integrated: G( λ ) ≤ λ E Z + λ v / 4 , so F( λ ) ≤ e λ E Z − λ 2 v / 4 This implies P { Z > E Z + t } ≤ e − t 2 / v

herbst’s argument As an example, suppose f is such that � n i=1 (Z − Z ′ i ) 2 + ≤ v . Then by the log-Sobolev inequality, λ F ′ ( λ ) − F( λ ) log F( λ ) ≤ v λ 2 4 F( λ ) If G( λ ) = log F( λ ) , this becomes � ′ � G( λ ) ≤ v 4 . λ This can be integrated: G( λ ) ≤ λ E Z + λ v / 4 , so F( λ ) ≤ e λ E Z − λ 2 v / 4 This implies P { Z > E Z + t } ≤ e − t 2 / v Stronger than the bounded differences inequality!

gaussian log-sobolev and concentration inequalities Let X = (X 1 , . . . , X n ) be a vector of i.i.d. standard normal If f : R n → R and Z = f(X) , � �∇ f(X) � 2 � Ent (Z 2 ) ≤ 2 E This can be proved using the central limit theorem and the Bernoulli log-Sobolev inequality.

gaussian log-sobolev and concentration inequalities Let X = (X 1 , . . . , X n ) be a vector of i.i.d. standard normal If f : R n → R and Z = f(X) , � �∇ f(X) � 2 � Ent (Z 2 ) ≤ 2 E This can be proved using the central limit theorem and the Bernoulli log-Sobolev inequality. It implies the Gaussian concentration inequality: Suppose f is Lipschitz: for all x , y ∈ R n , | f(x) − f(y) | ≤ L � x − y � . Then, for all t > 0 , P { f(X) − E f(X) ≥ t } ≤ e − t 2 / (2L 2 ) .

an application: supremum of a gaussian process Let (X t ) t ∈T be an almost surely continuous centered Gaussian process. Let Z = sup t ∈T X t . If σ 2 = sup � � �� X 2 , E t t ∈T then P {| Z − E Z | ≥ u } ≤ 2e − u 2 / (2 σ 2 )

an application: supremum of a gaussian process Let (X t ) t ∈T be an almost surely continuous centered Gaussian process. Let Z = sup t ∈T X t . If σ 2 = sup � � �� X 2 , E t t ∈T then P {| Z − E Z | ≥ u } ≤ 2e − u 2 / (2 σ 2 ) Proof: We may assume T = { 1 , ..., n } . Let Γ be the covariance matrix of X = (X 1 , . . . , X n ) . Let A = Γ 1 / 2 . If Y is a standard normal vector, then distr . f(Y) = i=1 ,..., n (AY) i max = i=1 ,..., n X i max By Cauchy-Schwarz, 1 / 2 � �   � � � � A 2 � � | (Au) i − (Av) i | = A i , j (u j − v j ) ≤ � u − v � � � i , j  � � j j � � ≤ σ � u − v �

beyond bernoulli and gaussian: the entropy method For general distributions, logarithmic Sobolev inequalities are not available. Solution: modified logarithmic Sobolev inequalities. Suppose X 1 , . . . , X n are independent. Let Z = f(X 1 , . . . , X n ) and Z i = f i (X (i) ) = f i (X 1 , . . . , X i − 1 , X i+1 , . . . , X n ) . Let φ (x) = e x − x − 1 . Then for all λ ∈ R , � Ze λ Z � � e λ Z � � e λ Z � λ E − E log E n � � � e λ Z φ ( − λ (Z − Z i )) ≤ . E i=1 Michel Ledoux

the entropy method i f(X 1 , . . . , x ′ Define Z i = inf x ′ i , . . . , X n ) and suppose n (Z − Z i ) 2 ≤ v . � i=1 Then for all t > 0 , P { Z − E Z > t } ≤ e − t 2 / (2v) .

the entropy method i f(X 1 , . . . , x ′ Define Z i = inf x ′ i , . . . , X n ) and suppose n (Z − Z i ) 2 ≤ v . � i=1 Then for all t > 0 , P { Z − E Z > t } ≤ e − t 2 / (2v) . This implies the bounded differences inequality and much more.

example: the largest eigenvalue of a symmetric matrix Let A = (X i , j ) n × n be symmetric, the X i , j independent ( i ≤ j ) with | X i , j | ≤ 1 . Let u T Au . Z = λ 1 = sup u: � u � =1 and suppose v is such that Z = v T Av . A ′ i , j is obtained by replacing X i , j by x ′ i , j . Then � � v T Av − v T A ′ (Z − Z i , j ) + ≤ i , j v ✶ Z > Z i , j � � � � v T (A − A ′ v i v j (X i , j − X ′ = i , j )v ✶ Z > Z i , j ≤ 2 i , j ) + ≤ 4 | v i v j | . Therefore, � n � 2 16 | v i v j | 2 ≤ 16 � � � (Z − Z ′ i , j ) 2 v 2 + ≤ = 16 . i 1 ≤ i ≤ j ≤ n 1 ≤ i ≤ j ≤ n i=1

example: convex lipschitz functions Let f : [0 , 1] n → R be a convex function. Let i f(X 1 , . . . , x ′ i , . . . , X n ) and let X ′ i be the value of x ′ Z i = inf x ′ i for which the minimum is achieved. Then, writing (i) = (X 1 , . . . , X i − 1 , X ′ X i , X i+1 , . . . , X n ) , n n (i) ) 2 � (Z − Z i ) 2 = � (f(X) − f(X i=1 i=1 � ∂ f n � 2 � (X i − X ′ i ) 2 ≤ (X) ∂ x i i=1 (by convexity) � ∂ f n � 2 � ≤ (X) ∂ x i i=1 = �∇ f(X) � 2 ≤ L 2 .

self-bounding functions Suppose Z satisfies n � 0 ≤ Z − Z i ≤ 1 and (Z − Z i ) ≤ Z . i=1 Recall that Var (Z) ≤ E Z . We have much more: P { Z > E Z + t } ≤ e − t 2 / (2 E Z+2t / 3) and P { Z < E Z − t } ≤ e − t 2 / (2 E Z)

self-bounding functions Suppose Z satisfies n � 0 ≤ Z − Z i ≤ 1 and (Z − Z i ) ≤ Z . i=1 Recall that Var (Z) ≤ E Z . We have much more: P { Z > E Z + t } ≤ e − t 2 / (2 E Z+2t / 3) and P { Z < E Z − t } ≤ e − t 2 / (2 E Z) Rademacher averages and the random VC dimension are self bounding.

self-bounding functions Suppose Z satisfies n � 0 ≤ Z − Z i ≤ 1 and (Z − Z i ) ≤ Z . i=1 Recall that Var (Z) ≤ E Z . We have much more: P { Z > E Z + t } ≤ e − t 2 / (2 E Z+2t / 3) and P { Z < E Z − t } ≤ e − t 2 / (2 E Z) Rademacher averages and the random VC dimension are self bounding. Configuration functions.

weakly self-bounding functions f : X n → [0 , ∞ ) is weakly (a , b) -self-bounding if there exist f i : X n − 1 → [0 , ∞ ) such that for all x ∈ X n , n � 2 � � f(x) − f i (x (i) ) ≤ af(x) + b . i=1

weakly self-bounding functions f : X n → [0 , ∞ ) is weakly (a , b) -self-bounding if there exist f i : X n − 1 → [0 , ∞ ) such that for all x ∈ X n , n � 2 � � f(x) − f i (x (i) ) ≤ af(x) + b . i=1 Then � � t 2 P { Z ≥ E Z + t } ≤ exp − . 2 (a E Z + b + at / 2)

weakly self-bounding functions f : X n → [0 , ∞ ) is weakly (a , b) -self-bounding if there exist f i : X n − 1 → [0 , ∞ ) such that for all x ∈ X n , n � 2 � � f(x) − f i (x (i) ) ≤ af(x) + b . i=1 Then � � t 2 P { Z ≥ E Z + t } ≤ exp − . 2 (a E Z + b + at / 2) If, in addition, f(x) − f i (x (i) ) ≤ 1 , then for 0 < t ≤ E Z , � � t 2 P { Z ≤ E Z − t } ≤ exp − . 2 (a E Z + b + c − t) where c = (3a − 1) / 6 .

the isoperimetric view Let X = (X 1 , . . . , X n ) have independent components, taking values in X n . Let A ⊂ X n . The Hamming distance of X to A is n � d(X , A) = min y ∈ A d(X , y) = min ✶ X i � =y i . y ∈ A i=1 Michel Talagrand

the isoperimetric view Let X = (X 1 , . . . , X n ) have independent components, taking values in X n . Let A ⊂ X n . The Hamming distance of X to A is n � d(X , A) = min y ∈ A d(X , y) = min ✶ X i � =y i . y ∈ A i=1 Michel Talagrand � � � n 1 ≤ e − 2t 2 / n . d(X , A) ≥ t + 2 log P P [A]

the isoperimetric view Let X = (X 1 , . . . , X n ) have independent components, taking values in X n . Let A ⊂ X n . The Hamming distance of X to A is n � d(X , A) = min y ∈ A d(X , y) = min ✶ X i � =y i . y ∈ A i=1 Michel Talagrand � � � n 1 ≤ e − 2t 2 / n . d(X , A) ≥ t + 2 log P P [A] Concentration of measure!

the isoperimetric view Proof: By the bounded differences inequality, P { E d(X , A) − d(X , A) ≥ t } ≤ e − 2t 2 / n . Taking t = E d(X , A) , we get � n 1 E d(X , A) ≤ 2 log P { A } . By the bounded differences inequality again, � � � n 1 ≤ e − 2t 2 / n P d(X , A) ≥ t + 2 log P { A }

talagrand’s convex distance The weighted Hamming distance is � d α (x , A) = inf y ∈ A d α (x , y) = inf | α i | y ∈ A i:x i � =y i where α = ( α 1 , . . . , α n ) . The same argument as before gives � � � � α � 2 1 ≤ e − 2t 2 / � α � 2 , P d α (X , A) ≥ t + log 2 P { A } This implies min ( P { A } , P { d α (X , A) ≥ t } ) ≤ e − t 2 / 2 . sup α : � α � =1

Lectures on learning theory G abor Lugosi ICREA and Pompeu Fabra - PowerPoint PPT Presentation

Lectures on learning theory G abor Lugosi ICREA and Pompeu Fabra University Barcelona what is learning theory? A mathematical theory to understand the behavior of learning algorithms and assist their design. what is learning theory? A

Course webpage WWW.cs.sfu.ca/~kabanets/307 307 Lectures Spring 2018 Page 1 307 Lectures Spring

The Shanghai Lectures 2019 HeronRobots Pathfinder Lectures Natural and Artificial Intelligence in

The Shanghai Lectures 2019 HeronRobots Pathfinder Lectures Natural and Artificial Intelligence in

The Shanghai Lectures 2019 HeronRobots Pathfinder Lectures Natural and Artificial Intelligence in

Lectures 34: Consumer Theory Alexander Wolitzky MIT 14.121 1 Consumer Theory Consumer theory

Nobel Lectures in Economic Sciences (2006-2010) (Nobel Lectures Including Presentation Speeches

The Shanghai Lectures 2019 HeronRobots Pathfinder Lectures Natural and Artificial Intelligence in

Lectures for 3rd Edition Note: these lectures are often supplemented with other materials and

Plan for second half of the course Lectures from Analytic Combinatorics One or two lectures,

Draft EE 8235: Lectures 10 & 11 1 Lectures 10 & 11: Semigroup Theory Want to

A Gentle Introduction to Machine Learning Supervised learning, unsupervised learning (very

Chapter 2- -3 3 Chapter 2 Definition of Theory: A theory is a systematic Definition of

Real Real- -Time Systems Time Systems Lectures: (Jan Jonsson) Lectures Fundamental methods

Exercises in the lectures on Exercises in the lectures on Superconducting RF - I and - II

Lectures 67: Monotone Comparative Statics, with Applications to Producer Theory Alexander

Dennis Ryan Clark County School District Health Occupations ryandl@nv.ccsd.net Learning Theory

Modeling Power and Energy of the Task-Parallel Cholesky Factorization on Multicore Processors

Luchd-labhairt ra? Gidheil ra? Idelasan cnain agus leantainneachd dh- chnanach am

North Middlesex Hospital Case for Change July 2018 Why are we creating a Case for Change? Our

Spectral Characteristics of the Solar Transition Region David Graham and James Grayson Mentor:

Proximity data visualization with h-plots Irene Epifanio Dpt. Matemtiques, Univ. Jaume I

A rigidity theorem for self-shrinkers of MCF. V. Palmer, UJI, Castell o joint work with: V.

Direct Methods for Solving Linear Systems Matrix Factorization Numerical Analysis (9th Edition)

1 . . . . . . . . Interpolating products of Interpolating products of Interpolating

Lectures on learning theory G abor Lugosi ICREA and Pompeu Fabra - PowerPoint PPT Presentation

Lectures on learning theory G abor Lugosi ICREA and Pompeu Fabra University Barcelona what is learning theory? A mathematical theory to understand the behavior of learning algorithms and assist their design. what is learning theory? A

Course webpage WWW.cs.sfu.ca/~kabanets/307 307 Lectures Spring 2018 Page 1 307 Lectures Spring

The Shanghai Lectures 2019 HeronRobots Pathfinder Lectures Natural and Artificial Intelligence in

The Shanghai Lectures 2019 HeronRobots Pathfinder Lectures Natural and Artificial Intelligence in

The Shanghai Lectures 2019 HeronRobots Pathfinder Lectures Natural and Artificial Intelligence in

Lectures 34: Consumer Theory Alexander Wolitzky MIT 14.121 1 Consumer Theory Consumer theory

Nobel Lectures in Economic Sciences (2006-2010) (Nobel Lectures Including Presentation Speeches

The Shanghai Lectures 2019 HeronRobots Pathfinder Lectures Natural and Artificial Intelligence in

Lectures for 3rd Edition Note: these lectures are often supplemented with other materials and

Plan for second half of the course Lectures from Analytic Combinatorics One or two lectures,

Draft EE 8235: Lectures 10 &amp; 11 1 Lectures 10 &amp; 11: Semigroup Theory Want to

A Gentle Introduction to Machine Learning Supervised learning, unsupervised learning (very

Chapter 2- -3 3 Chapter 2 Definition of Theory: A theory is a systematic Definition of

Real Real- -Time Systems Time Systems Lectures: (Jan Jonsson) Lectures Fundamental methods

Exercises in the lectures on Exercises in the lectures on Superconducting RF - I and - II

Lectures 67: Monotone Comparative Statics, with Applications to Producer Theory Alexander

Dennis Ryan Clark County School District Health Occupations ryandl@nv.ccsd.net Learning Theory

Modeling Power and Energy of the Task-Parallel Cholesky Factorization on Multicore Processors

Luchd-labhairt ra? Gidheil ra? Idelasan cnain agus leantainneachd dh- chnanach am

North Middlesex Hospital Case for Change July 2018 Why are we creating a Case for Change? Our

Spectral Characteristics of the Solar Transition Region David Graham and James Grayson Mentor:

Proximity data visualization with h-plots Irene Epifanio Dpt. Matemtiques, Univ. Jaume I

A rigidity theorem for self-shrinkers of MCF. V. Palmer, UJI, Castell o joint work with: V.

Direct Methods for Solving Linear Systems Matrix Factorization Numerical Analysis (9th Edition)

1 . . . . . . . . Interpolating products of Interpolating products of Interpolating

Draft EE 8235: Lectures 10 & 11 1 Lectures 10 & 11: Semigroup Theory Want to