 
              Gram Matrix estimation in high dimension Ilaria Giulini INRIA (project CLASSIC) D´ epartement de Math´ ematiques et Applications ENS, 45 rue d’Ulm, 75005 Paris Joint work with Olivier Catoni Journ´ ee DIM RDM-IdF 2013 12 septembre 2013
General Setting Let P ∈ M 1 + ( R d ) . The Gram matrix is � x x ⊤ dP ( x ) G = Estimate G is equivalent to estimate � � x , θ � 2 dP ( x ) N ( θ ) = since N ( θ ) = θ ⊤ G θ P is unknown X 1 , . . . , X n ∈ R d ∼ P i.i.d. sample Goal: Estimate N ( θ ) for every θ ∈ R d from the sample
General Setting Let P ∈ M 1 + ( R d ) . The Gram matrix is � x x ⊤ dP ( x ) G = Estimate G is equivalent to estimate � � x , θ � 2 dP ( x ) N ( θ ) = since N ( θ ) = θ ⊤ G θ P is unknown X 1 , . . . , X n ∈ R d ∼ P i.i.d. sample Goal: Estimate N ( θ ) for every θ ∈ R d from the sample
General Setting Let P ∈ M 1 + ( R d ) . The Gram matrix is � x x ⊤ dP ( x ) G = Estimate G is equivalent to estimate � � x , θ � 2 dP ( x ) N ( θ ) = since N ( θ ) = θ ⊤ G θ P is unknown X 1 , . . . , X n ∈ R d ∼ P i.i.d. sample Goal: Estimate N ( θ ) for every θ ∈ R d from the sample
General Setting Let P ∈ M 1 + ( R d ) . The Gram matrix is � x x ⊤ dP ( x ) G = Estimate G is equivalent to estimate � � x , θ � 2 dP ( x ) N ( θ ) = since N ( θ ) = θ ⊤ G θ P is unknown X 1 , . . . , X n ∈ R d ∼ P i.i.d. sample Goal: Estimate N ( θ ) for every θ ∈ R d from the sample
Assumption: � � x � 2 dP ( x ) < + ∞ . Tr ( G ) = Our goal: estimate � � θ, x � 2 dP ( x ) N ( θ ) = that is, built ˆ N (depending on X 1 , . . . , X n ) such that, with probability 1 − ǫ, for any θ ∈ R d , | N ( θ ) − ˆ N ( θ ) | ≤ η ( n , θ, ǫ ) where η ( n , θ, ǫ ) → 0 as n → ∞ Tecnhiques: PAC-Bayesiennes
Dimension Dependent Bound � � θ, x � 4 dP ( x ) Let κ = sup θ � = 0 2 < + ∞ . For any ǫ > 0 and n such that ( � θ, x � 2 dP ( x ) ) � � 2 � √ 5 κ − 4 � κ d + log ( ǫ − 1 ) + 1 . 11 d n > 27 , � 2 ( κ − 1 ) with probability at least 1 − 2 ǫ, for any θ ∈ R d , µ � � � ˆ N ( θ ) − N ( θ ) � ≤ N ( θ ) 1 − 3 µ, (1) � � where � � 2 ( κ − 1 ) 2 κ × 89 d µ = ( log ( ǫ − 1 ) + 1 . 11 d ) + n n Remark: Var ( � θ, X � 2 ) ∼ ( κ − 1 ) N ( θ ) 2
Dimension Dependent Bound � � θ, x � 4 dP ( x ) Let κ = sup θ � = 0 2 < + ∞ . For any ǫ > 0 and n such that ( � θ, x � 2 dP ( x ) ) � � 2 � √ 5 κ − 4 � κ d + log ( ǫ − 1 ) + 1 . 11 d n > 27 , � 2 ( κ − 1 ) with probability at least 1 − 2 ǫ, for any θ ∈ R d , µ � � � ˆ N ( θ ) − N ( θ ) � ≤ N ( θ ) 1 − 3 µ, (1) � � where � � 2 ( κ − 1 ) 2 κ × 89 d µ = ( log ( ǫ − 1 ) + 1 . 11 d ) + n n Remark: Var ( � θ, X � 2 ) ∼ ( κ − 1 ) N ( θ ) 2
Dimension-free Bound With probability at least 1 − 2 ǫ , for any θ ∈ R d , the same estimator ˆ N is such that � � ˆ N ( θ ) µ � � ✶ { 4 µ< 1 } N ( θ ) − 1 � ≤ � � 1 − 4 µ � � � where, for n < 10 20 , � 2 . 07 ( κ − 1 ) � log ( ǫ − 1 ) + 4 . 3 + 1 . 6 × � θ � 2 Tr ( G ) � µ = N ( θ ) n � n × 92 � θ � 2 Tr ( G ) 2 κ + N ( θ )
Remark Let θ i , i = 1 , . . . , d be a ON basis d � � x � 2 dP ( x ) = � Tr ( G ) = N ( θ i ) i = 1 If the energy is equally distributed, that is N ( θ i ) = N ( θ ) for any i = 1 , . . . , d then � d i = 1 N ( θ i ) Tr ( G ) = dN ( θ ) N ( θ ) = N ( θ ) = d N ( θ )
Remark Let θ i , i = 1 , . . . , d be a ON basis d � � x � 2 dP ( x ) = � Tr ( G ) = N ( θ i ) i = 1 If the energy is equally distributed, that is N ( θ i ) = N ( θ ) for any i = 1 , . . . , d then � d i = 1 N ( θ i ) Tr ( G ) = dN ( θ ) N ( θ ) = N ( θ ) = d N ( θ )
PAC-Bayesian approach Let X 1 , . . . , X n ∼ P be an i.i.d. sample D. McAllester; O. Catoni (2012) Let ν ∈ M 1 + (Θ) be a prior probability measure. ∀ f , ∀ posterior ρ ∈ M 1 + (Θ) such that K ( ρ, ν ) < + ∞ � n � 1 � 1 + f ( X i , θ ′ , λ ) d ρ ( θ ′ ) ≤ � � P log n i = 1 � f ( x , θ ′ , λ ) dP ( x ) d ρ ( θ ′ ) + K ( ρ, ν ) + log ( ǫ − 1 ) � ≥ 1 − ǫ n where the Kullback divergence of ρ with respect to ν is � � �� d ρ log d ρ if ρ ≪ ν d ν K ( ρ, ν ) = + ∞ otherwise
With probability at least 1 − 2 ǫ, for any θ ∈ R d , 1 ˆ B − ( θ ) ≤ N ( θ ) ≤ ˆ B + ( θ ) Definition of ˆ N 2 ˆ B + ( θ ) + ˆ B − ( θ ) ˆ N ( θ ) = 2 Results: 3 With probability at least 1 − 2 ǫ, for any θ ∈ R d , B + ( θ ) − ˆ ˆ B − ( θ ) � � � N ( θ ) − ˆ N ( θ ) � ≤ � � 2
With probability at least 1 − 2 ǫ, for any θ ∈ R d , 1 ˆ B − ( θ ) ≤ N ( θ ) ≤ ˆ B + ( θ ) Definition of ˆ N 2 ˆ B + ( θ ) + ˆ B − ( θ ) ˆ N ( θ ) = 2 Results: 3 With probability at least 1 − 2 ǫ, for any θ ∈ R d , B + ( θ ) − ˆ ˆ B − ( θ ) � � � N ( θ ) − ˆ N ( θ ) � ≤ � � 2
With probability at least 1 − 2 ǫ, for any θ ∈ R d , 1 ˆ B − ( θ ) ≤ N ( θ ) ≤ ˆ B + ( θ ) Definition of ˆ N 2 ˆ B + ( θ ) + ˆ B − ( θ ) ˆ N ( θ ) = 2 Results: 3 With probability at least 1 − 2 ǫ, for any θ ∈ R d , B + ( θ ) − ˆ ˆ B − ( θ ) � � � N ( θ ) − ˆ N ( θ ) � ≤ � � 2
Work in progress dimension-free bounds for the quadratic form associated to the empirical Gram matrix n G = 1 ˆ � X i X ⊤ i n i = 1 Stability of algorithms for spectral clustering (PCA)
Bibliography O. Catoni, Estimating the Gram matrix through PAC-Bayes bounds , preprint. O. Catoni. Challenging the empirical mean and empirical variance: a deviation study , Ann. Inst. H. Poincar´ e Probab. Statist. Vol. 48, No 4 (2012). G. Biau, A. Mas. PCA-Kernel Estimation , Stat. Risk. Model. 29, No. 1 (2012). J. Langford, J. Shawe-Taylor, PAC-Bayes & Margins , Advances in Neural Information Processing Systems (2002). D. McAllester, Simplified PAC-Bayesian margin bounds , In COLT (2003).
Recommend
More recommend