lecture 3 dependence measures using rkhs embeddings
play

Lecture 3: Dependence measures using RKHS embeddings MLSS T - PowerPoint PPT Presentation

Lecture 3: Dependence measures using RKHS embeddings MLSS T ubingen, 2015 Arthur Gretton Gatsby Unit, CSML, UCL Outline Three or more variable interactions, comparison with conditional dependence testing [Sejdinovic et al., 2013a]


  1. Total independence test • Total independence test: H 0 : P XY Z = P X P Y P Z vs. H 1 : P XY Z � = P X P Y P Z • For ( X 1 , . . . , X D ) ∼ P X , and κ = � D i =1 k ( i ) : � � 2 � � � � � � � � � � D n n D n D n � � � � � � � 1 2 � � K ( i ) K ( i ) ˆ ˆ P X − ab − µ κ P X i = � � ab n 2 n D +1 � � � � a =1 a =1 i =1 b =1 i =1 i =1 b =1 � �� � � � � � ∆ tot ˆ P H κ D n n � � � 1 K ( i ) + ab . n 2 D i =1 a =1 b =1 • Coincides with the test proposed by Kankainen (1995) using empirical characteristic functions: similar relationship to that between dCov and HSIC (DS et al, 2013)

  2. Example B: total independence tests Total independence test: Dataset B 1 Null acceptance rate (Type II error) 0 . 8 0 . 6 0 . 4 ∆ L : total indep. 0 . 2 ∆ tot : total indep. 0 1 3 5 7 9 11 13 15 17 19 Dimension Figure 4: Total independence: ∆ tot ˆ P vs. ∆ L ˆ P , n = 500

  3. Kernel dependence measures - in detail

  4. MMD for independence: HSIC !"#$%&'()#)&*+$,#&-"#.&-"%(+*"&/$0#1&2',& !" #" -"#34%#&'#5#%&"266$#%&-"2'&7"#'&0(//(7$'*& 2'&$'-#%#)8'*&)9#'-:&!"#3&'##,&6/#'-3&(0& #;#%9$)#1&2<(+-&2'&"(+%&2&,23&$0&6())$</#:& =&/2%*#&2'$.2/&7"(&)/$'*)&)/(<<#%1&#;+,#)&2& ,$)8'985#&"(+',3&(,(%1&2',&72'-)&'(-"$'*&.(%#& -"2'&-(&0(//(7&"$)&'()#:&!"#3&'##,&2&)$*'$>92'-& 2.(+'-&(0&#;#%9$)#&2',&.#'-2/&)8.+/28(':& @'(7'&0(%&-"#$%&9+%$()$-31&$'-#//$*#'9#1&2',& #;9#//#'-&9(..+'$928('&&)A$//)1&-"#&B252'#)#& <%##,&$)&6#%0#9-&$0&3(+&72'-&2&%#)6(')$5#1&& $'-#%2985#&6#-1&('#&-"2-&7$//&</(7&$'&3(+%&#2%& 2',&0(//(7&3(+&#5#%37"#%#:& !#;-&0%(.&,(*8.#:9(.&2',&6#?$',#%:9(.& Empirical HSIC ( P XY , P X P Y ): 1 n 2 ( HKH ◦ HLH ) ++

  5. Covariance to reveal dependence A more intuitive idea: maximize covariance of smooth mappings: COCO( P ; F , G ) := sup ( E x , y [ f ( x ) g ( y )] − E x [ f ( x )] E y [ g ( y )]) � f � F =1 , � g � G =1

  6. Covariance to reveal dependence A more intuitive idea: maximize covariance of smooth mappings: COCO( P ; F , G ) := sup ( E x , y [ f ( x ) g ( y )] − E x [ f ( x )] E y [ g ( y )]) � f � F =1 , � g � G =1 Correlation: −0.00 1.5 1 0.5 Y 0 −0.5 −1 −1.5 −2 0 2 X

  7. Covariance to reveal dependence A more intuitive idea: maximize covariance of smooth mappings: COCO( P ; F , G ) := sup ( E x , y [ f ( x ) g ( y )] − E x [ f ( x )] E y [ g ( y )]) � f � F =1 , � g � G =1 Dependence witness, X 0.5 Correlation: −0.00 1.5 0 f(x) 1 −0.5 0.5 −1 −2 0 2 x 0 Y Dependence witness, Y 0.5 −0.5 0 −1 g(y) −0.5 −1.5 −2 0 2 X −1 −2 0 2 y

  8. Covariance to reveal dependence A more intuitive idea: maximize covariance of smooth mappings: COCO( P ; F , G ) := sup ( E x , y [ f ( x ) g ( y )] − E x [ f ( x )] E y [ g ( y )]) � f � F =1 , � g � G =1 Dependence witness, X 0.5 Correlation: −0.00 Correlation: −0.90 COCO: 0.14 1.5 0 1 f(x) 1 −0.5 0.5 0.5 −1 −2 0 2 g(Y) x 0 Y 0 Dependence witness, Y 0.5 −0.5 −0.5 0 −1 g(y) −1 −0.5 −1.5 −1 −0.5 0 0.5 −2 0 2 f(X) X −1 −2 0 2 y

  9. Covariance to reveal dependence A more intuitive idea: maximize covariance of smooth mappings: COCO( P ; F , G ) := sup ( E x , y [ f ( x ) g ( y )] − E x [ f ( x )] E y [ g ( y )]) � f � F =1 , � g � G =1 Dependence witness, X 0.5 Correlation: −0.00 Correlation: −0.90 COCO: 0.14 1.5 0 1 f(x) 1 −0.5 0.5 0.5 −1 −2 0 2 g(Y) x 0 Y 0 Dependence witness, Y 0.5 −0.5 −0.5 0 −1 g(y) −1 −0.5 −1.5 −1 −0.5 0 0.5 −2 0 2 f(X) X −1 −2 0 2 y How do we define covariance in (infinite) feature spaces?

  10. Covariance to reveal dependence Covariance in RKHS: Let’s first look at finite linear case. We have two random vectors x ∈ R d , y ∈ R d ′ . Are they linearly dependent?

  11. Covariance to reveal dependence Covariance in RKHS: Let’s first look at finite linear case. We have two random vectors x ∈ R d , y ∈ R d ′ . Are they linearly dependent? Compute their covariance matrix: (ignore centering) � xy ⊤ � C xy = E How to get a single “summary” number?

  12. Covariance to reveal dependence Covariance in RKHS: Let’s first look at finite linear case. We have two random vectors x ∈ R d , y ∈ R d ′ . Are they linearly dependent? Compute their covariance matrix: (ignore centering) � xy ⊤ � C xy = E How to get a single “summary” number? Solve for vectors f ∈ R d , g ∈ R d ′ �� � � �� f ⊤ C xy g = f ⊤ x g ⊤ y argmax argmax E xy � f � =1 , � g � =1 � f � =1 , � g � =1 = argmax E x , y [ f ( x ) g ( y )] = argmax cov ( f ( x ) g ( y )) � f � =1 , � g � =1 � f � =1 , � g � =1 (maximum singular value)

  13. Challenges in defining feature space covariance Given features φ ( x ) ∈ F and ψ ( y ) ∈ G : Challenge 1: Can we define a feature space analog to x y ⊤ ? YES: • Given f ∈ R d , g ∈ R d ′ , h ∈ R d ′ , define matrix f g ⊤ such that ( f g ⊤ ) h = f ( g ⊤ h ). • Given f ∈ F , g ∈ G , h ∈ G , define tensor product operator f ⊗ g such that ( f ⊗ g ) h = f � g, h � G . • Now just set f := φ ( x ), g = ψ ( y ), to get x y ⊤ → φ ( x ) ⊗ ψ ( y )

  14. Challenges in defining feature space covariance Given features φ ( x ) ∈ F and ψ ( y ) ∈ G : Challenge 2: Does a covariance “matrix” (operator) in feature space exist? I.e. is there some C XY : G → F such that � f, C XY g � F = E x , y [ f ( x ) g ( y )] = cov ( f ( x ) , g ( y ))

  15. Challenges in defining feature space covariance Given features φ ( x ) ∈ F and ψ ( y ) ∈ G : Challenge 2: Does a covariance “matrix” (operator) in feature space exist? I.e. is there some C XY : G → F such that � f, C XY g � F = E x , y [ f ( x ) g ( y )] = cov ( f ( x ) , g ( y )) YES: via Bochner integrability argument (as with mean embedding). �� � < ∞ , we can define: Under the condition E x , y k ( x , x ) l ( y , y ) C XY := E x , y [ φ ( x ) ⊗ ψ ( y )] which is a Hilbert-Schmidt operator (sum of squared singular values is finite).

  16. REMINDER: functions revealing dependence COCO( P ; F , G ) := sup ( E x , y [ f ( x ) g ( y )] − E x [ f ( x )] E y [ g ( y )]) � f � F =1 , � g � G =1 Dependence witness, X 0.5 Correlation: −0.00 Correlation: −0.90 COCO: 0.14 1.5 0 1 f(x) 1 −0.5 0.5 0.5 −1 −2 0 2 g(Y) x Y 0 0 Dependence witness, Y 0.5 −0.5 −0.5 0 −1 g(y) −1 −0.5 −1.5 −1 −0.5 0 0.5 −2 0 2 f(X) X −1 −2 0 2 y How do we compute this from finite data?

  17. Empirical covariance operator The empirical covariance given z := ( x i , y i ) n i =1 (now include centering) n � C XY := 1 � φ ( x i ) ⊗ ψ ( y i ) − ˆ µ x ⊗ ˆ µ y , n i =1 � n µ x := 1 where ˆ i =1 φ ( x i ). More concisely, n C XY = 1 � nXHY ⊤ , where H = I n − n − 1 1 n , and 1 n is an n × n matrix of ones, and � � � � X = Y = . φ ( x 1 ) . . . φ ( x n ) ψ ( y 1 ) . . . ψ ( y n ) Define the kernel matrices � � X ⊤ X K ij = ij = k ( x i , x j ) L ij = l ( y i , y j ) ,

  18. Functions revealing dependence Optimization problem: � � f, � COCO( z ; F , G ) := max C XY g F � f � F ≤ 1 subject to � g � G ≤ 1 Assume n n � � α i [ φ ( x i ) − ˆ β i [ ψ ( y i ) − ˆ f = µ x ] = XHα g = µ y ] = Y Hβ, i =1 j =1 The associated Lagrangian is � � � � C XY g − λ − γ L ( f, g, λ, γ ) = f ⊤ � � f � 2 � g � 2 F − 1 G − 1 , 2 2

  19. Covariance to reveal dependence • Empirical COCO( z ; F , G ) largest eigenvalue of         n � K � � 1 0 L  α K 0  α  = γ  .     n � L � � 1 K 0 β 0 L β • � K and � L are matrices of inner products between centred observations in respective feature spaces: H = I − 1 � n 11 ⊤ K = HKH where

  20. Covariance to reveal dependence • Empirical COCO( z ; F , G ) largest eigenvalue of         n � K � � 1 0 L  α K 0  α  = γ  .     n � L � � 1 K 0 β 0 L β • � K and � L are matrices of inner products between centred observations in respective feature spaces: H = I − 1 � n 11 ⊤ K = HKH where • Mapping function for x :   n n � �  k ( x i , x ) − 1  f ( x ) = α i k ( x j , x ) n i =1 j =1

  21. Hard-to-detect dependence Smooth density Rough density 3 3 2 2 1 1 Y 0 Y 0 −1 −1 −2 −2 Density takes the form: −3 −3 −2 0 2 −2 0 2 X X 500 Samples, smooth density 500 samples, rough density P x , y ∝ 1 + sin( ω x ) sin( ω y ) 4 4 2 2 Y Y 0 0 −2 −2 −4 −4 −4 −2 0 2 4 −4 −2 0 2 4 X X

  22. Hard-to-detect dependence • Example: sinusoids of increasing frequency ω =1 ω =2 COCO (empirical average, 1500 samples) 0.1 0.09 0.08 ω =3 ω =4 0.07 COCO 0.06 0.05 0.04 ω =5 ω =6 0.03 0.02 0.01 0 1 2 3 4 5 6 7 Frequency of non−constant density component

  23. Hard-to-detect dependence COCO vs frequency of perturbation from independence.

  24. Hard-to-detect dependence COCO vs frequency of perturbation from independence. Case of ω = 1 Dependence witness, X 1 0.5 Correlation: 0.27 Correlation: −0.50 COCO: 0.09 4 0.6 f(x) 0 3 0.4 −0.5 2 0.2 1 −1 −2 0 2 0 x g(Y) Y 0 −0.2 Dependence witness, Y −1 1 −0.4 −2 0.5 −0.6 −3 g(y) 0 −0.8 −4 −1 −0.5 0 0.5 1 −4 −2 0 2 4 f(X) −0.5 X −1 −2 0 2 y

  25. Hard-to-detect dependence COCO vs frequency of perturbation from independence. Case of ω = 2 Dependence witness, X 1 0.5 Correlation: 0.04 Correlation: 0.51 COCO: 0.07 4 0.6 f(x) 0 3 0.4 −0.5 2 0.2 1 −1 −2 0 2 0 x g(Y) Y 0 −0.2 Dependence witness, Y −1 1 −0.4 −2 0.5 −0.6 −3 g(y) 0 −0.8 −4 −1 −0.5 0 0.5 1 −4 −2 0 2 4 f(X) −0.5 X −1 −2 0 2 y

  26. Hard-to-detect dependence COCO vs frequency of perturbation from independence. Case of ω = 3 Dependence witness, X 1 0.5 Correlation: 0.03 Correlation: −0.45 COCO: 0.03 4 0.5 f(x) 0 3 −0.5 2 1 −1 −2 0 2 x g(Y) Y 0 0 Dependence witness, Y −1 1 −2 0.5 −3 g(y) 0 −0.5 −4 −0.6 −0.4 −0.2 0 0.2 0.4 −4 −2 0 2 4 f(X) −0.5 X −1 −2 0 2 y

  27. Hard-to-detect dependence COCO vs frequency of perturbation from independence. Case of ω = 4 Dependence witness, X 1 0.5 Correlation: 0.03 Correlation: 0.21 COCO: 0.02 4 0.6 f(x) 0 3 0.4 −0.5 2 0.2 1 −1 −2 0 2 0 x g(Y) Y 0 −0.2 Dependence witness, Y −1 1 −0.4 −2 0.5 −0.6 −3 g(y) 0 −0.8 −4 −0.5 0 0.5 1 −4 −2 0 2 4 f(X) −0.5 X −1 −2 0 2 y

  28. Hard-to-detect dependence COCO vs frequency of perturbation from independence. Case of ω =?? Dependence witness, X 1 0.5 Correlation: 0.00 Correlation: −0.13 COCO: 0.02 3 0.6 f(x) 0 0.4 2 −0.5 0.2 1 −1 −2 0 2 0 x g(Y) Y 0 −0.2 Dependence witness, Y 1 −1 −0.4 0.5 −2 −0.6 g(y) 0 −0.8 −3 −1 −0.5 0 0.5 1 −4 −2 0 2 4 f(X) −0.5 X −1 −2 0 2 y

  29. Hard-to-detect dependence COCO vs frequency of perturbation from independence. Case of uniform noise! This bias will decrease with increasing sample size. Dependence witness, X 1 Correlation: 0.00 0.5 Correlation: −0.13 COCO: 0.02 3 0.6 f(x) 0 0.4 2 −0.5 0.2 1 −1 −2 0 2 0 g(Y) x Y 0 Dependence witness, Y −0.2 1 −1 −0.4 0.5 −2 −0.6 g(y) 0 −0.8 −3 −1 −0.5 0 0.5 1 −4 −2 0 2 4 f(X) −0.5 X −1 −2 0 2 y

  30. Hard-to-detect dependence COCO vs frequency of perturbation from independence. • As dependence is encoded at higher frequencies, the smooth mappings f, g achieve lower linear covariance. • Even for independent variables, COCO will not be zero at finite sample sizes, since some mild linear dependence will be induced by f, g (bias) • This bias will decrease with increasing sample size.

  31. More functions revealing dependence • Can we do better than COCO?

  32. More functions revealing dependence • Can we do better than COCO? • A second example with zero correlation Dependence witness, X 0.5 Correlation: 0 Correlation: −0.80 COCO: 0.11 1 0 1 f(x) 0.5 −0.5 0.5 −1 0 −2 0 2 g(Y) x Y 0 Dependence witness, Y −0.5 0.5 −0.5 −1 0 g(y) −1 −0.5 −1.5 −1 −0.5 0 0.5 −1 0 1 f(X) X −1 −2 0 2 y

  33. More functions revealing dependence • Can we do better than COCO? • A second example with zero correlation 2nd dependence witness, X 1 Correlation: 0 1 0.5 f 2 (x) 0 0.5 −0.5 −1 0 −2 0 2 Y x 2nd dependence witness, Y −0.5 1 0.5 −1 g 2 (y) 0 −1.5 −0.5 −1 0 1 −1 X −2 0 2 y

  34. More functions revealing dependence • Can we do better than COCO? • A second example with zero correlation 2nd dependence witness, X 1 Correlation: 0 Correlation: −0.37 COCO 2 : 0.06 1 0.5 1 f 2 (x) 0 0.5 −0.5 0.5 −1 0 −2 0 2 g 2 (Y) Y x 0 2nd dependence witness, Y −0.5 1 −0.5 0.5 −1 g 2 (y) 0 −1 −1.5 −1 0 1 −0.5 −1 0 1 f 2 (X) −1 X −2 0 2 y

  35. Hilbert-Schmidt Independence Criterion • Given γ i := COCO i ( z ; F , G ), define Hilbert-Schmidt Independence Criterion (HSIC) [ALT05, NIPS07a, JMLR10] : n � γ 2 HSIC( z ; F , G ) := i i =1

  36. Hilbert-Schmidt Independence Criterion • Given γ i := COCO i ( z ; F , G ), define Hilbert-Schmidt Independence Criterion (HSIC) [ALT05, NIPS07a, JMLR10] : n � γ 2 HSIC( z ; F , G ) := i i =1 • In limit of infinite samples: HSIC( P ; F, G ) := � C xy � 2 HS = � C xy , C xy � HS = E x , x ′ , y , y ′ [ k ( x , x ′ ) l ( y , y ′ )] + E x , x ′ [ k ( x , x ′ )] E y , y ′ [ l ( y , y ′ )] � � E x ′ [ k ( x , x ′ )] E y ′ [ l ( y , y ′ )] − 2 E x , y x ′ an independent copy of x , y ′ a copy of y – HSIC is identical to MMD ( P XY , P X P Y )

  37. When does HSIC determine independence? Theorem: When kernels k and l are each characteristic, then HSIC = 0 iff P x , y = P x P y [Gretton, 2015] . Weaker than MMD condition (which requires a kernel characteristic on X × Y to distinguish P x , y from Q x , y ).

  38. Intuition: why characteristic needed on both X and Y Question: Wouldn’t it be enough just to use a rich mapping from X to Y , e.g. via ridge regression with characteristic F : � � E XY ( Y − � f, φ ( X ) � F ) 2 + λ � f � 2 f ∗ = arg min , F f ∈F

  39. Intuition: why characteristic needed on both X and Y Question: Wouldn’t it be enough just to use a rich mapping from X to Y , e.g. via ridge regression with characteristic F : � � E XY ( Y − � f, φ ( X ) � F ) 2 + λ � f � 2 f ∗ = arg min , F f ∈F Counterexample: density symmetric about x -axis, s.t. p ( x, y ) = p ( x, − y ) Correlation: −0.00 1.5 1 0.5 Y 0 −0.5 −1 −1.5 −2 0 2 X

  40. Energy Distance and the MMD

  41. Energy distance and MMD Distance between probability distributions: Energy distance: [Baringhaus and Franz, 2004, Sz´ ekely and Rizzo, 2004, 2005] D E ( P , Q ) = E P � X − X ′ � q + E Q � Y − Y ′ � q − 2 E P , Q � X − Y � q 0 < q ≤ 2 Maximum mean discrepancy [Gretton et al., 2007, Smola et al., 2007, Gretton et al., 2012] MMD 2 ( P , Q ; F ) = E P k ( X, X ′ ) + E Q k ( Y, Y ′ ) − 2 E P , Q k ( X, Y )

  42. Energy distance and MMD Distance between probability distributions: Energy distance: [Baringhaus and Franz, 2004, Sz´ ekely and Rizzo, 2004, 2005] D E ( P , Q ) = E P � X − X ′ � q + E Q � Y − Y ′ � q − 2 E P , Q � X − Y � q 0 < q ≤ 2 Maximum mean discrepancy [Gretton et al., 2007, Smola et al., 2007, Gretton et al., 2012] MMD 2 ( P , Q ; F ) = E P k ( X, X ′ ) + E Q k ( Y, Y ′ ) − 2 E P , Q k ( X, Y ) Energy distance is MMD with a particular kernel! [Sejdinovic et al., 2013b]

  43. Distance covariance and HSIC Distance covariance (0 < q, r ≤ 2) [Feuerverger, 1993, Sz´ ekely et al., 2007] V 2 ( X, Y ) = E XY E X ′ Y ′ � � X − X ′ � q � Y − Y ′ � r � + E X E X ′ � X − X ′ � q E Y E Y ′ � Y − Y ′ � r � E X ′ � X − X ′ � q E Y ′ � Y − Y ′ � r � − 2 E XY Hilbert-Schmdit Indepence Criterion [Gretton et al., 2005, Smola et al., 2007, Gretton et al., 2008, Gretton and Gyorfi, 2010] Define RKHS F on X with kernel k , RKHS G on Y with kernel l . Then HSIC( P XY , P X P Y ) = E XY E X ′ Y ′ k ( X, X ′ ) l ( Y, Y ′ ) + E X E X ′ k ( X, X ′ ) E Y E Y ′ l ( Y, Y ′ ) − 2 E X ′ Y ′ � � E X k ( X, X ′ ) E Y l ( Y, Y ′ ) .

  44. Distance covariance and HSIC Distance covariance (0 < q, r ≤ 2) [Feuerverger, 1993, Sz´ ekely et al., 2007] V 2 ( X, Y ) = E XY E X ′ Y ′ � � X − X ′ � q � Y − Y ′ � r � + E X E X ′ � X − X ′ � q E Y E Y ′ � Y − Y ′ � r � E X ′ � X − X ′ � q E Y ′ � Y − Y ′ � r � − 2 E XY Hilbert-Schmdit Indepence Criterion [Gretton et al., 2005, Smola et al., 2007, Gretton et al., 2008, Gretton and Gyorfi, 2010] Define RKHS F on X with kernel k , RKHS G on Y with kernel l . Then HSIC( P XY , P X P Y ) = E XY E X ′ Y ′ k ( X, X ′ ) l ( Y, Y ′ ) + E X E X ′ k ( X, X ′ ) E Y E Y ′ l ( Y, Y ′ ) − 2 E X ′ Y ′ � � E X k ( X, X ′ ) E Y l ( Y, Y ′ ) . Distance covariance is HSIC with particular kernels! [Sejdinovic et al., 2013b]

  45. Semimetrics and Hilbert spaces Theorem [Berg et al., 1984, Lemma 2.1, p. 74] ρ : X × X → R is a semimetric (no triangle inequality) on X . Let z 0 ∈ X , and denote k ρ ( z, z ′ ) = ρ ( z, z 0 ) + ρ ( z ′ , z 0 ) − ρ ( z, z ′ ) . Then k is positive definite (via Moore-Arnonsajn, defines a unique RKHS) iff ρ is of negative type. Call k ρ a distance induced kernel Negative type: The semimetric space ( Z , ρ ) is said to have negative type if ∀ n ≥ 2, z 1 , . . . , z n ∈ Z , and α 1 , . . . , α n ∈ R , with � n i =1 α i = 0, n n � � α i α j ρ ( z i , z j ) ≤ 0 . (1) i =1 j =1

  46. Semimetrics and Hilbert spaces Theorem [Berg et al., 1984, Lemma 2.1, p. 74] ρ : X × X → R is a semimetric (no triangle inequality) on X . Let z 0 ∈ X , and denote k ρ ( z, z ′ ) = ρ ( z, z 0 ) + ρ ( z ′ , z 0 ) − ρ ( z, z ′ ) . Then k is positive definite (via Moore-Arnonsajn, defines a unique RKHS) iff ρ is of negative type. Call k ρ a distance induced kernel Special case: Z ⊆ R d and ρ q ( z, z ′ ) = � z − z ′ � q . Then ρ q is a valid semimetric of negative type for 0 < q ≤ 2.

  47. Semimetrics and Hilbert spaces Theorem [Berg et al., 1984, Lemma 2.1, p. 74] ρ : X × X → R is a semimetric (no triangle inequality) on X . Let z 0 ∈ X , and denote k ρ ( z, z ′ ) = ρ ( z, z 0 ) + ρ ( z ′ , z 0 ) − ρ ( z, z ′ ) . Then k is positive definite (via Moore-Arnonsajn, defines a unique RKHS) iff ρ is of negative type. Call k ρ a distance induced kernel Special case: Z ⊆ R d and ρ q ( z, z ′ ) = � z − z ′ � q . Then ρ q is a valid semimetric of negative type for 0 < q ≤ 2. Energy distance is MMD with a distance induced kernel Distance covariance is HSIC with distance induced kernels

  48. Two-sample testing benchmark Two-sample testing example in 1-D: 0.4 0.3 Q(X) 0.2 0.1 0.35 −6 −4 −2 0 2 4 6 X 0.3 0.4 0.25 0.3 0.2 Q(X) P(X) VS 0.2 0.15 0.1 0.1 −6 −4 −2 0 2 4 6 X 0.05 0.4 0 −6 −4 −2 0 2 4 6 0.3 X Q(X) 0.2 0.1 −6 −4 −2 0 2 4 6 X

  49. Two-sample test, MMD with distance kernel Obtain more powerful tests on this problem when q � = 1 (exponent of distance) Key: • Gaussian kernel • q = 1 • Best: q = 1 / 3 • Worst: q = 2

  50. Nonparametric Bayesian inference using distribution embeddings

  51. Motivating Example: Bayesian inference without a model • 3600 downsampled frames of 20 × 20 RGB pixels ( Y t ∈ [0 , 1] 1200 ) • 1800 training frames, remaining for test. • Gaussian noise added to Y t . Challenges: • No parametric model of camera dynamics (only samples) • No parametric model of map from camera angle to image (only samples) • Want to do filtering: Bayesian inference

  52. ABC: an approach to Bayesian inference without a model Bayes rule: P ( x | y ) π ( y ) � P ( y | x ) = P ( x | y ) π ( y ) dy • P ( x | y ) is likelihood • π ( y ) is prior One approach: Approximate Bayesian Computation (ABC)

  53. ABC: an approach to Bayesian inference without a model Approximate Bayesian Computation (ABC): ABC demonstration 10 Likelihood x ∼ P ( X | y ) 5 0 −5 −10 −10 −5 0 5 10 Prior y ∼ π ( Y )

  54. ABC: an approach to Bayesian inference without a model Approximate Bayesian Computation (ABC): ABC demonstration 10 Likelihood x ∼ P ( X | y ) 5 x ∗ 0 −5 −10 −10 −5 0 5 10 Prior y ∼ π ( Y )

  55. ABC: an approach to Bayesian inference without a model Approximate Bayesian Computation (ABC): ABC demonstration 10 Likelihood x ∼ P ( X | y ) 5 x ∗ 0 −5 −10 −10 −5 0 5 10 Prior y ∼ π ( Y )

  56. ABC: an approach to Bayesian inference without a model Approximate Bayesian Computation (ABC): ABC demonstration 10 Likelihood x ∼ P ( X | y ) 5 x ∗ 0 −5 −10 −10 −5 0 5 10 Prior y ∼ π ( Y )

  57. ABC: an approach to Bayesian inference without a model Approximate Bayesian Computation (ABC): ABC demonstration 10 Likelihood x ∼ P ( X | y ) 5 x ∗ 0 −5 ˆ −10 P ( Y | x ∗ ) −10 −5 0 5 10 Prior y ∼ π ( Y )

  58. ABC: an approach to Bayesian inference without a model Approximate Bayesian Computation (ABC): ABC demonstration 10 Likelihood x ∼ P ( X | y ) 5 x ∗ 0 −5 ˆ −10 P ( Y | x ∗ ) −10 −5 0 5 10 Prior y ∼ π ( Y ) Needed: distance measure D , tolerance parameter τ .

  59. ABC: an approach to Bayesian inference without a model Bayes rule: P ( x | y ) π ( y ) � P ( y | x ) = P ( x | y ) π ( y ) dy • P ( x | y ) is likelihood • π ( y ) is prior ABC generates a sample from p ( Y | x ∗ ) as follows: 1. generate a sample y t from the prior π , 2. generate a sample x t from P ( X | y t ), 3. if D ( x ∗ , x t ) < τ , accept y = y t ; otherwise reject, 4. go to (i). In step (3), D is a distance measure, and τ is a tolerance parameter.

  60. Motivating example 2: simple Gaussian case d ) T , V ) with V a randomly generated covariance • p ( x, y ) is N ((0 , 1 T Posterior mean on x : ABC vs kernel approach CPU time vs Error (6 dim.) 3.3 × 10 2 9.5 × 10 2 KBI −1 10 COND 2.5 × 10 3 ABC 1.0 × 10 4 200 Mean Square Errors 400 7.0 × 10 4 600 200 800 400 1000 1.3 × 10 6 600 1500 −2 2000 800 10 3000 4000 1000 5000 2000 1500 5000 6000 3000 4000 6000 0 1 2 3 4 10 10 10 10 10 CPU time (sec)

  61. Bayes again Bayes rule: P ( x | y ) π ( y ) � P ( y | x ) = P ( x | y ) π ( y ) dy • P ( x | y ) is likelihood • π is prior How would this look with kernel embeddings?

  62. Bayes again Bayes rule: P ( x | y ) π ( y ) � P ( y | x ) = P ( x | y ) π ( y ) dy • P ( x | y ) is likelihood • π is prior How would this look with kernel embeddings? Define RKHS G on Y with feature map ψ y and kernel l ( y, · ) We need a conditional mean embedding: for all g ∈ G , E Y | x ∗ g ( Y ) = � g, µ P ( y | x ∗ ) � G This will be obtained by RKHS-valued ridge regression

  63. Ridge regression and the conditional feature mean Ridge regression from X := R d to a finite vector output Y := R d ′ (these could be d ′ nonlinear features of y ): Define training data � � � � ∈ R d ′ × m ∈ R d × m X = Y = x 1 . . . x m y 1 . . . y m

  64. Ridge regression and the conditional feature mean Ridge regression from X := R d to a finite vector output Y := R d ′ (these could be d ′ nonlinear features of y ): Define training data � � � � ∈ R d ′ × m ∈ R d × m X = Y = x 1 . . . x m y 1 . . . y m Solve � � � Y − AX � 2 + λ � A � 2 ˘ A = arg min A ∈ R d ′× d , H S where min { d,d ′ } � � A � 2 H S = tr( A ⊤ A ) = γ 2 A,i i =1

  65. Ridge regression and the conditional feature mean Ridge regression from X := R d to a finite vector output Y := R d ′ (these could be d ′ nonlinear features of y ): Define training data � � � � ∈ R d ′ × m ∈ R d × m X = Y = x 1 . . . x m y 1 . . . y m Solve � � � Y − AX � 2 + λ � A � 2 ˘ A = arg min A ∈ R d ′× d , H S where min { d,d ′ } � � A � 2 H S = tr( A ⊤ A ) = γ 2 A,i i =1 Solution: ˘ A = C Y X ( C XX + mλI ) − 1

  66. Ridge regression and the conditional feature mean Prediction at new point x : ˘ y ∗ = Ax C Y X ( C XX + mλI ) − 1 x = m � = β i ( x ) y i i =1 where β i ( x ) = ( K + λmI ) − 1 � � ⊤ k ( x 1 , x ) . . . k ( x m , x ) and K := X ⊤ X k ( x 1 , x ) = x ⊤ 1 x

  67. Ridge regression and the conditional feature mean Prediction at new point x : ˘ y ∗ = Ax C Y X ( C XX + mλI ) − 1 x = m � = β i ( x ) y i i =1 where β i ( x ) = ( K + λmI ) − 1 � � ⊤ k ( x 1 , x ) . . . k ( x m , x ) and K := X ⊤ X k ( x 1 , x ) = x ⊤ 1 x What if we do everything in kernel space?

  68. Ridge regression and the conditional feature mean Recall our setup: • Given training pairs: ( x i , y i ) ∼ P XY • F on X with feature map ϕ x and kernel k ( x, · ) • G on Y with feature map ψ y and kernel l ( y, · ) We define the covariance between feature maps: C XX = E X ( ϕ X ⊗ ϕ X ) C XY = E XY ( ϕ X ⊗ ψ Y ) and matrices of feature mapped training data � � � � X = Y := ϕ x 1 . . . ϕ x m ψ y 1 . . . ψ y m

  69. Ridge regression and the conditional feature mean Objective: [Weston et al. (2003), Micchelli and Pontil (2005), Caponnetto and De Vito (2007), Grunewalder et al. (2012, 2013) ] � � ∞ � ˘ E XY � Y − AX � 2 G + λ � A � 2 � A � 2 γ 2 A = arg min , H S = H S A,i A ∈ HS( F , G ) i =1 Solution same as vector case: A = C Y X ( C XX + mλI ) − 1 , ˘ Prediction at new x using kernels: � � ( K + λmI ) − 1 � � ˘ Aϕ x = ψ y 1 . . . ψ y m k ( x 1 , x ) . . . k ( x m , x ) m � = β i ( x ) ψ y i i =1 where K ij = k ( x i , x j )

  70. Ridge regression and the conditional feature mean How is loss � Y − AX � 2 G relevant to conditional expectation of some E Y | x g ( Y )? Define: [Song et al. (2009), Grunewalder et al. (2013)] µ Y | x := Aϕ x

  71. Ridge regression and the conditional feature mean How is loss � Y − AX � 2 G relevant to conditional expectation of some E Y | x g ( Y )? Define: [Song et al. (2009), Grunewalder et al. (2013)] µ Y | x := Aϕ x We need A to have the property E Y | x g ( Y ) ≈ � g, µ Y | x � G = � g, Aϕ x � G = � A ∗ g, ϕ x � F = ( A ∗ g )( x )

  72. Ridge regression and the conditional feature mean How is loss � Y − AX � 2 G relevant to conditional expectation of some E Y | x g ( Y )? Define: [Song et al. (2009), Grunewalder et al. (2013)] µ Y | x := Aϕ x We need A to have the property E Y | x g ( Y ) ≈ � g, µ Y | x � G = � g, Aϕ x � G = � A ∗ g, ϕ x � F = ( A ∗ g )( x ) Natural risk function for conditional mean   2 � �   ( X ) − ( A ∗ g ) L ( A, P XY ) := sup E Y | X g ( Y ) ( X ) , E X   � �� � � �� � � g �≤ 1 Estimator Target

  73. Ridge regression and the conditional feature mean The squared loss risk provides an upper bound on the natural risk. L ( A, P XY ) ≤ E XY � ψ Y − Aϕ X � 2 G

  74. Ridge regression and the conditional feature mean The squared loss risk provides an upper bound on the natural risk. L ( A, P XY ) ≤ E XY � ψ Y − Aϕ X � 2 G Proof: Jensen and Cauchy Schwarz �� � � 2 ( X ) − ( A ∗ g ) ( X ) L ( A, P XY ) := sup E Y | X g ( Y ) E X � g �≤ 1 [ g ( Y ) − ( A ∗ g ) ( X )] 2 ≤ E XY sup � g �≤ 1

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend