lecture 3 dependence measures using rkhs embeddings
play

Lecture 3: Dependence measures using RKHS embeddings MLSS Cadiz, - PowerPoint PPT Presentation

Lecture 3: Dependence measures using RKHS embeddings MLSS Cadiz, 2016 Arthur Gretton Gatsby Unit, CSML, UCL Outline Three or more variable interactions, comparison with conditional dependence testing [Sejdinovic et al., 2013a]


  1. Total independence test • Total independence test: H 0 : P XY Z = P X P Y P Z vs. H 1 : P XY Z � = P X P Y P Z D � k ( i ) : • For ( X 1 , . . . , X D ) ∼ P X , and κ = i =1 � � 2 � � � � � � � � � � D n n D n D n � � � � � � � 1 2 � � K ( i ) K ( i ) ˆ ˆ µ κ P X − P X i = ab − � � ab n 2 n D +1 � � � � i =1 a =1 i =1 a =1 i =1 b =1 b =1 � �� � � � � � ∆ tot ˆ P H κ D n n � � � 1 K ( i ) + ab . n 2 D i =1 a =1 b =1 • Coincides with the test proposed by Kankainen (1995) using empirical characteristic functions.

  2. Kernel dependence measures - in detail

  3. MMD for independence: HSIC !"#$%&'()#)&*+$,#&-"#.&-"%(+*"&/$0#1&2',& !" #" -"#34%#&'#5#%&"266$#%&-"2'&7"#'&0(//(7$'*& 2'&$'-#%#)8'*&)9#'-:&!"#3&'##,&6/#'-3&(0& #;#%9$)#1&2<(+-&2'&"(+%&2&,23&$0&6())$</#:& =&/2%*#&2'$.2/&7"(&)/$'*)&)/(<<#%1&#;+,#)&2& ,$)8'985#&"(+',3&(,(%1&2',&72'-)&'(-"$'*&.(%#& -"2'&-(&0(//(7&"$)&'()#:&!"#3&'##,&2&)$*'$>92'-& 2.(+'-&(0&#;#%9$)#&2',&.#'-2/&)8.+/28(':& @'(7'&0(%&-"#$%&9+%$()$-31&$'-#//$*#'9#1&2',& #;9#//#'-&9(..+'$928('&&)A$//)1&-"#&B252'#)#& <%##,&$)&6#%0#9-&$0&3(+&72'-&2&%#)6(')$5#1&& $'-#%2985#&6#-1&('#&-"2-&7$//&</(7&$'&3(+%&#2%& 2',&0(//(7&3(+&#5#%37"#%#:& !#;-&0%(.&,(*8.#:9(.&2',&6#?$',#%:9(.& Empirical HSIC ( P XY , P X P Y ): 1 n 2 ( HKH ◦ HLH ) ++

  4. Covariance to reveal dependence A more intuitive idea: maximize covariance of smooth mappings: COCO( P ; F , G ) := sup ( E x , y [ f ( x ) g ( y )] − E x [ f ( x )] E y [ g ( y )]) � f � F =1 , � g � G =1

  5. Covariance to reveal dependence A more intuitive idea: maximize covariance of smooth mappings: COCO( P ; F , G ) := sup ( E x , y [ f ( x ) g ( y )] − E x [ f ( x )] E y [ g ( y )]) � f � F =1 , � g � G =1 Correlation: −0.00 1.5 1 0.5 Y 0 −0.5 −1 −1.5 −2 0 2 X

  6. Covariance to reveal dependence A more intuitive idea: maximize covariance of smooth mappings: COCO( P ; F , G ) := sup ( E x , y [ f ( x ) g ( y )] − E x [ f ( x )] E y [ g ( y )]) � f � F =1 , � g � G =1 Dependence witness, X 0.5 Correlation: −0.00 1.5 0 f(x) 1 −0.5 0.5 −1 −2 0 2 x 0 Y Dependence witness, Y 0.5 −0.5 0 −1 g(y) −0.5 −1.5 −2 0 2 X −1 −2 0 2 y

  7. Covariance to reveal dependence A more intuitive idea: maximize covariance of smooth mappings: COCO( P ; F , G ) := sup ( E x , y [ f ( x ) g ( y )] − E x [ f ( x )] E y [ g ( y )]) � f � F =1 , � g � G =1 Dependence witness, X 0.5 Correlation: −0.00 Correlation: −0.90 COCO: 0.14 1.5 0 1 f(x) 1 −0.5 0.5 0.5 −1 −2 0 2 g(Y) x 0 Y 0 Dependence witness, Y 0.5 −0.5 −0.5 0 −1 g(y) −1 −0.5 −1.5 −1 −0.5 0 0.5 −2 0 2 f(X) X −1 −2 0 2 y

  8. Covariance to reveal dependence A more intuitive idea: maximize covariance of smooth mappings: COCO( P ; F , G ) := sup ( E x , y [ f ( x ) g ( y )] − E x [ f ( x )] E y [ g ( y )]) � f � F =1 , � g � G =1 Dependence witness, X 0.5 Correlation: −0.00 Correlation: −0.90 COCO: 0.14 1.5 0 1 f(x) 1 −0.5 0.5 0.5 −1 −2 0 2 g(Y) x 0 Y 0 Dependence witness, Y 0.5 −0.5 −0.5 0 −1 g(y) −1 −0.5 −1.5 −1 −0.5 0 0.5 −2 0 2 f(X) X −1 −2 0 2 y How do we define covariance in (infinite) feature spaces?

  9. Covariance to reveal dependence Covariance in RKHS: Let’s first look at finite linear case. We have two random vectors x ∈ R d , y ∈ R d ′ . Are they linearly dependent?

  10. Covariance to reveal dependence Covariance in RKHS: Let’s first look at finite linear case. We have two random vectors x ∈ R d , y ∈ R d ′ . Are they linearly dependent? Compute their covariance matrix: (ignore centering) � xy ⊤ � C xy = E How to get a single “summary” number?

  11. Covariance to reveal dependence Covariance in RKHS: Let’s first look at finite linear case. We have two random vectors x ∈ R d , y ∈ R d ′ . Are they linearly dependent? Compute their covariance matrix: (ignore centering) � xy ⊤ � C xy = E How to get a single “summary” number? Solve for vectors f ∈ R d , g ∈ R d ′ �� � � �� f ⊤ C xy g = f ⊤ x g ⊤ y argmax argmax E xy � f � =1 , � g � =1 � f � =1 , � g � =1 = argmax E x , y [ f ( x ) g ( y )] = argmax cov ( f ( x ) g ( y )) � f � =1 , � g � =1 � f � =1 , � g � =1 (maximum singular value)

  12. Challenges in defining feature space covariance Given features φ ( x ) ∈ F and ψ ( y ) ∈ G : Challenge 1: Can we define a feature space analog to x y ⊤ ? YES: • Given f ∈ R d , g ∈ R d ′ , h ∈ R d ′ , define matrix f g ⊤ such that ( f g ⊤ ) h = f ( g ⊤ h ). • Given f ∈ F , g ∈ G , h ∈ G , define tensor product operator f ⊗ g such that ( f ⊗ g ) h = f � g, h � G . • Now just set f := φ ( x ), g = ψ ( y ), to get x y ⊤ → φ ( x ) ⊗ ψ ( y ) • Corresponds to the product kernel: � φ ( x ) ⊗ ψ ( y ) , φ ( x ′ ) ⊗ ψ ( y ′ ) � = k ( x, x ′ ) l ( y, y ′ )

  13. Challenges in defining feature space covariance Given features φ ( x ) ∈ F and ψ ( y ) ∈ G : Challenge 2: Does a covariance “matrix” (operator) in feature space exist? I.e. is there some C XY : G → F such that � f, C XY g � F = E x , y [ f ( x ) g ( y )] = cov ( f ( x ) , g ( y ))

  14. Challenges in defining feature space covariance Given features φ ( x ) ∈ F and ψ ( y ) ∈ G : Challenge 2: Does a covariance “matrix” (operator) in feature space exist? I.e. is there some C XY : G → F such that � f, C XY g � F = E x , y [ f ( x ) g ( y )] = cov ( f ( x ) , g ( y )) YES: via Bochner integrability argument (as with mean embedding). �� � < ∞ , we can define: Under the condition E x , y k ( x , x ) l ( y , y ) C XY := E x , y [ φ ( x ) ⊗ ψ ( y )] which is a Hilbert-Schmidt operator (sum of squared singular values is finite).

  15. REMINDER: functions revealing dependence COCO( P ; F , G ) := sup ( E x , y [ f ( x ) g ( y )] − E x [ f ( x )] E y [ g ( y )]) � f � F =1 , � g � G =1 Dependence witness, X 0.5 Correlation: −0.00 Correlation: −0.90 COCO: 0.14 1.5 0 1 f(x) 1 −0.5 0.5 0.5 −1 −2 0 2 g(Y) x Y 0 0 Dependence witness, Y 0.5 −0.5 −0.5 0 −1 g(y) −1 −0.5 −1.5 −1 −0.5 0 0.5 −2 0 2 f(X) X −1 −2 0 2 y How do we compute this from finite data?

  16. Empirical covariance operator The empirical feature covariance given z := ( x i , y i ) n i =1 (now include centering) n � C XY := 1 � φ ( x i ) ⊗ ψ ( y i ) − ˆ µ x ⊗ ˆ µ y , n i =1 n � µ x := 1 where ˆ φ ( x i ). n i =1

  17. Functions revealing dependence Optimization problem: � � f, � COCO( z ; F , G ) := max C XY g F � f � F ≤ 1 subject to � g � G ≤ 1 Assume n n � � α i [ φ ( x i ) − ˆ β i [ ψ ( y i ) − ˆ f = µ x ] g = µ y ] i =1 j =1 The associated Lagrangian is � � � � � � F − λ − γ f, � � f � 2 � g � 2 L ( f, g, λ, γ ) = F − 1 G − 1 C XY g , 2 2 where λ ≥ 0 and γ ≥ 0.

  18. Covariance to reveal dependence • Empirical COCO( z ; F , G ) largest eigenvalue of         1 K � � � 0 L  α K 0  α   n  = γ    .   1 � L � � β 0 L β K 0 n • � K and � L are matrices of inner products between centred observations in respective feature spaces: H = I − 1 � n 11 ⊤ K = HKH where K ij = k ( x i , x j ) and

  19. Covariance to reveal dependence • Empirical COCO( z ; F , G ) largest eigenvalue of         1 K � � � 0 L  α K 0  α   n  = γ    .   1 � L � � β 0 L β K 0 n • � K and � L are matrices of inner products between centred observations in respective feature spaces: H = I − 1 � n 11 ⊤ K = HKH where K ij = k ( x i , x j ) and • Mapping function for x :   n n � �  k ( x i , x ) − 1  f ( x ) = α i k ( x j , x ) n i =1 j =1

  20. Hard-to-detect dependence Smooth density Rough density 3 3 2 2 1 1 Y 0 Y 0 −1 −1 −2 −2 Density takes the form: −3 −3 −2 0 2 −2 0 2 X X 500 Samples, smooth density 500 samples, rough density P x , y ∝ 1 + sin( ω x ) sin( ω y ) 4 4 2 2 Y Y 0 0 −2 −2 −4 −4 −4 −2 0 2 4 −4 −2 0 2 4 X X

  21. Hard-to-detect dependence COCO vs frequency of perturbation from independence.

  22. Hard-to-detect dependence COCO vs frequency of perturbation from independence. Case of ω = 1 Dependence witness, X 1 0.5 Correlation: 0.27 Correlation: −0.50 COCO: 0.09 4 0.6 f(x) 0 3 0.4 −0.5 2 0.2 1 −1 −2 0 2 0 x g(Y) Y 0 −0.2 Dependence witness, Y −1 1 −0.4 −2 0.5 −0.6 −3 g(y) 0 −0.8 −4 −1 −0.5 0 0.5 1 −4 −2 0 2 4 f(X) −0.5 X −1 −2 0 2 y

  23. Hard-to-detect dependence COCO vs frequency of perturbation from independence. Case of ω = 2 Dependence witness, X 1 0.5 Correlation: 0.04 Correlation: 0.51 COCO: 0.07 4 0.6 f(x) 0 3 0.4 −0.5 2 0.2 1 −1 −2 0 2 0 x g(Y) Y 0 −0.2 Dependence witness, Y −1 1 −0.4 −2 0.5 −0.6 −3 g(y) 0 −0.8 −4 −1 −0.5 0 0.5 1 −4 −2 0 2 4 f(X) −0.5 X −1 −2 0 2 y

  24. Hard-to-detect dependence COCO vs frequency of perturbation from independence. Case of ω = 3 Dependence witness, X 1 0.5 Correlation: 0.03 Correlation: −0.45 COCO: 0.03 4 0.5 f(x) 0 3 −0.5 2 1 −1 −2 0 2 x g(Y) Y 0 0 Dependence witness, Y −1 1 −2 0.5 −3 g(y) 0 −0.5 −4 −0.6 −0.4 −0.2 0 0.2 0.4 −4 −2 0 2 4 f(X) −0.5 X −1 −2 0 2 y

  25. Hard-to-detect dependence COCO vs frequency of perturbation from independence. Case of ω = 4 Dependence witness, X 1 0.5 Correlation: 0.03 Correlation: 0.21 COCO: 0.02 4 0.6 f(x) 0 3 0.4 −0.5 2 0.2 1 −1 −2 0 2 0 x g(Y) Y 0 −0.2 Dependence witness, Y −1 1 −0.4 −2 0.5 −0.6 −3 g(y) 0 −0.8 −4 −0.5 0 0.5 1 −4 −2 0 2 4 f(X) −0.5 X −1 −2 0 2 y

  26. Hard-to-detect dependence COCO vs frequency of perturbation from independence. Case of ω =?? Dependence witness, X 1 0.5 Correlation: 0.00 Correlation: −0.13 COCO: 0.02 3 0.6 f(x) 0 0.4 2 −0.5 0.2 1 −1 −2 0 2 0 x g(Y) Y 0 −0.2 Dependence witness, Y 1 −1 −0.4 0.5 −2 −0.6 g(y) 0 −0.8 −3 −1 −0.5 0 0.5 1 −4 −2 0 2 4 f(X) −0.5 X −1 −2 0 2 y

  27. Hard-to-detect dependence COCO vs frequency of perturbation from independence. Case of uniform noise! This bias will decrease with increasing sample size. Dependence witness, X 1 Correlation: 0.00 0.5 Correlation: −0.13 COCO: 0.02 3 0.6 f(x) 0 0.4 2 −0.5 0.2 1 −1 −2 0 2 0 g(Y) x Y 0 Dependence witness, Y −0.2 1 −1 −0.4 0.5 −2 −0.6 g(y) 0 −0.8 −3 −1 −0.5 0 0.5 1 −4 −2 0 2 4 f(X) −0.5 X −1 −2 0 2 y

  28. Hard-to-detect dependence COCO vs frequency of perturbation from independence. • As dependence is encoded at higher frequencies, the smooth mappings f, g achieve lower linear covariance. • Even for independent variables, COCO will not be zero at finite sample sizes, since some mild linear dependence will be induced by f, g (bias) • This bias will decrease with increasing sample size.

  29. Hard-to-detect dependence • Example: sinusoids of increasing frequency ω =1 ω =2 COCO (empirical average, 1500 samples) 0.1 0.09 0.08 ω =3 ω =4 0.07 COCO 0.06 0.05 0.04 ω =5 ω =6 0.03 0.02 0.01 0 1 2 3 4 5 6 7 Frequency of non−constant density component

  30. More functions revealing dependence • Can we do better than COCO?

  31. More functions revealing dependence • Can we do better than COCO? • A second example with zero correlation Dependence witness, X 0.5 Correlation: 0 Correlation: −0.80 COCO: 0.11 1 0 1 f(x) 0.5 −0.5 0.5 −1 0 −2 0 2 g(Y) x Y 0 Dependence witness, Y −0.5 0.5 −0.5 −1 0 g(y) −1 −0.5 −1.5 −1 −0.5 0 0.5 −1 0 1 f(X) X −1 −2 0 2 y

  32. More functions revealing dependence • Can we do better than COCO? • A second example with zero correlation 2nd dependence witness, X 1 Correlation: 0 1 0.5 f 2 (x) 0 0.5 −0.5 −1 0 −2 0 2 Y x 2nd dependence witness, Y −0.5 1 0.5 −1 g 2 (y) 0 −1.5 −0.5 −1 0 1 −1 X −2 0 2 y

  33. More functions revealing dependence • Can we do better than COCO? • A second example with zero correlation 2nd dependence witness, X 1 Correlation: 0 Correlation: −0.37 COCO 2 : 0.06 1 0.5 1 f 2 (x) 0 0.5 −0.5 0.5 −1 0 −2 0 2 g 2 (Y) Y x 0 2nd dependence witness, Y −0.5 1 −0.5 0.5 −1 g 2 (y) 0 −1 −1.5 −1 0 1 −0.5 −1 0 1 f 2 (X) −1 X −2 0 2 y

  34. Hilbert-Schmidt Independence Criterion • Given γ i := COCO i ( z ; F , G ), define Hilbert-Schmidt Independence Criterion (HSIC) [ALT05, NIPS07a, JMLR10] : n � γ 2 HSIC( z ; F , G ) := i i =1

  35. Hilbert-Schmidt Independence Criterion • Given γ i := COCO i ( z ; F , G ), define Hilbert-Schmidt Independence Criterion (HSIC) [ALT05, NIPS07a, JMLR10] : n � γ 2 HSIC( z ; F , G ) := i i =1 • In limit of infinite samples: HSIC( P ; F, G ) := � C xy � 2 HS = � C xy , C xy � HS = E x , x ′ , y , y ′ [ k ( x , x ′ ) l ( y , y ′ )] + E x , x ′ [ k ( x , x ′ )] E y , y ′ [ l ( y , y ′ )] � � E x ′ [ k ( x , x ′ )] E y ′ [ l ( y , y ′ )] − 2 E x , y x ′ an independent copy of x , y ′ a copy of y – HSIC is identical to MMD ( P XY , P X P Y )

  36. When does HSIC determine independence? Theorem: When kernels k and l are each characteristic, then HSIC = 0 iff P x , y = P x P y [Gretton, 2015] . Weaker than MMD condition (which requires a kernel characteristic on X × Y to distinguish P x , y from Q x , y ).

  37. Intuition: why characteristic needed on both X and Y Question: Wouldn’t it be enough just to use a rich mapping from X to Y , e.g. via ridge regression with characteristic F : � � E XY ( Y − � f, φ ( X ) � F ) 2 + λ � f � 2 f ∗ = arg min , F f ∈F

  38. Intuition: why characteristic needed on both X and Y Question: Wouldn’t it be enough just to use a rich mapping from X to Y , e.g. via ridge regression with characteristic F : � � E XY ( Y − � f, φ ( X ) � F ) 2 + λ � f � 2 f ∗ = arg min , F f ∈F Counterexample: density symmetric about x -axis, s.t. p ( x, y ) = p ( x, − y ) Correlation: −0.00 1.5 1 0.5 Y 0 −0.5 −1 −1.5 −2 0 2 X

  39. Regression using distribution embeddings

  40. Kernels on distributions in supervised learning • Kernels have been very widely used in supervised learning – Support vector classification/regression, kernel ridge regression . . .

  41. Kernels on distributions in supervised learning • Kernels have been very widely used in supervised learning • Simple kernel on distributions (population counterpart of set kernel) [Haussler, 1999, G¨ artner et al., 2002] K ( P , Q ) = � µ P , µ Q � F • Squared distance between distribution embeddings (MMD) MMD 2 ( µ P , µ Q ) := � µ P − µ Q � 2 F = E P k ( x , x ′ ) + E Q k ( y , y ′ ) − 2 E P , Q k ( x , y )

  42. Kernels on distributions in supervised learning • Kernels have been very widely used in supervised learning • Simple kernel on distributions (population counterpart of set kernel) [Haussler, 1999, G¨ artner et al., 2002] K ( P , Q ) = � µ P , µ Q � F • Can define kernels on mean embedding features [Christmann, Steinwart NIPS10] , [AISTATS15] K G K e K C K t . . . 2 e − � µ P − µ Q � e − � µ P − µ Q � F � − 1 F F /θ 2 � − 1 � 1 + � µ P − µ Q � 2 1 + � µ P − µ Q � θ � , θ ≤ 2 . . . 2 θ 2 2 θ 2 F � µ P − µ Q � 2 F = E P k ( x , x ′ ) + E Q k ( y , y ′ ) − 2 E P , Q k ( x , y )

  43. Regression using population mean embeddings i . i . d . • Samples z := { ( µ P i , y i ) } ℓ ∼ ρ ( µ P , y ) = ρ ( y | µ P ) ρ ( µ P ), i =1 µ P i = E P i [ ϕ x ] • Regression function � f ρ ( µ P ) = y d ρ ( y | µ P ) , R

  44. Regression using population mean embeddings i . i . d . • Samples z := { ( µ P i , y i ) } ℓ ∼ ρ ( µ P , y ) = ρ ( y | µ P ) ρ ( µ P ), i =1 µ P i = E P i [ ϕ x ] • Regression function � f ρ ( µ P ) = y d ρ ( y | µ P ) , R • Ridge regression for labelled distributions ℓ � 1 ( f ( µ P i ) − y i ) 2 + λ � f � 2 f λ z = arg min H , ( λ > 0) ℓ f ∈ H i =1 • Define RKHS H with kernel K ( µ P , µ Q ) := � ψ µ P , ψ µ Q � H : functions from F ⊂ F to R , where F := { µ P : P ∈ P } P set of prob. meas. on X

  45. Regression using population mean embeddings • Expected risk, Excess risk R [ f ] = E ρ ( µ P ,y ) ( f ( µ P ) − y ) 2 E ( f λ z , f ρ ) = R [ f λ z ] − R [ f ρ ] . • Minimax rate [Caponnetto and Vito, 2007] � � bc ℓ − E ( f λ z , f ρ ) = O p (1 < b, c ∈ (1 , 2]) . bc +1 – b size of input space, c smoothness of f ρ

  46. Regression using population mean embeddings • Expected risk, Excess risk R [ f ] = E ρ ( µ P ,y ) ( f ( µ P ) − y ) 2 E ( f λ z , f ρ ) = R [ f λ z ] − R [ f ρ ] . • Minimax rate [Caponnetto and Vito, 2007] � � bc ℓ − E ( f λ z , f ρ ) = O p (1 < b, c ∈ (1 , 2]) . bc +1 – b size of input space, c smoothness of f ρ N � i . i . d . µ P i = N − 1 • Replace µ P i with ˆ ∼ P i ϕ x j x j j =1 • Given N = ℓ a log( ℓ ) and a = 2, (and H¨ older condition on ψ : F → H ) � � bc ℓ − E ( f λ z , f ρ ) = O p (1 < b, c ∈ (1 , 2]) . bc +1 ˆ Same rate as for population µ P i embeddings! [AISTATS15, JMLR in revision]

  47. Kernels on distributions in supervised learning Supervised learning applications: • Regression: From distributions to vector spaces. [AISTATS15] – Atmospheric monitoring, predict aerosol value from distribution of pixel values of a multispectral satellite image over an area (performance matches engineered state-of-the-art [Wang et al., 2012] ) • Expectation propagation: learn to predict outgoing messages from incoming messages, when updates would otherwise be done by numerical integration [UAI15] • Learning causal direction with mean embeddings [Lopez-Paz et al., 2015]

  48. Learning causal direction with mean embeddings Additive noise model to direct an edge between random variables x and y [Hoyer et al., 2009] Figure: D. Lopez-Paz

  49. Learning causal direction with mean embeddings Classification of cause-effect relations [Lopez-Paz et al., 2015] • Tuebingen cause-effect pairs: 82 scalar real-world examples where causes and effects known [Zscheischler, J., 2014] • Training data: artificial, random nonlinear functions with additive gaussian noise. • Features: µ P x , ˆ ˆ µ P y , ˆ µ P xy with labels for x → y and y → x • Performance 81% correct Figure:Mooij et al.(2015)

  50. Co-authors • From UCL: – Luca Baldasssarre – Steffen Grunewalder – Guy Lever – Sam Patterson – Massimiliano Pontil – Dino Sejdinovic • External: – Karsten Borgwardt, MPI – Wicher Bergsma, LSE – Kenji Fukumizu, ISM – Zaid Harchaoui, INRIA – Bernhard Schoelkopf, MPI – Alex Smola, CMU/Google – Le Song, Georgia Tech – Bharath Sriperumbudur, Cambridge

  51. Kernel two-sample tests for big data, optimal kernel choice

  52. Quadratic time estimate of MMD MMD 2 = � µ P − µ Q � 2 F = E P k ( x, x ′ ) + E Q k ( y, y ′ ) − 2 E P , Q k ( x, y )

  53. Quadratic time estimate of MMD MMD 2 = � µ P − µ Q � 2 F = E P k ( x, x ′ ) + E Q k ( y, y ′ ) − 2 E P , Q k ( x, y ) Given i.i.d. X := { x 1 , . . . , x m } and Y := { y 1 , . . . , y m } from P , Q , respectively: The earlier estimate: (quadratic time) m m � � 1 � E P k ( x, x ′ ) = k ( x i , x j ) m ( m − 1) i =1 j � = i

  54. Quadratic time estimate of MMD MMD 2 = � µ P − µ Q � 2 F = E P k ( x, x ′ ) + E Q k ( y, y ′ ) − 2 E P , Q k ( x, y ) Given i.i.d. X := { x 1 , . . . , x m } and Y := { y 1 , . . . , y m } from P , Q , respectively: The earlier estimate: (quadratic time) m m � � 1 � E P k ( x, x ′ ) = k ( x i , x j ) m ( m − 1) i =1 j � = i New, linear time estimate: E P k ( x, x ′ ) = 2 � m [ k ( x 1 , x 2 ) + k ( x 3 , x 4 ) + . . . ] m/ 2 � = 2 k ( x 2 i − 1 , x 2 i ) m i =1

  55. Linear time MMD Shorter expression with explicit k dependence: MMD 2 =: η k ( p, q ) = E xx ′ yy ′ h k ( x, x ′ , y, y ′ ) =: E v h k ( v ) , where h k ( x, x ′ , y, y ′ ) = k ( x, x ′ ) + k ( y, y ′ ) − k ( x, y ′ ) − k ( x ′ , y ) , and v := [ x, x ′ , y, y ′ ].

  56. Linear time MMD Shorter expression with explicit k dependence: MMD 2 =: η k ( p, q ) = E xx ′ yy ′ h k ( x, x ′ , y, y ′ ) =: E v h k ( v ) , where h k ( x, x ′ , y, y ′ ) = k ( x, x ′ ) + k ( y, y ′ ) − k ( x, y ′ ) − k ( x ′ , y ) , and v := [ x, x ′ , y, y ′ ]. The linear time estimate again: m/ 2 � η k = 2 ˇ h k ( v i ) , m i =1 where v i := [ x 2 i − 1 , x 2 i , y 2 i − 1 , y 2 i ] and h k ( v i ) := k ( x 2 i − 1 , x 2 i ) + k ( y 2 i − 1 , y 2 i ) − k ( x 2 i − 1 , y 2 i ) − k ( x 2 i , y 2 i − 1 )

  57. Linear time vs quadratic time MMD Disadvantages of linear time MMD vs quadratic time MMD • Much higher variance for a given m , hence . . . • . . . a much less powerful test for a given m

  58. Linear time vs quadratic time MMD Disadvantages of linear time MMD vs quadratic time MMD • Much higher variance for a given m , hence . . . • . . . a much less powerful test for a given m Advantages of the linear time MMD vs quadratic time MMD • Very simple asymptotic null distribution (a Gaussian, vs an infinite weighted sum of χ 2 ) • Both test statistic and threshold computable in O ( m ), with storage O (1). • Given unlimited data, a given Type II error can be attained with less computation

  59. Asymptotics of linear time MMD By central limit theorem, m 1 / 2 (ˇ η k − η k ( p, q )) D → N (0 , 2 σ 2 k ) • assuming 0 < E ( h 2 k ) < ∞ (true for bounded k ) k ( v ) − [ E v ( h k ( v ))] 2 . • σ 2 k = E v h 2

  60. Hypothesis test Hypothesis test of asymptotic level α : √ where Φ − 1 is inverse CDF of N (0 , 1) . t k,α = m − 1 / 2 σ k 2Φ − 1 (1 − α ) 2 = ˇ � Null distribution, linear time MMD η k 0.4 0.35 0.3 0.25 η k ) P (ˇ 0.2 0.15 Type I error 0.1 t k,α = (1 − α ) quantile 0.05 0 −4 −2 0 2 4 6 8 ˇ η k

  61. Type II error Null vs alternative distribution, P (ˇ η k ) 0.4 null alternative 0.35 0.3 0.25 η k ) P (ˇ 0.2 0.15 η k ( p, q ) Type II error 0.1 0.05 0 −4 −2 0 2 4 6 8 ˇ η k

  62. The best kernel: minimizes Type II error Type II error: ˇ η k falls below the threshold t k,α and η k ( p, q ) > 0. Prob. of a Type II error: � Φ − 1 (1 − α ) − η k ( p, q ) √ m � √ P (ˇ η k < t k,α ) = Φ σ k 2 where Φ is a Normal CDF.

  63. The best kernel: minimizes Type II error Type II error: ˇ η k falls below the threshold t k,α and η k ( p, q ) > 0. Prob. of a Type II error: � Φ − 1 (1 − α ) − η k ( p, q ) √ m � √ P (ˇ η k < t k,α ) = Φ σ k 2 where Φ is a Normal CDF. Since Φ monotonic, best kernel choice to minimize Type II error prob. is: k ∈K η k ( p, q ) σ − 1 k ∗ = arg max k , where K is the family of kernels under consideration.

  64. Learning the best kernel in a family Define the family of kernels as follows: � � d � K := k : k = β u k u , � β � 1 = D, β u ≥ 0 , ∀ u ∈ { 1 , . . . , d } . u =1 Properties: if at least one β u > 0 • all k ∈ K are valid kernels, • If all k u charateristic then k characteristic

  65. Test statistic The squared MMD becomes d � η k ( p, q ) = � µ k ( p ) − µ k ( q ) � 2 F k = β u η u ( p, q ) , u =1 where η u ( p, q ) := E v h u ( v ).

  66. Test statistic The squared MMD becomes d � η k ( p, q ) = � µ k ( p ) − µ k ( q ) � 2 F k = β u η u ( p, q ) , u =1 where η u ( p, q ) := E v h u ( v ). Denote: • β = ( β 1 , β 2 , . . . , β d ) ⊤ ∈ R d , • h = ( h 1 , h 2 , . . . , h d ) ⊤ ∈ R d , – h u ( x, x ′ , y, y ′ ) = k u ( x, x ′ ) + k u ( y, y ′ ) − k u ( x, y ′ ) − k u ( x ′ , y ) • η = E v ( h ) = ( η 1 , η 2 , . . . , η d ) ⊤ ∈ R d . Quantities for test: η k ( p, q ) = E ( β ⊤ h ) = β ⊤ η σ 2 k := β ⊤ cov( h ) β.

  67. Optimization of ratio η k ( p, q ) σ − 1 k Empirical test parameters: � � � ˆ η k = β ⊤ ˆ β ⊤ ˆ η σ k,λ = ˆ Q + λ m I β, ˆ Q is empirical estimate of cov( h ). Note: ˆ η k , ˆ σ k,λ computed on training data, vs ˇ η k , ˇ σ k on data to be tested (why?)

  68. Optimization of ratio η k ( p, q ) σ − 1 k Empirical test parameters: � � � ˆ η k = β ⊤ ˆ β ⊤ ˆ η σ k,λ = ˆ Q + λ m I β, ˆ Q is empirical estimate of cov( h ). Note: ˆ η k , ˆ σ k,λ computed on training data, vs ˇ η k , ˇ σ k on data to be tested (why?) Objective: β ∗ = arg max ˆ σ − 1 β � 0 ˆ η k ( p, q )ˆ k,λ � � � β ⊤ � � � − 1 / 2 ˆ β ⊤ ˆ = arg max η Q + λ m I β β � 0 η, ˆ =: α ( β ; ˆ Q )

  69. Optmization of ratio η k ( p, q ) σ − 1 k Assume: ˆ η has at least one positive entry η, ˆ Then there exists β � 0 s.t. α ( β ; ˆ Q ) > 0. α (ˆ η, ˆ β ∗ ; ˆ Thus: Q ) > 0

  70. Optmization of ratio η k ( p, q ) σ − 1 k Assume: ˆ η has at least one positive entry η, ˆ Then there exists β � 0 s.t. α ( β ; ˆ Q ) > 0. α (ˆ η, ˆ β ∗ ; ˆ Thus: Q ) > 0 β ∗ = arg max ˆ η, ˆ β � 0 α 2 ( β ; ˆ Solve easier problem: Q ). Quadratic program: min { β ⊤ � � ˆ β : β ⊤ ˆ η = 1 , β � 0 } Q + λ m I

  71. Optmization of ratio η k ( p, q ) σ − 1 k Assume: ˆ η has at least one positive entry η, ˆ Then there exists β � 0 s.t. α ( β ; ˆ Q ) > 0. α (ˆ η, ˆ β ∗ ; ˆ Thus: Q ) > 0 β ∗ = arg max ˆ η, ˆ β � 0 α 2 ( β ; ˆ Solve easier problem: Q ). Quadratic program: min { β ⊤ � � ˆ β : β ⊤ ˆ η = 1 , β � 0 } Q + λ m I What if ˆ η has no positive entries?

  72. Test procedure 1. Split the data into testing and training. 2. On the training data: η u for all k u ∈ K (a) Compute ˆ η u > 0, solve the QP to get β ∗ , else choose random (b) If at least one ˆ kernel from K 3. On the test data: d � η k ∗ using k ∗ = β ∗ k u (a) Compute ˇ u =1 (b) Compute test threshold ˇ t α,k ∗ using ˇ σ k ∗ η k ∗ > ˇ 4. Reject null if ˇ t α,k ∗

  73. Convergence bounds Assume bounded kernel, σ k , bounded away from 0. If λ m = Θ( m − 1 / 3 ) then � � � m − 1 / 3 � � � � � σ − 1 η k σ − 1 k,λ − sup � sup η k ˆ ˆ � = O P . k k ∈K k ∈K

  74. Convergence bounds Assume bounded kernel, σ k , bounded away from 0. If λ m = Θ( m − 1 / 3 ) then � � � m − 1 / 3 � � � � � σ − 1 η k σ − 1 k,λ − sup � sup η k ˆ ˆ � = O P . k k ∈K k ∈K Idea: � � � � � � σ − 1 η k σ − 1 k,λ − sup � sup η k ˆ ˆ � k k ∈K k ∈K � � � � � � � � σ − 1 k,λ − η k σ − 1 � η k σ − 1 k,λ − η k σ − 1 ≤ sup � ˆ η k ˆ � + sup � k,λ k k ∈K k ∈K √ � � d + C 3 D 2 λ m , D √ λ m ≤ C 1 sup | ˆ η k − η k | + C 2 sup | ˆ σ k,λ − σ k,λ | k ∈K k ∈K

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend