Lecture 3: Dependence measures using RKHS embeddings MLSS Cadiz, - PowerPoint PPT Presentation

Total independence test • Total independence test: H 0 : P XY Z = P X P Y P Z vs. H 1 : P XY Z � = P X P Y P Z D � k ( i ) : • For ( X 1 , . . . , X D ) ∼ P X , and κ = i =1 � � 2 � � � � � � � � � � D n n D n D n � � � � � � � 1 2 � � K ( i ) K ( i ) ˆ ˆ µ κ P X − P X i = ab − � � ab n 2 n D +1 � � � � i =1 a =1 i =1 a =1 i =1 b =1 b =1 � �� ∆ tot ˆ P H κ D n n � � � 1 K ( i ) + ab . n 2 D i =1 a =1 b =1 • Coincides with the test proposed by Kankainen (1995) using empirical characteristic functions.

Kernel dependence measures - in detail

MMD for independence: HSIC !"#$%&'()#)&*+$,#&-"#.&-"%(+*"&/$0#1&2',& !" #" -"#34%#&'#5#%&"266$#%&-"2'&7"#'&0(//(7$'*& 2'&$'-#%#)8'*&)9#'-:&!"#3&'##,&6/#'-3&(0& #;#%9$)#1&2<(+-&2'&"(+%&2&,23&$0&6())$</#:& =&/2%*#&2'$.2/&7"(&)/$'*)&)/(<<#%1&#;+,#)&2& ,$)8'985#&"(+',3&(,(%1&2',&72'-)&'(-"$'*&.(%#& -"2'&-(&0(//(7&"$)&'()#:&!"#3&'##,&2&)$*'$>92'-& 2.(+'-&(0&#;#%9$)#&2',&.#'-2/&)8.+/28(':& @'(7'&0(%&-"#$%&9+%$()$-31&$'-#//$*#'9#1&2',& #;9#//#'-&9(..+'$928('&&)A$//)1&-"#&B252'#)#& <%##,&$)&6#%0#9-&$0&3(+&72'-&2&%#)6(')$5#1&& $'-#%2985#&6#-1&('#&-"2-&7$//&</(7&$'&3(+%&#2%& 2',&0(//(7&3(+&#5#%37"#%#:& !#;-&0%(.&,(*8.#:9(.&2',&6#?$',#%:9(.& Empirical HSIC ( P XY , P X P Y ): 1 n 2 ( HKH ◦ HLH ) ++

Covariance to reveal dependence A more intuitive idea: maximize covariance of smooth mappings: COCO( P ; F , G ) := sup ( E x , y [ f ( x ) g ( y )] − E x [ f ( x )] E y [ g ( y )]) � f � F =1 , � g � G =1

Covariance to reveal dependence A more intuitive idea: maximize covariance of smooth mappings: COCO( P ; F , G ) := sup ( E x , y [ f ( x ) g ( y )] − E x [ f ( x )] E y [ g ( y )]) � f � F =1 , � g � G =1 Correlation: −0.00 1.5 1 0.5 Y 0 −0.5 −1 −1.5 −2 0 2 X

Covariance to reveal dependence A more intuitive idea: maximize covariance of smooth mappings: COCO( P ; F , G ) := sup ( E x , y [ f ( x ) g ( y )] − E x [ f ( x )] E y [ g ( y )]) � f � F =1 , � g � G =1 Dependence witness, X 0.5 Correlation: −0.00 1.5 0 f(x) 1 −0.5 0.5 −1 −2 0 2 x 0 Y Dependence witness, Y 0.5 −0.5 0 −1 g(y) −0.5 −1.5 −2 0 2 X −1 −2 0 2 y

Covariance to reveal dependence A more intuitive idea: maximize covariance of smooth mappings: COCO( P ; F , G ) := sup ( E x , y [ f ( x ) g ( y )] − E x [ f ( x )] E y [ g ( y )]) � f � F =1 , � g � G =1 Dependence witness, X 0.5 Correlation: −0.00 Correlation: −0.90 COCO: 0.14 1.5 0 1 f(x) 1 −0.5 0.5 0.5 −1 −2 0 2 g(Y) x 0 Y 0 Dependence witness, Y 0.5 −0.5 −0.5 0 −1 g(y) −1 −0.5 −1.5 −1 −0.5 0 0.5 −2 0 2 f(X) X −1 −2 0 2 y

Covariance to reveal dependence A more intuitive idea: maximize covariance of smooth mappings: COCO( P ; F , G ) := sup ( E x , y [ f ( x ) g ( y )] − E x [ f ( x )] E y [ g ( y )]) � f � F =1 , � g � G =1 Dependence witness, X 0.5 Correlation: −0.00 Correlation: −0.90 COCO: 0.14 1.5 0 1 f(x) 1 −0.5 0.5 0.5 −1 −2 0 2 g(Y) x 0 Y 0 Dependence witness, Y 0.5 −0.5 −0.5 0 −1 g(y) −1 −0.5 −1.5 −1 −0.5 0 0.5 −2 0 2 f(X) X −1 −2 0 2 y How do we define covariance in (infinite) feature spaces?

Covariance to reveal dependence Covariance in RKHS: Let’s first look at finite linear case. We have two random vectors x ∈ R d , y ∈ R d ′ . Are they linearly dependent?

Covariance to reveal dependence Covariance in RKHS: Let’s first look at finite linear case. We have two random vectors x ∈ R d , y ∈ R d ′ . Are they linearly dependent? Compute their covariance matrix: (ignore centering) � xy ⊤ � C xy = E How to get a single “summary” number?

Covariance to reveal dependence Covariance in RKHS: Let’s first look at finite linear case. We have two random vectors x ∈ R d , y ∈ R d ′ . Are they linearly dependent? Compute their covariance matrix: (ignore centering) � xy ⊤ � C xy = E How to get a single “summary” number? Solve for vectors f ∈ R d , g ∈ R d ′ �� f ⊤ C xy g = f ⊤ x g ⊤ y argmax argmax E xy � f � =1 , � g � =1 � f � =1 , � g � =1 = argmax E x , y [ f ( x ) g ( y )] = argmax cov ( f ( x ) g ( y )) � f � =1 , � g � =1 � f � =1 , � g � =1 (maximum singular value)

Challenges in defining feature space covariance Given features φ ( x ) ∈ F and ψ ( y ) ∈ G : Challenge 1: Can we define a feature space analog to x y ⊤ ? YES: • Given f ∈ R d , g ∈ R d ′ , h ∈ R d ′ , define matrix f g ⊤ such that ( f g ⊤ ) h = f ( g ⊤ h ). • Given f ∈ F , g ∈ G , h ∈ G , define tensor product operator f ⊗ g such that ( f ⊗ g ) h = f � g, h � G . • Now just set f := φ ( x ), g = ψ ( y ), to get x y ⊤ → φ ( x ) ⊗ ψ ( y ) • Corresponds to the product kernel: � φ ( x ) ⊗ ψ ( y ) , φ ( x ′ ) ⊗ ψ ( y ′ ) � = k ( x, x ′ ) l ( y, y ′ )

Challenges in defining feature space covariance Given features φ ( x ) ∈ F and ψ ( y ) ∈ G : Challenge 2: Does a covariance “matrix” (operator) in feature space exist? I.e. is there some C XY : G → F such that � f, C XY g � F = E x , y [ f ( x ) g ( y )] = cov ( f ( x ) , g ( y ))

Challenges in defining feature space covariance Given features φ ( x ) ∈ F and ψ ( y ) ∈ G : Challenge 2: Does a covariance “matrix” (operator) in feature space exist? I.e. is there some C XY : G → F such that � f, C XY g � F = E x , y [ f ( x ) g ( y )] = cov ( f ( x ) , g ( y )) YES: via Bochner integrability argument (as with mean embedding). �� < ∞ , we can define: Under the condition E x , y k ( x , x ) l ( y , y ) C XY := E x , y [ φ ( x ) ⊗ ψ ( y )] which is a Hilbert-Schmidt operator (sum of squared singular values is finite).

REMINDER: functions revealing dependence COCO( P ; F , G ) := sup ( E x , y [ f ( x ) g ( y )] − E x [ f ( x )] E y [ g ( y )]) � f � F =1 , � g � G =1 Dependence witness, X 0.5 Correlation: −0.00 Correlation: −0.90 COCO: 0.14 1.5 0 1 f(x) 1 −0.5 0.5 0.5 −1 −2 0 2 g(Y) x Y 0 0 Dependence witness, Y 0.5 −0.5 −0.5 0 −1 g(y) −1 −0.5 −1.5 −1 −0.5 0 0.5 −2 0 2 f(X) X −1 −2 0 2 y How do we compute this from finite data?

Empirical covariance operator The empirical feature covariance given z := ( x i , y i ) n i =1 (now include centering) n � C XY := 1 � φ ( x i ) ⊗ ψ ( y i ) − ˆ µ x ⊗ ˆ µ y , n i =1 n � µ x := 1 where ˆ φ ( x i ). n i =1

Functions revealing dependence Optimization problem: � � f, � COCO( z ; F , G ) := max C XY g F � f � F ≤ 1 subject to � g � G ≤ 1 Assume n n � � α i [ φ ( x i ) − ˆ β i [ ψ ( y i ) − ˆ f = µ x ] g = µ y ] i =1 j =1 The associated Lagrangian is � � � � � � F − λ − γ f, � � f � 2 � g � 2 L ( f, g, λ, γ ) = F − 1 G − 1 C XY g , 2 2 where λ ≥ 0 and γ ≥ 0.

Covariance to reveal dependence • Empirical COCO( z ; F , G ) largest eigenvalue of         1 K � � � 0 L  α K 0  α   n  = γ    .   1 � L � � β 0 L β K 0 n • � K and � L are matrices of inner products between centred observations in respective feature spaces: H = I − 1 � n 11 ⊤ K = HKH where K ij = k ( x i , x j ) and

Covariance to reveal dependence • Empirical COCO( z ; F , G ) largest eigenvalue of         1 K � � � 0 L  α K 0  α   n  = γ    .   1 � L � � β 0 L β K 0 n • � K and � L are matrices of inner products between centred observations in respective feature spaces: H = I − 1 � n 11 ⊤ K = HKH where K ij = k ( x i , x j ) and • Mapping function for x :   n n � �  k ( x i , x ) − 1  f ( x ) = α i k ( x j , x ) n i =1 j =1

Hard-to-detect dependence Smooth density Rough density 3 3 2 2 1 1 Y 0 Y 0 −1 −1 −2 −2 Density takes the form: −3 −3 −2 0 2 −2 0 2 X X 500 Samples, smooth density 500 samples, rough density P x , y ∝ 1 + sin( ω x ) sin( ω y ) 4 4 2 2 Y Y 0 0 −2 −2 −4 −4 −4 −2 0 2 4 −4 −2 0 2 4 X X

Hard-to-detect dependence COCO vs frequency of perturbation from independence.

Hard-to-detect dependence COCO vs frequency of perturbation from independence. Case of ω = 1 Dependence witness, X 1 0.5 Correlation: 0.27 Correlation: −0.50 COCO: 0.09 4 0.6 f(x) 0 3 0.4 −0.5 2 0.2 1 −1 −2 0 2 0 x g(Y) Y 0 −0.2 Dependence witness, Y −1 1 −0.4 −2 0.5 −0.6 −3 g(y) 0 −0.8 −4 −1 −0.5 0 0.5 1 −4 −2 0 2 4 f(X) −0.5 X −1 −2 0 2 y

Hard-to-detect dependence COCO vs frequency of perturbation from independence. Case of ω = 2 Dependence witness, X 1 0.5 Correlation: 0.04 Correlation: 0.51 COCO: 0.07 4 0.6 f(x) 0 3 0.4 −0.5 2 0.2 1 −1 −2 0 2 0 x g(Y) Y 0 −0.2 Dependence witness, Y −1 1 −0.4 −2 0.5 −0.6 −3 g(y) 0 −0.8 −4 −1 −0.5 0 0.5 1 −4 −2 0 2 4 f(X) −0.5 X −1 −2 0 2 y

Hard-to-detect dependence COCO vs frequency of perturbation from independence. Case of ω = 3 Dependence witness, X 1 0.5 Correlation: 0.03 Correlation: −0.45 COCO: 0.03 4 0.5 f(x) 0 3 −0.5 2 1 −1 −2 0 2 x g(Y) Y 0 0 Dependence witness, Y −1 1 −2 0.5 −3 g(y) 0 −0.5 −4 −0.6 −0.4 −0.2 0 0.2 0.4 −4 −2 0 2 4 f(X) −0.5 X −1 −2 0 2 y

Hard-to-detect dependence COCO vs frequency of perturbation from independence. Case of ω = 4 Dependence witness, X 1 0.5 Correlation: 0.03 Correlation: 0.21 COCO: 0.02 4 0.6 f(x) 0 3 0.4 −0.5 2 0.2 1 −1 −2 0 2 0 x g(Y) Y 0 −0.2 Dependence witness, Y −1 1 −0.4 −2 0.5 −0.6 −3 g(y) 0 −0.8 −4 −0.5 0 0.5 1 −4 −2 0 2 4 f(X) −0.5 X −1 −2 0 2 y

Hard-to-detect dependence COCO vs frequency of perturbation from independence. Case of ω =?? Dependence witness, X 1 0.5 Correlation: 0.00 Correlation: −0.13 COCO: 0.02 3 0.6 f(x) 0 0.4 2 −0.5 0.2 1 −1 −2 0 2 0 x g(Y) Y 0 −0.2 Dependence witness, Y 1 −1 −0.4 0.5 −2 −0.6 g(y) 0 −0.8 −3 −1 −0.5 0 0.5 1 −4 −2 0 2 4 f(X) −0.5 X −1 −2 0 2 y

Hard-to-detect dependence COCO vs frequency of perturbation from independence. Case of uniform noise! This bias will decrease with increasing sample size. Dependence witness, X 1 Correlation: 0.00 0.5 Correlation: −0.13 COCO: 0.02 3 0.6 f(x) 0 0.4 2 −0.5 0.2 1 −1 −2 0 2 0 g(Y) x Y 0 Dependence witness, Y −0.2 1 −1 −0.4 0.5 −2 −0.6 g(y) 0 −0.8 −3 −1 −0.5 0 0.5 1 −4 −2 0 2 4 f(X) −0.5 X −1 −2 0 2 y

Hard-to-detect dependence COCO vs frequency of perturbation from independence. • As dependence is encoded at higher frequencies, the smooth mappings f, g achieve lower linear covariance. • Even for independent variables, COCO will not be zero at finite sample sizes, since some mild linear dependence will be induced by f, g (bias) • This bias will decrease with increasing sample size.

Hard-to-detect dependence • Example: sinusoids of increasing frequency ω =1 ω =2 COCO (empirical average, 1500 samples) 0.1 0.09 0.08 ω =3 ω =4 0.07 COCO 0.06 0.05 0.04 ω =5 ω =6 0.03 0.02 0.01 0 1 2 3 4 5 6 7 Frequency of non−constant density component

More functions revealing dependence • Can we do better than COCO?

More functions revealing dependence • Can we do better than COCO? • A second example with zero correlation Dependence witness, X 0.5 Correlation: 0 Correlation: −0.80 COCO: 0.11 1 0 1 f(x) 0.5 −0.5 0.5 −1 0 −2 0 2 g(Y) x Y 0 Dependence witness, Y −0.5 0.5 −0.5 −1 0 g(y) −1 −0.5 −1.5 −1 −0.5 0 0.5 −1 0 1 f(X) X −1 −2 0 2 y

More functions revealing dependence • Can we do better than COCO? • A second example with zero correlation 2nd dependence witness, X 1 Correlation: 0 1 0.5 f 2 (x) 0 0.5 −0.5 −1 0 −2 0 2 Y x 2nd dependence witness, Y −0.5 1 0.5 −1 g 2 (y) 0 −1.5 −0.5 −1 0 1 −1 X −2 0 2 y

More functions revealing dependence • Can we do better than COCO? • A second example with zero correlation 2nd dependence witness, X 1 Correlation: 0 Correlation: −0.37 COCO 2 : 0.06 1 0.5 1 f 2 (x) 0 0.5 −0.5 0.5 −1 0 −2 0 2 g 2 (Y) Y x 0 2nd dependence witness, Y −0.5 1 −0.5 0.5 −1 g 2 (y) 0 −1 −1.5 −1 0 1 −0.5 −1 0 1 f 2 (X) −1 X −2 0 2 y

Hilbert-Schmidt Independence Criterion • Given γ i := COCO i ( z ; F , G ), define Hilbert-Schmidt Independence Criterion (HSIC) [ALT05, NIPS07a, JMLR10] : n � γ 2 HSIC( z ; F , G ) := i i =1

Hilbert-Schmidt Independence Criterion • Given γ i := COCO i ( z ; F , G ), define Hilbert-Schmidt Independence Criterion (HSIC) [ALT05, NIPS07a, JMLR10] : n � γ 2 HSIC( z ; F , G ) := i i =1 • In limit of infinite samples: HSIC( P ; F, G ) := � C xy � 2 HS = � C xy , C xy � HS = E x , x ′ , y , y ′ [ k ( x , x ′ ) l ( y , y ′ )] + E x , x ′ [ k ( x , x ′ )] E y , y ′ [ l ( y , y ′ )] � � E x ′ [ k ( x , x ′ )] E y ′ [ l ( y , y ′ )] − 2 E x , y x ′ an independent copy of x , y ′ a copy of y – HSIC is identical to MMD ( P XY , P X P Y )

When does HSIC determine independence? Theorem: When kernels k and l are each characteristic, then HSIC = 0 iff P x , y = P x P y [Gretton, 2015] . Weaker than MMD condition (which requires a kernel characteristic on X × Y to distinguish P x , y from Q x , y ).

Intuition: why characteristic needed on both X and Y Question: Wouldn’t it be enough just to use a rich mapping from X to Y , e.g. via ridge regression with characteristic F : � � E XY ( Y − � f, φ ( X ) � F ) 2 + λ � f � 2 f ∗ = arg min , F f ∈F

Intuition: why characteristic needed on both X and Y Question: Wouldn’t it be enough just to use a rich mapping from X to Y , e.g. via ridge regression with characteristic F : � � E XY ( Y − � f, φ ( X ) � F ) 2 + λ � f � 2 f ∗ = arg min , F f ∈F Counterexample: density symmetric about x -axis, s.t. p ( x, y ) = p ( x, − y ) Correlation: −0.00 1.5 1 0.5 Y 0 −0.5 −1 −1.5 −2 0 2 X

Regression using distribution embeddings

Kernels on distributions in supervised learning • Kernels have been very widely used in supervised learning – Support vector classification/regression, kernel ridge regression . . .

Kernels on distributions in supervised learning • Kernels have been very widely used in supervised learning • Simple kernel on distributions (population counterpart of set kernel) [Haussler, 1999, G¨ artner et al., 2002] K ( P , Q ) = � µ P , µ Q � F • Squared distance between distribution embeddings (MMD) MMD 2 ( µ P , µ Q ) := � µ P − µ Q � 2 F = E P k ( x , x ′ ) + E Q k ( y , y ′ ) − 2 E P , Q k ( x , y )

Kernels on distributions in supervised learning • Kernels have been very widely used in supervised learning • Simple kernel on distributions (population counterpart of set kernel) [Haussler, 1999, G¨ artner et al., 2002] K ( P , Q ) = � µ P , µ Q � F • Can define kernels on mean embedding features [Christmann, Steinwart NIPS10] , [AISTATS15] K G K e K C K t . . . 2 e − � µ P − µ Q � e − � µ P − µ Q � F � − 1 F F /θ 2 � − 1 � 1 + � µ P − µ Q � 2 1 + � µ P − µ Q � θ � , θ ≤ 2 . . . 2 θ 2 2 θ 2 F � µ P − µ Q � 2 F = E P k ( x , x ′ ) + E Q k ( y , y ′ ) − 2 E P , Q k ( x , y )

Regression using population mean embeddings i . i . d . • Samples z := { ( µ P i , y i ) } ℓ ∼ ρ ( µ P , y ) = ρ ( y | µ P ) ρ ( µ P ), i =1 µ P i = E P i [ ϕ x ] • Regression function � f ρ ( µ P ) = y d ρ ( y | µ P ) , R

Regression using population mean embeddings i . i . d . • Samples z := { ( µ P i , y i ) } ℓ ∼ ρ ( µ P , y ) = ρ ( y | µ P ) ρ ( µ P ), i =1 µ P i = E P i [ ϕ x ] • Regression function � f ρ ( µ P ) = y d ρ ( y | µ P ) , R • Ridge regression for labelled distributions ℓ � 1 ( f ( µ P i ) − y i ) 2 + λ � f � 2 f λ z = arg min H , ( λ > 0) ℓ f ∈ H i =1 • Define RKHS H with kernel K ( µ P , µ Q ) := � ψ µ P , ψ µ Q � H : functions from F ⊂ F to R , where F := { µ P : P ∈ P } P set of prob. meas. on X

Regression using population mean embeddings • Expected risk, Excess risk R [ f ] = E ρ ( µ P ,y ) ( f ( µ P ) − y ) 2 E ( f λ z , f ρ ) = R [ f λ z ] − R [ f ρ ] . • Minimax rate [Caponnetto and Vito, 2007] � � bc ℓ − E ( f λ z , f ρ ) = O p (1 < b, c ∈ (1 , 2]) . bc +1 – b size of input space, c smoothness of f ρ

Regression using population mean embeddings • Expected risk, Excess risk R [ f ] = E ρ ( µ P ,y ) ( f ( µ P ) − y ) 2 E ( f λ z , f ρ ) = R [ f λ z ] − R [ f ρ ] . • Minimax rate [Caponnetto and Vito, 2007] � � bc ℓ − E ( f λ z , f ρ ) = O p (1 < b, c ∈ (1 , 2]) . bc +1 – b size of input space, c smoothness of f ρ N � i . i . d . µ P i = N − 1 • Replace µ P i with ˆ ∼ P i ϕ x j x j j =1 • Given N = ℓ a log( ℓ ) and a = 2, (and H¨ older condition on ψ : F → H ) � � bc ℓ − E ( f λ z , f ρ ) = O p (1 < b, c ∈ (1 , 2]) . bc +1 ˆ Same rate as for population µ P i embeddings! [AISTATS15, JMLR in revision]

Kernels on distributions in supervised learning Supervised learning applications: • Regression: From distributions to vector spaces. [AISTATS15] – Atmospheric monitoring, predict aerosol value from distribution of pixel values of a multispectral satellite image over an area (performance matches engineered state-of-the-art [Wang et al., 2012] ) • Expectation propagation: learn to predict outgoing messages from incoming messages, when updates would otherwise be done by numerical integration [UAI15] • Learning causal direction with mean embeddings [Lopez-Paz et al., 2015]

Learning causal direction with mean embeddings Additive noise model to direct an edge between random variables x and y [Hoyer et al., 2009] Figure: D. Lopez-Paz

Learning causal direction with mean embeddings Classification of cause-effect relations [Lopez-Paz et al., 2015] • Tuebingen cause-effect pairs: 82 scalar real-world examples where causes and effects known [Zscheischler, J., 2014] • Training data: artificial, random nonlinear functions with additive gaussian noise. • Features: µ P x , ˆ ˆ µ P y , ˆ µ P xy with labels for x → y and y → x • Performance 81% correct Figure:Mooij et al.(2015)

Co-authors • From UCL: – Luca Baldasssarre – Steffen Grunewalder – Guy Lever – Sam Patterson – Massimiliano Pontil – Dino Sejdinovic • External: – Karsten Borgwardt, MPI – Wicher Bergsma, LSE – Kenji Fukumizu, ISM – Zaid Harchaoui, INRIA – Bernhard Schoelkopf, MPI – Alex Smola, CMU/Google – Le Song, Georgia Tech – Bharath Sriperumbudur, Cambridge

Kernel two-sample tests for big data, optimal kernel choice

Quadratic time estimate of MMD MMD 2 = � µ P − µ Q � 2 F = E P k ( x, x ′ ) + E Q k ( y, y ′ ) − 2 E P , Q k ( x, y )

Quadratic time estimate of MMD MMD 2 = � µ P − µ Q � 2 F = E P k ( x, x ′ ) + E Q k ( y, y ′ ) − 2 E P , Q k ( x, y ) Given i.i.d. X := { x 1 , . . . , x m } and Y := { y 1 , . . . , y m } from P , Q , respectively: The earlier estimate: (quadratic time) m m � � 1 � E P k ( x, x ′ ) = k ( x i , x j ) m ( m − 1) i =1 j � = i

Quadratic time estimate of MMD MMD 2 = � µ P − µ Q � 2 F = E P k ( x, x ′ ) + E Q k ( y, y ′ ) − 2 E P , Q k ( x, y ) Given i.i.d. X := { x 1 , . . . , x m } and Y := { y 1 , . . . , y m } from P , Q , respectively: The earlier estimate: (quadratic time) m m � � 1 � E P k ( x, x ′ ) = k ( x i , x j ) m ( m − 1) i =1 j � = i New, linear time estimate: E P k ( x, x ′ ) = 2 � m [ k ( x 1 , x 2 ) + k ( x 3 , x 4 ) + . . . ] m/ 2 � = 2 k ( x 2 i − 1 , x 2 i ) m i =1

Linear time MMD Shorter expression with explicit k dependence: MMD 2 =: η k ( p, q ) = E xx ′ yy ′ h k ( x, x ′ , y, y ′ ) =: E v h k ( v ) , where h k ( x, x ′ , y, y ′ ) = k ( x, x ′ ) + k ( y, y ′ ) − k ( x, y ′ ) − k ( x ′ , y ) , and v := [ x, x ′ , y, y ′ ].

Linear time MMD Shorter expression with explicit k dependence: MMD 2 =: η k ( p, q ) = E xx ′ yy ′ h k ( x, x ′ , y, y ′ ) =: E v h k ( v ) , where h k ( x, x ′ , y, y ′ ) = k ( x, x ′ ) + k ( y, y ′ ) − k ( x, y ′ ) − k ( x ′ , y ) , and v := [ x, x ′ , y, y ′ ]. The linear time estimate again: m/ 2 � η k = 2 ˇ h k ( v i ) , m i =1 where v i := [ x 2 i − 1 , x 2 i , y 2 i − 1 , y 2 i ] and h k ( v i ) := k ( x 2 i − 1 , x 2 i ) + k ( y 2 i − 1 , y 2 i ) − k ( x 2 i − 1 , y 2 i ) − k ( x 2 i , y 2 i − 1 )

Linear time vs quadratic time MMD Disadvantages of linear time MMD vs quadratic time MMD • Much higher variance for a given m , hence . . . • . . . a much less powerful test for a given m

Linear time vs quadratic time MMD Disadvantages of linear time MMD vs quadratic time MMD • Much higher variance for a given m , hence . . . • . . . a much less powerful test for a given m Advantages of the linear time MMD vs quadratic time MMD • Very simple asymptotic null distribution (a Gaussian, vs an infinite weighted sum of χ 2 ) • Both test statistic and threshold computable in O ( m ), with storage O (1). • Given unlimited data, a given Type II error can be attained with less computation

Asymptotics of linear time MMD By central limit theorem, m 1 / 2 (ˇ η k − η k ( p, q )) D → N (0 , 2 σ 2 k ) • assuming 0 < E ( h 2 k ) < ∞ (true for bounded k ) k ( v ) − [ E v ( h k ( v ))] 2 . • σ 2 k = E v h 2

Hypothesis test Hypothesis test of asymptotic level α : √ where Φ − 1 is inverse CDF of N (0 , 1) . t k,α = m − 1 / 2 σ k 2Φ − 1 (1 − α ) 2 = ˇ � Null distribution, linear time MMD η k 0.4 0.35 0.3 0.25 η k ) P (ˇ 0.2 0.15 Type I error 0.1 t k,α = (1 − α ) quantile 0.05 0 −4 −2 0 2 4 6 8 ˇ η k

Type II error Null vs alternative distribution, P (ˇ η k ) 0.4 null alternative 0.35 0.3 0.25 η k ) P (ˇ 0.2 0.15 η k ( p, q ) Type II error 0.1 0.05 0 −4 −2 0 2 4 6 8 ˇ η k

The best kernel: minimizes Type II error Type II error: ˇ η k falls below the threshold t k,α and η k ( p, q ) > 0. Prob. of a Type II error: � Φ − 1 (1 − α ) − η k ( p, q ) √ m � √ P (ˇ η k < t k,α ) = Φ σ k 2 where Φ is a Normal CDF.

The best kernel: minimizes Type II error Type II error: ˇ η k falls below the threshold t k,α and η k ( p, q ) > 0. Prob. of a Type II error: � Φ − 1 (1 − α ) − η k ( p, q ) √ m � √ P (ˇ η k < t k,α ) = Φ σ k 2 where Φ is a Normal CDF. Since Φ monotonic, best kernel choice to minimize Type II error prob. is: k ∈K η k ( p, q ) σ − 1 k ∗ = arg max k , where K is the family of kernels under consideration.

Learning the best kernel in a family Define the family of kernels as follows: � � d � K := k : k = β u k u , � β � 1 = D, β u ≥ 0 , ∀ u ∈ { 1 , . . . , d } . u =1 Properties: if at least one β u > 0 • all k ∈ K are valid kernels, • If all k u charateristic then k characteristic

Test statistic The squared MMD becomes d � η k ( p, q ) = � µ k ( p ) − µ k ( q ) � 2 F k = β u η u ( p, q ) , u =1 where η u ( p, q ) := E v h u ( v ).

Test statistic The squared MMD becomes d � η k ( p, q ) = � µ k ( p ) − µ k ( q ) � 2 F k = β u η u ( p, q ) , u =1 where η u ( p, q ) := E v h u ( v ). Denote: • β = ( β 1 , β 2 , . . . , β d ) ⊤ ∈ R d , • h = ( h 1 , h 2 , . . . , h d ) ⊤ ∈ R d , – h u ( x, x ′ , y, y ′ ) = k u ( x, x ′ ) + k u ( y, y ′ ) − k u ( x, y ′ ) − k u ( x ′ , y ) • η = E v ( h ) = ( η 1 , η 2 , . . . , η d ) ⊤ ∈ R d . Quantities for test: η k ( p, q ) = E ( β ⊤ h ) = β ⊤ η σ 2 k := β ⊤ cov( h ) β.

Optimization of ratio η k ( p, q ) σ − 1 k Empirical test parameters: � � � ˆ η k = β ⊤ ˆ β ⊤ ˆ η σ k,λ = ˆ Q + λ m I β, ˆ Q is empirical estimate of cov( h ). Note: ˆ η k , ˆ σ k,λ computed on training data, vs ˇ η k , ˇ σ k on data to be tested (why?)

Optimization of ratio η k ( p, q ) σ − 1 k Empirical test parameters: � � � ˆ η k = β ⊤ ˆ β ⊤ ˆ η σ k,λ = ˆ Q + λ m I β, ˆ Q is empirical estimate of cov( h ). Note: ˆ η k , ˆ σ k,λ computed on training data, vs ˇ η k , ˇ σ k on data to be tested (why?) Objective: β ∗ = arg max ˆ σ − 1 β � 0 ˆ η k ( p, q )ˆ k,λ � � � β ⊤ � � � − 1 / 2 ˆ β ⊤ ˆ = arg max η Q + λ m I β β � 0 η, ˆ =: α ( β ; ˆ Q )

Optmization of ratio η k ( p, q ) σ − 1 k Assume: ˆ η has at least one positive entry η, ˆ Then there exists β � 0 s.t. α ( β ; ˆ Q ) > 0. α (ˆ η, ˆ β ∗ ; ˆ Thus: Q ) > 0

Optmization of ratio η k ( p, q ) σ − 1 k Assume: ˆ η has at least one positive entry η, ˆ Then there exists β � 0 s.t. α ( β ; ˆ Q ) > 0. α (ˆ η, ˆ β ∗ ; ˆ Thus: Q ) > 0 β ∗ = arg max ˆ η, ˆ β � 0 α 2 ( β ; ˆ Solve easier problem: Q ). Quadratic program: min { β ⊤ � � ˆ β : β ⊤ ˆ η = 1 , β � 0 } Q + λ m I

Optmization of ratio η k ( p, q ) σ − 1 k Assume: ˆ η has at least one positive entry η, ˆ Then there exists β � 0 s.t. α ( β ; ˆ Q ) > 0. α (ˆ η, ˆ β ∗ ; ˆ Thus: Q ) > 0 β ∗ = arg max ˆ η, ˆ β � 0 α 2 ( β ; ˆ Solve easier problem: Q ). Quadratic program: min { β ⊤ � � ˆ β : β ⊤ ˆ η = 1 , β � 0 } Q + λ m I What if ˆ η has no positive entries?

Test procedure 1. Split the data into testing and training. 2. On the training data: η u for all k u ∈ K (a) Compute ˆ η u > 0, solve the QP to get β ∗ , else choose random (b) If at least one ˆ kernel from K 3. On the test data: d � η k ∗ using k ∗ = β ∗ k u (a) Compute ˇ u =1 (b) Compute test threshold ˇ t α,k ∗ using ˇ σ k ∗ η k ∗ > ˇ 4. Reject null if ˇ t α,k ∗

Convergence bounds Assume bounded kernel, σ k , bounded away from 0. If λ m = Θ( m − 1 / 3 ) then � � � m − 1 / 3 � � � � � σ − 1 η k σ − 1 k,λ − sup � sup η k ˆ ˆ � = O P . k k ∈K k ∈K

Convergence bounds Assume bounded kernel, σ k , bounded away from 0. If λ m = Θ( m − 1 / 3 ) then � � � m − 1 / 3 � � � � � σ − 1 η k σ − 1 k,λ − sup � sup η k ˆ ˆ � = O P . k k ∈K k ∈K Idea: � � � � � � σ − 1 η k σ − 1 k,λ − sup � sup η k ˆ ˆ � k k ∈K k ∈K � � � � � � � � σ − 1 k,λ − η k σ − 1 � η k σ − 1 k,λ − η k σ − 1 ≤ sup � ˆ η k ˆ � + sup � k,λ k k ∈K k ∈K √ � � d + C 3 D 2 λ m , D √ λ m ≤ C 1 sup | ˆ η k − η k | + C 2 sup | ˆ σ k,λ − σ k,λ | k ∈K k ∈K

Lecture 3: Dependence measures using RKHS embeddings MLSS Cadiz, - PowerPoint PPT Presentation

Lecture 3: Dependence measures using RKHS embeddings MLSS Cadiz, 2016 Arthur Gretton Gatsby Unit, CSML, UCL Outline Three or more variable interactions, comparison with conditional dependence testing [Sejdinovic et al., 2013a]

Lecture 3: Dependence measures using RKHS embeddings MLSS T ubingen, 2015 Arthur Gretton

Embeddings @ Twitter Making ML easy with Embeddings !!! Sept 2018 Agenda 1 Team 2 Whats an

Lecture 1: Introduction to RKHS MLSS Cadiz, 2016 Gatsby Unit, CSML, UCL May 12, 2016 Lecture 1:

Measuring Dependence and Conditional Dependence with Kernels Kenji Fukumizu The Institute of

Lecture 1: Introduction to RKHS MLSS Tbingen, 2015 Gatsby Unit, CSML, UCL July 22, 2015

Word embeddings Rappel Embeddings ( pas Word Embeddings ) Est une lookup table Formalisme:

Word Embeddings Natural Language Processing VU (706.230) - Andi Rexha 02/04/2020 Word Embeddings

Word Embeddings Revisited: Contextual Embeddings CS 6956: Deep Learning for NLP Overview

Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 L. Rosasco RKHS About this

Linear dependence and independence Linear dependence 1 Definition (linear (in)dependence) Let {

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Mixed membership word embeddings: Corpus-specific embeddings without big data James Foulds

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Lecture 2: Mappings of Probabilities to RKHS and Applications MLSS Cadiz, 2016 Arthur Gretton

Prolific Permutations and Expected Breadth Cheyne Homberger Permutation Patterns 2018 Joint work

CS 211 Autumn 2006 Homework 2 Solutions Matousek 2.1.2 Induction: By induction on n: Assume

The 8M Algorithm from Todays Perspective Radim Belohlavek, Martin Trnecka DEPARTMENT OF

Object-Relational Mapping Consider a typical business scenario Business data resides on

Tests for Multivariate Means Max Turgeon STAT 7200Multivariate Statistics Objectives

Sta$s$calMethodsforExperimental Par$clePhysics TomJunk

Multiple Comparisons October 18, 2019 October 18, 2019 1 / 17 After the ANOVA For an ANOVA, H

On the Optimum Number of Hypotheses to Test when the Number of Observations is Limited A.