Lecture 2: Mappings of Probabilities to RKHS and Applications MLSS - PowerPoint PPT Presentation

Empirical estimate of MMD • An unbiased empirical estimate: for { x i } m i =1 ∼ P and { y i } m i =1 ∼ Q , 2 = � m � m � 1 j � = i [ k ( x i , x j ) + k ( y i , y j )] MMD i =1 m ( m − 1) � m � m − 1 j =1 [ k ( y i , x j ) + k ( x i , y j )] m 2 i =1 • Proof: � µ P − µ Q � 2 = � µ P − µ Q , µ P − µ Q � F F � µ P , µ P � + � µ Q , µ Q � − 2 � µ P , µ Q � =

Empirical estimate of MMD • An unbiased empirical estimate: for { x i } m i =1 ∼ P and { y i } m i =1 ∼ Q , 2 = � m � m � 1 j � = i [ k ( x i , x j ) + k ( y i , y j )] MMD i =1 m ( m − 1) � m � m − 1 j =1 [ k ( y i , x j ) + k ( x i , y j )] m 2 i =1 • Proof: � µ P − µ Q � 2 = � µ P − µ Q , µ P − µ Q � F F � µ P , µ P � + � µ Q , µ Q � − 2 � µ P , µ Q � = = E P [ µ P ( x )] + . . .

Empirical estimate of MMD • An unbiased empirical estimate: for { x i } m i =1 ∼ P and { y i } m i =1 ∼ Q , 2 = � m � m � 1 j � = i [ k ( x i , x j ) + k ( y i , y j )] MMD i =1 m ( m − 1) � m � m − 1 j =1 [ k ( y i , x j ) + k ( x i , y j )] m 2 i =1 • Proof: � µ P − µ Q � 2 = � µ P − µ Q , µ P − µ Q � F F � µ P , µ P � + � µ Q , µ Q � − 2 � µ P , µ Q � = = E P [ µ P ( x )] + . . . = E P � µ P ( · ) , ϕ ( x ) � + . . .

Empirical estimate of MMD • An unbiased empirical estimate: for { x i } m i =1 ∼ P and { y i } m i =1 ∼ Q , 2 = � m � m � 1 j � = i [ k ( x i , x j ) + k ( y i , y j )] MMD i =1 m ( m − 1) � m � m − 1 j =1 [ k ( y i , x j ) + k ( x i , y j )] m 2 i =1 • Proof: � µ P − µ Q � 2 = � µ P − µ Q , µ P − µ Q � F F � µ P , µ P � + � µ Q , µ Q � − 2 � µ P , µ Q � = = E P [ µ P ( x )] + . . . = E P � µ P ( · ) , k ( x , · ) � + . . .

Empirical estimate of MMD • An unbiased empirical estimate: for { x i } m i =1 ∼ P and { y i } m i =1 ∼ Q , 2 = � m � m � 1 j � = i [ k ( x i , x j ) + k ( y i , y j )] MMD i =1 m ( m − 1) � m � m − 1 j =1 [ k ( y i , x j ) + k ( x i , y j )] m 2 i =1 • Proof: � µ P − µ Q � 2 = � µ P − µ Q , µ P − µ Q � F F � µ P , µ P � + � µ Q , µ Q � − 2 � µ P , µ Q � = = E P [ µ P ( x )] + . . . = E P � µ P ( · ) , k ( x , · ) � + . . . E P k ( x , x ′ ) + E Q k ( y , y ′ ) − 2 E P , Q k ( x , y ) =

Empirical estimate of MMD • An unbiased empirical estimate: for { x i } m i =1 ∼ P and { y i } m i =1 ∼ Q , 2 = � m � m � 1 j � = i [ k ( x i , x j ) + k ( y i , y j )] MMD i =1 m ( m − 1) � m � m − 1 j =1 [ k ( y i , x j ) + k ( x i , y j )] m 2 i =1 • Proof: � µ P − µ Q � 2 = � µ P − µ Q , µ P − µ Q � F F � µ P , µ P � + � µ Q , µ Q � − 2 � µ P , µ Q � = = E P [ µ P ( x )] + . . . = E P � µ P ( · ) , k ( x , · ) � + . . . E P k ( x , x ′ ) + E Q k ( y , y ′ ) − 2 E P , Q k ( x , y ) = � m � m Then � 1 E k ( x , x ′ ) = j � = i k ( x i , x j ) i =1 m ( m − 1)

MMD for independence: HSIC • Dependence measure: the Hilbert Schmidt Independence Criterion [ALT05, NIPS07a, ALT07, ALT08, JMLR10] Related to [Feuerverger, 1993] and [Sz´ ekely and Rizzo, 2009, Sz´ ekely et al., 2007] HSIC ( P XY , P X P Y ) := � µ P XY − µ P X P Y � 2

MMD for independence: HSIC • Dependence measure: the Hilbert Schmidt Independence Criterion [ALT05, NIPS07a, ALT07, ALT08, JMLR10] Related to [Feuerverger, 1993] and [Sz´ ekely and Rizzo, 2009, Sz´ ekely et al., 2007] HSIC ( P XY , P X P Y ) := � µ P XY − µ P X P Y � 2 l ( ) , k ( ) , #" !" !" #" κ ( ) = #" #" !" !" , k ( ) × l ( ) , , !" #" !" #"

MMD for independence: HSIC • Dependence measure: the Hilbert Schmidt Independence Criterion [ALT05, NIPS07a, ALT07, ALT08, JMLR10] Related to [Feuerverger, 1993] and [Sz´ ekely and Rizzo, 2009, Sz´ ekely et al., 2007] HSIC ( P XY , P X P Y ) := � µ P XY − µ P X P Y � 2 HSIC using expectations of kernels: Define RKHS F on X with kernel k , RKHS G on Y with kernel l . Then HSIC( P XY , P X P Y ) = E XY E X ′ Y ′ k ( x , x ′ ) l ( y , y ′ ) + E X E X ′ k ( x , x ′ ) E Y E Y ′ l ( y , y ′ ) − 2 E X ′ Y ′ � � E X k ( x , x ′ ) E Y l ( y , y ′ ) .

HSIC: empirical estimate and intuition !"#$%&'()#)&*+$,#&-"#.&-"%(+*"&/$0#1&2',& -"#34%#&'#5#%&"266$#%&-"2'&7"#'&0(//(7$'*& 2'&$'-#%#)8'*&)9#'-:&!"#3&'##,&6/#'-3&(0& #;#%9$)#1&2<(+-&2'&"(+%&2&,23&$0&6())$</#:& =&/2%*#&2'$.2/&7"(&)/$'*)&)/(<<#%1&#;+,#)&2& ,$)8'985#&"(+',3&(,(%1&2',&72'-)&'(-"$'*&.(%#& -"2'&-(&0(//(7&"$)&'()#:&!"#3&'##,&2&)$*'$>92'-& 2.(+'-&(0&#;#%9$)#&2',&.#'-2/&)8.+/28(':& @'(7'&0(%&-"#$%&9+%$()$-31&$'-#//$*#'9#1&2',& #;9#//#'-&9(..+'$928('&&)A$//)1&-"#&B252'#)#& <%##,&$)&6#%0#9-&$0&3(+&72'-&2&%#)6(')$5#1&& $'-#%2985#&6#-1&('#&-"2-&7$//&</(7&$'&3(+%&#2%& 2',&0(//(7&3(+&#5#%37"#%#:& !#;-&0%(.&,(*8.#:9(.&2',&6#?$',#%:9(.&

HSIC: empirical estimate and intuition !"#$%&'()#)&*+$,#&-"#.&-"%(+*"&/$0#1&2',& !" #" -"#34%#&'#5#%&"266$#%&-"2'&7"#'&0(//(7$'*& 2'&$'-#%#)8'*&)9#'-:&!"#3&'##,&6/#'-3&(0& #;#%9$)#1&2<(+-&2'&"(+%&2&,23&$0&6())$</#:& =&/2%*#&2'$.2/&7"(&)/$'*)&)/(<<#%1&#;+,#)&2& ,$)8'985#&"(+',3&(,(%1&2',&72'-)&'(-"$'*&.(%#& -"2'&-(&0(//(7&"$)&'()#:&!"#3&'##,&2&)$*'$>92'-& 2.(+'-&(0&#;#%9$)#&2',&.#'-2/&)8.+/28(':& @'(7'&0(%&-"#$%&9+%$()$-31&$'-#//$*#'9#1&2',& #;9#//#'-&9(..+'$928('&&)A$//)1&-"#&B252'#)#& <%##,&$)&6#%0#9-&$0&3(+&72'-&2&%#)6(')$5#1&& $'-#%2985#&6#-1&('#&-"2-&7$//&</(7&$'&3(+%&#2%& 2',&0(//(7&3(+&#5#%37"#%#:& !#;-&0%(.&,(*8.#:9(.&2',&6#?$',#%:9(.&

HSIC: empirical estimate and intuition !"#$%&'()#)&*+$,#&-"#.&-"%(+*"&/$0#1&2',& !" #" -"#34%#&'#5#%&"266$#%&-"2'&7"#'&0(//(7$'*& 2'&$'-#%#)8'*&)9#'-:&!"#3&'##,&6/#'-3&(0& #;#%9$)#1&2<(+-&2'&"(+%&2&,23&$0&6())$</#:& =&/2%*#&2'$.2/&7"(&)/$'*)&)/(<<#%1&#;+,#)&2& ,$)8'985#&"(+',3&(,(%1&2',&72'-)&'(-"$'*&.(%#& -"2'&-(&0(//(7&"$)&'()#:&!"#3&'##,&2&)$*'$>92'-& 2.(+'-&(0&#;#%9$)#&2',&.#'-2/&)8.+/28(':& @'(7'&0(%&-"#$%&9+%$()$-31&$'-#//$*#'9#1&2',& #;9#//#'-&9(..+'$928('&&)A$//)1&-"#&B252'#)#& <%##,&$)&6#%0#9-&$0&3(+&72'-&2&%#)6(')$5#1&& $'-#%2985#&6#-1&('#&-"2-&7$//&</(7&$'&3(+%&#2%& 2',&0(//(7&3(+&#5#%37"#%#:& !#;-&0%(.&,(*8.#:9(.&2',&6#?$',#%:9(.& Empirical HSIC ( P XY , P X P Y ): 1 n 2 ( HKH ◦ HLH ) ++

Characteristic kernels (Via Fourier, on the torus T )

Characteristic Kernels (via Fourier) Reminder: Characteristic: MMD a metric (MMD = 0 iff P = Q ) [NIPS07b, JMLR10] In the next slides: 1. Characteristic property on [ − π, π ] with periodic boundary 2. Characteristic property on R d

Characteristic Kernels (via Fourier) Reminder: Fourier series • Function [ − π, π ] with periodic boundary. ∞ ∞ � � ˆ ˆ f ( x ) = f ℓ exp( ıℓx ) = f ℓ (cos( ℓx ) + ı sin( ℓx )) . ℓ = −∞ l = −∞ Top hat Fourier series coefficients 0.5 1.4 0.4 1.2 1 0.3 0.8 0.2 f ( x ) f ℓ 0.6 ˆ 0.1 0.4 0 0.2 0 −0.1 −0.2 −0.2 −4 −2 0 2 4 −10 −5 0 5 10 x ℓ

Characteristic Kernels (via Fourier) Reminder: Fourier series of kernel ∞ � ˆ k ( x, y ) = k ( x − y ) = k ( z ) , k ( z ) = k ℓ exp ( ıℓz ) , ℓ = −∞ � � � � 2 π , ıσ 2 − σ 2 ℓ 2 ˆ 1 x 1 E.g., k ( x ) = k ℓ = 2 π exp 2 π ϑ , . 2 π 2 ϑ is the Jacobi theta function, close to Gaussian when σ 2 sufficiently narrower than [ − π, π ]. Kernel Fourier series coefficients 0.16 0.6 0.14 0.5 0.12 0.4 0.1 k ( x ) f ℓ 0.3 ˆ 0.08 0.2 0.06 0.1 0.04 0 0.02 −0.1 0 −4 −2 0 2 4 −10 −5 0 5 10 x ℓ

Characteristic Kernels (via Fourier) Maximum mean embedding via Fourier series: • Fourier series for P is characteristic function ¯ φ P • Fourier series for mean embedding is product of fourier series! (convolution theorem) � π µ P ,ℓ = ˆ k ℓ × ¯ µ P ( x ) = E P k ( x − x ) = k ( x − t ) d P ( t ) ˆ φ P ,ℓ − π

Characteristic Kernels (via Fourier) Maximum mean embedding via Fourier series: • Fourier series for P is characteristic function ¯ φ P • Fourier series for mean embedding is product of fourier series! (convolution theorem) � π µ P ,ℓ = ˆ k ℓ × ¯ µ P ( x ) = E P k ( x − x ) = k ( x − t ) d P ( t ) ˆ φ P ,ℓ − π • MMD can be written in terms of Fourier series: � � � � �� ¯ � ∞ � � ˆ � � φ P ,ℓ − ¯ MMD( P , Q ; F ) := exp( ıℓx ) φ Q ,ℓ k ℓ � � � � ℓ = −∞ F

A simpler Fourier expression for MMD • From previous slide, � � � � �� ¯ � ∞ � � ˆ � � φ P ,ℓ − ¯ MMD( P , Q ; F ) := exp( ıℓx ) � φ Q ,ℓ k ℓ � � � ℓ = −∞ F • The squared norm of a function f in F is: ∞ � | ˆ f ℓ | 2 � f � 2 F = � f, f � F = . ˆ k ℓ l = −∞ • Simple, interpretable expression for squared MMD: ∞ ∞ [ | φ P ,ℓ − φ Q ,ℓ | 2 ˆ � � k ℓ ] 2 | φ P ,ℓ − φ Q ,ℓ | 2 ˆ MMD 2 ( P , Q ; F ) = = k ℓ ˆ k ℓ l = −∞ l = −∞

Example • Example: P differs from Q at one frequency 0.2 0.15 P ( x ) 0.1 0.05 0 −2 0 2 x 0.2 0.15 Q ( x ) 0.1 0.05 0 −2 0 2 x

Characteristic Kernels (2) • Example: P differs from Q at (roughly) one frequency 0.2 1 0.15 F P ( x ) → φ P,ℓ 0.1 0.5 0.05 0 0 −2 0 2 −10 0 10 x ℓ 0.2 1 0.15 Q ( x ) φ Q,ℓ F 0.5 → 0.1 0.05 0 0 −10 0 10 −2 0 2 ℓ x

Characteristic Kernels (2) • Example: P differs from Q at (roughly) one frequency 0.2 1 0.15 F P ( x ) Characteristic function difference → φ P,ℓ 0.1 1 0.5 0.05 ց 0.8 φ P,ℓ − φ Q,ℓ 0 0 0.6 −2 0 2 −10 0 10 x ℓ 0.2 1 0.4 ր 0.15 0.2 Q ( x ) φ Q,ℓ F 0.5 → 0.1 0 −10 0 10 ℓ 0.05 0 0 −10 0 10 −2 0 2 ℓ x

Example Is the Gaussian-spectrum kernel characteristic? Kernel Fourier series coefficients 0.16 0.6 0.14 0.5 0.12 0.4 0.1 k ( x ) f ℓ 0.3 ˆ 0.08 0.2 0.06 0.1 0.04 0 0.02 −0.1 0 −4 −2 0 2 4 −10 −5 0 5 10 x ℓ ∞ � | φ P ,ℓ − φ Q ,ℓ | 2 ˆ MMD 2 ( P , Q ; F ) := k ℓ l = −∞

Example Is the Gaussian-spectrum kernel characteristic? YES Kernel Fourier series coefficients 0.16 0.6 0.14 0.5 0.12 0.4 0.1 k ( x ) f ℓ 0.3 ˆ 0.08 0.2 0.06 0.1 0.04 0 0.02 −0.1 0 −4 −2 0 2 4 −10 −5 0 5 10 x ℓ ∞ � | φ P ,ℓ − φ Q ,ℓ | 2 ˆ MMD 2 ( P , Q ; F ) := k ℓ l = −∞

Example Is the triangle kernel characteristic? Triangle Fourier series coefficients 0.3 0.07 0.25 0.06 0.2 0.05 0.15 0.04 f ( x ) f ℓ 0.1 ˆ 0.03 0.05 0.02 0 0.01 −0.05 −0.1 0 −4 −2 0 2 4 −10 −5 0 5 10 x ℓ ∞ � | φ P ,ℓ − φ Q ,ℓ | 2 ˆ MMD 2 ( P , Q ; F ) := k ℓ l = −∞

Example Is the triangle kernel characteristic? NO Triangle Fourier series coefficients 0.3 0.07 0.25 0.06 0.2 0.05 0.15 0.04 f ( x ) f ℓ 0.1 ˆ 0.03 0.05 0.02 0 0.01 −0.05 −0.1 0 −4 −2 0 2 4 −10 −5 0 5 10 x ℓ ∞ � | φ P ,ℓ − φ Q ,ℓ | 2 ˆ MMD 2 ( P , Q ; F ) := k ℓ l = −∞

Characteristic kernels (Via Fourier, on R d )

Characteristic Kernels (via Fourier) • Can we prove characteristic on R d ?

Characteristic Kernels (via Fourier) • Can we prove characteristic on R d ? • Characteristic function of P via Fourier transform � R d e ix ⊤ ω d P ( x ) φ P ( ω ) =

Characteristic Kernels (via Fourier) • Can we prove characteristic on R d ? • Characteristic function of P via Fourier transform � R d e ix ⊤ ω d P ( x ) φ P ( ω ) = • Translation invariant kernels: k ( x, y ) = k ( x − y ) = k ( z ) • Bochner’s theorem: � R d e − iz ⊤ ω d Λ( ω ) k ( z ) = – Λ finite non-negative Borel measure

Characteristic Kernels (via Fourier) • Fourier representation of MMD: � � | φ P ( ω ) − φ Q ( ω ) | 2 d Λ( ω ) MMD( P , Q ; F ) := – φ P characteristic function of P Proof: Using Bochner’s theorem (a) and Fubini’s theorem (b), � � MMD( P , Q ) = R d k ( x − y ) d ( P − Q )( x ) d ( P − Q )( y ) � � � ( a ) R d e − i ( x − y ) T ω d Λ( ω ) d ( P − Q )( x ) d ( P − Q )( y ) = � � � ( b ) R d e − ix T ω d ( P − Q )( x ) R d e iy T ω d ( P − Q )( y ) d Λ( ω ) = � | φ P ( ω ) − φ Q ( ω ) | 2 d Λ( ω ) =

Example • Example: P differs from Q at (roughly) one frequency 0.35 0.3 0.25 0.2 P(X) 0.15 0.1 0.05 0 −10 −5 0 5 10 X 0.5 0.4 0.3 Q(X) 0.2 0.1 0 −10 −5 0 5 10 X

Example • Example: P differs from Q at (roughly) one frequency 0.35 0.4 0.3 0.3 F 0.25 → 0.2 P(X) | φ P | 0.2 0.15 0.1 0.1 0.05 0 0 −10 −5 0 5 10 −20 −10 0 10 20 X ω 0.5 0.4 0.4 0.3 0.3 Q(X) | φ Q | F 0.2 → 0.2 0.1 0.1 0 0 −10 −5 0 5 10 −20 −10 0 10 20 X ω

Example • Example: P differs from Q at (roughly) one frequency 0.35 0.4 0.3 0.3 F 0.25 → 0.2 P(X) | φ P | 0.2 Characteristic function difference 0.15 0.2 0.1 ց 0.1 0.05 0.15 0 0 −10 −5 0 5 10 −20 −10 0 10 20 | φ P − φ Q | X ω 0.1 0.5 0.4 0.05 ր 0.4 0.3 0.3 0 Q(X) | φ Q | −30 −20 −10 0 10 20 30 F 0.2 → ω 0.2 0.1 0.1 0 0 −10 −5 0 5 10 −20 −10 0 10 20 X ω

Example • Example: P differs from Q at (roughly) one frequency Gaussian kernel Difference | φ P − φ Q | 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 −30 −20 −10 0 10 20 30 Frequency ω

Example • Example: P differs from Q at (roughly) one frequency Characteristic 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 −30 −20 −10 0 10 20 30 Frequency ω

Example • Example: P differs from Q at (roughly) one frequency Sinc kernel Difference | φ P − φ Q | 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 −30 −20 −10 0 10 20 30 Frequency ω

Example • Example: P differs from Q at (roughly) one frequency NOT characteristic 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 −30 −20 −10 0 10 20 30 Frequency ω

Example • Example: P differs from Q at (roughly) one frequency Triangle (B-spline) kernel Difference | φ P − φ Q | 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 −30 −20 −10 0 10 20 30 Frequency ω

Example • Example: P differs from Q at (roughly) one frequency ??? 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 −30 −20 −10 0 10 20 30 Frequency ω

Example • Example: P differs from Q at (roughly) one frequency Characteristic 0.2 0.18 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 −30 −20 −10 0 10 20 30 Frequency ω

Summary: Characteristic Kernels Characteristic kernel: (MMD = 0 iff P = Q ) [NIPS07b, COLT08] Main theorem: A translation invariant k characteristic for prob. measures on R d if and only if supp(Λ) = R d (i.e. support zero on at most a countable set) [COLT08, JMLR10] Corollary: continuous, compactly supported k characteristic (since Fourier spectrum Λ( ω ) cannot be zero on an interval). 1-D proof sketch from [Mallat, 1999, Theorem 2.6] proof on R d via distribution theory in [Sriperumbudur et al., 2010, Corollary 10 p. 1535]

k characteristic iff supp(Λ) = R d Proof: supp { Λ } = R d = ⇒ k characteristic: Recall Fourier definition of MMD: � R d | φ P ( ω ) − φ Q ( ω ) | 2 d Λ( ω ) . MMD 2 ( P , Q ) = Characteristic functions φ P ( ω ) and φ Q ( ω ) uniformly continuous, hence their difference cannot be non-zero only on a countable set. Map φ P uniformly continuous: ∀ ǫ > 0 , ∃ δ > 0 such that ∀ ( ω 1 , ω 2 ) ∈ Ω for which d ( ω 1 , ω 2 ) < δ , we have d ( φ P ( ω 1 ) , φ P ( ω 2 )) < ǫ . Uniform: δ depends only on ǫ , not on ω 1 , ω 2 .

k characteristic iff supp(Λ) = R d ⇒ supp { Λ } = R d : Proof: k characteristic = Proof by contrapositive. Given supp { Λ } � R d , hence ∃ open interval U such that Λ( ω ) zero on U . Construct densities p ( x ), q ( x ) such that φ P , φ Q differ only inside U

Further extensions • Similar reasoning wherever extensions of Bochner’s theorem exist: [Fukumizu et al., 2009] – Locally compact Abelian groups (periodic domains, as we saw) – Compact, non-Abelian groups (orthogonal matrices) – The semigroup R + n (histograms) • Related kernel statistics: Fisher statistic [Harchaoui et al., 2008] (zero iff P = Q for characteristic kernels), other distances [Zhou and Chellappa, 2006] (not yet shown to establish whether P = Q ), energy distances

Statistical hypothesis testing

Motivating question: differences in brain signals The problem: Do local field potential (LFP) signals change when measured near a spike burst? LFP near spike burst LFP without spike burst 0.3 0.3 0.2 0.2 0.1 0.1 LFP amplitude LFP amplitude 0 0 −0.1 −0.1 −0.2 −0.2 −0.3 −0.3 −0.4 −0.4 0 20 40 60 80 100 0 20 40 60 80 100 Time Time

Motivating question: differences in brain signals The problem: Do local field potential (LFP) signals change when measured near a spike burst?

Statistical test using MMD (1) • Two hypotheses: – H 0 : null hypothesis ( P = Q ) – H 1 : alternative hypothesis ( P � = Q )

Statistical test using MMD (1) • Two hypotheses: – H 0 : null hypothesis ( P = Q ) – H 1 : alternative hypothesis ( P � = Q ) • Observe samples x := { x 1 , . . . , x n } from P and y from Q • If empirical MMD( x , y ; F ) is – “far from zero”: reject H 0 – “close to zero”: accept H 0

Statistical test using MMD (2) • “far from zero” vs “close to zero” - threshold? 2 One answer: asymptotic distribution of � • MMD

Statistical test using MMD (2) • “far from zero” vs “close to zero” - threshold? 2 One answer: asymptotic distribution of � • MMD • An unbiased empirical estimate (quadratic cost): � 2 = � 1 MMD k ( x i , x j ) − k ( x i , y j ) − k ( y i , x j ) + k ( y i , y j ) n ( n − 1) � �� i � = j h (( x i ,y i ) , ( x j ,y j ))

Statistical test using MMD (2) • “far from zero” vs “close to zero” - threshold? 2 One answer: asymptotic distribution of � • MMD • An unbiased empirical estimate (quadratic cost): � 2 = � 1 MMD k ( x i , x j ) − k ( x i , y j ) − k ( y i , x j ) + k ( y i , y j ) n ( n − 1) � �� i � = j h (( x i ,y i ) , ( x j ,y j )) • When P � = Q , asymptotically normal � 2 − MMD 2 � ( √ n ) � ∼ N (0 , σ 2 MMD u ) [Hoeffding, 1948, Serfling, 1980] • Expression for the variance: z i := ( x i , y i ) � � 2 � � ( E z ′ h ( z , z ′ )) 2 � � σ 2 E z , z ′ ( h ( z , z ′ )) E z − u = 4

Statistical test using MMD (3) • Example: laplace distributions with different variance MMD distribution and Gaussian fit under H1 Two Laplace distributions with different variances 14 1.5 P X Empirical PDF Q X Gaussian fit 12 Prob. density 1 10 Prob. density 0.5 8 0 −6 −4 −2 0 2 4 6 X 6 4 2 0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 MMD

Statistical test using MMD (4) • When P = Q , U-statistic degenerate: E z ′ [ h ( z , z ′ )] = 0 [Anderson et al., 1994] • Distribution is ∞ � � � z 2 n MMD( x , y ; F ) ∼ l − 2 λ l l =1 • where – z l ∼ N (0 , 2) i.i.d � X ˜ k ( x, x ′ ) ψ i ( x ) d P ( x ) = λ i ψ i ( x ′ ) – � �� centred

Statistical test using MMD (4) • When P = Q , U-statistic degenerate: E z ′ [ h ( z , z ′ )] = 0 [Anderson et al., 1994] • Distribution is ∞ � � � z 2 n MMD( x , y ; F ) ∼ l − 2 λ l l =1 MMD density under H0 • where 0.7 χ 2 sum Empirical PDF – z l ∼ N (0 , 2) i.i.d 0.6 � X ˜ k ( x, x ′ ) ψ i ( x ) d P ( x ) = λ i ψ i ( x ′ ) 0.5 – Prob. density � �� 0.4 centred 0.3 0.2 0.1 0 −2 −1 0 1 2 3 4 5 6 n × MMD 2

Statistical test using MMD (5) • Given P = Q , want threshold T such that P (MMD > T ) ≤ 0 . 05 2 = K P,P + K Q,Q − 2 K P,Q � MMD MMD density under H0 and H1 0.7 null alternative 0.6 0.5 Prob. density 1− α null quantile 0.4 0.3 0.2 Type II error 0.1 0 −2 −1 0 1 2 3 4 5 6 n × MMD 2

Statistical test using MMD (5) • Given P = Q , want threshold T such that P (MMD > T ) ≤ 0 . 05

Statistical test using MMD (5) • Given P = Q , want threshold T such that P (MMD > T ) ≤ 0 . 05 • Permutation for empirical CDF [Arcones and Gin´ e, 1992, Alba Fern´ andez et al., 2008] • Pearson curves by matching first four moments [Johnson et al., 1994] • Large deviation bounds [Hoeffding, 1963, McDiarmid, 1989] • Consistent test using kernel eigenspectrum [NIPS09b]

Statistical test using MMD (5) • Given P = Q , want threshold T such that P (MMD > T ) ≤ 0 . 05 • Permutation for empirical CDF [Arcones and Gin´ e, 1992, Alba Fern´ andez et al., 2008] • Pearson curves by matching first four moments [Johnson et al., 1994] • Large deviation bounds [Hoeffding, 1963, McDiarmid, 1989] • Consistent test using kernel eigenspectrum [NIPS09b] CDF of the MMD and Pearson fit 1 P(MMD < mmd) 0.8 0.6 0.4 0.2 MMD Pearson 0 −0.02 0 0.02 0.04 0.06 0.08 0.1 mmd

Approximate null distribution of � MMD via permutation Empirical MMD: ) ⊤ w = (1 , 1 , 1 , . . . 1 , − 1 . . . , − 1 , − 1 , − 1 � �� n n � �  2 � � � � 1 ≈  K P,P K P,Q MMD ww ⊤  ⊙ n 2 K Q,P K Q,Q

Approximate null distribution of � MMD via permutation Permuted case: [Alba Fern´ andez et al., 2008] ) ⊤ w = (1 , − 1 , 1 , . . . 1 , − 1 . . . , 1 , − 1 , − 1 � �� n n � � (equal number of +1 and − 1)  � � � 1 � ? �  K P,P K P,Q ww ⊤  ⊙ = n 2 K Q,P K Q,Q

Approximate null distribution of � MMD via permutation Permuted case: [Alba Fern´ andez et al., 2008] ) ⊤ w = (1 , − 1 , 1 , . . . 1 , − 1 . . . , 1 , − 1 , − 1 � �� n n � � (equal number of +1 and − 1)  � � � 1 � ? �  K P,P K P,Q ww ⊤  ⊙ = n 2 K Q,P K Q,Q ⊙ = Figure thanks to Kacper Chwialkowski.

2 via permutation Approximate null distribution of � MMD � �  � � � p ≈ 1  K P,P K P,Q ww ⊤ 2 �  ⊙ MMD n 2 K Q,P K Q,Q MMD density under H0 0.7 Null PDF Null PDF from permutation 0.6 0.5 Prob. density 0.4 0.3 0.2 0.1 0 −2 −1 0 1 2 3 4 5 6 n × MMD 2

Detecting differences in brain signals Do local field potential (LFP) signals change when measured near a spike burst? LFP near spike burst LFP without spike burst 0.3 0.3 0.2 0.2 0.1 0.1 LFP amplitude LFP amplitude 0 0 −0.1 −0.1 −0.2 −0.2 −0.3 −0.3 −0.4 −0.4 0 20 40 60 80 100 0 20 40 60 80 100 Time Time

Nero data: consistent test w/o permutation • Maximum mean discrepancy (MMD): distance between P and Q MMD( P , Q ; F ) := � µ P − µ Q � 2 F • Is � MMD significantly > 0? P ≠ Q (neuro) 0.5 • P = Q , null distrib. of � MMD: Spectral Permutation 0.4 Type II error ∞ � n � λ l ( z 2 MMD → l − 2) , 0.3 D l =1 0.2 – λ l is l th eigenvalue of 0.1 kernel ˜ k ( x i , x j ) 0 100 150 200 250 300 Sample size m Use Gram matrix spectrum for ˆ λ l : consistent test without permutation

Hypothesis testing with HSIC

Distribution of HSIC at independence • (Biased) empirical HSIC a v-statistic HSIC b = 1 n 2 trace( KHLH ) – Statistical testing: How do we find when this is larger enough that the null hypothesis P = P x P y is unlikely? – Formally: given P = P x P y , what is the threshold T such that P (HSIC > T ) < α for small α ?

Distribution of HSIC at independence • (Biased) empirical HSIC a v-statistic HSIC b = 1 n 2 trace( KHLH ) • Associated U-statistic degenerate when P = P x P y [Serfling, 1980] : ∞ � D λ l z 2 → z l ∼ N (0 , 1)i . i . d . n HSIC b l , l =1 � ( i,j,q,r ) � h ijqr = 1 λ l ψ l ( z j ) = h ijqr ψ l ( z i ) dF i,q,r , k tu l tu + k tu l vw − 2 k tu l tv 4! ( t,u,v,w )

Lecture 2: Mappings of Probabilities to RKHS and Applications MLSS - PowerPoint PPT Presentation

Lecture 2: Mappings of Probabilities to RKHS and Applications MLSS T ubingen, 2015 Arthur Gretton Gatsby Unit, CSML, UCL Outline Kernel metric on the space of probability measures Function revealing differences in distributions

Lecture 2: Mappings of Probabilities to RKHS and Applications MLSS Cadiz, 2016 Arthur Gretton

Lecture 1: Introduction to RKHS MLSS Cadiz, 2016 Gatsby Unit, CSML, UCL May 12, 2016 Lecture 1:

Lecture 1: Introduction to RKHS MLSS Tbingen, 2015 Gatsby Unit, CSML, UCL July 22, 2015

Review: Probabilities DISCRETE PROBABILITIES Intro We have all been exposed to informal

Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 L. Rosasco RKHS About this

Where do the probabilities come from? Probabilities come from: Experts Data D. Poole

On the branch set of mappings of finite and bounded distortion. R ami Luisto Univerzita Karlova

Locator/ID Separation: Study on the cost of Mappings Caching and Mappings Lookups Technical

Covering mappings. Theory and applications S.E. Zhukovskiy February 2013 1. Covering mappings

Results on set mappings P eter Komj ath E otv os U. Budapest 15th International

Lecture 3: Dependence measures using RKHS embeddings MLSS T ubingen, 2015 Arthur Gretton

Lecture 3: Dependence measures using RKHS embeddings MLSS Cadiz, 2016 Arthur Gretton Gatsby

Conditional Probabilities Anders Ringgaard Kristensen Department of Veterinary and Animal

Comonotone lower probabilities for bivariate Introduction and discrete structures Comonotonicity

Partially specified Probabilities: decisions and games May 2007 Ehud Lehrer The problem

Should we think of quantum probabilities as Bayesian probabilities? Carlton M. Caves C. M.

Comments on "Anderson Acceleration, Mixing and Extrapolation" Comments on "Anderson

Design and Characterization of Polymeric Floating Microspheres of Levofloxacin Hemihydrate.

Krylov subspace methods for eigenvalue problems David S. Watkins watkins@math.wsu.edu

Computations related to the Riemann Hypothesis William F. Galway Department of Mathematics

I Jesus Enough? Matthew 13:44-52 Matthew 13:44-52 44 The kingdom of heaven is like

RENOVATE Carl Hofmann Teaching Leader renovate verb restore; refresh; reinvigorate.

n o i t u l o v e R s u s e J D O G F O M O D G N I K 5 . 7 5 3 -

1-859-359-6124 You can find sermon notes and videos on our website at jccrichmond.com/sermons/

Lecture 2: Mappings of Probabilities to RKHS and Applications MLSS - PowerPoint PPT Presentation

Lecture 2: Mappings of Probabilities to RKHS and Applications MLSS T ubingen, 2015 Arthur Gretton Gatsby Unit, CSML, UCL Outline Kernel metric on the space of probability measures Function revealing differences in distributions

Lecture 2: Mappings of Probabilities to RKHS and Applications MLSS Cadiz, 2016 Arthur Gretton

Lecture 1: Introduction to RKHS MLSS Cadiz, 2016 Gatsby Unit, CSML, UCL May 12, 2016 Lecture 1:

Lecture 1: Introduction to RKHS MLSS Tbingen, 2015 Gatsby Unit, CSML, UCL July 22, 2015

Review: Probabilities DISCRETE PROBABILITIES Intro We have all been exposed to informal

Reproducing Kernel Hilbert Spaces Lorenzo Rosasco 9.520 Class 03 L. Rosasco RKHS About this

Where do the probabilities come from? Probabilities come from: Experts Data D. Poole

On the branch set of mappings of finite and bounded distortion. R ami Luisto Univerzita Karlova

Locator/ID Separation: Study on the cost of Mappings Caching and Mappings Lookups Technical

Covering mappings. Theory and applications S.E. Zhukovskiy February 2013 1. Covering mappings

Results on set mappings P eter Komj ath E otv os U. Budapest 15th International

Lecture 3: Dependence measures using RKHS embeddings MLSS T ubingen, 2015 Arthur Gretton

Lecture 3: Dependence measures using RKHS embeddings MLSS Cadiz, 2016 Arthur Gretton Gatsby

Conditional Probabilities Anders Ringgaard Kristensen Department of Veterinary and Animal

Comonotone lower probabilities for bivariate Introduction and discrete structures Comonotonicity

Partially specified Probabilities: decisions and games May 2007 Ehud Lehrer The problem

Should we think of quantum probabilities as Bayesian probabilities? Carlton M. Caves C. M.

Comments on &quot;Anderson Acceleration, Mixing and Extrapolation&quot; Comments on &quot;Anderson

Design and Characterization of Polymeric Floating Microspheres of Levofloxacin Hemihydrate.

Krylov subspace methods for eigenvalue problems David S. Watkins watkins@math.wsu.edu

Computations related to the Riemann Hypothesis William F. Galway Department of Mathematics

I Jesus Enough? Matthew 13:44-52 Matthew 13:44-52 44 The kingdom of heaven is like

RENOVATE Carl Hofmann Teaching Leader renovate verb restore; refresh; reinvigorate.

n o i t u l o v e R s u s e J D O G F O M O D G N I K 5 . 7 5 3 -

1-859-359-6124 You can find sermon notes and videos on our website at jccrichmond.com/sermons/

Comments on "Anderson Acceleration, Mixing and Extrapolation" Comments on "Anderson