1/53
Recent Advances in Hilbert Space Representation
- f Probability Distributions
Recent Advances in Hilbert Space Representation of Probability - - PowerPoint PPT Presentation
Recent Advances in Hilbert Space Representation of Probability Distributions Krikamol Muandet Max Planck Institute for Intelligent Systems T ubingen, Germany RegML 2020, Genova, Italy July 1, 2020 1/53 Reference Kernel Mean Embedding of
1/53
2/53
3/53
4/53
5/53
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 x1 −1.0 −0.5 0.0 0.5 1.0 x2
+1
6/53
1, x2 2,
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 x1 −1.0 −0.5 0.0 0.5 1.0 x2
Data in Input Space
+1
ϕ
1
0.0 0.2 0.4 0.6 0.8 1.0 ϕ2 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 ϕ3 −0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8
Data in Feature Space
+1
6/53
1, x2 2,
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 x1 −1.0 −0.5 0.0 0.5 1.0 x2
Data in Input Space
+1
ϕ
1
0.0 0.2 0.4 0.6 0.8 1.0 ϕ2 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 ϕ3 −0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8
Data in Feature Space
+1
6/53
1, x2 2,
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 x1 −1.0 −0.5 0.0 0.5 1.0 x2
Data in Input Space
+1
ϕ
1
0.0 0.2 0.4 0.6 0.8 1.0 ϕ2 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 ϕ3 −0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8
Data in Feature Space
+1
6/53
1, x2 2,
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 x1 −1.0 −0.5 0.0 0.5 1.0 x2
Data in Input Space
+1
ϕ
1
0.0 0.2 0.4 0.6 0.8 1.0 ϕ2 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 ϕ3 −0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8
Data in Feature Space
+1
1z2 1 + x2 2z2 2 + 2(x1x2)(z1z2) = (x1z1 + x2z2)2 = (x · z)2
6/53
1, x2 2,
−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 x1 −1.0 −0.5 0.0 0.5 1.0 x2
Data in Input Space
+1
ϕ
1
0.0 0.2 0.4 0.6 0.8 1.0 ϕ2 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 ϕ3 −0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 0.8
Data in Feature Space
+1
1z2 1 + x2 2z2 2 + 2(x1x2)(z1z2) = (x1z1 + x2z2)2 = (x · z)2
7/53
7/53
8/53
2 σ(w ⊤ 1 x + b1) + b2
1Rosenblatt 1958; Minsky and Papert 1969
9/53
9/53
9/53
1, x2 2,
m
2
10/53
n
n
10/53
n
n
n
n
n
10/53
n
n
n
n
n
11/53
11/53
11/53
11/53
12/53
12/53
13/53
13/53
n
i x + bi)
14/53
14/53
14/53
14/53
15/53
16/53
D i s t r i b u t i
s p a c e I n p u t s p a c e
training data unseen test data
P2
XY
P1
XY
P PN
XY
(Xk, Yk)
Xk
PX
(Xk, Yk) (Xk, Yk)
k = 1, . . . , n k = 1, . . . , nN k = 1, . . . , n2 k = 1, . . . , n1
17/53
17/53
18/53
19/53
19/53
19/53
20/53
20/53
21/53
21/53
21/53
22/53
22/53
22/53
2
22/53
2
23/53
n
i=1 k(xi, ·) ∈ H,
n
i=1 δxi.
3Tolstikhin et al. Minimax Estimation of Kernel Mean Embeddings. JMLR, 2017. 4Muandet et al. Kernel Mean Shrinkage Estimators. JMLR, 2016.
23/53
n
i=1 k(xi, ·) ∈ H,
n
i=1 δxi.
P[f (X)] = f , ˆ
3Tolstikhin et al. Minimax Estimation of Kernel Mean Embeddings. JMLR, 2017. 4Muandet et al. Kernel Mean Shrinkage Estimators. JMLR, 2016.
23/53
n
i=1 k(xi, ·) ∈ H,
n
i=1 δxi.
P[f (X)] = f , ˆ
δ
3Tolstikhin et al. Minimax Estimation of Kernel Mean Embeddings. JMLR, 2017. 4Muandet et al. Kernel Mean Shrinkage Estimators. JMLR, 2016.
23/53
n
i=1 k(xi, ·) ∈ H,
n
i=1 δxi.
P[f (X)] = f , ˆ
δ
3Tolstikhin et al. Minimax Estimation of Kernel Mean Embeddings. JMLR, 2017. 4Muandet et al. Kernel Mean Shrinkage Estimators. JMLR, 2016.
23/53
n
i=1 k(xi, ·) ∈ H,
n
i=1 δxi.
P[f (X)] = f , ˆ
δ
3Tolstikhin et al. Minimax Estimation of Kernel Mean Embeddings. JMLR, 2017. 4Muandet et al. Kernel Mean Shrinkage Estimators. JMLR, 2016.
24/53
θ
H.
24/53
θ
H.
K
K
24/53
θ
H.
K
K
T =
T
H
24/53
θ
H.
K
K
T =
T
H
24/53
θ
H.
K
K
T =
T
H
25/53
n
25/53
n
25/53
n
25/53
n
25/53
n
δ
25/53
n
δ
26/53
26/53
D i s t r i b u t i
s p a c e I n p u t s p a c e
26/53
D i s t r i b u t i
s p a c e I n p u t s p a c e
training data unseen test data
P2
XY
P1
XY
P PN
XY
...
(Xk, Yk)
...
Xk
PX
(Xk, Yk) (Xk, Yk) k = 1, . . . , n k = 1, . . . , nN k = 1, . . . , n2 k = 1, . . . , n1
26/53
D i s t r i b u t i
s p a c e I n p u t s p a c e
training data unseen test data
P2
XY
P1
XY
P PN
XY
...
(Xk, Yk)
...
Xk
PX
(Xk, Yk) (Xk, Yk) k = 1, . . . , n k = 1, . . . , nN k = 1, . . . , n2 k = 1, . . . , n1
27/53
27/53
m
m
28/53
j } ∼ Pi(X).
28/53
j } ∼ Pi(X).
28/53
j } ∼ Pi(X).
29/53
(Xk, Yk)
Xk
(Xk, Yk) (Xk, Yk)
k = 1, . . . , n k = 1, . . . , nN k = 1, . . . , n2 k = 1, . . . , n1
30/53
H = µPH − 2µP, µQH + µQH.
30/53
H = µPH − 2µP, µQH + µQH.
h∈H,h≤1
30/53
H = µPH − 2µP, µQH + µQH.
h∈H,h≤1
30/53
H = µPH − 2µP, µQH + µQH.
h∈H,h≤1
i=1 ∼ P and {yj}m j=1 ∼ Q, the empirical MMD is
u(P, Q, H) =
n
n
m
m
n
m
31/53
i=1 ∼ P and {yj}n j=1 ∼ Q, check if P = Q.
31/53
i=1 ∼ P and {yj}n j=1 ∼ Q, check if P = Q.
2 u(P, Q, H)
32/53
G max D
random noise z Gθ(z) Generator Gθ real or synthetic? x or Gθ(z) Discriminator Dφ
× ×× × ×× × × × × ×
real data {xi} synthetic data {Gθ(zi)}
µX − ˆ µGθ(Z)
MMD Test
33/53
33/53
θ
H = min θ
H
θ
h∈H,h≤1
33/53
θ
H = min θ
H
θ
h∈H,h≤1
34/53
35/53
35/53
36/53
XXk(x, ·)
XX
XX
37/53
37/53
XXk(x, ·) =: µY |x.
37/53
XXk(x, ·) =: µY |x.
37/53
XXk(x, ·) =: µY |x.
XX does not exists. Hence, we often use
37/53
XXk(x, ·) =: µY |x.
XX does not exists. Hence, we often use
n
38/53
0 (·) − PY ∗ 1 (·)
0 and Y ∗ 1 are potential outcomes of a treatment policy T.
38/53
0 (·) − PY ∗ 1 (·)
0 and Y ∗ 1 are potential outcomes of a treatment policy T.
0 or PY ∗ 1 .
38/53
0 (·) − PY ∗ 1 (·)
0 and Y ∗ 1 are potential outcomes of a treatment policy T.
0 or PY ∗ 1 .
38/53
0 (·) − PY ∗ 1 (·)
0 and Y ∗ 1 are potential outcomes of a treatment policy T.
0 or PY ∗ 1 .
39/53
40/53
40/53
XXk(x, ·)
40/53
XXk(x, ·)
40/53
XXk(x, ·)
41/53
42/53
43/53
43/53
43/53
43/53
44/53
44/53
44/53
44/53
45/53
f ∈F
45/53
f ∈F
45/53
f ∈F
f ∈F,f ≤1
45/53
f ∈F
f ∈F,f ≤1
46/53
i=1 from P(X, Z), we aim to estimate θ0 by
θ∈Θ
2
θ∈Θ
46/53
i=1 from P(X, Z), we aim to estimate θ0 by
θ∈Θ
2
θ∈Θ
i=1 from P(X, Z) and the parameter estimate
2
2
47/53
48/53
49/53
49/53
49/53
49/53
50/53
n
n
50/53
n
n
n
50/53
n
n
n
51/53
51/53
i=1 αiϕ(˜
YY . Then,
YY ˆ
51/53
i=1 αiϕ(˜
YY . Then,
YY ˆ
j=1 βjφ(xj)
52/53
5Fukumizu et al. Kernel Bayes’ Rule. JMLR. 2013
52/53
X = EX[φ(X) ⊗ φ(X)] and µ⊗ Y = EY [ϕ(Y ) ⊗ ϕ(Y )].
5Fukumizu et al. Kernel Bayes’ Rule. JMLR. 2013
52/53
X = EX[φ(X) ⊗ φ(X)] and µ⊗ Y = EY [ϕ(Y ) ⊗ ϕ(Y )].
Y = UY |Xµ⊗ X .
5Fukumizu et al. Kernel Bayes’ Rule. JMLR. 2013
52/53
X = EX[φ(X) ⊗ φ(X)] and µ⊗ Y = EY [ϕ(Y ) ⊗ ϕ(Y )].
Y = UY |Xµ⊗ X .
5Fukumizu et al. Kernel Bayes’ Rule. JMLR. 2013
52/53
X = EX[φ(X) ⊗ φ(X)] and µ⊗ Y = EY [ϕ(Y ) ⊗ ϕ(Y )].
Y = UY |Xµ⊗ X .
5Fukumizu et al. Kernel Bayes’ Rule. JMLR. 2013
53/53