Vector-valued Distribution Regression: A Simple and Consistent - - PowerPoint PPT Presentation

vector valued distribution regression a simple and
SMART_READER_LITE
LIVE PREVIEW

Vector-valued Distribution Regression: A Simple and Consistent - - PowerPoint PPT Presentation

Vector-valued Distribution Regression: A Simple and Consistent Approach Zolt an Szab o Joint work with Arthur Gretton (UCL), Barnab as P oczos (CMU), Bharath K. Sriperumbudur (PSU) Statistical Science Seminars October 9, 2014


slide-1
SLIDE 1

Vector-valued Distribution Regression: A Simple and Consistent Approach

Zolt´ an Szab´

  • Joint work with Arthur Gretton (UCL), Barnab´

as P´

  • czos (CMU),

Bharath K. Sriperumbudur (PSU)

Statistical Science Seminars October 9, 2014

Zolt´ an Szab´

  • Vector-valued Distribution Regression: Simple, Consistent
slide-2
SLIDE 2

Outline

Motivation. Previous work. High-level goal. Definitions, algorithm, error guarantee, consistency. Numerical illustration.

Zolt´ an Szab´

  • Vector-valued Distribution Regression: Simple, Consistent
slide-3
SLIDE 3

Problem: regression on distributions

Given: {(xi, yi)}l

i=1 samples H ∋ f =? such that f (xi) ≈ yi.

Our interest:

xi-s are distributions, but (challenge!),

  • nly samples are given from xi-s: {xi,n}Ni

n=1.

Zolt´ an Szab´

  • Vector-valued Distribution Regression: Simple, Consistent
slide-4
SLIDE 4

Two-stage sampled setting = bag-of-features

Examples: image = set of patches/visual descriptors, document = bag of words/sentences/paragraphs, molecule = different configurations/shapes, group of people on a social network: bag of friendship graphs, customer = his/her shopping records, user = set of trial time-series.

Zolt´ an Szab´

  • Vector-valued Distribution Regression: Simple, Consistent
slide-5
SLIDE 5

Distribution regression: wider context

Several problems are covered in machine learning and statistics: multi-instance learning, point estimation tasks without analytical formula.

Zolt´ an Szab´

  • Vector-valued Distribution Regression: Simple, Consistent
slide-6
SLIDE 6

Existing methods

Idea:

1

estimate distribution similarities,

2

plug them into a learning algorithm. Approaches:

1

parametric approaches: Gaussian, MOG, exponential family [Jebara et al., 2004, Wang et al., 2009, Nielsen and Nock, 2012].

2

kernelized Gaussian measures: [Jebara et al., 2004, Zhou and Chellappa, 2006].

Zolt´ an Szab´

  • Vector-valued Distribution Regression: Simple, Consistent
slide-7
SLIDE 7

Existing methods+

1

(Positive definite) kernels: [Cuturi et al., 2005, Martins et al., 2009, Hein and Bousquet, 2005].

2

Divergence measures (KL, . . . ): [P´

  • czos et al., 2011].

3

Set metric based algorithms:

1

Hausdorff metric [Edgar, 1995], and

2

its variants [Wang and Zucker, 2000, Wu et al., 2010, Zhang and Zhou, 2009, Chen and Wu, 2012].

Zolt´ an Szab´

  • Vector-valued Distribution Regression: Simple, Consistent
slide-8
SLIDE 8

Existing methods: summary

MIL dates back to [Haussler, 1999, G¨ artner et al., 2002]. There are several multi-instance methods, applications.

Zolt´ an Szab´

  • Vector-valued Distribution Regression: Simple, Consistent
slide-9
SLIDE 9

Existing methods: summary

MIL dates back to [Haussler, 1999, G¨ artner et al., 2002]. There are several multi-instance methods, applications. One ’small’ open question: Does any of these techniques make sense?

Zolt´ an Szab´

  • Vector-valued Distribution Regression: Simple, Consistent
slide-10
SLIDE 10

Existing methods: “exceptions”

APR (axis-parallel rectangles) and its variants, classification [Auer, 1998, Long and Tan, 1998, Blum and Kalai, 1998, Babenko et al., 2011, Zhang et al., 2013, Sabato and Tishby, 2012]: yi = max(IR(xi,1), . . . , IR(xi,N)) ∈ {0, 1}, where R = unknown rectangle.

Zolt´ an Szab´

  • Vector-valued Distribution Regression: Simple, Consistent
slide-11
SLIDE 11

Existing methods: “exceptions”

APR (axis-parallel rectangles) and its variants, classification [Auer, 1998, Long and Tan, 1998, Blum and Kalai, 1998, Babenko et al., 2011, Zhang et al., 2013, Sabato and Tishby, 2012]: yi = max(IR(xi,1), . . . , IR(xi,N)) ∈ {0, 1}, where R = unknown rectangle. Density based approaches, regression: KDE + kernel smoothing [P´

  • czos et al., 2013, Oliva et al., 2014],

densities live on compact Euclidean domain, density estimation: nuisance step.

Zolt´ an Szab´

  • Vector-valued Distribution Regression: Simple, Consistent
slide-12
SLIDE 12

High-level goal: set kernel

Given (2 bags): Bi := {xi,n}Ni

n=1 ∼ xi,

Bj := {xj,m}Nj

m=1 ∼ xj.

Similarity of the bags (set/multi-instance/ensemble-, convolution kernel [Haussler, 1999, G¨ artner et al., 2002]): K(Bi, Bj) = 1 NiNj

Ni

  • n=1

Nj

  • m=1

k(xi,n, xj,m).

Zolt´ an Szab´

  • Vector-valued Distribution Regression: Simple, Consistent
slide-13
SLIDE 13

High-level goal: consistency of set kernels

Are set kernels consistent, when plugged into some regression scheme? Our focus: ridge regression . Motivation (ridge scheme):

1

simple algorithm.

2

recently proved parallelizations [Zhang et al., 2014].

Zolt´ an Szab´

  • Vector-valued Distribution Regression: Simple, Consistent
slide-14
SLIDE 14

Story

H: assumed function class to capture the (x, y) relation. fρ: true regression function (might not be in H). fH: “best” function from H (l = ∞, N := Ni = ∞). ˆ f : estimated function from H based on {({xi,n}N

n=1, yi)}l i=1.

Aim:

High probability error guarantees (λ: reg., E: risk): E[ˆ f ] − E[fH] ≤ r1(l, N, λ), (1) ˆ f − fρL2 ≤ r2(l, N, λ) + r3(richness of H). (2) Consistency: (l, N, λ) =? such that ri(l, N, λ) → 0 (i = 1, 2).

Zolt´ an Szab´

  • Vector-valued Distribution Regression: Simple, Consistent
slide-15
SLIDE 15

Distribution regression: definition, solution idea

z = {(xi, yi)}l

i=1: xi ∈ M+ 1 (D), yi ∈ Y .

ˆ z =

  • {xi,n}N

n=1, yi

l

i=1: xi,1, . . . , xi,N i.i.d.

∼ xi. Goal: learn the relation between x and y based on ˆ z. Idea:

1

embed the distributions (using µ defined by k),

2

apply ridge regression (determined by K).

M+

1 (D) µ

− → X ⊆ H(k)

f ∈H(K)

− − − − − → Y .

Zolt´ an Szab´

  • Vector-valued Distribution Regression: Simple, Consistent
slide-16
SLIDE 16

Kernel part (k, K): RKHS

k : D × D → R kernel on D, if

∃ϕ : D → H(ilbert space) feature map, k(a, b) = ϕ(a), ϕ(b)H (∀a, b ∈ D).

Kernel examples: D = Rd (p > 0, θ > 0)

k(a, b) = (a, b + θ)p: polynomial, k(a, b) = e−a−b2

2/(2θ2): Gaussian,

k(a, b) = e−θa−b2: Laplacian.

In the H = H(k) RKHS (∃!): ϕ(u) = k(·, u).

Zolt´ an Szab´

  • Vector-valued Distribution Regression: Simple, Consistent
slide-17
SLIDE 17

Kernel part: example domains (D)

Euclidean space: D = Rd. Strings, time series, graphs, dynamical systems. Distributions.

Zolt´ an Szab´

  • Vector-valued Distribution Regression: Simple, Consistent
slide-18
SLIDE 18

Embedding step: M+

1 (D) µ

− → X ⊆ H(k)

Given: kernel k : D × D → R. Mean embedding of a distribution x ∈ M+

1 (D):

µx =

  • D

k(·, u)dx(u) ∈ H(k). Mean embedding of the empirical distribution ˆ xi = 1

N

N

n=1 δxi,n ∈ M+ 1 (D):

µˆ

xi =

  • D

k(·, u)dˆ xi(u) = 1 N

N

  • n=1

k(·, xi,n) ∈ H(k).

Zolt´ an Szab´

  • Vector-valued Distribution Regression: Simple, Consistent
slide-19
SLIDE 19

Objective function: X

f ∈H=H(K)

− − − − − − → Y

Optimal (H/measurable) in expected risk (E) sense: E [fH] = inf

f ∈H E[f ] = inf f ∈H

  • X×Y

f (µa) − y2

Y dρ(µa, y),

fρ(µa) = E[y|µa] =

  • Y

ydρ(y|µa) (µa ∈ X). One-stage (

  • → z), two-stage difficulty (z → ˆ

z): f λ

z = arg min f ∈H

1 l

l

  • i=1

f (µxi) − yi2

Y + λ f 2 H ,

(3) f λ

ˆ z = arg min f ∈H

1 l

l

  • i=1

f (µˆ

xi) − yi2 Y + λ f 2 H .

(4)

Zolt´ an Szab´

  • Vector-valued Distribution Regression: Simple, Consistent
slide-20
SLIDE 20

Algorithmically: ridge regression ⇒ analytical solution

Given:

training sample: ˆ z, test distribution: t.

Prediction: (f λ

ˆ z ◦ µ)(t) = [y1, . . . , yl](K + lλIl)−1k,

(5) K = [Kij] = [K(µˆ

xi, µˆ xj)] ∈ L(Y )l×l,

(6) k =    K(µˆ

x1, µt)

. . . K(µˆ

xl, µt)

   ∈ L(Y )l. (7) Specially: Y = R ⇒ L(Y ) = R; Y = Rd ⇒ L(Y ) = Rd×d.

Zolt´ an Szab´

  • Vector-valued Distribution Regression: Simple, Consistent
slide-21
SLIDE 21

Assumption-1

D: separable, topological. Y : separable Hilbert. k:

bounded: supu∈D k(u, u) ≤ Bk ∈ (0, ∞), continuous.

X = µ

  • M+

1 (D)

  • ∈ B(H).

Zolt´ an Szab´

  • Vector-valued Distribution Regression: Simple, Consistent
slide-22
SLIDE 22

Assumption-1 – continued

K [Kµa := K(·, µa)]:

1

bounded: Kµa2

HS = Tr

  • K ∗

µaKµa

  • ≤ BK ∈ (0, ∞),

(∀µa ∈ X).

2

  • lder continuous: ∃L > 0, h ∈ (0, 1] such that

Kµa − KµbL(Y ,H) ≤ L µa − µbh

H ,

∀(µa, µb) ∈ X × X.

y is bounded: ∃C < ∞ such that yY ≤ C almost surely.

Zolt´ an Szab´

  • Vector-valued Distribution Regression: Simple, Consistent
slide-23
SLIDE 23

Assumption-1: remarks (before the ρ assumptions)

k: bounded, continuous ⇒

µ : (M+

1 (D), B(τw)) → (H, B(H)) measurable.

µ measurable, X ∈ B(H) ⇒ ρ on X × Y : well-defined.

If (*) := D is compact metric, k is universal, then µ is continuous and X ∈ B(H). If Y = R, we get the traditional boundedness of K: K(µa, µa) ≤ BK, (∀µa ∈ X).

Zolt´ an Szab´

  • Vector-valued Distribution Regression: Simple, Consistent
slide-24
SLIDE 24

Assumption-1: linear K ↔ set kernel

Let Y = R and K(µa, µb) = µa, µbH. Recall µˆ

xi = 1

N

N

  • n=1

k(·, xi,n). In this case: BK = Bk, L = 1, h = 1, we get the set kernel K(µˆ

xi, µˆ xj) =

  • µˆ

xi, µˆ xj

  • H = 1

N2

N

  • n,m=1

k(xi,n, xj,m).

Zolt´ an Szab´

  • Vector-valued Distribution Regression: Simple, Consistent
slide-25
SLIDE 25

Assumption-1: nonlinear K examples for Y = R

In case of (*) and Y = R: H¨

  • lder K-s (D: compact, metric; µ:

continuous) KG Ke KC e− µa−µb2

H 2θ2

e− µa−µbH

2θ2

  • 1 + µa − µb2

H /θ2−1

h = 1 h = 1

2

h = 1 Kt Ki

  • 1 + µa − µbθ

H

−1

  • µa − µb2

H + θ2− 1

2

h = θ

2 (θ ≤ 2)

h = 1

Zolt´ an Szab´

  • Vector-valued Distribution Regression: Simple, Consistent
slide-26
SLIDE 26

Assumption-1 – continued: ρ ∈ P(b, c)

Let the T : H → H covariance operator be T =

  • X

K(·, µa)K ∗(·, µa)dρX(µa) with eigenvalues tn (n = 1, 2, . . .). Assumption: ρ ∈ P(b, c) = set of distributions on X × Y

α ≤ nbtn ≤ β (∀n ≥ 1; α > 0, β > 0), ∃g ∈ H such that fH = T

c−1 2 g with g2

H ≤ R (R > 0),

where b ∈ (1, ∞), c ∈ [1, 2]. Intuition: b – effective input dimension, c – smoothness of fH.

Zolt´ an Szab´

  • Vector-valued Distribution Regression: Simple, Consistent
slide-27
SLIDE 27

Assumption-2: Assumption-1, but with alternative ρ

Let ˜ T be the extension of T from H to L2

ρX :

S∗

K : H ֒

→ L2

ρX ,

SK : L2

ρX → H,

(SKg)(µu) =

  • X

K(µu, µt)g(µt)dρX(µt), ˜ T = S∗

KSK : L2 ρX → L2 ρX .

Our assumptions on ρ: Range space assumption: fρ ∈ Im

  • ˜

T s for some s ≥ 0. L2

ρX : separable.

Zolt´ an Szab´

  • Vector-valued Distribution Regression: Simple, Consistent
slide-28
SLIDE 28

Assumption-2: remarks

Range space assumption:

fρ ∈ Im

  • ˜

T s : s captures the smoothness of fρ.

L2

ρX : separable ⇔ measure space with d(A, B) = ρX(A △ B)

is so [Thomson et al., 2008].

Zolt´ an Szab´

  • Vector-valued Distribution Regression: Simple, Consistent
slide-29
SLIDE 29

Error guarantees, consistency (in human-readable format)

In case of Assumption-1: if l ≥ λ− 1

b −1

E[f λ

ˆ z ] − E[fH] ≤ logh(l)

Nhλ3 + λc + 1 l2λ + 1 lλ

1 b

→ 0 Assumption-2: if

1 λ2 ≤ l

  • S∗

Kf λ ˆ z − fρ

  • L2

ρX

≤ log

h 2 (l)

N

h 2 λ 3 2

+ 1 λ √ l + DH → DH, DH = inf

q∈H fρ − S∗ KqL2

ρX

with high probability.

Zolt´ an Szab´

  • Vector-valued Distribution Regression: Simple, Consistent
slide-30
SLIDE 30

Demo-1 (Y = R): Supervised entropy learning

Problem: learn the entropy of (rotated) Gaussians. Baseline: kernel smoothing based distribution regression (applying density estimation) =: DFDR. Performance: RMSE boxplot over 25 random experiments. Experience:

more precise than the only theoretically justified method, by avoiding density estimation.

Zolt´ an Szab´

  • Vector-valued Distribution Regression: Simple, Consistent
slide-31
SLIDE 31

Supervised entropy learning: plots

1 2 3 −2 −1 1 2 RMSE: MERR=0.75, DFDR=2.02 rotation angle (β) entropy true MERR DFDR MERR DFDR 1 2 3 4 RMSE

Zolt´ an Szab´

  • Vector-valued Distribution Regression: Simple, Consistent
slide-32
SLIDE 32

Demo-2 (Y = R): Aerosol prediction from satellite images

Bag:= multispectral satellite image over an area. Label of a bag:= AOD value of a highly accurate ground-based instrument. Performance: RMSE. Experience:

≈ domain-specific, engineered methods, beating state-of-the-art MI techniques.

Zolt´ an Szab´

  • Vector-valued Distribution Regression: Simple, Consistent
slide-33
SLIDE 33

Aerosol prediction: results

Baseline [mixture model (EM)]: 7.5 − 8.5 (±0.1 − 0.6). Linear K:

single: 7.91 (±1.61). ensemble: 7.86 (±1.71).

Nonlinear K:

Single: 7.90 (±1.63), Ensemble: 7.81 (±1.64).

Zolt´ an Szab´

  • Vector-valued Distribution Regression: Simple, Consistent
slide-34
SLIDE 34

Summary

Problem: two-stage sampled distribution regression. Literature: large number of heuristics. Contribution:

error guarantees, consistency for the ridge based solution. specially, set kernel is consistent in regression (15-year-old

  • pen question).

Code ∈ ITE toolbox: https://bitbucket.org/szzoli/ite/

Zolt´ an Szab´

  • Vector-valued Distribution Regression: Simple, Consistent
slide-35
SLIDE 35

Future research directions

Theoretical:

quadratic loss (E), bounded kernels (k, K), mean embedding (µ) with i.i.d. samples ({xi,n}N

n=1): relaxation,

equivalent characterizations/alternative priors (ρ), lower/optimal bounds, error guarantees for non-point estimates.

Practical: large-scale solvers, dim(Y ) = ∞.

Zolt´ an Szab´

  • Vector-valued Distribution Regression: Simple, Consistent
slide-36
SLIDE 36

Thank you for the attention!

Acknowledgments: This work was supported by the Gatsby Charitable Foundation, and by NSF grants IIS1247658 and IIS1250350. The work was carried out while Bharath K. Sriperumbudur was a research fellow in the Statistical Laboratory, Department of Pure Mathematics and Mathematical Statistics at the University of Cambridge, UK.

Zolt´ an Szab´

  • Vector-valued Distribution Regression: Simple, Consistent
slide-37
SLIDE 37

Appendix: Contents

Topological definitions. Vector-valued RKHS. Hausdorff metric. Weak topology on M+

1 (D).

Universal kernel examples.

Zolt´ an Szab´

  • Vector-valued Distribution Regression: Simple, Consistent
slide-38
SLIDE 38

Topological space, open sets

Given: D = ∅ set. τ ⊆ 2D is called a topology on D if:

1

∅ ∈ τ, D ∈ τ.

2

Finite intersection: O1 ∈ τ, O2 ∈ τ ⇒ O1 ∩ O2 ∈ τ.

3

Arbitrary union: Oi ∈ τ (i ∈ I) ⇒ ∪i∈IOi ∈ τ.

Then, (D, τ) is called a topological space; O ∈ τ: open sets.

Zolt´ an Szab´

  • Vector-valued Distribution Regression: Simple, Consistent
slide-39
SLIDE 39

Topology: examples

τ = {∅, D}: indiscrete topology. τ = 2D: discrete topology. (D, d) metric space:

Open ball: Bǫ(x) = {y ∈ D : d(x, y) < ǫ}. O ⊆ D is open if for ∀x ∈ O ∃ǫ > 0 such that Bǫ(x) ⊆ O. τ := {O ⊆ D : O is an open subset of D}.

Zolt´ an Szab´

  • Vector-valued Distribution Regression: Simple, Consistent
slide-40
SLIDE 40

Closed-, compact set, closure, dense subset, separability

Given: (D, τ). A ⊆ D is closed if D\A ∈ τ (i.e., its complement is open), compact if for any family (Oi)i∈I of open sets with A ⊆ ∪i∈IOi, ∃i1, . . . , in ∈ I with A ⊆ ∪n

j=1Oij.

Closure of A ⊆ D: ¯ A :=

  • A⊆C closed in D

C. (8) A ⊆ D is dense if ¯ A = D. (D, τ) is separable if ∃ countable, dense subset of D. Counterexample: l∞/L∞.

Zolt´ an Szab´

  • Vector-valued Distribution Regression: Simple, Consistent
slide-41
SLIDE 41

The discrete space

  • D, 2D

: complete metric space. Discrete metric (inducing the discrete topology): d(x, y) = 0, if x = y 1, if x = y

  • .

(9) Discrete space: separable ⇔ |D| is countable.

Zolt´ an Szab´

  • Vector-valued Distribution Regression: Simple, Consistent
slide-42
SLIDE 42

Vector-valued RKHS

Definition: A H ⊆ Y X Hilbert space of functions is RKHS if Aµx,y : f → y, f (µx)Y (10) is continuous for ∀µx ∈ X, y ∈ Y . = The evaluation functional is continuous in every direction. Riesz representation theorem ⇒ ∃Kµt ∈ L(Y , H): K(µx, µt)(y) = (Kµty)(µx), (∀µx, µt ∈ X), or shortly K(·, µt)(y) = Kµty, (11) H(K) = span{Kµty : µt ∈ X, y ∈ Y }. (12)

Zolt´ an Szab´

  • Vector-valued Distribution Regression: Simple, Consistent
slide-43
SLIDE 43

Vector-valued RKHS – continued

Examples (Y = Rd):

1

Ki : X × X → R kernels (i = 1, . . . , d). Diagonal kernel: K(µa, µb) = diag(K1(µa, µb), . . . , Kd(µa, µb)). (13)

2

Combination of Dj diagonal kernels [Dj(µa, µb) ∈ Rr×r, Aj ∈ Rr×d]: K(µa, µb) =

m

  • j=1

A∗

j Dj(µa, µb)Aj.

(14)

Zolt´ an Szab´

  • Vector-valued Distribution Regression: Simple, Consistent
slide-44
SLIDE 44

Existing methods: set metric based algorithms

Hausdorff metric [Edgar, 1995]: dH(X, Y ) = max

  • sup

x∈X

inf

y∈Y d(x, y), sup y∈Y

inf

x∈X d(x, y)

  • .

(15)

Metric on compact sets of metric spaces [(M, d); X, Y ⊆ M]. ’Slight’ problem: highly sensitive to outliers.

Zolt´ an Szab´

  • Vector-valued Distribution Regression: Simple, Consistent
slide-45
SLIDE 45

Weak topology on M+

1 (D)

Def.: It is the weakest topology such that the Lh : (M+

1 (D), τw) → R,

Lh(x) =

  • D

h(u)dx(u) mapping is continuous for all h ∈ Cb(D), where Cb(D) = {(D, τ) → R bounded, continuous functions}.

Zolt´ an Szab´

  • Vector-valued Distribution Regression: Simple, Consistent
slide-46
SLIDE 46

Universal kernel examples

On every compact subset of Rd: k(a, b) = e−

a−b2 2 2σ2 ,

(σ > 0) k(a, b) = eβa,b, (β > 0), or more generally k(a, b) = f (a, b), f (x) =

  • n=0

anxn (∀an > 0) k(a, b) = (1 − a, b)α , (α > 0).

Zolt´ an Szab´

  • Vector-valued Distribution Regression: Simple, Consistent
slide-47
SLIDE 47

Auer, P. (1998). Approximating hyper-rectangles: Learning and pseudorandom sets. Journal of Computer and System Sciences, 57:376–388. Babenko, B., Verma, N., Doll´ ar, P., and Belongie, S. (2011). Multiple instance learning with manifold bags. In International Conference on Machine Learning (ICML), pages 81–88. Blum, A. and Kalai, A. (1998). A note on learning from multiple-instance examples. Machine Learning, 30:23–29. Chen, Y. and Wu, O. (2012). Contextual Hausdorff dissimilarity for multi-instance clustering. In International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), pages 870–873.

Zolt´ an Szab´

  • Vector-valued Distribution Regression: Simple, Consistent
slide-48
SLIDE 48

Cuturi, M., Fukumizu, K., and Vert, J.-P. (2005). Semigroup kernels on measures. Journal of Machine Learning Research, 6:11691198. Edgar, G. (1995). Measure, Topology and Fractal Geometry. Springer-Verlag. G¨ artner, T., Flach, P. A., Kowalczyk, A., and Smola, A. (2002). Multi-instance kernels. In International Conference on Machine Learning (ICML), pages 179–186. Haussler, D. (1999). Convolution kernels on discrete structures. Technical report, Department of Computer Science, University

  • f California at Santa Cruz.

(http://cbse.soe.ucsc.edu/sites/default/files/convolutio

Zolt´ an Szab´

  • Vector-valued Distribution Regression: Simple, Consistent
slide-49
SLIDE 49

Hein, M. and Bousquet, O. (2005). Hilbertian metrics and positive definite kernels on probability measures. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 136–143. Jebara, T., Kondor, R., and Howard, A. (2004). Probability product kernels. Journal of Machine Learning Research, 5:819–844. Long, P. M. and Tan, L. (1998). PAC learning of axis-aligned rectangles with respect to product distributions from multiple-instance examples. Machine Learning, 30:7–21. Martins, A. F. T., Smith, N. A., Xing, E. P., Aguiar, P. M. Q., and Figueiredo, M. A. T. (2009). Nonextensive information theoretical kernels on measures. Journal of Machine Learning Research, 10:935–975.

Zolt´ an Szab´

  • Vector-valued Distribution Regression: Simple, Consistent
slide-50
SLIDE 50

Nielsen, F. and Nock, R. (2012). A closed-form expression for the Sharma-Mittal entropy of exponential families. Journal of Physics A: Mathematical and Theoretical, 45:032003. Oliva, J. B., Neiswanger, W., P´

  • czos, B., Schneider, J., and

Xing, E. (2014). Fast distribution to real regression. International Conference on Artificial Intelligence and Statistics (AISTATS; JMLR W&CP), 33:706–714. P´

  • czos, B., Rinaldo, A., Singh, A., and Wasserman, L. (2013).

Distribution-free distribution regression. International Conference on Artificial Intelligence and Statistics (AISTATS; JMLR W&CP), 31:507–515. P´

  • czos, B., Xiong, L., and Schneider, J. (2011).

Nonparametric divergence estimation with applications to machine learning on distributions.

Zolt´ an Szab´

  • Vector-valued Distribution Regression: Simple, Consistent
slide-51
SLIDE 51

In Uncertainty in Artificial Intelligence (UAI), pages 599–608. Sabato, S. and Tishby, N. (2012). Multi-instance learning with any hypothesis class. Journal of Machine Learning Research, 13:2999–3039. Thomson, B. S., Bruckner, J. B., and Bruckner, A. M. (2008). Real Analysis. Prentice-Hall. Wang, F., Syeda-Mahmood, T., Vemuri, B. C., Beymer, D., and Rangarajan, A. (2009). Closed-form Jensen-R´ enyi divergence for mixture of Gaussians and applications to group-wise shape registration. Medical Image Computing and Computer-Assisted Intervention, 12:648–655. Wang, J. and Zucker, J.-D. (2000). Solving the multiple-instance problem: A lazy learning approach.

Zolt´ an Szab´

  • Vector-valued Distribution Regression: Simple, Consistent
slide-52
SLIDE 52

In International Conference on Machine Learning (ICML), pages 1119–1126. Wu, O., Gao, J., Hu, W., Li, B., and Zhu, M. (2010). Identifying multi-instance outliers. In SIAM International Conference on Data Mining (SDM), pages 430–441. Zhang, D., He, J., Si, L., and Lawrence, R. D. (2013). MILEAGE: Multiple Instance LEArning with Global Embedding. International Conference on Machine Learning (ICML; JMLR W&CP), 28:82–90. Zhang, M.-L. and Zhou, Z.-H. (2009). Multi-instance clustering with applications to multi-instance prediction. Applied Intelligence, 31:47–68. Zhang, Y., Duchi, J. C., and Wainwright, M. J. (2014).

Zolt´ an Szab´

  • Vector-valued Distribution Regression: Simple, Consistent
slide-53
SLIDE 53

Divide and conquer kernel ridge regression: A distributed algorithm with minimax optimal rates. Technical report, University of California, Berkeley. (http://arxiv.org/abs/1305.5029). Zhou, S. K. and Chellappa, R. (2006). From sample similarity to ensemble similarity: Probabilistic distance measures in reproducing kernel Hilbert space. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28:917–929.

Zolt´ an Szab´

  • Vector-valued Distribution Regression: Simple, Consistent