vector valued distribution regression a simple and
play

Vector-valued Distribution Regression: A Simple and Consistent - PowerPoint PPT Presentation

Vector-valued Distribution Regression: A Simple and Consistent Approach Zolt an Szab o Joint work with Arthur Gretton (UCL), Barnab as P oczos (CMU), Bharath K. Sriperumbudur (PSU) Statistical Science Seminars October 9, 2014


  1. Vector-valued Distribution Regression: A Simple and Consistent Approach Zolt´ an Szab´ o Joint work with Arthur Gretton (UCL), Barnab´ as P´ oczos (CMU), Bharath K. Sriperumbudur (PSU) Statistical Science Seminars October 9, 2014 Zolt´ an Szab´ o Vector-valued Distribution Regression: Simple, Consistent

  2. Outline Motivation. Previous work. High-level goal. Definitions, algorithm, error guarantee, consistency. Numerical illustration. Zolt´ an Szab´ o Vector-valued Distribution Regression: Simple, Consistent

  3. Problem: regression on distributions Given: { ( x i , y i ) } l i =1 samples H ∋ f =? such that f ( x i ) ≈ y i . Our interest: x i -s are distributions, but (challenge!), only samples are given from x i -s: { x i , n } N i n =1 . Zolt´ an Szab´ o Vector-valued Distribution Regression: Simple, Consistent

  4. Two-stage sampled setting = bag-of-features Examples: image = set of patches/visual descriptors, document = bag of words/sentences/paragraphs, molecule = different configurations/shapes, group of people on a social network: bag of friendship graphs, customer = his/her shopping records, user = set of trial time-series. Zolt´ an Szab´ o Vector-valued Distribution Regression: Simple, Consistent

  5. Distribution regression: wider context Several problems are covered in machine learning and statistics: multi-instance learning, point estimation tasks without analytical formula. Zolt´ an Szab´ o Vector-valued Distribution Regression: Simple, Consistent

  6. Existing methods Idea: estimate distribution similarities, 1 plug them into a learning algorithm. 2 Approaches: parametric approaches: Gaussian, MOG, exponential family 1 [Jebara et al., 2004, Wang et al., 2009, Nielsen and Nock, 2012]. kernelized Gaussian measures: 2 [Jebara et al., 2004, Zhou and Chellappa, 2006]. Zolt´ an Szab´ o Vector-valued Distribution Regression: Simple, Consistent

  7. Existing methods+ (Positive definite) kernels: [Cuturi et al., 2005, 1 Martins et al., 2009, Hein and Bousquet, 2005]. Divergence measures (KL, . . . ): [P´ oczos et al., 2011]. 2 Set metric based algorithms: 3 Hausdorff metric [Edgar, 1995], and 1 its variants [Wang and Zucker, 2000, Wu et al., 2010, 2 Zhang and Zhou, 2009, Chen and Wu, 2012]. Zolt´ an Szab´ o Vector-valued Distribution Regression: Simple, Consistent

  8. Existing methods: summary MIL dates back to [Haussler, 1999, G¨ artner et al., 2002]. There are several multi-instance methods, applications. Zolt´ an Szab´ o Vector-valued Distribution Regression: Simple, Consistent

  9. Existing methods: summary MIL dates back to [Haussler, 1999, G¨ artner et al., 2002]. There are several multi-instance methods, applications. One ’small’ open question: Does any of these techniques make sense? Zolt´ an Szab´ o Vector-valued Distribution Regression: Simple, Consistent

  10. Existing methods: “exceptions” APR (axis-parallel rectangles) and its variants, classification [Auer, 1998, Long and Tan, 1998, Blum and Kalai, 1998, Babenko et al., 2011, Zhang et al., 2013, Sabato and Tishby, 2012]: y i = max( I R ( x i , 1 ) , . . . , I R ( x i , N )) ∈ { 0 , 1 } , where R = unknown rectangle. Zolt´ an Szab´ o Vector-valued Distribution Regression: Simple, Consistent

  11. Existing methods: “exceptions” APR (axis-parallel rectangles) and its variants, classification [Auer, 1998, Long and Tan, 1998, Blum and Kalai, 1998, Babenko et al., 2011, Zhang et al., 2013, Sabato and Tishby, 2012]: y i = max( I R ( x i , 1 ) , . . . , I R ( x i , N )) ∈ { 0 , 1 } , where R = unknown rectangle. Density based approaches, regression: KDE + kernel smoothing [P´ oczos et al., 2013, Oliva et al., 2014], densities live on compact Euclidean domain, density estimation: nuisance step. Zolt´ an Szab´ o Vector-valued Distribution Regression: Simple, Consistent

  12. High-level goal: set kernel Given (2 bags): B i := { x i , n } N i n =1 ∼ x i , B j := { x j , m } N j m =1 ∼ x j . Similarity of the bags (set/multi-instance/ensemble-, convolution kernel [Haussler, 1999, G¨ artner et al., 2002]): N j N i 1 � � K ( B i , B j ) = k ( x i , n , x j , m ) . N i N j n =1 m =1 Zolt´ an Szab´ o Vector-valued Distribution Regression: Simple, Consistent

  13. High-level goal: consistency of set kernels Are set kernels consistent , when plugged into some regression scheme? Our focus: ridge regression . Motivation (ridge scheme): simple algorithm. 1 recently proved parallelizations [Zhang et al., 2014]. 2 Zolt´ an Szab´ o Vector-valued Distribution Regression: Simple, Consistent

  14. Story H : assumed function class to capture the ( x , y ) relation. f ρ : true regression function (might not be in H ). f H : “best” function from H ( l = ∞ , N := N i = ∞ ). ˆ f : estimated function from H based on { ( { x i , n } N n =1 , y i ) } l i =1 . Aim: High probability error guarantees ( λ : reg., E : risk): E [ˆ f ] − E [ f H ] ≤ r 1 ( l , N , λ ) , (1) � ˆ f − f ρ � L 2 ≤ r 2 ( l , N , λ ) + r 3 (richness of H ) . (2) Consistency: ( l , N , λ ) =? such that r i ( l , N , λ ) → 0 ( i = 1 , 2). Zolt´ an Szab´ o Vector-valued Distribution Regression: Simple, Consistent

  15. Distribution regression: definition, solution idea i =1 : x i ∈ M + z = { ( x i , y i ) } l 1 ( D ), y i ∈ Y . �� l i . i . d . �� { x i , n } N ˆ z = n =1 , y i i =1 : x i , 1 , . . . , x i , N ∼ x i . Goal: learn the relation between x and y based on ˆ z . Idea: embed the distributions (using µ defined by k ), 1 apply ridge regression (determined by K ). 2 f ∈ H ( K ) µ M + 1 ( D ) − → X ⊆ H ( k ) − − − − − → Y . Zolt´ an Szab´ o Vector-valued Distribution Regression: Simple, Consistent

  16. Kernel part ( k , K ): RKHS k : D × D → R kernel on D , if ∃ ϕ : D → H (ilbert space) feature map, k ( a , b ) = � ϕ ( a ) , ϕ ( b ) � H ( ∀ a , b ∈ D ). Kernel examples: D = R d ( p > 0, θ > 0) k ( a , b ) = ( � a , b � + θ ) p : polynomial, k ( a , b ) = e −� a − b � 2 2 / (2 θ 2 ) : Gaussian, k ( a , b ) = e − θ � a − b � 2 : Laplacian. In the H = H ( k ) RKHS ( ∃ !): ϕ ( u ) = k ( · , u ). Zolt´ an Szab´ o Vector-valued Distribution Regression: Simple, Consistent

  17. Kernel part: example domains ( D ) Euclidean space: D = R d . Strings, time series, graphs, dynamical systems. Distributions. Zolt´ an Szab´ o Vector-valued Distribution Regression: Simple, Consistent

  18. µ Embedding step: M + − → X ⊆ H ( k ) 1 ( D ) Given: kernel k : D × D → R . Mean embedding of a distribution x ∈ M + 1 ( D ): � k ( · , u ) d x ( u ) ∈ H ( k ) . µ x = D Mean embedding of the empirical distribution x i = 1 � N n =1 δ x i , n ∈ M + ˆ 1 ( D ): N N � x i ( u ) = 1 � µ ˆ x i = k ( · , u ) d ˆ k ( · , x i , n ) ∈ H ( k ) . N D n =1 Zolt´ an Szab´ o Vector-valued Distribution Regression: Simple, Consistent

  19. f ∈ H = H ( K ) Objective function: X − − − − − − → Y Optimal ( H /measurable) in expected risk ( E ) sense: � � f ( µ a ) − y � 2 E [ f H ] = inf f ∈ H E [ f ] = inf Y d ρ ( µ a , y ) , f ∈ H X × Y � f ρ ( µ a ) = E [ y | µ a ] = y d ρ ( y | µ a ) ( µ a ∈ X ) . Y � One-stage ( → z ), two-stage difficulty ( z → ˆ z ): l 1 � � f ( µ x i ) − y i � 2 Y + λ � f � 2 f λ z = arg min H , (3) l f ∈ H i =1 l 1 f λ � x i ) − y i � 2 Y + λ � f � 2 z = arg min � f ( µ ˆ H . (4) ˆ l f ∈ H i =1 Zolt´ an Szab´ o Vector-valued Distribution Regression: Simple, Consistent

  20. Algorithmically: ridge regression ⇒ analytical solution Given: training sample: ˆ z , test distribution: t . Prediction: z ◦ µ )( t ) = [ y 1 , . . . , y l ]( K + l λ I l ) − 1 k , ( f λ (5) ˆ x j )] ∈ L ( Y ) l × l , K = [ K ij ] = [ K ( µ ˆ x i , µ ˆ (6)   K ( µ ˆ x 1 , µ t ) . .  ∈ L ( Y ) l . k = (7)   .  K ( µ ˆ x l , µ t ) Specially: Y = R ⇒ L ( Y ) = R ; Y = R d ⇒ L ( Y ) = R d × d . Zolt´ an Szab´ o Vector-valued Distribution Regression: Simple, Consistent

  21. Assumption-1 D : separable, topological. Y : separable Hilbert. k : bounded: sup u ∈ D k ( u , u ) ≤ B k ∈ (0 , ∞ ), continuous. M + � � X = µ 1 ( D ) ∈ B ( H ). Zolt´ an Szab´ o Vector-valued Distribution Regression: Simple, Consistent

  22. Assumption-1 – continued K [ K µ a := K ( · , µ a )]: bounded: 1 � K µ a � 2 � K ∗ � HS = Tr µ a K µ a ≤ B K ∈ (0 , ∞ ) , ( ∀ µ a ∈ X ) . H¨ older continuous: ∃ L > 0, h ∈ (0 , 1] such that 2 � K µ a − K µ b � L ( Y , H ) ≤ L � µ a − µ b � h ∀ ( µ a , µ b ) ∈ X × X . H , y is bounded: ∃ C < ∞ such that � y � Y ≤ C almost surely. Zolt´ an Szab´ o Vector-valued Distribution Regression: Simple, Consistent

  23. Assumption-1: remarks (before the ρ assumptions) k : bounded, continuous ⇒ µ : ( M + 1 ( D ) , B ( τ w )) → ( H , B ( H )) measurable. µ measurable, X ∈ B ( H ) ⇒ ρ on X × Y : well-defined. If (*) := D is compact metric, k is universal, then µ is continuous and X ∈ B ( H ). If Y = R , we get the traditional boundedness of K : K ( µ a , µ a ) ≤ B K , ( ∀ µ a ∈ X ) . Zolt´ an Szab´ o Vector-valued Distribution Regression: Simple, Consistent

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend