distribution regression
play

Distribution Regression o ( Zolt an Szab Ecole Polytechnique) - PowerPoint PPT Presentation

Distribution Regression o ( Zolt an Szab Ecole Polytechnique) Joint work with Bharath K. Sriperumbudur (Department of Statistics, PSU), Barnab as P oczos (ML Department, CMU), Arthur Gretton (Gatsby Unit, UCL) Dagstuhl


  1. Distribution Regression o (´ Zolt´ an Szab´ Ecole Polytechnique) Joint work with ◦ Bharath K. Sriperumbudur (Department of Statistics, PSU), ◦ Barnab´ as P´ oczos (ML Department, CMU), ◦ Arthur Gretton (Gatsby Unit, UCL) Dagstuhl Seminar 16481 December 1, 2016 Szab´ o et al. Distribution Regression

  2. Example: sustainability Goal : aerosol prediction → climate. Prediction using labelled bags: bag := multi-spectral satellite measurements over an area, label := local aerosol value. Szab´ o et al. Distribution Regression

  3. Example: existing methods Multi-instance learning: [Haussler, 1999, G¨ artner et al., 2002] (set kernel): sensible methods in regression: few, restrictive technical conditions, 1 super-high resolution satellite image: would be needed. 2 Szab´ o et al. Distribution Regression

  4. One-page summary Contributions: Practical: state-of-the-art accuracy (aerosol). 1 Theoretical: 2 General bags: graphs, time series, texts, . . . Consistency of set kernel in regression (17-year-old open problem). How many samples/bag? → [Szab´ o et al., 2016]. Szab´ o et al. Distribution Regression

  5. One-page summary Contributions: Practical: state-of-the-art accuracy (aerosol). 1 Theoretical: 2 General bags: graphs, time series, texts, . . . Consistency of set kernel in regression (17-year-old open problem). How many samples/bag? → [Szab´ o et al., 2016]. Szab´ o et al. Distribution Regression

  6. Objects in the bags time-series modelling: user = set of time-series, computer vision: image = collection of patch vectors, NLP: corpus = bag of documents, network analysis: group of people = bag of friendship graphs, . . . Szab´ o et al. Distribution Regression

  7. Regression on labelled bags Given: �� ˆ �� ℓ i =1 , ˆ P i : bag from P i , N := | ˆ labelled bags: ˆ z = P i , y i P i | . test bag: ˆ P . Szab´ o et al. Distribution Regression

  8. Regression on labelled bags Given: �� ˆ �� ℓ i =1 , ˆ P i : bag from P i , N := | ˆ labelled bags: ˆ z = P i , y i P i | . test bag: ˆ P . Estimator: 1 � � 2 � ℓ � � + λ � f � 2 f λ z = arg min f µ ˆ − y i H . ˆ P i ℓ i =1 f ∈ H ���� feature of ˆ P i Szab´ o et al. Distribution Regression

  9. Regression on labelled bags Given: �� ˆ �� ℓ i =1 , ˆ P i : bag from P i , N := | ˆ labelled bags: ˆ z = P i , y i P i | . test bag: ˆ P . Estimator: � � 2 1 � ℓ � � + λ � f � 2 f λ z = arg min f − y i µ ˆ H . ˆ P i ℓ i =1 f ∈ H ( K ) Prediction: � ˆ � = g T ( G + ℓλ I ) − 1 y , y ˆ P � � �� � � �� g = K µ ˆ P , µ ˆ , G = K µ ˆ P i , µ ˆ , y = [ y i ] . P i P j Szab´ o et al. Distribution Regression

  10. Regression on labelled bags Given: �� ˆ �� ℓ i =1 , ˆ P i : bag from P i , N := | ˆ labelled bags: ˆ z = P i , y i P i | . test bag: ˆ P . Estimator: � � 2 1 � ℓ � � + λ � f � 2 f λ z = arg min f − y i µ ˆ H . ˆ P i ℓ i =1 f ∈ H ( K ) Prediction: � ˆ � = g T ( G + ℓλ I ) − 1 y , y ˆ P � � �� � � �� g = K µ ˆ P , µ ˆ , G = K µ ˆ P i , µ ˆ , y = [ y i ] . P i P j Challenge How many samples/bag? Szab´ o et al. Distribution Regression

  11. Regression on labelled bags: similarity Let us define an inner product on distributions [ ˜ K ( P , Q )]: Set kernel: A = { a i } N i =1 , B = { b j } N j =1 . 1 � 1 N N N K ( A , B ) = 1 , 1 � � � � ˜ k ( a i , b j ) = ϕ ( a i ) ϕ ( b j ) . N 2 N N i , j =1 i =1 j =1 � �� � feature of bag A Remember: Szab´ o et al. Distribution Regression

  12. Regression on labelled bags: similarity Let us define an inner product on distributions [ ˜ K ( P , Q )]: Set kernel: A = { a i } N i =1 , B = { b j } N j =1 . 1 � 1 N N N K ( A , B ) = 1 , 1 � � � � ˜ k ( a i , b j ) = ϕ ( a i ) ϕ ( b j ) . N 2 N N i , j =1 i =1 j =1 � �� � feature of bag A Taking ’limit’ [Berlinet and Thomas-Agnan, 2004, 2 Altun and Smola, 2006, Smola et al., 2007]: a ∼ P , b ∼ Q � � ˜ K ( P , Q ) = E a , b k ( a , b ) = E a ϕ ( a ) , E b ϕ ( b ) . � �� � feature of distribution P =: µ P Example (Gaussian kernel): k ( a , b ) = e −� a − b � 2 2 / (2 σ 2 ) . Szab´ o et al. Distribution Regression

  13. Regression on labelled bags: baseline Quality of estimator, baseline: R ( f ) = E ( µ P , y ) ∼ ρ [ f ( µ P ) − y ] 2 , f ρ = best regressor . How many samples/bag to get the accuracy of f ρ ? Possible? Assume (for a moment): f ρ ∈ H ( K ). Szab´ o et al. Distribution Regression

  14. Our result: how many samples/bag Known [Caponnetto and De Vito, 2007]: best/achieved rate � � bc ℓ − R ( f λ z ) − R ( f ρ ) = O , bc +1 b – size of the input space, c – smoothness of f ρ . Szab´ o et al. Distribution Regression

  15. Our result: how many samples/bag Known [Caponnetto and De Vito, 2007]: best/achieved rate � � bc ℓ − R ( f λ z ) − R ( f ρ ) = O , bc +1 b – size of the input space, c – smoothness of f ρ . Let N = ˜ O ( ℓ a ). N : size of the bags. ℓ : number of bags. Our result If 2 ≤ a , then f λ z attains the best achievable rate. ˆ Szab´ o et al. Distribution Regression

  16. Our result: how many samples/bag Known [Caponnetto and De Vito, 2007]: best/achieved rate � � bc ℓ − R ( f λ z ) − R ( f ρ ) = O , bc +1 b – size of the input space, c – smoothness of f ρ . Let N = ˜ O ( ℓ a ). N : size of the bags. ℓ : number of bags. Our result If 2 ≤ a , then f λ z attains the best achievable rate. ˆ In fact, a = b ( c +1) bc +1 < 2 is enough. Consequence: regression with set kernel is consistent. Szab´ o et al. Distribution Regression

  17. Extensions K : linear → H¨ older, e.g. RBF [Christmann and Steinwart, 2010]. 1 Szab´ o et al. Distribution Regression

  18. Extensions K : linear → H¨ older, e.g. RBF [Christmann and Steinwart, 2010]. 1 Misspecified setting ( f ρ ∈ L 2 \ H ): 2 Consistency: convergence to inf f ∈ H � f − f ρ � L 2 . Smoothness on f ρ : computational & statistical tradeoff. Szab´ o et al. Distribution Regression

  19. Extensions Vector-valued output: 3 Y : separable Hilbert space ⇒ K ( µ P , µ Q ) ∈ L ( Y ). Prediction on a test bag ˆ P : � ˆ � = g T ( G + ℓλ I ) − 1 y , y ˆ P g = [ K ( µ ˆ P , µ ˆ P i )] , G = [ K ( µ ˆ P i , µ ˆ P j )] , y = [ y i ] . Specifically: Y = R ⇒ L ( Y ) = R ; Y = R d ⇒ L ( Y ) = R d × d . Szab´ o et al. Distribution Regression

  20. Aerosol prediction result (100 × RMSE ) We perform on par with the state-of-the-art, hand-engineered method. [Wang et al., 2012]: 7 . 5 − 8 . 5: hand-crafted features. Our prediction accuracy: 7 . 81: no expert knowledge. Code in ITE: https://bitbucket.org/szzoli/ite/ Szab´ o et al. Distribution Regression

  21. Summary Problem: distribution regression. Contribution: computational & statistical tradeoff analysis, specifically, the set kernel is consistent, minimax optimal rate is achievable: sub-quadratic bag size. Open question: optimal bag size. Szab´ o et al. Distribution Regression

  22. Thank you for the attention! Acknowledgments : This work was supported by the Gatsby Charitable Foundation, and by NSF grants IIS1247658 and IIS1250350. A part of the work was carried out while Bharath K. Sriperumbudur was a research fellow in the Statistical Laboratory, Department of Pure Mathematics and Mathematical Statistics at the University of Cambridge, UK. Szab´ o et al. Distribution Regression

  23. Altun, Y. and Smola, A. (2006). Unifying divergence minimization and statistical inference via convex duality. In Conference on Learning Theory (COLT) , pages 139–153. Berlinet, A. and Thomas-Agnan, C. (2004). Reproducing Kernel Hilbert Spaces in Probability and Statistics . Kluwer. Caponnetto, A. and De Vito, E. (2007). Optimal rates for regularized least-squares algorithm. Foundations of Computational Mathematics , 7:331–368. Christmann, A. and Steinwart, I. (2010). Universal kernels on non-standard input spaces. In Advances in Neural Information Processing Systems (NIPS) , pages 406–414. G¨ artner, T., Flach, P. A., Kowalczyk, A., and Smola, A. (2002). Szab´ o et al. Distribution Regression

  24. Multi-instance kernels. In International Conference on Machine Learning (ICML) , pages 179–186. Haussler, D. (1999). Convolution kernels on discrete structures. Technical report, Department of Computer Science, University of California at Santa Cruz. ( http://cbse.soe.ucsc.edu/sites/default/files/ convolutions.pdf ). Smola, A., Gretton, A., Song, L., and Sch¨ olkopf, B. (2007). A Hilbert space embedding for distributions. In Algorithmic Learning Theory (ALT) , pages 13–31. Szab´ o, Z., Sriperumbudur, B., P´ oczos, B., and Gretton, A. (2016). Learning theory for distribution regression. Journal of Machine Learning Research , 17(152):1–40. Wang, Z., Lan, L., and Vucetic, S. (2012). Szab´ o et al. Distribution Regression

  25. Mixture model for multiple instance regression and applications in remote sensing. IEEE Transactions on Geoscience and Remote Sensing , 50:2226–2237. Szab´ o et al. Distribution Regression

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend