regression probabilistic perspective
play

Regression: Probabilistic perspective CE-717: Machine Learning - PowerPoint PPT Presentation

Regression: Probabilistic perspective CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2018 Curve fitting: probabilistic perspective } Describing uncertainty over value of target variable as a probability distribution


  1. Regression: Probabilistic perspective CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2018

  2. Curve fitting: probabilistic perspective } Describing uncertainty over value of target variable as a probability distribution } Example: ๐‘”(๐‘ฆ; ๐’™) ๐‘”(๐‘ฆ ' ; ๐’™) ๐‘ž(๐‘ง|๐‘ฆ ' , ๐’™, ๐œ) 2

  3. The learning diagram including noisy target } Type equation here. โ„Ž: ๐’ด โ†’ ๐’ต ๐‘ฆ A , โ€ฆ , ๐‘ฆ C ๐‘ฆ A , ๐‘ง (A) , โ€ฆ , ๐‘ฆ C , ๐‘ง (C) ๐‘” ๐’š = โ„Ž(๐’š) ๐‘”: ๐’ด โ†’ ๐’ต ๐‘„ ๐‘ฆ, ๐‘ง = ๐‘„ ๐‘ฆ ๐‘„(๐‘ง|๐‘ฆ) Target Distribution on features distribution 3 [Y.S. Abou Mostafa, 2012]

  4. Curve fitting: probabilistic perspective (Example) } Special case: Observed output = function + noise ๐‘ง = ๐‘” ๐’š; ๐’™ + ๐œ— e.g., ๐œ—~๐‘‚(0, ๐œ L ) } Noise: Whatever we cannot capture with our chosen family of functions 4

  5. Curve fitting: probabilistic perspective (Example) } Best regression ๐”ฝ ๐‘ง|๐’š = ๐น ๐‘”(๐’š; ๐’™) + ๐œ— = ๐‘”(๐’š; ๐’™) ๐œ—~๐‘‚(0,๐œ L ) } ๐‘” ๐’š; ๐’™ is trying to capture the mean of the observations ๐‘ง given the input ๐’š : } ๐”ฝ ๐‘ง|๐’š : conditional expectation of ๐‘ง given ๐’š } evaluated according to the model (not according to the underlying distribution ๐‘„ ) 5

  6. Curve fitting using probabilistic estimation } Maximum Likelihood (ML) estimation } Maximum A Posteriori (MAP) estimation } Bayesian approach 6

  7. Maximum likelihood estimation R ๐’š P , ๐‘ง P } Given observations ๐’  = PQA } Find the parameters that maximize the (conditional) likelihood of the outputs: R ๐‘€ ๐’  ; ๐œพ = ๐‘ž ๐’› ๐’€, ๐œพ = W ๐‘ž(๐‘ง P |๐’š P , ๐œพ) PQA (A) (A) ๐‘ฆ [ 1 ๐‘ฆ A โ‹ฏ ๐‘ง (A) โ‹ฏ (L) (L) 1 ๐‘ฆ [ ๐‘ฆ A ๐’› = โ‹ฎ ๐’€ = โ‹ฑ โ‹ฎ โ‹ฎ โ‹ฎ ๐‘ง (R) (R) (R) ๐‘ฆ [ 1 ๐‘ฆ A โ‹ฏ 7

  8. ๏ฟฝ Maximum likelihood estimation (Contโ€™d) ๐‘ง = ๐‘” ๐’š; ๐’™ + ๐œ— , ๐œ—~๐‘‚(0, ๐œ L ) } ๐‘ง given ๐’š is normally distributed with mean ๐‘”(๐’š; ๐’™) and variance ๐œ L : } we model the uncertainty in the predictions, not just the mean 1 {โˆ’ 1 L } ๐‘ž(๐‘ง|๐’š, ๐’™, ๐œ L ) = exp 2๐œ L ๐‘ง โˆ’ ๐‘” ๐’š; ๐’™ 2๐œŒ ๐œ 8

  9. ๏ฟฝ Maximum likelihood estimation (Contโ€™d) } Example: univariate linear function 1 {โˆ’ 1 ๐‘ž(๐‘ง|๐’š, ๐’™, ๐œ L ) = 2๐œ L ๐‘ง โˆ’ ๐‘ฅ ' โˆ’ ๐‘ฅ A ๐‘ฆ L } exp 2๐œŒ ๐œ โ€ข Why is this line a bad fit according to the likelihood criterion? โ€ข ๐‘ž(๐‘ง|๐’š,๐’™,๐œ L ) for most of the points will be โ€ข near zero (as they are far from this line) 9

  10. Maximum likelihood estimation (Contโ€™d) } Maximize the likelihood of the outputs (i.i.d): C ๐‘€ ๐’ ; ๐’™, ๐œ L = W ๐‘ž(๐‘ง P |๐’š (P) , ๐’™, ๐œ L ) PQA ๐‘€ ๐’ ; ๐’™, ๐œ L ๐’™ e = argmax ๐’™ C W ๐‘ž(๐‘ง P |๐’š (P) , ๐’™, ๐œ L ) = argmax ๐’™ PQA 10

  11. Maximum likelihood estimation (Contโ€™d) } It is often easier (but equivalent) to try to maximize the log-likelihood: ln ๐‘ž ๐’› ๐’€, ๐’™, ๐œ L ๐’™ e = argmax ๐’™ C C ln W ๐‘ž(๐‘ง P |๐’š (P) , ๐’™, ๐œ L ) = i ln ๐’ช(๐‘ง P |๐’š P , ๐’™, ๐œ L ) PQA PQA C = โˆ’๐‘‚ ln ๐œ โˆ’ ๐‘‚ 2 ln 2๐œŒ โˆ’ 1 L 2๐œ L i ๐‘ง P โˆ’ ๐‘”(๐’š P ; ๐’™) PQA sum of squares error 11

  12. Maximum likelihood estimation (Contโ€™d) } Maximizing log-likelihood (when we assume ๐‘ง = ๐‘” ๐’š; ๐’™ + ๐œ— , ๐œ—~๐‘‚(0, ๐œ L ) ) is equivalent to minimizing SSE e be the maximum likelihood (here least squares) setting } Let ๐’™ of the parameters. } What is the maximum likelihood estimate of ๐œ L ? ๐œ– log ๐‘€(๐’ ; ๐’™, ๐œ L ) = 0 ๐œ–๐œ L C m L = 1 L ๐‘‚ i ๐‘ง P โˆ’ ๐‘”(๐’š P ; ๐’™ e) โ‡’ ๐œ PQA Mean squared prediction error 12

  13. Maximum likelihood estimation (Contโ€™d) } Generally, maximizing log-likelihood is equivalent to minimizing empirical loss when the loss is defined according to: ๐‘€๐‘๐‘ก๐‘ก ๐‘ง P , ๐‘” ๐’š P , ๐’™ = โˆ’ ln ๐‘ž(๐‘ง P |๐’š (P) , ๐’™, ๐œพ) } Loss: negative log-probability } More general distributions for ๐‘ž(๐‘ง|๐’š) can be considered 13

  14. Maximum A Posterior (MAP) estimation } MAP: } Given observations ๐’  } Find the parameters that maximize the probabilities of the parameters after observing the data (posterior probabilities): ๐œพ pqr = max ๐‘ž ๐œพ| ๐’  ) ๐œพ Since ๐‘ž ๐œพ| ๐’  โˆ ๐‘ž ๐’  |๐œพ ๐‘ž(๐œพ) ๐œพ pqr = max ๐‘ž ๐’  |๐œพ ๐‘ž(๐œพ) ๐œพ 14

  15. ๏ฟฝ Maximum A Posterior (MAP) estimation C ๐’š P , ๐‘ง P } Given observations ๐’  = PQA max ๐’™ ๐‘ž(๐’™|๐’€, ๐’›) โˆ ๐‘ž ๐’› ๐’€, ๐’™ ๐‘ž(๐’™) [wA 1 ๐‘“๐‘ฆ๐‘ž โˆ’ 1 ๐‘ž ๐’™ = ๐’ช ๐Ÿ, ๐›ฝ L ๐‘ฑ = 2๐›ฝ L ๐’™ y ๐’™ 2๐œŒ ๐›ฝ 15

  16. Maximum A Posterior (MAP) estimation C ๐’š P , ๐‘ง P } Given observations ๐’  = PQA ๐’™ ln ๐‘ž ๐’› ๐’€, ๐’™, ๐œ L ๐‘ž(๐’™) max C 1 + 1 L ๐œ L i ๐‘ง P โˆ’ ๐‘”(๐’š P ; ๐’™) ๐›ฝ L ๐’™ y ๐’™ min ๐’™ PQA { | } Equivalent to regularized SSE with ๐œ‡ = } | 16

  17. ๏ฟฝ ๏ฟฝ Bayesian approach C ๐’š P , ๐‘ง P } Given observations ๐’  = PQA } Find the parameters that maximize the probabilities of observations ๐‘ž ๐‘ง ๐’š, ๐’  = ~ ๐‘ž ๐‘ง ๐’™, ๐’š ๐‘ž ๐’™| ๐’  ๐‘’๐’™ } Example of prior distribution: ๐‘ž ๐’™ = ๐’ช(๐Ÿ, ๐›ฝ L ๐‘ฑ) ๐’ C = 1 โ€šA ) } In this case: ๐‘ž ๐’™| ๐’  = ๐’ช(๐’ C , ๐‘ป C โ€šA ๐’€ y ๐’› ๐œ L ๐‘ป C ๐‘ป C = 1 ๐›ฝ L ๐‘ฑ + 1 ๐œ L ๐’€ y ๐’€ 17

  18. ๏ฟฝ ๏ฟฝ Bayesian approach C ๐’š P , ๐‘ง P } Given observations ๐’  = PQA } Find the parameters that maximize the probabilities of observations C ๐‘ž ๐’  ๐’™ = ๐‘€ ๐’ ; ๐’™, ๐œพ = W ๐‘ž ๐‘ง P ๐’™ y ๐’š P , ๐œพ PQA ๐‘ž ๐‘ง P ๐‘” ๐’š P , ๐’™ , ๐œพ = ๐’ช(๐‘ง P |๐’™ y ๐’š P , ๐œ L ) ๐‘ž ๐’™ = ๐’ช(๐Ÿ, ๐›ฝ L ๐‘ฑ) ๐‘ž(๐’™|๐’ ) โˆ ๐‘ž ๐’  ๐’™ ๐‘ž(๐’™) ๐’ C = 1 โ€šA ๐’€ y ๐’› ๐œ L ๐‘ป C ๐‘ž(๐‘ง|๐’š, ๐’ ) = ~ ๐‘ž ๐‘ง ๐’™, ๐’š ๐‘ž ๐’™|๐’  ๐‘’๐’™ Predictive ๐‘ป C = 1 ๐›ฝ L ๐‘ฑ + 1 distribution ๐œ L ๐’€ y ๐’€ y ๐’š, ๐œ C L (๐’š) ๐‘ž ๐‘ง ๐’š, ๐’  = ๐‘‚ ๐’ C L ๐’š = ๐œ L + ๐’š y ๐‘ป C โ€šA ๐’š ๐œ C 18

  19. Predictive distribution: example } Example: Sinusoidal data, 9 Gaussian basis functions Red curve shows the mean of the predictive distribution Pink region spans one standard deviation either side of the mean 19 [Bishop]

  20. Predictive distribution: example } Functions whose parameters are sampled from ๐‘ž(๐’™|๐’ ) 20 [Bishop]

  21. Resource } C. Bishop, โ€œPattern Recognition and Machine Learningโ€, Chapter 3.3. 21

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend