probabilistic graphical models probabilistic graphical
play

Probabilistic Graphical Models Probabilistic Graphical Models - PowerPoint PPT Presentation

Probabilistic Graphical Models Probabilistic Graphical Models introduction to learning Siamak Ravanbakhsh Fall 2019 Learning objectives Learning objectives different goals of learning a graphical model effect of goals on the learning setup


  1. Probabilistic Graphical Models Probabilistic Graphical Models introduction to learning Siamak Ravanbakhsh Fall 2019

  2. Learning objectives Learning objectives different goals of learning a graphical model effect of goals on the learning setup

  3. Where does a graphical model come from? Where does a graphical model come from? image: http://blog.londolozi.com/

  4. Where does a graphical model come from? Where does a graphical model come from? designed by domain experts : more suitable for directed models cond. probabilities are more intuitive than unnormalized factors no need to estimate the partition function image: http://blog.londolozi.com/

  5. Where does a graphical model come from? Where does a graphical model come from? designed by domain experts : more suitable for directed models cond. probabilities are more intuitive than unnormalized factors no need to estimate the partition function learning from data: fixed structure: easy for directed models unknown structure fully or partially observed data, hidden variables image: http://blog.londolozi.com/

  6. Goals of learning: Goals of learning: density estimation density estimation assumption : data is IID sample from a P ∗ (1) ( M ) ( m ) P ∗ D = { X , … , X } ∼ X 1 I ( x ∈ empirical distribution: ( x ) = D ) P D ∣ D ∣

  7. Goals of learning: density estimation Goals of learning: density estimation assumption : data is IID sample from a P ∗ (1) ( M ) ( m ) P ∗ D = { X , … , X } ∼ X 1 I ( x ∈ empirical distribution: ( x ) = D ) P D ∣ D ∣ objective : learn a close to P ∗ ^ ∈ P P ^ ∗ = arg min ( P ∥ P ) P D P KL

  8. Goals of learning: density estimation Goals of learning: density estimation assumption : data is IID sample from a P ∗ (1) ( M ) ( m ) P ∗ D = { X , … , X } ∼ X 1 I ( x ∈ empirical distribution: ( x ) = D ) P D ∣ D ∣ objective : learn a close to P ∗ ^ ∈ P P ^ ∗ = E ∗ E = arg min ( P ∥ P ) [log P ] − [log P ] P D P ∗ P ∗ P KL

  9. Goals of learning: density estimation Goals of learning: density estimation assumption : data is IID sample from a P ∗ (1) ( M ) ( m ) P ∗ D = { X , … , X } ∼ X 1 I ( x ∈ empirical distribution: ( x ) = D ) P D ∣ D ∣ objective : learn a close to P ∗ ^ ∈ P P ^ ∗ = E ∗ E = arg min ( P ∥ P ) [log P ] − [log P ] P D P ∗ P ∗ P KL negative Entropy of P* (does not depend on P)

  10. Goals of learning: density estimation Goals of learning: density estimation assumption : data is IID sample from a P ∗ (1) ( M ) ( m ) P ∗ D = { X , … , X } ∼ X 1 I ( x ∈ empirical distribution: ( x ) = D ) P D ∣ D ∣ objective : learn a close to P ∗ ^ ∈ P P ^ ∗ = E ∗ E = arg min ( P ∥ P ) [log P ] − [log P ] P D P ∗ P ∗ P KL negative Entropy of P* (does not depend on P) ^ substitute with : P ∗ = arg max log P ( x ) P ∑ x ∈ D P P D log-likelihood its negative is called the log loss how to compare two log-likelihood values?

  11. Goals of learning: prediction Goals of learning: prediction given D = {( X ( m ) ( m ) , Y )} interested in learning ^ ( X ∣ Y ) P the output in our prediction is structured making prediction: ^ ^ ( Y ) = arg max ( x ∣ Y ) X x P e.g. in image segmentation

  12. Goals of learning: Goals of learning: prediction prediction given D = {( X ( m ) ( m ) , Y )} interested in learning ^ ( X ∣ Y ) P the output in our prediction is structured making prediction: ^ ^ ( Y ) = arg max ( x ∣ Y ) X x P e.g. in image segmentation error measures: ^ 0/1 loss (unforgiving): E I ( X = ( Y )) X ( X , Y )∼ P ∗ ^ Hamming loss: E I ( X ( X , Y )∼ P ∗ ∑ i = ( Y ) ) X i i ^ E conditional log-likelihood: log ( X ∣ Y ) P ( X , Y )∼ P ∗ takes prediction uncertainty into account

  13. Goals of learning: Goals of learning: knowledge discovery knowledge discovery given D = {( X ( m ) )} interested in learning G or H finding conditional independencies or causal relationships E.g. in gene regulatory network image credit: Chen et al., 2014

  14. Goals of learning: knowledge discovery Goals of learning: knowledge discovery given D = {( X ( m ) )} interested in learning G or H finding conditional independencies or causal relationships not always uniquely identifiable two DAGs are I-equivalent if ′ Recall I ( G ) = I ( G ) E.g. in gene regulatory network same undirected skeleton same immoralities image credit: Chen et al., 2014

  15. bias-variance trade-off bias-variance trade-off learning ideally minimizes some risk (expected loss) E [ loss ( X )] X ∼ P ∗ in reality we use empirical risk E [ loss ( x )] x ∈ D image: http://ipython-books.github.io

  16. bias-variance trade-off bias-variance trade-off learning ideally minimizes some risk (expected loss) E [ loss ( X )] X ∼ P ∗ in reality we use empirical risk E [ loss ( x )] x ∈ D if our model is expressive we can overfit low empirical risk does not translate to low risk high variance our model does not generalize to samples outside D as measured by a validation set different choices of produce very different models ^ D ∼ P ∗ overfitting in density estimation P image: http://ipython-books.github.io

  17. bias-variance trade-off bias-variance trade-off learning ideally minimizes some risk (expected loss) E [ loss ( X )] X ∼ P ∗ in reality we use empirical risk E [ loss ( x )] x ∈ D if our model is expressive we can overfit low empirical risk does not translate to low risk high variance our model does not generalize to samples outside D as measured by a validation set different choices of produce very different models ^ D ∼ P ∗ overfitting in density estimation P simple models cannot fit the data high bias the model has a bias, and even large dataset cannot help D image: http://ipython-books.github.io

  18. bias-variance trade-off bias-variance trade-off learning ideally minimizes some risk (expected loss) E [ loss ( X )] X ∼ P ∗ in reality we use empirical risk E [ loss ( x )] x ∈ D if our model is expressive we can overfit low empirical risk does not translate to low risk high variance our model does not generalize to samples outside D as measured by a validation set different choices of produce very different models ^ D ∼ P ∗ overfitting in density estimation P a solution: penalize model complexity regularization simple models cannot fit the data high bias the model has a bias, and even large dataset cannot help D image: http://ipython-books.github.io

  19. Discreminative vs generative Discreminative vs generative training if the goal is prediction: ^ ( X ∣ Y ) P Generative: learn and condition on Y (e.g., MRF) ^ ( X , Y ) P Discriminative: directly learn (e.g., CRF) ^ ( X ∣ Y ) P

  20. Discreminative vs generative Discreminative vs generative training if the goal is prediction: ^ ( X ∣ Y ) P Generative: learn and condition on Y (e.g., MRF) ^ ( X , Y ) P Discriminative: directly learn (e.g., CRF) ^ ( X ∣ Y ) P Example naive Bayes vs logistic regression trained generatively (log-likelihood) X works better on small datasets (higher bias) unnecessary cond. ind. assumptions about Y can deal with missing values & learn from unlabeled data Y Naive Bayes P ( X ∣ Y ) ∝ P ( X ) P ( Y ∣ X )

  21. Discreminative vs generative Discreminative vs generative training if the goal is prediction: ^ ( X ∣ Y ) P Generative: learn and condition on Y (e.g., MRF) ^ ( X , Y ) P Discriminative: directly learn (e.g., CRF) ^ ( X ∣ Y ) P Example naive Bayes vs logistic regression trained generatively (log-likelihood) X works better on small datasets (higher bias) unnecessary cond. ind. assumptions about Y can deal with missing values & learn from unlabeled data Y P ( X ∣ Y ) ∝ P ( X ) P ( Y ∣ X ) Naive Bayes X trained discriminatively (cond. log-likelihood) works better on large datasets Y no assumptions about cond. ind. in Y logistic regression P ( X = 1∣ Y ) = σ ( W Y + b ) T

  22. Discreminative vs generative Discreminative vs generative training Example naive Bayes vs logistic regression on UCI dataset naive Bayes logistic regression from: Ng & Jordan 2001

  23. summary summary learning can have different objectives: density estimation calculating P(x) sampling from P (generative modeling) prediction (conditional density estimation) discriminative and generative modeling knowledge discovery expressed as empirical risk minimization bias-variance trade-off regularize the model

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend