FINAL EXAM REVIEW Will cover: All content from the course (Units - - PowerPoint PPT Presentation

final exam review
SMART_READER_LITE
LIVE PREVIEW

FINAL EXAM REVIEW Will cover: All content from the course (Units - - PowerPoint PPT Presentation

FINAL EXAM REVIEW Will cover: All content from the course (Units 1-5) Most points concentrated on Units 3-5 (mixture models, HMMs, MCMC) Logistics Take-home exam, maximum 2 hour time limit Exam release late afternoon Fri 5/1


slide-1
SLIDE 1

FINAL EXAM REVIEW

Will cover:

  • All content from the course (Units 1-5)
  • Most points concentrated on Units 3-5 (mixture models, HMMs, MCMC)

Logistics

  • Take-home exam, maximum 2 hour time limit
  • Exam release late afternoon Fri 5/1
  • Exam due NOON (11:59am ET) on Fri 5/8
  • Can use: Any notes, any textbook, any Python code (run locally)
  • Cannot use: The internet to search for answers, other people
  • We will provide most needed formulas or give textbook reference
slide-2
SLIDE 2

Takeaway Messages

1) When uncertain about a variable, don’t condition on it, integrate it away! 2) Model performance is only as good as your fitting algorithm, initialization, and hyperparameter selection. 3) MCMC is a powerful way to estimate posterior distributions (and resulting expectations) even when the model is not analytically tractable

slide-3
SLIDE 3

Takeaway 1!

When uncertain about a parameter, better to INTEGRATE AWAY than CONDITION ON OK: Using a point estimate BETTER: Integrate away ”w” via the sum rule

p(x∗| ˆ w)

<latexit sha1_base64="lKLrfhc+1Z5XvW67gMKyBwIphE0=">AB+nicbVBNT8JAEJ36ifhV9OhlIzFBD6RFEz0SvXjERD4SqGS7LBhu212tyIp/BQvHjTGq7/Em/GBXpQ8CWTvLw3k5l5fsSZ0o7zba2srq1vbGa2sts7u3v7du6gpsJYElolIQ9lw8eKciZoVTPNaSOSFAc+p3V/cDP1649UKhaKez2KqBfgnmBdRrA2UtvORYWnhzM0Rq0+1slwgk7bdt4pOjOgZeKmJA8pKm37q9UJSRxQoQnHSjVdJ9JegqVmhNJthUrGmEywD3aNFTgCovmZ0+QSdG6aBuKE0JjWbq74kEB0qNAt90Blj31aI3Ff/zmrHuXnkJE1GsqSDzRd2YIx2iaQ6owyQlmo8MwUQycysifSwx0SatrAnBXx5mdRKRfe8WLq7yJev0zgycATHUAXLqEMt1CBKhAYwjO8wps1tl6sd+tj3rpipTOH8AfW5w+0LpL+</latexit>

p(x∗|X) = Z

w

p(x∗, w|X)dw

<latexit sha1_base64="x2/tMCazDkzDw1uMp72i63SWnY=">ACDnicbVDLTsJAFJ3iC/GFunQzkZAgMaRFE92YEN24xEQeCdRmOp3ChOm0mZkKBPkCN/6KGxca49a1O/GAbpQ8CQ3OTn3tx7jxsxKpVpfhupeWV1bX0emZjc2t7J7u7V5dhLDCp4ZCFoukiSRjlpKaoYqQZCYICl5G27ua+I17IiQN+a0aRsQOUIdTn2KktORk81EBDu6K8AE2j+AFbFOunD6MCgOneAz7M9nrO9mcWTKngIvESkgOJKg62a+2F+I4IFxhqRsWak7BESimJGxpl2LEmEcA91SEtTjgIi7dH0nTHMa8WDfih0cQWn6u+JEQqkHAau7gyQ6sp5byL+57Vi5Z/bI8qjWBGOZ4v8mEVwk20KOCYMWGmiAsqL4V4i4SCudYEaHYM2/vEjq5ZJ1UirfnOYql0kcaXADkEBWOAMVMA1qIawOARPINX8GY8GS/Gu/Exa0Zycw+APj8we+c5jE</latexit>
slide-4
SLIDE 4

Takeaway 2

  • Initialization, remember CP3 (GMMs)
  • as well as CP5 (coming!)
  • Algorithm, remember the difference between

LBFGS and EM in CP3 * Hyperparameter: Remember the poor performance in CP2

Difference between purple and blue is 0.01 on log scale When normalized over 400 pixels (20x20) per image Means purple model says average validation set image is exp(0.01 * 400) = 54.5 times more likely than the blue model

slide-5
SLIDE 5

Takeaway 3

  • Can use MCMC to do posterior predictive

p(x∗|X) = Z

w

p(x∗, w|X)dw = Z

w

p(x∗|w)p(w|X)dw = 1 S

S

X

s=1

p(x∗|ws), ws iid ∼ p(ws|X)

<latexit sha1_base64="yD5xTIqcA75dojf35Uv58djnkRk=">ACl3icbVFda9swFJW9r9b7aNr1ZezlsrCRlBLsrtAxKCvrGH1s6dIG4sTIstyKyh+VrpcEzX9pP2Zv+zeTkzxsyS4Ijs65hyPdG5dSaPT93474OGjx082Nr2nz56/2Gpt71zpolKM91khCzWIqeZS5LyPAiUflIrTLJb8Or47bfTr71xpUeTfcFbyUZvcpEKRtFSUetn2YHpeA9+wKAL74hFDlGEyg702hvHyYLPpl4YeitqFadC1c60kVZSaozWUNoa6yOjoB5fQpO0sI01dPchvL+vaGJvRjedSNmd4tKEyKdohEjq2oRaZHVjbCxNTNRq+z1/XrAOgiVok2WdR61fYVKwKuM5Mkm1HgZ+iSNDFQome2FlealTaY3fGhTjOuR2Y+1xreWiaBtFD25Ahz9m+HoZnWsy2nRnFW72qNeT/tGF6YeREXlZIc/ZIitJGABzZIgEYozlDMLKFPCvhXYLbVzRbtKzw4hWP3yOrg6AXvewcXh+2Tz8txbJDX5A3pkIAckRNyRs5JnzBn1/nonDpf3FfuJ/ere7ZodZ2l5yX5p9yLP1A7wmo=</latexit>
slide-6
SLIDE 6

You are capable of so many things now!

Given a proposed probabilistic model, you can do:

ML estimation of parameters MAP estimation of parameters EM to estimate parameters MCMC estimation of posterior Heldout likelihood computation Hyperparameter selection via CV Hyperparameter selection via evidence

slide-7
SLIDE 7

Unit 1

Probabilistic Analysis Skills

  • Discrete and continuous r.v.
  • Sum rule and product rule
  • Bayes rule (derived from above)
  • Expectations
  • Independence

Distributions

  • Bernoulli distribution
  • Beta distribution
  • Gamma function
  • Dirichlet distribution

Data analysis

  • Beta-Bernoulli for binary data
  • ML estimation of ”proba. heads"
  • MAP estimation of “proba. heads"
  • Estimating the posterior
  • Predicting new data
  • Dirichlet-Categorical for discrete data
  • ML estimation of unigram probas
  • MAP estimation of unigram probas
  • Estimating the posterior
  • Predicting new data

Optimization Skills

  • Finding extrema by zeros of first derivative
  • Handling Constraints via Lagrange multipliers
slide-8
SLIDE 8

Example Unit 1 Question

a) True or False: Bayes Rule can be proved using the Sum Rule and Product Rules a) You’re modeling the wins/losses of your favorite sports team with a Beta-Bernoulli model.

a) You assume each game’s binary outcome (win=1/loss=0) is iid. b) You observe in preseason play: 5 wins and 3 losses c) Suggest a prior to use for the win probability d) Identify 2 or more assumptions about this model that may not be valid in the real world (with concrete reasons)

slide-9
SLIDE 9

Example Unit 1 Answer

slide-10
SLIDE 10

Unit 2

Probabilistic Analysis Skills

  • Joints, conditionals, marginals
  • Covariance matrices (pos. definite, symmetric)
  • Gaussian conjugacy rules

Linear Algebra Skills

  • Determinants
  • Positive definite
  • Invertibility

Distributions

  • Univariate Gaussian distribution
  • Multivariate Gaussian distribution

Data analysis

  • Gaussian-Gaussian for regression
  • ML estimation of weights
  • MAP estimation of weights
  • Estimating the posterior over weights
  • Predicting new data

Optimization Skills

  • Convexity and second derivatives
  • Finding extrema by zeros of first derivative
  • First and second order gradient descent
slide-11
SLIDE 11

Example Unit 2 Question

You are doing regression with the following model

  • Normal prior on the weights
  • Normal likelihood:
  • a. Consider the following two estimators for t_*. What’s the difference?
  • b. Suggest at least 2 ways to pick a value for the hyperparameter \sigma

p(tn|xn) = NormPDF(w ∗ xn, σ2)

<latexit sha1_base64="a43bVnPThLqB7flrdowacMns+mI=">ACG3icbVBNSwMxEM36WetX1aOXYBGqSNmtgl4EURFPUsGq0NYlm6ZtaJdklm1rP0fXvwrXjwo4knw4L8x2/bg14OBx3szMwLIsENuO6nMzI6Nj4xmZnKTs/Mzs3nFhbPTRhryio0FKG+DIhgitWAQ6CXUaERkIdhF0DlL/4pw0N1Bt2I1SVpKd7klICV/FwpKoCv8B2+9dUa3sU1YLeQnIRalg+PegV8g9dTawPXDG9JclVaw34u7xbdPvBf4g1JHg1R9nPvtUZIY8kUEGMqXpuBPWEaOBUsF62FhsWEdohLVa1VBHJTD3p/9bDq1Zp4GaobSnAfX7REKkMV0Z2E5JoG1+e6n4n1eNoblT7iKYmCKDhY1Y4EhxGlQuME1oyC6lhCqub0V0zbRhIKNM2tD8H6/Jecl4reZrF0upXf2x/GkUHLaAUVkIe20R46RmVUQRTdo0f0jF6cB+fJeXeBq0jznBmCf2A8/EFs5Sesg=</latexit>

ˆ t∗ = wMAP x∗ ˜ t∗ = Et∼p(t|x∗,X)[t]

<latexit sha1_base64="6vUJKUEKlj3EmXfpJftwyboOxw=">ACO3icbVBNSxBFOzRaHT8Ws0xl0cWRUWGSPoRdgYhFyEVxd2BmHnt5et7Hng+436jLO/Lin/DmJZcIiHX3NOzu4hfBQ1FVT36vQpTKTQ6zoM1Nv5hYvLj1LQ9Mzs3v1BZXDrRSaYb7JEJqoVUs2liHkTBUreShWnUSj5aXjxvfRPL7nSIomPsZ9yP6LnsegKRtFIQeXI61HMsQjWYWUXrs7yg2+NAq6DdvzbA+F7PAn14so9sIw3y+CHMHTIoJ0FeGmjG9Aa62ANiD4QaXq1JwB4C1xR6RKRmgElXuvk7As4jEySbVu06Kfk4VCiZ5YXuZ5ilF/Sctw2NacS1nw9uL2DZKB3oJsq8GgPp/IaR1PwpNslxfv/ZK8T2vnWF3x89FnGbIYzb8qJtJwATKIqEjFGco+4ZQpoTZFViPKsrQ1G2bEtzXJ78lJ5s192t83CrWt8b1TFPpMvZJW4ZJvUyQ/SIE3CyC35SX6TR+vO+mX9sf4Oo2PWaOYTeQHr382I6qL</latexit>
slide-12
SLIDE 12

Example Unit 2 Answer

slide-13
SLIDE 13

Unit 3: K-Means and Mixture Models

Distributions

  • Mixtures of Gaussians (GMMs)
  • Mixtures in general
  • Can use any likelihood (not just Gauss)

Numerical Methods logsumexp Data analysis

  • K-means or GMM for a dataset
  • How to pick K hyperparameter
  • Why multiple inits matter

Optimization Skills

  • K-means objective and algorithm
  • Coordinate ascent / descent algorithms
  • Optimization objectives with hidden vars
  • Complete likelihood: p(x, z | \theta)
  • Incomplete likelihood: p( x | \theta)
  • Expectations of complete likelihood
  • How to derive it
  • Why it is important
  • Expectation-Maximization algorithm
  • Lower bound objective
  • What E-step does
  • What M-step does
slide-14
SLIDE 14

Ex Exampl ple Uni nit 3 Que uestion

Consider two possible models for clustering 1-dim. data

  • K-Means
  • Gaussian mixtures

Name ways that the GMM is more flexible as a model:

  • How is the GMM’s treatment of assignments more flexible?
  • How is the GMM’s parameterization of a “cluster” more flexible?

Under what limit does the GMM likelihood reduce to the K-means objective?

slide-15
SLIDE 15

Ex Exampl ple Uni nit 3 An Answer

slide-16
SLIDE 16

Unit 4: Markov models and HMMs

Probabilistic Analysis Skills

  • Markov conditional independence
  • Stationary distributions
  • Deriving independence properties
  • Like HW4 problem 1

Linear Algebra Skills

  • Eigenvectors/values for stationary

distributions Distributions

  • Discrete Markov models

Algorithm Skills

  • Forward algorithm
  • Backward algorithm
  • Viterbi algorithm

(all examples of dynamic programming) Optimization Skills

  • EM for HMMs
  • E-step
  • M-step
slide-17
SLIDE 17

Example Unit 4 Question

  • Describe how the Viterbi algorithm is an instance of dynamic programming

Identify all the key parts:

  • What is the fundamental problem being solved?
  • How is the final solution built from solutions to smaller problems?
  • How to describe all the solutions as a big “table” that should be filled in?
  • What is the “base case” update (the simplest subproblem)?
  • What is the recursive update?
slide-18
SLIDE 18

Example Unit 4 Answer

slide-19
SLIDE 19

Unit 5: Markov Chain Monte Carlo

Probabilistic Analysis Skills

  • Inverse CDF rule for sampling
  • Transformations of random variables
  • Ancestral sampling
  • Stationary distributions
  • Remember, always a unique stationary

distribution if Markov chain is ergodic

  • Detailed balance

Linear Algebra Skills

  • Eigenvectors/values for stationary

distributions

MCMC algorithms

  • Metropolis
  • Metropolis-Hastings
  • Gibbs sampling

Data Analysis

  • Using MCMC to estimate a posterior
slide-20
SLIDE 20

Example Unit 5 Question

  • 5a. Can we use the inverse CDF rule for sampling from a univariate

Normal analytically? Can we do it numerically? If so, how?

  • 5b. How would you use ancestral sampling to sample from a Bayesian

Linear regression model?

  • 5c. T/F: We only need to run one MCMC chain in practice and we can

use all samples from that china

slide-21
SLIDE 21

Example Unit 5 Answer