Clustering and Prediction Probability and Statistics for Data - - PowerPoint PPT Presentation

clustering and prediction
SMART_READER_LITE
LIVE PREVIEW

Clustering and Prediction Probability and Statistics for Data - - PowerPoint PPT Presentation

Clustering and Prediction Probability and Statistics for Data Science CSE594 - Spring 2016 But first, One final useful statistical technique from Part II Confidence Intervals Motivation: p-values tell a nice succinct story but neglect a lot of


slide-1
SLIDE 1

Clustering and Prediction

Probability and Statistics for Data Science CSE594 - Spring 2016

slide-2
SLIDE 2

But first,

One final useful statistical technique from Part II

slide-3
SLIDE 3

Confidence Intervals

Motivation: p-values tell a nice succinct story but neglect a lot of information. Estimating a point, approximated as normal (e.g. error or mean) find CI% based on standard normal distribution (i.e. CI% = 95, z = 1.96)

slide-4
SLIDE 4

Resampling Techniques Revisited

The bootstrap

  • What if we don’t know the distribution?
slide-5
SLIDE 5

Resampling Techniques Revisited

The bootstrap

  • What if we don’t know the distribution?
  • Resample many potential distributions based on the observed data and find

the range that CI% of the data fall in (e.g. mean). Resample: for each i in n observations, put all observations in a hat and draw one (all observations are equally likely).

slide-6
SLIDE 6

Clustering and Prediction

(now back to our regularly scheduled program)

slide-7
SLIDE 7

Clustering and Prediction

(now back to our regularly scheduled program)

  • I. Probability Theory
  • II. Discovery: Quantitative Research Methods

III.

slide-8
SLIDE 8

Clustering and Prediction

X1 X2 X3 Y

slide-9
SLIDE 9

Clustering and Prediction

X1 X2 X3 Y #Discovery X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 Y X13 X14 X15 ... Xm

slide-10
SLIDE 10

Clustering and Prediction

X1 X2 X3 Y #Discovery X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 Y X13 X14 X15 ... Xm

M < ~5 or m << n (much less) M > ~100 or m ฀ n or m >> n

slide-11
SLIDE 11

Clustering and Prediction

X1 X2 X3 Y #Discovery X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 Y X13 X14 X15 ... Xm X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 ... Xm

slide-12
SLIDE 12

Clustering and Prediction

X1 X2 X3 Y #Discovery X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 Y X13 X14 X15 ... Xm X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 ... Xm

slide-13
SLIDE 13

Overfitting (1-d example)

Underfit Overfit High Bias High Variance (image credit: Scikit-learn; in practice data are rarely this clear)

slide-14
SLIDE 14

Common Goal: Generalize to new data

Original Data New Data?

Does the model hold up? Model

slide-15
SLIDE 15

Common Goal: Generalize to new data

Training Data Testing Data

Does the model hold up? Model

slide-16
SLIDE 16

Common Goal: Generalize to new data

Training Data Testing Data

Does the model hold up? Model Develo- pment Data Model Set training parameters

slide-17
SLIDE 17

Feature Selection / Subset Selection

Forward Stepwise Selection:

  • start with current_model just has the intercept (mean)

remaining_predictors = all_predictors

  • for i in range(k)

#find best p to add to current_model: for p in remaining_prepdictors refit current_model with p #add best p, based on RSSp to current_model #remove p from remaining predictors

slide-18
SLIDE 18

Regularization (Shrinkage)

No selection (weight=beta) forward stepwise

Why just keep or discard features?

slide-19
SLIDE 19

Regularization (L2, Ridge Regression)

Idea: Impose a penalty on size of weights: Ordinary least squares objective: Ridge regression:

slide-20
SLIDE 20

Regularization (L2, Ridge Regression)

Idea: Impose a penalty on size of weights: Ordinary least squares objective: Ridge regression:

slide-21
SLIDE 21

Regularization (L2, Ridge Regression)

Idea: Impose a penalty on size of weights: Ordinary least squares objective: Ridge regression: In Matrix Form:

I: m x m identity matrix

slide-22
SLIDE 22

Regularization (L1, The “Lasso”)

Idea: Impose a penalty and zero-out some weights The Lasso Objective: No closed form matrix solution, but

  • ften solved with coordinate descent.

Application: m ≅ n or m >> n

slide-23
SLIDE 23

Regularization Comparison

slide-24
SLIDE 24

Review, 3/31 - 4/5

  • Confidence intervals
  • Bootstrap
  • Prediction Framework: Train, Development, Test
  • Overfitting: Bias versus Variance
  • Feature Selection: Forward Stepwise Regression
  • Ridge Regression (L2 regularization)
  • Lasso Regression (L1 regulatization)
slide-25
SLIDE 25

Common Goal: Generalize to new data

Training Data Testing Data

Does the model hold up? Model Develo- pment Model Set parameters

slide-26
SLIDE 26

N-Fold Cross-Validation

Goal: Decent estimate of model accuracy

train test

dev All data

train test

dev

train train test

dev

train

...

Iter 1 Iter 2 Iter 3 ….

slide-27
SLIDE 27

Supervised vs. Unsupervised

Supervised

  • Predicting an outcome
  • Loss function used to characterize quality of prediction
slide-28
SLIDE 28

Supervised vs. Unsupervised

Supervised

  • Predicting an outcome
  • Loss function used to characterize quality of prediction

Unsupervised

  • No outcome to predict
  • Goal: Infer properties of without a supervised loss function.
  • Often larger data.
  • Don’t need to worry about conditioning on another variable.
slide-29
SLIDE 29

K-Means Clustering

Clustering: Group similar observations, often over unlabeled data. K-means: A “prototype” method (i.e. not based on an algebraic model).

Euclidean Distance: centers = a random selection of k cluster centers until centers converge:

  • 1. For all xi, find the closest center (according to d)
  • 2. Recalculate centers based on mean of euclidean distance
slide-30
SLIDE 30

Review 4-7

  • Cross-validation
  • Supervised Learning
  • Euclidean distance in m-dimensional space
  • K-Means clustering
slide-31
SLIDE 31

K-Means Clustering

Understanding K-Means

(source: Scikit-Learn)

slide-32
SLIDE 32

Dimensionality Reduction - Concept

slide-33
SLIDE 33

Dimensionality Reduction - PCA

Linear approximates of data in q dimensions. Found via Singular Value Decomposition:

X = UDVT

slide-34
SLIDE 34

Review 4-11

  • K-Means Issues
  • Dimensionality Reduction
  • PCA

○ What is V (the components)? ○ Percentage variance explained

slide-35
SLIDE 35
slide-36
SLIDE 36

Classification: Regularized Logistic Regression

slide-37
SLIDE 37

Classification: Naive Bayes

Bayes classifier: choose the class most likely according to P(y|X). (y is a class label)

slide-38
SLIDE 38

Classification: Naive Bayes

Bayes classifier: choose the class most likely according to P(y|X). (y is a class label) Naive Bayes classifier: Assumes all predictors are independent given y.

slide-39
SLIDE 39

Classification: Naive Bayes

Bayes Rule: P(A|B) = P(B|A)P(A) / P(B)

slide-40
SLIDE 40

Classification: Naive Bayes

Posterior Prior Likelihood

slide-41
SLIDE 41

Classification: Naive Bayes

Maximum a Posteriori (MAP): Pick the class with the maximum posterior probability. Posterior Prior Likelihood

slide-42
SLIDE 42

Classification: Naive Bayes

Maximum a Posteriori (MAP): Pick the class with the maximum posterior probability. Unnormalized Posterior Posterior Prior Likelihood

slide-43
SLIDE 43

Gaussian Naive Bayes

Assume P(X|Y) is Normal

slide-44
SLIDE 44

Gaussian Naive Bayes

Assume P(X|Y) is Normal Then, training is: 1. Estimate P(Y = k); k = count(Y = k) / Count(Y = *) 2. MLE to find parameters (, ) for each class of Y. (the “class conditional distribution”)

slide-45
SLIDE 45

Gaussian Naive Bayes

Assume P(X|Y) is Normal Then, training is: 1. Estimate P(Y = k); k = count(Y = k) / Count(Y = *) 2. MLE to find parameters (, ) for each class of Y. (the “class conditional distribution”)

slide-46
SLIDE 46

Gaussian Naive Bayes

Assume P(X|Y) is Normal Then, training is: 1. Estimate P(Y = k); k = count(Y = k) / Count(Y = *) 2. MLE to find parameters (, ) for each class of Y. (the “class conditional distribution”)

slide-47
SLIDE 47

Example Project

https://docs.google.com/presentation/d/1jD-FQhOTaMh82JRc-p81TY1QCUbtpKZGwe5U4A3gml8/

slide-48
SLIDE 48

Review: 4-14, 4-19

  • Types of machine learning problems
  • Regularized Logistic Regression
  • Naive Bayes Classifier
  • Implementing a Gaussian Naives Bayes
  • Application of probability, statistics, and prediction for measuring county

mortality rates from Twitter.

slide-49
SLIDE 49

Gaussian Naive Bayes

Assume P(X|Y) is Normal Then, training is: 1. Estimate P(Y = k); k = count(Y = k) / Count(Y = *) 2. MLE to find parameters (, ) for each class of Y. (the “class conditional distribution”) Maximum a Posteriori (MAP): Pick the class with the maximum posterior probability.

slide-50
SLIDE 50

MLE: For which parameters does the observed data have the highest probability.

Gaussian Naive Bayes

Unnormalized Posterior Maximum a Posteriori (MAP): Pick the class with the maximum posterior probability.

slide-51
SLIDE 51

Gaussian Naive Bayes

Assume P(X|Y) is Normal Then, training is: 1. Estimate P(Y = k); k = count(Y = k) / Count(Y = *) 2. MLE to find parameters (, ) for each class of Y. (the “class conditional distribution”) Unnormalized Posterior Without knowing P(X), can we turn this into the (normalized) posterior? Maximum a Posteriori (MAP): Pick the class with the maximum posterior probability.

slide-52
SLIDE 52

Use the Law of Total Probability, for all i = 1 ... k, where A1 ... Ak partition Ω:

Gaussian Naive Bayes

Unnormalized Posterior Without knowing P(X), can we turn this into the (normalized) posterior? Maximum a Posteriori (MAP): Pick the class with the maximum posterior probability.

slide-53
SLIDE 53

Use the Law of Total Probability, for all i = 1 ... k, where A1 ... Ak partition Ω:

Gaussian Naive Bayes

Unnormalized Posterior Without knowing P(X), can we turn this into the (normalized) posterior? Maximum a Posteriori (MAP): Pick the class with the maximum posterior probability.

slide-54
SLIDE 54

Use the Law of Total Probability, for all i = 1 ... k, where A1 ... Ak partition Ω:

Gaussian Naive Bayesian Inference

Unnormalized Posterior Without knowing P(X), can we turn this into the (normalized) posterior? discrete continuous A is “marginalized”

  • ut
slide-55
SLIDE 55

Q: What distinguishes Bayesian inference? A: Assume a

Gaussian Naive Bayesian Inference

slide-56
SLIDE 56

Bayesian Inference

Given: Goal: Compute the

slide-57
SLIDE 57

Bayesian Inference

Given: Goal: Compute the Types of priors:

  • Uninformative (Improper: not a probability (e.g. constant))
  • Belief-based
  • Conjugate to a likelihood: if the posterior is in the same family as

the prior.

slide-58
SLIDE 58

Bayesian Inference

Given: Goal: Compute the Types of priors:

  • Uninformative (Improper: not a probability (e.g. constant))
  • Belief-based
  • Conjugate to a likelihood: if the posterior is in the same family as

the prior. Example: Beta(⍺, ) is conjugate to a Bernoulli likelihood.

https://en.wikipedia.

  • rg/wiki/Conjugate_prior#Table_of_conjuga

te_distributions

slide-59
SLIDE 59

Bayesian Inference

Given: Goal: Compute the

slide-60
SLIDE 60

Bayesian Inference

Given: Goal: Compute the

slide-61
SLIDE 61

Bayesian Inference

Given: Goal: Compute the

  • - predictive distribution
slide-62
SLIDE 62

Bayesian Inference

Given: Goal: Compute the

  • - predictive distribution

Like a posterior-weighted average of P(Znew|)

slide-63
SLIDE 63

Review, 4-21

  • How to turn an unnormalized posterior into a normalized posterior
  • What is Bayesian Inference?
  • Typical definition of a posterior
  • Predictive Distribution
slide-64
SLIDE 64

Bayesian Vs. Frequentist

Bayesian

  • Probability is degree of belief

=> can derive probability of many things

  • Can estimate probability of parameters
  • Can draw inferences about parameter

probability distribution, point estimates, intervals

Frequentist

  • Limiting relative frequencies => probability is an observed property
  • Parameters fixed and unknown => no need for probability of parameter
  • Procedures for long-run frequencies (e.g. 95% CI)
slide-65
SLIDE 65

Bayesian Vs. Frequentist

Bayesian

  • Probability is degree of belief

=> can derive probability of many things

  • Can estimate probability of parameters
  • Can draw inferences about parameter

probability distribution, point estimates, intervals

Frequentist

  • Limiting relative frequencies => probability is an observed property
  • Parameters fixed and unknown => no need for probability of parameter
  • Procedures for long-run frequencies (e.g. 95% CI)
slide-66
SLIDE 66

Bayesian Vs. Frequentist

Pro Bayes:

  • Estimating distributions => uncertainty built in
  • No need to choose model; always “admissible”
  • Automatic regularization

Con:

  • Need to assume prior (even if nothing can obviously work)
  • Approximate solutions: tend to be a little less accurate for simple classification

/ regression problems

slide-67
SLIDE 67

Bayesian Vs. Frequentist

Pro Bayes:

  • Estimating distributions => uncertainty built in
  • No need to choose model; always “admissible”
  • Automatic regularization

Con:

  • Need to assume prior (even if nothing can obviously work)
  • Approximate solutions: tend to be a little less accurate for simple classification

/ regression problems

There is at least one situation where the model performs at least as good as any other model.

slide-68
SLIDE 68

Revisiting N-Fold Cross-Validation

Goal: Decent estimate of model accuracy

train test dev

All data

train test dev train train test dev train

...

slide-69
SLIDE 69

Revisiting N-Fold Cross-Validation

Goal: Decent estimate of model accuracy

train test dev

All data

train test dev train train test dev train

...

Training Data Testing Data

Does the model hold up?

Model Develo

  • pment

Data Model Set training parameters

slide-70
SLIDE 70

Revisiting N-Fold Cross-Validation

Goal: Decent estimate of model accuracy

train test dev

All data

train test dev train train test dev train

...

slide-71
SLIDE 71

Revisiting N-Fold Cross-Validation

train test dev

All data

train test dev train test dev train

...

train

Goal: Select a super-reliable penalty (alpha) (this is overkill)

Then pick best model and predict ->

test

slide-72
SLIDE 72

Revisiting N-Fold Cross-Validation

Goal: Decent estimate of model accuracy

train test dev

All data

train test dev train train test dev train

...

train test dev

All data

train test dev train test dev train

...

train

Goal: Select a super-reliable penalty (alpha) (this is overkill)

Then pick best model and predict ->

test

slide-73
SLIDE 73

Revisiting N-Fold Cross-Validation

Goal: Decent estimate of model accuracy

train test dev

All data

train test dev train train test dev train

...

train test dev

All data

train test dev train test dev train

...

train

Goal: Select a super-reliable penalty (alpha) (this is overkill)

Then pick best model and predict ->

test

Example: Assignment 3

slide-74
SLIDE 74

Introduction Time Series Analysis

Goal: Understanding temporal patterns of data (or real world events) Common tasks:

  • Trend Analysis: Extrapolate patterns over time (typically descriptive).
  • Forecasting: Predicting a future event (predictive).

(contrasts with “cross-sectional” prediction -- predicting a different group)

slide-75
SLIDE 75

Introduction to Causal Inference (Revisited)

X causes Y as opposed to X is associated with Y

Changing X will change the distribution of Y. X causes Y Y causes X

slide-76
SLIDE 76

Spurious Correlations

Extremely common in time-series analysis.

slide-77
SLIDE 77

Spurious Correlations

Extremely common in time-series analysis. http://tylervigen.com/spurious-correlations

slide-78
SLIDE 78

Introduction to Causal Inference (Revisited)

X causes Y as opposed to X is associated with Y

Changing X will change the distribution of Y. X causes Y Y causes X Counterfactual Model: Exposed or Not Exposed: X = 1 or 0 Causal Odds Ratio:

slide-79
SLIDE 79

Simpson’s “Paradox”

Y=1 Y=0 Y=1 Y=0 X=1 .15 .225 .1 .025 X=0 .0375 .0875 .2625 .1125 Z = men Z = women

http://vudlab.com/simpsons/

slide-80
SLIDE 80

Autocorrelation

“(a.k.a. Serial correlation).” Quantifying the strength of a temporal pattern in serial data. Requirements:

  • Assume regular measurement (hourly, daily, monthly...etc..)
slide-81
SLIDE 81

Autocorrelation

Quantifying the strength of a temporal pattern in serial data.

Which have temporal patterns?

slide-82
SLIDE 82

Autocorrelation

Quantifying the strength of a temporal pattern in serial data.

Which have temporal patterns?

white noise strong autocorrelation weak autocorrelation sinusoidal

slide-83
SLIDE 83

Autocorrelation

Quantifying the strength of a temporal pattern in serial data. Q: HOW?

slide-84
SLIDE 84

Autocorrelation

Quantifying the strength of a temporal pattern in serial data. Q: HOW? A: Correlate with a copy of self, shifted slightly. ….

slide-85
SLIDE 85

Autocorrelation

Quantifying the strength of a temporal pattern in serial data. Q: HOW? A: Correlate with a copy of self, shifted slightly. Y = [3, 4, 4, 5, 6, 7, 7, 8] correlate(Y[0:7], Y[1:8]) #lag=1 correlate(Y[0:-2], Y[2:8]) #lag=2 ….

slide-86
SLIDE 86

Autocorrelation

Quantifying the strength of a temporal pattern in serial data. Q: HOW? A: Correlate with a copy of self, shifted slightly. Y = [3, 4, 4, 5, 6, 7, 7, 8] correlate(Y[0:7], Y[1:8]) #lag=1 correlate(Y[0:-2], Y[2:8]) #lag=2 ….

slide-87
SLIDE 87

Review, 4-26 and 4-28

  • Bayesian verse Frequentist Learning
  • Why / when to use Dev within folds of N-Fold CV
  • Time series -- what distinguishes
  • Causal Inference
  • Autocorrelation

○ Type of univariate time series ○ Lag Plots

slide-88
SLIDE 88

Autoregressive Model

AR Models: Linear AR model:

slide-89
SLIDE 89

Autoregressive Model

AR Models: Linear AR model: Notation:

slide-90
SLIDE 90

Autoregressive Model

AR Models: Linear AR model: Notation:

slide-91
SLIDE 91

Moving Average

Based on error; (a “smoothing” technique). Q: Best estimator of random data (i.e. white noise)?

slide-92
SLIDE 92

Moving Average

Based on error; (a “smoothing” technique). Q: Best estimator of random data (i.e. white noise)? A: The mean

slide-93
SLIDE 93

Moving Average

Based on error; (a “smoothing” technique). Q: Best estimator of random data (i.e. white noise)? A: The mean Simple Moving Average

slide-94
SLIDE 94

Moving Average Model

In a regression model (ARMA or ARIMA), we consider error terms

slide-95
SLIDE 95

Moving Average Model

In a regression model (ARMA or ARIMA), we consider error terms

slide-96
SLIDE 96

Moving Average Model

In a regression model (ARMA or ARIMA), we consider error terms Notation:

attributed to “shocks” -- independent, from a normal distribution

slide-97
SLIDE 97

ARMA Models

AutoRegressive (AR) Moving Average (MA) Model ARMA(p, q): ARMA(1, 1): example: Y is sales; error may be effect from coupon or advertising (credit: Ben Lambert)

slide-98
SLIDE 98

Time-series Applications

  • ARMA

○ Economic indicators ○ System performance ○ Trend analysis (often situations where there is a general trend and random “shocks”)

  • Univariate Models in General

○ Anomaly Detection ○ Forecasting ○ Season Trends ○ Signal Processing

  • Integration as predictors within multivariate models

statsmodels.tsa.arima_model

slide-99
SLIDE 99

Review: 5-3

  • Autoregression Model
  • Notation
  • Simple Moving Average
  • Moving Average Model
  • ARMA
  • Applications of Time Series Models