CS6220: DATA MINING TECHNIQUES Matrix Data: Prediction Instructor: - - PowerPoint PPT Presentation

cs6220 data mining techniques
SMART_READER_LITE
LIVE PREVIEW

CS6220: DATA MINING TECHNIQUES Matrix Data: Prediction Instructor: - - PowerPoint PPT Presentation

CS6220: DATA MINING TECHNIQUES Matrix Data: Prediction Instructor: Yizhou Sun yzsun@ccs.neu.edu January 19, 2016 Announcements Team formation due next Wednesday Homework 1 out by tomorrow 2 Todays Schedule Course Project


slide-1
SLIDE 1

CS6220: DATA MINING TECHNIQUES

Instructor: Yizhou Sun

yzsun@ccs.neu.edu January 19, 2016

Matrix Data: Prediction

slide-2
SLIDE 2

Announcements

  • Team formation due next Wednesday
  • Homework 1 out by tomorrow

2

slide-3
SLIDE 3

Today’s Schedule

  • Course Project Introduction
  • Linear Regression Model
  • Decision Tree

3

slide-4
SLIDE 4

Methods to Learn

Matrix Data Text Data Set Data Sequence Data Time Series Graph & Network Images Classification

Decision Tree; Naïve Bayes; Logistic Regression SVM; kNN HMM Label Propagation Neural Network

Clustering

K-means; hierarchical clustering; DBSCAN; Mixture Models; kernel k- means* PLSA SCAN; Spectral Clustering

Frequent Pattern Mining

Apriori; FP-growth GSP; PrefixSpan

Prediction

Linear Regression Autoregression Collaborative Filtering

Similarity Search

DTW P-PageRank

Ranking

PageRank 4

slide-5
SLIDE 5

How to learn these algorithms?

  • Three levels
  • When it is applicable?
  • Input, output, strengths, weaknesses, time

complexity

  • How it works?
  • Pseudo-code, work flows, major steps
  • Can work out a toy problem by pen and paper
  • Why it works?
  • Intuition, philosophy, objective, derivation, proof

5

slide-6
SLIDE 6

Matrix Data: Prediction

  • Matrix Data
  • Linear Regression Model
  • Model Evaluation and Selection
  • Summary

6

slide-7
SLIDE 7

Example

7

                  np x ... nf x ... n1 x ... ... ... ... ... ip x ... if x ... i1 x ... ... ... ... ... 1p x ... 1f x ... 11 x

A matrix of n × 𝑞:

  • n data objects / points
  • p attributes / dimensions
slide-8
SLIDE 8

Attribute Type

  • Numerical
  • E.g., height, income
  • Categorical / discrete
  • E.g., Sex, Race

8

slide-9
SLIDE 9

Categorical Attribute Types

  • Nominal: categories, states, or “names of things”
  • Hair_color = {auburn, black, blond, brown, grey, red, white}
  • marital status, occupation, ID numbers, zip codes
  • Binary
  • Nominal attribute with only 2 states (0 and 1)
  • Symmetric binary: both outcomes equally important
  • e.g., gender
  • Asymmetric binary: outcomes not equally important.
  • e.g., medical test (positive vs. negative)
  • Convention: assign 1 to most important outcome (e.g., HIV positive)
  • Ordinal
  • Values have a meaningful order (ranking) but magnitude between

successive values is not known.

  • Size = {small, medium, large}, grades, army rankings

9

slide-10
SLIDE 10

Matrix Data: Prediction

  • Matrix Data
  • Linear Regression Model
  • Model Evaluation and Selection
  • Summary

10

slide-11
SLIDE 11

Linear Regression

  • Ordinary Least Square Regression
  • Closed form solution
  • Gradient descent
  • Linear Regression with Probabilistic

Interpretation

11

slide-12
SLIDE 12

The Linear Regression Problem

  • Any Attributes to Continuous Value: x ⇒ y
  • {age; major ; gender; race} ⇒ GPA
  • {income; credit score; profession} ⇒ loan
  • {college; major ; GPA} ⇒ future income
  • ...

12

slide-13
SLIDE 13

Illustration

13

slide-14
SLIDE 14

Formalization

  • Data: n independent data objects
  • 𝑧𝑗, i = 1, … , 𝑜
  • 𝒚𝑗 = 𝑦𝑗0, 𝑦𝑗1, 𝑦𝑗2, … , 𝑦𝑗𝑞

T, i = 1, … , 𝑜

  • A constant factor is added to model the bias

term, i. e. , 𝑦𝑗0 = 1

  • Model:
  • 𝑧: dependent variable
  • 𝒚: explanatory variables
  • 𝜸 = 𝛾0, 𝛾1, … , 𝛾𝑞

𝑈: 𝑥𝑓𝑗𝑕ℎ𝑢 𝑤𝑓𝑑𝑢𝑝𝑠

  • 𝑧 = 𝒚𝑈𝜸 = 𝛾0 + 𝑦1𝛾1 + 𝑦2𝛾2 + ⋯ + 𝑦𝑞𝛾𝑞

14

slide-15
SLIDE 15

A 2-step Process

  • Model Construction
  • Use training data to find the best parameter 𝜸,

, denoted as 𝜸

  • Model Usage
  • Model Evaluation
  • Use validation data to select the best model
  • Feature selection
  • Apply the model to the unseen data (test data):

𝑧 = 𝒚𝑈 𝜸

15

slide-16
SLIDE 16

Least Square Estimation

  • Cost function (Total Square Error):
  • 𝐾 𝜸 = 𝑗 𝒚𝑗

𝑈𝜸 − 𝑧𝑗 2

  • Matrix form:
  • 𝐾 𝜸 = X𝜸 − 𝒛 𝑈(𝑌𝜸 − 𝒛)
  • r X𝜸 − 𝒛

2

16

𝒀: 𝒐 × 𝒒 + 𝟐 matrix

                  np x ... nf x ... n1 x ... ... ... ... ... ip x ... if x ... i1 x ... ... ... ... ... 1p x ... 1f x ... 11 x , 1 , 1 , 1

𝑧1 ⋮ 𝑧𝑗 ⋮ 𝑧𝑜 y: 𝒐 × 𝟐 𝐰𝐟𝐝𝐮𝐩𝐬

slide-17
SLIDE 17

Ordinary Least Squares (OLS)

  • Goal: find

𝜸 that minimizes 𝐾 𝜸

  • 𝐾 𝜸 = X𝜸 − 𝑧 𝑈 𝑌𝜸 − 𝑧

= 𝜸𝑈𝑌𝑈𝑌𝜸 − 𝑧𝑈𝑌𝜸 − 𝜸𝑈𝑌𝑈𝑧 + 𝑧𝑈𝑧

  • Ordinary least squares
  • Set first derivative of 𝐾 𝜸 as 0
  • 𝜖𝐾

𝜖𝜸 = 2𝜸𝑈XTX − 2𝑧𝑈𝑌 = 0

𝜸 = 𝑌𝑈𝑌

−1𝑌𝑈𝑧

17

slide-18
SLIDE 18

Gradient Descent

  • Minimize the cost function by moving

down in the steepest direction

18

slide-19
SLIDE 19

Batch Gradient Descent

  • Move in the direction of steepest descend

Repeat until converge { }

19

𝜸(𝑢+1):=𝜸(t) − 𝜃

𝜖𝐾 𝜖𝜸 𝜸=𝜸(t) ,

Where 𝐾 𝜸 = 𝑗 𝒚𝑗

𝑈𝜸 − 𝑧𝑗 2 = 𝑗 𝐾𝑗(𝜸) and

𝜖𝐾 𝜖𝜸 =

𝑗

𝜖𝐾𝑗 𝜖𝜸 =

𝑗

2𝒚𝑗 (𝒚𝑗

𝑈𝜸 − 𝑧𝑗) e.g., 𝜃 = 0.1

slide-20
SLIDE 20

Stochastic Gradient Descent

  • When a new observation, i, comes in, update

weight immediately (extremely useful for large- scale datasets):

Repeat { for i=1:n {

𝜸(𝑢+1):=𝜸(t) + 2𝜃(𝑧𝑗 − 𝒚𝑗

𝑈𝜸(𝑢))𝒚𝑗

} }

20

If the prediction for object i is smaller than the real value, 𝜸 should move forward to the direction of 𝒚𝑗

slide-21
SLIDE 21

Other Practical Issues

  • What if 𝑌𝑈𝑌 is not invertible?
  • Add a small portion of identity matrix, λ𝐽, to it

(ridge regression* )

  • What if some attributes are categorical?
  • Set dummy variables
  • E.g., 𝑦 = 1, 𝑗𝑔 𝑡𝑓𝑦 = 𝐺; 𝑦 = 0, 𝑗𝑔 𝑡𝑓𝑦 = 𝑁
  • Nominal variable with multiple values?
  • Create more dummy variables for one variable
  • What if non-linear correlation exists?
  • Transform features, say, 𝑦 to 𝑦2

21

slide-22
SLIDE 22

Probabilistic Interpretation

  • Review of normal distribution
  • X~𝑂 𝜈, 𝜏2 ⇒ 𝑔 𝑌 = 𝑦 =

1 2𝜌𝜏2 𝑓− 𝑦−𝜈 2

2𝜏2

22

slide-23
SLIDE 23

Probabilistic Interpretation

  • Model: 𝑧𝑗 = 𝑦𝑗

𝑈𝛾 + ε𝑗

  • ε𝑗~𝑂(0, 𝜏2)
  • 𝑧𝑗 𝑦𝑗, 𝛾~𝑂(𝑦𝑗

𝑈𝛾, 𝜏2)

  • 𝐹 𝑧𝑗 𝑦𝑗 = 𝑦𝑗

𝑈𝛾

  • Likelihood:
  • 𝑀 𝜸 = 𝑗 𝑞 𝑧𝑗 𝑦𝑗, 𝛾)

= 𝑗

1 2𝜌𝜏2 exp{− 𝑧𝑗−𝒚𝑗

𝑈𝜸 2

2𝜏2

}

  • Maximum Likelihood Estimation
  • find

𝜸 that maximizes L 𝜸

  • arg max 𝑀 = arg min 𝐾, Equivalent to OLS!

23

slide-24
SLIDE 24

Matrix Data: Prediction

  • Matrix Data
  • Linear Regression Model
  • Model Evaluation and Selection
  • Summary

24

slide-25
SLIDE 25

Model Selection Problem

  • Basic problem:
  • how to choose between competing linear regression

models

  • Model too simple:
  • “underfit” the data; poor predictions; high bias; low

variance

  • Model too complex:
  • “overfit” the data; poor predictions; low bias; high

variance

  • Model just right:
  • balance bias and variance to get good predictions

25

slide-26
SLIDE 26

Bias and Variance

  • Bias: 𝐹(

𝑔 𝑦 ) − 𝑔(𝑦)

  • How far away is the expectation of the estimator to the true

value? The smaller the better.

  • Variance: 𝑊𝑏𝑠

𝑔 𝑦 = 𝐹[ 𝑔 𝑦 − 𝐹 𝑔 𝑦

2

]

  • How variant is the estimator? The smaller the better.
  • Reconsider mean square error
  • 𝐾

𝜸 /𝑜 = 𝑗 𝒚𝑗

𝑈

𝜸 − 𝑧𝑗

2/𝑜

  • Can be considered as
  • 𝐹[

𝑔 𝑦 − 𝑔(𝑦) − 𝜁

2] = 𝑐𝑗𝑏𝑡2 + 𝑤𝑏𝑠𝑗𝑏𝑜𝑑𝑓 + 𝑜𝑝𝑗𝑡𝑓

26

Note 𝐹 𝜁 = 0, 𝑊𝑏𝑠 𝜁 = 𝜏2

True predictor 𝑔 𝑦 : 𝑦𝑈𝜸 Estimated predictor 𝑔 𝑦 : 𝑦𝑈 𝜸

slide-27
SLIDE 27

Bias-Variance Trade-off

27

slide-28
SLIDE 28

Cross-Validation

  • Partition the data into K folds
  • Use K-1 fold as training, and 1 fold as testing
  • Calculate the average accuracy best on K

training-testing pairs

  • Accuracy on validation/test dataset!
  • Mean square error can again be used: 𝑗 𝒚𝑗

𝑈

𝜸 − 𝑧𝑗

2/𝑜

28

slide-29
SLIDE 29

AIC & BIC*

  • AIC and BIC can be used to test the quality
  • f statistical models
  • AIC (Aka

kaike ike information formation criterion erion)

  • 𝐵𝐽𝐷 = 2𝑙 − 2ln(

𝑀),

  • where k is the number of parameters in the model

and 𝑀 is the likelihood under the estimated parameter

  • BIC (Bayesian Information criterion)
  • B𝐽𝐷 = 𝑙𝑚𝑜(𝑜) − 2ln(

𝑀),

  • Where n is the number of objects

29

slide-30
SLIDE 30

Stepwise Feature Selection

  • Avoid brute-force selection
  • 2𝑞
  • Forward selection
  • Starting with the best single feature
  • Always add the feature that improves the performance

best

  • Stop if no feature will further improve the performance
  • Backward elimination
  • Start with the full model
  • Always remove the feature that results in the best

performance enhancement

  • Stop if removing any feature will get worse performance

30

slide-31
SLIDE 31

Matrix Data: Prediction

  • Matrix Data
  • Linear Regression Model
  • Model Evaluation and Selection
  • Summary

31

slide-32
SLIDE 32

Summary

  • What is matrix data?
  • Attribute types
  • Linear regression
  • OLS
  • Probabilistic interpretation
  • Model Evaluation and Selection
  • Bias-Variance Trade-off
  • Mean square error
  • Cross-validation, AIC, BIC, step-wise feature

selection

32