CS6220: DATA MINING TECHNIQUES Matrix Data: Prediction Instructor: - - PowerPoint PPT Presentation
CS6220: DATA MINING TECHNIQUES Matrix Data: Prediction Instructor: - - PowerPoint PPT Presentation
CS6220: DATA MINING TECHNIQUES Matrix Data: Prediction Instructor: Yizhou Sun yzsun@ccs.neu.edu January 19, 2016 Announcements Team formation due next Wednesday Homework 1 out by tomorrow 2 Todays Schedule Course Project
Announcements
- Team formation due next Wednesday
- Homework 1 out by tomorrow
2
Today’s Schedule
- Course Project Introduction
- Linear Regression Model
- Decision Tree
3
Methods to Learn
Matrix Data Text Data Set Data Sequence Data Time Series Graph & Network Images Classification
Decision Tree; Naïve Bayes; Logistic Regression SVM; kNN HMM Label Propagation Neural Network
Clustering
K-means; hierarchical clustering; DBSCAN; Mixture Models; kernel k- means* PLSA SCAN; Spectral Clustering
Frequent Pattern Mining
Apriori; FP-growth GSP; PrefixSpan
Prediction
Linear Regression Autoregression Collaborative Filtering
Similarity Search
DTW P-PageRank
Ranking
PageRank 4
How to learn these algorithms?
- Three levels
- When it is applicable?
- Input, output, strengths, weaknesses, time
complexity
- How it works?
- Pseudo-code, work flows, major steps
- Can work out a toy problem by pen and paper
- Why it works?
- Intuition, philosophy, objective, derivation, proof
5
Matrix Data: Prediction
- Matrix Data
- Linear Regression Model
- Model Evaluation and Selection
- Summary
6
Example
7
np x ... nf x ... n1 x ... ... ... ... ... ip x ... if x ... i1 x ... ... ... ... ... 1p x ... 1f x ... 11 x
A matrix of n × 𝑞:
- n data objects / points
- p attributes / dimensions
Attribute Type
- Numerical
- E.g., height, income
- Categorical / discrete
- E.g., Sex, Race
8
Categorical Attribute Types
- Nominal: categories, states, or “names of things”
- Hair_color = {auburn, black, blond, brown, grey, red, white}
- marital status, occupation, ID numbers, zip codes
- Binary
- Nominal attribute with only 2 states (0 and 1)
- Symmetric binary: both outcomes equally important
- e.g., gender
- Asymmetric binary: outcomes not equally important.
- e.g., medical test (positive vs. negative)
- Convention: assign 1 to most important outcome (e.g., HIV positive)
- Ordinal
- Values have a meaningful order (ranking) but magnitude between
successive values is not known.
- Size = {small, medium, large}, grades, army rankings
9
Matrix Data: Prediction
- Matrix Data
- Linear Regression Model
- Model Evaluation and Selection
- Summary
10
Linear Regression
- Ordinary Least Square Regression
- Closed form solution
- Gradient descent
- Linear Regression with Probabilistic
Interpretation
11
The Linear Regression Problem
- Any Attributes to Continuous Value: x ⇒ y
- {age; major ; gender; race} ⇒ GPA
- {income; credit score; profession} ⇒ loan
- {college; major ; GPA} ⇒ future income
- ...
12
Illustration
13
Formalization
- Data: n independent data objects
- 𝑧𝑗, i = 1, … , 𝑜
- 𝒚𝑗 = 𝑦𝑗0, 𝑦𝑗1, 𝑦𝑗2, … , 𝑦𝑗𝑞
T, i = 1, … , 𝑜
- A constant factor is added to model the bias
term, i. e. , 𝑦𝑗0 = 1
- Model:
- 𝑧: dependent variable
- 𝒚: explanatory variables
- 𝜸 = 𝛾0, 𝛾1, … , 𝛾𝑞
𝑈: 𝑥𝑓𝑗ℎ𝑢 𝑤𝑓𝑑𝑢𝑝𝑠
- 𝑧 = 𝒚𝑈𝜸 = 𝛾0 + 𝑦1𝛾1 + 𝑦2𝛾2 + ⋯ + 𝑦𝑞𝛾𝑞
14
A 2-step Process
- Model Construction
- Use training data to find the best parameter 𝜸,
, denoted as 𝜸
- Model Usage
- Model Evaluation
- Use validation data to select the best model
- Feature selection
- Apply the model to the unseen data (test data):
𝑧 = 𝒚𝑈 𝜸
15
Least Square Estimation
- Cost function (Total Square Error):
- 𝐾 𝜸 = 𝑗 𝒚𝑗
𝑈𝜸 − 𝑧𝑗 2
- Matrix form:
- 𝐾 𝜸 = X𝜸 − 𝒛 𝑈(𝑌𝜸 − 𝒛)
- r X𝜸 − 𝒛
2
16
𝒀: 𝒐 × 𝒒 + 𝟐 matrix
np x ... nf x ... n1 x ... ... ... ... ... ip x ... if x ... i1 x ... ... ... ... ... 1p x ... 1f x ... 11 x , 1 , 1 , 1
𝑧1 ⋮ 𝑧𝑗 ⋮ 𝑧𝑜 y: 𝒐 × 𝟐 𝐰𝐟𝐝𝐮𝐩𝐬
Ordinary Least Squares (OLS)
- Goal: find
𝜸 that minimizes 𝐾 𝜸
- 𝐾 𝜸 = X𝜸 − 𝑧 𝑈 𝑌𝜸 − 𝑧
= 𝜸𝑈𝑌𝑈𝑌𝜸 − 𝑧𝑈𝑌𝜸 − 𝜸𝑈𝑌𝑈𝑧 + 𝑧𝑈𝑧
- Ordinary least squares
- Set first derivative of 𝐾 𝜸 as 0
- 𝜖𝐾
𝜖𝜸 = 2𝜸𝑈XTX − 2𝑧𝑈𝑌 = 0
- ⇒
𝜸 = 𝑌𝑈𝑌
−1𝑌𝑈𝑧
17
Gradient Descent
- Minimize the cost function by moving
down in the steepest direction
18
Batch Gradient Descent
- Move in the direction of steepest descend
Repeat until converge { }
19
𝜸(𝑢+1):=𝜸(t) − 𝜃
𝜖𝐾 𝜖𝜸 𝜸=𝜸(t) ,
Where 𝐾 𝜸 = 𝑗 𝒚𝑗
𝑈𝜸 − 𝑧𝑗 2 = 𝑗 𝐾𝑗(𝜸) and
𝜖𝐾 𝜖𝜸 =
𝑗
𝜖𝐾𝑗 𝜖𝜸 =
𝑗
2𝒚𝑗 (𝒚𝑗
𝑈𝜸 − 𝑧𝑗) e.g., 𝜃 = 0.1
Stochastic Gradient Descent
- When a new observation, i, comes in, update
weight immediately (extremely useful for large- scale datasets):
Repeat { for i=1:n {
𝜸(𝑢+1):=𝜸(t) + 2𝜃(𝑧𝑗 − 𝒚𝑗
𝑈𝜸(𝑢))𝒚𝑗
} }
20
If the prediction for object i is smaller than the real value, 𝜸 should move forward to the direction of 𝒚𝑗
Other Practical Issues
- What if 𝑌𝑈𝑌 is not invertible?
- Add a small portion of identity matrix, λ𝐽, to it
(ridge regression* )
- What if some attributes are categorical?
- Set dummy variables
- E.g., 𝑦 = 1, 𝑗𝑔 𝑡𝑓𝑦 = 𝐺; 𝑦 = 0, 𝑗𝑔 𝑡𝑓𝑦 = 𝑁
- Nominal variable with multiple values?
- Create more dummy variables for one variable
- What if non-linear correlation exists?
- Transform features, say, 𝑦 to 𝑦2
21
Probabilistic Interpretation
- Review of normal distribution
- X~𝑂 𝜈, 𝜏2 ⇒ 𝑔 𝑌 = 𝑦 =
1 2𝜌𝜏2 𝑓− 𝑦−𝜈 2
2𝜏2
22
Probabilistic Interpretation
- Model: 𝑧𝑗 = 𝑦𝑗
𝑈𝛾 + ε𝑗
- ε𝑗~𝑂(0, 𝜏2)
- 𝑧𝑗 𝑦𝑗, 𝛾~𝑂(𝑦𝑗
𝑈𝛾, 𝜏2)
- 𝐹 𝑧𝑗 𝑦𝑗 = 𝑦𝑗
𝑈𝛾
- Likelihood:
- 𝑀 𝜸 = 𝑗 𝑞 𝑧𝑗 𝑦𝑗, 𝛾)
= 𝑗
1 2𝜌𝜏2 exp{− 𝑧𝑗−𝒚𝑗
𝑈𝜸 2
2𝜏2
}
- Maximum Likelihood Estimation
- find
𝜸 that maximizes L 𝜸
- arg max 𝑀 = arg min 𝐾, Equivalent to OLS!
23
Matrix Data: Prediction
- Matrix Data
- Linear Regression Model
- Model Evaluation and Selection
- Summary
24
Model Selection Problem
- Basic problem:
- how to choose between competing linear regression
models
- Model too simple:
- “underfit” the data; poor predictions; high bias; low
variance
- Model too complex:
- “overfit” the data; poor predictions; low bias; high
variance
- Model just right:
- balance bias and variance to get good predictions
25
Bias and Variance
- Bias: 𝐹(
𝑔 𝑦 ) − 𝑔(𝑦)
- How far away is the expectation of the estimator to the true
value? The smaller the better.
- Variance: 𝑊𝑏𝑠
𝑔 𝑦 = 𝐹[ 𝑔 𝑦 − 𝐹 𝑔 𝑦
2
]
- How variant is the estimator? The smaller the better.
- Reconsider mean square error
- 𝐾
𝜸 /𝑜 = 𝑗 𝒚𝑗
𝑈
𝜸 − 𝑧𝑗
2/𝑜
- Can be considered as
- 𝐹[
𝑔 𝑦 − 𝑔(𝑦) − 𝜁
2] = 𝑐𝑗𝑏𝑡2 + 𝑤𝑏𝑠𝑗𝑏𝑜𝑑𝑓 + 𝑜𝑝𝑗𝑡𝑓
26
Note 𝐹 𝜁 = 0, 𝑊𝑏𝑠 𝜁 = 𝜏2
True predictor 𝑔 𝑦 : 𝑦𝑈𝜸 Estimated predictor 𝑔 𝑦 : 𝑦𝑈 𝜸
Bias-Variance Trade-off
27
Cross-Validation
- Partition the data into K folds
- Use K-1 fold as training, and 1 fold as testing
- Calculate the average accuracy best on K
training-testing pairs
- Accuracy on validation/test dataset!
- Mean square error can again be used: 𝑗 𝒚𝑗
𝑈
𝜸 − 𝑧𝑗
2/𝑜
28
AIC & BIC*
- AIC and BIC can be used to test the quality
- f statistical models
- AIC (Aka
kaike ike information formation criterion erion)
- 𝐵𝐽𝐷 = 2𝑙 − 2ln(
𝑀),
- where k is the number of parameters in the model
and 𝑀 is the likelihood under the estimated parameter
- BIC (Bayesian Information criterion)
- B𝐽𝐷 = 𝑙𝑚𝑜(𝑜) − 2ln(
𝑀),
- Where n is the number of objects
29
Stepwise Feature Selection
- Avoid brute-force selection
- 2𝑞
- Forward selection
- Starting with the best single feature
- Always add the feature that improves the performance
best
- Stop if no feature will further improve the performance
- Backward elimination
- Start with the full model
- Always remove the feature that results in the best
performance enhancement
- Stop if removing any feature will get worse performance
30
Matrix Data: Prediction
- Matrix Data
- Linear Regression Model
- Model Evaluation and Selection
- Summary
31
Summary
- What is matrix data?
- Attribute types
- Linear regression
- OLS
- Probabilistic interpretation
- Model Evaluation and Selection
- Bias-Variance Trade-off
- Mean square error
- Cross-validation, AIC, BIC, step-wise feature
selection
32