Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 3 - PowerPoint PPT Presentation

Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 3 Jan-Willem van de Meent ( credit : Yijun Zhao, Arthur Gretton   Rasmussen & Williams)

Linear Regression

Linear Regression Assume f is a linear combination of D features ε ∼ Norm ( 0, σ 2 ) For N points we write Learning task : Estimate w

Linear Regression

Error Measure: Sum of Squares Mean Squared Error (MSE): N E ( w ) = 1 X ( w T x n � y n ) 2 N n =1 = 1 N k Xw � y k 2 where — x 1 T — 2 3 2 y 1 T 3 — x 2 T — y 2 T 6 7 6 7 X = y = 6 7 6 7 4 5 4 5 . . . . . . — x NT — y NT

Minimizing the Error E ( w ) = 1 N k Xw � y k 2 5 E ( w ) = 2 N X T ( Xw � y ) = 0 2 X T Xw = X T y w = X † y where X † = ( X T X ) � 1 X T is the ’pseudo-inverse’ of X

Minimizing the Error E ( w ) = 1 N k Xw � y k 2 5 E ( w ) = 2 N X T ( Xw � y ) = 0 2 X T Xw = X T y w = X † y where X † = ( X T X ) � 1 X T is the ’pseudo-inverse’ of X Matrix Cookbook (on course website)

Ordinary Least Squares Construct matrix X and the vector y from the dataset { ( x 1 , y 1 ) , x 2 , y 2 ) , . . . , ( x N , y N ) } (each x includes x 0 = 1) as follows:  — x T   y T  1 — 1 — x T y T 2 —     2 X = y =         . . . . . . — x T y T N — N Compute X † = ( X T X ) − 1 X T Return w = X † y

Basis function regression Linear regression Basis function regression For N samples Polynomial regression

Polynomial Regression M = 0 M = 1 1 1 t t 0 0 − 1 − 1 0 1 0 1 x x M = 3 M = 9 1 1 t t 0 0 − 1 − 1 0 1 0 1 x x

Polynomial Regression Underfit M = 0 M = 1 1 1 t t 0 0 − 1 − 1 0 1 0 1 x x M = 3 M = 9 1 1 t t 0 0 − 1 − 1 0 1 0 1 x x

Polynomial Regression M = 0 M = 1 1 1 t t 0 0 − 1 − 1 0 1 0 1 x x Overfit M = 3 M = 9 1 1 t t 0 0 − 1 − 1 0 1 0 1 x x

Regularization L 2 regularization (ridge regression) minimizes: E ( w ) = 1 N k Xw � y k 2 + λ k w k 2 where λ � 0 and k w k 2 = w T w � k k L 1 regularization (LASSO) minimizes: E ( w ) = 1 N k Xw � y k 2 + λ | w | 1 D where λ � 0 and | w | 1 = P | ω i | i =1

Regularization

Regularization L 2: closed form solution w = ( X T X + λ I ) � 1 X T y L 1: No closed form solution. Use quadratic programming: minimize k Xw � y k 2 k w k 1  s s . t .

Review: Probability

Examples: Independent Events 1. What’s the probability of getting a sequence of 1,2,3,4,5,6 if we roll a dice six times? 2. A school survey found that 9 out of 10 students like pizza. If three students are chosen at random with replacement, what is the probability that all three students like pizza?

Dependent Events uit Apple or- intro- Orange Red bin Blue bin If I take a fruit from the red bin, what is the probability that I get an apple ?

Dependent Events uit Apple or- intro- Orange Red bin Blue bin Conditional Probability P(fruit = apple | bin = red ) = 2 / 8

Dependent Events uit Apple or- intro- Orange Red bin Blue bin Joint Probability P(fruit = apple , bin = red ) = 2 / 12

Dependent Events uit Apple or- intro- Orange Red bin Blue bin Joint Probability P(fruit = apple , bin = blue ) = ?

Dependent Events uit Apple or- intro- Orange Red bin Blue bin Joint Probability P(fruit = apple , bin = blue ) = 3 / 12

Dependent Events uit Apple or- intro- Orange Red bin Blue bin Joint Probability P(fruit = orange , bin = blue ) = ?

Dependent Events uit Apple or- intro- Orange Red bin Blue bin Joint Probability P(fruit = orange , bin = blue ) = 1 / 12

Two rules of Probability uit or- intro- 1. Sum Rule (Marginal Probabilities) P(fruit = apple ) = P(fruit = apple , bin = blue ) + P(fruit = apple , bin = red ) = ?

Two rules of Probability uit or- intro- 1. Sum Rule (Marginal Probabilities) P(fruit = apple ) = P(fruit = apple , bin = blue ) + P(fruit = apple , bin = red ) = 3 / 12 + 2 / 12 = 5 / 12

Two rules of Probability uit or- intro- 2. Product Rule P(fruit = apple , bin = red ) = P(fruit = apple | bin = red ) p(bin = red ) = ?

Two rules of Probability uit or- intro- 2. Product Rule P(fruit = apple , bin = red ) = P(fruit = apple | bin = red ) p(bin = red ) = 2 / 8 * 8 / 12 = 2 / 12

Two rules of Probability uit or- intro- 2. Product Rule (reversed) P(fruit = apple , bin = red ) = P(bin = red | fruit = apple ) p(fruit = apple ) = ?

Two rules of Probability uit or- intro- 2. Product Rule (reversed) P(fruit = apple , bin = red ) = P(bin = red | fruit = apple ) p(fruit = apple ) = 2 / 5 * 5 / 12 = 2 / 12

Bayes' Rule Posterior Likelihood Prior Sum Rule: Product Rule:

Bayes' Rule Posterior Likelihood Prior Probability of rare disease: 0.005 Probability of detection: 0.98 Probability of false positive: 0.05 Probability of disease when test positive?

Bayes' Rule Posterior Likelihood Prior 0.99 * 0.005 = 0.00495 0.99 * 0.005 + 0.05 * 0.995 = 0.0547 0.00495 / 0.0547 = 0.09

Normal Distribution ∼ ⇒ Density:

Multivariate Normal Density: Σ i j = E [( x i − µ i )( x j − µ j )] Parameters:

Covariance Matrices Density:  1 . 0  2 . 0  1 . 0 � � �  � 0 . 0 0 . 0 0 . 5 1 . 0 − 0 . 5 0 . 0 1 . 0 0 . 0 0 . 5 0 . 5 1 . 0 − 0 . 5 1 . 0 Question: Which covariance matrix Σ   corresponds to which plot?

Marginals and Conditionals Suppose that x and y are jointly Gaussian:  A  x � ✓ a � �◆ C ∼ N z = , C T y b B Question: What are the marginal   distributions p ( x ) and p ( y )? y x ∼ N ( a , A ) y ∼ N ( b , B ) x x

Marginals and Conditionals Suppose that x and y are jointly Gaussian:  A  x � ✓ a � �◆ C ∼ N z = , C T y b B Question: What are the conditional   distributions p ( x | y ) and p ( y | x )? y a + CB − 1 ( y − b ) , A − CB − 1 C T � � x | y ∼ N b + C T A − 1 ( x − a ) , B − C T A − 1 C � � y | x ∼ N x x

Maximum Likelihood

Regression: Probabilistic Interpretation ? What is the probability

Regression: Probabilistic Interpretation Least Squares   Objective Likelihood

Maximum Likelihood Least Squares   Objective Log-Likelihood Maximizing the likelihood minimizes the sum of squares

Maximum a Posteriori

Regression with Priors Can we maximize ? (this is known as maximum a posteriori estimation)

Regression with Priors From Bayes Rule

Maximum a Posteriori Maximum a Posteriori is Equivalent to Ridge Regression

Basis Function Regression M = 3 1 t 0 − 1 0 1 x

Predictive Posterior

Priors on Functions M=0 M=1 M=2 2 4 5 1 2 0 0 0 − 2 − 1 − 5 − 4 − 2 − 1 0 1 2 − 1 0 1 2 − 1 0 1 2 M=3 M=5 5 M=17 x 10 50 2 10 0 0 0 − 10 − 2 − 50 − 1 0 1 2 − 1 0 1 2 − 1 0 1 2 Idea: sampling w ~ p( w ) defines a function w T φ ( x ), so p( w ) is equivalent to a prior on functions . adapted from : Carl Rasmussen, Probabilistic Machine Learning 4f13, http://mlg.eng.cam.ac.uk/teaching/4f13/1617/

Posterior Uncertainty Can we reason about the posterior on functions ? 2 0 − 2 Increasing λ − 6 − 4 − 2 0 2 4 6 2 0 − 2 − 6 − 4 − 2 0 2 4 6 2 0 − 2 − 6 − 4 − 2 0 2 4 6 Idea: sample w ~ p( w | X , y ) and plot functions adapted from : Carl Rasmussen, Probabilistic Machine Learning 4f13, http://mlg.eng.cam.ac.uk/teaching/4f13/1617/

The Predictive Distribution Predictive distribution on observations Prior Predictive distribution on the function value � 1 φ ( x ⇤ ) > A � 1 Φ y , φ ( x ⇤ ) > A � 1 φ ( x ⇤ ) � � f ⇤ | x ⇤ , X, y ⇠ N σ 2 n ameter is typic n ΦΦ > + Σ � 1 and A = σ � 2 p . for f ⇤ , f ( x ⇤ ) with Φ = Φ ( X ) invert the A matrix of w.r.t. the Gau equation we need adapted from : Carl Rasmussen, Probabilistic Machine Learning 4f13, http://mlg.eng.cam.ac.uk/teaching/4f13/1617/

The Predictive Distribution Idea: Average over all possible values of w 2 0 − 2 − 6 − 4 − 2 0 2 4 6 Increasing λ 2 0 − 2 − 6 − 4 − 2 0 2 4 6 2 0 − 2 − 6 − 4 − 2 0 2 4 6 adapted from : Carl Rasmussen, Probabilistic Machine Learning 4f13, http://mlg.eng.cam.ac.uk/teaching/4f13/1617/

The Kernel Trick

Cost of Feature Computation Example: Mapping with linear and quadratic terms

Cost of Feature Computation Example: Mapping with linear and quadratic terms 1+d+d 2 /2 terms

Cost of Feature Computation Example: Mapping with linear and quadratic terms ϕ ( x ) Polynomial Cost 100 features > d 2 /2 terms up d 2 N 2 /4 2,500 N 2 Quadratic to degree 2 > d 3 /6 terms up d 3 N 2 /12 83,000 N 2 Cubic to degree 3 > d 4 /24 terms d 4 N 2 /48 1,960,000 N 2 Quartic up to degree 4 •

The Kernel Trick Define a kernel function such that k can be cheaper to evaluate than φ !

Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 3 - PowerPoint PPT Presentation

Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 3 Jan-Willem van de Meent ( credit : Yijun Zhao, Arthur Gretton Rasmussen & Williams) Linear Regression Linear Regression Assume f is a linear combination of D features

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Data Mining: Concepts and Techniques Chapter 1 Introduction 1 August 19, 2013

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 1 Introduction What is data mining? After years of data mining there is

Data Mining: Concepts and Techniques Web Mining Li Xiong Slides credits: Jiawei Han and

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

CS6220: DATA MINING TECHNIQUES Chapter 7: Advanced Pattern Mining Instructor: Yizhou Sun

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

LECTURE 1: INTRODUCTION TO DATA MINING Dr. Dhaval Patel CSE, IIT-Roorkee What is data mining?

Web Mining Web Mining to automatically discover and extract information from Web

Web Mining Web Mining to automatically discover and extract information from Web

Data Mining: Concepts and Techniques Chap 8. Data Streams, Time Series Data, and Sequential

Support Vector Machine and Kernel Methods Jiayu Zhou 1 Department of Computer Science and

Kernel Interpolation for Scalable Structured Gaussian Processes (KISS-GP) Andrew Gordon Wilson

Density-Based Alternative Explanation Fuzzy Clustering as a Explaining the . . . What If Not

Learning From Data Lecture 25 The Kernel Trick Learning with only inner products The Kernel M.

Using a Hilbert-Schmidt SVD for Stable Kernel Computations Greg Fasshauer Mike McCourt

Designing Kernel Functions Designing Kernel Functions Using the Karhunen-Love Using the

AND MACHINE LEARNING CHAPTER 6: KERNEL METHODS Previous Chapters - Presented linear models for

RBF Kernels: Generating a complex dataset DataCamp Support Vector Machines in R A bit about RBF