Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 4 - PowerPoint PPT Presentation

Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 4 Jan-Willem van de Meent ( credit : Yijun Zhao, Arthur Gretton   Rasmussen & Williams, Percy Liang)

Kernel Regression

Basis function regression Linear regression Basis function regression For N samples Polynomial regression

Basis Function Regression M = 3 1 t 0 − 1 0 1 x

The Kernel Trick Define a kernel function such that k can be cheaper to evaluate than φ !

Kernel Ridge Regression MAP / Expected value for Weights ( requires inversion of D x D matrix ) A : = ( Φ > Φ + λ I ) E [ w | y ] = A � 1 Φ > y Φ : = Φ ( X ) Alternate representation ( requires inversion of N x N matrix ) A � 1 Φ > = Φ > ( K + λ I ) � 1 K : = λ � 1 ΦΦ > Predictive posterior (using kernel function) E [ f ( x ⇤ ) | y ] = φ ( x ⇤ ) > E [ w | y ] = φ ( x ⇤ ) > Φ > ( K + λ I ) � 1 y X

Kernel Ridge Regression MAP / Expected value for Weights ( requires inversion of D x D matrix ) A : = ( Φ > Φ + λ I ) E [ w | y ] = A � 1 Φ > y Φ : = Φ ( X ) Alternate representation ( requires inversion of N x N matrix ) A � 1 Φ > = Φ > ( K + λ I ) � 1 K : = λ � 1 ΦΦ > Predictive posterior (using kernel function) E [ f ( x ⇤ ) | y ] = φ ( x ⇤ ) > E [ w | y ] = φ ( x ⇤ ) > Φ > ( K + λ I ) � 1 y X k ( x ⇤ , x n )( K + λ I ) � 1 = nm y m n , m

Kernel Ridge Regression n ! ( y i � h f , φ ( x i ) i H ) 2 + λ k f k 2 X f ∗ = arg min H . f ∈ H i = 1 λ =0.1, σ =0.6 λ =10, σ =0.6 λ =1e − 07, σ =0.6 1 1 1.5 1 0.5 0.5 0.5 0 0 0 − 0.5 − 0.5 − 0.5 − 1 − 1 − 1 − 0.5 0 0.5 1 1.5 − 0.5 0 0.5 1 1.5 − 0.5 0 0.5 1 1.5 Closed form Solution

Gaussian Processes (a.k.a. Kernel Ridge Regression with Variance Estimates) 2 2 1 1 output, f(x) output, f(x) 0 0 − 1 − 1 − 2 − 2 − 5 0 5 − 5 0 5 input, x input, x k ( x ⇤ , x ) > [ K + σ 2 noise I ] − 1 y , � p ( y ⇤ | x ⇤ , x , y ) ∼ N k ( x ⇤ , x ⇤ ) + σ 2 noise − k ( x ⇤ , x ) > [ K + σ 2 noise I ] − 1 k ( x ⇤ , x ) � adapted from : Carl Rasmussen, Probabilistic Machine Learning 4f13, http://mlg.eng.cam.ac.uk/teaching/4f13/1617/

Choosing Kernel Hyperparameters � � too long 1.0 about right too short function value, y 0.5 0.0 − 0.5 − 10 − 5 0 5 10 input, x The mean posterior predictive function is plotted for 3 different length scales (the − ( x − x 0 ) 2 function: k ( x , x 0 ) = v 2 exp + � 2 � � noise � xx 0 . 2 ` 2 Characteristic Lengthscales adapted from : Carl Rasmussen, Probabilistic Machine Learning 4f13, http://mlg.eng.cam.ac.uk/teaching/4f13/1617/

Intermezzo: Kernels Borrowing from :   Arthur Gretton   (Gatsby, UCL)

Hilbert Spaces Definition (Inner product) Let H be a vector space over R . A function h · , · i H : H ⇥ H ! R is an inner product on H if 1 Linear: h α 1 f 1 + α 2 f 2 , g i H = α 1 h f 1 , g i H + α 2 h f 2 , g i H 2 Symmetric: h f , g i H = h g , f i H 3 h f , f i H � 0 and h f , f i H = 0 if and only if f = 0. p Norm induced by the inner product: k f k H := h f , f i H

Example: Fourier Bases

Example: Fourier Bases Fourier modes define a vector space

Kernels Definition Let X be a non-empty set. A function k : X × X → R is a kernel if there exists an R -Hilbert space and a map φ : X → H such that ∀ x , x 0 ∈ X , ⌦ ↵ k ( x , x 0 ) := φ ( x ) , φ ( x 0 ) H . Almost no conditions on X (eg, X itself doesn’t need an inner product, eg. documents). A single kernel can correspond to several possible features. A trivial example for X := R :  x / √ � 2 √ φ 1 ( x ) = x φ 2 ( x ) = and 2 x /

Sums, Transformations, Products Theorem (Sums of kernels are kernels) Given α > 0 and k, k 1 and k 2 all kernels on X , then α k and k 1 + k 2 are kernels on X . (Proof via positive definiteness: later!) A di ff erence of kernels may not be a kernel ( why? ) Theorem (Mappings between spaces) Let X and e X be sets, and define a map A : X → e X . Define the kernel k on e X . Then the kernel k ( A ( x ) , A ( x 0 )) is a kernel on X . Example: k ( x , x 0 ) = x 2 ( x 0 ) 2 . Theorem (Products of kernels are kernels) Given k 1 on X 1 and k 2 on X 2 , then k 1 ⇥ k 2 is a kernel on X 1 ⇥ X 2 . If X 1 = X 2 = X , then k := k 1 ⇥ k 2 is a kernel on X . Proof: Main idea only!

Polynomial Kernels Theorem (Polynomial kernels) Let x , x 0 2 R d for d � 1 , and let m � 1 be an integer and c � 0 be a positive real. Then � m �⌦ x , x 0 ↵ k ( x , x 0 ) := + c is a valid kernel. To prove : expand into a sum (with non-negative scalars) of kernels h x , x 0 i raised to integer powers. These individual terms are valid kernels by the product rule.

Infinite Sequences Definition The space ` 2 ( square summable sequences) comprises all sequences a := ( a i ) i � 1 for which 1 k a k 2 X a 2 ` 2 = i < 1 . i = 1 Definition Given sequence of functions ( � i ( x )) i � 1 in ` 2 where � i : X ! R is the i th coordinate of � ( x ) . Then 1 X k ( x , x 0 ) := � i ( x ) � i ( x 0 ) (1) i = 1

Infinite Sequences Why square summable? By Cauchy-Schwarz, � � 1 � � X � � φ i ( x ) φ i ( x 0 ) � φ ( x 0 ) �  k φ ( x ) k ` 2 � � � ` 2 , � � � i = 1 so the sequence defining the inner product converges for all x , x 0 2 X

Taylor Series Kernels Definition (Taylor series kernel) For r 2 ( 0 , 1 ] , with a n � 0 for all n � 0 1 X a n z n f ( z ) = | z | < r , z 2 R , n = 0 Define X to be the p r -ball in R d , so k x k < p r , 1 x , x 0 ↵ n . X �⌦ x , x 0 ↵� ⌦ k ( x , x 0 ) = f = a n n = 0 Example (Exponential kernel) �⌦ x , x 0 ↵� k ( x , x 0 ) := exp .

Gaussian Kernel (also known as Radial Basis Function (RBF) kernel) Example (Gaussian kernel) The Gaussian kernel on R d is defined as ⇣ � 2 ⌘ − γ � 2 � � x − x 0 � k ( x , x 0 ) := exp . Proof : an exercise! Use product rule, mapping rule, exponential kernel.

Gaussian Kernel (also known as Radial Basis Function (RBF) kernel) Example (Gaussian kernel) The Gaussian kernel on R d is defined as ⇣ � 2 ⌘ − γ � 2 � � x − x 0 � k ( x , x 0 ) := exp . Proof : an exercise! Use product rule, mapping rule, exponential kernel. Squared Exponential (SE) Automatic Relevance   Determination (ARD)

Products of Kernels me: Squared-exp ( SE ) Periodic ( Per ) Linear ( Lin ) − ( x ≠ x Õ ) 2 f ( x − c )( x Õ − c ) 1 2 1 ¸ 2 sin 2 1 22 − 2 π x ≠ x Õ σ 2 σ 2 σ 2 f exp f exp 2 ¸ 2 p ) : 0 0 0 x (with x Õ = 1 ) x − x Õ x − x Õ Lin × Lin SE × Per Lin × SE Lin × Per 0 0 0 0 x (with x Õ = 1 ) x (with x Õ = 1 ) x (with x Õ = 1 ) x − x Õ source: David Duvenaud (PhD Thesis)

Positive Definiteness Definition (Positive definite functions) A symmetric function k : X × X → R is positive definite if ∀ n ≥ 1 , ∀ ( a 1 , . . . a n ) ∈ R n , ∀ ( x 1 , . . . , x n ) ∈ X n , n n X X a i a j k ( x i , x j ) ≥ 0 . i = 1 j = 1 The function k ( · , · ) is strictly positive definite if for mutually distinct x i , the equality holds only when all the a i are zero.

Mercer’s Theorem Theorem Let H be a Hilbert space, X a non-empty set and φ : X ! H . Then h φ ( x ) , φ ( y ) i H =: k ( x , y ) is positive definite. Proof. n n n n X X X X a i a j k ( x i , x j ) = h a i φ ( x i ) , a j φ ( x j ) i H i = 1 j = 1 i = 1 j = 1 2 � � n � � X = a i φ ( x i ) � 0 . � � � � � � i = 1 H Reverse also holds: positive definite k ( x , x 0 ) is inner product in a unique H (Moore-Aronsajn: coming later!).

DIMENSIONALITY REDUCTION Borrowing from :   Percy Liang   (Stanford)

Linear Dimensionality Reduction Idea : Project high-dimensional vector   onto a lower dimensional space ∈ x ∈ R 361 z = U > x z ∈ R 10

Problem Setup Given n data points in d dimensions: x 1 , . . . , x n ∈ R d X = ( x 1 · · · · · · x n ) ∈ R d ⇥ n Transpose of X   used in regression!

Problem Setup Given n data points in d dimensions: x 1 , . . . , x n ∈ R d X = ( x 1 · · · · · · x n ) ∈ R d ⇥ n Want to reduce dimensionality from d to k Choose k directions u 1 , . . . , u k U = ( u 1 ·· u k ) ∈ R d ⇥ k

Problem Setup Given n data points in d dimensions: x 1 , . . . , x n ∈ R d X = ( x 1 · · · · · · x n ) ∈ R d ⇥ n Want to reduce dimensionality from d to k Choose k directions u 1 , . . . , u k U = ( u 1 ·· u k ) ∈ R d ⇥ k For each u j , compute “similarity” z j = u > j x

Problem Setup Given n data points in d dimensions: x 1 , . . . , x n ∈ R d X = ( x 1 · · · · · · x n ) ∈ R d ⇥ n Want to reduce dimensionality from d to k Choose k directions u 1 , . . . , u k U = ( u 1 ·· u k ) ∈ R d ⇥ k For each u j , compute “similarity” z j = u > j x Project x down to z = ( z 1 , . . . , z k ) > = U > x How to choose U ?

Principal Component Analysis ∈ x ∈ R 361 z = U > x z ∈ R 10 Optimize two equivalent objectives 1. Minimize the reconstruction error 2. Maximizes the projected variance

PCA Objective 1: Reconstruction Error U serves two functions: • Encode: z = U > x , z j = u > j x P

Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 4 - PowerPoint PPT Presentation

Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 4 Jan-Willem van de Meent ( credit : Yijun Zhao, Arthur Gretton Rasmussen & Williams, Percy Liang) Kernel Regression Basis function regression Linear regression Basis

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Data Mining: Concepts and Techniques Chapter 1 Introduction 1 August 19, 2013

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 1 Introduction What is data mining? After years of data mining there is

Data Mining: Concepts and Techniques Web Mining Li Xiong Slides credits: Jiawei Han and

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

CS6220: DATA MINING TECHNIQUES Chapter 7: Advanced Pattern Mining Instructor: Yizhou Sun

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

LECTURE 1: INTRODUCTION TO DATA MINING Dr. Dhaval Patel CSE, IIT-Roorkee What is data mining?

Web Mining Web Mining to automatically discover and extract information from Web

Web Mining Web Mining to automatically discover and extract information from Web

Data Mining: Concepts and Techniques Chap 8. Data Streams, Time Series Data, and Sequential

Wireless Network Pricing Chapter 6: Oligopoly Pricing Jianwei Huang & Lin Gao Network

Nursing Home Choice, Family Bargaining and Optimal Policy in a Hotelling Economy M-L Leroux

in one cloud FIW Research Conference Verti-zontal Differentiation in Monopolistic

On corrections of classical multivariate tests for high-dimensional data Jian-feng Yao with

Multivariate Analysis of Variance Max Turgeon STAT 7200Multivariate Statistics Objectives

Using Facebook Data to Predict 2016 US Presidential Election Keng-Chi Chang Chun-Fang Chiang

The Green Paradox The Green Paradox Reyer Gerlagh Tilburg University Introduction Policy

The Multivariate Dustbin Phil Ender UCLA Statistical Consulting Group (Ret.) Stata Conference

Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 4 - PowerPoint PPT Presentation

Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 4 Jan-Willem van de Meent ( credit : Yijun Zhao, Arthur Gretton Rasmussen & Williams, Percy Liang) Kernel Regression Basis function regression Linear regression Basis

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Data Mining: Concepts and Techniques Chapter 1 Introduction 1 August 19, 2013

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 1 Introduction What is data mining? After years of data mining there is

Data Mining: Concepts and Techniques Web Mining Li Xiong Slides credits: Jiawei Han and

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

CS6220: DATA MINING TECHNIQUES Chapter 7: Advanced Pattern Mining Instructor: Yizhou Sun

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

LECTURE 1: INTRODUCTION TO DATA MINING Dr. Dhaval Patel CSE, IIT-Roorkee What is data mining?

Web Mining Web Mining to automatically discover and extract information from Web

Web Mining Web Mining to automatically discover and extract information from Web

Data Mining: Concepts and Techniques Chap 8. Data Streams, Time Series Data, and Sequential

Wireless Network Pricing Chapter 6: Oligopoly Pricing Jianwei Huang &amp; Lin Gao Network

Nursing Home Choice, Family Bargaining and Optimal Policy in a Hotelling Economy M-L Leroux

in one cloud FIW Research Conference Verti-zontal Differentiation in Monopolistic

On corrections of classical multivariate tests for high-dimensional data Jian-feng Yao with

Multivariate Analysis of Variance Max Turgeon STAT 7200Multivariate Statistics Objectives

Using Facebook Data to Predict 2016 US Presidential Election Keng-Chi Chang Chun-Fang Chiang

The Green Paradox The Green Paradox Reyer Gerlagh Tilburg University Introduction Policy

The Multivariate Dustbin Phil Ender UCLA Statistical Consulting Group (Ret.) Stata Conference

Wireless Network Pricing Chapter 6: Oligopoly Pricing Jianwei Huang & Lin Gao Network