Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 13 - PowerPoint PPT Presentation

Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 13 Jan-Willem van de Meent (credit: David Lopez-Paz, David Duvenaud, Laurens van der Maaten)

Homework • Homework 3 is out today (due 4 Nov) • Homework 1 has been graded   (we will grade Homework 2 a little faster) • Regrading policy • Step 1: E-mail TAs to resolve simple   problems (e.g. code not running). • Step 2: E-mail instructor to request   regrading. • We will regrade the entire problem set.   The final grade can be lower than before.

Review: PCA Data Orthonormal Basis U = ( u 1 ·· u k ) ∈ R d ⇥ X = ( x 1 · · · · · · x n ) ∈ R d ⇥ n n n Inverse Change of basis Change of basis z = U > x > j x x = Uz = ˜ to z = ( z 1 , . . . , z k ) > n ” z j = u > j x

Review: PCA Data Orthonormal Basis U = ( u 1 ·· u k ) ∈ R d ⇥ X = ( x 1 · · · · · · x n ) ∈ R d ⇥ n n n Eigenvectors of Covariance Eigen-decomposition

Review: PCA Data Orthonormal Basis U = ( u 1 ·· u k ) ∈ R d ⇥ X = ( x 1 · · · · · · x n ) ∈ R d ⇥ n n n Eigenvectors of Covariance Eigen-decomposition Claim : Eigenvectors of a symmetric matrix are orthogonal

Review: PCA n (from stack exchange)

Review: PCA Data Orthonormal Basis U = ( u 1 ·· u k ) ∈ R d ⇥ X = ( x 1 · · · · · · x n ) ∈ R d ⇥ n n n Eigenvectors of Covariance Eigen-decomposition Claim : Eigenvectors of a symmetric matrix are orthogonal

Review: PCA Data Truncated Basis U = ( u 1 ·· u k ) ∈ R d ⇥ k X = ( x 1 · · · · · · x n ) ∈ R d ⇥ n Eigenvectors of Covariance Truncated decomposition

Review: PCA Data Truncated Basis U = ( u 1 ·· u k ) ∈ R d ⇥ k X = ( x 1 · · · · · · x n ) ∈ R d ⇥ n Reconstruction / Decoding Projection / Encoding z = U > x e: ˜ x = Uz =

Review: PCA Top 2 components Bottom 2 components Data : three varieties of wheat: Kama, Rosa, Canadian   Attributes : Area, Perimeter, Compactness, Length of Kernel,   Width of Kernel, Asymmetry Coefficient, Length of Groove

PCA: Complexity Data Truncated Basis U = ( u 1 ·· u k ) ∈ R d ⇥ k X = ( x 1 · · · · · · x n ) ∈ R d ⇥ n Using eigen-value decomposition • Computation of covariance C : O ( n d 2 ) • Eigen-value decomposition: O ( d 3 ) • Total complexity: O ( n d 2 + d 3 )

PCA: Complexity Data Truncated Basis U = ( u 1 ·· u k ) ∈ R d ⇥ k X = ( x 1 · · · · · · x n ) ∈ R d ⇥ n Using singular-value decomposition • Full decomposition: O(min{ nd 2 , n 2 d }) • Rank-k decomposition: O( k d n log(n))   (with power method)  

Singular Value Decomposition Idea : Decompose a   d x d matrix M into 1. Change of basis V   (unitary matrix) 2. A scaling Σ   (diagonal matrix) 3. Change of basis U   (unitary matrix)

Singular Value Decomposition Idea : Decompose the   d x n matrix X into 1. A n x n basis V   (unitary matrix) 2. A d x n matrix Σ   (diagonal projection) 3. A d x d basis U   (unitary matrix) d X = U d ⇥ d Σ d ⇥ n V > n ⇥ n

Random Projections Borrowing from :   David Lopez-Paz & David Duvenaud

Random Projections Fast, e ffi cient and & distance-preserving dimensionality reduction ! w 2 R 40500 × 1000 y 1 x 1 � � (1 ± ✏ ) y 2 x 2 w 2 R 40500 × 1000 R 40500 R 1000 (1 � ✏ ) k x 1 � x 2 k 2  k y 1 � y 2 k 2  (1 + ✏ ) k x 1 � x 2 k 2 This result is formalized in the Johnson-Lindenstrauss Lemma

Johnson-Lindenstrauss Lemma For any 0 < ✏ < 1 / 2 and any integer m > 4, let k = 20 log m . Then, ✏ 2 for any set V of m points in R N 9 f : R N ! R k s.t. 8 u , v 2 V : (1 � ✏ ) k u � v k 2  k f ( u ) � f ( v ) k 2  (1 + ✏ ) k u � v k 2 . The proof is a great example of Erd¨ os’ probabilistic method (1947). Paul Erd¨ os Joram Lindenstrauss William B. Johnson 1913-1996 1936-2012 1944-

Johnson-Lindenstrauss Lemma For any 0 < ✏ < 1 / 2 and any integer m > 4, let k = 20 log m . Then, ✏ 2 for any set V of m points in R N 9 f : R N ! R k s.t. 8 u , v 2 V : (1 � ✏ ) k u � v k 2  k f ( u ) � f ( v ) k 2  (1 + ✏ ) k u � v k 2 . Holds when f is linear function with random coefficients 1 k A , A 2 R k × N , k < N and A ij ⇠ N (0 , 1). t f = √

Example: 20-newsgroups data Data: 20-newsgroups, from 100.000 features to 300 (0.3%)

Example: 20-newsgroups data Data: 20-newsgroups, from 100.000 features to 1.000 (1%)

Example: 20-newsgroups data Data: 20-newsgroups, from 100.000 features to 10.000 (10%)

Example: 20-newsgroups data Data: 20-newsgroups, from 100.000 features to 10.000 (10%) Conclusion : RP preserves distances like PCA,   but faster than PCA number of dimensions is vey large

Stochastic Neighbor   Embeddings Borrowing from :   Laurens van der Maaten   (Delft -> Facebook AI)

Manifold Learning Idea : Perform a non-linear dimensionality reduction   in a manner that preserves proximity (but not distances)

Manifold Learning

PCA on MNIST Digits

Swiss Roll Euclidean distance is not always   a good notion of proximity

Non-linear Projection Bad projection: relative position to neighbors changes

Non-linear Projection Intuition: Want to preserve local neighborhood

Stochastic Neighbor Embedding Similarity in high dimension Similarity in low dimension exp ( − || x i − x j || 2 / 2 σ 2 exp ( − || y i − y j || 2 ) i ) p j | i = q j | i = k 6 = i exp ( − || x i − x k || 2 / 2 σ 2 k 6 = i exp ( − || y i − y k || 2 ) P P i )

Stochastic Neighbor Embedding Similarity of datapoints in High Dimension exp ( − || x i − x j || 2 / 2 σ 2 i ) p j | i = k 6 = i exp ( − || x i − x k || 2 / 2 σ 2 P i ) Similarity of datapoints in Low Dimension exp ( − || y i − y j || 2 ) q j | i = k 6 = i exp ( − || y i − y k || 2 ) P Cost function p j | i log p j | i X X X C = KL ( P i || Q i ) = q j | i i i j Idea: Optimize y i via gradient descent on C

Stochastic Neighbor Embedding Gradient has a surprisingly simple form ∂ C X = ( p j | i − q j | i + p i | j − q i | j )( y i − y j ) ∂ y i j 6 = i The gradient update with momentum term is given by Y ( t ) = Y ( t � 1) + η∂ C + β ( t )( Y ( t � 1) − Y ( t � 2) ) ∂ y i

Stochastic Neighbor Embedding Gradient has a surprisingly simple form ∂ C X = ( p j | i − q j | i + p i | j − q i | j )( y i − y j ) ∂ y i j 6 = i The gradient update with momentum term is given by Y ( t ) = Y ( t � 1) + η∂ C + β ( t )( Y ( t � 1) − Y ( t � 2) ) ∂ y i Problem : p j|i is not equal to p i|j

Symmetric SNE X X X | Minimize a single KL divergence between a joint probability distribution p ij log p ij X X C = KL ( P || Q ) = q ij j 6 = i i The obvious way to redefine the pairwise similarities is exp ( − || x i − x j || 2 / 2 σ 2 ) p ij = k 6 = l exp ( − || x l − x k || 2 / 2 σ 2 ) P exp ( − || y i − y j || 2 ) q ij = k 6 = l exp ( − || y l − y k || 2 ) P

Symmetric SNE X X X | Minimize a single KL divergence between a joint probability distribution p ij log p ij X X C = KL ( P || Q ) = q ij j 6 = i i The obvious way to redefine the pairwise similarities is exp ( − || x i − x j || 2 / 2 σ 2 ) p ij = k 6 = l exp ( − || x l − x k || 2 / 2 σ 2 ) P exp ( − || y i − y j || 2 ) q ij = k 6 = l exp ( − || y l − y k || 2 ) P Problem : How should we choose σ ?

Choosing the bandwidth exp ( − || x i − x j || 2 / 2 σ 2 ) p ij = k 6 = l exp ( − || x l − x k || 2 / 2 σ 2 ) P Bad σ : Neighborhood is not local in manifold

Choosing the bandwidth exp ( − || x i − x j || 2 / 2 σ 2 ) p ij = k 6 = l exp ( − || x l − x k || 2 / 2 σ 2 ) P Good σ : Neighborhood contains 5-50 points

Choosing the bandwidth exp ( − || x i − x j || 2 / 2 σ 2 ) p ij = k 6 = l exp ( − || x l − x k || 2 / 2 σ 2 ) P Problem : optimal σ may vary if density not uniform

Choosing the bandwidth exp ( − || x i − x j || 2 / 2 σ 2 i ) p ij = p j | i + p i | j p j | i = k 6 = i exp ( − || x i − x k || 2 / 2 σ 2 P i ) 2 N Solution : Define σ i per point.

Choosing the bandwidth Set σ i to ensure constant perplexity

Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 13 - PowerPoint PPT Presentation

Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 13 Jan-Willem van de Meent (credit: David Lopez-Paz, David Duvenaud, Laurens van der Maaten) Homework Homework 3 is out today (due 4 Nov) Homework 1 has been graded

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Data Mining: Concepts and Techniques Chapter 1 Introduction 1 August 19, 2013

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 1 Introduction What is data mining? After years of data mining there is

Data Mining: Concepts and Techniques Web Mining Li Xiong Slides credits: Jiawei Han and

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

CS6220: DATA MINING TECHNIQUES Chapter 7: Advanced Pattern Mining Instructor: Yizhou Sun

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

LECTURE 1: INTRODUCTION TO DATA MINING Dr. Dhaval Patel CSE, IIT-Roorkee What is data mining?

Web Mining Web Mining to automatically discover and extract information from Web

Web Mining Web Mining to automatically discover and extract information from Web

Data Mining: Concepts and Techniques Chap 8. Data Streams, Time Series Data, and Sequential

Cognitive Bias and Critical Thinking in Open Source Intelligence (OSINT) Benjamin Brown Akamai

Design of an ASIP IDEA Crypto Processor Reza Faghih Mirzaee 1 and Mohammad Eshghi 2 1 Department

Chimera and NFS 4.1 in dCache dCache.ORG Patrick Fuhrmann Tigran Mkrtchyan presented by Peter

Affinity Group 2 August 7, 2018 The University of Wisconsin Service Center will Serve the

Connecting Payment Systems Evan Schwartz Company dedicated to Global, supporting

What We Will Address: At the conclusion of this session, we will: 1. Learn more about what drives

Introduction to Machine Learning Introduktion til maskinlring DM825 Machine Learning - L0

5/25/15 A Brief Guide on California Law Regarding Reten9on