http://cs246.stanford.edu High dimensional == many features Find - PowerPoint PPT Presentation

CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu

 High ‐ dimensional == many features  Find concepts/topics/genres:  Documents:  Features: Thousands of words, millions of word pairs  Surveys – Netflix: 480k users x 177k movies 1/25/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 2

 Compress / reduce dimensionality:  10 6 rows; 10 3 columns; no updates  random access to any cell(s); small error: OK 1/25/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 3

 Assumption: Data lies on or near a low d ‐ dimensional subspace  Axes of this subspace are effective representation of the data 1/25/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 4

Why reduce dimensions?  Discover hidden correlations/topics  Words that occur commonly together  Remove redundant and noisy features  Not all words are useful  Interpretation and visualization  Easier storage and processing of the data 1/25/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 5

A [m x n] = U [m x r]   r x r] ( V [n x r] ) T  A : Input data matrix  m x n matrix (e.g., m documents, n terms)  U : Left singular vectors  m x r matrix ( m documents, r concepts)   : Singular values  r x r diagonal matrix (strength of each ‘concept’) ( r : rank of the matrix A )  V : Right singular vectors  n x r matrix ( n terms, r concepts) 1/25/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 6

n n   V T m m A U 1/25/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 7

n  1 u 1 v 1  2 u 2 v 2  + m A σ i … scalar u i … vector v i … vector 1/25/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 8

It is always possible to decompose a real matrix A into A = U  V T , where  U,  , V : unique  U, V : column orthonormal:  U T U = I ; V T V = I ( I : identity matrix)  (Cols. are orthogonal unit vectors)   : diagonal  Entries ( singular values ) are positive, and sorted in decreasing order ( σ 1  σ 2  ...  0) 1/25/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 9

 A = U  V T ‐ example: Casablanca Serenity Amelie Matrix Alien 0.18 0 1 1 1 0 0 0.36 0 2 2 2 0 0 SciFi 9.64 0 0.18 0 1 1 1 0 0 x x = 0 5.29 0.90 0 5 5 5 0 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 Romnce 0 0 0 1 1 0 0.27 0 0 0 0.71 0.71 1/25/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 10

 A = U  V T ‐ example: Casablanca SciFi ‐ concept Serenity Amelie Matrix Romance ‐ concept Alien 0.18 0 1 1 1 0 0 0.36 0 2 2 2 0 0 SciFi 9.64 0 0.18 0 1 1 1 0 0 x x = 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 Romnce 0 0 0 1 1 0 0.27 0 0 0 0.71 0.71 1/25/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 11

 A = U  V T ‐ example: U is “user ‐ to ‐ concept” similarity matrix Casablanca SciFi ‐ concept Serenity Amelie Matrix Romance ‐ concept Alien 0.18 0 1 1 1 0 0 0.36 0 2 2 2 0 0 SciFi 9.64 0 0.18 0 1 1 1 0 0 x x = 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 Romnce 0 0 0 1 1 0 0.27 0 0 0 0.71 0.71 1/25/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 12

 A = U  V T ‐ example: Casablanca Serenity Amelie Matrix Alien ‘strength’ of SciFi ‐ concept 0.18 0 1 1 1 0 0 0.36 0 2 2 2 0 0 SciFi 9.64 0 0.18 0 1 1 1 0 0 x x = 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 Romnce 0 0 0 1 1 0 0.27 0 0 0 0.71 0.71 1/25/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 13

 A = U  V T ‐ example: V is “movie ‐ to ‐ concept” similarity matrix Casablanca Serenity Amelie Matrix Alien 0.18 0 1 1 1 0 0 SciFi ‐ concept 0.36 0 2 2 2 0 0 SciFi 9.64 0 0.18 0 1 1 1 0 0 x x = 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 Romnce 0 0 0 1 1 0 0.27 0 0 0 0.71 0.71 1/25/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 14

 A = U  V T ‐ example: V is “movie ‐ to ‐ concept” similarity matrix Casablanca Serenity Amelie Matrix Alien 0.18 0 1 1 1 0 0 SciFi ‐ concept 0.36 0 2 2 2 0 0 SciFi 9.64 0 0.18 0 1 1 1 0 0 x x = 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 Romnce 0 0 0 1 1 0 0.27 0 0 0 0.71 0.71 1/25/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 15

‘ movies ’, ‘ users ’ and ‘ concepts ’:  U : user ‐ to ‐ concept similarity matrix  V : movie ‐ to ‐ concept sim. matrix   : its diagonal elements: ‘strength’ of each concept 1/25/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 16

SVD gives best axis Movie 2 rating to project on:  ‘best’ = min sum first singular of squares of vector projection errors  minimum reconstruction v 1 error Movie 1 rating 1/25/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 17

 A = U  V T ‐ example: 0.18 0 1 1 1 0 0 0.36 0 2 2 2 0 0 9.64 0 0.18 0 1 1 1 0 0 = x x 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 v 1 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 0 0 0 1 1 0 0.27 0 0 0 0.71 0.71 1/25/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 18

 A = U  V T ‐ example: variance (‘spread’) on the v 1 axis 0.18 0 1 1 1 0 0 0.36 0 2 2 2 0 0 9.64 0 0.18 0 1 1 1 0 0 = x x 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 0 0 0 1 1 0 0.27 0 0 0 0.71 0.71 1/25/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 19

 A = U  V T ‐ example:  U  Gives the coordinates of the points in the projection axis 0.18 0 1 1 1 0 0 0.36 0 2 2 2 0 0 9.64 0 0.18 0 1 1 1 0 0 = x x 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 0 0 0 1 1 0 0.27 0 0 0 0.71 0.71 1/25/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 20

More details  Q: How exactly is dim. reduction done? 0.18 0 1 1 1 0 0 0.36 0 2 2 2 0 0 9.64 0 0.18 0 1 1 1 0 0 = x x 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 0 0 0 1 1 0 0.27 0 0 0 0.71 0.71 1/25/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 21

More details  Q: How exactly is dim. reduction done?  A: Set the smallest singular values to zero 0.18 0 1 1 1 0 0 0.36 0 2 2 2 0 0 9.64 0 0.18 0 1 1 1 0 0 = x x 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 A= 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 0 0 0 1 1 0 0.27 0 0 0 0.71 0.71 1/25/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 22

More details  Q: How exactly is dim. reduction done?  A: Set the smallest singular values to zero 0.18 0 1 1 1 0 0 0.36 0 2 2 2 0 0 9.64 0 0.18 0 1 1 1 0 0 x x ~ 0 0 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 A= 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 0 0 0 1 1 0 0.27 0 0 0 0.71 0.71 1/25/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 23

More details  Q: How exactly is dim. reduction done?  A: Set the smallest singular values to zero: 0.18 0 1 1 1 0 0 0.36 0 2 2 2 0 0 9.64 0 0.18 0 1 1 1 0 0 x x ~ 0 0 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 A= 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 0 0 0 1 1 0 0.27 0 0 0 0.71 0.71 1/25/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 24

More details  Q: How exactly is dim. reduction done?  A: Set the smallest singular values to zero: 0.18 1 1 1 0 0 0.36 2 2 2 0 0 9.64 0.18 1 1 1 0 0 x x ~ 5 5 5 0 0 0.90 0 0 0 2 2 0 A= 0 0 0 3 3 0 0.58 0.58 0.58 0 0 0 0 0 1 1 0 1/25/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 25

More details  Q: How exactly is dim. reduction done?  A: Set the smallest singular values to zero B= 1 1 1 0 0 1 1 1 0 0 Frobenius norm: 2 2 2 0 0 2 2 2 0 0 ǁ M ǁ F = Σ ij M ij 2 1 1 1 0 0 1 1 1 0 0 ~ 5 5 5 0 0 A= 5 5 5 0 0 0 0 0 0 0 0 0 0 2 2 ǁ A-B ǁ F = Σ ij (A ij -B ij ) 2 0 0 0 0 0 0 0 0 3 3 0 0 0 0 0 is “small” 0 0 0 1 1 1/25/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 26

 Theorem: Let A = U  V T ( σ 1  σ 2  …, rank( A )= r ) then B = U S V T  S = diagonal n x n matrix where s i = σ i ( i=1…k ) else s i =0 is a best rank ‐ k approximation to A :  B is solution to min B ǁ A-B ǁ F where rank( B )= k Σ � ��  We will need 2 facts: ��  where M = P Q R is SVD of M � �  U  V T ‐ U S V T = U (  ‐ S ) V T 1/25/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 27

 We will need 2 facts: ��  where M = P Q R is SVD of M � � We apply: -- P column orthonormal -- R row orthonormal -- Q is diagonal  U  V T ‐ U S V T = U (  ‐ S ) V T 1/25/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 28

http://cs246.stanford.edu High dimensional == many features Find - PowerPoint PPT Presentation

CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu High dimensional == many features Find concepts/topics/genres: Documents: Features: Thousands of words, millions of word pairs

http://cs246.stanford.edu High-dimension == many features Find concepts/topics/genres:

http://cs246.stanford.edu Input features: N features: X 1 , X 2 , X N A Each X j

http://cs246.stanford.edu High dim. High dim. Graph Graph Infinite Infinite Machine Machine Apps

http://cs246.stanford.edu Web pages are not equally important www.joe-schmoe.com vs.

http://cs246.stanford.edu Many real-world problems Web Search and Text Mining Billions

http://cs246.stanford.edu Instructor: Jure Leskovec TAs: Aditya Parameswaran

http://cs246.stanford.edu CPU Machine Learning, Statistics Memory Classical Data Mining

http://cs246.stanford.edu High dim. Graph Infinite Machine Apps data data data learning

http://cs246.stanford.edu Supermarket shelf management Market-basket model: Goal: Identify

http://cs246.stanford.edu More algorithms for streams: (1) Filtering a data stream: Bloom

http://cs246.stanford.edu Classic model of algorithms You get to see the entire input, then

http://cs246.stanford.edu Rank nodes using link structure PageRank: Link voting: P

http://cs246.stanford.edu Web advertising Weve learned how to match advertisers to

http://cs246.stanford.edu Web advertising We discussed how to match advertisers to queries

http://cs246.stanford.edu Web advertising We discussed how to match

http://cs246.stanford.edu Web advertising We discussed how to match advertisers to

SIM CQM & ASSESSMENT UPDATES WEBINAR TUESDAY, FEBRUARY 28 11:00 12:00 AGENDA Welcome

Ambedkar University Delhi Work participation rates of rural women in 15 major states of India (per

Non-linear interlinkages and key objectives amongst the Paris Agreement and the Sustainable

Selecting and Using Views To Compute Aggregate Queries Foto Afrati (NTUA Greece) and Rada

Alien terms and definitions Erik Nordmark erik.nordmark@sun.com Wassim Haddad From

Tradeoffs between Anonymity and Identifiability Bob Hinden IETF 63 Paris 3 August 2005 The

Interfacing AliEn and ARC Interfacing AliEn and ARC for a distributed Nordic T1 for a

Unit 2: Probability and distributions 3. Normal and binomial distributions Sta 101 - Spring 2019

Sambuz

Useful Links

Newsletter

Mail Us

http://cs246.stanford.edu High dimensional == many features Find - PowerPoint PPT Presentation

CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu High dimensional == many features Find concepts/topics/genres: Documents: Features: Thousands of words, millions of word pairs

http://cs246.stanford.edu High-dimension == many features Find concepts/topics/genres:

http://cs246.stanford.edu Input features: N features: X 1 , X 2 , X N A Each X j

http://cs246.stanford.edu High dim. High dim. Graph Graph Infinite Infinite Machine Machine Apps

http://cs246.stanford.edu Web pages are not equally important www.joe-schmoe.com vs.

http://cs246.stanford.edu Many real-world problems Web Search and Text Mining Billions

http://cs246.stanford.edu Instructor: Jure Leskovec TAs: Aditya Parameswaran

http://cs246.stanford.edu CPU Machine Learning, Statistics Memory Classical Data Mining

http://cs246.stanford.edu High dim. Graph Infinite Machine Apps data data data learning

http://cs246.stanford.edu Supermarket shelf management Market-basket model: Goal: Identify

http://cs246.stanford.edu More algorithms for streams: (1) Filtering a data stream: Bloom

http://cs246.stanford.edu Classic model of algorithms You get to see the entire input, then

http://cs246.stanford.edu Rank nodes using link structure PageRank: Link voting: P

http://cs246.stanford.edu Web advertising Weve learned how to match advertisers to

http://cs246.stanford.edu Web advertising We discussed how to match advertisers to queries

http://cs246.stanford.edu Web advertising We discussed how to match

http://cs246.stanford.edu Web advertising We discussed how to match advertisers to

SIM CQM &amp; ASSESSMENT UPDATES WEBINAR TUESDAY, FEBRUARY 28 11:00 12:00 AGENDA Welcome

Ambedkar University Delhi Work participation rates of rural women in 15 major states of India (per

Non-linear interlinkages and key objectives amongst the Paris Agreement and the Sustainable

Selecting and Using Views To Compute Aggregate Queries Foto Afrati (NTUA Greece) and Rada

Alien terms and definitions Erik Nordmark erik.nordmark@sun.com Wassim Haddad From

Tradeoffs between Anonymity and Identifiability Bob Hinden IETF 63 Paris 3 August 2005 The

Interfacing AliEn and ARC Interfacing AliEn and ARC for a distributed Nordic T1 for a

Unit 2: Probability and distributions 3. Normal and binomial distributions Sta 101 - Spring 2019

Sambuz

Useful Links

Newsletter

Mail Us

SIM CQM & ASSESSMENT UPDATES WEBINAR TUESDAY, FEBRUARY 28 11:00 12:00 AGENDA Welcome