cs345a data mining jure leskovec and anand rajaraman j
play

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford - PowerPoint PPT Presentation

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University Homework 2 is out: Homework 2 is out: Due Monday 15 th at midnight! Submit PDFs Submit PDFs Talk: http://rain.stanford.edu Wed at 12:30 in Terman 453


  1. CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University

  2. Homework 2 is out: Homework 2 is out:  Due Monday 15 th at midnight!  Submit PDFs  Submit PDFs Talk:  http://rain.stanford.edu  Wed at 12:30 in Terman 453  Yehuda Koren – Winner of the Netflix challenge! g 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 2

  3.  Text ‐ LSI: find ‘concepts’ T t LSI fi d ‘ t ’ 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 3

  4.  Compress / reduce dimensionality  10 6 rows; 10 3 columns; no updates  random access to any cell(s); small error: OK 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 4

  5. 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 5

  6. 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 6

  7. A [n x m] = U [n x r]   r x r] ( V [m x r] ) T  ) T (  A : n x m matrix  A : n x m matrix (eg., n documents, m terms)  U : n x r matrix (n documents, r concepts)   : r x r diagonal matrix (strength of each ‘concept’) (strength of each concept ) (r : rank of the matrix)  V : m x r matrix (m terms, r concepts) 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 7

  8. n n    V T m m m A A U U 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 8

  9. n  1 u 1  v 1  2 u 2  v 2 A   + m A 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 9

  10. THEOREM [P THEOREM [Press+92]: always possible to 92] l ibl decompose matrix A into A = U  V T , where  U  V : unique  U,  V : unique  U , V : column orthonormal:  U T U = I ; V T V = I (I: identity matrix ) ; ( y )  (Cols. are orthogonal unit vectors)   : diagonal  Entries (singular values) are positive, and sorted in decreasing order 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 10

  11.  A = U  V T ‐ example: U  V T A l retrieval inf . brainlung g brain d data 0.18 0 1 1 1 0 0 0.36 0 0 36 0 2 2 2 2 2 2 0 0 0 0 CS 9.64 0 0.18 0 1 1 1 0 0 x x = 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 MD 0 0 0 1 1 0 0 0 27 0.27 0 0 0 0 0 0 0.71 0.71 0 71 0 71 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 11

  12.  A = U  V T ‐ example: U  V T A l retrieval CS ‐ concept inf . brainlung g MD ‐ concept p brain d data 0.18 0 1 1 1 0 0 0.36 0 0 36 0 2 2 2 2 2 2 0 0 0 0 CS 9.64 0 0.18 0 1 1 1 0 0 x x = 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 MD 0 0 0 1 1 0 0 0 27 0.27 0 0 0 0 0 0 0.71 0.71 0 71 0 71 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 12

  13.  A = U  V T ‐ example: U  V T A l d doc ‐ to ‐ concept t t similarity matrix retrieval CS ‐ concept inf . brainlung g MD ‐ concept p brain data d 0.18 0 1 1 1 0 0 0 36 0 0.36 0 2 2 2 2 2 2 0 0 0 0 CS 9.64 0 0.18 0 1 1 1 0 0 x x = 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 MD 0 0 0 1 1 0 0 0 27 0.27 0 0 0 0 0 0 0 71 0 71 0.71 0.71 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 13

  14.  A = U  V T ‐ example: U  V T A l retrieval inf . brainlung g brain data d ‘strength’ of CS ‐ concept 0.18 0 1 1 1 0 0 0.36 0 0 36 0 2 2 2 2 2 2 0 0 0 0 CS 9.64 0 0.18 0 1 1 1 0 0 x x = 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 MD 0 0 0 1 1 0 0 0 27 0.27 0 0 0 0 0 0 0.71 0.71 0 71 0 71 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 14

  15.  A = U  V T ‐ example: U  V T A l term ‐ to ‐ concept retrieval similarity matrix inf . brainlung g brain d data 0.18 0 CS ‐ concept 1 1 1 0 0 0.36 0 0 36 0 2 2 2 2 2 2 0 0 0 0 CS 9.64 0 0.18 0 1 1 1 0 0 x x = 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 MD 0 0 0 1 1 0 0 0 27 0.27 0 0 0 0 0 0 0.71 0.71 0 71 0 71 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 15

  16.  A = U  V T ‐ example: U  V T A l term ‐ to ‐ concept retrieval similarity matrix inf . brainlung g brain d data 0.18 0 CS ‐ concept 1 1 1 0 0 0.36 0 0 36 0 2 2 2 2 2 2 0 0 0 0 CS 9.64 0 0.18 0 1 1 1 0 0 x x = 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 MD 0 0 0 1 1 0 0 0 27 0.27 0 0 0 0 0 0 0.71 0.71 0 71 0 71 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 16

  17. ‘d ‘documents’, ‘terms’ and ‘concepts’: t ’ ‘t ’ d ‘ t ’  U : document ‐ to ‐ concept similarity matrix  V : term ‐ to ‐ concept sim. matrix   : its diagonal elements:  it di l l t ‘strength’ of each concept 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 17

  18. SVD: gives best axis to project first singular  best axis to vector project on: p j (‘best’ = min sum of squares v1 of projection of projection errors)  minimum RMS error 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 18

  19.  A = U  V T ‐ example: U  V T A l 0.18 0 1 1 1 0 0 0.36 0 0 36 0 2 2 2 2 2 2 0 0 0 0 9.64 0 0.18 0 1 1 1 0 0 x x = 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 v 1 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 0 0 0 1 1 0 0 0 27 0.27 0 0 0 0 0 0 0 71 0 71 0.71 0.71 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 19

  20.  A = U  V T ‐ example: U  V T A l variance (‘spread’) p on the v 1 axis 0.18 0 1 1 1 0 0 0.36 0 0 36 0 2 2 2 2 2 2 0 0 0 0 9.64 0 0.18 0 1 1 1 0 0 x x = 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 0 0 0 1 1 0 0 0 27 0.27 0 0 0 0 0 0 0 71 0 71 0.71 0.71 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 20

  21.  A = U  V T ‐ example: U  V T A l  U  gives the coordinates of the points in the projection axis points in the projection axis 0.18 0 1 1 1 0 0 0.36 0 0 36 0 2 2 2 2 2 2 0 0 0 0 9.64 0 0.18 0 1 1 1 0 0 x x = 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 0 0 0 1 1 0 0 0 27 0.27 0 0 0 0 0 0 0 71 0 71 0.71 0.71 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 21

  22. M More details d t il  Q: how exactly is dim. reduction done? 0.18 0 1 1 1 0 0 0.36 0 0 36 0 2 2 2 2 2 2 0 0 0 0 9.64 0 0.18 0 1 1 1 0 0 x x = 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 0 0 0 1 1 0 0 0.27 0 27 0 0 0 0 0 0 0 71 0 71 0.71 0.71 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 22

  23. M More details d t il  Q: how exactly is dim. reduction done?  A: set the smallest singular values to zero:  A: set the smallest singular values to zero: 0.18 0 1 1 1 0 0 0.36 0 0 36 0 2 2 2 2 2 2 0 0 0 0 9.64 0 0.18 0 1 1 1 0 0 x x = 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 0 0 0 1 1 0 0 0.27 0 27 0 0 0 0 0 0 0 71 0 71 0.71 0.71 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 23

  24. 0.18 0 1 1 1 0 0 0.36 0 0 36 0 2 2 2 2 2 2 0 0 0 0 9.64 0 0.18 0 1 1 1 0 0 x x ~ 0 0 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 0 0 0 1 1 0 0 0.27 0 27 0 0 0 0 0 0 0 71 0 71 0.71 0.71 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 24

  25. 0.18 0 1 1 1 0 0 0.36 0 0 36 0 2 2 2 2 2 2 0 0 0 0 9.64 0 0.18 0 1 1 1 0 0 x x ~ 0 0 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 0 0 0 1 1 0 0 0.27 0 27 0 0 0 0 0 0 0 71 0 71 0.71 0.71 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 25

  26. 0.18 1 1 1 0 0 0 36 0.36 2 2 2 2 2 2 0 0 0 0 9.64 0.18 1 1 1 0 0 x x ~ 5 5 5 0 0 0.90 0 0 0 2 2 0 0 0 0 3 3 0 0.58 0.58 0.58 0 0 0 0 0 1 1 0 0 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 26

  27. 1 1 1 0 0 1 1 1 0 0 2 2 2 0 0 2 2 2 2 2 2 0 0 0 0 1 1 1 0 0 1 1 1 0 0 5 5 5 0 0 ~ 5 5 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 0 0 0 0 0 0 0 0 3 3 0 0 0 0 0 0 0 0 1 1 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 27

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend