CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford - PowerPoint PPT Presentation

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University

Homework 2 is out: Homework 2 is out:  Due Monday 15 th at midnight!  Submit PDFs  Submit PDFs Talk:  http://rain.stanford.edu  Wed at 12:30 in Terman 453  Yehuda Koren – Winner of the Netflix challenge! g 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 2

 Text ‐ LSI: find ‘concepts’ T t LSI fi d ‘ t ’ 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 3

 Compress / reduce dimensionality  10 6 rows; 10 3 columns; no updates  random access to any cell(s); small error: OK 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 4

2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 5

2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 6

A [n x m] = U [n x r]   r x r] ( V [m x r] ) T  ) T (  A : n x m matrix  A : n x m matrix (eg., n documents, m terms)  U : n x r matrix (n documents, r concepts)   : r x r diagonal matrix (strength of each ‘concept’) (strength of each concept ) (r : rank of the matrix)  V : m x r matrix (m terms, r concepts) 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 7

n n    V T m m m A A U U 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 8

n  1 u 1  v 1  2 u 2  v 2 A   + m A 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 9

THEOREM [P THEOREM [Press+92]: always possible to 92] l ibl decompose matrix A into A = U  V T , where  U  V : unique  U,  V : unique  U , V : column orthonormal:  U T U = I ; V T V = I (I: identity matrix ) ; ( y )  (Cols. are orthogonal unit vectors)   : diagonal  Entries (singular values) are positive, and sorted in decreasing order 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 10

 A = U  V T ‐ example: U  V T A l retrieval inf . brainlung g brain d data 0.18 0 1 1 1 0 0 0.36 0 0 36 0 2 2 2 2 2 2 0 0 0 0 CS 9.64 0 0.18 0 1 1 1 0 0 x x = 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 MD 0 0 0 1 1 0 0 0 27 0.27 0 0 0 0 0 0 0.71 0.71 0 71 0 71 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 11

 A = U  V T ‐ example: U  V T A l retrieval CS ‐ concept inf . brainlung g MD ‐ concept p brain d data 0.18 0 1 1 1 0 0 0.36 0 0 36 0 2 2 2 2 2 2 0 0 0 0 CS 9.64 0 0.18 0 1 1 1 0 0 x x = 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 MD 0 0 0 1 1 0 0 0 27 0.27 0 0 0 0 0 0 0.71 0.71 0 71 0 71 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 12

 A = U  V T ‐ example: U  V T A l d doc ‐ to ‐ concept t t similarity matrix retrieval CS ‐ concept inf . brainlung g MD ‐ concept p brain data d 0.18 0 1 1 1 0 0 0 36 0 0.36 0 2 2 2 2 2 2 0 0 0 0 CS 9.64 0 0.18 0 1 1 1 0 0 x x = 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 MD 0 0 0 1 1 0 0 0 27 0.27 0 0 0 0 0 0 0 71 0 71 0.71 0.71 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 13

 A = U  V T ‐ example: U  V T A l retrieval inf . brainlung g brain data d ‘strength’ of CS ‐ concept 0.18 0 1 1 1 0 0 0.36 0 0 36 0 2 2 2 2 2 2 0 0 0 0 CS 9.64 0 0.18 0 1 1 1 0 0 x x = 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 MD 0 0 0 1 1 0 0 0 27 0.27 0 0 0 0 0 0 0.71 0.71 0 71 0 71 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 14

 A = U  V T ‐ example: U  V T A l term ‐ to ‐ concept retrieval similarity matrix inf . brainlung g brain d data 0.18 0 CS ‐ concept 1 1 1 0 0 0.36 0 0 36 0 2 2 2 2 2 2 0 0 0 0 CS 9.64 0 0.18 0 1 1 1 0 0 x x = 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 MD 0 0 0 1 1 0 0 0 27 0.27 0 0 0 0 0 0 0.71 0.71 0 71 0 71 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 15

 A = U  V T ‐ example: U  V T A l term ‐ to ‐ concept retrieval similarity matrix inf . brainlung g brain d data 0.18 0 CS ‐ concept 1 1 1 0 0 0.36 0 0 36 0 2 2 2 2 2 2 0 0 0 0 CS 9.64 0 0.18 0 1 1 1 0 0 x x = 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 MD 0 0 0 1 1 0 0 0 27 0.27 0 0 0 0 0 0 0.71 0.71 0 71 0 71 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 16

‘d ‘documents’, ‘terms’ and ‘concepts’: t ’ ‘t ’ d ‘ t ’  U : document ‐ to ‐ concept similarity matrix  V : term ‐ to ‐ concept sim. matrix   : its diagonal elements:  it di l l t ‘strength’ of each concept 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 17

SVD: gives best axis to project first singular  best axis to vector project on: p j (‘best’ = min sum of squares v1 of projection of projection errors)  minimum RMS error 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 18

 A = U  V T ‐ example: U  V T A l 0.18 0 1 1 1 0 0 0.36 0 0 36 0 2 2 2 2 2 2 0 0 0 0 9.64 0 0.18 0 1 1 1 0 0 x x = 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 v 1 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 0 0 0 1 1 0 0 0 27 0.27 0 0 0 0 0 0 0 71 0 71 0.71 0.71 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 19

 A = U  V T ‐ example: U  V T A l variance (‘spread’) p on the v 1 axis 0.18 0 1 1 1 0 0 0.36 0 0 36 0 2 2 2 2 2 2 0 0 0 0 9.64 0 0.18 0 1 1 1 0 0 x x = 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 0 0 0 1 1 0 0 0 27 0.27 0 0 0 0 0 0 0 71 0 71 0.71 0.71 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 20

 A = U  V T ‐ example: U  V T A l  U  gives the coordinates of the points in the projection axis points in the projection axis 0.18 0 1 1 1 0 0 0.36 0 0 36 0 2 2 2 2 2 2 0 0 0 0 9.64 0 0.18 0 1 1 1 0 0 x x = 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 0 0 0 1 1 0 0 0 27 0.27 0 0 0 0 0 0 0 71 0 71 0.71 0.71 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 21

M More details d t il  Q: how exactly is dim. reduction done? 0.18 0 1 1 1 0 0 0.36 0 0 36 0 2 2 2 2 2 2 0 0 0 0 9.64 0 0.18 0 1 1 1 0 0 x x = 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 0 0 0 1 1 0 0 0.27 0 27 0 0 0 0 0 0 0 71 0 71 0.71 0.71 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 22

M More details d t il  Q: how exactly is dim. reduction done?  A: set the smallest singular values to zero:  A: set the smallest singular values to zero: 0.18 0 1 1 1 0 0 0.36 0 0 36 0 2 2 2 2 2 2 0 0 0 0 9.64 0 0.18 0 1 1 1 0 0 x x = 0 5.29 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 0 0 0 1 1 0 0 0.27 0 27 0 0 0 0 0 0 0 71 0 71 0.71 0.71 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 23

0.18 0 1 1 1 0 0 0.36 0 0 36 0 2 2 2 2 2 2 0 0 0 0 9.64 0 0.18 0 1 1 1 0 0 x x ~ 0 0 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 0 0 0 1 1 0 0 0.27 0 27 0 0 0 0 0 0 0 71 0 71 0.71 0.71 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 24

0.18 0 1 1 1 0 0 0.36 0 0 36 0 2 2 2 2 2 2 0 0 0 0 9.64 0 0.18 0 1 1 1 0 0 x x ~ 0 0 5 5 5 0 0 0.90 0 0 0 0 2 2 0 0.53 0 0 0 3 3 0 0.80 0.58 0.58 0.58 0 0 0 0 0 1 1 0 0 0.27 0 27 0 0 0 0 0 0 0 71 0 71 0.71 0.71 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 25

0.18 1 1 1 0 0 0 36 0.36 2 2 2 2 2 2 0 0 0 0 9.64 0.18 1 1 1 0 0 x x ~ 5 5 5 0 0 0.90 0 0 0 2 2 0 0 0 0 3 3 0 0.58 0.58 0.58 0 0 0 0 0 1 1 0 0 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 26

1 1 1 0 0 1 1 1 0 0 2 2 2 0 0 2 2 2 2 2 2 0 0 0 0 1 1 1 0 0 1 1 1 0 0 5 5 5 0 0 ~ 5 5 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 0 0 0 0 0 0 0 0 3 3 0 0 0 0 0 0 0 0 1 1 2/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 27

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford - PowerPoint PPT Presentation

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University Homework 2 is out: Homework 2 is out: Due Monday 15 th at midnight! Submit PDFs Submit PDFs Talk: http://rain.stanford.edu Wed at 12:30 in Terman 453

CS345a: Data Mining Jure Leskovec Stanford University Instructors: Instructors: Jure

Mining Data Streams (Part 1) CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford

CS345a: Data Mining Jure Leskovec Stanford University CPU Machine Learning, Statistics Memory

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University Friday 5:30 at

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University Instead of generic

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University Feature selection:

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University HW3 is out HW3

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University Would like to do

Clustering Algorithms CS345a: Data Mining Jure Leskovec and Anand

http://cs246.stanford.edu Instructor: Jure Leskovec TAs: Aditya Parameswaran

End-toEnd In-memory Graph Analytics Jure Leskovec (@jure) Including joint work with Rok Sosic,

http://cs224w.stanford.edu 10/25/2010 Jure Leskovec, Stanford CS224W: Social and Information

DATA MINING LECTURE 15 The Map-Reduce Computational Paradigm Most of the slides are taken from:

http://cs224w.stanford.edu October August 12/3/2013 Jure Leskovec, Stanford CS224W: Social and

http://cs224w.stanford.edu 10/31/2012 Jure Leskovec, Stanford CS224W: Social and Information

Analytics on Sensor Networks Joint work with D. D. Ha Hallac , S. Vare, S. Bhooshan, R. Sosic, S.

Spanning Trees Recall the definitions of: - graphs, the vertex set V={0,1,2,,n-1}, the edge

The Revised Simplex Method Combinatorial Problem Solving (CPS) Javier Larrosa Albert Oliveras

Clustering and Dimensionality Reduction Stony Brook University CSE545, Fall 2017 Goal:

E-PUBS @ OECD Organisation for Economic Cooperation and Development Toby Green Public Affairs

http://cs246.stanford.edu High-dimension == many features Find concepts/topics/genres:

Implementation of Xen PVHVM drivers in OpenBSD Mike Belopuhov Esdenera Networks GmbH

11/14/2016 Disclosures Research grants to my institution from Merck and GSK Tuberculosis:

MATH 12002 - CALCULUS I 5.2: Laws of Logarithms Professor Donald L. White Department of