http://cs246.stanford.edu High dim. Graph Infinite Machine Apps - PowerPoint PPT Presentation

CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu

High dim. Graph Infinite Machine Apps data data data learning Locality Filtering PageRank, Recommen sensitive data SVM SimRank der systems hashing streams Community Web Decision Association Clustering Detection advertising Trees Rules Dimensional Duplicate Spam Queries on Perceptron, ity document Detection streams kNN reduction detection 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 2

Facebook social graph 4-degrees of separation [Backstrom-Boldi-Rosa-Ugander-Vigna, 2011] 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 3

Connections between political blogs Polarization of the network [Adamic-Glance, 2005] 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 4

Citation networks and Maps of science [Börner et al., 2012] 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 5

domain2 domain1 router domain3 Internet 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 6

Seven Bridges of Königsberg [Euler, 1735] Return to the starting point by traveling each link of the graph once and only once. 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 7

 Web as a directed graph:  Nodes: Webpages  Edges: Hyperlinks I teach a class on CS224W: Networks. Classes are in the Gates Computer building Science Department at Stanford Stanford University 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 8

 Web as a directed graph:  Nodes: Webpages  Edges: Hyperlinks I teach a class on CS224W: Networks. Classes are in the Gates Computer building Science Department at Stanford Stanford University 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 9

2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 10

 How to organize the Web?  First try: Human curated Web directories  Yahoo, DMOZ, LookSmart  Second try: Web Search  Information Retrieval investigates: Find relevant docs in a small and trusted set  Newspaper articles, Patents, etc.  But: Web is huge , full of untrusted documents, random things, web spam, etc. 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 11

2 challenges of web search:  (1) Web contains many sources of information Who to “trust”?  Trick: Trustworthy pages may point to each other!  (2) What is the “best” answer to query “newspaper”?  No single right answer  Trick: Pages that actually know about newspapers might all be pointing to many newspapers 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 12

 All web pages are not equally “important” www.joe-schmoe.com vs. www.stanford.edu  There is large diversity in the web-graph node connectivity. Let’s rank the pages by the link structure! 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 13

 We will cover the following Link Analysis approaches for computing importances of nodes in a graph:  Page Rank  Hubs and Authorities (HITS)  Topic-Specific (Personalized) Page Rank  Web Spam Detection Algorithms 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 14

 Idea: Links as votes  Page is more important if it has more links  In-coming links? Out-going links?  Think of in-links as votes:  www.stanford.edu has 23,400 in-links  www.joe-schmoe.com has 1 in-link  Are all in-links are equal?  Links from important pages count more  Recursive question! 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 15

2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 16

 Each link’s vote is proportional to the importance of its source page  If page j with importance r j has n out-links, each link gets r j / n votes  Page j ’s own importance is the sum of the votes on its in-links i k r i /3 r k /4 j r j /3 r j = r i /3+r k /4 r j /3 r j /3 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 17

 A “vote” from an important The web in 1839 page is worth more y/2  A page is important if it is y pointed to by other important a/2 pages y/2  Define a “rank” r j for page j m a m a/2 r   i r “Flow” equations: j d r y = r y /2 + r a /2  i j i r a = r y /2 + r m r m = r a /2 𝒆 𝒋 … out -degree of node 𝒋 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 18

Flow equations:  3 equations, 3 unknowns, r y = r y /2 + r a /2 no constants r a = r y /2 + r m r m = r a /2  No unique solution  All solutions equivalent modulo the scale factor  Additional constraint forces uniqueness:  𝒔 𝒛 + 𝒔 𝒃 + 𝒔 𝒏 = 𝟐 𝟑 𝟑 𝟐  Solution: 𝒔 𝒛 = 𝟔 , 𝒔 𝒃 = 𝟔 , 𝒔 𝒏 = 𝟔  Gaussian elimination method works for small examples, but we need a better method for large web-size graphs  We need a new formulation! 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 19

 Stochastic adjacency matrix 𝑵  Let page 𝑗 has 𝑒 𝑗 out-links 1  If 𝑗 → 𝑘 , then 𝑁 𝑘𝑗 = else 𝑁 𝑘𝑗 = 0 𝑒 𝑗  𝑵 is a column stochastic matrix  Columns sum to 1  Rank vector 𝒔 : vector with an entry per page  𝑠 𝑗 is the importance score of page 𝑗  𝑠 𝑗 = 1 𝑗 r    The flow equations can be written i r j d 𝒔 = 𝑵 ⋅ 𝒔  i j i 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 20

 r  i r  Remember the flow equation: j d   Flow equation in the matrix form i j i 𝑵 ⋅ 𝒔 = 𝒔  Suppose page i links to 3 pages, including j i r j j . = r i 1/3 . M r r = 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 21

 The flow equations can be written 𝒔 = 𝑵 ∙ 𝒔  So the rank vector r is an eigenvector of the stochastic web matrix M  In fact, its first or principal eigenvector, NOTE: x is an eigenvector with with corresponding eigenvalue 1 the corresponding eigenvalue λ if:  Largest eigenvalue of M is 1 since M is 𝑩𝒚 = 𝝁𝒚 column stochastic  We know r is unit length and each column of M sums to one, so 𝑵𝒔 ≤ 𝟐  We can now efficiently solve for r ! The method is called Power iteration 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 22

y a m y ½ ½ 0 y a ½ 0 1 a m 0 ½ 0 m r = M∙r r y = r y /2 + r a /2 y ½ ½ 0 y a = ½ 0 1 a r a = r y /2 + r m m 0 ½ 0 m r m = r a /2 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 23

 Given a web graph with n nodes, where the nodes are pages and edges are hyperlinks  Power iteration: a simple iterative scheme  Suppose there are N web pages ( t )    r  Initialize: r (0) = [1/N,….,1/N] T ( 1 ) t i r j d   Iterate: r (t+1) = M ∙ r (t) i j i d i …. out -degree of node i  Stop when | r (t+1) – r (t) | 1 <   | x | 1 =  1 ≤ i ≤ N |x i | is the L 1 norm 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 24

y a m  Power Iteration: y y ½ ½ 0  Set 𝑠 𝑘 = 1 /N a ½ 0 1 a m 𝑠 𝑗 m 0 ½ 0  1: 𝑠′ 𝑘 = 𝑗→𝑘 𝑒 𝑗 r y = r y /2 + r a /2  2: 𝑠 = 𝑠′ r a = r y /2 + r m  Goto 1 r m = r a /2  Example: r y 1/3 1/3 5/12 9/24 6/15 11/24 … r a = 1/3 3/6 1/3 6/15 r m 1/3 1/6 3/12 1/6 3/15 Iteration 0, 1, 2, … 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 25

y a m  Power Iteration: y y ½ ½ 0  Set 𝑠 𝑘 = 1 /N a ½ 0 1 a m 𝑠 𝑗 m 0 ½ 0  1: 𝑠′ 𝑘 = 𝑗→𝑘 𝑒 𝑗 r y = r y /2 + r a /2  2: 𝑠 = 𝑠′ r a = r y /2 + r m  Goto 1 r m = r a /2  Example: r y 1/3 1/3 5/12 9/24 6/15 11/24 … r a = 1/3 3/6 1/3 6/15 r m 1/3 1/6 3/12 1/6 3/15 Iteration 0, 1, 2, … 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 26

 Power iteration: A method for finding dominant eigenvector (the vector corresponding to the largest eigenvalue)  𝒔 (𝟐) = 𝑵 ⋅ 𝒔 (𝟏)  𝒔 (𝟑) = 𝑵 ⋅ 𝒔 𝟐 = 𝑵 𝑵𝒔 𝟐 = 𝑵 𝟑 ⋅ 𝒔 𝟏  𝒔 (𝟒) = 𝑵 ⋅ 𝒔 𝟑 = 𝑵 𝑵 𝟑 𝒔 𝟏 = 𝑵 𝟒 ⋅ 𝒔 𝟏  Claim: Sequence 𝑵 ⋅ 𝒔 𝟏 , 𝑵 𝟑 ⋅ 𝒔 𝟏 , … 𝑵 𝒍 ⋅ 𝒔 𝟏 , … approaches the dominant eigenvector of 𝑵 2/5/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 27

http://cs246.stanford.edu High dim. Graph Infinite Machine Apps - PowerPoint PPT Presentation

CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu High dim. Graph Infinite Machine Apps data data data learning Locality Filtering PageRank, Recommen sensitive data SVM SimRank der systems

http://cs246.stanford.edu High dim. High dim. Graph Graph Infinite Infinite Machine Machine Apps

http://cs246.stanford.edu CPU Machine Learning, Statistics Memory Classical Data Mining

http://cs246.stanford.edu Web pages are not equally important www.joe-schmoe.com vs.

http://cs246.stanford.edu Overlaps with machine learning, statistics, artificial intelligence,

http://cs246.stanford.edu Instructor: Jure Leskovec TAs: Aditya Parameswaran

http://cs246.stanford.edu High-dimension == many features Find concepts/topics/genres:

http://cs246.stanford.edu High dimensional == many features Find

http://cs246.stanford.edu More algorithms for streams: (1) Filtering a data stream: Bloom

http://cs246.stanford.edu Classic model of algorithms You get to see the entire input, then

http://cs246.stanford.edu Rank nodes using link structure PageRank: Link voting: P

http://cs246.stanford.edu Web advertising Weve learned how to match advertisers to

http://cs246.stanford.edu Web advertising We discussed how to match advertisers to queries

http://cs246.stanford.edu Web advertising We discussed how to match

http://cs246.stanford.edu Web advertising We discussed how to match advertisers to

http://cs246.stanford.edu Training data 100 million ratings, 480,000 users, 17,770 movies

http://cs246.stanford.edu TAs : Bahman Bahmani Juthika Dabholkar Pierre Kreitmann

Interruptible Iterators Jed Liu Aaron Kimball Andrew C. Myers Department of Computer Science

6. Mechanism: Limited Direct Execution Operating System: Three Easy Pieces 1 Youjip Won How to

CS419 Spring 2010 Computer Security Virtual Machines and Malware Vinod Ganapathy Lecture 20

Rare events and scaling in superdiffusive materials and in field-induced anomalous dynamics

Agricultural technology adoption and impact: Explaining the puzzle of low adoption Alain de

The method of concentration compactness and dispersive Hamiltonian Evolution Equations W. Schlag,

National Trauma Campaign CTIPP.org/NationalTraumaCampaign TraumaCampaign@gmail.com

Bringing Trauma Sensitive Jenn Turner, LMHC, RYT, YACEP Principles Online Who is Jenn?