CS425: Algorithms for Web Scale Data Most of the slides are from the - PowerPoint PPT Presentation

CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original slides can be accessed at: www.mmds.org

 Graph data overview  Problems with early search engines  PageRank Model ▪ Flow Formulation ▪ Matrix Interpretation ▪ Random Walk Interpretation ▪ Google’s Formulation  How to Compute PageRank CS425: Algorithms for Web-Scale Data 2

Facebook social graph 4-degrees of separation [Backstrom-Boldi-Rosa-Ugander-Vigna, 2011] J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 3

Connections between political blogs Polarization of the network [Adamic-Glance, 2005] J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 4

Citation networks and Maps of science [Börner et al., 2012] J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 5

domain2 domain1 router domain3 Internet J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 6

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 7

 How to organize the Web?  First try: Human curated Web directories ▪ Yahoo, DMOZ, LookSmart  Second try: Web Search ▪ Information Retrieval investigates: Find relevant docs in a small and trusted set ▪ Newspaper articles, Patents, etc. ▪ But: Web is huge , full of untrusted documents, random things, web spam, etc. J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 8

2 challenges of web search:  (1) Web contains many sources of information Who to “trust”? ▪ Trick: Trustworthy pages may point to each other!  (2) What is the “best” answer to query “newspaper”? ▪ No single right answer ▪ Trick: Pages that actually know about newspapers might all be pointing to many newspapers J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 9

Early Search Engines  Inverted index  Data structure that return pointers to all pages a term occurs  Which page to return first?  Where do the search terms appear in the page?  How many occurrences of the search terms in the page?  What if a spammer tries to fool the search engine? 10 CS 425 – Lecture 1 Mustafa Ozdal, Bilkent University

Fooling Early Search Engines  Example: A spammer wants his page to be in the top search results for the term “movies”.  Approach 1:  Add thousands of copies of the term “movies” to your page.  Make them invisible.  Approach 2:  Search the term “movies”.  Copy the contents of the top page to your page.  Make it invisible.  Problem: Ranking only based on page contents  Early search engines almost useless because of spam. 11 CS 425 – Lecture 1 Mustafa Ozdal, Bilkent University

Google’s Innovations  Basic idea: Search engine believes what other pages say about you instead of what you say about yourself.  Main innovations: 1. Define the importance of a page based on:  How many pages point to it?  How important are those pages? 2. Judge the contents of a page based on:  Which terms appear in the page?  Which terms are used to link to the page? 12 CS 425 – Lecture 1 Mustafa Ozdal, Bilkent University

 All web pages are not equally “important” www.joe-schmoe.com vs. www.stanford.edu  There is large diversity in the web-graph node connectivity. Let’s rank the pages by the link structure! J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 13

 We will cover the following Link Analysis approaches for computing importances of nodes in a graph: ▪ Page Rank ▪ Topic-Specific (Personalized) Page Rank ▪ Web Spam Detection Algorithms J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 14

 Think of in-links as votes: ▪ www.stanford.edu has 23,400 in-links ▪ www.joe-schmoe.com has 1 in-link  Are all in-links are equal? ▪ Links from important pages count more ▪ Recursive question! J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 16

A B C 3.3 38.4 34.3 D E F 3.9 8.1 3.9 1.6 1.6 1.6 1.6 1.6 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 17

 Each link’s vote is proportional to the importance of its source page  If page j with importance r j has n out-links, each link gets r j / n votes  Page j ’s own importance is the sum of the votes on its in-links i k r i /3 r k /4 j r j /3 r j = r i /3+r k /4 r j /3 r j /3 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 18

 A “vote” from an important page is worth more y/2  A page is important if it is y pointed to by other important a/2 pages y/2  Define a “rank” r j for page j m a m a/2  r  i r “Flow” equations: j d r y = r y /2 + r a /2  i j i r a = r y /2 + r m r m = r a /2 𝒆 𝒋 … out -degree of node 𝒋 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 19

Flow equations:  3 equations, 3 unknowns, r y = r y /2 + r a /2 no constants r a = r y /2 + r m r m = r a /2 ▪ No unique solution ▪ All solutions equivalent modulo the scale factor  Additional constraint forces uniqueness: ▪ 𝒔 𝒛 + 𝒔 𝒃 + 𝒔 𝒏 = 𝟐 𝟑 𝟑 𝟐 ▪ Solution: 𝒔 𝒛 = 𝟔 , 𝒔 𝒃 = 𝟔 , 𝒔 𝒏 = 𝟔  Gaussian elimination method works for small examples, but we need a better method for large web-size graphs  We need a new formulation! J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 20

 Adjacency matrix 𝑵 ▪ Let page 𝑗 have 𝑒 𝑗 out-links 1 ▪ If 𝑗 → 𝑘 , then 𝑁 𝑘𝑗 = else 𝑁 𝑘𝑗 = 0 𝑒 𝑗  Rank vector 𝒔 : vector with an entry per page ▪ 𝑠 𝑗 is the importance score of page 𝑗 ▪ σ 𝑗 𝑠 = 1 𝑗  r  i  The flow equations can be written r j d  𝒔 = 𝑵 ⋅ 𝒔 i j i J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 22

y a m y ½ ½ 0 y a ½ 0 1 a m 0 ½ 0 m r = M∙r r y = r y /2 + r a /2 y ½ ½ 0 y a = ½ 0 1 a r a = r y /2 + r m m 0 ½ 0 m r m = r a /2 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 23

 r  i r  Remember the flow equation: j d   Flow equation in the matrix form i j i 𝑵 ⋅ 𝒔 = 𝒔 ▪ Suppose page i links to 3 pages, including j i r j j . = r i 1/3 . M r r = J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 24

Exercise: Matrix Formulation r M r A B r A r A 1/2 0 0 1 1/3 0 r B r B 1/2 0 . = 1/3 0 0 1/2 r C r C 0 1/3 1/2 0 r D r D C D 25 CS 425 – Lecture 1 Mustafa Ozdal, Bilkent University

Linear Algebra Reminders  A is a column stochastic matrix iff each of its columns add up to 1 and there are no negative entries.  Our adjacency matrix M is column stochastic. Why?  If there exist a vector x and a scalar λ such that Ax = λ x, then:  x is an eigenvector and λ is an eigenvalue of A  The principal eigenvector is the one that corresponds to the largest eigenvalue.  The largest eigenvalue of a column stochastic matrix is 1. Ax = x, where x is the principal eigenvector 26 CS 425 – Lecture 1 Mustafa Ozdal, Bilkent University

 PageRank flow formulation: 𝒔 = 𝑵 ∙ 𝒔  So the rank vector r is an eigenvector of the stochastic web matrix M NOTE: x is an eigenvector with ▪ In fact, its first or principal eigenvector, the corresponding eigenvalue λ if: with corresponding eigenvalue 1 𝑩𝒚 = 𝝁𝒚  We can now efficiently solve for r ! The method is called Power iteration J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 27

 Given a web graph with n nodes, where the nodes are pages and edges are hyperlinks  Power iteration: a simple iterative scheme ▪ Suppose there are N web pages   ( t ) r  ▪ Initialize: r (0) = [1/N,….,1/N] T ( t 1 ) i r j d ▪ Iterate: r (t+1) = M ∙ r (t)  i j i d i …. out -degree of node i ▪ Stop when | r (t+1) – r (t) | 1 <  | x | 1 =  1≤i≤N |x i | is the L 1 norm Can use any other vector norm, e.g., Euclidean J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 28

y a m  Power Iteration: y y ½ ½ 0 ▪ Set 𝑠 𝑘 = 1 /N a ½ 0 1 a m m 0 ½ 0 𝑠 𝑗 ▪ 1: 𝑠′ 𝑘 = σ 𝑗→𝑘 𝑒 𝑗 r y = r y /2 + r a /2 ▪ 2: 𝑠 = 𝑠′ r a = r y /2 + r m ▪ Goto 1 r m = r a /2  Example: r y 1/3 1/3 5/12 9/24 6/15 11/24 … r a = 1/3 3/6 1/3 6/15 r m 1/3 1/6 3/12 1/6 3/15 Iteration 0, 1, 2, … J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 29

CS425: Algorithms for Web Scale Data Most of the slides are from the - PowerPoint PPT Presentation

CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original slides can be accessed at: www.mmds.org Graph data overview Problems with early

CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets

CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets

CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets

CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets

CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

Large-Scale Web Applications Mendel Rosenblum CS142 Lecture Notes - Large-Scale Web Apps Web

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Lecture 1: Semantic Web and RDF Aidan Hogan aidhog@gmail.com THE WEB The Web is now 26 years

Web Scraping 1 / 9 Web Scraping Two ways to mine data from the web The hard way, by web

Web Mining Web Mining to automatically discover and extract information from Web

Web Application Security Attacks on the Web Attacker Web User Application Web Database Web

Agenda Web MVC-2: Apache Struts Drawbacks with Web Model 1 Web Model 2 (Web MVC) Rimon

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

Graph Algorithms Chapter 22 1 CPTR 430 Algorithms Graph Algorithms Why Study Graph Algorithms?

http://www.mmds.org #1: C4.5 Decision Tree - Classification (61 votes) #2: K-Means -

8 May 2014 1 Introduction Welcome House-keeping Purpose The purpose of the retail review is

Weeble: Enabling Low-Power Nodes to Coexist with High-Power Nodes in White Space Networks Boidar

PHYSICAL ELECTRONICS(ECE3540) CHAPTER 8 THE PN JUNCTION DIODE CHAPTER 8 THE PN JUNCTION

ACCELERATORS CINDY JOE SATURDAY MORNING PHYSICS OCTOBER 21, 2017 ABOUT ME Grew up in Arkansas

Web Information Retrieval Lecture 13 Introduction to text classification and clustering

Text classification I (Nave Bayes) CE-324: Modern Information Retrieval Sharif University of

The Power of an Agile Mindset Linda Rising linda@lindarising.org www.lindarising.org