Web and PageRank Lecture 4 CSCI 4974/6971 12 Sep 2016 1 / 16

Today’s Biz 1. Review MPI 2. Reminders 3. Structure of the web 4. PageRank Centrality 5. More MPI 6. Parallel Pagerank Tutorial 2 / 16

MPI Review ◮ Basic functions ◮ MPI Init(&argc, &argv) ◮ MPI Comm rank(MPI COMM WORLD, &rank) ◮ MPI Comm size(MPI COMM WORLD, &size) ◮ MPI Finalize() ◮ MPI Barrier(MPI COMM WORLD) ◮ Point to point communication ◮ MPI Send(sbuf, count, MPI TYPE, to, tag, MPI COMM WORLD) ◮ MPI Recv(rbuf, count, MPI TYPE, from, tag, MPI COMM WORLD) ◮ Reductions ◮ MPI Reduce(sbuf, rbuf, count, MPI TYPE, MPI OP, MPI COMM WORLD) ◮ MPI Allreduce(sbuf, rbuf, count, MPI TYPE, MPI OP, root, MPI COMM WORLD) 4 / 16

Reminders ◮ Assignment 1: Monday 19 Sept 16:00 ◮ Project Proposal: Thursday 22 Sept 16:00 ◮ Office hours: Tuesday & Wednesday 14:00-16:00 Lally 317 ◮ Or email me for other availability ◮ Class schedule (for next month): ◮ Web analysis methods ◮ Social net analysis methods ◮ Bio net analysis methods ◮ Random networks and usage 6 / 16

Structure of the Web Slides from Jure Leskovec and Anand Rajaraman, Stanford University 8 / 16

CS345a: ¡Data ¡Mining ¡ Jure ¡Leskovec ¡and ¡Anand ¡Rajaraman ¡ Stanford ¡University ¡

 TheFind.com ¡  Large ¡set ¡of ¡products ¡(~6GB ¡compressed) ¡  For ¡each ¡product ¡  A=ributes ¡  Related ¡products ¡  Craigslist ¡  About ¡3 ¡weeks ¡of ¡data ¡(~7.5GB ¡compressed) ¡  Text ¡of ¡posts, ¡plus ¡category ¡metadata ¡  e.g., ¡match ¡buyers ¡and ¡sellers ¡

 How ¡big ¡is ¡the ¡Web? ¡  Technically, ¡infinite ¡  Much ¡duplicaOon ¡(30-‑40%) ¡  Best ¡esOmate ¡of ¡“unique” ¡ staOc ¡HTML ¡pages ¡comes ¡from ¡ search ¡engine ¡claims ¡  Google ¡= ¡8 ¡billion(?), ¡Yahoo ¡= ¡20 ¡ billion ¡  What ¡is ¡the ¡structure ¡of ¡the ¡ Web? ¡How ¡is ¡it ¡organized? ¡ 1/26/10 ¡ Jure ¡Leskovec ¡& ¡Anand ¡Rajaraman, ¡Stanford ¡CS345a: ¡Data ¡Mining ¡ 3 ¡

1/26/10 ¡ Jure ¡Leskovec ¡& ¡Anand ¡Rajaraman, ¡Stanford ¡CS345a: ¡Data ¡Mining ¡ 4 ¡

Directed ¡ graph ¡  In ¡early ¡days ¡of ¡the ¡Web ¡links ¡were ¡navigaOonal ¡  Today ¡many ¡links ¡are ¡transacOonal ¡ 1/26/10 ¡ Jure ¡Leskovec ¡& ¡Anand ¡Rajaraman, ¡Stanford ¡CS345a: ¡Data ¡Mining ¡ 5 ¡

 Two ¡types ¡of ¡directed ¡graphs: ¡  DAG ¡– ¡directed ¡acyclic ¡graph: ¡  Has ¡no ¡cycles: ¡if ¡u ¡can ¡reach ¡v, ¡ ¡ then ¡v ¡can ¡not ¡reach ¡u ¡  Strongly ¡connected: ¡  Any ¡node ¡can ¡reach ¡any ¡node ¡ via ¡a ¡directed ¡path ¡  Any ¡directed ¡graph ¡can ¡be ¡ expressed ¡in ¡terms ¡of ¡these ¡ two ¡types ¡ 1/26/10 ¡ Jure ¡Leskovec ¡& ¡Anand ¡Rajaraman, ¡Stanford ¡CS345a: ¡Data ¡Mining ¡ 6 ¡

 Strongly ¡connected ¡component ¡(SCC) ¡is ¡a ¡set ¡ of ¡nodes ¡S ¡so ¡that: ¡  Every ¡pair ¡of ¡nodes ¡in ¡S ¡can ¡reach ¡each ¡other ¡  There ¡is ¡no ¡larger ¡set ¡containing ¡S ¡with ¡this ¡ property ¡ 1/26/10 ¡ Jure ¡Leskovec ¡& ¡Anand ¡Rajaraman, ¡Stanford ¡CS345a: ¡Data ¡Mining ¡ 7 ¡

 Take ¡a ¡large ¡snapshot ¡of ¡the ¡web ¡and ¡try ¡to ¡ understand ¡how ¡it’s ¡SCCs ¡“fit” ¡as ¡a ¡DAG. ¡  ComputaOonal ¡issues: ¡  Say ¡want ¡to ¡find ¡SCC ¡containing ¡specific ¡node ¡v? ¡  ObservaOon: ¡  Out(v) ¡… ¡nodes ¡that ¡can ¡be ¡reachable ¡from ¡v ¡(BFS ¡out) ¡  SCC ¡containing ¡v: ¡ ¡= ¡Out(v, ¡G) ¡ ∩ ¡In(v, ¡G) ¡ ¡= ¡Out(v, ¡G) ¡ ∩ ¡Out(v, ¡G) ¡ where ¡G ¡is ¡G ¡with ¡direcOons ¡of ¡all ¡edge ¡flipped ¡ 1/26/10 ¡ Jure ¡Leskovec ¡& ¡Anand ¡Rajaraman, ¡Stanford ¡CS345a: ¡Data ¡Mining ¡ 8 ¡

[Broder ¡et ¡al., ¡‘00] ¡  There ¡is ¡a ¡giant ¡SCC ¡ ¡  Broder ¡et ¡al., ¡2000: ¡  Giant ¡weakly ¡connected ¡component: ¡ ¡90% ¡of ¡the ¡nodes ¡ 1/26/10 ¡ Jure ¡Leskovec ¡& ¡Anand ¡Rajaraman, ¡Stanford ¡CS345a: ¡Data ¡Mining ¡ 9 ¡

[Broder ¡et ¡al., ¡‘00] ¡  250 ¡million ¡webpages, ¡1.5 ¡billion ¡links ¡[Altavista] ¡ 1/26/10 ¡ Jure ¡Leskovec ¡& ¡Anand ¡Rajaraman, ¡Stanford ¡CS345a: ¡Data ¡Mining ¡ 10 ¡

[Albert ¡et ¡al., ¡‘99] ¡  Diameter ¡(average ¡directed ¡shortest ¡path ¡length) ¡is ¡19 ¡(in ¡1999) ¡ 1/26/10 ¡ Jure ¡Leskovec ¡& ¡Anand ¡Rajaraman, ¡Stanford ¡CS345a: ¡Data ¡Mining ¡ 11 ¡

[Broder ¡et ¡al., ¡‘00] ¡  Average ¡distance: ¡ ¡ 75% ¡of ¡Ome ¡there ¡is ¡no ¡directed ¡ ¡ path ¡from ¡start ¡to ¡finish ¡page ¡  Follow ¡in-‑links ¡(directed): ¡16.12 ¡  Follow ¡out-‑links ¡(directed): ¡16.18 ¡  Undirected: ¡6.83 ¡  Diameter ¡of ¡SCC ¡(directed): ¡ ¡  At ¡least ¡28 ¡ 1/26/10 ¡ Jure ¡Leskovec ¡& ¡Anand ¡Rajaraman, ¡Stanford ¡CS345a: ¡Data ¡Mining ¡ 12 ¡

[Broder ¡et ¡al., ¡‘00] ¡ 1/26/10 ¡ Jure ¡Leskovec ¡& ¡Anand ¡Rajaraman, ¡Stanford ¡CS345a: ¡Data ¡Mining ¡ 13 ¡

 Take ¡real ¡network ¡plot ¡a ¡histogram ¡of ¡ p k ¡vs. ¡ k 1/26/10 ¡ Jure ¡Leskovec ¡& ¡Anand ¡Rajaraman, ¡Stanford ¡CS345a: ¡Data ¡Mining ¡ 14 ¡

 Plot ¡the ¡same ¡data ¡on ¡ log-‑log ¡axis: ¡ 1/26/10 ¡ Jure ¡Leskovec ¡& ¡Anand ¡Rajaraman, ¡Stanford ¡CS345a: ¡Data ¡Mining ¡ 15 ¡

0.1 ¡ 0.1 ¡ 1 ¡ 10 ¡ 100 ¡ 1000 ¡ 10000 ¡ 100000 ¡1000000 ¡ 10000000 ¡ 100000000 ¡ 1E+09 ¡ 1E+10 ¡ 1E+11 ¡ 0.001 ¡ 1E-‑05 ¡ 1E-‑07 ¡ 1E-‑09 ¡ 1E-‑11 ¡ 1E-‑13 ¡ 1E-‑15 ¡ 1E-‑17 ¡ 1E-‑19 ¡ 1E-‑21 ¡ Power ¡law: ¡ Exponential ¡ 1E-‑23 ¡ Y ¡~ ¡X -‑2 ¡ Y ¡~ ¡e -‑X ¡ 1E-‑25 ¡ 1/26/10 ¡ Jure ¡Leskovec ¡& ¡Anand ¡Rajaraman, ¡Stanford ¡CS345a: ¡Data ¡Mining ¡ 16 ¡

 Power ¡law ¡degree ¡exponent ¡is ¡ typically ¡2 ¡< ¡ α ¡< ¡3 ¡  Web ¡graph ¡[Broder ¡et ¡al. ¡00]: ¡  α in ¡ = ¡2.1, ¡ α out ¡ = ¡2.4 ¡  Autonomous ¡systems ¡[Faloutsos ¡ et ¡al. ¡99]: ¡  α ¡= ¡2.4 ¡  Actor ¡collaboraOons ¡[Barabasi-‑ Albert ¡00]: ¡ ¡  α ¡= ¡2.3 ¡  CitaOons ¡to ¡papers ¡[Redner ¡98]: ¡  α ¡ ≈ ¡3 ¡  Online ¡social ¡networks ¡[Leskovec ¡ et ¡al. ¡07]: ¡  α ¡ ≈ ¡2 ¡ 1/26/10 ¡ Jure ¡Leskovec ¡& ¡Anand ¡Rajaraman, ¡Stanford ¡CS345a: ¡Data ¡Mining ¡ 17 ¡

Random ¡network ¡ Scale-‑free ¡(power-‑law) ¡network ¡ (Erdos-‑Renyi ¡random ¡graph) ¡ Degree ¡ Function ¡is ¡ distribution ¡is ¡ scale ¡free ¡if: ¡ Power-‑law ¡ f(ax) = c f(x) Degree ¡distribution ¡is ¡Binomial ¡ 1/26/10 ¡ Jure ¡Leskovec ¡& ¡Anand ¡Rajaraman, ¡Stanford ¡CS345a: ¡Data ¡Mining ¡ Part ¡1-‑18 ¡

Structure of the Web – Revisited Slides from Robert Meusel, Sebastiano Vigna, Oliver Lehmberg, Christian Bizer, Universit¨ at Mannheim 9 / 16

Graph Structure in the Web Revisited Robert Meusel, Sebastiano Vigna, Oliver Lehmberg, Christian Bizer Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 1

Textbook Knowledge about the Web Graph  Broder et al.: Graph structure in the Web. WWW2000.  used two AltaVista crawls (200 million pages, 1.5 billion links)  Results Power Laws Bow-Tie Graph Structure of the Web – Meusel/Vigna/Lehmberg/Bizer – WWW 2014 (Version: 4.2.2014) – Slide 2

Web and PageRank Lecture 4 CSCI 4974/6971 12 Sep 2016 1 / 16 - PowerPoint PPT Presentation

Web and PageRank Lecture 4 CSCI 4974/6971 12 Sep 2016 1 / 16 Todays Biz 1. Review MPI 2. Reminders 3. Structure of the web 4. PageRank Centrality 5. More MPI 6. Parallel Pagerank Tutorial 2 / 16 Todays Biz 1. Review MPI 2.

Graph Mining - PageRank Mert Terzihan-Zhixiong Chen Content 1. Web as a Graph 2. Why is

The PageRank Algorithm and Web Search John Orr Engines Introduction PageRank Computation

PageRank CS16: Introduction to Data Structures & Algorithms Spring 2020 Outline The WWW

Ranking linked data Web graph, PageRank, Topic-specific PageRank and HITS Web Search Overview

Ranking linked data Web graph, PageRank, Topic-specific PageRank and HITS Web Search 1 Overview

PageRank Google's PageRank algorithm. [Sergey Brin and Larry Page, 1998] Measure

0.1 Naive formulation of PageRank In general, PageRank is a way to rank nodes on a graph. Let r i

PAGERANK-RELATED METHODS FOR ANALYZING CITATION NETWORKS Author: Ludo Waltman and Erjia Yan

IV.4 Topic-Specific & Personalized PageRank PageRank produces one-size-fits-all

Personalized PageRank Document Understanding, session 4 CS6200: Information Retrieval

Lin inear programming Example Numpy: PageRank scipy.optimize.linprog Example linear

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

PageRank (PR) Q: What makes a web page important? A: many important pages contain links to it;

Link-based Web Search Web Search PageRank HITS Stability Issues Current

PageRank and recommenders on very large scale A Big Data perspective through Stratosphere

Chapter 5: Link Analysis for Authority Scoring 5.1 PageRank (S. Brin and L. Page 1997/1998) 5.2

Informatics 1: Data & Analysis Lecture 14: Example Corpora Applications Ian Stark School of

How to have a research career in industry Rebecca Isaacs, Research Scientist at Google SOSP

About Me About Me The Webs Missing Links: The Webs Missing Links: Dual training Dual

What is this Page Known for? Computing Web Page Reputations Davood Rafiei, Alberto Mendelzon

Web Search Basics Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

Multiprocessors and Thread-Level Parallelism 1 MO401 Tpicos IC-UNICAMP Centralized

CS490W Web Search (I) Luo Si Department of Computer Science Purdue University Slides from

DIMACS Workshop Opening-Closing Comments Stephen E. Fienberg Department of Statistics &