PageRank CS16: Introduction to Data Structures & Algorithms - - PowerPoint PPT Presentation

pagerank
SMART_READER_LITE
LIVE PREVIEW

PageRank CS16: Introduction to Data Structures & Algorithms - - PowerPoint PPT Presentation

PageRank CS16: Introduction to Data Structures & Algorithms Spring 2020 Outline The WWW & Search Engines Basic PageRank (Real) PageRank PageRank in practice 2 The World Wide Web Created by Tim-Berners Lee in 1989


slide-1
SLIDE 1

PageRank

CS16: Introduction to Data Structures & Algorithms Spring 2020

slide-2
SLIDE 2

Outline

  • The WWW & Search Engines
  • Basic PageRank
  • (Real) PageRank
  • PageRank in practice
2
slide-3
SLIDE 3

The World Wide Web

  • Created by Tim-Berners Lee in 1989
  • Collection of “pages”
  • Pages are
  • identified by Uniform Resource Locator (URL)
  • composed of text & hyperlinks (pointers to
  • ther pages)
3
slide-4
SLIDE 4

Hypertext

  • Hypertext and hyperlinks predate the WWW
  • Hypertext Editing System (HES) in 1967
  • Ted Nelson, Andy van Dam + Brown students
  • File Retrieval and Editing System (FRESS) in 1968
  • Andy van Dam + Brown students (including Bob Wallace)
  • used in Brown’s “Introduction to Poetry” in 1975 & 1976
  • N-Line System (NLS) in 1968
  • Douglas Engelbart
4
slide-5
SLIDE 5

Growth of the Web

5 450,000,000 900,000,000 1,350,000,000 1,800,000,000 1991 1993 1995 1997 1999 2001 2003 2005 2007 2009 2011 2013 2015 2017 2020 # of Websites
slide-6
SLIDE 6

Growth of the Web

6 3,000,000 6,000,000 9,000,000 12,000,000 15,000,000 18,000,000 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 # of Websites Google (2.4M) Yahoo (2K) Altavista (23K)
slide-7
SLIDE 7

Search Engines

  • The Web is great but how do find what we need?
  • Search engine
  • system that indexes collection of web pages
  • returns relevant pages when queried with keyword(s)
  • Q: how do we build a search engine?
7
slide-8
SLIDE 8

Search Engines

  • Idea #1
  • build a dictionary that maps keywords to URLs
  • use hash tables or binary search trees (see Lecture 05)
  • what’s the problem with this approach?
  • some keywords will have too many URLs to check
  • let’s rank the pages by relevance!
  • Q: how do we rank pages by relevance?
8
slide-9
SLIDE 9

Search Engines

  • Rank by frequency
  • build a dictionary that maps keywords to URLs
  • use hash tables or binary search trees (see end of Lecture 05)
  • store URLs ranked by the # of times keyword appears in page
  • Q: Is this a good idea?
  • Why or why not?
  • 9
slide-10
SLIDE 10

Search Engines

  • Rank by frequency
  • build a dictionary that maps keywords to URLs
  • use hash tables or binary search trees (see end of Lecture 05)
  • store URLs ranked by the # of times keyword appears in page
  • Q: Is this a good idea?
  • Why or why not?
  • Imagine searching for “turtle”
10

turtle turtle turtle turtle turtle turtle turtle turtle turtle turtle turtle turtle turtle turtle turtle turtle turtle turtle turtle turtle turtle turtle turtle turtle turtle turtle turtle turtle turtle turtle turtle turtle turtle turtle turtle turtle turtle turtle turtle turtle turtle turtle turtle turtle turtle

slide-11
SLIDE 11
  • Sergey Brin & Larry Page
  • PhD students doing research in information retrieval
  • noticed that links were important too!
  • intuition that links conveyed information about importance
  • But what exactly? and how can you make use of links?
11
slide-12
SLIDE 12

PageRank

  • How does PageRank work?
  • Why does it work?
  • How do you implement it efficiently?
  • Google indexes “hundreds of billions” of pages
  • answers and ranks in 0.5 seconds
  • processes 40,000 queries a second
  • 3.5 billion per day
  • Using clever algorithms and data structures!
12

From Lecture 01

slide-13
SLIDE 13

The PageRank Algorithm

  • Views WWW as a directed graph G=(V,E)
  • web pages are the vertices
  • hyperlinks are the edges
  • High-level idea
  • algorithm works by rounds
  • think of the pagerank of a page as some amount of fluid
  • at each round a page pushes its pagerank/fluid to the

pages it links to

13
slide-14
SLIDE 14

The Basic PageRank Algorithm

  • At every round
  • each vertex splits its PR evenly among its outgoing edges
  • each vertex receives PR from all its incoming edges
  • this is done using an update rule which is run on every vertex
  • The update rule for Basic PageRank is:
14

PR(v) = X

u∈in(v)

PR(u) |out(u)|

<latexit sha1_base64="6Fy2YTG6XmJhjxyAFgso8yb6PE=">ACMXicbVDLSsNAFJ34rPVdelmsAjtpiQiqAuh6EZ3VawtNKVMpN26GQS5lEoab7JjV8iuNCFilt/wklaRFsPDJx7zr3cuceLGJXKtl+shcWl5ZXV3Fp+fWNza7uws3svQy0wqeOQhaLpIUkY5aSuqGKkGQmCAo+Rhje4TP3GkAhJQ36nRhFpB6jHqU8xUkbqFK5jV/qwdpuUhmV4Dl2pg06soUs5zBzKUyeBri8Qjn+adTmJx1kVapW46RTKNoVOwOcJ86UFMEUtU7hye2GWAeEK8yQlC3HjlQ7RkJRzEiSd7UkEcID1CMtQzkKiGzH2ckJPDRKF/qhMI8rmKm/J2IUSDkKPNMZINWXs14q/ue1tPJP2zHlkVaE48kiXzOoQpjmB7tUEKzYyBCEBTV/hbiPTDjKpJw3ITizJ8+T+lHlrOLcHBerF9M0cmAfHIAScMAJqIrUAN1gMEDeAZv4N16tF6tD+tz0rpgTWf2wB9YX9+56KmH</latexit><latexit sha1_base64="6Fy2YTG6XmJhjxyAFgso8yb6PE=">ACMXicbVDLSsNAFJ34rPVdelmsAjtpiQiqAuh6EZ3VawtNKVMpN26GQS5lEoab7JjV8iuNCFilt/wklaRFsPDJx7zr3cuceLGJXKtl+shcWl5ZXV3Fp+fWNza7uws3svQy0wqeOQhaLpIUkY5aSuqGKkGQmCAo+Rhje4TP3GkAhJQ36nRhFpB6jHqU8xUkbqFK5jV/qwdpuUhmV4Dl2pg06soUs5zBzKUyeBri8Qjn+adTmJx1kVapW46RTKNoVOwOcJ86UFMEUtU7hye2GWAeEK8yQlC3HjlQ7RkJRzEiSd7UkEcID1CMtQzkKiGzH2ckJPDRKF/qhMI8rmKm/J2IUSDkKPNMZINWXs14q/ue1tPJP2zHlkVaE48kiXzOoQpjmB7tUEKzYyBCEBTV/hbiPTDjKpJw3ITizJ8+T+lHlrOLcHBerF9M0cmAfHIAScMAJqIrUAN1gMEDeAZv4N16tF6tD+tz0rpgTWf2wB9YX9+56KmH</latexit><latexit sha1_base64="6Fy2YTG6XmJhjxyAFgso8yb6PE=">ACMXicbVDLSsNAFJ34rPVdelmsAjtpiQiqAuh6EZ3VawtNKVMpN26GQS5lEoab7JjV8iuNCFilt/wklaRFsPDJx7zr3cuceLGJXKtl+shcWl5ZXV3Fp+fWNza7uws3svQy0wqeOQhaLpIUkY5aSuqGKkGQmCAo+Rhje4TP3GkAhJQ36nRhFpB6jHqU8xUkbqFK5jV/qwdpuUhmV4Dl2pg06soUs5zBzKUyeBri8Qjn+adTmJx1kVapW46RTKNoVOwOcJ86UFMEUtU7hye2GWAeEK8yQlC3HjlQ7RkJRzEiSd7UkEcID1CMtQzkKiGzH2ckJPDRKF/qhMI8rmKm/J2IUSDkKPNMZINWXs14q/ue1tPJP2zHlkVaE48kiXzOoQpjmB7tUEKzYyBCEBTV/hbiPTDjKpJw3ITizJ8+T+lHlrOLcHBerF9M0cmAfHIAScMAJqIrUAN1gMEDeAZv4N16tF6tD+tz0rpgTWf2wB9YX9+56KmH</latexit><latexit sha1_base64="6Fy2YTG6XmJhjxyAFgso8yb6PE=">ACMXicbVDLSsNAFJ34rPVdelmsAjtpiQiqAuh6EZ3VawtNKVMpN26GQS5lEoab7JjV8iuNCFilt/wklaRFsPDJx7zr3cuceLGJXKtl+shcWl5ZXV3Fp+fWNza7uws3svQy0wqeOQhaLpIUkY5aSuqGKkGQmCAo+Rhje4TP3GkAhJQ36nRhFpB6jHqU8xUkbqFK5jV/qwdpuUhmV4Dl2pg06soUs5zBzKUyeBri8Qjn+adTmJx1kVapW46RTKNoVOwOcJ86UFMEUtU7hye2GWAeEK8yQlC3HjlQ7RkJRzEiSd7UkEcID1CMtQzkKiGzH2ckJPDRKF/qhMI8rmKm/J2IUSDkKPNMZINWXs14q/ue1tPJP2zHlkVaE48kiXzOoQpjmB7tUEKzYyBCEBTV/hbiPTDjKpJw3ITizJ8+T+lHlrOLcHBerF9M0cmAfHIAScMAJqIrUAN1gMEDeAZv4N16tF6tD+tz0rpgTWf2wB9YX9+56KmH</latexit>
slide-15
SLIDE 15

Basic PageRank: Example 1

15

A B C D E

.2 .2 .2 .2 .2

.2 .2 .2 .2 .1 .1

Round 1

slide-16
SLIDE 16

Basic PageRank: Example 1

16

A B C D E

.1 .4 .1 .4

.1 .1 .4 .2 .2

Round 2

slide-17
SLIDE 17

Basic PageRank: Example 1

17

A B C D E

.2 .5 .2 .1

.2 .2 .1 .25 .25

Round 3

slide-18
SLIDE 18

Basic PageRank: Example 1

18

A B C D E

.25 .3 .25 .2

.25 .25 .2 .15 .15

Round 4

slide-19
SLIDE 19

Basic PageRank: Example 1

19

A B C D E

.15 .45 .15 .25

.15 .15 .25 .225 .225

Round 5

slide-20
SLIDE 20

Basic PageRank: Example 1

20

A B C D E

.225 .40 .225 .15

.225 .225 .15 .20 .20

Round 6

slide-21
SLIDE 21

Basic PageRank: Example 1

21

A B C D E

.20 .375 .20 .225

.20 .20 .225 .1875 .1875

Round 7

slide-22
SLIDE 22

Basic PageRank: Example 1

22

A B C D E

.1875 .425 .1875 .20

.1875 .1875 .20 .2125 .2125

Round 8

slide-23
SLIDE 23

Basic PageRank

  • At Round 8
  • B has pagerank .425
  • D has pagerank .2
  • A has pagerank .1875
  • C has pagerank .1875
  • E has pagerank 0
  • What happens if we keep going?
  • does the ranking stabilize or will C have pagerank higher than D?
  • can E end up with non-zero pagerank?
  • for certain graphs, if we keep going long enough pageranks will stabilize
23
slide-24
SLIDE 24

Observations

  • The sum of all pageranks always equals 1
  • pagerank is moved around but never created or destroyed
  • Pages with many incoming edges accumulate more pagerank (e.g., A, D)
  • even better if incoming edges from pages with high pagerank (e.g., B)
  • Pages with no incoming edges lose all their pagerank (e.g., E)
  • Intuitively
  • the more a page is linked to, the higher its rank
  • the more a page is linked to by high-ranked pages, the higher it ranks
24
slide-25
SLIDE 25
  • runtime of
  • O(|V|+k(|V|+|E|))=O(|V|+|E|)
  • assuming k is a constant

Basic PageRank

25 BasicPageRank(G, k): # k is number of “rounds” for v in V: v.rank = 1/|V| for i from 1 to k: for v in V: v.prevrank = v.rank for v in V: v.rank = 0 for u in v.incoming: v.rank = v.rank + u.prevrank/|u.outgoing|
slide-26
SLIDE 26

Basic PageRank: Example 2

26

A B C D E

slide-27
SLIDE 27

Basic PageRank on Example 2

27

3 min

Activity #1 k = 3

BasicPageRank(G, k): # k is number of “rounds” for v in V: v.rank = 1/|V| for i from 1 to k: for v in V: v.prevrank = v.rank for v in V: v.rank = 0 for u in v.incoming: v.rank = v.rank + u.prevrank/|u.outgoing|
slide-28
SLIDE 28

Basic PageRank: Example 2

28

A B C D E

.2 .2 .2 .2 .2

.2 .1 .1 .2 .2

Round 1

slide-29
SLIDE 29

Basic PageRank: Example 2

29

A B C D E

.7 .3

.3

Round 2

slide-30
SLIDE 30

Basic PageRank: Example 2

30

A B C D E

1 Round 3

slide-31
SLIDE 31

What’s going on?

slide-32
SLIDE 32

Basic PageRank: Example 2

  • All the pagerank got trapped at B
  • happens if graph has sinks (i.e., nodes w/ no outgoing edges)
32 A B C D E 1
slide-33
SLIDE 33

Basic PageRank: Example 3

33

A C B D

.25 .25 .25 .25

slide-34
SLIDE 34

Basic PageRank: Example 3

34

A C B D

1/4 1/4 1/4 1/4

1/8 1/4 1/8 1/4 1/4
slide-35
SLIDE 35

Basic PageRank: Example 3

35

A C B D

1/8 1/2 3/8

1/8 1/2 3/8
slide-36
SLIDE 36

Basic PageRank: Example 3

36

A C B D

1/2 1/2

1/2 1/2
slide-37
SLIDE 37
  • All the pagerank got trapped at C and D

Basic PageRank: Example 3

37

A C B D

1/2 1/2

1/2 1/2
slide-38
SLIDE 38

Basic PageRank

  • Basic PageRank doesn’t work for certain graphs
  • e.g.: graphs with sinks or with cycles with no outgoing edges
  • all the pagerank gets trapped there
  • How do we handle “rank traps”?
  • Water flows down from high elevation to low elevation
  • why doesn’t all the water end up at the lowest points on Earth?
  • because some of the water evaporates…
  • …and rains back down on the high elevation points
38
slide-39
SLIDE 39

Handling Rank Traps

  • Let’s make some of the pagerank evaporate!
  • We need a new update rule
  • In basic update rule, nodes gave all their pagerank to neighbors
  • In new update rule, a node will
  • give a d fraction of its PR to its neighbors (split evenly)
  • give a 1-d fraction of its PR to all nodes (split evenly)
  • including self + neighbors
  • this guarantees that pagerank doesn’t accumulate anywhere
  • d is usually set to .85
39

What happens if the node is a sink?

slide-40
SLIDE 40

Disappearing PageRank

40

B A

1/2

d/2

1/2

(1-d)/4 (1-d)/4 (1-d)/4 (1-d)/4
slide-41
SLIDE 41

Disappearing PageRank

  • The sum of the pageranks does not sum to 1:
  • since 0 < d < 1
  • We lost d/2 of B’s pagerank when we updated
41

B A

(1-d)/2 1/2 (1 − d) 2 + 1 2 = 1 − d 2 < 1

<latexit sha1_base64="hjR81g5W9GoBAjEbdcTybXCiWM=">ACG3icbZDLSgMxFIYz9VbrbdSlm2ARKtIyqYIKCkU3Lis4tCWkslk2tDMhSQjlGEexI2v4saFivBhW9jOp2Ftv4Q+POdc0jO70ScSWVZ30ZhYXFpeaW4Wlpb39jcMrd37mUYC0JtEvJQtB0sKWcBtRVTnLYjQbHvcNpyRteTeuBCsnC4E6NI9rz8SBgHiNYadQ3j7uewCSpoKp7mCb1FB7BKUHZ7RIiWM2Jm5ELiPpm2apZmeC8Qbkpg1zNvnZdUMS+zRQhGMpO8iKVC/BQjHCaVrqxpJGmIzwgHa0DbBPZS/JlkvhgSYu9EKhT6BgRn9PJNiXcuw7utPHaihnaxP4X60TK+sl7AgihUNyPQhL+ZQhXCSFHSZoETxsTaYCKb/CskQ6ySUzrOkQ0CzK8bu147r6Hbk3LjKk+jCPbAPqgABE5BA9yAJrABAY/gGbyCN+PJeDHejY9pa8HIZ3bBHxlfP2eBnqY=</latexit><latexit sha1_base64="hjR81g5W9GoBAjEbdcTybXCiWM=">ACG3icbZDLSgMxFIYz9VbrbdSlm2ARKtIyqYIKCkU3Lis4tCWkslk2tDMhSQjlGEexI2v4saFivBhW9jOp2Ftv4Q+POdc0jO70ScSWVZ30ZhYXFpeaW4Wlpb39jcMrd37mUYC0JtEvJQtB0sKWcBtRVTnLYjQbHvcNpyRteTeuBCsnC4E6NI9rz8SBgHiNYadQ3j7uewCSpoKp7mCb1FB7BKUHZ7RIiWM2Jm5ELiPpm2apZmeC8Qbkpg1zNvnZdUMS+zRQhGMpO8iKVC/BQjHCaVrqxpJGmIzwgHa0DbBPZS/JlkvhgSYu9EKhT6BgRn9PJNiXcuw7utPHaihnaxP4X60TK+sl7AgihUNyPQhL+ZQhXCSFHSZoETxsTaYCKb/CskQ6ySUzrOkQ0CzK8bu147r6Hbk3LjKk+jCPbAPqgABE5BA9yAJrABAY/gGbyCN+PJeDHejY9pa8HIZ3bBHxlfP2eBnqY=</latexit><latexit sha1_base64="hjR81g5W9GoBAjEbdcTybXCiWM=">ACG3icbZDLSgMxFIYz9VbrbdSlm2ARKtIyqYIKCkU3Lis4tCWkslk2tDMhSQjlGEexI2v4saFivBhW9jOp2Ftv4Q+POdc0jO70ScSWVZ30ZhYXFpeaW4Wlpb39jcMrd37mUYC0JtEvJQtB0sKWcBtRVTnLYjQbHvcNpyRteTeuBCsnC4E6NI9rz8SBgHiNYadQ3j7uewCSpoKp7mCb1FB7BKUHZ7RIiWM2Jm5ELiPpm2apZmeC8Qbkpg1zNvnZdUMS+zRQhGMpO8iKVC/BQjHCaVrqxpJGmIzwgHa0DbBPZS/JlkvhgSYu9EKhT6BgRn9PJNiXcuw7utPHaihnaxP4X60TK+sl7AgihUNyPQhL+ZQhXCSFHSZoETxsTaYCKb/CskQ6ySUzrOkQ0CzK8bu147r6Hbk3LjKk+jCPbAPqgABE5BA9yAJrABAY/gGbyCN+PJeDHejY9pa8HIZ3bBHxlfP2eBnqY=</latexit><latexit sha1_base64="hjR81g5W9GoBAjEbdcTybXCiWM=">ACG3icbZDLSgMxFIYz9VbrbdSlm2ARKtIyqYIKCkU3Lis4tCWkslk2tDMhSQjlGEexI2v4saFivBhW9jOp2Ftv4Q+POdc0jO70ScSWVZ30ZhYXFpeaW4Wlpb39jcMrd37mUYC0JtEvJQtB0sKWcBtRVTnLYjQbHvcNpyRteTeuBCsnC4E6NI9rz8SBgHiNYadQ3j7uewCSpoKp7mCb1FB7BKUHZ7RIiWM2Jm5ELiPpm2apZmeC8Qbkpg1zNvnZdUMS+zRQhGMpO8iKVC/BQjHCaVrqxpJGmIzwgHa0DbBPZS/JlkvhgSYu9EKhT6BgRn9PJNiXcuw7utPHaihnaxP4X60TK+sl7AgihUNyPQhL+ZQhXCSFHSZoETxsTaYCKb/CskQ6ySUzrOkQ0CzK8bu147r6Hbk3LjKk+jCPbAPqgABE5BA9yAJrABAY/gGbyCN+PJeDHejY9pa8HIZ3bBHxlfP2eBnqY=</latexit>
slide-42
SLIDE 42

Handling Sinks

  • There are several ways to handle sinks
  • The simplest is to modify the graph as follows
  • if v is a sink, add an edge from v to every node in the graph
  • This includes an edge from v to itself
  • Then use the update rule we described on slide #38
42
slide-43
SLIDE 43

The Real PageRank Algorithm

  • Add edges connecting every sink to every node
  • At every round each vertex
  • splits a d fraction of its PR evenly among its outgoing

edges

  • splits a (1-d) fraction of its PR evenly among all nodes
  • receives PR from its incoming edges & from its share of

the “evaporated” pageranks of all nodes

  • d is called the damping factor & is usually set to .85
43
slide-44
SLIDE 44

The Real PageRank Algorithm

  • At every round the PR of each vertex v is updated using:
44

PR(v) = ✓ X

u∈in(v)

d · PR(u) |out(u)| ◆ + X

u∈V

(1 − d) · PR(u) |V | = ✓ d · X

u∈in(v)

PR(u) |out(u)| ◆ + 1 − d |V | · X

u∈V

PR(u) = ✓ d · X

u∈in(v)

PR(u) |out(u)| ◆ + 1 − d |V | = 1 − d |V | + d · X

u∈in(v)

PR(u) |out(u)|

<latexit sha1_base64="Z7PLqg9/wDMWfnmUVxCKLpMv9rk=">ADxHicvVJda9swFXsfXTZV9o97kUsbCSUBXsMuj0USvewPWZjSQtRCLIsJ6KW5OqjEBT3R/Zt7M9MdpKSJmMBrtguLr3HOLzcpcqZNFP1oBOG9+w8e7j1qPn7y9Nnz1v7BUEurCB0QmUt1nmBNcybowDCT0/NCUcyTnJ4lF5+q/tkVZpJ8d3MCzrmeCpYxg2vjTZb/x0SGew/63sXHXhm2OIEjadpC2fOIsREzAGsBEBSghyhQmzqWIpNLA9ajtlm5R46Q1XNR1jxdeAg3qYZrgk78Nu0uOdwdkuHCQ1Dz1sit0J/9/JWRGuqF1zI7zN7eBtN/87GU2aoewn+XnLTaUS+qA+4m8Spg1X0J60blEpiORWG5FjrURwVZuywMozktGwiq2mByQWe0pFPBeZUj19hSV87SspzKTynzCwrm5OMy1nvPEIzk2M73dq4q/642syT6MHROFNVSQpVBmc2gkrE4apkxRYvK5TzBRzHuFZIb9cow/KZfQrz9y7vJ4F3vYy/+r59craxh54CV6BDojBETgBX0AfDAJToNZcBmo8HPIQx3aJTRorGZegDsRXv8CZ40BQ=</latexit><latexit sha1_base64="Z7PLqg9/wDMWfnmUVxCKLpMv9rk=">ADxHicvVJda9swFXsfXTZV9o97kUsbCSUBXsMuj0USvewPWZjSQtRCLIsJ6KW5OqjEBT3R/Zt7M9MdpKSJmMBrtguLr3HOLzcpcqZNFP1oBOG9+w8e7j1qPn7y9Nnz1v7BUEurCB0QmUt1nmBNcybowDCT0/NCUcyTnJ4lF5+q/tkVZpJ8d3MCzrmeCpYxg2vjTZb/x0SGew/63sXHXhm2OIEjadpC2fOIsREzAGsBEBSghyhQmzqWIpNLA9ajtlm5R46Q1XNR1jxdeAg3qYZrgk78Nu0uOdwdkuHCQ1Dz1sit0J/9/JWRGuqF1zI7zN7eBtN/87GU2aoewn+XnLTaUS+qA+4m8Spg1X0J60blEpiORWG5FjrURwVZuywMozktGwiq2mByQWe0pFPBeZUj19hSV87SspzKTynzCwrm5OMy1nvPEIzk2M73dq4q/642syT6MHROFNVSQpVBmc2gkrE4apkxRYvK5TzBRzHuFZIb9cow/KZfQrz9y7vJ4F3vYy/+r59craxh54CV6BDojBETgBX0AfDAJToNZcBmo8HPIQx3aJTRorGZegDsRXv8CZ40BQ=</latexit><latexit sha1_base64="Z7PLqg9/wDMWfnmUVxCKLpMv9rk=">ADxHicvVJda9swFXsfXTZV9o97kUsbCSUBXsMuj0USvewPWZjSQtRCLIsJ6KW5OqjEBT3R/Zt7M9MdpKSJmMBrtguLr3HOLzcpcqZNFP1oBOG9+w8e7j1qPn7y9Nnz1v7BUEurCB0QmUt1nmBNcybowDCT0/NCUcyTnJ4lF5+q/tkVZpJ8d3MCzrmeCpYxg2vjTZb/x0SGew/63sXHXhm2OIEjadpC2fOIsREzAGsBEBSghyhQmzqWIpNLA9ajtlm5R46Q1XNR1jxdeAg3qYZrgk78Nu0uOdwdkuHCQ1Dz1sit0J/9/JWRGuqF1zI7zN7eBtN/87GU2aoewn+XnLTaUS+qA+4m8Spg1X0J60blEpiORWG5FjrURwVZuywMozktGwiq2mByQWe0pFPBeZUj19hSV87SspzKTynzCwrm5OMy1nvPEIzk2M73dq4q/642syT6MHROFNVSQpVBmc2gkrE4apkxRYvK5TzBRzHuFZIb9cow/KZfQrz9y7vJ4F3vYy/+r59craxh54CV6BDojBETgBX0AfDAJToNZcBmo8HPIQx3aJTRorGZegDsRXv8CZ40BQ=</latexit><latexit sha1_base64="Z7PLqg9/wDMWfnmUVxCKLpMv9rk=">ADxHicvVJda9swFXsfXTZV9o97kUsbCSUBXsMuj0USvewPWZjSQtRCLIsJ6KW5OqjEBT3R/Zt7M9MdpKSJmMBrtguLr3HOLzcpcqZNFP1oBOG9+w8e7j1qPn7y9Nnz1v7BUEurCB0QmUt1nmBNcybowDCT0/NCUcyTnJ4lF5+q/tkVZpJ8d3MCzrmeCpYxg2vjTZb/x0SGew/63sXHXhm2OIEjadpC2fOIsREzAGsBEBSghyhQmzqWIpNLA9ajtlm5R46Q1XNR1jxdeAg3qYZrgk78Nu0uOdwdkuHCQ1Dz1sit0J/9/JWRGuqF1zI7zN7eBtN/87GU2aoewn+XnLTaUS+qA+4m8Spg1X0J60blEpiORWG5FjrURwVZuywMozktGwiq2mByQWe0pFPBeZUj19hSV87SspzKTynzCwrm5OMy1nvPEIzk2M73dq4q/642syT6MHROFNVSQpVBmc2gkrE4apkxRYvK5TzBRzHuFZIb9cow/KZfQrz9y7vJ4F3vYy/+r59craxh54CV6BDojBETgBX0AfDAJToNZcBmo8HPIQx3aJTRorGZegDsRXv8CZ40BQ=</latexit>
  • 1. nodes with edges
pointing to v
  • 2. d fraction of u’s PR
  • 3. number of edges
leaving u
  • 4. (1-d) fraction of u’s PR

1

slide-45
SLIDE 45

The Real PageRank

  • Runtime of a round O(|E|)
  • How many rounds should we run?
  • Ideally until the pageranks “stabilize”
  • pageranks stop changing even though we run more rounds
  • We can prove that
  • if we run for large enough number of rounds then

pageranks will stabilize

  • that number could be very large for some graphs…
  • …but in practice it’s usually reasonable
45
slide-46
SLIDE 46

Alternative Sink Handling

  • You can also handle sinks without modifying the graph
  • but you need a slightly different update rule
46

PR(v) = 1 − d |V | + d · ✓ X

u∈in(v)

PR(u) |out(u)| + X

u∈sinks(G)

PR(u) |V | ◆

<latexit sha1_base64="nEPJd7aur4uCDUgY2QDJdVP8lc=">ACjnicdVFta9swEJbd9WXpy7z2476IhUHCWLDX0pWsrJ+aD9mY0kLcQiyLKcismSkUyG4/jn7Q/u2fzPZNWNrugPBc8/dc3e6SwrBDYThL89fe7G+sbn1srO9s7v3Kni9PzbKaspGVAmlbxNimOCSjYCDYLeFZiRPBLtJFpd1/OaeacOV/A7Lgk1zMpc845SAo2bBjzI2GR5+q3r3fXyO40wTWkYf0qp8GD9U+D1OY5oqwHC5/NebGw+Ky2OucSNkMtaWLW6P7Vs3+kbT1mo3abUitpwuTBV7+o/BdwATdt+ZxZ0w0HYGF4FUQu6qLXhLPgZp4ranEmghgzicICpiXRwKlgVSe2hWELsicTRyUJGdmWjbrPA7x6Q4U9o9Cbh/1aUJDdmScuMydwZ57GavK52MRCdjItuSwsMEkfG2VWYFC4vg1OuWYUxNIBQjV3s2J6R9xiwF2wXkL09MurYPxEB0Owq9H3Ysv7Tq20Bv0FvVQhD6hC3SNhmiEqLftRd6pd+YH/rF/7n9+TPW9VnOA/jH/+jdcX8T7</latexit>
slide-47
SLIDE 47

PageRank requirements

  • Storage
  • Need to store a copy of the entire web graph
  • Google estimated to store about 50 billion web pages
  • Average size of a page is 2 MB
  • …that’s about of 100 petabytes (1 PB = 250 bytes)!
  • Hard to compute on such a large data set
  • need to store data on clusters of 1000’s of machines
  • need to coordinate all these machines to execute PageRank
  • Google File System, MapReduce, BigTable, etc.
47
slide-48
SLIDE 48

PageRank in Practice

  • Google continually computes the pagerank of every webpage
  • When you query
  • your keyword narrows down the pages to return
  • then Google ranks them by their precomputed pagerank
  • We don’t know if Google still uses the original PageRank or

some variant

48
slide-49
SLIDE 49

Other Applications of PageRank

  • PageRank is also in other fields
  • Biology (studying protein interactions)
  • Neuroscience (finding importance of brain regions)
  • Engineering (finding anomalies)
  • Mathematics (analyzing graphs)
  • Sports (ranking sports teams)
  • Literature (importance/influence of books)
  • Bibliometrics (which authors are more influential)
49
slide-50
SLIDE 50

Readings

  • A film from 1976 about Andy’s FRESS system and its use in

Introduction to Poetry

  • https://archive.org/details/AndyVanDamHypertextFilm
  • The book Networks, Crowds and Markets by Easly and

Kleinberg has a great overview of PageRank

  • the evaporation metaphor comes from there!
  • Other applications of PageRank
  • https://arxiv.org/pdf/1407.5107.pdf
  • Size of the web
  • http://www.worldwidewebsize.com
50