Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
1
CSE 6240: Web Search and Text Mining. Spring 2020
Web as a Network
- Prof. Srijan Kumar
Web as a Network Prof. Srijan Kumar 1 Srijan Kumar, Georgia Tech, - - PowerPoint PPT Presentation
CSE 6240: Web Search and Text Mining. Spring 2020 Web as a Network Prof. Srijan Kumar 1 Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining Project Resources Compute Resources: Got everyone access to PACE COC-ICE
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
1
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
2
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
3
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
4
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
5
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
6
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
7
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
8
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
9
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
10
Biology
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
11
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
12
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
13
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0040961
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
14
Connections between political blogs
Polarization of the network [Adamic-Glance, 2005]
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
15
A drug is likely to treat a disease if it is
Proteins targeted by a drug Proteins targeted by a disease
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
16
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
17
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
18
v
Out(v)
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
19
database generated pages
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
20
time from various sources, including voluntary submissions.
Tomkins, Broder, and Kumar
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
21
In(v) = {w | w can reach v} Out(v) = {w | v can reach w}
E C A B G F D
For example: In(A) = {A,B,C,E,G} Out(A)={A,B,C,D,F}
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
22
E C A B D E C A B D
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
23
E C A B G F D
Strongly connected components of the graph: {A,B,C,G}, {D}, {E}, {F}
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
24
E C A B G F D
(1) Strongly connected components of graph G: {A,B,C,G}, {D}, {E}, {F} (2) G’ is a DAG:
G’
{A,B,C,G} {E} {D} {F}
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
25
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
26
v
Out(v)
A
In(A)
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
27
E C A B D F G H Out(A) In(A)
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
28
Giant SCC1 Giant SCC2
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
29
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
30
x-axis: rank y-axis: number of reached nodes
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
31
203 million pages, 1.5 billion links [Broder et al. 2000]
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
32
The degree distribution: ~ k -2
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
33
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
34
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
35
k P(k)
1 2 3 4 0.1 0.2 0.3 0.4 0.5 0.6
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
36
k: degree Nk: number of nodes with degree k
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
37
¹
i j i ij
, max
where hij is the distance from node i to node j
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
38
neighbors w into the queue and mark hu(w)=hu(v)+1
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
39
Ci=0 Ci=1/3 Ci=1
i i i
where ei is the number of edges between the neighbors of node i
N i i
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
40
C A B D E H F G
kB=2, eB=1, CB=2/2 = 1 kD=4, eD=2, CD=4/12 = 1/3
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
41
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
42
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
43
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
44
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
45
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
Note: We plotted the same data as on the previous slide, just the axes are now logarithmic.
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
47
=
=
k k i i k k
i
C N C
:
1
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
48
Number of links between pairs of nodes
Average path length 6.6 90% of the nodes can be reached in < 8 hops
Steps #Nodes
1 1 10 2 78 3 3,96 4 8,648 5 3,299,252 6 28,395,849 7 79,059,497 8 52,995,778 9 10,321,008 10 1,955,007 11 518,410 12 149,945 13 44,616 14 13,740 15 4,476 16 1,542 17 536 18 167 19 71 20 29 21 16 22 10 23 3 24 2 25 3
# nodes as we do BFS out of a random node
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
49
Srijan Kumar, Georgia Tech, CSE6240 Spring 2020: Web Search and Text Mining
50