A foray into graph mining
Neil Shah April 15th, 2019
A foray into graph mining Neil Shah April 15 th , 2019 (Graph) data - - PowerPoint PPT Presentation
A foray into graph mining Neil Shah April 15 th , 2019 (Graph) data is prevalent 2.5 exabytes of data produced every day 90% generated in the last 2 years Data is produced as the product of a highly interconnected world 244 million
Neil Shah April 15th, 2019
1.3 billion users 1 billion daily mobile views 244 million users 480 million products 187 million daily actives 3.5 billion daily snaps
M
i e r e c
m e n d a t i
S e a r c h e n g i n e r a n k i n g P r
u c t p u r c h a s i n g S
i a l p l a t f
m i n t e r a c t i
u1 u2 u3 u4 u5 u6
Users-by-users
u7 u8 u9 u10 u11
u1 u2 u3 u4 u5 u6
Users-by-users
u7 u8 u9 u10 u11
u1 u2 u3 u4 u5 u6
Users-by-users
u7 u8 u9 u10 u11
u1 u2 u3 u4 u5 u6
Users-by-users
u7 u8 u9 u10 u11
adjacent edge weights
communicate”
3 4 1 2 9 1 u1 u2 u3 u4 u5 u6
Users-by-users
u7 u8 u9 u10 u11
6
edges within one “hop” from ego
u1 u2 u3 u4 u5 u6
Users-by-users
u7 u8 u9 u10 u11
path between them.
pairs are connected.
u1 u2 u3 u4 u5 u6
Users-by-users
u7 u8 u9 u10 u11
node and/or edge types.
u1
u2 u3 u4 u5 u6
p1 p2 p3 p4 p5
Users Products Users-by-products
u1 u2 u3 u4 u5 u6
Users-by-users
u7 u8 u9 u10 u11
1 1 1 1 1 1 1 1 1
Users Users
seemingly random.
world graphs are far from random.
Lyon ’03 Trace-route paths
(binom.)
Babaoglu’ 18
Log(# posts) vs. log(# users) log(# visitors) vs. log(# sites) log(# peers) vs. log(# routers)
Faloutsos ‘99 Viswanath ‘09 Adamic ‘02
distribution obeys a power-law:
proportional scaling of the whole function
log(# visitors) vs. log(# sites)
Newman ‘05
connecting it to ! already existing nodes
the degree $% as
academic citations, recommendation, virality
log(# visitors) vs. log(# sites)
the “six degrees of separation”
Boston was 6.2 (sample size 64)
has mode 6 (sample size 180M nodes and 1.3B edges)
connected to hubs, which facilitate paths
log(# visitors) vs. log(# sites)
& $ ∝ " $ . (power law!) with 1 ≤ 0 ≤ 2, generally
Power-law in # edges vs. # nodes (over time)
between node pairs)
actually shrinks over time, instead
tend to get closer
prevalence and growth of hubs
… the list goes on
hyperlinked body of pages by importance and relevance?
rank page importance according to connectivity patterns
Backlinks and Forward links:
ØA and B are C’s backlinks ØC is A and B’s forward link
Content adapted from Li ‘09
probability distribution
walk on the graph; a surfer keeps clicking successive pages at random.
Idea: each page equally distributes its own PageRank to its forward-links recursively. “An important page has many important pages pointing to it”
PageRank Calculation: first iteration
Adjacency matrix transposed and column-normalized (accounts for equal neighbor distribution) Yahoo Amzn MS Initial PageRank scores Read as “Amazon gives ½ of its own PageRank to Yahoo and Microsoft each”
PageRank Calculation: second iteration
Adjacency matrix transposed and column-normalized (accounts for equal neighbor distribution) Yahoo Amzn MS Initial PageRank scores Read as “Amazon gives ½ of its own PageRank to Yahoo and Microsoft each”
Adjacency matrix transposed and column-normalized (accounts for equal neighbor distribution) Yahoo Amzn MS Initial PageRank scores
Convergence after some iterations
Read as “Amazon gives ½ of its own PageRank to Yahoo and Microsoft each”
A loop:
During each iteration, the loop accumulates rank but never distributes rank to other pages!
Adjacency matrix transposed and column-normalized (accounts for equal neighbor distribution) Yahoo Amzn MS Initial PageRank scores Read as “Microsoft gives all of its PageRank to Microsoft”
Adjacency matrix transposed and column-normalized (accounts for equal neighbor distribution) Yahoo Amzn MS Initial PageRank scores Read as “Microsoft gives all of its PageRank to Microsoft”
Adjacency matrix transposed and column-normalized (accounts for equal neighbor distribution) Yahoo Amzn MS Initial PageRank scores Read as “Microsoft gives all of its PageRank to Microsoft” All roads lead to Microsoft
configured/normalized adjacency matrix, due to Markov chain theory! Cool!
the surfer having a random jump probability.
!(#): a distribution of ranks of web pages that the surfer can jump to when he/she “gets bored” after clicking on successive links.
Adjacency matrix transposed and column-normalized (accounts for equal neighbor distribution) Yahoo Amzn MS Initial PageRank scores 20% random jump probability
what nodes should we recommend a user u to promote engagement?
neighborhood!
Liben-Nowell ‘04
Users-by-users
what nodes should we recommend a user u to promote engagement?
factor model/embedding that compactly encodes “interests”
blocks/communities
%×' %×( (×( (×'
)*)+ ),
x x ()*≥ )+ ≥ … ),)
blocks/communities
n users m videos
“music lovers” “artist spotlights” “adrenaline junkies” “action movies” “dabbling cooks” “baking shows”
"# "$ "% &$ &# &%
+ + …
'# '$ '%
which “summarize” user/item affinities towards latent factors
(depending on application)
Koren ‘09
space which “summarize” user/item affinities towards latent factors
(depending on application)
Koren ‘09
interactions? Are there natural “clusters” of behaviors?
can indicate community behaviors. These are useful for
advertiser query
Content adapted from Leskovec ‘10
algorithms are common
n: nodes in S m: edges in S c: edges pointing
to graph “cut”)
uncoarsen the graph
Interestingly, Local Spectral clusters are more compact and tighter, despite having higher (worse) conductance than METIS!
similar
ODF prefers large clusters
clusters are very sparse)
analyze them
external connectivity)
looking clusters according to human intuition
and can we find such anomalies automatically?
building null/normal models and penalizing excessive deviation
a real community)
Users-by-users
Content adapted from Akoglu ‘10
Near-star Near-clique
telemarketer, port scanner, people adding friends indiscriminatively, etc. tightly connected people, terrorist groups?, discussion group, etc.
Heavy vicinity
too much money wrt number
wrt number of donors, etc. single-minded, tight company
Dominant heavy link
α
Differentiates “dense” from “sparse” neighborhoods
β
Differentiates “heavy” from “light” neighborhoods
γ
Differentiates “uniform” distribution from “dominant” heavy edges
violates our “laws” far away from most points Anomaly ≈
scoredist = distance to fitting line scoreoutl = outlierness score score = func( scoredist , scoreoutl )
ü can interpret the type of anomaly ü can sort nodes wrt their outlierness scores
Part of a group of posts who all link to each other Post linking to many other posts indiscriminately
Has published 40 papers, but to the same conference (and nowhere else) Have published hundreds of papers, to almost as many conferences!
Alice
Content adapted from Shin ‘16
Restaurants Accounts Restaurants Accounts Adjacency Matrix
are sparse on the time axis (formed gradually)
blocks are also dense
synchronous behavior)
blocks are denser than natural dense blocks in the tensor model Restaurants Timestamp Sparse Dense Accounts
A cell indicates that account i rates restaurant j at time t
Adjacency Tensor
Src IP Dst IP T i m e s t a m p Src User Dst User T i m e s t a m p User Page T i m e s t a m p
TCP Dumps Wikipedia Revision History Time-evolving Social Network
“suspiciousness” metric)
Assume a block (subtensor) ! in a 3- way tensor "
= *+ *- *. (
Some notable choices:
Traditional Density: ρ? (, = = ABCC ( /Vol(B) (maximized by single entry with max. value) Arithmetic Avg. Degree: ρI (, = = ABCC ( /Size(B) Geometric Avg. Degree: ρN (, = = ABCC ( /
O Vol B
! = 2.9
" = 3
! =3.3
! = 3.6
! = 3.6
Find & Remove Find & Remove Find & Remove Restore
Among slices in the same mode, removing the slice with minimum mass is always best
Density metric Input Tensor Order Densest Block
TCP connections forming the densest blocks are network attacks First three blocks found Src IP Dst IP T i m e s t a m p
First three blocks found by M-Zoom Page edit wars : 11 users revised 10 pages, 2,305 times within 16 hours User Page T i m e s t a m p
leverage large-scale interaction patterns to