DS504/CS586: Big Data Analytics Graph Mining II Prof. Yanhua Li - - PowerPoint PPT Presentation

ds504 cs586 big data analytics graph mining ii
SMART_READER_LITE
LIVE PREVIEW

DS504/CS586: Big Data Analytics Graph Mining II Prof. Yanhua Li - - PowerPoint PPT Presentation

Welcome to DS504/CS586: Big Data Analytics Graph Mining II Prof. Yanhua Li Time: 6-8:50PM Thursday Location: AK233 Spring 2018 Course Project I has been graded. v Grading was based on v 1. Project report v 2. Project team presentation v 3.


slide-1
SLIDE 1

DS504/CS586: Big Data Analytics Graph Mining II

  • Prof. Yanhua Li

Welcome to

Time: 6-8:50PM Thursday Location: AK233 Spring 2018

slide-2
SLIDE 2

Logistics 2

Course Project I has been graded.

v Grading was based on

v 1. Project report v 2. Project team presentation v 3. Self-&-cross evaluation form v 4. In-class survey/evaluation form v I also provided comments to your project reports

in Canvas discussion forum.

v If you are interested in publishing your

results, talk to me. (Totally optional.)

slide-3
SLIDE 3

Logistics 3

Course Project II

v Projects will be in groups!

v 4-6 students per group, depending on

enrollment

v “research-oriented” project timeline:

v Team Project

v Starting date: Week 8 (R) on 3/1: v Project proposal due date: Week 10 (R) 3/15: v Project Progress Presentation: TBD, 15mins

per team:

v Project due date: Week 16 (R) 4/26: v Project final Presentation: Week 16 (R) 4/26:

slide-4
SLIDE 4

Graph Data

4

Graphs are everywhere.

Program Flow Ecological Network Biological Network Social Network Chemical Network Web Graph

slide-5
SLIDE 5

Complex Graphs

5

Real-life graph contains complex contents – labels associated with nodes, edges and graphs.

Node Labels: Location, Gender, Charts, Library, Events, Groups, Journal, Tags, Age, Tracks.

slide-6
SLIDE 6

6

# of Users # of Links Facebook 400 Million 52K Million Twitter 105 Million 10K Million LinkedIn 60 Million 0.9K Million Last.FM 40 Million 2K Million LiveJournal 25 Million 2K Million del.icio.us 5.3 Million 0.7K Million DBLP 0.7 Million 8 Million Large Graphs

Large Scale Graphs.

slide-7
SLIDE 7

Mining in Big Graphs

v Network Statistic Analysis (last lecture)

§ Network Size § Degree distribution.

v Node Ranking (this lecture)

§ Identifying most important/influential nodes § Viral Marketing, resource allocation

slide-8
SLIDE 8

Characterize Node Importance

v Rank the webpages in search engine. v Viral Marketing, resource allocation v Open a new restaurant, find the optimal

location

v …

slide-9
SLIDE 9

Brainstorming

} Node Importance

1 2 3 4 5 6 They are equivalent.

slide-10
SLIDE 10

10

Ranking nodes on an undirected graph

Node Degree Stationary distribution

} Local Importance } Global Importance

10

} d(5)=4 } d(3)=3 } d(4)=2 } d(2)=2 } d(1)=2 } d(6)=1 1 2 3 4 5 6 1 2 3 4 5 6 } π(5)=4/14 } π(3)=3/14 } π(4)=2/14 } π(2)=2/14 } π(1)=2/14 } π(6)=1/14 } |V|=6 } |E|=7 They are equivalent. Connected Graphs

slide-11
SLIDE 11

11

Ranking nodes on a directed graph

Node in & out Degree Stationary distribution

} Local Importance } Global Importance

} din(3)=3; dout(5)=3; } din(5)=2; dout(3)=2; } din(1)=2; dout(1)=2; } din(2)=2; dout(4)=2; } din(4)=1; dout(2)=1; } din(6)=1; dout(6)=1; 1 2 3 4 5 6 1 2 3 4 5 6 } π(5)=? } π(4)=? } π(3)=? } π(2)=? } π(1)=? } π(6)=? They are equivalent? Strongly Connected Graphs & Aperiodic

slide-12
SLIDE 12

Random Walk (Undirected Graph)

v Adjacency matrix v Transition Probability Matrix v |E|: number of links v Stationary Distribution

1 4 3 2

D = 3 2 3 2 ! " # # # # $ % & & & &

Undirected

A = 1 1 1 1 1 1 1 1 1 1 ! " # # # # $ % & & & &

Symmetric

P = A•D−1 = 1/ 3 1/ 3 1/ 3 1/ 2 1/ 2 1/ 3 1/ 3 1/ 3 1/ 2 1/ 2 " # $ $ $ $ % & ' ' ' '

πi = di 2 E

P

ij = 1

ki

} π(1)=3/10 } π(3)=3/10 } π(2)=2/10 } π(4)=2/10

xt,i = xt−1, j p ji

j

slide-13
SLIDE 13

Random Walk (directed graph)

v Adjacency matrix v Transition Probability Matrix v |E|: number of directed links v Stationary Distribution

1 4 3 2

D = 2 1 3 1 ! " # # # # $ % & & & & A = 1 1 1 1 1 1 1 ! " # # # # $ % & & & &

Asymmetric

P = A•D−1 = 1/ 2 1/ 2 1/ 2 1/ 2 1/ 3 1/ 3 1/ 3 1 " # $ $ $ $ % & ' ' ' '

πi ≠ di 2 E

P

ij =

1 kout,i

Strongly Connected Graphs & Aperiodic

}

π(1)=6/18=1/3

}

π(2)=4/18=2/9

}

π(3)=3/18=1/6

}

π(4)=5/18

xt,i = xt−1, j p ji

j

slide-14
SLIDE 14

14

Ranking nodes in a directed graph

Node in & out Degree Stationary distribution

} Local Importance } Global Importance

14

} din(3)=3; dout(5)=3; } din(5)=2; dout(3)=2; } din(1)=2; dout(1)=2; } din(2)=2; dout(4)=2; } din(4)=1; dout(2)=1; } din(6)=1; dout(6)=1; 1 2 3 4 5 6 1 2 3 4 5 6 } π(1)=5/16 } π(3)=1/4 } π(2)=3/16 } π(4)=1/8 } π(5)=3/32 } π(6)=1/32 They are no longer equivalent. Strongly Connected Graphs & Aperiodic

slide-15
SLIDE 15

directed graphs

v Periodic v vs v Aperiodic Graphs § The greatest common divisor of the lengths of its cycles is one or not v Disconnected graph v vs v Connected graph § Strongly Connected § vs § Weakly Connected

v Ergodic: Strongly Connected and Aperiodic

1 4 3 2

Strongly Connected Graphs & Aperiodic

1 4 3 2

slide-16
SLIDE 16

Why This Order?

slide-17
SLIDE 17

17

Ranking nodes in a directed graph (II)

PageRank HITS

} Random Walk } with Random Jumps

} Hub & Authority

17

} R(3)=?; } R(5)=?; } R(1)=?; } R(2)=?; } R(4)=?; } R(6)=?; 1 2 3 4 5 6 1 2 3 4 5 6 They are no longer equivalent. } Ra(3)=?; Rh(5)=?; } Ra(5)=?; Rh(3)=?; } Ra(1)=?; Rh(1)=?; } Ra(2)=?; Rh(4)=?; } Ra(4)=?; Rh(2)=?; } Ra(6)=?; Rh(6)=?;

slide-18
SLIDE 18

Naïve PageRank

v Adjacency matrix v Transition Probability Matrix v Stationary Distribution v Disconnected Graph & Random surfing behaviors

1 4 3 2

D = 2 1 3 1 ! " # # # # $ % & & & & A = 1 1 1 1 1 1 1 ! " # # # # $ % & & & & P = A•D−1 = 1/ 2 1/ 2 1/ 2 1/ 2 1/ 3 1/ 3 1/ 3 1 " # $ $ $ $ % & ' ' ' '

P

ij =

1 kout,i Ri = Rj p ji

j

Ri = πi

}

π(1)=6/18=1/3

}

π(2)=4/18=2/9

}

π(3)=3/18=1/6

}

π(4)=5/18

slide-19
SLIDE 19

Standard PageRank

v Adjacency matrix v Transition Probability Matrix (d=0.85) v Stationary Distribution (J is all-1 matrix). v Convergence

§ Leading eigenvector of Ppr

1 4 3 2

D = 2 1 3 1 ! " # # # # $ % & & & & A = 1 1 1 1 1 1 1 ! " # # # # $ % & & & & P = A•D−1 = 1/ 2 1/ 2 1/ 2 1/ 2 1/ 3 1/ 3 1/ 3 1 " # $ $ $ $ % & ' ' ' '

P

ij =

1 kout,i Ri = d Rj p ji

j

+(1− d) 1 n Ri = π pr,i

P

pr = d •P +(1− d) 1

n J = 0.0375 0.4625 0.4625 0.0375 0.0375 0.0375 0.0375 0.8875 0.3208 0.3208 0.0375 0.3208 0.8875 0.0375 0.0375 0.0375 " # $ $ $ $ % & ' ' ' '

slide-20
SLIDE 20

v How to quantify the importance as a hub

and authority separately?

slide-21
SLIDE 21

Hub & Authority (HITS)

v Adjacency matrix v Hub and authority

§ Initial Step: § Each step with normalization:

v Convergence

§ hub and authority are the left and right singular vector of the adjacency matrix A.

1 4 3 2

D = 2 1 3 1 ! " # # # # $ % & & & & A = 1 1 1 1 1 1 1 ! " # # # # $ % & & & &

hub( p) =1;auth( p) =1; hub( p) = auth(i)

i=1 n

; auth( p) = hub(i)

i=1 n

; hub( p) = hub( p) hub(i)2

i=1 n

; auth( p) = auth( p) auth(i)2

i=1 n

;

slide-22
SLIDE 22

A Note on Maximizing the Spread of Influence in Social Networks

  • E. Even-Dar and A. Shapira
slide-23
SLIDE 23

Social Influence

slide-24
SLIDE 24

Social Influence

Collaboration networks Microblogs Social networks Location Based Services Sharing sites Instant Messaging

slide-25
SLIDE 25

Social Influence

slide-26
SLIDE 26

Voter Influence Model

Word of mouth effect! Opinion diffusions Switch opinions back and forth

[1] P. Clifford and A. Sudbury. A model for spatial conflict. Biometrika, 60(3):581, 1973.

Bob Alice David

D

Randomly selecting one neighbor to adopt its opinion

slide-27
SLIDE 27

Influence Maximization

Goal: Maximize the number of future red nodes Budget: Selecting k individuals as initial red seeds

[15] E. Even-Dar and A. Shapira. A note on maximizing the spread of influence in social networks. In WINE, 2007.

Assumption: Uniform cost of selecting each initial seed

slide-28
SLIDE 28

Formulation

xt+1(i) = xt( j)p ji

j:aij>0

At step t>0, i i

( )

t

x i 1 ( )

t

x i −

At step t+1,

pij = aij / aij

j∈V

3 1 4 2 6 5 Influence at step t:

( ) ( )

t t i V

f x x i

=∑

Influence contribution:

max :lim ( ) ( )

t t x

f x f x

→∞

Long term

max : ( ) ( )

t x

f x f x −

Short term i Probability of node i being red at step t:

( )

t

x i

slide-29
SLIDE 29

Formulation (Random Walk)

Influence at step t:

1xt

T

Influence contribution:

max

x0

: lim

t→∞1xt T

Long term

max

x0

:1xt

T

Short term Matrix form:

xt = x0Pt lim

t→∞ xt = lim t→∞ x0Pt = π Influence contribution:

max

x0

: lim

t→∞ ft(x0)− f0(x0) = x0π T − f0(x0)

Long term

max : ( ) ( )

t x

f x f x −

Short term is a column vector, which is the transpose of row vector

xt

T

xt

slide-30
SLIDE 30

Influence Maximization

Goal: Maximize the number of future red nodes Budget: Selecting C for initial red seeds

[15] E. Even-Dar and A. Shapira. A note on maximizing the spread of influence in social networks. In WINE, 2007.

Assumption: Heterogeneous costs of selecting different initial seeds (ci) Knapsack problem

slide-31
SLIDE 31

Knapsack problem

Weight = Influence Value, Stationary distribution Size = Cost ci of choosing a node ni

slide-32
SLIDE 32

One-way connection Randomly select an out-going neighbor Adopt the opinion of one of the outgoing neighbors.

What if directed social graph

slide-33
SLIDE 33

Directed and Signed Networks

Friend Foe One-way signed connection Randomly select an out-going neighbor Adopt the opposite opinion of foe, the same opinion of friend

[WSDM'13] Yanhua Li, Wei Chen, Yajun Wang, Zhi-Li Zhang, Influence Diffusion Dynamics and Influence Maximization in Social Networks with Friend and Foe Relationships.The 6th ACM International Conference on Web Search and Data Mining, February 4-8, 2013, Rome, Italy.

slide-34
SLIDE 34

Any Comments & Critiques?

slide-35
SLIDE 35

Logistics 35

Next Week Guest Lecture:

  • Prof. Kyumin Lee

v Topic: Big Data on Social Networks v http://web.cs.wpi.edu/~kmlee/

v Schedule for next week,

v 6-7:10PM, Guest lecture from Prof Lee; v 7:10-7:20PM, Break; v 7:20-8:30PM, Team 4 presentation and Q&A v (8:30-8:50PM, if time allows, new research

topics in Prof Lee’s group, and Q&A.)

slide-36
SLIDE 36

Project Example

USPS https://ribbs.usps.gov/intelligentmail_package/documents/ tech_guides/PUB199IMPBImpGuide.pdf