DS504/CS586: Big Data Analytics Graph Mining II Prof. Yanhua Li - PowerPoint PPT Presentation

Welcome to DS504/CS586: Big Data Analytics Graph Mining II Prof. Yanhua Li Time: 6-8:50PM Thursday Location: AK233 Spring 2018

Course Project I has been graded. v Grading was based on v 1. Project report v 2. Project team presentation v 3. Self-&-cross evaluation form v 4. In-class survey/evaluation form v I also provided comments to your project reports in Canvas discussion forum. v If you are interested in publishing your results, talk to me. (Totally optional.) 2 Logistics

Course Project II v Projects will be in groups! v 4-6 students per group, depending on enrollment v “ research-oriented ” project timeline: v Team Project v Starting date: Week 8 (R) on 3/1: v Project proposal due date: Week 10 (R) 3/15: v Project Progress Presentation: TBD, 15mins per team: v Project due date: Week 16 (R) 4/26: v Project final Presentation: Week 16 (R) 4/26: 3 Logistics

Graph Data Graphs are everywhere. Ecological Biological Social Network Network Network Chemical Program Flow Web Graph Network 4

Complex Graphs Real-life graph contains complex contents – labels associated with nodes, edges and graphs. Node Labels: Location, Gender, Charts, Library, Events, Groups, Journal, Tags, Age, Tracks. 5

Large Graphs Large Scale Graphs. # of Users # of Links Facebook 400 Million 52K Million Twitter 105 Million 10K Million LinkedIn 60 Million 0.9K Million Last.FM 40 Million 2K Million LiveJournal 25 Million 2K Million del.icio.us 5.3 Million 0.7K Million DBLP 0.7 Million 8 Million 6

Mining in Big Graphs v Network Statistic Analysis (last lecture) § Network Size § Degree distribution. v Node Ranking (this lecture) § Identifying most important/influential nodes § Viral Marketing, resource allocation

Characterize Node Importance v Rank the webpages in search engine. v Viral Marketing, resource allocation v Open a new restaurant, find the optimal location v …

Brainstorming } Node Importance 3 4 1 5 2 6 They are equivalent.

Ranking nodes on an undirected graph Node Degree Stationary distribution Connected Graphs } Local Importance } Global Importance 3 3 4 4 1 1 5 5 2 2 6 6 } π (5)=4/14 } |V|=6 } d(5)=4 } π (3)=3/14 } |E|=7 } d(3)=3 } π (4)=2/14 } d(4)=2 } π (2)=2/14 } d(2)=2 } π (1)=2/14 } d(1)=2 } π (6)=1/14 } d(6)=1 10 They are equivalent. 10

Ranking nodes on a directed graph Node in & out Degree Stationary distribution Strongly Connected Graphs & Aperiodic } Local Importance } Global Importance 3 3 4 4 1 1 5 5 2 2 6 6 } d in (3)=3; d out (5)=3; } π (5)=? } d in (5)=2; d out (3)=2; } π (4)=? } d in (1)=2; d out (1)=2; } π (3)=? } d in (2)=2; d out (4)=2; } π (2)=? } d in (4)=1; d out (2)=1; } π (1)=? } d in (6)=1; d out (6)=1; } π (6)=? 11 They are equivalent?

Random Walk (Undirected Graph) v Adjacency matrix 1 2 ! $ ! $ 0 1 1 1 3 0 0 0 # & # & 0 2 0 0 1 0 1 0 Symmetric # & # & A = D = # & # & 1 1 0 1 0 0 3 0 # & # & 0 0 0 2 1 0 1 0 " % " % 4 3 v Transition Probability Matrix Undirected ij = 1 " % 0 1/ 3 1/ 3 1/ 3 P $ ' k i 1/ 2 0 1/ 2 0 P = A • D − 1 = $ ' $ ' 1/ 3 1/ 3 0 1/ 3 ∑ x t , i = x t − 1, j p ji $ ' 1/ 2 0 1/ 2 0 # & j v |E|: number of links } π (1)=3/10 v Stationary Distribution } π (3)=3/10 π i = d i } π (2)=2/10 } π (4)=2/10 2 E

Random Walk (directed graph) Strongly Connected Graphs & Aperiodic v Adjacency matrix 1 2 ! $ ! $ 0 1 1 0 2 0 0 0 # & # & 0 1 0 0 0 0 1 0 Asymmetric # & # & A = D = # & # & 1 1 0 1 0 0 3 0 # & # & 0 0 0 1 1 0 0 0 " % " % 4 3 v Transition Probability Matrix 1 " % 0 1/ 2 1/ 2 0 P ij = $ ' k out , i 1/ 2 0 1/ 2 0 P = A • D − 1 = $ ' $ ' 1/ 3 1/ 3 0 1/ 3 ∑ x t , i = x t − 1, j p ji $ ' 1 0 0 0 # & j v |E|: number of directed links v Stationary Distribution π i ≠ d i π (1)=6/18=1/3 } π (2)=4/18=2/9 } π (3)=3/18=1/6 2 E } π (4)=5/18 }

Ranking nodes in a directed graph Node in & out Degree Stationary distribution Strongly Connected Graphs & Aperiodic } Local Importance } Global Importance 3 3 4 4 1 1 5 5 2 2 6 6 } d in (3)=3; d out (5)=3; } π (1)=5/16 } d in (5)=2; d out (3)=2; } π (3)=1/4 } d in (1)=2; d out (1)=2; } π (2)=3/16 } d in (2)=2; d out (4)=2; } π (4)=1/8 } d in (4)=1; d out (2)=1; } π (5)=3/32 } d in (6)=1; d out (6)=1; } π (6)=1/32 14 They are no longer equivalent. 14

directed graphs Strongly Connected Graphs & Aperiodic 1 2 v Periodic v vs v Aperiodic Graphs § The greatest common divisor of the lengths of its cycles is one or not 4 3 v Disconnected graph v vs 1 2 v Connected graph § Strongly Connected § vs § Weakly Connected 4 3 v Ergodic: Strongly Connected and Aperiodic

Why This Order?

Ranking nodes in a directed graph (II) PageRank HITS } Random Walk } Hub & Authority } with Random Jumps 3 3 4 4 1 1 5 5 2 2 6 6 } R(3)=?; } R a (3)=?; R h (5)=?; } R(5)=?; } R a (5)=?; R h (3)=?; } R(1)=?; } R a (1)=?; R h (1)=?; } R(2)=?; } R a (2)=?; R h (4)=?; } R(4)=?; } R a (4)=?; R h (2)=?; } R(6)=?; } R a (6)=?; R h (6)=?; 17 They are no longer equivalent. 17

Naïve PageRank v Adjacency matrix 1 2 ! $ ! $ 0 1 1 0 2 0 0 0 # & # & 0 1 0 0 0 0 1 0 # & # & A = D = # & # & 1 1 0 1 0 0 3 0 # & # & 0 0 0 1 1 0 0 0 " % " % 4 3 v Transition Probability Matrix 1 " % 0 1/ 2 1/ 2 0 P ij = $ ' k out , i 1/ 2 0 1/ 2 0 P = A • D − 1 = $ ' $ ' 1/ 3 1/ 3 0 1/ 3 $ ' 1 0 0 0 # & ∑ R i = R j p ji π (1)=6/18=1/3 j } π (2)=4/18=2/9 } v Stationary Distribution π (3)=3/18=1/6 } π (4)=5/18 } R i = π i v Disconnected Graph & Random surfing behaviors

Standard PageRank v Adjacency matrix 1 2 ! $ ! $ 0 1 1 0 2 0 0 0 # & # & 0 1 0 0 0 0 1 0 # & # & A = D = # & # & 1 1 0 1 0 0 3 0 # & # & 0 0 0 1 1 0 0 0 " % " % 4 3 v Transition Probability Matrix (d=0.85) 1 " % 0 1/ 2 1/ 2 0 P ij = $ ' k out , i 1/ 2 0 1/ 2 0 P = A • D − 1 = $ ' $ ' 1/ 3 1/ 3 0 1/ 3 $ ' + (1 − d ) 1 1 0 0 0 # & ∑ R i = d R j p ji n j v Stationary Distribution (J is all-1 matrix). " % 0.0375 0.4625 0.4625 0.0375 $ ' R i = π pr , i 0.0375 0.0375 0.0375 0.8875 pr = d • P + (1 − d ) 1 $ ' P n J = $ ' 0.3208 0.3208 0.0375 0.3208 v Convergence $ ' 0.8875 0.0375 0.0375 0.0375 # & § Leading eigenvector of P pr

v How to quantify the importance as a hub and authority separately?

Hub & Authority (HITS) v Adjacency matrix 1 2 ! $ ! $ 0 1 1 0 2 0 0 0 # & # & 0 1 0 0 0 0 1 0 # & # & A = D = # & # & 1 1 0 1 0 0 3 0 # & # & 0 0 0 1 1 0 0 0 " % " % 4 3 v Hub and authority hub ( p ) = 1; auth ( p ) = 1; § Initial Step: § Each step with normalization: hub ( p ) hub ( p ) = ; n ∑ hub ( p ) = auth ( i ) ; n hub ( i ) 2 ∑ i = 1 i = 1 auth ( p ) n auth ( p ) = ∑ hub ( i ) ; auth ( p ) = ; i = 1 n auth ( i ) 2 ∑ i = 1 v Convergence § hub and authority are the left and right singular vector of the adjacency matrix A.

A Note on Maximizing the Spread of Influence in Social Networks E. Even-Dar and A. Shapira

Social Influence

Social Influence Instant Messaging Collaboration networks Sharing sites Location Microblogs Social Based Services networks

Social Influence

Voter Influence Model Opinion diffusions Switch opinions back and forth Word of mouth effect! D Randomly selecting one neighbor to adopt its opinion Bob Alice David [1] P. Clifford and A. Sudbury. A model for spatial conflict. Biometrika , 60(3):581, 1973.

Influence Maximization Budget: Selecting k individuals as initial red seeds Assumption: Uniform cost of selecting each initial seed Goal: Maximize the number of future red nodes [15] E. Even-Dar and A. Shapira. A note on maximizing the spread of influence in social networks. In WINE , 2007.

Formulation x i ( ) Probability of node i being red at step t: t x i ( ) 1 x i ( ) At step t>0, − i i t t x t + 1 ( i ) = ∑ x t ( j ) p ji At step t+1, i j : a ij > 0 p ij = a ij / ∑ a ij 2 j ∈ V 3 f x ( ) x i ( ) = ∑ Influence at step t: t 0 t i V ∈ 6 1 Influence contribution: max : f x ( ) f ( x ) Short term − 4 t 0 0 0 x 0 max :lim f x ( ) f ( x ) 5 Long term − t 0 0 0 t x →∞ 0

Formulation (Random Walk ） Influence at step t: T 1 x t Influence contribution: T Short term max : 1 x t x 0 T max : lim t →∞ 1 x t Long term x 0 T x t x t is a column vector, which is the transpose of row vector x t = x 0 P t Matrix form: t →∞ x 0 P t = π lim t →∞ x t = lim Influence contribution: Short term max : f x ( ) f ( x ) − t 0 0 0 x 0 Long term t →∞ f t ( x 0 ) − f 0 ( x 0 ) = x 0 π T − f 0 ( x 0 ) max : lim x 0

DS504/CS586: Big Data Analytics Graph Mining II Prof. Yanhua Li - PowerPoint PPT Presentation

Welcome to DS504/CS586: Big Data Analytics Graph Mining II Prof. Yanhua Li Time: 6-8:50PM Thursday Location: AK233 Spring 2018 Course Project I has been graded. v Grading was based on v 1. Project report v 2. Project team presentation v 3.

DS504/CS586: Big Data Analytics Data acquisition and measurement Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics --Introduction & Logistics Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics Graph Mining Prof. Yanhua Li Time: 6:00pm 8:50pm R Location:

DS504/CS586: Big Data Analytics Graph Mining Prof. Yanhua Li Time: 6:00pm 8:50pm R Location:

DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li Time: 6:00pm 8:50pm Thu

DS504/CS586: Big Data Analytics Big Data Clustering II Prof. Yanhua Li Time: 6pm 8:50pm Thu

DS504/CS586: Big Data Analytics Big Data Clustering II Prof. Yanhua Li Time: 6pm 8:50pm Thu

DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics Data acquisition and measurement Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics Data Management Prof. Yanhua Li Time: 6:00pm 8:50pm R

DS504/CS586: Big Data Analytics --Introduction & Logistics Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics --Introduction & Logistics Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics --Introduction & Logistics Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics --Introduction & Logistics Prof. Yanhua Li Time: 6:00pm

Pick up a handout on the front table 1 Welcome to DS504/CS586: Big Data Analytics --Review

QBF-BASED SYNTHESIS OF OPTIMAL WORD-SPLITTING IN APPROXIMATE MULTI-LEVEL CELLS Daniel E. Holcomb

3/26/19 INSERT Martinsville Trailer HOMETOWN PRIDE & A WORLDLY WELCOME Build Positive

Neural Network Compression Linear Neural Reconstruction David A. R. Robin Internship with

Research and development for the IsoDAR experiment WIN2017 06/23/2017 Spencer N. Axani

Mid Way House Program Finding Home.Far Away From Home Since August 2013 JI HANE I

LF LFPP/ PP/FMPP FMPP Pr Prop oposals: osals: The Insiders Scoop Wednesday, May 29, 2019

Shahram Hadian Born in Iran (governed by Shariah) Proud U.S. Citizen

Planning and Optimization C14. Merge-and-Shrink Abstractions: Generic Algorithm Malte Helmert and

DS504/CS586: Big Data Analytics Graph Mining II Prof. Yanhua Li - PowerPoint PPT Presentation

Welcome to DS504/CS586: Big Data Analytics Graph Mining II Prof. Yanhua Li Time: 6-8:50PM Thursday Location: AK233 Spring 2018 Course Project I has been graded. v Grading was based on v 1. Project report v 2. Project team presentation v 3.

DS504/CS586: Big Data Analytics Data acquisition and measurement Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics --Introduction &amp; Logistics Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics Graph Mining Prof. Yanhua Li Time: 6:00pm 8:50pm R Location:

DS504/CS586: Big Data Analytics Graph Mining Prof. Yanhua Li Time: 6:00pm 8:50pm R Location:

DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li Time: 6:00pm 8:50pm Thu

DS504/CS586: Big Data Analytics Big Data Clustering II Prof. Yanhua Li Time: 6pm 8:50pm Thu

DS504/CS586: Big Data Analytics Big Data Clustering II Prof. Yanhua Li Time: 6pm 8:50pm Thu

DS504/CS586: Big Data Analytics Data Pre-processing and Cleaning Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics Data acquisition and measurement Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics Data Management Prof. Yanhua Li Time: 6:00pm 8:50pm R

DS504/CS586: Big Data Analytics --Introduction &amp; Logistics Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics --Introduction &amp; Logistics Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics --Introduction &amp; Logistics Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics --Introduction &amp; Logistics Prof. Yanhua Li Time: 6:00pm

Pick up a handout on the front table 1 Welcome to DS504/CS586: Big Data Analytics --Review

QBF-BASED SYNTHESIS OF OPTIMAL WORD-SPLITTING IN APPROXIMATE MULTI-LEVEL CELLS Daniel E. Holcomb

3/26/19 INSERT Martinsville Trailer HOMETOWN PRIDE &amp; A WORLDLY WELCOME Build Positive

Neural Network Compression Linear Neural Reconstruction David A. R. Robin Internship with

Research and development for the IsoDAR experiment WIN2017 06/23/2017 Spencer N. Axani

Mid Way House Program Finding Home.Far Away From Home Since August 2013 JI HANE I

LF LFPP/ PP/FMPP FMPP Pr Prop oposals: osals: The Insiders Scoop Wednesday, May 29, 2019

Shahram Hadian Born in Iran (governed by Shariah) Proud U.S. Citizen

Planning and Optimization C14. Merge-and-Shrink Abstractions: Generic Algorithm Malte Helmert and

DS504/CS586: Big Data Analytics --Introduction & Logistics Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics --Introduction & Logistics Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics --Introduction & Logistics Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics --Introduction & Logistics Prof. Yanhua Li Time: 6:00pm

DS504/CS586: Big Data Analytics --Introduction & Logistics Prof. Yanhua Li Time: 6:00pm

3/26/19 INSERT Martinsville Trailer HOMETOWN PRIDE & A WORLDLY WELCOME Build Positive