Patterns and Anomalies Christos Faloutsos CMU CMU SCS Thank you - - PowerPoint PPT Presentation

patterns and anomalies
SMART_READER_LITE
LIVE PREVIEW

Patterns and Anomalies Christos Faloutsos CMU CMU SCS Thank you - - PowerPoint PPT Presentation

CMU SCS Mining Large Social Networks: Patterns and Anomalies Christos Faloutsos CMU CMU SCS Thank you The Department of Informatics Happy 20-th! Prof. Yannis Manolopoulos Prof. Kostas Tsichlas Mrs. Nina Daltsidou AUTH, May


slide-1
SLIDE 1

CMU SCS

Mining Large Social Networks: Patterns and Anomalies

Christos Faloutsos CMU

slide-2
SLIDE 2

CMU SCS

Thank you

  • The Department of Informatics
  • Happy 20-th!
  • Prof. Yannis Manolopoulos
  • Prof. Kostas Tsichlas
  • Mrs. Nina Daltsidou

AUTH, May 30, 2012

  • C. Faloutsos (CMU)

2

slide-3
SLIDE 3

CMU SCS

International-caliber friends among AUTH alumni

  • Prof. Evimaria Terzi (U. Boston)
  • Prof. Kyriakos Mouratidis (SMU)
  • Dr. Michalis Vlachos (IBM)

AUTH, May 30, 2012

  • C. Faloutsos (CMU)

3

slide-4
SLIDE 4

CMU SCS

  • C. Faloutsos (CMU)

4

Outline

  • Introduction – Motivation
  • Problem#1: Patterns in graphs
  • Problem#2: Tools
  • Problem#3: Scalability
  • Conclusions

AUTH, May 30, 2012

slide-5
SLIDE 5

CMU SCS

  • C. Faloutsos (CMU)

5

Graphs - why should we care?

Internet Map [lumeta.com] Food Web [Martinez ’91]

AUTH, May 30, 2012

$10s of BILLIONS revenue >500M users

slide-6
SLIDE 6

CMU SCS

  • C. Faloutsos (CMU)

6

Graphs - why should we care?

  • IR: bi-partite graphs (doc-terms)
  • web: hyper-text graph
  • ... and more:

D1 DN T1 TM ... ...

AUTH, May 30, 2012

slide-7
SLIDE 7

CMU SCS

  • C. Faloutsos (CMU)

7

Graphs - why should we care?

  • web-log (‘blog’) news propagation
  • computer network security: email/IP traffic

and anomaly detection

  • ....
  • [subject-verb-object: graph]
  • Graph == relational table with 2 columns

(src, dst)

  • BIG DATA – big graphs

AUTH, May 30, 2012

slide-8
SLIDE 8

CMU SCS

  • C. Faloutsos (CMU)

8

Outline

  • Introduction – Motivation
  • Problem#1: Patterns in graphs

– Static graphs – Weighted graphs – Time evolving graphs

  • Problem#2: Tools
  • Problem#3: Scalability
  • Conclusions

AUTH, May 30, 2012

slide-9
SLIDE 9

CMU SCS

  • C. Faloutsos (CMU)

9

Problem #1 - network and graph mining

  • What does the Internet look like?
  • What does FaceBook look like?
  • What is ‘normal’/‘abnormal’?
  • which patterns/laws hold?

AUTH, May 30, 2012

slide-10
SLIDE 10

CMU SCS

  • C. Faloutsos (CMU)

10

Graph mining

  • Are real graphs random?

AUTH, May 30, 2012

slide-11
SLIDE 11

CMU SCS

  • C. Faloutsos (CMU)

11

Laws and patterns

  • Are real graphs random?
  • A: NO!!

– Diameter – in- and out- degree distributions – other (surprising) patterns

  • So, let’s look at the data

AUTH, May 30, 2012

slide-12
SLIDE 12

CMU SCS

  • C. Faloutsos (CMU)

12

Solution# S.1

  • Power law in the degree distribution

[SIGCOMM99]

log(rank) log(degree) internet domains

att.com ibm.com

AUTH, May 30, 2012

slide-13
SLIDE 13

CMU SCS

  • C. Faloutsos (CMU)

13

Solution# S.1

  • Power law in the degree distribution

[SIGCOMM99]

log(rank) log(degree)

  • 0.82

internet domains

att.com ibm.com

AUTH, May 30, 2012

slide-14
SLIDE 14

CMU SCS

  • C. Faloutsos (CMU)

14

But:

How about graphs from other domains?

AUTH, May 30, 2012

slide-15
SLIDE 15

CMU SCS

  • C. Faloutsos (CMU)

15

More power laws:

  • web hit counts [w/ A. Montgomery]

Web Site Traffic in-degree (log scale) Count (log scale) Zipf users sites ``ebay’’

AUTH, May 30, 2012

slide-16
SLIDE 16

CMU SCS

And numerous more

  • Who-trusts-whom (epinions.com)
  • Income [Pareto] –’80-20 distribution’
  • Duration of downloads [Bestavros+]
  • Duration of UNIX jobs (‘mice and

elephants’)

  • Size of files of a user
  • ‘Black swans’

AUTH, May 30, 2012

  • C. Faloutsos (CMU)

16

slide-17
SLIDE 17

CMU SCS

  • C. Faloutsos (CMU)

17

Outline

  • Introduction – Motivation
  • Problem#1: Patterns in graphs

– Static graphs

  • degree, diameter, eigen,
  • Triangles

– Time evolving graphs

  • Problem#2: Tools

AUTH, May 30, 2012

slide-18
SLIDE 18

CMU SCS

  • C. Faloutsos (CMU)

18

Solution# S.3: Triangle ‘Laws’

  • Real social networks have a lot of triangles

AUTH, May 30, 2012

slide-19
SLIDE 19

CMU SCS

  • C. Faloutsos (CMU)

19

Solution# S.3: Triangle ‘Laws’

  • Real social networks have a lot of triangles

– Friends of friends are friends

  • Any patterns?

AUTH, May 30, 2012

slide-20
SLIDE 20

CMU SCS

  • C. Faloutsos (CMU)

20

Triangle Law: #S.3

[Tsourakakis ICDM 2008]

SN Reuters Epinions X-axis: degree Y-axis: mean # triangles n friends -> ~n1.6 triangles

AUTH, May 30, 2012

slide-21
SLIDE 21

CMU SCS

Triangle counting for large graphs?

Anomalous nodes in Twitter(~ 3 billion edges) [U Kang, Brendan Meeder, +, PAKDD’11]

21

AUTH, May 30, 2012

21

  • C. Faloutsos (CMU)
slide-22
SLIDE 22

CMU SCS

Triangle counting for large graphs?

Anomalous nodes in Twitter(~ 3 billion edges) [U Kang, Brendan Meeder, +, PAKDD’11]

22

AUTH, May 30, 2012

22

  • C. Faloutsos (CMU)
slide-23
SLIDE 23

CMU SCS

Triangle counting for large graphs?

Anomalous nodes in Twitter(~ 3 billion edges) [U Kang, Brendan Meeder, +, PAKDD’11]

23

AUTH, May 30, 2012

23

  • C. Faloutsos (CMU)
slide-24
SLIDE 24

CMU SCS

  • C. Faloutsos (CMU)

24

Outline

  • Introduction – Motivation
  • Problem#1: Patterns in graphs

– Static graphs – Time evolving graphs

  • Problem#2: Tools

AUTH, May 30, 2012

slide-25
SLIDE 25

CMU SCS

  • C. Faloutsos (CMU)

25

Problem: Time evolution

  • with Jure Leskovec (CMU ->

Stanford)

  • and Jon Kleinberg (Cornell –
  • sabb. @ CMU)

AUTH, May 30, 2012

slide-26
SLIDE 26

CMU SCS

  • C. Faloutsos (CMU)

26

T.1 Evolution of the Diameter

  • Prior work on Power Law graphs hints

at slowly growing diameter:

– diameter ~ O(log N) – diameter ~ O(log log N)

  • What is happening in real data?

AUTH, May 30, 2012

slide-27
SLIDE 27

CMU SCS

  • C. Faloutsos (CMU)

27

T.1 Evolution of the Diameter

  • Prior work on Power Law graphs hints

at slowly growing diameter:

– diameter ~ O(log N) – diameter ~ O(log log N)

  • What is happening in real data?
  • Diameter shrinks over time

AUTH, May 30, 2012

slide-28
SLIDE 28

CMU SCS

  • C. Faloutsos (CMU)

28

T.1 Diameter – “Patents”

  • Patent citation

network

  • 25 years of data
  • @1999

– 2.9 M nodes – 16.5 M edges time [years] diameter

AUTH, May 30, 2012

slide-29
SLIDE 29

CMU SCS

  • C. Faloutsos (CMU)

29

Outline

  • Introduction – Motivation
  • Problem#1: Patterns in graphs
  • Problem#2: Tools

– Belief Propagation

  • Problem#3: Scalability
  • Conclusions

AUTH, May 30, 2012

slide-30
SLIDE 30

CMU SCS

AUTH, May 30, 2012

  • C. Faloutsos (CMU)

30

E-bay Fraud detection

w/ Polo Chau & Shashank Pandit, CMU [www’07]

slide-31
SLIDE 31

CMU SCS

AUTH, May 30, 2012

  • C. Faloutsos (CMU)

31

E-bay Fraud detection

slide-32
SLIDE 32

CMU SCS

AUTH, May 30, 2012

  • C. Faloutsos (CMU)

32

E-bay Fraud detection

slide-33
SLIDE 33

CMU SCS

AUTH, May 30, 2012

  • C. Faloutsos (CMU)

33

E-bay Fraud detection - NetProbe

slide-34
SLIDE 34

CMU SCS

Popular press

And less desirable attention:

  • E-mail from ‘Belgium police’ (‘copy of

your code?’)

AUTH, May 30, 2012

  • C. Faloutsos (CMU)

34

slide-35
SLIDE 35

CMU SCS

  • C. Faloutsos (CMU)

35

Outline

  • Introduction – Motivation
  • Problem#1: Patterns in graphs
  • Problem#2: Tools
  • Problem#3: Scalability -PEGASUS
  • Conclusions

AUTH, May 30, 2012

slide-36
SLIDE 36

CMU SCS

AUTH, May 30, 2012

  • C. Faloutsos (CMU)

36

Scalability

  • Google: > 450,000 processors in clusters of ~2000

processors each [Barroso, Dean, Hölzle, “Web Search for

a Planet: The Google Cluster Architecture” IEEE Micro 2003]

  • Yahoo: 5Pb of data [Fayyad, KDD’07]
  • Problem: machine failures, on a daily basis
  • How to parallelize data mining tasks, then?
  • A: map/reduce – hadoop (open-source clone)

http://hadoop.apache.org/

slide-37
SLIDE 37

CMU SCS

  • C. Faloutsos (CMU)

37

Outline

  • Introduction – Motivation
  • Problem#1: Patterns in graphs
  • Problem#2: Tools
  • Problem#3: Scalability –PEGASUS

– Radius plot

  • Conclusions

AUTH, May 30, 2012

slide-38
SLIDE 38

CMU SCS

HADI for diameter estimation

  • Radius Plots for Mining Tera-byte Scale

Graphs U Kang, Charalampos Tsourakakis, Ana Paula Appel, Christos Faloutsos, Jure Leskovec, SDM’10

  • Naively: diameter needs O(N**2) space and

up to O(N**3) time – prohibitive (N~1B)

  • Our HADI: linear on E (~10B)

– Near-linear scalability wrt # machines – Several optimizations -> 5x faster

  • C. Faloutsos (CMU)

38

AUTH, May 30, 2012

slide-39
SLIDE 39

CMU SCS

????

19+ [Barabasi+]

39

  • C. Faloutsos (CMU)

Radius Count

AUTH, May 30, 2012

~1999, ~1M nodes

slide-40
SLIDE 40

CMU SCS

YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges)

  • Largest publicly available graph ever studied.

????

19+ [Barabasi+]

40

  • C. Faloutsos (CMU)

Radius Count

AUTH, May 30, 2012

??

~1999, ~1M nodes

slide-41
SLIDE 41

CMU SCS

YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges)

  • Largest publicly available graph ever studied.

????

19+? [Barabasi+]

41

  • C. Faloutsos (CMU)

Radius Count

AUTH, May 30, 2012

14 (dir.) ~7 (undir.)

slide-42
SLIDE 42

CMU SCS

YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges)

  • 7 degrees of separation (!)
  • Diameter: shrunk

????

19+? [Barabasi+]

42

  • C. Faloutsos (CMU)

Radius Count

AUTH, May 30, 2012

14 (dir.) ~7 (undir.)

slide-43
SLIDE 43

CMU SCS

YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges) Q: Shape?

????

43

  • C. Faloutsos (CMU)

Radius Count

AUTH, May 30, 2012

~7 (undir.)

slide-44
SLIDE 44

CMU SCS

44

  • C. Faloutsos (CMU)

YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges)

  • effective diameter: surprisingly small.
  • Multi-modality (?!)

AUTH, May 30, 2012

slide-45
SLIDE 45

CMU SCS

Radius Plot of GCC of YahooWeb.

45

  • C. Faloutsos (CMU)

AUTH, May 30, 2012

slide-46
SLIDE 46

CMU SCS

46

  • C. Faloutsos (CMU)

YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges)

  • effective diameter: surprisingly small.
  • Multi-modality: probably mixture of cores .

AUTH, May 30, 2012

slide-47
SLIDE 47

CMU SCS

47

  • C. Faloutsos (CMU)

YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges)

  • effective diameter: surprisingly small.
  • Multi-modality: probably mixture of cores .

AUTH, May 30, 2012

EN ~7 Conjecture: DE BR

slide-48
SLIDE 48

CMU SCS

48

  • C. Faloutsos (CMU)

YahooWeb graph (120Gb, 1.4B nodes, 6.6 B edges)

  • effective diameter: surprisingly small.
  • Multi-modality: probably mixture of cores .

AUTH, May 30, 2012

~7 Conjecture:

slide-49
SLIDE 49

CMU SCS

  • C. Faloutsos (CMU)

49

Outline

  • Introduction – Motivation
  • Problem#1: Patterns in graphs
  • Problem#2: Tools
  • Problem#3: Scalability
  • Conclusions

AUTH, May 30, 2012

slide-50
SLIDE 50

CMU SCS

  • C. Faloutsos (CMU)

50

OVERALL CONCLUSIONS – low level:

  • Several new patterns (shrinking diameters,

triangle-laws, etc)

  • New tools:

– Fraud detection (belief propagation)

  • Scalability: PEGASUS / hadoop

AUTH, May 30, 2012

slide-51
SLIDE 51

CMU SCS

  • C. Faloutsos (CMU)

51

OVERALL CONCLUSIONS – medium-level

  • BIG DATA: Large datasets reveal

patterns/outliers that are invisible otherwise

AUTH, May 30, 2012

slide-52
SLIDE 52

CMU SCS

  • C. Faloutsos (CMU)

52

Project info

Akoglu, Leman Chau, Polo Kang, U McGlohon, Mary Tong, Hanghang Prakash, Aditya

AUTH, May 30, 2012

Thanks to: NSF IIS-0705359, IIS-0534205,

CTA-INARC; Yahoo (M45), LLNL, IBM, SPRINT, Google, INTEL, HP, iLab www.cs.cmu.edu/~pegasus

Koutra, Danae

slide-53
SLIDE 53

CMU SCS

Thank you for the honor!

  • Congratulations for 20-th anniversary

and…

AUTH, May 30, 2012

  • C. Faloutsos (CMU)

53

slide-54
SLIDE 54

CMU SCS

High-level conclusion: Collaborations

  • Sociology + CS (triangles)
  • Civil engineering + CS (sensor placement)
  • fMRI/medical + graphs (medical db’s)

AUTH, May 30, 2012

  • C. Faloutsos (CMU)

54

slide-55
SLIDE 55

CMU SCS

Never stop learning

AUTH, May 30, 2012

  • C. Faloutsos (CMU)

55

Socrates Plato Aristotle

GHRASKW AEI DIDASKOMENOS