Graph- -based Learning based Learning Graph Larry Holder Larry - - PowerPoint PPT Presentation

graph based learning based learning graph
SMART_READER_LITE
LIVE PREVIEW

Graph- -based Learning based Learning Graph Larry Holder Larry - - PowerPoint PPT Presentation

Graph- -based Learning based Learning Graph Larry Holder Larry Holder Computer Science and Engineering Computer Science and Engineering University of Texas at Arlington University of Texas at Arlington 1 1 Graph- -based Learning based


slide-1
SLIDE 1

1 1

Graph Graph-

  • based Learning

based Learning

Larry Holder Larry Holder Computer Science and Engineering Computer Science and Engineering University of Texas at Arlington University of Texas at Arlington

slide-2
SLIDE 2

2 2

Graph Graph-

  • based Learning

based Learning

Multi Multi-

  • relational data mining and learning

relational data mining and learning SUBDUE graph SUBDUE graph-

  • based relational learner

based relational learner

  • Discovery

Discovery

  • Clustering

Clustering

  • Graph grammar learning

Graph grammar learning

  • Supervised learning

Supervised learning

slide-3
SLIDE 3

3 3

Multi Multi-

  • Relational Data Mining

Relational Data Mining

Looking for patterns involving multiple Looking for patterns involving multiple tables (relations) in a relational database tables (relations) in a relational database

ID ID Last Last First First Age Age Income Income P1 P1 P2 P2 Doe Doe John John 30 30 P3 P3 Doe Doe Sally Sally 29 29 80000 80000 90000 90000 Smith Smith Robert Robert 35 35 100000 100000

Person

Person1 Person1 Person2 Person2 P1 P1 P2 P2 P3 P3 P7 P7

Married RichCouple(X,Y) Person(X,LastX,FirstX,AgeX,IncX) & Person(Y,LastY,FirstY,AgeY,IncY) & Married(X,Y) & (IncX + IncY) > 150000.

slide-4
SLIDE 4

4 4

Multi Multi-

  • Relational Data Mining

Relational Data Mining

Approaches Approaches

  • Transform to non

Transform to non-

  • relational problem

relational problem

  • First

First-

  • order logic based
  • rder logic based

Inductive Logic Programming (ILP) Inductive Logic Programming (ILP)

  • Graph based

Graph based

slide-5
SLIDE 5

5 5

Graph Graph-

  • based Data Mining

based Data Mining

Finding all Finding all subgraphs subgraphs g g within a set of within a set of graph transactions graph transactions G G such that such that

  • where

where t t is the minimum support

t G g freq > | | ) (

is the minimum support

slide-6
SLIDE 6

6 6

Graph Graph-

  • based Data Mining

based Data Mining

Systems Systems

  • Apriori

Apriori-

  • based Graph Mining (AGM)

based Graph Mining (AGM)

Inokuchi Inokuchi, , Washio Washio and and Motoda Motoda, 2003 , 2003

  • Frequent Sub

Frequent Sub-

  • Graph discovery (FSG)

Graph discovery (FSG)

Kuramochi Kuramochi and and Karypis Karypis, 2001 , 2001

  • Graph

Graph-

  • based Substructure pattern mining

based Substructure pattern mining ( (gSpan gSpan) )

Yan Yan and Han, 2002 and Han, 2002

Focus on pruning and fast, code Focus on pruning and fast, code-

  • based

based graph matching graph matching

slide-7
SLIDE 7

7 7

Graph Graph-

  • based Relational Learning

based Relational Learning

Finding patterns in Finding patterns in graph(s graph(s) )

  • Discovery

Discovery

  • Clustering

Clustering

  • Supervised learning

Supervised learning

Person Doe John 80000 30 Last First Age Income Person Doe Sally 90000 29 Last First Age Income Person Smith Robert 100000 35 Last First Age Income Married Married

slide-8
SLIDE 8

8 8

Graph Graph-

  • based Relational Learning

based Relational Learning

Graph Graph-

  • Based Induction (GBI)

Based Induction (GBI)

  • Yoshida,

Yoshida, Motoda Motoda and and Indurkhya Indurkhya, 1994 , 1994

SUBstructure SUBstructure Discovery Using Examples Discovery Using Examples (SUBDUE) (SUBDUE)

  • Cook and Holder, 1994

Cook and Holder, 1994

Focus on efficient Focus on efficient subgraph subgraph generation generation and compression and compression-

  • based heuristic search

based heuristic search

slide-9
SLIDE 9

9 9

SUBDUE Graph SUBDUE Graph-

  • based Discovery

based Discovery

Graph representation Graph representation Graph compression and MDL Graph compression and MDL Discovery algorithm Discovery algorithm Inexact graph match Inexact graph match Background knowledge Background knowledge Parallel/distributed discovery Parallel/distributed discovery

slide-10
SLIDE 10

10 10

Graph Representation Graph Representation

Input is a labeled (vertices and edges) directed graph Input is a labeled (vertices and edges) directed graph A A substructure substructure is a connected is a connected subgraph subgraph An An instance instance of a substructure is an isomorphic

  • f a substructure is an isomorphic subgraph

subgraph

  • f the input graph
  • f the input graph

Input graph compressed by replacing instances with Input graph compressed by replacing instances with vertex representing substructure vertex representing substructure

R1 C1 T1 S1 T2 S2 T3 S3 T4 S4 Input Database Substructure S1 (graph form) Compressed Database

  • bject

triangle

R1 C1

  • bject

square

  • n

shape shape

S1 S1 S1 S1 S1 S1 S1 S1 S1 S1 S1 S1

slide-11
SLIDE 11

11 11

Graph Representation Graph Representation

S1 S1 S1 S1 S1 S2 S2 S2

slide-12
SLIDE 12

12 12

Graph Compression and MDL Graph Compression and MDL

Minimum Description Length (MDL) Minimum Description Length (MDL) principle principle

  • Best theory minimizes description length of

Best theory minimizes description length of theory and the data given theory theory and the data given theory

Best substructure Best substructure S S minimizes description minimizes description length of substructure definition length of substructure definition DL(S) DL(S) and and compressed graph compressed graph DL(G|S) DL(G|S)

)) | ( ) ( ( min S G DL S DL

S

+

slide-13
SLIDE 13

13 13

Discovery Algorithm Discovery Algorithm

1.

  • 1. Create substructure for each unique

Create substructure for each unique vertex label vertex label

Substructures:

triangle (4), square (4), circle (1), rectangle (1) circle rectangle triangle square

  • n
  • n

triangle square

  • n
  • n

triangle square

  • n
  • n

triangle square

  • n
  • n
  • n
slide-14
SLIDE 14

14 14

Discovery Algorithm Discovery Algorithm

2.

  • 2. Expand best substructures by an edge or

Expand best substructures by an edge or edge+neighboring vertex edge+neighboring vertex

Substructures:

triangle square

  • n

circle rectangle square

  • n

rectangle triangle

  • n

circle rectangle triangle square

  • n
  • n

triangle square

  • n
  • n

triangle square

  • n
  • n

triangle square

  • n
  • n
  • n

rectangle

  • n
slide-15
SLIDE 15

15 15

Discovery Algorithm Discovery Algorithm

3.

  • 3. Keep only best

Keep only best beam beam-

  • width

width substructures on queue substructures on queue 4.

  • 4. Terminate when queue is empty or

Terminate when queue is empty or #discovered substructures > #discovered substructures > limit limit 5.

  • 5. Compress graph and repeat to generate

Compress graph and repeat to generate hierarchical description hierarchical description

slide-16
SLIDE 16

16 16

DNA Example DNA Example

slide-17
SLIDE 17

17 17

Sample SUBDUE Input Sample SUBDUE Input

sample.g:

e 1 11 shape e 2 12 shape e 3 13 shape e 4 14 shape e 5 15 shape e 6 16 shape e 7 17 shape e 8 18 shape e 9 19 shape e 10 20 shape e 1 5 on e 2 6 on e 3 7 on e 4 8 on e 5 10 on e 9 10 on e 10 2 on e 10 3 on e 10 4 on v 1 object v 2 object v 3 object v 4 object v 5 object v 6 object v 7 object v 8 object v 9 object v 10 object v 11 triangle v 12 triangle v 13 triangle v 14 triangle v 15 square v 16 square v 17 square v 18 square v 19 circle v 20 rectangle

R1 C1 T1 S1 T2 S2 T3 S3 T4 S4

slide-18
SLIDE 18

18 18

Inexact Graph Match Inexact Graph Match

Some variations may occur between Some variations may occur between instances instances Want to abstract over minor differences Want to abstract over minor differences Difference = cost of transforming one Difference = cost of transforming one graph to make it isomorphic to another graph to make it isomorphic to another Match if cost/size < Match if cost/size < threshold threshold

slide-19
SLIDE 19

19 19

Inexact Graph Match Inexact Graph Match

1 2 A B a b 5 3 4 B A b a a b B ∅ (1,3) 1 (1,4) 0 (1,5) 1 (1,λ) 1 (2,4) 7 (2,5) 6 (2,λ) 10 (2,3) 3 (2,5) 6 (2,λ) 9 (2,3) 7 (2,4) 7 (2,λ) 10 (2,3) 9 (2,4) 10 (2,5) 9 (2,λ) 11

Least-cost match is {(1,4), (2,3)}

slide-20
SLIDE 20

20 20

Inexact Graph Match Inexact Graph Match

Vertices considered by degree Vertices considered by degree Polynomially Polynomially constrained constrained

  • Greedy after

Greedy after n nk

k partial mappings considered

partial mappings considered

  • Suboptimal mappings rare for k>2

Suboptimal mappings rare for k>2

slide-21
SLIDE 21

21 21

Background Knowledge Background Knowledge

User User-

  • defined substructures

defined substructures Two alternative uses Two alternative uses

  • Prime search queue

Prime search queue

  • Initial graph compression

Initial graph compression

Variant of discovery algorithm used to Variant of discovery algorithm used to generate instances generate instances

slide-22
SLIDE 22

22 22

Parallel/Distributed Discovery Parallel/Distributed Discovery

Divide graph into P partitions Divide graph into P partitions Distribute to P processors Distribute to P processors Each processor performs serial discovery Each processor performs serial discovery

  • n local partition
  • n local partition

Broadcast best substructures, evaluate on Broadcast best substructures, evaluate on

  • ther processors
  • ther processors

Master processor stores best global Master processor stores best global substructures substructures

slide-23
SLIDE 23

23 23

Graph Graph-

  • based Clustering

based Clustering

Hierarchical, conceptual clustering Hierarchical, conceptual clustering Previous work defined classification trees Previous work defined classification trees

  • Inadequate in relational domains

Inadequate in relational domains

Better hierarchical description: Better hierarchical description: classification lattice classification lattice

  • A cluster can have more than one parent

A cluster can have more than one parent

  • A parent can be at any level (not only one

A parent can be at any level (not only one level above) level above)

Use iterative graph Use iterative graph-

  • based discovery

based discovery

slide-24
SLIDE 24

24 24

Clustering: DNA Clustering: DNA

slide-25
SLIDE 25

25 25

Clustering: DNA Clustering: DNA

Coverage Coverage

  • 61%

61%

  • 68%

68%

  • 71%

71%

DNA O | O == P — OH C — N C — C C — C \ O O | O == P — OH | O | CH2 C \ N — C \ C O \ C / \ C — C N — C / \ O C

slide-26
SLIDE 26

26 26

Learning Graph Grammars Learning Graph Grammars

Graph grammar production: S Graph grammar production: S P P

  • S is a non

S is a non-

  • terminal

terminal

  • P is a graph containing terminals and/or non

P is a graph containing terminals and/or non-

  • terminals

terminals

  • S

S P P1

1 | P

| P2

2 |

| … … | | P Pn

n

Recursive production: S Recursive production: S P S | P P S | P

  • P linked to S via a single edge

P linked to S via a single edge

  • Algorithm exponential in number of linking

Algorithm exponential in number of linking edges edges

slide-27
SLIDE 27

27 27

Example Graph Grammar Example Graph Grammar

S2 a b S3 c d e f S2 a b S3 S3

slide-28
SLIDE 28

28 28

Graph Grammar Learning Graph Grammar Learning

SUBDUE Extensions ( SUBDUE Extensions (SubdueGL SubdueGL) )

  • Each iteration results in a graph grammar

Each iteration results in a graph grammar production substructure production substructure

  • Production used to compress graph

Production used to compress graph

Replace instances of right Replace instances of right-

  • hand side with new

hand side with new vertex labeled with non vertex labeled with non-

  • terminal on left

terminal on left-

  • hand side

hand side

  • Iterations continue until entire graph

Iterations continue until entire graph compressed to single non compressed to single non-

  • terminal

terminal

slide-29
SLIDE 29

29 29

SubdueGL SubdueGL Example Example

Input graph Input graph

  • Edge labels: ‘t’, ‘s’, ‘next’

Edge labels: ‘t’, ‘s’, ‘next’

a c b a d b a e b a f b x q z y x q z y x q z y x q z y r k

slide-30
SLIDE 30

30 30

SubdueGL SubdueGL Example Example

First production rule First production rule Input graph parsed by first production

x q z y S1 S1 x q z y

Input graph parsed by first production

a c b a d b a e b a f b r k S1 S1

slide-31
SLIDE 31

31 31

SubdueGL SubdueGL Example Example

Second and third production rules Second and third production rules Input graph parsed by productions Input graph parsed by productions

S2 a b S3 S2 S3 c d e f a b S3 r k S2 S1 S1

slide-32
SLIDE 32

32 32

Graph Graph-

  • Based Supervised Learning

Based Supervised Learning

Input now a set of positive graphs and a Input now a set of positive graphs and a set of negative graphs set of negative graphs

Input Hypothesis

  • bject
  • bject
  • bject
  • n
  • n

triangle square shape shape

slide-33
SLIDE 33

33 33

Graph Graph-

  • Based Supervised Learning

Based Supervised Learning

Solution 1 Solution 1

  • Find substructure compressing positive

Find substructure compressing positive graphs, but not negative graphs graphs, but not negative graphs

  • Compress graphs and iterate until no further

Compress graphs and iterate until no further compression compression

Problem Problem

  • Compressing, instead of removing, partially

Compressing, instead of removing, partially-

  • covered positive graphs leads to overly

covered positive graphs leads to overly-

  • specific hypotheses

specific hypotheses

slide-34
SLIDE 34

34 34

Graph Graph-

  • Based Supervised Learning

Based Supervised Learning

Solution 2 Solution 2

  • Find substructure

Find substructure covering covering (i.e., (i.e., subgraph subgraph of)

  • f)

positive graphs, but not negative graphs positive graphs, but not negative graphs

  • Remove

Remove covered positive graphs and iterate covered positive graphs and iterate until all covered until all covered

Substructure value = 1 Substructure value = 1 -

  • Error

Error

NegEgs PosEgs red NegEgsCove

  • vered

PosEgsNotC Error # # # # + + =

slide-35
SLIDE 35

35 35

Supervised Learning: Cancer Supervised Learning: Cancer

Chemical toxicity Chemical toxicity SUBDUE achieved 62% accuracy classifying SUBDUE achieved 62% accuracy classifying carcinogenic vs. non carcinogenic vs. non-

  • carcinogenic compounds

carcinogenic compounds

compound atom atom c 22

  • 13

c 22

  • 13

element element type type charge charge 7 contains contains six_ring in_group in_group halide10 ashby_alert ashby_alert p 6 positive ames di227 cytogen_ca compound atom atom c 22

  • 13

c 22

  • 13

element element type type charge charge 7 contains contains six_ring in_group in_group halide10 ashby_alert ashby_alert p 6 positive ames di227 cytogen_ca compound p drosophila_slrl compound p _ compound compound p _ compound amine p chromaberr has_group compound amine p has_group compound compound amine p has_group

slide-36
SLIDE 36

36 36

Application Domains Application Domains

Biochemical domains Biochemical domains

  • Protein data

Protein data

  • DNA data

DNA data

  • Toxicology (cancer) data

Toxicology (cancer) data

Spatial Spatial-

  • temporal domains

temporal domains

  • Earthquake data

Earthquake data

  • Aircraft Safety and Reporting System

Aircraft Safety and Reporting System

Web topology and search Web topology and search

  • Social network analysis

Social network analysis …

web_page web_page web_page hyperlink hyperlink hyperlink home … …

slide-37
SLIDE 37

37 37

Summary Summary

Multi Multi-

  • relational data mining and learning

relational data mining and learning Graph Graph-

  • based relational learning

based relational learning

  • Discovery

Discovery

  • Clustering

Clustering

  • Graph grammar learning

Graph grammar learning

  • Supervised learning

Supervised learning

slide-38
SLIDE 38

38 38

Future Directions Future Directions

Efficient graph Efficient graph-

  • based learning from

based learning from incremental streaming data incremental streaming data Supervised graphs Supervised graphs

  • All examples in one, connected graph

All examples in one, connected graph

Graph Graph-

  • based anomaly detection

based anomaly detection Improved scalability Improved scalability

  • Graph and

Graph and subgraph subgraph isomorphism isomorphism

slide-39
SLIDE 39

39 39

Further Information Further Information

Graph Graph-

  • based Data Mining

based Data Mining

  • http://banzai.uta.edu/gdm

http://banzai.uta.edu/gdm

SUBDUE Project SUBDUE Project

  • http://

http://ailab.uta.edu ailab.uta.edu/subdue /subdue