Frequent Subgraph Mining Frequent Subgraph Mining (FSM) Outline FSM - PowerPoint PPT Presentation

Frequent Subgraph Mining

Frequent Subgraph Mining (FSM) Outline • FSM Preliminaries • FSM Algorithms – gSpan – complete FSM on labeled graphs – SUBDUE – approximate FSM on labeled graphs – SLEUTH – FSM on trees • Review

FSM In a Nutshell • Discovery of graph structures that occur a significant number of times across a set of graphs • Ex.: Common occurrences of hydroxide-ion • Other instances: – Finding common biological pathways among species. – Recurring patterns of humans interaction during an epidemic. Carbonic Acid – Highlighting similar data to reveal data set as a whole. H O H Sulfuric Acid Acetic Acid O O O C O H O S H C C Ammonia O O H H O H H H H N H

FSM Preliminaries • Support is some integer or frequency • Frequent graphs occur more than support number of times. O-H present in ¾ inputs � frequent if support <= 3 Carbonic Acid H O H Sulfuric Acid Acetic Acid O O O C O H O S H C C Ammonia O O H H O H H H H N H

What Makes FSM So Hard? • Isomorphic graphs have same structural properties even though they may look different. • Subgraph isomorphism problem : Does a graph contain a subgraph isomorphic to another graph? • FSM algorithms encounter this problem while buildings graphs. • This problem is known to be NP-complete ! A B A B Isomorphic under A,B,C,D labeling C D D C

Pattern Growth Approach • Underlying strategy of both traditional frequent pattern mining and frequent subgraph mining • General Process: – candidate generation : which patterns will be considered? For FSM, – candidate pruning : if a candidate is not a viable frequent pattern, can we exploit the pattern to prevent unnecessary work? • subgraphs and subsets exponentiate as size increases! – support counting : how many of a given pattern exist? • These algorithms work in a breadth-first or depth-first way. – Joins smaller frequent sets into larger ones. – Checks the frequency of larger sets.

Pattern Growth Approach – Apriori • Apriori principle: if an itemset is frequent, then all of its subsets are also frequent. – Ex. if itemset {A, B, C, D} is frequent, then {A, B} is frequent. – Simple proof: With respect to frequency, all sets trivially contain their subsets, thus frequency of subset >= frequency of set. – Same property applies to (sub)graphs! • Apriori algorithm exploits this to prune huge sections of the search space! ∅ If A is infrequent, no A B C supersets with A can be frequent! AB AC BC ABC

FSM Algorithms Discussed • gSpan – complete frequent subgraph mining – improves performance over straightforward apriori extensions to graphs through DFS Code representation and aggressive candidate pruning • SUBDUE – approximate frequent subgraph mining – uses graph compression as metric for determining a “frequently occuring” subgraph • SLEUTH – complete frequent subgraph mining – built specifically for trees

FSM – R package • R package for FSM is called subgraphMining • To import: install.packages(“subgraphMining”) • Package contains: gSpan, SUBDUE, SLUETH. • Also contains the following data sets: – cslogs – metabolicInteractions. • To load the data, use the following code: # The cslogs data set data(cslogs) # The matabolicInteractions data data(metabolicInteractions)

FSM Outline • FSM Preliminaries • FSM Algorithms – gSpan – SUBDUE – SLEUTH • Review

gSpan: Graph-Based Substructure Pattern Mining • Written by Xifeng Yan & Jiawei Han in 2002. • Form of pattern-growth mining algorithm. – Adds edges to candidate subgraph – Also known as, edge extension • Avoid cost intensive problems like – Redundant candidate generation – Isomorphism testing • Uses two main concepts to find frequent subgraphs – DFS lexicographic order – minimum DFS code

gSpan Inputs • Set of graphs, support • Graph of form � = ( � , � , � � , � � ) – � , � – vertex and edge sets – � � – vertex labels – � � – edge labels – label sets need not be one-to-one H O H � � = { � , � , � } � � = { single−bond, double−bond } O C O

gSpan Components Strategy: • build frequent subgraphs bottom-up , using DFS code as regularized representation • eliminate redundancies via minimal DFS codes based on code lexicographic ordering Depth-first Search (DFS) Code structured graph representation for building, comparing DFS Lexicographic minimal DFS code Order selection, pruning of canonical comparison subgraphs of graphs

Depth First Search Primer Todo…?

gSpan: DFS codes Code Edge # DFS Code: sequence of edges traversed during DFS 0 (0,1,X,a,Y) 1 (1,2,Y,b,X) Vertex discovery 2 (2,0,X,a,X) times 3 (2,3,X,c,Z) 0 X 4 (3,1,Z,b,Y) 5 (1,4,Y,d,Z) a a 1 Y d Format: ( � , � , � � , � ( � , � ) , � � ) 4 b Z � , � – vertices by time of discovery b X 2 � � , � � - vertex labels of � � , � � c � ( � , � ) – edge label between � � , � � 3 Z � < � : forward edge � > � : back edge

DFS Code: Edge Ordering • Edges in code ordered in very specific manner, corresponding to DFS process • � � = ( � � , � � ), � � = ( � � , � � ) • � � ≺ � � � � � appears before � � in code • Ordering rules: 1. if � � = � � and � � < � � � � � ≺ � � • from same source vertex, � � traversed before � � in DFS 2. if � � < � � and � � = � � � � � ≺ � � • � � is a forward edge and � � traversed as result of � � traversal if � � ≺ � � and � � ≺ � � , � � � ≺ � � 3. • ordering is transitive

DFS Code: Edge Ordering Example 0 Code Edge # X • Rule applications by edge # a a 0 (0,1,X,a,Y) • 0 ≺ 1 (Rule 2) 1 Y 1 (1,2,Y,b,X) d • 1 ≺ 2 (Rule 2) 2 (2,0,X,a,X) 4 b Z • 0 ≺ 2 (Rule 3) 3 (2,3,X,c,Z) b 2 X 4 (3,1,Z,b,Y) • 2 ≺ 3 (Rule 1) 5 (1,4,Y,d,Z) c • Exercise: what 3 Z others? Edge ordering can be recorded easily during the DFS!

Graphs have multiple DFS Codes! Exercise: Write the 2 rightmost graphs using DFS code 0 0 0 X Y X X d 4 b a a a a a a Z 1 1 X 1 X Y Y b d d 4 a a b b Z Z c b b b Y 2 X 2 X 2 X d b c c c Z 3 3 3 Z Z Z Z 4 solution to redundant DFS codes: lexical ordering, minimal code!

DFS Lexicographic Ordering vs. DFS Code • DFS code: Ordering of edge sequence of a particular DFS – E.g. DFS’s that start at different vertices may have different DFS codes • Lexicographic ordering: ordering between different DFS codes

DFS Lexicographic Ordering • Given lexicographic ordering of label set � , ≺ � • Given graphs � � , � � (equivalent label sets). • Given DFS codes – � = code � � , � � = � � , � � , … , � � – � = code � � , � � = � � , � � , … , � � – (assume � ≥ � ) • � ≤ � iff either of the following are true: – ∃ � , 0 ≤ � ≤ min � , � such that • � � = � � for � < � and • � � ≺ � � � – � � = � � �� 0 ≤ � ≤ �

DFS Lex. Ordering: Edge Comparison • Given DFS codes – � = code � � , � � = � � , � � , … , � � – � = code � � , � � = � � , � � , … , � � – (assume � ≥ � ) • Given � such that � � = � � for � < � • Given � � = � � , � � , � � � , � � � , � � , � � � , � � = � � , � � , � � � , � � � , � � , � � � , • � � ≺ � � � if one of the following cases Case 1 : Both forward edges, AND… Case 3 : � � back , � � forward � � � ≺ � � � Case 2 : Both back edges, AND…

Edge Comparison: Case 1 (both forward) • Both forward edges, AND one of the following: – � � < � � (edge starts from a later visited vertex) • Why is this (think about DFS process)? – � � = � � AND labels of � lexicographically less than labels of � , in order of tuple. • Ex: Labels are strings, � � = __, __, m, e, x , � � = (__, __, m, u, x) – m = m, e < u � � � ≺ � � � • Note: if both forward edges, then � � = � � – Reasoning: all previous edges equal, target vertex discovery times are the same

Edge Comparison: Case 2 (both back) • Both back edges, AND one of the following: – � � < � � (edge refers to earlier vertex) – � � = � � AND edge label of � lexicographically less than � • Note: given that all previous edges equal, vertex labels must also be equal • Note: if both back edges, then � � = � � – Reasoning: all previous edges equal, source vertex discovery times are the same.

Code (A) Code (B) Code (C) Edge # 0 (0,1,X,a,Y) (0,1,Y,a,X) (0,1,X,a,X) 1 (1,2,Y,b,X) (1,2,X,a,X) (1,2,X,a,Y) 2 (2,0,X,a,X) (2,0,X,b,Y) (2,0,Y,b,X) 3 (2,3,X,c,Z) (2,3,X,c,Z) (2,3,Y,b,Z) 4 (3,1,Z,b,Y) (3,1,Z,b,X) (3,0,Z,c,X) 5 (1,4,Y,d,Z) (0,4,Y,d,Z) (2,4,Y,d,Z) 0 0 0 X X Y d a 4 a b a a Z 1 1 Y 1 X X d b 4 b a a Z b b c X 2 X 2 Y 2 d c c b Z 3 3 3 Z Z Z 4

Frequent Subgraph Mining Frequent Subgraph Mining (FSM) Outline FSM - PowerPoint PPT Presentation

Frequent Subgraph Mining Frequent Subgraph Mining (FSM) Outline FSM Preliminaries FSM Algorithms gSpan complete FSM on labeled graphs SUBDUE approximate FSM on labeled graphs SLEUTH FSM on trees Review FSM In

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

FSM FSM Full Service Model Initiative Full Service Model Initiative September 2017 Advisors

FSM FSM Full Service Model Initiative Full Service Model Initiative Advisors February 2017

Topic II.1: Frequent Subgraph Mining Discrete Topics in Data Mining Universitt des Saarlandes,

Mining Large Single Networks under Subgraph Mining Large Single Networks under Subgraph

Scope Constrained Frequent Pattern Mining: Constrained Frequent Pattern Mining: A A

Frequent Itemset Mining Stony Brook University CSE545, Fall 2016 Frequent Itemset Mining aka

Lecture 5-2: Sequential Circuit Design continued FSM design Design steps for FSM: Draw state

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Data Mining in Bioinformatics Day 5: Frequent Subgraph Mining Chlo-Agathe Azencott &

Data Mining 2018 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 10, 2018

Introduction to Data Mining Frequent Pattern Mining and Association Analysis Li Xiong Slide

Introduction to Data Mining Frequent Pattern Mining and Association Analysis Li Xiong Slide

CORE DECOMPOSITION AND DENSEST SUBGRAPH IN MULTILAYER NETWORKS CORE DECOMPOSITION AND DENSEST

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Frequent Pattern Mining Overview Basic Concepts and Challenges Data Mining Techniques:

A Multiphase Microreactor for A Multiphase Microreactor for Organic Nitration Organic Nitration

Acid Base Chemistry Notes 1. is the

Lecture 7 Professor Hicks Inorganic Chemistry (CHE152) substances that Strong acids Acids

Thinking Like a Chemist About Acids and Bases UNIT 6 DAY 5 What are we going to learn today?

EVOLUTION OF TRACE METAL REMOVAL PRODUCTS IN FIELD-SCALE VERTICAL FLOW BIOREACTORS Julie LaBar

Finding Gene and Chemical Names in Patent Text Supervised Learning with no Training Data slides

CONTEXT: SPECIALTY CHEMICALS The chemical industries manufacture a wide spectrum of products:

BIOCONVERSION TECHNOLOGIES PTT203 BIOCHEMICAL ENGINEERING PUAN NURUL AIN HARMIZA ABDULLAH

Frequent Subgraph Mining Frequent Subgraph Mining (FSM) Outline FSM - PowerPoint PPT Presentation

Frequent Subgraph Mining Frequent Subgraph Mining (FSM) Outline FSM Preliminaries FSM Algorithms gSpan complete FSM on labeled graphs SUBDUE approximate FSM on labeled graphs SLEUTH FSM on trees Review FSM In

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

FSM FSM Full Service Model Initiative Full Service Model Initiative September 2017 Advisors

FSM FSM Full Service Model Initiative Full Service Model Initiative Advisors February 2017

Topic II.1: Frequent Subgraph Mining Discrete Topics in Data Mining Universitt des Saarlandes,

Mining Large Single Networks under Subgraph Mining Large Single Networks under Subgraph

Scope Constrained Frequent Pattern Mining: Constrained Frequent Pattern Mining: A A

Frequent Itemset Mining Stony Brook University CSE545, Fall 2016 Frequent Itemset Mining aka

Lecture 5-2: Sequential Circuit Design continued FSM design Design steps for FSM: Draw state

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Data Mining in Bioinformatics Day 5: Frequent Subgraph Mining Chlo-Agathe Azencott &amp;

Data Mining 2018 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 10, 2018

Introduction to Data Mining Frequent Pattern Mining and Association Analysis Li Xiong Slide

Introduction to Data Mining Frequent Pattern Mining and Association Analysis Li Xiong Slide

CORE DECOMPOSITION AND DENSEST SUBGRAPH IN MULTILAYER NETWORKS CORE DECOMPOSITION AND DENSEST

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Frequent Pattern Mining Overview Basic Concepts and Challenges Data Mining Techniques:

A Multiphase Microreactor for A Multiphase Microreactor for Organic Nitration Organic Nitration

Acid Base Chemistry Notes 1. is the

Lecture 7 Professor Hicks Inorganic Chemistry (CHE152) substances that Strong acids Acids

Thinking Like a Chemist About Acids and Bases UNIT 6 DAY 5 What are we going to learn today?

EVOLUTION OF TRACE METAL REMOVAL PRODUCTS IN FIELD-SCALE VERTICAL FLOW BIOREACTORS Julie LaBar

Finding Gene and Chemical Names in Patent Text Supervised Learning with no Training Data slides

CONTEXT: SPECIALTY CHEMICALS The chemical industries manufacture a wide spectrum of products:

BIOCONVERSION TECHNOLOGIES PTT203 BIOCHEMICAL ENGINEERING PUAN NURUL AIN HARMIZA ABDULLAH

Data Mining in Bioinformatics Day 5: Frequent Subgraph Mining Chlo-Agathe Azencott &