Mining Algorithms for New Applications: the case of Depth-First - - PowerPoint PPT Presentation

β–Ά
mining algorithms for new applications the case of depth
SMART_READER_LITE
LIVE PREVIEW

Mining Algorithms for New Applications: the case of Depth-First - - PowerPoint PPT Presentation

Mining Algorithms for New Applications: the case of Depth-First Search Sanjoy Dasgupta Russell Impagliazzo Ragesh Jaiswal Credit: Some of todays slides are due to Miles Jones CSE 101, Spring 2020, Week 2 Algorithm Mining Algorithms


slide-1
SLIDE 1

Mining Algorithms for New Applications: the case of Depth-First Search Sanjoy Dasgupta Russell Impagliazzo Ragesh Jaiswal Credit: Some of today’s slides are due to Miles Jones CSE 101, Spring 2020, Week 2

slide-2
SLIDE 2

Algorithm Mining

  • Algorithms designed for one problem are often usable for

a number of other computational tasks, some of which seem unrelated to the original goal

  • Today, we are going to look at how to use the depth-first

search algorithm to solve a variety of graph problems

slide-3
SLIDE 3

Algorithm Mining techniques

  • Deeper Analysis: What else does the algorithm already

give us?

  • Augmentation: What additional information could we

glean just by keeping track of the progress of the algorithm?

  • Modification: How can we use the same idea to solve

new problems in a similar way?

  • Reduction: how can we use the algorithm as a black box

to solve new problems?

slide-4
SLIDE 4

Graph Reachability and DFS

  • Graph reachability: Given a directed graph G, and a

starting vertex v, return an array that specifies for each vertex u whether u is reachable from v

  • Depth-First Search (DFS): An efficient algorithm for

Graph reachability

  • Breadth-First Search (BFS): Another efficient algorithm for

Graph reachability.

slide-5
SLIDE 5

DFS as recursion

  • procedure explore(G,v)
  • Input: graph G = (V,E); node v in V output:
  • Output: array visited[u]
  • 1. visited[v] = true
  • 2. for each edge (v,u) in E do:
  • if not visited[u]: explore(G,u)
slide-6
SLIDE 6

Key Points of DFS

  • No matter how the recursions are nested, for each vertex

u, we only run explore(u) ONCE, because after that, it is marked visited. (We need this for termination and efficiency)

  • On the other hand, we discover a path to a new

destination, we always explore all new vertices reachable (We need this for correctness, to guarantee that we find ALL the reachable vertices)

slide-7
SLIDE 7

DFS as iterative algorithmmGRAPH REACHABILITY:

procedure DFS (G: directed graph, v: vertex) Initialize array visited[u] to False Initialize stack of vertices F, PUSH v; Visited[v]==True; While F is not empty: v==Pop; For each neighbor u of v (in reverse order): If not visited[u]: Push u; visited[u] == True; Return visited

procedure explore(G = (V,E), s) visited(s)=true for each edge (s,u): if not visited(u): explore(G,u)

slide-8
SLIDE 8

DFS on Directed Graphs

A C B D E G F H

F = A

slide-9
SLIDE 9

DFS on Directed Graphs

A C B D E G F H

F= A. Pop A. Neighbors of A = (C) Push C, visited C == True F= C

slide-10
SLIDE 10

DFS on Directed Graphs

A C B D E G F H

F= C. Pop C. Neighbors of C = (F,E,B) Push F, Push E, Push B, F= B, E, F

slide-11
SLIDE 11

DFS on Directed Graphs

A C B D E G F H

F= B,E,F. Pop B. Neighbors of B = (D,A) Push D , F= E, F, D

slide-12
SLIDE 12

DFS on Directed Graphs

A C B D E G F H

F= E,F, D Pop E. Neighbors of E = (H,G,F) Push G, H F= F, D, G, H. Pop, Pop, Pop, Pop

slide-13
SLIDE 13

DFS as iterative algorithmmGRAPH REACHABILITY:

procedure DFS (G: directed graph, v: vertex) Initialize array visited[u] to False. O(|V|) Initialize stack of vertices F, PUSH v; Visited[v]==True; O(1) While F is not empty: done at most |V| times, once per v v==Pop; For each neighbor u of v (in reverse order): O(1 + deg (v)) = O(|V|) If not visited[u]: Push u; visited[u] == True; Return visited. Correct: Loop takes |V| *O(|V|), rest O(|V|), total 𝑃 π‘Š !)

slide-14
SLIDE 14

DFS as iterative algorithmmGRAPH REACHABILITY:

procedure DFS (G: directed graph, v: vertex) Initialize array visited[u] to False. O(|V|) Initialize stack of vertices F, PUSH v; Visited[v]==True; O(1) While F is not empty: done at most |V| times, once per v v==Pop; For each neighbor u of v (in reverse order): O(1 + deg (v)) = O(|V|) If not visited[u]: Push u; visited[u] == True; Return visited. Tighter : Loop runs once for each v, O(1 + deg (v)) time on that loop. So total time at most : 𝑃(βˆ‘" 1 + deg 𝑀 ) = 𝑃( π‘Š + 𝐹 )

slide-15
SLIDE 15

Complete DFS

  • DFS actually just costs O(number of reachable nodes +

number of reachable edges ). Parts of the graph that weren’t found don’t cost either.

  • So, still in total O(|V|+|E|) time, we can run also keep on

running explore from undiscovered vertices, until we’ve found the whole graph. We usually keep track of which iteration each vertex was discovered in.

  • Alternative viewpoint: Add a new vertex with edges to all
  • vertices. Run DFS from the new vertex.
slide-16
SLIDE 16

Depth first search

procedure DFS(G) cc = 0 for each vertex v: visited(v) = false for each vertex v: if not visited(v): cc++ explore(G,v) procedure DFS(G) cc = 0 clock = 1 for each vertex v: visited(v) = false for each vertex v: if not visited(v): cc++ explore(G,v) procedure previsit(v) pre(v)=clock clock++ procedure post visit(v) post(v)=clock clock++

slide-17
SLIDE 17

All reachable vertices, not all paths

  • While DFS finds all the reachable vertices, it doesn’t

consider all paths between them. No feasible algorithm could.

A 1 A 2 A 3 A n How many paths from A1 to An?

slide-18
SLIDE 18

All reachable vertices, not all paths

  • While DFS finds all the reachable vertices, it doesn’t

consider all paths between them. No feasible algorithm could.

A 1 A 2 A 3 A n 2#$% paths from A1 to An

slide-19
SLIDE 19

Finding paths: the DFS tree

  • After the DFS, we know which vertices are reachable,

but not how to get there How long could a path in a graph be? How about a simple path? How many paths do we have to find?

slide-20
SLIDE 20

Finding paths: the DFS tree

  • After the DFS, we know which vertices are reachable,

but not how to get there We have up to |V|-1 paths to find, and each path can be up to length |V|.

slide-21
SLIDE 21

Synergy

  • After the DFS, we know which vertices are reachable,

but not how to get there We have up to |V|-1 paths to find, and each path can be up to length |V|. Sometimes, doing something similar many times costs less than doing it from scratch each time. For DFS, the paths overlap, and form a |V|-1 edge tree

slide-22
SLIDE 22

DFS augmented to create DFS tree

  • procedure explore(G,v)
  • Input: graph G = (V,E); node v in V output:
  • Output: array visited[u]; parent[u]
  • 1. visited[v] = true
  • 2. for each edge (v,u) in E do:
  • if not visited[u]: parent[u]==v; explore(G,u);
slide-23
SLIDE 23

keeping track of paths

slide-24
SLIDE 24

DFS augmtd with pre, post numbers

  • procedure explore(G,v)
  • Input: graph G = (V,E); node v in V output: count

starts at 1

  • Output: array visited[u]; parent[u]; pre[u]; post[u]
  • 1. visited[v] = true ;
  • 2. for each edge (v,u) in E do:
  • if not visited[u]: parent[u]==v; pre[u]=count;

count++; explore(G,u);

  • 3. post[v] == count, count++
slide-25
SLIDE 25

Depth first search

procedure DFS(G) cc = 0 for each vertex v: visited(v) = false for each vertex v: if not visited(v): cc++ explore(G,v) procedure DFS(G) cc = 0 clock = 1 for each vertex v: visited(v) = false for each vertex v: if not visited(v): cc++ explore(G,v) procedure previsit(v) pre(v)=clock clock++ procedure post visit(v) post(v)=clock clock++

slide-26
SLIDE 26

keeping track of paths

slide-27
SLIDE 27

Inferring relative position in tree

If u is below v in the DFS tree iff pre(v) < pre (u) and post (u) < post (v). In this case, an edge from u to v creates a cycle If u is to the right of v iff pre(v) < pre(u) and post (v) < post (u)

slide-28
SLIDE 28
  • Tree edge: solid edge included in the DFS output tree
  • Back edge: leads to an ancestor
  • Forward edge: leads to a descendent
  • Cross edge: leads to neither anc. or des.: always from

right to left

  • Note that Back edge is slightly different in directed and

undirected graphs.

Edge types (directed graph)

slide-29
SLIDE 29

DFS on Directed Graphs

A B C D E F G H A A

1

C C

2

B B

3

D D

4

D

5

B

6

E E

7

F F

8

F

9

G G

10

H H

11

H

12

G

13

E

14

C

15

A

16

A B G D F H C E A C E G H F B D

slide-30
SLIDE 30

The different types of edges can be determined from the pre/post numbers for the edge (𝑣, 𝑀)

  • (𝑣, 𝑀) is a tree/forward edge then π‘žπ‘ π‘“ 𝑣 < π‘žπ‘ π‘“ 𝑀 <

π‘žπ‘π‘‘π‘’ 𝑀 < π‘žπ‘π‘‘π‘’(𝑣)

  • (𝑣, 𝑀) is a back edge then π‘žπ‘ π‘“ 𝑀 < π‘žπ‘ π‘“ 𝑣 < π‘žπ‘π‘‘π‘’ 𝑣 <

π‘žπ‘π‘‘π‘’(𝑀)

  • (𝑣, 𝑀) is a cross edge then π‘žπ‘ π‘“ 𝑀 < π‘žπ‘π‘‘π‘’ 𝑀 < π‘žπ‘ π‘“ 𝑣 <

π‘žπ‘π‘‘π‘’(𝑣)

Edge types and pre/post numbers

slide-31
SLIDE 31
slide-32
SLIDE 32
  • A cycle in a directed graph is a path that starts and ends

with the same vertex 𝑀/ β†’ 𝑀0 β†’ 𝑀1 β†’ β‹― β†’ 𝑀2 β†’ 𝑀/ 𝐡 β†’ 𝐷 β†’ 𝐹 β†’ 𝐡

Cycles in Directed Graphs

slide-33
SLIDE 33

Proof: β†’ Suppose G has a cycle: 𝑀/ β†’ 𝑀0 β†’ 𝑀1 β†’ β‹― β†’ 𝑀2 β†’ 𝑀/

A directed graph has a directed cycle iff its dfs output tree has a back edge

slide-34
SLIDE 34

Proof: β†’ Suppose G has a cycle: 𝑀/ β†’ 𝑀0 β†’ 𝑀1 β†’ β‹― β†’ 𝑀2 β†’ 𝑀/ Suppose 𝑀/ is the first vertex to be discovered. (What does that mean about 𝑀/?)

A directed graph has a directed cycle iff its dfs output tree has a back edge

slide-35
SLIDE 35

Proof: β†’ Suppose G has a cycle: 𝑀/ β†’ 𝑀0 β†’ 𝑀1 β†’ β‹― β†’ 𝑀2 β†’ 𝑀/ Suppose 𝑀/ is the first vertex to be discovered. (the vertex with the lowest pre-number.) All other 𝑀3 are reachable from it and therefore, they are all descendants in the DFS tree.

A directed graph has a directed cycle iff its dfs output tree has a back edge

slide-36
SLIDE 36

Proof: β†’ Suppose G has a cycle: 𝑀! β†’ 𝑀" β†’ 𝑀# β†’ β‹― β†’ 𝑀$ β†’ 𝑀! Suppose 𝑀! is the first vertex to be discovered. (the vertex with the lowest pre-number.) All other 𝑀% are reachable from it and therefore, they are all descendants in the dfs tree. Therefore the edge 𝑀$, 𝑀! is a back edge.

A directed graph has a directed cycle iff its dfs output tree has a back edge

slide-37
SLIDE 37

Proof: ← Suppose 𝑐, 𝑏 is a back edge.

A directed graph has a directed cycle iff its dfs output tree has a back edge

slide-38
SLIDE 38

Proof: ← Suppose 𝑐, 𝑏 is a back edge. Then by definition 𝑏 is a ancestor of 𝑐 so there is a path from 𝑏 to 𝑐 in the DFS output tree.

A directed graph has a directed cycle iff its dfs output tree has a back edge

slide-39
SLIDE 39

Proof: ← Suppose 𝑐, 𝑏 is a back edge. Then by definition 𝑏 is a ancestor of 𝑐 so there is a path from 𝑏 to 𝑐 in the DFS output tree. Along with the back edge, this path completes a cycle.

A directed graph has a directed cycle iff its dfs output tree has a back edge

slide-40
SLIDE 40
  • A directed graph without a cycle is called acyclic. (DAG)

Corollary: A directed graph G is a DAG if and only if it’s DFS output tree does not have any back edges.

Directed Acyclic Graphs (DAG)

slide-41
SLIDE 41

Step 1: perform dfs on the graph Step 2: loop through each edge to see if it is a back edge. i.e.: for each edge (u,v), if pre(v) < pre(u) < post(u) < post(v): return β€œnot DAG” return β€œDAG”

How to spot a DAG?

slide-42
SLIDE 42
  • Is it possible to order the vertices such that all edges go

in only one direction?

  • For what types of DAGs is this possible?
  • How do we find such an ordering?

Linearization aka Topological Sort

slide-43
SLIDE 43

Theorem: every edge in a DAG goes from a higher post number to lower post number.

Property of DAGS

slide-44
SLIDE 44

Theorem: every edge in a DAG goes from a higher post number to lower post number. proof: suppose (u,v) is an edge in a DAG then it can’t be a back edge, therefore it can only be a forward edge/tree edge or a cross edge. All of which have the property that post(v) < post(u). Corollary: Sorting by post numbers is a topological sort

Property of DAGS

slide-45
SLIDE 45

Linearization of a DAG: Since we know that edges go in the direction of decreasing post numbers, if we order the vertices by decreasing post numbers then we will have a linearization procedure linearize(DAG G=(V,E)) run DFS(G) return list of vertices in decreasing order of post numbers (by putting at start of list when post number assigned)

Property of DAGS

slide-46
SLIDE 46
  • Since all DAGs can be linearized, that means the first

vertex in the ordering does not have any edges coming in and the last vertex does not have any edges going out.

  • Definitions:
  • A vertex with no incoming edges is called a source
  • A vertex with no outgoing edges is called a sink
  • Theorem: All DAGs have at least one source and one

sink.

Sources and sinks

slide-47
SLIDE 47

Strongly connected vertices

Two vertices 𝑣 and 𝑀 in a directed graph are strongly connected if there exists a path from 𝑣 to 𝑀 and a path from 𝑀 to 𝑣. Which vertices are strongly connected to J? L A B C D G F H E K J M I

slide-48
SLIDE 48

Strongly connected vertices

Two vertices 𝑣 and 𝑀 in a directed graph are strongly connected if there exists a path from 𝑣 to 𝑀 and a path from 𝑀 to 𝑣. Which vertices are strongly connected to J? J, K, L, M L A B C D G F H E K J M I J,K,L,M

slide-49
SLIDE 49

Strongly connected Graph

A graph is called strongly connected if for each pair of vertices v,u there is a path from v to u and a path from u to v. Is this a strongly connected graph? L A B C D G F H E K J M I

slide-50
SLIDE 50

Consider the relation 𝑣𝑆𝑀 if 𝑣 is strongly connected to 𝑀. Then 𝑆 is an equivalence relation. It is reflexive, symmetric and transitive. So 𝑆 partitions π‘Š, the set of vertices into equivalence classes. These equivalence classes are called strongly connected components.

Strongly connected components

slide-51
SLIDE 51

Strongly connected component

What are the strongly connected components of this graph? L A B C D G F H E K J M I

slide-52
SLIDE 52

Strongly connected components as

  • vertices. (Meta-graph)

A B,C,F,I D G J,K,L,M E H

L A B C D G F H E K J M I

slide-53
SLIDE 53

Directed Graphs as DAGs of SCCs

Every Directed graph is a DAG of its strongly connected components. Some SCCs are sink SCCs and some are source SCCs.

slide-54
SLIDE 54

There is a linear time algorithm that decomposes a directed graph into its strongly connected components.

Decomposition

If explore is performed on a vertex 𝑣, then it will visit only the vertices that are reachable by 𝑣. What vertices will be visited when explore is performed on 𝑣 if 𝑣 is in a sink SCC?

slide-55
SLIDE 55

Sink SCCs

If explore is performed on a vertex that is in a sink SCC, then only the vertices from that SCC will be visited.

This suggests a way to look for SCCs.

  • Start explore on a vertex in a sink SCC and visit its SCC.
  • Remove the sink SCC from the graph and repeat.
slide-56
SLIDE 56

Ideally we would like to find a vertex in a sink SCC. Unfortunately, there is not a direct way to do this.

Source SCCs

slide-57
SLIDE 57

However, there is a way to find a vertex in a source SCC.

Source SCCs

The vertex with the greatest post number in any DFS output tree belongs to a source SCC. The vertex with the least post number in a dfs output does not necessarily belong to a sink SCC.

slide-58
SLIDE 58

Example of low post number not in a sink.

A C B D E F G

slide-59
SLIDE 59

The vertex with the greatest post number in any DFS

  • utput tree belongs to a source SCC.

To prove this, we will state a more general property: If 𝐷 and 𝐷′ are strongly connected components and there is an edge from a vertex in 𝐷 to a vertex in 𝐷′ then the highest post number in 𝐷 is greater than the highest post number in 𝐷′

Vertices in Source SCCs

slide-60
SLIDE 60

The vertex with the greatest post number in any DFS

  • utput tree belongs to a source SCC.

To prove this, we will state a more general property: If 𝐷 and 𝐷′ are strongly connected components and there is an edge from a vertex in 𝐷 to a vertex in 𝐷′ then the highest post number in 𝐷 is greater than the highest post number in 𝐷′

Vertices in S SCCs

slide-61
SLIDE 61

Case 1: DFS searches 𝐷 before 𝐷′: Then at some point dfs will cross into C’ and visit every edge in C’ then it will retrace its steps until it gets back to the first node in C it started with and assign it the highest post number

Proof

C

C’

slide-62
SLIDE 62

Case 2: DFS searches 𝐷′ before 𝐷: Then DFS will visit all vertices of C’ before getting stuck and assign a post number to all vertices of C’. Then it will visit some vertex of C later and assign post numbers to those vertices.

Proof

C

C’

slide-63
SLIDE 63

The strongly connected components can be linearized by arranging them in decreasing order of their highest post numbers.

Corollary

slide-64
SLIDE 64

Given a graph 𝐻, let 𝐻& be the reverse graph of 𝐻. Then the sources of 𝐻& are the sinks of 𝐻, So if we perform DFS on 𝐻& then the vertex with the highest post number is in a source. This means that this vertex will be in a sink of 𝐻. So start with this vertex and explore the SCC. Then the vertex with the next greatest post number in 𝐻& is in the next SCC in linear order so start with that one next.

How to find sink SCCs

slide-65
SLIDE 65
  • Construct 𝐻6.
  • Run DFS on 𝐻6 and keep track of the post numbers.
  • Run DFS on 𝐻 and order the vertices in decreasing
  • rder of the post numbers from the previous step.

Every time DFS increments cc, you have found a new SCC!!

How to decompose a graph into its SCCs:

slide-66
SLIDE 66

A B C D E F G H I J K L M

slide-67
SLIDE 67

L A B C D G F H E K J M I A A

1

A A

2

B B

3

F B F

4

C F C

5

D C

6

D G

7

D G J G

8

J K J

9

K E K

10

E E

11

E M

12

M L M L

13

L

14

L

15

M M

16

K K

17

J J

18

G G D

19

D

20

C C I I

21 I

H H

22

H

23

H I

24

I F

25

F B

26

B

slide-68
SLIDE 68

L A B C D G F H E K J M I B, F, I, H, C, D, G, J, K, M, L, E, A

cc = 1 cc = 2 cc = 3 cc = 4 cc = 5 cc = 6

B B C F I C F I H H D D G G J L M K J L M K E E A A

slide-69
SLIDE 69
  • Run DFS on 𝐻& and keep track of the post numbers.
  • Run DFS on 𝐻 and order the vertices in decreasing order of

the postnumbers from the previous step. Every time DFS increments cc, you have found a new SCC!! How long does this take? I claim it is linear time for each step and so it is linear time in general

How to decompose a graph into its SCCs:

slide-70
SLIDE 70

DFS is good for

slide-71
SLIDE 71
  • Find what vertices can be reached by a given vertex
  • Divide an undirected graph into connected components
  • find cycles in graphs (directed or undirected.)
  • Find sinks and sources in DAGs
  • Topologically sort a DAG
  • Make a directed graph into a DAG of its SCCs

DFS is good for

slide-72
SLIDE 72

DFS is not good for

slide-73
SLIDE 73
  • Finding shortest distances between vertices.

DFS is not good for