Falcon: A Graph Manipulation Language for Heterogeneous Systems - - PowerPoint PPT Presentation

falcon a graph manipulation language for heterogeneous
SMART_READER_LITE
LIVE PREVIEW

Falcon: A Graph Manipulation Language for Heterogeneous Systems - - PowerPoint PPT Presentation

Falcon: A Graph Manipulation Language for Heterogeneous Systems Unnikrishnan C, IISc, Bangalore Rupesh Nasre, IIT, Madras Y N Srikant, IISc, Bangalore January 20 , 2016 Introduction Experimental Results Introduction Conclusion


slide-1
SLIDE 1

Falcon: A Graph Manipulation Language for Heterogeneous Systems

Unnikrishnan C, IISc, Bangalore Rupesh Nasre, IIT, Madras Y N Srikant, IISc, Bangalore January 20 , 2016

slide-2
SLIDE 2

Introduction Experimental Results Conclusion backup slides Introduction Language-Data Types and Constructs 1 Falcon

is a Domain Specific Language (DSL) for writing Graph algorithms.

Unnikrishnan C, Rupesh Nasre, Y N Srikant Falcon: A Graph Manipulation Language for Heterogeneous Systems 1 / 12

slide-3
SLIDE 3

Introduction Experimental Results Conclusion backup slides Introduction Language-Data Types and Constructs 1 Falcon

is a Domain Specific Language (DSL) for writing Graph algorithms.

2 Falcon

i) extends C programming language. ii) provides additional data types for Graph processing. iii) constructs for writing explicitly parallel graph algorithms.

Unnikrishnan C, Rupesh Nasre, Y N Srikant Falcon: A Graph Manipulation Language for Heterogeneous Systems 1 / 12

slide-4
SLIDE 4

Introduction Experimental Results Conclusion backup slides Introduction Language-Data Types and Constructs 1 Falcon

is a Domain Specific Language (DSL) for writing Graph algorithms.

2 Falcon

i) extends C programming language. ii) provides additional data types for Graph processing. iii) constructs for writing explicitly parallel graph algorithms.

3 Support for heterogeneous backends(CPU and GPU). Unnikrishnan C, Rupesh Nasre, Y N Srikant Falcon: A Graph Manipulation Language for Heterogeneous Systems 1 / 12

slide-5
SLIDE 5

Introduction Experimental Results Conclusion backup slides Introduction Language-Data Types and Constructs 1 Falcon

is a Domain Specific Language (DSL) for writing Graph algorithms.

2 Falcon

i) extends C programming language. ii) provides additional data types for Graph processing. iii) constructs for writing explicitly parallel graph algorithms.

3 Support for heterogeneous backends(CPU and GPU). 4 Supports parallel execution of different algorithms on multiple

devices.

Unnikrishnan C, Rupesh Nasre, Y N Srikant Falcon: A Graph Manipulation Language for Heterogeneous Systems 1 / 12

slide-6
SLIDE 6

Introduction Experimental Results Conclusion backup slides Introduction Language-Data Types and Constructs 1 Falcon

is a Domain Specific Language (DSL) for writing Graph algorithms.

2 Falcon

i) extends C programming language. ii) provides additional data types for Graph processing. iii) constructs for writing explicitly parallel graph algorithms.

3 Support for heterogeneous backends(CPU and GPU). 4 Supports parallel execution of different algorithms on multiple

devices.

5 Supports partitioning of Graph objects and execution of a

single algorithm using multiple devices. Used when graph

  • bject does not fit in a single device.

6 Supports mutation of Graph object. 7 Allows viewing Graph in different way(say collection of

triangles).

Unnikrishnan C, Rupesh Nasre, Y N Srikant Falcon: A Graph Manipulation Language for Heterogeneous Systems 1 / 12

slide-7
SLIDE 7

Introduction Experimental Results Conclusion backup slides Introduction Language-Data Types and Constructs

Language constructs for parallelization and Synchronization in Falcon

single(t1) {stmt block1} else {stmt block2} The thread that gets a lock on item t1 executes stmt block1 and other threads execute stmt block2. single(coll) {stmt block1} else {stmt block2} The thread that gets a lock on all elements in the collection executes stmt block1 and others execute stmt block2.

Table 1. single statement(Synchronization) in Falcon

Data Type Iterator Description Graph points iterate over all points in graph Graph edges iterate over all edges in graph Graph pptyname iterate over all elements in new ppty. Point nbrs iterate over all neighboring points Point

  • utnbrs

iterate over dst point of outgoing edges (Directed Graph) Edge nbrs iterate over neighbor edges Set item iterate over all items in Set Collection item iterate over all items in Collection

Table 2. Iterators for foreach(parallelization) statement in Falcon parallel sections- for Multiple parallel regions on different devices.

Unnikrishnan C, Rupesh Nasre, Y N Srikant Falcon: A Graph Manipulation Language for Heterogeneous Systems 2 / 12

slide-8
SLIDE 8

Introduction Experimental Results Conclusion backup slides Introduction Language-Data Types and Constructs

After Last session of HiPEAC, you want to reach home through shortest path??!!

1

int <GPU> changed = 0; // Variable on GPU

2

relaxgraph(Point <GPU>p, Graph <GPU>graph) {

3

foreach (t In p.outnbrs)

4

MIN(t.dist, p.dist + graph.getweight(p, t), changed);

5

}

6

main(int argc, char *argv[]) {

7

Graph hgraph; // graph on CPU

8

hgraph.addPointProperty(dist, int);

9

hgraph.getType() <GPU>graph; // graph on GPU

10

hgraph.read(argv[1]) // read graph on CPU

11

graph = hgraph; // copy graph to GPU

12

foreach (t In graph.points)t.dist=MAX INT;//INFinity

13

graph.points[0].dist = 0; // source has dist 0

14

while( 1 ){

15

changed = 0;

16

foreach (t In graph.points) relaxgraph(t,graph);

17

if (changed == 0) break; //reached fix point

18

}

19

for (int i = 0; i <graph.npoints; ++i)

20

printf(”i=%d dist=%d\n”, i, graph.points[i].dist);

21

} A SIMPLE EXAMPLE PRG CDG DEL FRA 100 75 400 6 400 6 400 6 PRG CDG FRA DEL INF INF INF 100 75 INF 100 75 500 Unnikrishnan C, Rupesh Nasre, Y N Srikant Falcon: A Graph Manipulation Language for Heterogeneous Systems 3 / 12

slide-9
SLIDE 9

Introduction Experimental Results Conclusion backup slides Introduction Language-Data Types and Constructs

After Last session of HiPEAC, you want to reach home through shortest path??!!

1

int <GPU> changed = 0; // Variable on GPU

2

relaxgraph(Point <GPU>p, Graph <GPU>graph) {

3

foreach (t In p.outnbrs)

4

MIN(t.dist, p.dist + graph.getweight(p, t), changed);

5

}

6

main(int argc, char *argv[]) {

7

Graph hgraph; // graph on CPU

8

hgraph.addPointProperty(dist, int);

9

hgraph.getType() <GPU>graph; // graph on GPU

10

hgraph.read(argv[1]) // read graph on CPU

11

graph = hgraph; // copy graph to GPU

12

foreach (t In graph.points)t.dist=MAX INT;//INFinity

13

graph.points[0].dist = 0; // source has dist 0

14

while( 1 ){

15

changed = 0;

16

foreach (t In graph.points) relaxgraph(t,graph);

17

if (changed == 0) break; //reached fix point

18

}

19

for (int i = 0; i <graph.npoints; ++i)

20

printf(”i=%d dist=%d\n”, i, graph.points[i].dist);

21

} A SIMPLE EXAMPLE PRG CDG DEL FRA 100 75 400 6 400 6 400 6 PRG CDG FRA DEL INF INF INF 100 75 INF 100 75 500 Unnikrishnan C, Rupesh Nasre, Y N Srikant Falcon: A Graph Manipulation Language for Heterogeneous Systems 3 / 12

slide-10
SLIDE 10

Introduction Experimental Results Conclusion backup slides Introduction Language-Data Types and Constructs

After Last session of HiPEAC, you want to reach home through shortest path??!!

1

int <GPU> changed = 0; // Variable on GPU

2

relaxgraph(Point <GPU>p, Graph <GPU>graph) {

3

foreach (t In p.outnbrs)

4

MIN(t.dist, p.dist + graph.getweight(p, t), changed);

5

}

6

main(int argc, char *argv[]) {

7

Graph hgraph; // graph on CPU

8

hgraph.addPointProperty(dist, int);

9

hgraph.getType() <GPU>graph; // graph on GPU

10

hgraph.read(argv[1]) // read graph on CPU

11

graph = hgraph; // copy graph to GPU

12

foreach (t In graph.points)t.dist=MAX INT;//INFinity

13

graph.points[0].dist = 0; // source has dist 0

14

while( 1 ){

15

changed = 0;

16

foreach (t In graph.points) relaxgraph(t,graph);

17

if (changed == 0) break; //reached fix point

18

}

19

for (int i = 0; i <graph.npoints; ++i)

20

printf(”i=%d dist=%d\n”, i, graph.points[i].dist);

21

} A SIMPLE EXAMPLE PRG CDG DEL FRA 100 75 400 6 400 6 400 6 PRG CDG FRA DEL INF INF INF 100 75 INF 100 75 500 Unnikrishnan C, Rupesh Nasre, Y N Srikant Falcon: A Graph Manipulation Language for Heterogeneous Systems 3 / 12

slide-11
SLIDE 11

Introduction Experimental Results Conclusion backup slides Introduction Language-Data Types and Constructs

After Last session of HiPEAC, you want to reach home through shortest path??!!

1

int <GPU> changed = 0; // Variable on GPU

2

relaxgraph(Point <GPU>p, Graph <GPU>graph) {

3

foreach (t In p.outnbrs)

4

MIN(t.dist, p.dist + graph.getweight(p, t), changed);

5

}

6

main(int argc, char *argv[]) {

7

Graph hgraph; // graph on CPU

8

hgraph.addPointProperty(dist, int);

9

hgraph.getType() <GPU>graph; // graph on GPU

10

hgraph.read(argv[1]) // read graph on CPU

11

graph = hgraph; // copy graph to GPU

12

foreach (t In graph.points)t.dist=MAX INT;//INFinity

13

graph.points[0].dist = 0; // source has dist 0

14

while( 1 ){

15

changed = 0;

16

foreach (t In graph.points) relaxgraph(t,graph);

17

if (changed == 0) break; //reached fix point

18

}

19

for (int i = 0; i <graph.npoints; ++i)

20

printf(”i=%d dist=%d\n”, i, graph.points[i].dist);

21

} A SIMPLE EXAMPLE PRG CDG DEL FRA 100 75 400 6 400 6 400 6 PRG CDG FRA DEL INF INF INF 100 75 INF 100 75 500 Unnikrishnan C, Rupesh Nasre, Y N Srikant Falcon: A Graph Manipulation Language for Heterogeneous Systems 3 / 12

slide-12
SLIDE 12

Introduction Experimental Results Conclusion backup slides Introduction Language-Data Types and Constructs

After Last session of HiPEAC, you want to reach home through shortest path??!!

1

int <GPU> changed = 0; // Variable on GPU

2

relaxgraph(Point <GPU>p, Graph <GPU>graph) {

3

foreach (t In p.outnbrs)

4

MIN(t.dist, p.dist + graph.getweight(p, t), changed);

5

}

6

main(int argc, char *argv[]) {

7

Graph hgraph; // graph on CPU

8

hgraph.addPointProperty(dist, int);

9

hgraph.getType() <GPU>graph; // graph on GPU

10

hgraph.read(argv[1]) // read graph on CPU

11

graph = hgraph; // copy graph to GPU

12

foreach (t In graph.points)t.dist=MAX INT;//INFinity

13

graph.points[0].dist = 0; // source has dist 0

14

while( 1 ){

15

changed = 0;

16

foreach (t In graph.points) relaxgraph(t,graph);

17

if (changed == 0) break; //reached fix point

18

}

19

for (int i = 0; i <graph.npoints; ++i)

20

printf(”i=%d dist=%d\n”, i, graph.points[i].dist);

21

} A SIMPLE EXAMPLE PRG CDG DEL FRA 100 75 400 6 400 6 400 6 PRG CDG FRA DEL INF INF INF 100 75 INF 100 75 500 Unnikrishnan C, Rupesh Nasre, Y N Srikant Falcon: A Graph Manipulation Language for Heterogeneous Systems 3 / 12

slide-13
SLIDE 13

Introduction Experimental Results Conclusion backup slides Introduction Language-Data Types and Constructs

After Last session of HiPEAC, you want to reach home through shortest path??!!

1

int <GPU> changed = 0; // Variable on GPU

2

relaxgraph(Point <GPU>p, Graph <GPU>graph) {

3

foreach (t In p.outnbrs)

4

MIN(t.dist, p.dist + graph.getweight(p, t), changed);

5

}

6

main(int argc, char *argv[]) {

7

Graph hgraph; // graph on CPU

8

hgraph.addPointProperty(dist, int);

9

hgraph.getType() <GPU>graph; // graph on GPU

10

hgraph.read(argv[1]) // read graph on CPU

11

graph = hgraph; // copy graph to GPU

12

foreach (t In graph.points)t.dist=MAX INT;//INFinity

13

graph.points[0].dist = 0; // source has dist 0

14

while( 1 ){

15

changed = 0;

16

foreach (t In graph.points) relaxgraph(t,graph);

17

if (changed == 0) break; //reached fix point

18

}

19

for (int i = 0; i <graph.npoints; ++i)

20

printf(”i=%d dist=%d\n”, i, graph.points[i].dist);

21

} A SIMPLE EXAMPLE PRG CDG DEL FRA 100 75 400 6 400 6 400 6 PRG CDG FRA DEL INF INF INF 100 75 INF 100 75 500 Unnikrishnan C, Rupesh Nasre, Y N Srikant Falcon: A Graph Manipulation Language for Heterogeneous Systems 3 / 12

slide-14
SLIDE 14

Introduction Experimental Results Conclusion backup slides Introduction Language-Data Types and Constructs

Falcon Compiler Code Genaration (Synchronization and parallelization constructs)

Input Falcon DSL Code Check for Falcon Synchro- nization / Parallel Constructs single statement parallel sections statement foreach statement On Collection On Collection On Collection Convert to OpenMP parallel sections Input Falcon DSL Code Check for Falcon Synchro- nization / Parallel Constructs single statement parallel sections statement foreach statement On Collection Convert to OpenMP parallel sections Single One item Convert to Compare And Swap(CAS) based code Single One item Convert to Compare And Swap(CAS) based code Single Collection Convert to code with barrier for entire paralle region Single Collection Convert to code with barrier for entire paralle region If outermost foreach statement GPU:Convert to CUDA kernel call with Thrust library CPU: Convert to parallel code us- ing Galois Worklist If outermost foreach statement GPU:Convert to CUDA kernel call with Thrust library CPU: Convert to parallel code us- ing Galois Worklist If outermost foreach statement GPU: Convert to CUDA kernel call CPU: Convert to OpenMP pragma If outermost foreach statement GPU: Convert to CUDA kernel call CPU: Convert to OpenMP pragma yes no no yes

Unnikrishnan C, Rupesh Nasre, Y N Srikant Falcon: A Graph Manipulation Language for Heterogeneous Systems 4 / 12

slide-15
SLIDE 15

Introduction Experimental Results Conclusion backup slides Introduction Language-Data Types and Constructs

Falcon Compiler Code Genaration (Synchronization and parallelization constructs)

Input Falcon DSL Code Check for Falcon Synchro- nization / Parallel Constructs single statement parallel sections statement foreach statement On Collection On Collection On Collection Convert to OpenMP parallel sections Input Falcon DSL Code Check for Falcon Synchro- nization / Parallel Constructs single statement parallel sections statement foreach statement On Collection Convert to OpenMP parallel sections Single One item Convert to Compare And Swap(CAS) based code Single One item Convert to Compare And Swap(CAS) based code Single Collection Convert to code with barrier for entire paralle region Single Collection Convert to code with barrier for entire paralle region If outermost foreach statement GPU:Convert to CUDA kernel call with Thrust library CPU: Convert to parallel code us- ing Galois Worklist If outermost foreach statement GPU:Convert to CUDA kernel call with Thrust library CPU: Convert to parallel code us- ing Galois Worklist If outermost foreach statement GPU: Convert to CUDA kernel call CPU: Convert to OpenMP pragma If outermost foreach statement GPU: Convert to CUDA kernel call CPU: Convert to OpenMP pragma yes no no yes

Unnikrishnan C, Rupesh Nasre, Y N Srikant Falcon: A Graph Manipulation Language for Heterogeneous Systems 4 / 12

slide-16
SLIDE 16

Introduction Experimental Results Conclusion backup slides Introduction Language-Data Types and Constructs

Falcon Compiler Code Genaration (Synchronization and parallelization constructs)

Input Falcon DSL Code Check for Falcon Synchro- nization / Parallel Constructs single statement parallel sections statement foreach statement On Collection On Collection On Collection Convert to OpenMP parallel sections Input Falcon DSL Code Check for Falcon Synchro- nization / Parallel Constructs single statement parallel sections statement foreach statement On Collection Convert to OpenMP parallel sections Single One item Convert to Compare And Swap(CAS) based code Single One item Convert to Compare And Swap(CAS) based code Single Collection Convert to code with barrier for entire paralle region Single Collection Convert to code with barrier for entire paralle region If outermost foreach statement GPU:Convert to CUDA kernel call with Thrust library CPU: Convert to parallel code us- ing Galois Worklist If outermost foreach statement GPU:Convert to CUDA kernel call with Thrust library CPU: Convert to parallel code us- ing Galois Worklist If outermost foreach statement GPU: Convert to CUDA kernel call CPU: Convert to OpenMP pragma If outermost foreach statement GPU: Convert to CUDA kernel call CPU: Convert to OpenMP pragma yes no no yes

Unnikrishnan C, Rupesh Nasre, Y N Srikant Falcon: A Graph Manipulation Language for Heterogeneous Systems 4 / 12

slide-17
SLIDE 17

Introduction Experimental Results Conclusion backup slides Introduction Language-Data Types and Constructs

Falcon Compiler Code Genaration (Synchronization and parallelization constructs)

Input Falcon DSL Code Check for Falcon Synchro- nization / Parallel Constructs single statement parallel sections statement foreach statement On Collection On Collection On Collection Convert to OpenMP parallel sections Input Falcon DSL Code Check for Falcon Synchro- nization / Parallel Constructs single statement parallel sections statement foreach statement On Collection Convert to OpenMP parallel sections Single One item Convert to Compare And Swap(CAS) based code Single One item Convert to Compare And Swap(CAS) based code Single Collection Convert to code with barrier for entire paralle region Single Collection Convert to code with barrier for entire paralle region If outermost foreach statement GPU:Convert to CUDA kernel call with Thrust library CPU: Convert to parallel code us- ing Galois Worklist If outermost foreach statement GPU:Convert to CUDA kernel call with Thrust library CPU: Convert to parallel code us- ing Galois Worklist If outermost foreach statement GPU: Convert to CUDA kernel call CPU: Convert to OpenMP pragma If outermost foreach statement GPU: Convert to CUDA kernel call CPU: Convert to OpenMP pragma yes no no yes

Unnikrishnan C, Rupesh Nasre, Y N Srikant Falcon: A Graph Manipulation Language for Heterogeneous Systems 4 / 12

slide-18
SLIDE 18

Introduction Experimental Results Conclusion backup slides Introduction Language-Data Types and Constructs

Falcon Compiler Code Genaration (Synchronization and parallelization constructs)

Input Falcon DSL Code Check for Falcon Synchro- nization / Parallel Constructs single statement parallel sections statement foreach statement On Collection On Collection On Collection Convert to OpenMP parallel sections Input Falcon DSL Code Check for Falcon Synchro- nization / Parallel Constructs single statement parallel sections statement foreach statement On Collection Convert to OpenMP parallel sections Single One item Convert to Compare And Swap(CAS) based code Single One item Convert to Compare And Swap(CAS) based code Single Collection Convert to code with barrier for entire paralle region Single Collection Convert to code with barrier for entire paralle region If outermost foreach statement GPU:Convert to CUDA kernel call with Thrust library CPU: Convert to parallel code us- ing Galois Worklist If outermost foreach statement GPU:Convert to CUDA kernel call with Thrust library CPU: Convert to parallel code us- ing Galois Worklist If outermost foreach statement GPU: Convert to CUDA kernel call CPU: Convert to OpenMP pragma If outermost foreach statement GPU: Convert to CUDA kernel call CPU: Convert to OpenMP pragma yes no no yes

Unnikrishnan C, Rupesh Nasre, Y N Srikant Falcon: A Graph Manipulation Language for Heterogeneous Systems 4 / 12

slide-19
SLIDE 19

Introduction Experimental Results Conclusion backup slides Introduction Language-Data Types and Constructs

Falcon Compiler Code Genaration (Synchronization and parallelization constructs)

Input Falcon DSL Code Check for Falcon Synchro- nization / Parallel Constructs single statement parallel sections statement foreach statement On Collection On Collection On Collection Convert to OpenMP parallel sections Input Falcon DSL Code Check for Falcon Synchro- nization / Parallel Constructs single statement parallel sections statement foreach statement On Collection Convert to OpenMP parallel sections Single One item Convert to Compare And Swap(CAS) based code Single One item Convert to Compare And Swap(CAS) based code Single Collection Convert to code with barrier for entire paralle region Single Collection Convert to code with barrier for entire paralle region If outermost foreach statement GPU:Convert to CUDA kernel call with Thrust library CPU: Convert to parallel code us- ing Galois Worklist If outermost foreach statement GPU:Convert to CUDA kernel call with Thrust library CPU: Convert to parallel code us- ing Galois Worklist If outermost foreach statement GPU: Convert to CUDA kernel call CPU: Convert to OpenMP pragma If outermost foreach statement GPU: Convert to CUDA kernel call CPU: Convert to OpenMP pragma yes no no yes

Unnikrishnan C, Rupesh Nasre, Y N Srikant Falcon: A Graph Manipulation Language for Heterogeneous Systems 4 / 12

slide-20
SLIDE 20

Introduction Experimental Results Conclusion backup slides Introduction Language-Data Types and Constructs

Falcon Compiler Code Genaration (Synchronization and parallelization constructs)

Input Falcon DSL Code Check for Falcon Synchro- nization / Parallel Constructs single statement parallel sections statement foreach statement On Collection On Collection On Collection Convert to OpenMP parallel sections Input Falcon DSL Code Check for Falcon Synchro- nization / Parallel Constructs single statement parallel sections statement foreach statement On Collection Convert to OpenMP parallel sections Single One item Convert to Compare And Swap(CAS) based code Single One item Convert to Compare And Swap(CAS) based code Single Collection Convert to code with barrier for entire paralle region Single Collection Convert to code with barrier for entire paralle region If outermost foreach statement GPU:Convert to CUDA kernel call with Thrust library CPU: Convert to parallel code us- ing Galois Worklist If outermost foreach statement GPU:Convert to CUDA kernel call with Thrust library CPU: Convert to parallel code us- ing Galois Worklist If outermost foreach statement GPU: Convert to CUDA kernel call CPU: Convert to OpenMP pragma If outermost foreach statement GPU: Convert to CUDA kernel call CPU: Convert to OpenMP pragma yes no no yes

Unnikrishnan C, Rupesh Nasre, Y N Srikant Falcon: A Graph Manipulation Language for Heterogeneous Systems 4 / 12

slide-21
SLIDE 21

Introduction Experimental Results Conclusion backup slides Introduction Language-Data Types and Constructs

Falcon Compiler Code Genaration (Synchronization and parallelization constructs)

Input Falcon DSL Code Check for Falcon Synchro- nization / Parallel Constructs single statement parallel sections statement foreach statement On Collection On Collection On Collection Convert to OpenMP parallel sections Input Falcon DSL Code Check for Falcon Synchro- nization / Parallel Constructs single statement parallel sections statement foreach statement On Collection Convert to OpenMP parallel sections Single One item Convert to Compare And Swap(CAS) based code Single One item Convert to Compare And Swap(CAS) based code Single Collection Convert to code with barrier for entire paralle region Single Collection Convert to code with barrier for entire paralle region If outermost foreach statement GPU:Convert to CUDA kernel call with Thrust library CPU: Convert to parallel code us- ing Galois Worklist If outermost foreach statement GPU:Convert to CUDA kernel call with Thrust library CPU: Convert to parallel code us- ing Galois Worklist If outermost foreach statement GPU: Convert to CUDA kernel call CPU: Convert to OpenMP pragma If outermost foreach statement GPU: Convert to CUDA kernel call CPU: Convert to OpenMP pragma yes no no yes

Unnikrishnan C, Rupesh Nasre, Y N Srikant Falcon: A Graph Manipulation Language for Heterogeneous Systems 4 / 12

slide-22
SLIDE 22

Introduction Experimental Results Conclusion backup slides Introduction Language-Data Types and Constructs

Falcon Compiler Code Genaration (Synchronization and parallelization constructs)

Input Falcon DSL Code Check for Falcon Synchro- nization / Parallel Constructs single statement parallel sections statement foreach statement On Collection On Collection On Collection Convert to OpenMP parallel sections Input Falcon DSL Code Check for Falcon Synchro- nization / Parallel Constructs single statement parallel sections statement foreach statement On Collection Convert to OpenMP parallel sections Single One item Convert to Compare And Swap(CAS) based code Single One item Convert to Compare And Swap(CAS) based code Single Collection Convert to code with barrier for entire paralle region Single Collection Convert to code with barrier for entire paralle region If outermost foreach statement GPU:Convert to CUDA kernel call with Thrust library CPU: Convert to parallel code us- ing Galois Worklist If outermost foreach statement GPU:Convert to CUDA kernel call with Thrust library CPU: Convert to parallel code us- ing Galois Worklist If outermost foreach statement GPU: Convert to CUDA kernel call CPU: Convert to OpenMP pragma If outermost foreach statement GPU: Convert to CUDA kernel call CPU: Convert to OpenMP pragma yes no no yes

Unnikrishnan C, Rupesh Nasre, Y N Srikant Falcon: A Graph Manipulation Language for Heterogeneous Systems 4 / 12

slide-23
SLIDE 23

Introduction Experimental Results Conclusion backup slides Introduction Language-Data Types and Constructs

Falcon Compiler Code Genaration (Synchronization and parallelization constructs)

Input Falcon DSL Code Check for Falcon Synchro- nization / Parallel Constructs single statement parallel sections statement foreach statement On Collection On Collection On Collection Convert to OpenMP parallel sections Input Falcon DSL Code Check for Falcon Synchro- nization / Parallel Constructs single statement parallel sections statement foreach statement On Collection Convert to OpenMP parallel sections Single One item Convert to Compare And Swap(CAS) based code Single One item Convert to Compare And Swap(CAS) based code Single Collection Convert to code with barrier for entire paralle region Single Collection Convert to code with barrier for entire paralle region If outermost foreach statement GPU:Convert to CUDA kernel call with Thrust library CPU: Convert to parallel code us- ing Galois Worklist If outermost foreach statement GPU:Convert to CUDA kernel call with Thrust library CPU: Convert to parallel code us- ing Galois Worklist If outermost foreach statement GPU: Convert to CUDA kernel call CPU: Convert to OpenMP pragma If outermost foreach statement GPU: Convert to CUDA kernel call CPU: Convert to OpenMP pragma yes no no yes

Unnikrishnan C, Rupesh Nasre, Y N Srikant Falcon: A Graph Manipulation Language for Heterogeneous Systems 4 / 12

slide-24
SLIDE 24

Introduction Experimental Results Conclusion backup slides Introduction Language-Data Types and Constructs

Falcon Compiler Code Genaration (Synchronization and parallelization constructs)

Input Falcon DSL Code Check for Falcon Synchro- nization / Parallel Constructs single statement parallel sections statement foreach statement On Collection On Collection On Collection Convert to OpenMP parallel sections Input Falcon DSL Code Check for Falcon Synchro- nization / Parallel Constructs single statement parallel sections statement foreach statement On Collection Convert to OpenMP parallel sections Single One item Convert to Compare And Swap(CAS) based code Single One item Convert to Compare And Swap(CAS) based code Single Collection Convert to code with barrier for entire paralle region Single Collection Convert to code with barrier for entire paralle region If outermost foreach statement GPU:Convert to CUDA kernel call with Thrust library CPU: Convert to parallel code us- ing Galois Worklist If outermost foreach statement GPU:Convert to CUDA kernel call with Thrust library CPU: Convert to parallel code us- ing Galois Worklist If outermost foreach statement GPU: Convert to CUDA kernel call CPU: Convert to OpenMP pragma If outermost foreach statement GPU: Convert to CUDA kernel call CPU: Convert to OpenMP pragma yes no no yes

Unnikrishnan C, Rupesh Nasre, Y N Srikant Falcon: A Graph Manipulation Language for Heterogeneous Systems 4 / 12

slide-25
SLIDE 25

Introduction Experimental Results Conclusion backup slides Introduction Language-Data Types and Constructs

Falcon Compiler Code Genaration (Synchronization and parallelization constructs)

Input Falcon DSL Code Check for Falcon Synchro- nization / Parallel Constructs single statement parallel sections statement foreach statement On Collection On Collection On Collection Convert to OpenMP parallel sections Input Falcon DSL Code Check for Falcon Synchro- nization / Parallel Constructs single statement parallel sections statement foreach statement On Collection Convert to OpenMP parallel sections Single One item Convert to Compare And Swap(CAS) based code Single One item Convert to Compare And Swap(CAS) based code Single Collection Convert to code with barrier for entire paralle region Single Collection Convert to code with barrier for entire paralle region If outermost foreach statement GPU:Convert to CUDA kernel call with Thrust library CPU: Convert to parallel code us- ing Galois Worklist If outermost foreach statement GPU:Convert to CUDA kernel call with Thrust library CPU: Convert to parallel code us- ing Galois Worklist If outermost foreach statement GPU: Convert to CUDA kernel call CPU: Convert to OpenMP pragma If outermost foreach statement GPU: Convert to CUDA kernel call CPU: Convert to OpenMP pragma yes no no yes

Unnikrishnan C, Rupesh Nasre, Y N Srikant Falcon: A Graph Manipulation Language for Heterogeneous Systems 4 / 12

slide-26
SLIDE 26

Introduction Experimental Results Conclusion backup slides Introduction Language-Data Types and Constructs

Falcon Compiler Code Genaration (Synchronization and parallelization constructs)

Input Falcon DSL Code Check for Falcon Synchro- nization / Parallel Constructs single statement parallel sections statement foreach statement On Collection On Collection On Collection Convert to OpenMP parallel sections Input Falcon DSL Code Check for Falcon Synchro- nization / Parallel Constructs single statement parallel sections statement foreach statement On Collection Convert to OpenMP parallel sections Single One item Convert to Compare And Swap(CAS) based code Single One item Convert to Compare And Swap(CAS) based code Single Collection Convert to code with barrier for entire paralle region Single Collection Convert to code with barrier for entire paralle region If outermost foreach statement GPU:Convert to CUDA kernel call with Thrust library CPU: Convert to parallel code us- ing Galois Worklist If outermost foreach statement GPU:Convert to CUDA kernel call with Thrust library CPU: Convert to parallel code us- ing Galois Worklist If outermost foreach statement GPU: Convert to CUDA kernel call CPU: Convert to OpenMP pragma If outermost foreach statement GPU: Convert to CUDA kernel call CPU: Convert to OpenMP pragma yes no no yes

Unnikrishnan C, Rupesh Nasre, Y N Srikant Falcon: A Graph Manipulation Language for Heterogeneous Systems 4 / 12

slide-27
SLIDE 27

Introduction Experimental Results Conclusion backup slides Introduction Language-Data Types and Constructs

Falcon Compiler Code Genaration (Synchronization and parallelization constructs)

Input Falcon DSL Code Check for Falcon Synchro- nization / Parallel Constructs single statement parallel sections statement foreach statement On Collection On Collection On Collection Convert to OpenMP parallel sections Input Falcon DSL Code Check for Falcon Synchro- nization / Parallel Constructs single statement parallel sections statement foreach statement On Collection Convert to OpenMP parallel sections Single One item Convert to Compare And Swap(CAS) based code Single One item Convert to Compare And Swap(CAS) based code Single Collection Convert to code with barrier for entire paralle region Single Collection Convert to code with barrier for entire paralle region If outermost foreach statement GPU:Convert to CUDA kernel call with Thrust library CPU: Convert to parallel code us- ing Galois Worklist If outermost foreach statement GPU:Convert to CUDA kernel call with Thrust library CPU: Convert to parallel code us- ing Galois Worklist If outermost foreach statement GPU: Convert to CUDA kernel call CPU: Convert to OpenMP pragma If outermost foreach statement GPU: Convert to CUDA kernel call CPU: Convert to OpenMP pragma yes no no yes

Unnikrishnan C, Rupesh Nasre, Y N Srikant Falcon: A Graph Manipulation Language for Heterogeneous Systems 4 / 12

slide-28
SLIDE 28

Introduction Experimental Results Conclusion backup slides Introduction Language-Data Types and Constructs

Falcon Compiler Code Genaration (Synchronization and parallelization constructs)

Input Falcon DSL Code Check for Falcon Synchro- nization / Parallel Constructs single statement parallel sections statement foreach statement On Collection On Collection On Collection Convert to OpenMP parallel sections Input Falcon DSL Code Check for Falcon Synchro- nization / Parallel Constructs single statement parallel sections statement foreach statement On Collection Convert to OpenMP parallel sections Single One item Convert to Compare And Swap(CAS) based code Single One item Convert to Compare And Swap(CAS) based code Single Collection Convert to code with barrier for entire paralle region Single Collection Convert to code with barrier for entire paralle region If outermost foreach statement GPU:Convert to CUDA kernel call with Thrust library CPU: Convert to parallel code us- ing Galois Worklist If outermost foreach statement GPU:Convert to CUDA kernel call with Thrust library CPU: Convert to parallel code us- ing Galois Worklist If outermost foreach statement GPU: Convert to CUDA kernel call CPU: Convert to OpenMP pragma If outermost foreach statement GPU: Convert to CUDA kernel call CPU: Convert to OpenMP pragma yes no no yes

Unnikrishnan C, Rupesh Nasre, Y N Srikant Falcon: A Graph Manipulation Language for Heterogeneous Systems 4 / 12

slide-29
SLIDE 29

Introduction Experimental Results Conclusion backup slides GPUs and MultiGPUs CPU Dynamic(Morph) Algorithms Heterogeneous execution

1 Using Falcon compiler we wrote algorithms like BFS, SSSP and

Boruvka’s-MST.

Unnikrishnan C, Rupesh Nasre, Y N Srikant Falcon: A Graph Manipulation Language for Heterogeneous Systems 5 / 12

slide-30
SLIDE 30

Introduction Experimental Results Conclusion backup slides GPUs and MultiGPUs CPU Dynamic(Morph) Algorithms Heterogeneous execution

1 Using Falcon compiler we wrote algorithms like BFS, SSSP and

Boruvka’s-MST.

2 We wrote dynamic algorithms like Survey Propagation(SP),

Delaunay Mesh refinement(DMR) and Dynamic-SSSP in Falcon.

Unnikrishnan C, Rupesh Nasre, Y N Srikant Falcon: A Graph Manipulation Language for Heterogeneous Systems 5 / 12

slide-31
SLIDE 31

Introduction Experimental Results Conclusion backup slides GPUs and MultiGPUs CPU Dynamic(Morph) Algorithms Heterogeneous execution

1 Using Falcon compiler we wrote algorithms like BFS, SSSP and

Boruvka’s-MST.

2 We wrote dynamic algorithms like Survey Propagation(SP),

Delaunay Mesh refinement(DMR) and Dynamic-SSSP in Falcon.

3 Performance of GPU codes were compared with

i) LonestarGPU-(ISS group at the University of Texas at Austin) i) Galois- (ISS group at the University of Texas at Austin) ii)Green-Marl- DSL (PP Laboratory, Stanford University) iV)Totem( NetSysLab, University of British Columbia)

Unnikrishnan C, Rupesh Nasre, Y N Srikant Falcon: A Graph Manipulation Language for Heterogeneous Systems 5 / 12

slide-32
SLIDE 32

Introduction Experimental Results Conclusion backup slides GPUs and MultiGPUs CPU Dynamic(Morph) Algorithms Heterogeneous execution

1 Using Falcon compiler we wrote algorithms like BFS, SSSP and

Boruvka’s-MST.

2 We wrote dynamic algorithms like Survey Propagation(SP),

Delaunay Mesh refinement(DMR) and Dynamic-SSSP in Falcon.

3 Performance of GPU codes were compared with

i) LonestarGPU-(ISS group at the University of Texas at Austin) i) Galois- (ISS group at the University of Texas at Austin) ii)Green-Marl- DSL (PP Laboratory, Stanford University) iV)Totem( NetSysLab, University of British Columbia)

4 Totem- for comparing Performance on CPU, GPU and

heterogeneous execution.

Unnikrishnan C, Rupesh Nasre, Y N Srikant Falcon: A Graph Manipulation Language for Heterogeneous Systems 5 / 12

slide-33
SLIDE 33

Introduction Experimental Results Conclusion backup slides GPUs and MultiGPUs CPU Dynamic(Morph) Algorithms Heterogeneous execution

1 Using Falcon compiler we wrote algorithms like BFS, SSSP and

Boruvka’s-MST.

2 We wrote dynamic algorithms like Survey Propagation(SP),

Delaunay Mesh refinement(DMR) and Dynamic-SSSP in Falcon.

3 Performance of GPU codes were compared with

i) LonestarGPU-(ISS group at the University of Texas at Austin) i) Galois- (ISS group at the University of Texas at Austin) ii)Green-Marl- DSL (PP Laboratory, Stanford University) iV)Totem( NetSysLab, University of British Columbia)

4 Totem- for comparing Performance on CPU, GPU and

heterogeneous execution.

5 Galois and Green-Marl for comparing Performance on CPU.

Unnikrishnan C, Rupesh Nasre, Y N Srikant Falcon: A Graph Manipulation Language for Heterogeneous Systems 5 / 12

slide-34
SLIDE 34

Introduction Experimental Results Conclusion backup slides GPUs and MultiGPUs CPU Dynamic(Morph) Algorithms Heterogeneous execution

1 Using Falcon compiler we wrote algorithms like BFS, SSSP and

Boruvka’s-MST.

2 We wrote dynamic algorithms like Survey Propagation(SP),

Delaunay Mesh refinement(DMR) and Dynamic-SSSP in Falcon.

3 Performance of GPU codes were compared with

i) LonestarGPU-(ISS group at the University of Texas at Austin) i) Galois- (ISS group at the University of Texas at Austin) ii)Green-Marl- DSL (PP Laboratory, Stanford University) iV)Totem( NetSysLab, University of British Columbia)

4 Totem- for comparing Performance on CPU, GPU and

heterogeneous execution.

5 Galois and Green-Marl for comparing Performance on CPU. 6 LonestarGPU for comparing Performance on GPU.

Unnikrishnan C, Rupesh Nasre, Y N Srikant Falcon: A Graph Manipulation Language for Heterogeneous Systems 5 / 12

slide-35
SLIDE 35

Introduction Experimental Results Conclusion backup slides GPUs and MultiGPUs CPU Dynamic(Morph) Algorithms Heterogeneous execution

1 Using Falcon compiler we wrote algorithms like BFS, SSSP and

Boruvka’s-MST.

2 We wrote dynamic algorithms like Survey Propagation(SP),

Delaunay Mesh refinement(DMR) and Dynamic-SSSP in Falcon.

3 Performance of GPU codes were compared with

i) LonestarGPU-(ISS group at the University of Texas at Austin) i) Galois- (ISS group at the University of Texas at Austin) ii)Green-Marl- DSL (PP Laboratory, Stanford University) iV)Totem( NetSysLab, University of British Columbia)

4 Totem- for comparing Performance on CPU, GPU and

heterogeneous execution.

5 Galois and Green-Marl for comparing Performance on CPU. 6 LonestarGPU for comparing Performance on GPU. 7 We were able to get performance close to and some times better

than above systems.

Unnikrishnan C, Rupesh Nasre, Y N Srikant Falcon: A Graph Manipulation Language for Heterogeneous Systems 5 / 12

slide-36
SLIDE 36

Introduction Experimental Results Conclusion backup slides GPUs and MultiGPUs CPU Dynamic(Morph) Algorithms Heterogeneous execution

1 Using Falcon compiler we wrote algorithms like BFS, SSSP and

Boruvka’s-MST.

2 We wrote dynamic algorithms like Survey Propagation(SP),

Delaunay Mesh refinement(DMR) and Dynamic-SSSP in Falcon.

3 Performance of GPU codes were compared with

i) LonestarGPU-(ISS group at the University of Texas at Austin) i) Galois- (ISS group at the University of Texas at Austin) ii)Green-Marl- DSL (PP Laboratory, Stanford University) iV)Totem( NetSysLab, University of British Columbia)

4 Totem- for comparing Performance on CPU, GPU and

heterogeneous execution.

5 Galois and Green-Marl for comparing Performance on CPU. 6 LonestarGPU for comparing Performance on GPU. 7 We were able to get performance close to and some times better

than above systems.

8 Tested on 12-core CPU and 4-GPU machine(1 Kepler and 3 Tesla ).

Unnikrishnan C, Rupesh Nasre, Y N Srikant Falcon: A Graph Manipulation Language for Heterogeneous Systems 5 / 12

slide-37
SLIDE 37

Introduction Experimental Results Conclusion backup slides GPUs and MultiGPUs CPU Dynamic(Morph) Algorithms Heterogeneous execution (Points,Edges) rand1(16M,64M),rand2(32M,128M) rmat1(10M,100M),rmat2(20M,200M) road1(14M,34M) road2(23M,58M)

rand1 rand2 rmat1 rmat2 road1 road2 2 4 6 8

1 1 1 1 1 1 1.15 1.36 1.8 6.69 1.3 1.95 0.71 0.75 1.19 4.09 0.62 1.13

Speedup

LonestarGPU Falcon-GPU Totem-GPU

Speedup of SSSP on GPU

r a n d 1 r a n d 2 r m a t 1 r m a t 2 r

  • a

d 1 r

  • a

d 2 1 2 3 4

1 1 1 1 1 1 1.77 1.92 1.36 2.53 1.5 2.02

Speedup

LonestarGPU Falcon-MultiGPU

Speedup- SSSP, BFS and MST in parallel on 3 GPUs

Unnikrishnan C, Rupesh Nasre, Y N Srikant Falcon: A Graph Manipulation Language for Heterogeneous Systems 6 / 12

slide-38
SLIDE 38

Introduction Experimental Results Conclusion backup slides GPUs and MultiGPUs CPU Dynamic(Morph) Algorithms Heterogeneous execution

rand1 rand2 rmat1 rmat2 road1 road2 5 10

8.62 11.95 7.28 8.39 4.9 5.57 7.23 9.81 6.98 8.39 4.9 5.63 7.64 9.84 4.66 4.95 0.1 0.1 5.8 6.3 0.1 0.1

Speedup

Galois-12 Falcon-12 Totem-12 GreenMarl-12

(a)

SSSP speedup over Galois Single

rand1 rand2 rmat1 rmat2 road1 road2 5 10 15

8.57 8.24 6.67 7.17 6.39 5.89 13.87 12.72 8.42 8.64 6.27 6.35 8.14 8.22 6.53 7.01 0.86 0.65 5.73 6.18 0.6 0.5

Speedup

Galois-12 Falcon-12 Totem-12 GreenMarl-12

(b)

BFS speedup over Galois single

Unnikrishnan C, Rupesh Nasre, Y N Srikant Falcon: A Graph Manipulation Language for Heterogeneous Systems 7 / 12

slide-39
SLIDE 39

Introduction Experimental Results Conclusion backup slides GPUs and MultiGPUs CPU Dynamic(Morph) Algorithms Heterogeneous execution 1 single on collection and barrier on GPU. Unnikrishnan C, Rupesh Nasre, Y N Srikant Falcon: A Graph Manipulation Language for Heterogeneous Systems 8 / 12

slide-40
SLIDE 40

Introduction Experimental Results Conclusion backup slides GPUs and MultiGPUs CPU Dynamic(Morph) Algorithms Heterogeneous execution 1 single on collection and barrier on GPU. 2 Shucai Xiao and Wu chun Feng. Inter-Block GPU

Communication via Fast Barrier Synchronization.IPDPS 2010.

Unnikrishnan C, Rupesh Nasre, Y N Srikant Falcon: A Graph Manipulation Language for Heterogeneous Systems 8 / 12

slide-41
SLIDE 41

Introduction Experimental Results Conclusion backup slides GPUs and MultiGPUs CPU Dynamic(Morph) Algorithms Heterogeneous execution 1 single on collection and barrier on GPU. 2 Shucai Xiao and Wu chun Feng. Inter-Block GPU

Communication via Fast Barrier Synchronization.IPDPS 2010.

3 Importance of good runtime for Collection/Worklist. Unnikrishnan C, Rupesh Nasre, Y N Srikant Falcon: A Graph Manipulation Language for Heterogeneous Systems 8 / 12

slide-42
SLIDE 42

Introduction Experimental Results Conclusion backup slides GPUs and MultiGPUs CPU Dynamic(Morph) Algorithms Heterogeneous execution 1 single on collection and barrier on GPU. 2 Shucai Xiao and Wu chun Feng. Inter-Block GPU

Communication via Fast Barrier Synchronization.IPDPS 2010.

3 Importance of good runtime for Collection/Worklist. 4 Ulrich Meyer and Peter Sanders. Delta-Stepping: A Parallel

Single Source Shortest Path Algorithm.EAS 1998.

Unnikrishnan C, Rupesh Nasre, Y N Srikant Falcon: A Graph Manipulation Language for Heterogeneous Systems 8 / 12

slide-43
SLIDE 43

Introduction Experimental Results Conclusion backup slides GPUs and MultiGPUs CPU Dynamic(Morph) Algorithms Heterogeneous execution

r . 5 M r 1 M r 2 M r 1 M 0.5 1 1.5 2

1 1 1 1 0.92 0.93 0.97 0.97 LonestarGPU Falcon-GPU

(a) DMR speedup

  • ver

Lones- tarGPU

r . 5 M r 1 M r 2 M 10 20

11 11.03 11.01 11.69 12.95 20.91 Galois-12 Falcon-12

(b) DMR speedup

  • ver

Galois single

r a n d 1 r a n d 2 r m a t 1 r m a t 2 r

  • a

d 1 r

  • a

d 2 10 20

4.02 5.04 5.93 6.58 8.32 10.72 1.34 1.75 1.45 1.63 10.7 14.16 18.1 16.1 11.49 11.26 0.61 0.56 Falcon-GPU LonestarGPU Falcon-CPU

(c) Dynamic-SSSP- Self relative speedup

Morph Algorithm Results-Speedup DMR(Delaunay Mesh Refinement) and Dynamic-SSSP

Unnikrishnan C, Rupesh Nasre, Y N Srikant Falcon: A Graph Manipulation Language for Heterogeneous Systems 9 / 12

slide-44
SLIDE 44

Introduction Experimental Results Conclusion backup slides GPUs and MultiGPUs CPU Dynamic(Morph) Algorithms Heterogeneous execution

rand64 rmat50 rmat60 5 10

8.24 9.04 10.18 7.53 8.45 7.96 8.65 7.13 7.92 9.6 8.7 10.45

Speedup Falcon-sssp Totem-sssp Falcon-bfs Totem-bfs speedup on two GPUs rand128 rmat80 5 10

6.03 5.8 6.45 5.52 4.15 4.83 4.27 4.96

Speedup Speedup on two GPUs and one CPU Heterogeneous Execution-SSSP and BFS speedup

Unnikrishnan C, Rupesh Nasre, Y N Srikant Falcon: A Graph Manipulation Language for Heterogeneous Systems 10 / 12

slide-45
SLIDE 45

Introduction Experimental Results Conclusion backup slides GPUs and MultiGPUs CPU Dynamic(Morph) Algorithms Heterogeneous execution

Partitioned Execution of one Algorithm on Multiple Devices

1

fun1(Point ori, Point incom){

2

if(orig.dist >incom.dist)

3

  • rig.dist=incom.dist

4

}

5

relaxgraph(Point p, HGraph hgraph){

6

foreach(t in p.outnbrs)

7

MIN(t.dist,p.dist+hgraph.getWeight(p,t), hgraph.changed[0]);

8

}

9

main(int argc, char *argv[]) {

10

HGraph hgraph;

11

hgraph.addPointProperty(dist, int);

12

hgraph.addProperty(changed, int);

13

hgraph.read(argv[1]);

14

hgraph.makePartition(1,1,ORDERED);

15

hgraph.updateFunction(fun1);

16

foreach(t In hgraph.points) t.dist=1234567890;

17

hgraph.points[0].dist=0;

18

while( 1 ){

19

hgraph.changed[0]=0;

20

foreach(t In hgraph.points)relaxgraph(t,hgraph);

21

if(hgraph.changed[0]==0)break;

22

hgraph.updatePartition();

23

}

24

for(int i = 0;i <hgraph.npoints; i++)

25

printf(“%d”, hgraph.points[i].dist);

26

} A SIMPLE EXAMPLE PRG CDG DEL FRA 100 75 400 6 Part1(PRG, CDG) 0 , INF , INF , INF 0 , 100 , 75 , INF 0 , 100 , 75 , INF 0 , 100 , 75 , 500 0 , 100 , 75 , 500 Part2(FRA,DEL) 0 , INF, INF , INF 0 , INF , INF , INF 0 , 100 , 75 , INF 0 , 100, 75 , 675 0 , 100 , 75 , 500 Unnikrishnan C, Rupesh Nasre, Y N Srikant Falcon: A Graph Manipulation Language for Heterogeneous Systems 11 / 12

slide-46
SLIDE 46

Introduction Experimental Results Conclusion backup slides GPUs and MultiGPUs CPU Dynamic(Morph) Algorithms Heterogeneous execution

Partitioned Execution of one Algorithm on Multiple Devices

1

fun1(Point ori, Point incom){

2

if(orig.dist >incom.dist)

3

  • rig.dist=incom.dist

4

}

5

relaxgraph(Point p, HGraph hgraph){

6

foreach(t in p.outnbrs)

7

MIN(t.dist,p.dist+hgraph.getWeight(p,t), hgraph.changed[0]);

8

}

9

main(int argc, char *argv[]) {

10

HGraph hgraph;

11

hgraph.addPointProperty(dist, int);

12

hgraph.addProperty(changed, int);

13

hgraph.read(argv[1]);

14

hgraph.makePartition(1,1,ORDERED);

15

hgraph.updateFunction(fun1);

16

foreach(t In hgraph.points) t.dist=1234567890;

17

hgraph.points[0].dist=0;

18

while( 1 ){

19

hgraph.changed[0]=0;

20

foreach(t In hgraph.points)relaxgraph(t,hgraph);

21

if(hgraph.changed[0]==0)break;

22

hgraph.updatePartition();

23

}

24

for(int i = 0;i <hgraph.npoints; i++)

25

printf(“%d”, hgraph.points[i].dist);

26

} A SIMPLE EXAMPLE PRG CDG DEL FRA 100 75 400 6 Part1(PRG, CDG) 0 , INF , INF , INF 0 , 100 , 75 , INF 0 , 100 , 75 , INF 0 , 100 , 75 , 500 0 , 100 , 75 , 500 Part2(FRA,DEL) 0 , INF, INF , INF 0 , INF , INF , INF 0 , 100 , 75 , INF 0 , 100, 75 , 675 0 , 100 , 75 , 500 Unnikrishnan C, Rupesh Nasre, Y N Srikant Falcon: A Graph Manipulation Language for Heterogeneous Systems 11 / 12

slide-47
SLIDE 47

Introduction Experimental Results Conclusion backup slides GPUs and MultiGPUs CPU Dynamic(Morph) Algorithms Heterogeneous execution

Partitioned Execution of one Algorithm on Multiple Devices

1

fun1(Point ori, Point incom){

2

if(orig.dist >incom.dist)

3

  • rig.dist=incom.dist

4

}

5

relaxgraph(Point p, HGraph hgraph){

6

foreach(t in p.outnbrs)

7

MIN(t.dist,p.dist+hgraph.getWeight(p,t), hgraph.changed[0]);

8

}

9

main(int argc, char *argv[]) {

10

HGraph hgraph;

11

hgraph.addPointProperty(dist, int);

12

hgraph.addProperty(changed, int);

13

hgraph.read(argv[1]);

14

hgraph.makePartition(1,1,ORDERED);

15

hgraph.updateFunction(fun1);

16

foreach(t In hgraph.points) t.dist=1234567890;

17

hgraph.points[0].dist=0;

18

while( 1 ){

19

hgraph.changed[0]=0;

20

foreach(t In hgraph.points)relaxgraph(t,hgraph);

21

if(hgraph.changed[0]==0)break;

22

hgraph.updatePartition();

23

}

24

for(int i = 0;i <hgraph.npoints; i++)

25

printf(“%d”, hgraph.points[i].dist);

26

} A SIMPLE EXAMPLE PRG CDG DEL FRA 100 75 400 6 Part1(PRG, CDG) 0 , INF , INF , INF 0 , 100 , 75 , INF 0 , 100 , 75 , INF 0 , 100 , 75 , 500 0 , 100 , 75 , 500 Part2(FRA,DEL) 0 , INF, INF , INF 0 , INF , INF , INF 0 , 100 , 75 , INF 0 , 100, 75 , 675 0 , 100 , 75 , 500 Unnikrishnan C, Rupesh Nasre, Y N Srikant Falcon: A Graph Manipulation Language for Heterogeneous Systems 11 / 12

slide-48
SLIDE 48

Introduction Experimental Results Conclusion backup slides GPUs and MultiGPUs CPU Dynamic(Morph) Algorithms Heterogeneous execution

Partitioned Execution of one Algorithm on Multiple Devices

1

fun1(Point ori, Point incom){

2

if(orig.dist >incom.dist)

3

  • rig.dist=incom.dist

4

}

5

relaxgraph(Point p, HGraph hgraph){

6

foreach(t in p.outnbrs)

7

MIN(t.dist,p.dist+hgraph.getWeight(p,t), hgraph.changed[0]);

8

}

9

main(int argc, char *argv[]) {

10

HGraph hgraph;

11

hgraph.addPointProperty(dist, int);

12

hgraph.addProperty(changed, int);

13

hgraph.read(argv[1]);

14

hgraph.makePartition(1,1,ORDERED);

15

hgraph.updateFunction(fun1);

16

foreach(t In hgraph.points) t.dist=1234567890;

17

hgraph.points[0].dist=0;

18

while( 1 ){

19

hgraph.changed[0]=0;

20

foreach(t In hgraph.points)relaxgraph(t,hgraph);

21

if(hgraph.changed[0]==0)break;

22

hgraph.updatePartition();

23

}

24

for(int i = 0;i <hgraph.npoints; i++)

25

printf(“%d”, hgraph.points[i].dist);

26

} A SIMPLE EXAMPLE PRG CDG DEL FRA 100 75 400 6 Part1(PRG, CDG) 0 , INF , INF , INF 0 , 100 , 75 , INF 0 , 100 , 75 , INF 0 , 100 , 75 , 500 0 , 100 , 75 , 500 Part2(FRA,DEL) 0 , INF, INF , INF 0 , INF , INF , INF 0 , 100 , 75 , INF 0 , 100, 75 , 675 0 , 100 , 75 , 500 Unnikrishnan C, Rupesh Nasre, Y N Srikant Falcon: A Graph Manipulation Language for Heterogeneous Systems 11 / 12

slide-49
SLIDE 49

Introduction Experimental Results Conclusion backup slides GPUs and MultiGPUs CPU Dynamic(Morph) Algorithms Heterogeneous execution

Partitioned Execution of one Algorithm on Multiple Devices

1

fun1(Point ori, Point incom){

2

if(orig.dist >incom.dist)

3

  • rig.dist=incom.dist

4

}

5

relaxgraph(Point p, HGraph hgraph){

6

foreach(t in p.outnbrs)

7

MIN(t.dist,p.dist+hgraph.getWeight(p,t), hgraph.changed[0]);

8

}

9

main(int argc, char *argv[]) {

10

HGraph hgraph;

11

hgraph.addPointProperty(dist, int);

12

hgraph.addProperty(changed, int);

13

hgraph.read(argv[1]);

14

hgraph.makePartition(1,1,ORDERED);

15

hgraph.updateFunction(fun1);

16

foreach(t In hgraph.points) t.dist=1234567890;

17

hgraph.points[0].dist=0;

18

while( 1 ){

19

hgraph.changed[0]=0;

20

foreach(t In hgraph.points)relaxgraph(t,hgraph);

21

if(hgraph.changed[0]==0)break;

22

hgraph.updatePartition();

23

}

24

for(int i = 0;i <hgraph.npoints; i++)

25

printf(“%d”, hgraph.points[i].dist);

26

} A SIMPLE EXAMPLE PRG CDG DEL FRA 100 75 400 6 Part1(PRG, CDG) 0 , INF , INF , INF 0 , 100 , 75 , INF 0 , 100 , 75 , INF 0 , 100 , 75 , 500 0 , 100 , 75 , 500 Part2(FRA,DEL) 0 , INF, INF , INF 0 , INF , INF , INF 0 , 100 , 75 , INF 0 , 100, 75 , 675 0 , 100 , 75 , 500 Unnikrishnan C, Rupesh Nasre, Y N Srikant Falcon: A Graph Manipulation Language for Heterogeneous Systems 11 / 12

slide-50
SLIDE 50

Introduction Experimental Results Conclusion backup slides 1 We have a introduced a new DSL for Graph algorithms which

targets heterogeneous architectures.

Unnikrishnan C, Rupesh Nasre, Y N Srikant Falcon: A Graph Manipulation Language for Heterogeneous Systems 12 / 12

slide-51
SLIDE 51

Introduction Experimental Results Conclusion backup slides 1 We have a introduced a new DSL for Graph algorithms which

targets heterogeneous architectures.

2 Programmer does not have to worry on target architecture ,

thread & memory management.

Unnikrishnan C, Rupesh Nasre, Y N Srikant Falcon: A Graph Manipulation Language for Heterogeneous Systems 12 / 12

slide-52
SLIDE 52

Introduction Experimental Results Conclusion backup slides 1 We have a introduced a new DSL for Graph algorithms which

targets heterogeneous architectures.

2 Programmer does not have to worry on target architecture ,

thread & memory management.

3 Future Works in mind

i) to extend it for CPU clusters. ii) Making DSL more simple(say removing <GPU>tag). iii) Support for non-Nvidia GPUs by providing OpenCL backend. iv) adding optimizations on Compiler.

Unnikrishnan C, Rupesh Nasre, Y N Srikant Falcon: A Graph Manipulation Language for Heterogeneous Systems 12 / 12

slide-53
SLIDE 53

Introduction Experimental Results Conclusion backup slides 1 We have a introduced a new DSL for Graph algorithms which

targets heterogeneous architectures.

2 Programmer does not have to worry on target architecture ,

thread & memory management.

3 Future Works in mind

i) to extend it for CPU clusters. ii) Making DSL more simple(say removing <GPU>tag). iii) Support for non-Nvidia GPUs by providing OpenCL backend. iv) adding optimizations on Compiler.

4 for queries email me on unni c@csa.iisc.ernet.in. Unnikrishnan C, Rupesh Nasre, Y N Srikant Falcon: A Graph Manipulation Language for Heterogeneous Systems 12 / 12

slide-54
SLIDE 54

Introduction Experimental Results Conclusion backup slides 1 We have a introduced a new DSL for Graph algorithms which

targets heterogeneous architectures.

2 Programmer does not have to worry on target architecture ,

thread & memory management.

3 Future Works in mind

i) to extend it for CPU clusters. ii) Making DSL more simple(say removing <GPU>tag). iii) Support for non-Nvidia GPUs by providing OpenCL backend. iv) adding optimizations on Compiler.

4 for queries email me on unni c@csa.iisc.ernet.in. 5 Some sample autogenereated code can be downloaded from

my home page(http://clweb.csa.iisc.ernet.in/unni c).

Unnikrishnan C, Rupesh Nasre, Y N Srikant Falcon: A Graph Manipulation Language for Heterogeneous Systems 12 / 12

slide-55
SLIDE 55

Introduction Experimental Results Conclusion backup slides 1 We have a introduced a new DSL for Graph algorithms which

targets heterogeneous architectures.

2 Programmer does not have to worry on target architecture ,

thread & memory management.

3 Future Works in mind

i) to extend it for CPU clusters. ii) Making DSL more simple(say removing <GPU>tag). iii) Support for non-Nvidia GPUs by providing OpenCL backend. iv) adding optimizations on Compiler.

4 for queries email me on unni c@csa.iisc.ernet.in. 5 Some sample autogenereated code can be downloaded from

my home page(http://clweb.csa.iisc.ernet.in/unni c).

6 The Falcon compiler will made available online. Unnikrishnan C, Rupesh Nasre, Y N Srikant Falcon: A Graph Manipulation Language for Heterogeneous Systems 12 / 12

slide-56
SLIDE 56

Introduction Experimental Results Conclusion backup slides

Questions??

Unnikrishnan C, Rupesh Nasre, Y N Srikant Falcon: A Graph Manipulation Language for Heterogeneous Systems 13 / 12

slide-57
SLIDE 57

Introduction Experimental Results Conclusion backup slides

THANK YOU Dˇ EKUJI

Unnikrishnan C, Rupesh Nasre, Y N Srikant Falcon: A Graph Manipulation Language for Heterogeneous Systems 14 / 12

slide-58
SLIDE 58

Introduction Experimental Results Conclusion backup slides

1

relaxgraph(Point <GPU>p, Graph <GPU>graph) {

2

foreach (t In p.outnbrs)

3

MIN(t.dist, p.dist + graph.getweight(p, t), changed);

4

}

1

#define t (((struct hgraph *)(graph.extra)))

2

global void relaxgraph(GGraph graph, int x) {

3

id = blockIdx.x * blockDim.x + threadIdx.x + x;

4

if (id <graph.npoints){

5

int falcft0 = graph.index[id];

6

int falcft1 = graph.index[id+1]-graph.index[id];

8

for (falcft2 = 0; falcft2 <falcft1; falcft2++) {

9

int ut0 = 2 * (falcft0 + falcft2); //edge index

10

int ut1 = graph.edges[ut0].ipe; //dest point

11

int ut2 = graph.edges[ut0 + 1].ipe;

12

GMIN(&t->dist[ut1], t->dist[id] + ut2, changed);

13

}

14

}

Code generated for GPU

1

relaxgraph(Point p, Graph graph) {

2

foreach (t In p.outnbrs)

3

MIN(t.dist, p.dist + graph.getweight(p, t), changed);

4

}

1

#define t (((struct hgraph *)(graph.extra)))

2

relaxgraph(int &p ,HGraph &graph) {

3

int falcft0 = graph.index[p];

4

int falcft1 = graph.index[p+1]-graph.index[p];

7

for (int falcft2 = 0; falcft2 <falcft1; falcft2++) {

8

int ut0 = 2 * (falcft0 + falcft2);

9

int ut1 = graph.edges[ut0].ipe;

10

int ut2 = graph.edges[ut0 + 1].ipe;

11

HMIN(&t->dist[ut1], t->dist[p] + ut2, ut1, changed);

12

}

14

}

Code generated for CPU Code generated for relaxgraph function of SSSP example

slide-59
SLIDE 59

Introduction Experimental Results Conclusion backup slides

1

int <GPU>changed=0,lev=0;

2

BFS(Point <GPU>p,Graph <GPU>graph,int lev) {

3

foreach( t In p.outnbrs ){

4

if(t.dist>(lev+1)) {

5

t.dist=lev+1;changed=1;

6

}

7

}

8

}

9

main(int argc, char *name[]) {

10

Graph hgraph;

11

hgraph.addPointProperty(dist,int);

12

hgraph.read(name[1]);

13

hgraph.getType() <GPU>graph;

14

graph=hgraph;

15

foreach(t In graph.points)t.dist=1234567890;

16

graph.points[0].dist=0;

17

while( 1 ){

18

changed=0;

19

foreach(t In graph.points)(t.dist==lev) BFS(t,graph);

20

if(changed==0)break;

21

lev++;

22

}

23

for(int i=0;i<graph.npoints;i++)

24

printf(“%d\n”,graph.points[i].dist);

25

}

Falcon code for BFS on GPU

1

int changed=0,lev=0;

2

BFS(Point p,Graph graph,int lev) {

3

foreach( t In p.outnbrs ){

4

if(t.dist>(lev+1)) {

5

t.dist=lev+1;changed=1;

6

}

7

}

8

}

9

main(int argc, char *name[]) {

10

Graph hgraph;

11

hgraph.addPointProperty(dist,int);

12

hgraph.read(name[1]);

13

foreach(t In hgraph.points) t.dist=1234567890;

14

hgraph.points[0].dist=0;

15

while( 1 ){

16

changed=0;

17

foreach(t In hgraph.points) (t.dist==lev)BFS(t,hgraph,lev);

18

if(changed==0)break;

19

lev++;

20

}

21

for(int i=0;i<hgraph.npoints;i++)

22

printf(“%d\n”,hgraph.points[i].dist);

23

}

Falcon code for BFS on CPU

slide-60
SLIDE 60

Introduction Experimental Results Conclusion backup slides

Dynamic Graph algorithms

1 Falcon provides addPoint(),addEdge() function on Graph

class.

slide-61
SLIDE 61

Introduction Experimental Results Conclusion backup slides

Dynamic Graph algorithms

1 Falcon provides addPoint(),addEdge() function on Graph

class.

2 Falcon compiler checks for such function during code

generation.

slide-62
SLIDE 62

Introduction Experimental Results Conclusion backup slides

Dynamic Graph algorithms

1 Falcon provides addPoint(),addEdge() function on Graph

class.

2 Falcon compiler checks for such function during code

generation.

3 If such a function is found in DSL, Graph is pre-allocated with

more space(say 3 times).

slide-63
SLIDE 63

Introduction Experimental Results Conclusion backup slides

Dynamic Graph algorithms

1 Falcon provides addPoint(),addEdge() function on Graph

class.

2 Falcon compiler checks for such function during code

generation.

3 If such a function is found in DSL, Graph is pre-allocated with

more space(say 3 times).

4 Deletion of Points/Edges are managed by marking. No

Garbage Collection.

slide-64
SLIDE 64

Introduction Experimental Results Conclusion backup slides

Graph member function addProperty() function

1 graph.addProperty(struct node,triangle);

slide-65
SLIDE 65

Introduction Experimental Results Conclusion backup slides

Graph member function addProperty() function

1 graph.addProperty(struct node,triangle); 2 After this you can use triangle in same way as Edge and Point

in DSL code.

slide-66
SLIDE 66

Introduction Experimental Results Conclusion backup slides

Graph member function addProperty() function

1 graph.addProperty(struct node,triangle); 2 After this you can use triangle in same way as Edge and Point

in DSL code.

3 It provides to programmer

i)an iterator triangle ii) variable ntriangle, which store number of triangles.

slide-67
SLIDE 67

Introduction Experimental Results Conclusion backup slides

Graph member function addProperty() function

1 graph.addProperty(struct node,triangle); 2 After this you can use triangle in same way as Edge and Point

in DSL code.

3 It provides to programmer

i)an iterator triangle ii) variable ntriangle, which store number of triangles.

4 This function can be used to view Graph as Collection of

triangles(DMR) or as Collection of clauses(Survey Propagation).

slide-68
SLIDE 68

Introduction Experimental Results Conclusion backup slides

Code for Parallel Execution of SSSP and BFS on two GPUs in Falcon

1

int <GPU>changed;

2

SSSPBFS(char *name) { //begin SSSPBFS

3

Graph hgraph;//Graph object on CPU

4

hgraph.addPointProperty(dist,int);

5

hgraph.addProperty(changed,int);

6

hgraph.getType() <GPU>graph0;//Graph on GPU0

7

hgraph.getType() <GPU>graph1;//Graph on GPU1

8

hgraph.addPointProperty(dist1,int);

9

hgraph.read(name);//read Graph from file to CPU

10

graph0=hgraph;//copy entire Graph to GPU0

11

graph1=hgraph;//copy entire Graph to GPU1

12

foreach(t In graph0.points)t.dist=1234567890;

13

foreach(t In graph1.points)t.dist=1234567890;

14

graph0.points[0].dist=0;

16

parallel sections { //do in parallel

17

section {//compute BFS on GPU1

18

while(1){

19

graph1.changed[0]=0;

20

foreach(t In graph1.points)BFS(t,graph1);

21

if(graph1.changed[0]==0) break;

22

}

23

}

24

section {//compute SSSP on GPU0

25

while(1){

26

graph0.changed[0]=0;

27

foreach(t In graph0.points)SSSP(t,graph0);

28

if(graph0.changed[0]==0) break;

29

}

30

}

31

}//end SSSPBFS

32

}