Roomy: A System for Space Limited Computations Dan Kunkle Ph.D. - - PowerPoint PPT Presentation

roomy a system for space limited computations
SMART_READER_LITE
LIVE PREVIEW

Roomy: A System for Space Limited Computations Dan Kunkle Ph.D. - - PowerPoint PPT Presentation

Roomy: A System for Space Limited Computations Dan Kunkle Ph.D. Student College of Computer and Information Science Northeastern University Advisor: Gene Cooperman PASCO 10: July 21, 2010 Dan Kunkle Roomy; Space Limited Computation PASCO


slide-1
SLIDE 1

Roomy: A System for Space Limited Computations

Dan Kunkle

Ph.D. Student College of Computer and Information Science Northeastern University Advisor: Gene Cooperman

PASCO ’10: July 21, 2010

Dan Kunkle Roomy; Space Limited Computation PASCO ’10: July 21, 2010 1 / 53

slide-2
SLIDE 2

Outline

1

Overview: Roomy and Parallel Disk-based Computation

2

Roomy: Goals, Design, and Programming Model

3

Example Programming Constructs

4

Ten Keys to Using Roomy

5

Applications of Roomy Pancake Sorting Binary Decision Diagrams

6

Conclusions

Dan Kunkle Roomy; Space Limited Computation PASCO ’10: July 21, 2010 2 / 53

slide-3
SLIDE 3

Outline

1

Overview: Roomy and Parallel Disk-based Computation

2

Roomy: Goals, Design, and Programming Model

3

Example Programming Constructs

4

Ten Keys to Using Roomy

5

Applications of Roomy Pancake Sorting Binary Decision Diagrams

6

Conclusions

Dan Kunkle Roomy; Space Limited Computation PASCO ’10: July 21, 2010 3 / 53

slide-4
SLIDE 4

Problem Statement

Goal: solve space limited problems without significantly increasing hardware costs or radically altering algorithms and data structures. A space limited problem is one where existing solutions quickly exceed available memory. Solution: Roomy A new programming model that extends a programming language with transparent disk-based computing support. An open source C/C++ library implementing this new programming language extension.

Dan Kunkle Roomy; Space Limited Computation PASCO ’10: July 21, 2010 4 / 53

slide-5
SLIDE 5

Definition: Parallel Disk-based Computation

Parallel disk-based computation: using disks as the main working memory of a computation, instead of RAM. This provides several orders of magnitude more space for the same price. Performance Issues and Solutions Bandwidth: the bandwidth of a disk is roughly 50 times less than RAM (100 MB/s versus 5 GB/s). Solution: use many disks in parallel. Latency: even worse, the latency of disk is many orders of magnitude worse than RAM. Solution: avoid latency penalties by using streaming access.

Dan Kunkle Roomy; Space Limited Computation PASCO ’10: July 21, 2010 5 / 53

slide-6
SLIDE 6

Other Approaches to Space-limited Problems

Other approaches to space limited problems include: New algorithmic techniques that reduce space usage (e.g., Bloom filters). Issue: usually problem specific; not always applicable Increase RAM using large shared-memory machines Issue: expensive (non-commodity hardware) Distributed memory clusters Issue: RAM per CPU is the same – still runs out of RAM quickly Disks of a single machine Issue: low bandwidth relative to RAM

Dan Kunkle Roomy; Space Limited Computation PASCO ’10: July 21, 2010 6 / 53

slide-7
SLIDE 7

Implications of Disk-based Computation

By replacing RAM with disks A cluster of 50 computers, each with 8 cores and 1 TB of disk space, can substitute for a shared memory computer with 400 cores and a single 50 TB memory subsystem. Algorithm and Software Engineering Issues Unfortunately, writing programs that use many disks in parallel and avoid using random access is often a difficult task. Our group has over five years of case histories applying this to computational group theory – but each case requires months of development and debugging.

Rubik’s Cube in 26 moves, 2007, 8 TB of aggregate storage.

Dan Kunkle Roomy; Space Limited Computation PASCO ’10: July 21, 2010 7 / 53

slide-8
SLIDE 8

Outline

1

Overview: Roomy and Parallel Disk-based Computation

2

Roomy: Goals, Design, and Programming Model

3

Example Programming Constructs

4

Ten Keys to Using Roomy

5

Applications of Roomy Pancake Sorting Binary Decision Diagrams

6

Conclusions

Dan Kunkle Roomy; Space Limited Computation PASCO ’10: July 21, 2010 8 / 53

slide-9
SLIDE 9

Goals of Roomy

The primary goals of Roomy are: Minimally invasive: common data structures in user sequential code are replaced by Roomy data structures (lists, arrays, and hash tables). Performance: the interface biases programmers toward approaches with high performance parallel disk-based implementations. Choice of architectures: can used shared or distributed memory; locally attached disks or storage area networks (SAN). Scalability: the size of data structures is limited only by aggregate disk space; performance generally scales linearly with increasing parallelism.

Dan Kunkle Roomy; Space Limited Computation PASCO ’10: July 21, 2010 9 / 53

slide-10
SLIDE 10

Design of Roomy

Applications Algorithm Library

API

Foundation

file management remote I/O external sorting synchronization and barriers RoomyArray: update, predicates delayed read map, reduce RoomyList: add, remove addAll, removeAll removeDupes map, reduce breadth-first search parallel depth-first search dynamic programming A.I search (pancake sorting, Rubik’s Cube) SAT solver Binary decision diagrams Explicit state model checking

Dan Kunkle Roomy; Space Limited Computation PASCO ’10: July 21, 2010 10 / 53

slide-11
SLIDE 11

Roomy Programming Model

The Roomy programming model: Provides basic data structures (arrays, unordered lists, and hash tables). Transparently distributes data structures across many disks and performs operations on that data in parallel. Immediately processes streaming access operators. Delays processing random access operators until they can be performed efficiently in batch (e.g., collecting and sorting updates to an array).

Dan Kunkle Roomy; Space Limited Computation PASCO ’10: July 21, 2010 11 / 53

slide-12
SLIDE 12

Example: Delayed Processing of Hash Table Insertions

  • Dan Kunkle

Roomy; Space Limited Computation PASCO ’10: July 21, 2010 12 / 53

slide-13
SLIDE 13

Programming Interface

There are three Roomy data structures: RoomyArray: a fixed size, indexed array of elements (elements can be as small as one bit). RoomyHashTable: a dynamically sized structure mapping keys to values. RoomyList: a dynamically sized, unordered list of elements. There are two types of Roomy operations: delayed and immediate. Operations requiring random access are delayed. Other operations are performed immediately. Processing of delayed operations is initiated explicitly by the user, by making a call to synchronize a data structure.

Dan Kunkle Roomy; Space Limited Computation PASCO ’10: July 21, 2010 13 / 53

slide-14
SLIDE 14

RoomyArray Data Structure

RoomyArray Delayed Operations access – apply a user-defined function to an element update – update an element using a user-defined function RoomyArray Immediate Operations sync – process outstanding delayed operations size – return the number of elements map – apply a user-defined function to each element reduce – return a value based on a combination of all elements predicateCount – return the number of elements that satisfy a property

Dan Kunkle Roomy; Space Limited Computation PASCO ’10: July 21, 2010 14 / 53

slide-15
SLIDE 15

RoomyHashTable Data Structure

RoomyHashTable Delayed Operations insert – insert a (key, value) pair in the table remove – remove a (key, value) pair from the table access – apply a user-defined function to a (key, value) pair update – update the value of a (key, value) pair RoomyHashTable Immediate Operations (gray = same as RoomyArray)

sync – process outstanding delayed operations size – return the number of elements map – apply a user-defined function to each element reduce – return a value based on a combination of all elements predicateCount – return the number of elements that satisfy a property

Dan Kunkle Roomy; Space Limited Computation PASCO ’10: July 21, 2010 15 / 53

slide-16
SLIDE 16

RoomyList Data Structure

RoomyList Delayed Operations add – add an element to the list remove – remove all occurrences of an element from the list RoomyList Immediate Operations (gray = same as RoomyArray)

addAll – add all elements from one list to another removeAll – remove all elements in one list from another removeDupes – remove duplicate elements from a list sync – process outstanding delayed operations size – return the number of elements map – apply a user-defined function to each element reduce – return a value based on a combination of all elements predicateCount – return the number of elements that satisfy a property

Dan Kunkle Roomy; Space Limited Computation PASCO ’10: July 21, 2010 16 / 53

slide-17
SLIDE 17

Outline

1

Overview: Roomy and Parallel Disk-based Computation

2

Roomy: Goals, Design, and Programming Model

3

Example Programming Constructs

4

Ten Keys to Using Roomy

5

Applications of Roomy Pancake Sorting Binary Decision Diagrams

6

Conclusions

Dan Kunkle Roomy; Space Limited Computation PASCO ’10: July 21, 2010 17 / 53

slide-18
SLIDE 18

Example Programming Constructs

The use of data structures similar to traditional programming models allows many common programming constructs to be implemented in Roomy. This section will give Roomy code for: map reduce predicates permutation multiplication set operations chain reduction pair reduction breadth-first search

Dan Kunkle Roomy; Space Limited Computation PASCO ’10: July 21, 2010 18 / 53

slide-19
SLIDE 19

Programming Construct: Map (complete code)

Map: apply a function to every element of a data structure Example: add all elements in a RoomyArray to a RoomyList

RoomyArray∗ ra ; RoomyList∗ r l ; // Function to map over ra . void mapFunc ( uint64 i , void ∗ v a l ) { RoomyList add ( r l , v a l ) ; } i n t main ( i n t argc , char ∗∗ argv ) { Roomy init(&argc , &argv ) ; ra = RoomyArray makeBytes ( "array" , s i z e o f ( uint64 ) , 100); r l = RoomyList make ( "list" , s i z e o f ( uint64 ) ) ; /∗ . . . code that m o d i f i e s ra . . . ∗/ RoomyArray map ( ra , mapFunc ) ; // execute map RoomyList sync ( r l ) ; // sync r l to complete delayed ’ add ’

  • ps

Roomy finalize ( ) ; }

Dan Kunkle Roomy; Space Limited Computation PASCO ’10: July 21, 2010 19 / 53

slide-20
SLIDE 20

Programming Construct: Map (abridged)

Map: apply a function to every element of a data structure Example: add all elements in a RoomyArray to a RoomyList

RoomyArray∗ ra ; RoomyList∗ r l ; // Function to map over ra . void mapFunc ( uint64 i , void ∗ v a l ) { RoomyList add ( r l , v a l ) ; } RoomyArray map ( ra , mapFunc ) ; // execute map RoomyList sync ( r l ) ; // sync r l to complete delayed ’ add ’

  • ps

Dan Kunkle Roomy; Space Limited Computation PASCO ’10: July 21, 2010 20 / 53

slide-21
SLIDE 21

Programming Construct: Reduce

Reduce: produce a result based on a combination of all elements in a data structure Example: compute the sum of squares of the elements in a RoomyList

RoomyList∗ r l ; // elements

  • f

type i n t // Add square

  • f

an element to sum . void mergeElt ( i n t ∗ sum , i n t ∗ element ) { ∗sum += ∗e ∗ ∗e ; } // Compute sum

  • f

two p a r t i a l answers . void mergeResults ( i n t ∗ sum1 , i n t ∗ sum2 ) { ∗sum1 += ∗sum2 ; } i n t sum = 0 ; RoomyList reduce ( r l , &sum , s i z e o f ( i n t ) , mergeElt , mergeResults ) ;

Dan Kunkle Roomy; Space Limited Computation PASCO ’10: July 21, 2010 21 / 53

slide-22
SLIDE 22

Programming Construct: Predicates

Predicates: count the number of elements in a data structure that satisfy a Boolean function Example: count the number of elements in a RoomyList greater than 42

RoomyList∗ r l ; // P r e d i c a t e : r e t u r n 1 i f element i s g r e a t e r than 42 u i n t 8 predFunc ( i n t ∗ v a l ) { i f ( ∗ v a l > 42 ) r e t u r n 1; e l s e r e t u r n 0; } RoomyList attachPredicate ( r l , predFunc ) ; // . . . code that m o d i f i e s r l . . . uint64 gt42 = RoomyList predicateCount ( r l , predFunc ) ;

Dan Kunkle Roomy; Space Limited Computation PASCO ’10: July 21, 2010 22 / 53

slide-23
SLIDE 23

Programming Construct: Permutation Multiplication

Permutation multiplication: arrays X, Y, Z of length N. for i = 0 to N-1: Z[i] = Y[X[i]]

RoomyArray ∗X, ∗Y, ∗Z ; // a c c e s s X[ i ] void accessX ( uint64 i , uint64 ∗ x i ) { RoomyArray access (Y, ∗ x i , &i , accessY ) ; } // a c c e s s Y[X[ i ] ] void accessY ( uint64 x i , uint64 ∗ y x i , uint64 ∗ i ) { RoomyArray update (Z , ∗ i , y x i , setZ ) ; } // s e t Z [ i ] = Y[X[ i ] ] void setZ ( uint64 i , uint64 ∗ z i , uint64 ∗ y x i , uint64 ∗ z i NEW ) { ∗z i NEW = ∗ y x i ; } RoomyArray map (X, accessX ) ; // a c c e s s X[ i ] RoomyArray sync (Y ) ; // a c c e s s Y[X[ i ] ] RoomyArray sync (Z ) ; // s e t Z [ i ] = Y[X[ i ] ]

Dan Kunkle Roomy; Space Limited Computation PASCO ’10: July 21, 2010 23 / 53

slide-24
SLIDE 24

Programming Construct: Set Operations

Set operations: sets can be represented using a RoomyList A RoomySet data structure is planned for the future. Convert list to set

RoomyList∗ A; // can c o n t a i n d u p l i c a t e elements RoomyList removeDupes (A ) ; // now a s e t

Union: A = A ∪ B

RoomyList ∗A, ∗B; RoomyList addAll (A, B ) ; RoomyList removeDupes (A ) ;

Difference: A = A − B

RoomyList ∗A, ∗B; RoomyList removeAll (A, B ) ;

Dan Kunkle Roomy; Space Limited Computation PASCO ’10: July 21, 2010 24 / 53

slide-25
SLIDE 25

Programming Construct: Set Operations

Intersection: C = A ∩ B Implemented as C = (A ∪ B) − (A − B) − (B − A)

// i n p u t s e t s RoomyList ∗A, ∗B; // i n i t i a l l y empty s e t s RoomyList ∗AandB , ∗AminusB , ∗BminusA , ∗C ; // c r e a t e t h r e e temporary s e t s RoomyList addAll (AandB , A ) ; RoomyList addAll (AandB , B ) ; RoomyList removeDupes (AandB ) ; RoomyList addAll ( AminusB , A ) ; RoomyList removeAll ( AminusB , B ) ; RoomyList addAll ( BminusA , B ) ; RoomyList removeAll ( BminusA , A ) ; // compute i n t e r s e c t i o n RoomyList addAll (C, AandB ) ; RoomyList removeAll (C, AminusB ) ; RoomyList removeAll (C, BminusA ) ;

Dan Kunkle Roomy; Space Limited Computation PASCO ’10: July 21, 2010 25 / 53

slide-26
SLIDE 26

Programming Construct: Chain Reduction

Chain reduction: combine each element in a sequence with the element after it Example: given an array a of N integers,

f o r ( i = 1 to N−1) a [ i ] = a [ i ] + a [ i −1]

where a[i] on the right is the value before update

RoomyArray∗ ra ; // a r r a y

  • f

i n t s , l e n g t h N // Function to be mapped

  • ver

ra , i s s u e s updates void c a l l U p d a t e ( uint64 iMinus1 , i n t ∗ v a l i M i n u s 1 ) { uint64 i = iMinus1 + 1 ; i f ( i < N) RoomyArray update ( ra , i , val iMinus1 , doUpdate ) ; } // Function to complete updates void doUpdate ( uint64 i , i n t ∗ v a l i , i n t ∗ val iMinus1 , i n t ∗ val i NEW ) { ∗ val i NEW = ∗ v a l i + ∗ v a l i M i n u s 1 ; } RoomyArray map ( ra , c a l l U p d a t e ) ; // i s s u e updates RoomyArray sync ( ra ) ; // complete updates

Dan Kunkle Roomy; Space Limited Computation PASCO ’10: July 21, 2010 26 / 53

slide-27
SLIDE 27

Programming Construct: Pair Reduction

Pair reduction: apply a function to each pair of elements. For an array a of length N:

f o r i = 0 to N−1 f o r j = 0 to N−1 f ( a [ i ] , a [ j ] ) ;

Dan Kunkle Roomy; Space Limited Computation PASCO ’10: July 21, 2010 27 / 53

slide-28
SLIDE 28

Programming Construct: Pair Reduction

Example: insert each pair of elements from a RoomyArray in a RoomyList

RoomyArray∗ ra ; // a r r a y

  • f

int , l e n g t h N RoomyList∗ r l ; // l i s t c o n t a i n i n g Pair ( int , i n t ) // Map function , sends a c c e s s to a l l

  • ther

e l t s void c a l l A c c e s s ( uint64

  • uterIndex ,

i n t ∗

  • uterVal )

{ f o r i n n e r I n d e x = 0 to N−1 RoomyArray access ( ra , i n n e r I n d e x ,

  • uterVal ,

doAccess ) ; } // Access function , adds a p a i r to the l i s t void doAccess ( uint64 i n n e r I n d e x , i n t ∗ innerVal , i n t ∗

  • uterVal )

{ RoomyList add ( r l , new Pair (∗ innerVal , ∗ outerVal ) ) ; } RoomyArray map ( ra , c a l l A c c e s s ) ; RoomyArray sync ( ra ) ; // perform delayed a c c e s s e s RoomyList sync ( r l ) ; // perform delayed adds

Dan Kunkle Roomy; Space Limited Computation PASCO ’10: July 21, 2010 28 / 53

slide-29
SLIDE 29

Programming Construct: Breadth-first Search

Breadth-first search: enumerate all elements of a graph, exploring elements closer to the starting point first.

The graph is implicit, defined by a starting element and a generating function that returns the neighbors of a given element. Initialize search

// L i s t s f o r a l l e l t s , current , and next l e v e l RoomyList∗ a l l = RoomyList make ( "allLev" , e l t S i z e ) ; RoomyList∗ cur = RoomyList make ( "lev0" , e l t S i z e ) ; RoomyList∗ next = RoomyList make ( "lev1" , e l t S i z e ) ; // Function to produce next l e v e l from c u r r e n t void genNext (T e l t ) { /∗ User−d e f i n e d code to compute n e i g h b o r s . . . ∗/ f o r nbr i n n e i g h b o r s RoomyList add ( next , nbr ) ; } // Add s t a r t element RoomyList add ( a l l , s t a r t E l t ) ; RoomyList add ( cur , s t a r t E l t ) ;

Dan Kunkle Roomy; Space Limited Computation PASCO ’10: July 21, 2010 29 / 53

slide-30
SLIDE 30

Programming Construct: Breadth-first Search

Perform search

// Generate l e v e l s u n t i l no new s t a t e s are found w h i l e ( RoomyList size ( cur )) { // gen er ate next l e v e l from c u r r e n t RoomyList map ( cur , genNext ) ; RoomyList sync ( next ) ; // d e t e c t d u p l i c a t e s w i t h i n next l e v e l RoomyList removeDupes ( next ) ; // d e t e c t d u p l i c a t e s from p r e v i o u s l e v e l s RoomyList removeAll ( next , a l l ) ; // r e c o r d new elements RoomyList addAll ( a l l , next ) ; // r o t a t e l e v e l s RoomyList destroy ( cur ) ; cur = next ; next = RoomyList make ( levName , e l t S i z e ) ; }

Dan Kunkle Roomy; Space Limited Computation PASCO ’10: July 21, 2010 30 / 53

slide-31
SLIDE 31

Outline

1

Overview: Roomy and Parallel Disk-based Computation

2

Roomy: Goals, Design, and Programming Model

3

Example Programming Constructs

4

Ten Keys to Using Roomy

5

Applications of Roomy Pancake Sorting Binary Decision Diagrams

6

Conclusions

Dan Kunkle Roomy; Space Limited Computation PASCO ’10: July 21, 2010 31 / 53

slide-32
SLIDE 32

KEY #1: Parallelism

Multi-process and Multi-threading: It is anticipated that most applications will use one Roomy process per compute node. If disk bandwidth is not fully utilized, and there is excess CPU power,

  • ne node can start multiple Roomy processes.

Roomy is multi-threaded, but user code is usually serial. Currently, user code can be multi-threaded, but only if one thread makes all calls to Roomy.

Dan Kunkle Roomy; Space Limited Computation PASCO ’10: July 21, 2010 32 / 53

slide-33
SLIDE 33

KEY #2: Data Structure Size

Maximum data structure size is not limited by Roomy. The maximum size of a Roomy data structure is limited only by aggregate available disk space. Typically, each Roomy process stores about the same amount of data. If nodes have significantly different amounts of free space, multiple Roomy processes can be started on nodes with more space.

Dan Kunkle Roomy; Space Limited Computation PASCO ’10: July 21, 2010 33 / 53

slide-34
SLIDE 34

KEY #3: Choice of Roomy Data Structure

RoomyArray is often the most efficient data structure. + Minimizes data stored (elements can be as small as one bit) + Does not need hash function to determine element location + Does not use sorting

  • Fixed size

RoomyHashTable is good when keys cannot be mapped to integers, or structure size is not predetermined. + Variable size + Arbitrary types for keys + Does not use sorting

  • Empty slots take up additional space
  • Hash function adds some CPU overhead

RoomyList should usually be chosen only if there is no alternative solution. + Variable size + No need for element indexes or keys

  • Sorting causes a significant slowdown
  • Hash function adds some CPU overhead

Dan Kunkle Roomy; Space Limited Computation PASCO ’10: July 21, 2010 34 / 53

slide-35
SLIDE 35

KEY #4: Synchronization Costs

Minimizing Synchronization Costs: The number of sync operations should be minimized. i.e., The number of outstanding delayed operations per sync should be maximized. Synchronizing cost is due to:

A small number of delayed operations causes random access. A large number of delayed operations requires the entire data structure to be accessed. All compute nodes must wait for all others to finish.

Dan Kunkle Roomy; Space Limited Computation PASCO ’10: July 21, 2010 35 / 53

slide-36
SLIDE 36

KEY #5: Load Balancing

Load Balancing: The even distribution of data is handled by Roomy.

RoomyArray element at index i is stored on node i mod N. RoomyHashTable and RoomyList elements are distributed using a hash function.

The load can become unbalanced if there are a small number of hot elements. Load balancing is important because all nodes must wait for the slowest node on a sync.

Watch for other causes of slow nodes, particularly certain hardware problems (e.g. a disk with high error rate).

Dan Kunkle Roomy; Space Limited Computation PASCO ’10: July 21, 2010 36 / 53

slide-37
SLIDE 37

KEY #6: Peak Disk Usage

Peak Disk Usage: Is one of the statistics printed by Roomy printStats. All Roomy data is stored on disk.

data structures delayed operations (includes an 8-byte index for RoomyArrays)

Disk space can be freed by synchronizing delayed operations.

Dan Kunkle Roomy; Space Limited Computation PASCO ’10: July 21, 2010 37 / 53

slide-38
SLIDE 38

KEY #7: Peak RAM Usage

Peak RAM Usage: Typically, buffers for delayed operations are the bulk of RAM usage. To minimize RAM usage: minimize the number of Roomy data structures that have delayed operations outstanding at one time. A future version of Roomy is planned that uses free RAM as a cache for frequently used data.

Dan Kunkle Roomy; Space Limited Computation PASCO ’10: July 21, 2010 38 / 53

slide-39
SLIDE 39

KEY #8: Local vs. Shared Disks

Local Disks vs. Storage Area Network (SAN): Local disks

+ processing of delayed operations does not use network + typically higher performance

  • less reliable than an array of disks

SAN

+ may provide significantly more disk space + more reliable (e.g., RAID)

  • possible lower performance: may be used by many other users; can

cause a network bottleneck

Dan Kunkle Roomy; Space Limited Computation PASCO ’10: July 21, 2010 39 / 53

slide-40
SLIDE 40

KEY #9: Network Bottlenecks

Aggregate network bandwidth should be at least as large as aggregate disk bandwidth. All delayed operations are written to disk once and read from disk

  • nce.

With local disk, delayed operations cross the network once. With a SAN, delayed operations cross the network twice.

Dan Kunkle Roomy; Space Limited Computation PASCO ’10: July 21, 2010 40 / 53

slide-41
SLIDE 41

KEY #10: Other High Latency Storage

Roomy is appropriate for any high-latency storage: Solid state drives (SSD), i.e. flash storage, provides much better random read performance than disk, but still has very bad random write performance Distributed RAM can also be high latency due to the network.

Roomy can be run with a RAM disk for distributed memory computations.

Dan Kunkle Roomy; Space Limited Computation PASCO ’10: July 21, 2010 41 / 53

slide-42
SLIDE 42

Outline

1

Overview: Roomy and Parallel Disk-based Computation

2

Roomy: Goals, Design, and Programming Model

3

Example Programming Constructs

4

Ten Keys to Using Roomy

5

Applications of Roomy Pancake Sorting Binary Decision Diagrams

6

Conclusions

Dan Kunkle Roomy; Space Limited Computation PASCO ’10: July 21, 2010 42 / 53

slide-43
SLIDE 43

Pancake Sorting Problem

Pancake sorting: Sort using prefix reversal. Goal is to minimize the number of reversals used. Example 3142 1342 4312 2134 1234 Question: what is the maximum number of reversals needed to sort N elements?

Dan Kunkle Roomy; Space Limited Computation PASCO ’10: July 21, 2010 43 / 53

slide-44
SLIDE 44

Pancake Sorting Graph

1234 2134 3214 4321 3124 4312 4123 1243 2143 4213 3421 3412 1324 2314 4231 4132 1342 3142 2431 1423 2413 3241 1432 2341 Dan Kunkle Roomy; Space Limited Computation PASCO ’10: July 21, 2010 44 / 53

slide-45
SLIDE 45

Pancake Sorting Graph: Breadth-first Search Levels

1234 2134 3214 4321 3124 4312 4123 1243 2143 4213 3421 3412 1324 2314 4231 4132 1342 3142 2431 1423 2413 3241 1432 2341 Dan Kunkle Roomy; Space Limited Computation PASCO ’10: July 21, 2010 45 / 53

slide-46
SLIDE 46

Experimental Results for Pancake Sorting

Roomy was used for a breadth-first search of the 13-pancake graph. The graph has approximately 6.2 billion vertices and 74 billion edges. The computation completed in 11.5 minutes using 64 compute nodes. Peak disk usage was 200 GB. Average disk bandwidth was over 1.5 GB/s. This replicated the best result as of 2006. Writing the Roomy program took less than a day.

Level Number of Elements 1 1 12 2 132 3 1451 4 14556 5 130096 6 1030505 7 7046318 8 40309555 9 184992275 10 639768688 11 1525115582 12 2183056185 13 1458670200 14 186883243 15 2001

Distribution of Elements

Dan Kunkle Roomy; Space Limited Computation PASCO ’10: July 21, 2010 46 / 53

slide-47
SLIDE 47

Binary Decision Diagrams: Problem Statement

A binary decision diagram (BDD) is a compact representation of a Boolean function. One of the primary practical uses of BDDs is in symbolic model checking, particularly circuit verification. Problem: BDD packages typically run out of space very quickly. can fill RAM in a matter of minutes to hours. traditional approaches make heavy use of random access patterns

Dan Kunkle Roomy; Space Limited Computation PASCO ’10: July 21, 2010 47 / 53

slide-48
SLIDE 48

Example of a Binary Decision Diagram

1 x0 x1 x1 x2 x2 x3 x3 x4 x5

BDD representing (x0 ∨ ¬x1 ∨ x2 ∨ x4 ∨ x5)∧ (¬x0 ∨ x3 ∨ x1 ∨ x4 ∨ x5) Roomy-based BDD package implements three algorithms: apply: the application of a Boolean operator to two BDDs (and, or, xor, etc.) any-SAT: return a satisfying assignment SAT-count: count the number

  • f satisfying assignments

Dan Kunkle Roomy; Space Limited Computation PASCO ’10: July 21, 2010 48 / 53

slide-49
SLIDE 49

Counting Solutions to the N-Queens Problem

Problem: determine the number of ways N non-attacking queens can be placed on an N × N chess board. Size of State Space: N! Boolean Representation

N2 variables: xi,j is true iff there is a queen at row i, column j N2 square constraints: Si,j is true iff there is a non-attacked queen on i, j Si,j = xi,j ∧ ¬xi1,j1 ∧ ¬xi2,j2 ∧ . . . N row constraints: Ri is true iff row i has exactly one queen Ri = Si,1 ∨ Si,2 ∨ . . . ∨ Si,N board constraint: B is true iff the board has one queen in each row B = R1 ∧ R2 ∧ . . . ∧ RN

Solution: count the number of satisfying assignments of B

Dan Kunkle Roomy; Space Limited Computation PASCO ’10: July 21, 2010 49 / 53

slide-50
SLIDE 50

N-Queens Results

Roomy-based package increased size of state space by 240 times

  • ver a RAM-based package (BuDDy) using 56 GB of RAM.

See PASCO ’10: Parallel Disk-Based Computation for Large, Monolithic Binary Decision Diagrams, Kunkle, Slavici, Cooperman.

8 9 10 11 12 13 14 15 16

Board Dimension

1 10 100 1000 10000 100000

Time (seconds) Roomy (8 nodes) BuDDy (56 GB) BuDDy (8 GB) BuDDy (1 GB) 32 nodes 64 nodes Dan Kunkle Roomy; Space Limited Computation PASCO ’10: July 21, 2010 50 / 53

slide-51
SLIDE 51

Outline

1

Overview: Roomy and Parallel Disk-based Computation

2

Roomy: Goals, Design, and Programming Model

3

Example Programming Constructs

4

Ten Keys to Using Roomy

5

Applications of Roomy Pancake Sorting Binary Decision Diagrams

6

Conclusions

Dan Kunkle Roomy; Space Limited Computation PASCO ’10: July 21, 2010 51 / 53

slide-52
SLIDE 52

Conclusions

Roomy is a new programming model and open source library for parallel disk-based computation. Roomy can provide orders of magnitude more space over RAM-based methods. The Roomy programming model extends sequential programs in a minimally invasive manner.

Dan Kunkle Roomy; Space Limited Computation PASCO ’10: July 21, 2010 52 / 53

slide-53
SLIDE 53

See: roomy.sourceforge.net for a beta release of library and user documentation

Dan Kunkle Roomy; Space Limited Computation PASCO ’10: July 21, 2010 53 / 53