CnC for Tuning Hints on OCR Nick Vrvilo, Rice University The 7 th - - PowerPoint PPT Presentation
CnC for Tuning Hints on OCR Nick Vrvilo, Rice University The 7 th - - PowerPoint PPT Presentation
CnC for Tuning Hints on OCR Nick Vrvilo, Rice University The 7 th Annual CnC Workshop September 8, 2015 Acknowledgements This work was done as part of my internship with the OCR team, part of Intel Federal, LLC at Jones Farm (Hillsboro, OR).
Acknowledgements
This work was done as part of my internship with the OCR team, part of Intel Federal, LLC at Jones Farm (Hillsboro, OR). Mentors (Intel): Josh Fryman and Romain Cledat Habanero Team (Rice): Vivek Sarkar, Kath Knobe, Zoran Budimlić, and Sanjay Chatterjee
2
Objective
Demonstrate the effectiveness of OCR tuning hints by way of code generation from a higher- level programming model (CnC).
3
OCR Tunings
Objective
CnC-OCR Scaffolding CnC App Code CnC Graph
hints handler
4
Open Community Runtime (OCR)*
OCR project goals:
- Provide effective abstraction for diverse
hardware
- Typify future task-based execution models
- Handle large-scale parallelism efficiently
- Maintain a separation of concerns
(application/scheduling/resources)
- Open source (encourage collaboration)
* OCR ==> X-Stack Traleika Glacier project’s implementation
5
Outline
- Introduction
- OCR Hints API
- CnC on OCR
- Tuning Hints Implementation and Analysis
6
CnC / OCR Concept Mapping
Concept OCR construct CnC construct Task classes (code) EDT template Step collection Task instance EDT Step instance Data classes All DBs have type void*
(keeping track of individual DBs’ types is the app programmer's responsibility)
Item collection Data instance Datablock Item instance Unique instance identifier GUID Tag (step tag / item key) Dependence registration Event add dependence Item get Dependence satisfaction Event satisfy Item put
7
OCR Hints API: Example
// Assume we have a template and a datablock
- crGuid_t edt;
- crEdtCreate(&edt, template, 0, NULL, 1, NULL,
EDT_PROP_NONE, NULL_GUID, NULL); { // Set an OCR hint
- crHint_t stepHints;
- crHintInit(&stepHints, OCR_HINT_EDT_T);
- crGetHint(edt, &stepHints);
- crSetHintValue(&stepHints, OCR_HINT_EDT_PRIORITY, 100);
- crSetHint(edt, &stepHints);
}
- crAddDependence(datablock, edt, 0, DB_DEFAULT_MODE);
8
OCR Hints API:
Pros
- Generic
- Conceptually decoupled
- Light-weight
Cons
- Verbose
- Placed in app source code
- Limited expressiveness
9 9
Outline
- Introduction
- OCR Hints API
- CnC on OCR
- Tuning Hints Implementation and Analysis
10
CnC-OCR Developer Workflow
Write graph spec Run translator tool (produces skeleton project) Flesh-out skeleton code Run program (functionality check) debug Write tuning spec(s) Re-run translator tool (updates scaffolding code) Re-run program (performance check) fine-tuning
11
OCR Tunings
CnC-OCR + Tuning
CnC-OCR Scaffolding CnC App Code CnC Graph
hints handler
12
Separation of Concerns in CnC
- Graph specification can be written without
implementation details
- Step function implementations written without
knowledge of the external graph (only its own inputs and outputs)
- Tuning specification given in a separate file
- Easy to mix-in different tunings for performance
testing
- Try combinations of tunings until you find the
ideal configuration
13
Outline
- Introduction
- OCR Hints API
- CnC on OCR
- Tuning Hints Implementation and Analysis
14
Tuning Hints Overview
- 1. Step / item distribution
- 2. Step affinity with input
- 3. Step priority
- 4. Scheduler throttling
- 5. Partial item requests
15
Hint #1: Step / Item Distribution Functions
- What?
Declare a function for mapping individual step / item instances from a collection onto the set of OCR policy domains.
- Why?
– Distributed OCR currently lacks advanced schedule/placement heuristics. – Need control of distribution for a reasonable baseline.
16
Smith-Waterman Sequence Alignment
- Each input sequence
length ~200k
- Dynamic programming
- ptimization on ~40-billion
cell matrix
- Tiles of 177x153 cells
- Total of 1138x1322 tiles
17
Smith-Waterman Specification
Graph Specification
[ int above[] : i, j ]; [ int left[] : i, j ]; [ SeqData *data : () ]; ( swStep: i, j ) <- [ data: () ], [ above: i, j ] $when(i > 0), [ left: i, j ] $when(j > 0)
- > [ below @ above: i+1, j ],
[ right @ left: i, j+1 ], ( swStep: i+i, j ) $when(i+1 < #nth);
Tuning Specification
[ above ]: { distfn: (i / 16) % $RANKS }; [ left ]: { distfn: (i / 16) % $RANKS }; ( swStep ): { distfn: (i / 16) % $RANKS };
18 18
Smith-Waterman Sequence Alignment
- Each input sequence
length ~200k
- Dynamic programming
- ptimization on ~40-billion
cell matrix
- Tiles of 177x153 cells
- Total of 1138x1322 tiles
- Default: CnC default
distribution
- Row-block: Rows in blocks
- f 16
- 10 runs per configuration
19
10 20 30 40 50 1 2 4 8 Average Execution Time (seconds) Node Count CnC-OCR Default CnC-OCR Row-Block iCnC Row-Block 115.40 141.49
Hint #2: Step Affinity with Input Item
- What?
Declare that a step instance be affinitized with
- ne of its input items.
- Why?
– OCR can use this affinity to improve scheduling heuristics. – More expressive way to specify tunings like hint #1.
20
Smith-Waterman Specification
Graph Specification
[ int above[] : i, j ]; [ int left[] : i, j ]; [ SeqData *data : () ]; ( swStep: i, j ) <- [ data: () ], [ above: i, j ] $when(i > 0), [ left: i, j ] $when(j > 0)
- > [ below @ above: i+1, j ],
[ right @ left: i, j+1 ], ( swStep: i+i, j ) $when(i+1 < #nth);
Tuning Specification
[ above ]: { distfn: (i / 16) % $RANKS }; [ left ]: { distfn: (i / 16) % $RANKS }; ( swStep ): { placeWith: above };
21 21
Hint #3: Step Priority Weights
- What?
Express a priority weight for a given CnC step, such that steps with heavier weights should execute earlier.
- Why?
– Search problems: prioritize paths likely to find the answer sooner – Enable concurrency: prefer task with high-demand
- utput (many consumers)
22
N-Queens Puzzle
- Board size: 13x13
- Solutions possible: 73,312
♛ ♛ ♛ ♛ ♛ ♛ ♛ ♛
23
N-Queens Specification
- Graph:
[ u64 solutions[4]: i ]; ( placeQueen: row, board )
- > ( placeQueen: row+1, board_prime ),
[ solutions: ? ];
- Tuning:
( placeQueen /* row, board */ ): { priority: row };
24
Implementation of Step Priority Weights
Description Default Scheduler Priority Scheduler Location Base data structure deque bin-heap utils/ Scheduler interface wrapper deque bin-heap scheduler-
- bject/
Scheduler (aggregate) root object wst pr-wsh scheduler-
- bject/
Scheduler heuristic behavior hc priority scheduler- heuristic/
25
N-Queens Puzzle
- Board size: 13x13
- Solutions possible: 73,312
- Solutions sought: 5,000
- DEQ: Default work-stealing deque
- DFS: Prioritize deep rows
- BFS: Prioritize shallow rows
- 50 runs per configuration
1 2 3 4 DEQ DFS BFS Average execution time (seconds)
26
Hint #4: Stoker Step (Scheduler Throttling)
- What?
Annotate the work-creating steps (which we call stokers) so that the runtime can differentiate them from non-work-creating steps (which we call quenchers).
- Why?
– If the scheduler has plenty of work to do, we can throttle by not running any more stoker steps for the time being. – For work stealing, we can prioritized stoker-steps for stealing, mitigates the need for more stealing in the near- term.
27
Task-Bomb (Synthetic Example)
- Root step creates Z=32
stoker steps
- Each stoker creates
- Y=100 quencher tasks
- One stoker task
- Recursion creates X=200
levels
- Since the stoker is always
created last, we would expect all of the stokers to run in a depth-first manner when using the standard work-stealing deque scheduler
$initialize stoker(0,0) quencher(0,0,0) … quencher(0,0,Y) stoker(0,1) quencher(0,1,0) … quencher(0,1,Y) stoker(0,2) … … stoker(Z,0) quencher(Z,0,0) … quencher(Z,0,Y) stoker(Z,1) …
28
Task-Bomb CnC Graph Spec
[ void *done: () ]; ( stoker: i, j )
- > ( quencher: i, j, $rangeTo(Y) ),
( stoker: i, j+1 ) $when(j<X); ( quencher: i, j, k )
- > [ done: () ] $when(i==0 && j==X && k==Y);
( $initialize: () ) -> ( stoker: $range(Z), 0 ); ( $finalize: () ) <- [ done: () ];
29
Task-Bomb CnC Tunings
Alternative 1: Stoker / Quencher
( stoker ): { stoker: true };
Alternative 2: Priorities
( stoker ): { priority: -1 };
30 30
Task-Bomb (Synthetic Example)
- Root step creates Z=32
stoker steps
- Each stoker creates
- Y=100 quencher tasks
- One stoker task
- Recursion creates X=200
levels
- Default scheduler dies
(deque overflow)
- Stoker hint allows for
throttling
- Similar performance via
priorities
0.5 1 1.5 2 2.5 3 3.5 4 Default Priority Stoker Average Execution Time (seconds)
☠
31
Hint #5: Partial Item Inputs
- What?
Allow the programmer to specify that a step only accesses a sub-range of the bytes of an input item.
- Why?
– For distributed memory, can transfer just the part that will be accessed when an item is an input to a remote step.
- Work In Progress
32
Summary
- OCR hints demonstrated via CnC tuning and code
generation:
– Step / item distribution – Scheduler throttling – Step affinity with input – Partial item requests – Step priority
- CnC provides benefits of high-level paradigm to
OCR:
– Expressiveness – Separation of concerns
- OCR design strategy makes it possible to add new
hint handlers with a reasonable amount of effort
33