[PPT] - Event Data Processing Frameworks for the Future The Vision The PowerPoint Presentation

SLIDE 1

M.Frank CERN/LHCb

Event Data Processing  Frameworks for the Future

❍ The Vision ❍ The Model ❍ The Guinea pig ❍ Results

SLIDE 2

M.Frank CERN/LHCb 2

The Problem

❍ Resources are scarce  Process parallelization does not address  

modern CPU technology

 Many cores [Intel Many Integrated Core Architecture: 80]  Scarce memory / CPU core  Number of open files per node  castor, hpms, Oracle  …

 Minimize resource usage (memory, files)  Let multiple threads use the same resources 

- I/O buffers, detector description, magnetic field map,

histograms, static storage, …

~ 1-2 thread per hardware thread

 Pipelined Data Processing (PDP)

SLIDE 3

M.Frank CERN/LHCb 3

Pipelined Data Processing

❍ Two parallelization concepts  Event parallelization 

simultaneous processing of multiple events

 Algorithm parallelization for a given event 

simultaneous execution of multiple Algorithms

❍ Both concepts may coexist ❍ Additional benefit:  

Processing a given set of events may be faster

❍ Glossary (Gaudi-speak):  Event are processed by a sequence of Algorithms  An Algorithm is a considerable amount of code  

acting on the data of one event [not just sqrt(x)]

SLIDE 4

M.Frank CERN/LHCb 4

Amdahlʼs Law

❍ What is the possible gain that can be achieved ?  Speedup = 1 / ( serial + parallel / Nthread )  In which area are we navigating?

SLIDE 5

M.Frank CERN/LHCb 5

Answers required

❍ Using the Pipelined Data Processing paradigm:  Which speedup can be achieved ?  Which parameters will the model have ?  What amount of work is required to transform an

existing program

 Framework  Physics code

SLIDE 6

M.Frank CERN/LHCb 6

Pipelined Data Processing

T0 T7 T6 T5 T4 T3 T2 T1

Time Input Output Processing

=

“Clock cycles”

Algorithm Algorithm Algorithm Algorithm ….

❍ Internal parallelization within an Algorithm  is NOT explicitly ruled out  but not taken into consideration

SLIDE 7

M.Frank CERN/LHCb 7

Pipelined Data Processing:  Event Parallelism

T0 T8 T7 T6 T5 T4 T3 T2 T1 T9 T10 T11 T12

X X

❍ Multiple instances of 

single event queues

❍ Filling up threads up to

some configurable limit

X

SLIDE 8

M.Frank CERN/LHCb 8

Pipelined Data Processing  Algorithm Parallelization

❍ Algorithms consume data from the TES  

(transient event data store – blackboard for event data)

❍ Algorithms post data to the TES

Basic assumptions:

❍ The execution order of any 2 algorithms with the same 

input data does not matter

❍ They can be executed in parallel

SLIDE 9

M.Frank CERN/LHCb 9

Consequence

T0 T7 T6 T5 T4 T3 T2 T1

❍ Can keep more threads  

busy at a time

❍ Hence:  Less events in memory  Less memory used ❍ Example  First massage raw data

for each subdetector (parallel)

 Then fit track…

SLIDE 10

M.Frank CERN/LHCb 10

The Guinea Pig Model

❍ Paragon: LHCb reconstruction program “Brunel” ❍ Implement Pipelined Data Processing model ❍ With input from real event execution  Which algorithms are executed  Average wall time each algorithm requires  List of required input data items for each algorithm ❍ The Model  Replace execution with “sleep” 

Not entirely accurate, but a reasonable approximation

SLIDE 11

M.Frank CERN/LHCb 11

Pipelined Data Processing:   Configuration

❍ Start with a sea of algorithms  Match inputs with outputs 

 Algorithm dependencies   Execution order

 Model dependencies obtained by snooping on TES

Input Module In Out Algorithm 2 In Out

In Out

Algorithm 1

In Out

Algorithm 3 Histogramm 1 In Out

…..

SLIDE 12

M.Frank CERN/LHCb 12

Pipelined Data Processing:   Configuration

❍ Resolved Algorithm queue after snooping

Input Module In Out Algorithm 2 In Out

In Out

Algorithm 1 Histogramm 1 In Out

In Out

Algorithm 3

…..

1 2 3 5 4

SLIDE 13

M.Frank CERN/LHCb 13

Conceptual Model:   Executors, Workers and Manager

❍ Formal workload given to a worker ❍ As long as work and idle workers: 

 schedule an algorithm

 acquire worker from idle queue  attach algorithm to worker  submit worker ❍ Once Worker is finished  put worker back to idle queue  Algorithm back to “sea”  Evaluate TES content to 

reschedule workers

Dataflow Manager

Worker Worker Worker Worker

Idle queue Busy queue

Worker Worker Worker Worker Algorithm Worker Worker Worker Worker Worker Worker Algorithm

Waiting work

Event [TES] Event [TES] Event [ TES ]

SLIDE 14

M.Frank CERN/LHCb 14

Conceptual Model:   Executors, Workers and Manager

❍ Formal workload given to a worker ❍ As long as work and idle workers: 

 schedule an algorithm

 acquire worker from idle queue  attach algorithm to worker  submit worker ❍ Once Worker is finished  put worker back to idle queue  Algorithm back to “sea”  Evaluate TES content to 

reschedule workers

Dataflow Manager

Worker Worker Worker Worker

Idle queue Busy queue

Worker Worker Worker Worker Algorithm Worker Worker Worker Worker Worker Worker Algorithm

Waiting work

Event [TES] Event [TES] Event [ TES ]

Machinery implemented using GCD

(Grand Central Dispatch)

but: Standalone implementation simple (was predecessor)

SLIDE 15

M.Frank CERN/LHCb 15

The Guinea Pig Model:  Parameter Space

❍ All parameters “within reason” ❍ Global model parameters  Maximal number of threads allowed. Max ~ 40 ❍ Event parallelization parameters  Maximal number of events processed in parallel  Maximal 10 events ❍ Algorithmic parallelization parameters  Maximal number of instances of a given Algorithm  By definition <= number of parallel events

SLIDE 16

M.Frank CERN/LHCb 16

Model Result:  Assuming full reentrancy

❍

Max 10 events in parallel

❍

Max 10 instances/algorithm

❍

All algorithms reentrant

Theoretical limit  t = t1 / Nthread Max evts > 3  Speedup up to ~30 Max 2 events  1 event * 2 Max 1 event  Algorithmic parallel limit  Speedup: ~7 One thread   = classic processing (t1)

SLIDE 17

M.Frank CERN/LHCb 17

Model Result:  Assuming full reentrancy

❍ The result only shows that the model works ❍ However, such an implementation would be  Not practical in the presence of (a lot of) existing code 

since all of it must be reentrant

 Hell of a work – if possible at all ❍ Measures are necessary  Not only for a transition phase  Some algorithms cannot be made reentrant  Exercise: Only make top N algorithms reentrant

SLIDE 18

M.Frank CERN/LHCb 18

What does this really mean?

Vary a cutoff, which defined, which algorithms must be reentrant

SLIDE 19

M.Frank CERN/LHCb 19

Model Result:  The top 7 time consuming algorithms

Average proc. time/event

580 msec

100 %

FitBest
58 msec

10.0 % top 1 CreateOfflinePhotons

40 msec

6.8 % RichOfflineGPIDLLIt0

28 msec

5.0 % RichOfflineGPIDLLIt1

29 msec

4.8 % CreateOfflineTracks

14 msec

2.4 % top 4 PatForward

10 msec

1.7 % TrackAddLikelihood

10 msec

1.7%

top 7

Top 7:

189 msec

32.6 %

SLIDE 20

M.Frank CERN/LHCb 20

Model Result Top 7: 

Max. 10 instances of top 7 algorithms

❍

Max 10 events in parallel

❍

TOP 7 algorithms reentrant  with max. 10 instances

❍

Cut 10 msec [1.7 %]

Theoretical limit  Max evts > 3  Speedup up to ~30 Max 2 events  1 event * 2 Max 1 event  Algorithmic parallel limit  Speedup: ~7 One thread   = classic processing (t1)

SLIDE 21

M.Frank CERN/LHCb 21

Model Result Top 4: 

Max. 10 instances of top 4 algorithms

❍

Max 10 events in parallel

❍

TOP 4 algorithms reentrant  with max 10 instances

❍

Cut 25 msec [4.3 %]

Theoretical limit  Max evts > 3  Speedup up to ~30 Max 2 events  1 event * 2 Max 1 event  Algorithmic parallel limit  Speedup: ~7 One thread   = classic processing (t1)

SLIDE 22

M.Frank CERN/LHCb 22

Model Result Top 1: 

Max. 10 instances of top algorithm

❍

Max 10 events in parallel

❍

TOP 1 algorithm reentrant with max 10 instances

❍

Cut 50 msec [10 %]

Theoretical limit 

Max evts > 3  No improvement  Not sufficient

Max 2 events  Speedup ~ 1 event * 2 Max 1 event  Algorithmic parallel limit  Speedup: ~7 One thread   = classic processing (t1)

SLIDE 23

M.Frank CERN/LHCb 23

Model Result:  Importance of Algorithm Reentrancy

❍

Max 10 events in parallel

❍

Max 1 instance/algorithm

Theoretical limit 

Allowing for more events will not improve things anymore Dominated by execution time of slowest  algorithm

Max 1 event  Algorithmic parallel limit  Speedup: ~7 One thread   = classic processing (t1)

SLIDE 24

M.Frank CERN/LHCb 24

Results: Summary

❍ Provided both parallelization mechanisms are applied  Large wall time gains could be achieved  Factor 30 not out of reach  Framework infrastructure resources  

reduced by this factor

❍ Many changes are only internal to the framework  Multiple event processing  Thread safe data access ❍ Only top time consuming algorithms must be closely

watched and made reentrant

SLIDE 25

M.Frank CERN/LHCb 25

Implications to existing frameworks

❍ Can the implementation of such an a processing

framework be applied to existing code?

 depends…  Algorithms and framework components must be able to

deal with several events in parallel

 e.g. Single “blackboard” would not do  The state of Algorithm instances may not depend  

n the current event

 The Algorithm chain must be divisible into  

“self-contained” units

 Locking typically not supported by existing frameworks  Spaghetti code is a killer…  Otherwise: Yes; this can be applied to existing

frameworks

SLIDE 26

M.Frank CERN/LHCb 26

Program of Work

v

Build a prototype

v

Test with dummy Algorithms

v

Measure possible gains

x

Get more physics code and core software  developers in the boat

x

Develop framework changes to support parallelization

x

Apply to existing reconstruction program

x

Start to convert existing physics code base 

1rst. goal: 7 algorithm reentrant  
=> workout mechanisms

x

Measure performance

SLIDE 27

M.Frank CERN/LHCb 27

Conclusions

❍ Only both, event and algorithm parallelization  

shows the full potential of many core Algorithms

❍ Not all of the physics code base must be changed at once ❍ Smooth transition phase is provided ❍ If most of the implications can be hidden by the framework ❍ Still: a lot of work coming up ❍ Migration cannot be transparent  has to be agreed / prepared / scheduled by the  

code developers in of the whole collaboration

SLIDE 28

M.Frank CERN/LHCb

Backup Slides

SLIDE 29

M.Frank CERN/LHCb 29

Dataflow Manager V2   GCD implementation

Worker

m_idleQue: 0..n

DataflowMgr Executor Factory

m_work: 0..n m_maxWorkers:int

IOMask / BitField

m_input: 1 m_output: 1 m_executor: 1 m_master: 1

need to be mutexed

rtl::Lock

m_lock: 1

prioritized list, static

IOMask / BitField

m_event: 1

EventContext

m_events 0…n

ContextQue IdleQue

WorkQue

hold/unhold

Executor

m_instances 1…n

AlgMask / BitField

m_executed: 1

ID: int

m_factory: 1

Event Data Processing Frameworks for the Future

The Problem

modern CPU technology

 Minimize resource usage (memory, files)  Let multiple threads use the same resources

histograms, static storage, …

 Pipelined Data Processing (PDP)

Pipelined Data Processing

simultaneous processing of multiple events

simultaneous execution of multiple Algorithms

Processing a given set of events may be faster

acting on the data of one event [not just sqrt(x)]

Amdahlʼs Law

Answers required

existing program

Pipelined Data Processing

=

Algorithm Algorithm Algorithm Algorithm ….

Pipelined Data Processing: Event Parallelism

X X

X

Pipelined Data Processing Algorithm Parallelization

(transient event data store – blackboard for event data)

Basic assumptions:

input data does not matter

Consequence

The Guinea Pig Model

Not entirely accurate, but a reasonable approximation

Pipelined Data Processing: Configuration

 Algorithm dependencies  Execution order

…..

Pipelined Data Processing: Configuration

…..

Conceptual Model: Executors, Workers and Manager

Dataflow Manager

Conceptual Model: Executors, Workers and Manager

Dataflow Manager

Machinery implemented using GCD

The Guinea Pig Model: Parameter Space

Model Result: Assuming full reentrancy

Model Result: Assuming full reentrancy

since all of it must be reentrant

What does this really mean?

Vary a cutoff, which defined, which algorithms must be reentrant

Model Result: The top 7 time consuming algorithms

Average proc. time/event

100 %

10.0 % top 1 CreateOfflinePhotons

6.8 % RichOfflineGPIDLLIt0

5.0 % RichOfflineGPIDLLIt1

4.8 % CreateOfflineTracks

2.4 % top 4 PatForward

1.7 % TrackAddLikelihood

1.7%

Top 7:

32.6 %

Model Result Top 7:

Model Result Top 4:

Model Result Top 1:

Max evts > 3 No improvement Not sufficient

Model Result: Importance of Algorithm Reentrancy

Results: Summary

reduced by this factor

watched and made reentrant

Implications to existing frameworks

framework be applied to existing code?

deal with several events in parallel

“self-contained” units

frameworks

Program of Work

v

Build a prototype

v

Test with dummy Algorithms

v

Measure possible gains

x

Get more physics code and core software developers in the boat

x

Develop framework changes to support parallelization

x

Event Data Processing  Frameworks for the Future

 Minimize resource usage (memory, files)  Let multiple threads use the same resources 

Pipelined Data Processing:  Event Parallelism

Pipelined Data Processing  Algorithm Parallelization

Pipelined Data Processing:   Configuration

 Algorithm dependencies   Execution order

Pipelined Data Processing:   Configuration

Conceptual Model:   Executors, Workers and Manager

Conceptual Model:   Executors, Workers and Manager

The Guinea Pig Model:  Parameter Space

Model Result:  Assuming full reentrancy

Model Result:  Assuming full reentrancy

Model Result:  The top 7 time consuming algorithms

Model Result Top 7: 

Model Result Top 4: 

Model Result Top 1: 

Max evts > 3  No improvement  Not sufficient

Model Result:  Importance of Algorithm Reentrancy

Get more physics code and core software  developers in the boat

Start to convert existing physics code base 

Dataflow Manager V2   GCD implementation