Parallelization of the AliRoot Event-Reconstruction Stefan B. Lohn - - PowerPoint PPT Presentation

parallelization of the aliroot event reconstruction
SMART_READER_LITE
LIVE PREVIEW

Parallelization of the AliRoot Event-Reconstruction Stefan B. Lohn - - PowerPoint PPT Presentation

Parallelization of the AliRoot Event-Reconstruction Stefan B. Lohn CERN, 6. October 2011 Outlines 1. Introduction 2. Transformation of Thread-Safety 3. Transformation of ROOT & AliRoot 4. Critical section: Cint 5. Multi-Threaded


slide-1
SLIDE 1

Parallelization of the AliRoot Event-Reconstruction

Stefan B. Lohn CERN, 6. October 2011

slide-2
SLIDE 2

Outlines

  • 1. Introduction
  • 2. Transformation of Thread-Safety
  • 3. Transformation of ROOT & AliRoot
  • 4. Critical section: Cint
  • 5. Multi-Threaded execution
  • 6. Sharing resources to reduce the footprint
  • 7. Testing Event-Reconstruction
  • 8. Conclusions
  • 6. Oct. 2011

2

slide-3
SLIDE 3

Introduction

  • 6. Oct. 2011

3

This work is based on tools and techniques, developed for the parallelization of sequential source-code in C/C++. And successfully applied for the parallel Monte- Carlo simulation Geant4.

(Origin: Dennis Schmitz, Wikimedia Foundation)

More and more computing units, Intels Westmere has already 12 cores plus Hyper-Threading But limited resources like Caches, IO-Bandwidth and internal data links.

Two ways of parallel processing: 1) Multi-Processing

  • Slow Context-Switch
  • Sharing Memory is

sophisticated

  • PROOF-Lite

2) Multi-Threading

  • Depends on support of

thread-safety

slide-4
SLIDE 4

Introduction

  • 6. Oct. 2011

4

Question: Can we introduce a parallel AliRoot Event-Reconstruction using multi-threading?

AliRoot ROOT CInt

Event-Reconstruction (physical Analysis) Physical Analysis for huge amounts of data C/C++ interpreter

Thread-safety: Access to resources shared amongst threads can not interfere the processing of other threads, even through an unpredicted way. This also calls unconditional-thread safety.

The basic steps are: 1. Introducing thread-safety 2. Keep performance and scalability 3. Reducing the memory-footprint

slide-5
SLIDE 5

Transformation of Thread-Safety

  • 6. Oct. 2011

5

Static Analysis Source-code Trafo.

#include <iostream> #include <TROOT.h> #include <TRint.h> Int main(){ TRint (); } #include <iostream> #include <TROOT.h> #include <TRint.h> Int main(){ TRint (); } #include <iostream> #include <TROOT.h> #include <TRint.h> Int main(){ TRint (); }

Parsing AST

Rewriting (PrittyPrinting)

What are we looking for to obtain thread-safety?

Searching:

  • Global Decl.
  • Static and
  • Extern Decl.

Adding Thread-Local Specifier: a) __thread int Variable; b) static __thread int Variable; c) extern __thread intVariable; Source-to-source transformation:

Source-code files Abstract- Syntax-Tree

slide-6
SLIDE 6

Transformation of Thread-Safety

  • 6. Oct. 2011

6

Static Analysis Source-code Trafo.

#include <iostream> #include <TROOT.h> #include <TRint.h> Int main(){ TRint (); } #include <iostream> #include <TROOT.h> #include <TRint.h> Int main(){ TRint (); } #include <iostream> #include <TROOT.h> #include <TRint.h> Int main(){ TRint (); }

Parsing AST

Rewriting (PrittyPrinting)

What are we looking for to obtain thread-safety?

Searching:

  • Global Decl.
  • Static and
  • Extern Decl.

But non-PODs need to be changed: std::string Var;

  • 1. __thread std::string* Var_Ptr;
  • 2. Correct access from functions

Source-to-source transformation:

Source-code files Abstract- Syntax-Tree

slide-7
SLIDE 7

Transformation of Thread-Safety

  • 6. Oct. 2011

7

Static Analysis Source-code Trafo.

#include <iostream> #include <TROOT.h> #include <TRint.h> Int main(){ TRint (); } #include <iostream> #include <TROOT.h> #include <TRint.h> Int main(){ TRint (); } #include <iostream> #include <TROOT.h> #include <TRint.h> Int main(){ TRint (); }

Parsing AST

Rewriting (PrittyPrinting)

Patching GCC-Parser

Information about declarations

Source-code files

slide-8
SLIDE 8

Transformation of Thread-Safety

  • 6. Oct. 2011

8

Static Analysis Source-code Trafo.

#include <iostream> #include <TROOT.h> #include <TRint.h> Int main(){ TRint (); } #include <iostream> #include <TROOT.h> #include <TRint.h> Int main(){ TRint (); } #include <iostream> #include <TROOT.h> #include <TRint.h> Int main(){ TRint (); }

Parsing AST

Rewriting (PrittyPrinting)

Patching GCC-Parser

Information about declarations

X X X Unfortunately , no interaction to the Abstract Syntax Tree

AND the GCC-plugin support is useless for our case

Source-code files

slide-9
SLIDE 9

Transformation of Thread-Safety

  • 6. Oct. 2011

9

Static Analysis Source-code Trafo.

#include <iostream> #include <TROOT.h> #include <TRint.h> Int main(){ TRint (); } #include <iostream> #include <TROOT.h> #include <TRint.h> Int main(){ TRint (); } #include <iostream> #include <TROOT.h> #include <TRint.h> Int main(){ TRint (); }

Parsing AST

Rewriting (PrittyPrinting)

  • 1. Rose Compiler with EDG frontend
  • 2. LLVM with Clang as C/C++ frontend

Both are capable of performing the proposed transformation with high precision But: EDG is not accepting whole AliRoot code and is licensed for commercial purposes. The RecursiveASTVisitor template in Clang is used for traversing the AST, Statement-, Expression- and Type Visitors to access the nodes of the AST. The Rewriter object can be used for replacing and adding

  • wn source-code. Implementation not finished.

Source-code files

slide-10
SLIDE 10

Transformation of ROOT & AliRoot

Converting statics/globals/extern decl. -> TLS:

Finally around 1000 TLS specifiers have been added in ROOT and 366 in AliRoot. 3000 lines in ROOT and 1660 in AliRoot are added for initialization. => almost 6000 lines added automatically with some extraordinary exceptions, treated manually.

  • 6. Oct. 2011

10

Statics Globals Extern AliRoot 1724 196 220 ROOT 897 7 554 CINT 749 715 941

slide-11
SLIDE 11

Critical section: CInt

11

As demonstrated, the transformation lacks on access to more precise and reliable information from the AST in the current state. CInt is not transformed yet. Additional CInt is not just assumed to be executed, but generates source-code which assumed to be executed, the so called dictionaries. This makes it still thread-unaware. Q.: Can we surround this issue?

  • 6. Oct. 2011
  • 1. Using ACliC, means first to compile macros.
  • 2. Avoid concurrent write access of type information in the

interpreter by building them in advance.

  • 3. Locking critical sections, where CInt is called.
slide-12
SLIDE 12

Critical section: CInt

  • 6. Oct. 2011

12

Following these three steps, CInt and the interface TCint can be used as singletons and stay thread-unaware. But TROOT is accessing TCInt and should stay a singleton to. Q.: Can TROOT be used as a singleton? TROOT ListOfFiles ListOfMappedFiles ListOfCanvases ListOfStyles so on. Will be replaced by Lists on thread private Heap: Heap threads

slide-13
SLIDE 13

Multi-threaded execution

  • 6. Oct. 2011

13

Extract Initialization Termination Merging Exit

Simple Test Setup

  • 1. No interference between threads
  • 2. Most parts stay almost the same
  • 3. Minor changes in the code for steering

the event-reconstruction

  • 4. Extraction step needs to distribute the

required information and

  • 5. a Merging step need to fuse results

Concurrent processing

  • 1. Additional runtime for extraction,

merging and initialization

  • 2. The original initialization is repeated per

thread and wasting time

  • 3. With many cores, IO is going worst

=> For fixing this, an IO-Managing thread is proposed for implementation BUT

slide-14
SLIDE 14

Multi-threaded execution

  • 6. Oct. 2011

14

Extract Initialization Termination Merging Exit

Simple Test Setup

… Concurrent processing Investigating scalability:

Creation of 1M random numbers, stored into separate files of 900MB in total. Grows till a speedup of 9.64 with 12 threads. Test machine: 12 core Intel Westmere. 2.6 GHz, 12 MB LL cache.

GAP caused by IO usage ~100MB/s

slide-15
SLIDE 15

Sharing resources

  • 6. Oct. 2011

15

The value of this approach is not just using multi-threading, but using shared memory to reduce the whole memory consumption. The same way, we shared TROOT to all threads, we can share other classes as well. SharedClass Class Member Fields Relative read-only (after initialization) => Stay on global heap Transatory fields (read-write) => Go to thread private heap

slide-16
SLIDE 16

Sharing resources

  • 6. Oct. 2011

16

1) General classification by using profiler. E.g. Massif 2) Then one must roughly classify member fields. 3) ptrace and memory protection can then be used to verify if they are relative read-only or transatory fields.

(Origin: X. Dong, G. Cooperman, J. Apostolakis, Multithreaded Geant4: Semi-automatic Transformation into Scalable Thread-Parallel Software)

slide-17
SLIDE 17

Testing Event-Reconstruction

  • 6. Oct. 2011

17

Test with 200 events and ITS only. 4 threads are running with a speedup of 2.5. But only 2 times more memory is used than a single thread reconstruction. Only Cint & TROOT is shared. Preliminary results for PPBench raw-reconstruction (Proton-Proton collision)

Test machine: 12 core Intel

  • Westmere. 2.6 GHz, 12 MB

LL cache.

slide-18
SLIDE 18

Conclusions

  • 6. Oct. 2011

18

  • 1. Simple way of parallelization that woks for AliRoot.
  • 2. Reducing time in development and maintenance.
  • 3. Introducing multi-threading without expert knowledge.
  • 4. Keeping memory consumption under control.
  • 5. Providing an analysis technique to investigate candidates for shared classes
  • 6. and to investigate concerns of correctness.

Further efforts:

  • 1. Analyze correctness for this approach.
  • 2. Find sharable classes to reduce memory consumption (e.g. ITSgeom).
  • 3. Investigate further needs for massive multithreading.
slide-19
SLIDE 19

Sources

  • 6. Oct. 2011

19

Many thanks to Xin Dong and Axel Naumann for their advices.

  • X. Dong, G. Cooperman, J. Apostolakis, Multithreaded Geant4: Semi-automatic

Transformation into Scalable Thread-Parallel Software

  • F. Carminati and G. Bruckner. The Alice Offline Bible, 2007.
  • F. Carminati, P. V. Vyvre, L. B., et al. Technical Design Report of Computing, 2005.
  • G. de Oliveira Costa and A. Oliva, Speeding up thread-local storage access in

dynamic libraries in the arm platform, 2006, RedHat.

  • S. Ghemawat. Thread-Caching Malloc (TCMalloc).
  • M. Tadel and F. Carminati. Parallelization of ALICE simulation - a jump

through the looking-glass. Journal of Physics, Conference Series Volume 219, 2010.

  • S. Jarp, A. Lazzaro and A. Nowak, Scalability of westmere-ep and

nehalem-ex platforms. OpenLab, CERN.

slide-20
SLIDE 20

Thanks for attention. Any further questions?