Parallelization of the AliRoot Event-Reconstruction Stefan B. Lohn - - PowerPoint PPT Presentation
Parallelization of the AliRoot Event-Reconstruction Stefan B. Lohn - - PowerPoint PPT Presentation
Parallelization of the AliRoot Event-Reconstruction Stefan B. Lohn CERN, 6. October 2011 Outlines 1. Introduction 2. Transformation of Thread-Safety 3. Transformation of ROOT & AliRoot 4. Critical section: Cint 5. Multi-Threaded
Outlines
- 1. Introduction
- 2. Transformation of Thread-Safety
- 3. Transformation of ROOT & AliRoot
- 4. Critical section: Cint
- 5. Multi-Threaded execution
- 6. Sharing resources to reduce the footprint
- 7. Testing Event-Reconstruction
- 8. Conclusions
- 6. Oct. 2011
2
Introduction
- 6. Oct. 2011
3
This work is based on tools and techniques, developed for the parallelization of sequential source-code in C/C++. And successfully applied for the parallel Monte- Carlo simulation Geant4.
(Origin: Dennis Schmitz, Wikimedia Foundation)
More and more computing units, Intels Westmere has already 12 cores plus Hyper-Threading But limited resources like Caches, IO-Bandwidth and internal data links.
Two ways of parallel processing: 1) Multi-Processing
- Slow Context-Switch
- Sharing Memory is
sophisticated
- PROOF-Lite
2) Multi-Threading
- Depends on support of
thread-safety
Introduction
- 6. Oct. 2011
4
Question: Can we introduce a parallel AliRoot Event-Reconstruction using multi-threading?
AliRoot ROOT CInt
Event-Reconstruction (physical Analysis) Physical Analysis for huge amounts of data C/C++ interpreter
Thread-safety: Access to resources shared amongst threads can not interfere the processing of other threads, even through an unpredicted way. This also calls unconditional-thread safety.
The basic steps are: 1. Introducing thread-safety 2. Keep performance and scalability 3. Reducing the memory-footprint
Transformation of Thread-Safety
- 6. Oct. 2011
5
Static Analysis Source-code Trafo.
#include <iostream> #include <TROOT.h> #include <TRint.h> Int main(){ TRint (); } #include <iostream> #include <TROOT.h> #include <TRint.h> Int main(){ TRint (); } #include <iostream> #include <TROOT.h> #include <TRint.h> Int main(){ TRint (); }
Parsing AST
Rewriting (PrittyPrinting)
What are we looking for to obtain thread-safety?
Searching:
- Global Decl.
- Static and
- Extern Decl.
Adding Thread-Local Specifier: a) __thread int Variable; b) static __thread int Variable; c) extern __thread intVariable; Source-to-source transformation:
Source-code files Abstract- Syntax-Tree
Transformation of Thread-Safety
- 6. Oct. 2011
6
Static Analysis Source-code Trafo.
#include <iostream> #include <TROOT.h> #include <TRint.h> Int main(){ TRint (); } #include <iostream> #include <TROOT.h> #include <TRint.h> Int main(){ TRint (); } #include <iostream> #include <TROOT.h> #include <TRint.h> Int main(){ TRint (); }
Parsing AST
Rewriting (PrittyPrinting)
What are we looking for to obtain thread-safety?
Searching:
- Global Decl.
- Static and
- Extern Decl.
But non-PODs need to be changed: std::string Var;
- 1. __thread std::string* Var_Ptr;
- 2. Correct access from functions
Source-to-source transformation:
Source-code files Abstract- Syntax-Tree
Transformation of Thread-Safety
- 6. Oct. 2011
7
Static Analysis Source-code Trafo.
#include <iostream> #include <TROOT.h> #include <TRint.h> Int main(){ TRint (); } #include <iostream> #include <TROOT.h> #include <TRint.h> Int main(){ TRint (); } #include <iostream> #include <TROOT.h> #include <TRint.h> Int main(){ TRint (); }
Parsing AST
Rewriting (PrittyPrinting)
Patching GCC-Parser
Information about declarations
Source-code files
Transformation of Thread-Safety
- 6. Oct. 2011
8
Static Analysis Source-code Trafo.
#include <iostream> #include <TROOT.h> #include <TRint.h> Int main(){ TRint (); } #include <iostream> #include <TROOT.h> #include <TRint.h> Int main(){ TRint (); } #include <iostream> #include <TROOT.h> #include <TRint.h> Int main(){ TRint (); }
Parsing AST
Rewriting (PrittyPrinting)
Patching GCC-Parser
Information about declarations
X X X Unfortunately , no interaction to the Abstract Syntax Tree
AND the GCC-plugin support is useless for our case
Source-code files
Transformation of Thread-Safety
- 6. Oct. 2011
9
Static Analysis Source-code Trafo.
#include <iostream> #include <TROOT.h> #include <TRint.h> Int main(){ TRint (); } #include <iostream> #include <TROOT.h> #include <TRint.h> Int main(){ TRint (); } #include <iostream> #include <TROOT.h> #include <TRint.h> Int main(){ TRint (); }
Parsing AST
Rewriting (PrittyPrinting)
- 1. Rose Compiler with EDG frontend
- 2. LLVM with Clang as C/C++ frontend
Both are capable of performing the proposed transformation with high precision But: EDG is not accepting whole AliRoot code and is licensed for commercial purposes. The RecursiveASTVisitor template in Clang is used for traversing the AST, Statement-, Expression- and Type Visitors to access the nodes of the AST. The Rewriter object can be used for replacing and adding
- wn source-code. Implementation not finished.
Source-code files
Transformation of ROOT & AliRoot
Converting statics/globals/extern decl. -> TLS:
Finally around 1000 TLS specifiers have been added in ROOT and 366 in AliRoot. 3000 lines in ROOT and 1660 in AliRoot are added for initialization. => almost 6000 lines added automatically with some extraordinary exceptions, treated manually.
- 6. Oct. 2011
10
Statics Globals Extern AliRoot 1724 196 220 ROOT 897 7 554 CINT 749 715 941
Critical section: CInt
11
As demonstrated, the transformation lacks on access to more precise and reliable information from the AST in the current state. CInt is not transformed yet. Additional CInt is not just assumed to be executed, but generates source-code which assumed to be executed, the so called dictionaries. This makes it still thread-unaware. Q.: Can we surround this issue?
- 6. Oct. 2011
- 1. Using ACliC, means first to compile macros.
- 2. Avoid concurrent write access of type information in the
interpreter by building them in advance.
- 3. Locking critical sections, where CInt is called.
Critical section: CInt
- 6. Oct. 2011
12
Following these three steps, CInt and the interface TCint can be used as singletons and stay thread-unaware. But TROOT is accessing TCInt and should stay a singleton to. Q.: Can TROOT be used as a singleton? TROOT ListOfFiles ListOfMappedFiles ListOfCanvases ListOfStyles so on. Will be replaced by Lists on thread private Heap: Heap threads
Multi-threaded execution
- 6. Oct. 2011
13
Extract Initialization Termination Merging Exit
Simple Test Setup
…
- 1. No interference between threads
- 2. Most parts stay almost the same
- 3. Minor changes in the code for steering
the event-reconstruction
- 4. Extraction step needs to distribute the
required information and
- 5. a Merging step need to fuse results
Concurrent processing
- 1. Additional runtime for extraction,
merging and initialization
- 2. The original initialization is repeated per
thread and wasting time
- 3. With many cores, IO is going worst
=> For fixing this, an IO-Managing thread is proposed for implementation BUT
Multi-threaded execution
- 6. Oct. 2011
14
Extract Initialization Termination Merging Exit
Simple Test Setup
… Concurrent processing Investigating scalability:
Creation of 1M random numbers, stored into separate files of 900MB in total. Grows till a speedup of 9.64 with 12 threads. Test machine: 12 core Intel Westmere. 2.6 GHz, 12 MB LL cache.
GAP caused by IO usage ~100MB/s
Sharing resources
- 6. Oct. 2011
15
The value of this approach is not just using multi-threading, but using shared memory to reduce the whole memory consumption. The same way, we shared TROOT to all threads, we can share other classes as well. SharedClass Class Member Fields Relative read-only (after initialization) => Stay on global heap Transatory fields (read-write) => Go to thread private heap
Sharing resources
- 6. Oct. 2011
16
1) General classification by using profiler. E.g. Massif 2) Then one must roughly classify member fields. 3) ptrace and memory protection can then be used to verify if they are relative read-only or transatory fields.
(Origin: X. Dong, G. Cooperman, J. Apostolakis, Multithreaded Geant4: Semi-automatic Transformation into Scalable Thread-Parallel Software)
Testing Event-Reconstruction
- 6. Oct. 2011
17
Test with 200 events and ITS only. 4 threads are running with a speedup of 2.5. But only 2 times more memory is used than a single thread reconstruction. Only Cint & TROOT is shared. Preliminary results for PPBench raw-reconstruction (Proton-Proton collision)
Test machine: 12 core Intel
- Westmere. 2.6 GHz, 12 MB
LL cache.
Conclusions
- 6. Oct. 2011
18
- 1. Simple way of parallelization that woks for AliRoot.
- 2. Reducing time in development and maintenance.
- 3. Introducing multi-threading without expert knowledge.
- 4. Keeping memory consumption under control.
- 5. Providing an analysis technique to investigate candidates for shared classes
- 6. and to investigate concerns of correctness.
Further efforts:
- 1. Analyze correctness for this approach.
- 2. Find sharable classes to reduce memory consumption (e.g. ITSgeom).
- 3. Investigate further needs for massive multithreading.
Sources
- 6. Oct. 2011
19
Many thanks to Xin Dong and Axel Naumann for their advices.
- X. Dong, G. Cooperman, J. Apostolakis, Multithreaded Geant4: Semi-automatic
Transformation into Scalable Thread-Parallel Software
- F. Carminati and G. Bruckner. The Alice Offline Bible, 2007.
- F. Carminati, P. V. Vyvre, L. B., et al. Technical Design Report of Computing, 2005.
- G. de Oliveira Costa and A. Oliva, Speeding up thread-local storage access in
dynamic libraries in the arm platform, 2006, RedHat.
- S. Ghemawat. Thread-Caching Malloc (TCMalloc).
- M. Tadel and F. Carminati. Parallelization of ALICE simulation - a jump
through the looking-glass. Journal of Physics, Conference Series Volume 219, 2010.
- S. Jarp, A. Lazzaro and A. Nowak, Scalability of westmere-ep and
nehalem-ex platforms. OpenLab, CERN.