Track Reconstruc.on on GPUs MOHAMMAD AL-TURANY GSI Darmstadt - PowerPoint PPT Presentation

Why ¡CUDA? 9  CUDA ¡development ¡tools ¡work ¡alongside ¡the ¡convenMonal ¡C/C++ ¡ compiler, ¡so ¡one ¡can ¡mix ¡GPU ¡code ¡with ¡general-‑purpose ¡code ¡for ¡ the ¡host ¡CPU.  CUDA ¡AutomaMcally ¡Manages ¡Threads:  It ¡does ¡NOT ¡require ¡explicit ¡management ¡for ¡threads ¡in ¡the ¡convenMonal ¡ sense, ¡which ¡greatly ¡simplifies ¡the ¡programming ¡model. ¡ ¡  Stable, ¡available ¡(for ¡free), ¡documented ¡and ¡supported ¡for ¡ windows, ¡Linux ¡and ¡Mac ¡OS  Low ¡learning ¡curve:  Just ¡a ¡few ¡extensions ¡to ¡C ¡  No ¡knowledge ¡of ¡graphics ¡is ¡required ¡ 26.03.2010 Mohammad Al-Turany, Hades Meeting Friday, March 26, 2010

Cuda ¡(2.3) ¡Toolkit 10  NVCC ¡ ¡C ¡compiler  CUDA ¡FFT ¡and ¡BLAS ¡libraries ¡for ¡the ¡GPU  CUDA-‑gdb ¡hardware ¡debugger ¡  CUDA ¡Visual ¡Profiler ¡  CUDA ¡run.me ¡driver ¡(also ¡available ¡in ¡the ¡ standard ¡NVIDIA ¡GPU ¡driver)  CUDA ¡programming ¡manual 26.03.2010 Mohammad Al-Turany, Hades Meeting Friday, March 26, 2010

CUDA ¡in ¡FairRoot 11  FindCuda.cmake ¡ ¡( Abe ¡Stephens ¡SCI ¡InsMtute )  Integrate ¡CUDA ¡into ¡FairRoot ¡very ¡smoothly  CMake ¡create ¡shared ¡libraries ¡for ¡CUDA ¡part  FairCuda ¡is ¡a ¡class ¡which ¡wraps ¡CUDA ¡implemented ¡ ¡ funcMons ¡so ¡that ¡they ¡can ¡be ¡used ¡directly ¡from ¡ROOT ¡ CINT ¡or ¡compiled ¡code 26.03.2010 Mohammad Al-Turany, Hades Meeting Friday, March 26, 2010

ReconstrucMon ¡chain ¡ ¡ 12 ....... Hits Track Finder Track Track Fitter candidates Tracks Task CPU ....... 26.03.2010 Mohammad Al-Turany, Hades Meeting Friday, March 26, 2010

ReconstrucMon ¡chain ¡ ¡ 12 ....... Hits Track Finder Track Track Fitter candidates Tracks ....... 26.03.2010 Mohammad Al-Turany, Hades Meeting Friday, March 26, 2010

ReconstrucMon ¡chain ¡ ¡ 12 ....... Hits Track Finder Track Track Fitter candidates Tracks Task GPU ....... 26.03.2010 Mohammad Al-Turany, Hades Meeting Friday, March 26, 2010

CUDA ¡programming ¡model 13 • Kernel: One ¡kernel ¡is ¡executed ¡at ¡a ¡Mme ¡ • Kernel ¡launches ¡a ¡grid ¡of ¡thread ¡blocks ¡ • • Thread ¡block: A ¡batch ¡of ¡thread. ¡ • Threads ¡in ¡a ¡block ¡cooperate ¡together, ¡ • efficiently ¡share ¡data. Thread/block ¡have ¡unique ¡id • • Grid: A ¡batch ¡of ¡thread ¡blocks ¡that ¡execute ¡the ¡ • same ¡kernel. Threads ¡in ¡different ¡blocks ¡in ¡the ¡same ¡grid ¡ • cannot ¡directly ¡communicate ¡with ¡each ¡ other 26.03.2010 Mohammad Al-Turany, Hades Meeting Friday, March 26, 2010

CUDA ¡memory ¡model 14  There ¡is ¡6 ¡different ¡ memory ¡regions ¡ 26.03.2010 Mohammad Al-Turany, Hades Meeting Friday, March 26, 2010

Register ¡Memory 15  The ¡fastest ¡form ¡of ¡memory ¡ on ¡the ¡mul.-‑processor.  Is ¡only ¡accessible ¡by ¡the ¡ thread.  Has ¡the ¡life.me ¡of ¡the ¡ thread 26.03.2010 Mohammad Al-Turany, Hades Meeting Friday, March 26, 2010

Shared ¡Memory 16  Can ¡be ¡as ¡fast ¡as ¡a ¡register ¡ when ¡there ¡are ¡no ¡bank ¡ conflicts ¡or ¡when ¡reading ¡ from ¡the ¡same ¡address.  Accessible ¡by ¡any ¡thread ¡of ¡ the ¡block ¡from ¡which ¡it ¡was ¡ created.  Has ¡the ¡life.me ¡of ¡the ¡block. 26.03.2010 Mohammad Al-Turany, Hades Meeting Friday, March 26, 2010

Global ¡Memory 17  Poten.ally ¡150x ¡slower ¡than ¡ register ¡or ¡shared ¡memory ¡.  Accessible ¡from ¡either ¡the ¡ host ¡or ¡device.  Has ¡the ¡life.me ¡of ¡the ¡ applica.on. 26.03.2010 Mohammad Al-Turany, Hades Meeting Friday, March 26, 2010

Local ¡Memory 18  Resides ¡in ¡global ¡memory ¡ and ¡can ¡be ¡150x ¡slower ¡than ¡ register ¡or ¡shared ¡memory  Is ¡only ¡accessible ¡by ¡the ¡ thread  Has ¡the ¡life.me ¡of ¡the ¡ thread. 26.03.2010 Mohammad Al-Turany, Hades Meeting Friday, March 26, 2010

Constant ¡Memory 19  in DRAM  cached  per grid  read-only 26.03.2010 Mohammad Al-Turany, Hades Meeting Friday, March 26, 2010

Texture ¡Memory 20  in DRAM  cached  per grid  read-only 26.03.2010 Mohammad Al-Turany, Hades Meeting Friday, March 26, 2010

21 Global, ¡local, ¡ texture, ¡and ¡ constant ¡memory ¡ are ¡physically ¡the ¡ same ¡memory. ¡ They ¡differ ¡only ¡in ¡ caching ¡algorithms ¡ ¡ and ¡access ¡models. ¡ CPU can refresh and access only: global, constant, and texture memory. 26.03.2010 Mohammad Al-Turany, Hades Meeting Friday, March 26, 2010

Scalability ¡in ¡CUDA 22 26.03.2010 Mohammad Al-Turany, Hades Meeting Friday, March 26, 2010

Comparisons ¡between ¡CPU ¡and ¡GPU ¡code! ¡ 23 Tracks Candidates Tracks 26.03.2010 Mohammad Al-Turany, Hades Meeting Friday, March 26, 2010

Comparisons ¡between ¡CPU ¡and ¡GPU ¡code! ¡ 23 Tracks Candidates CPU Fitting Tracks 26.03.2010 Mohammad Al-Turany, Hades Meeting Friday, March 26, 2010

Comparisons ¡between ¡CPU ¡and ¡GPU ¡code! ¡ 23 Tracks Candidates CPU Fitting Tracks Copy To Copy To GPU Copy To C Array GPU Fitting Host Using the GPUs include some overhead in data processing which has to be considered in the comparisons to CPU code 26.03.2010 Mohammad Al-Turany, Hades Meeting Friday, March 26, 2010

Example ¡(Texture ¡Memory) U S I N G ¡ T E X T U R E ¡ M E M O RY ¡ ¡ FO R ¡ F I E L D ¡ M A P S Friday, March 26, 2010

Field ¡Maps 25 26.03.2010 Mohammad Al-Turany, Hades Meeting Friday, March 26, 2010

Field ¡Maps 25  Usually ¡a ¡three ¡dimensional ¡array ¡(XYZ, ¡Rθ ϕ , etc ) 26.03.2010 Mohammad Al-Turany, Hades Meeting Friday, March 26, 2010

Field ¡Maps 25  Usually ¡a ¡three ¡dimensional ¡array ¡(XYZ, ¡Rθ ϕ , etc )  Used ¡as ¡a ¡lookup ¡table ¡with ¡some ¡interpolaMon ¡ ¡ 26.03.2010 Mohammad Al-Turany, Hades Meeting Friday, March 26, 2010

Field ¡Maps 25  Usually ¡a ¡three ¡dimensional ¡array ¡(XYZ, ¡Rθ ϕ , etc )  Used ¡as ¡a ¡lookup ¡table ¡with ¡some ¡interpolaMon ¡ ¡  For ¡performance ¡and ¡mulM-‑access ¡issues, ¡many ¡people ¡try ¡ to ¡parameterize ¡it. 26.03.2010 Mohammad Al-Turany, Hades Meeting Friday, March 26, 2010

Field ¡Maps 25  Usually ¡a ¡three ¡dimensional ¡array ¡(XYZ, ¡Rθ ϕ , etc )  Used ¡as ¡a ¡lookup ¡table ¡with ¡some ¡interpolaMon ¡ ¡  For ¡performance ¡and ¡mulM-‑access ¡issues, ¡many ¡people ¡try ¡ to ¡parameterize ¡it. Drawback:  Specific for certain maps  Hard to do with good accuracy  Not possible for all maps 26.03.2010 Mohammad Al-Turany, Hades Meeting Friday, March 26, 2010

Texture ¡Memory ¡for ¡field ¡maps 26  Three ¡dimensional ¡arrays ¡can ¡be ¡bind ¡to ¡texture ¡directly  Accessible ¡from ¡all ¡threads ¡in ¡a ¡grid  Linear ¡interpolaMon ¡is ¡done ¡by ¡dedicated ¡hardware  Cashed ¡and ¡allow ¡mulMple ¡random ¡access 26.03.2010 Mohammad Al-Turany, Hades Meeting Friday, March 26, 2010

Texture ¡Memory ¡for ¡field ¡maps 26  Three ¡dimensional ¡arrays ¡can ¡be ¡bind ¡to ¡texture ¡directly  Accessible ¡from ¡all ¡threads ¡in ¡a ¡grid  Linear ¡interpolaMon ¡is ¡done ¡by ¡dedicated ¡hardware  Cashed ¡and ¡allow ¡mulMple ¡random ¡access Ideal ¡for ¡field ¡maps! 26.03.2010 Mohammad Al-Turany, Hades Meeting Friday, March 26, 2010

Using ¡Texture ¡Memory 27  Host ¡(CPU) ¡code:  Allocate/obtain ¡memory ¡(global ¡linear/pitch ¡linear, ¡or ¡CUDA ¡array)  Create ¡a ¡texture ¡reference ¡object ¡(Currently ¡must ¡be ¡at ¡file-‑scope)  Bind ¡the ¡texture ¡reference ¡to ¡memory/array  When ¡done: ¡Unbind ¡the ¡texture ¡reference, ¡free ¡resources  Device ¡(kernel) ¡code:  Fetch ¡using ¡texture ¡reference ¡  Linear ¡memory ¡textures: ¡tex1Dfetch() ¡  Array ¡textures: ¡tex1D() ¡or ¡tex2D() ¡or ¡tex3D() ¡  Pitch ¡linear ¡textures: ¡tex2D() 26.03.2010 Mohammad Al-Turany, Hades Meeting Friday, March 26, 2010

Texture ¡Filtering 28  CudaFilterModePoint: ¡ ¡The ¡returned ¡value ¡is ¡the ¡texel ¡ (Texture ¡Element) ¡whose ¡texture ¡coordinates ¡are ¡the ¡ closest ¡to ¡the ¡input ¡texture ¡coordinates; ¡  CudaFilterModeLinear: ¡ ¡The ¡returned ¡value ¡is ¡the ¡ linear ¡interpolaMon ¡of ¡the ¡two ¡(for ¡a ¡one-‑dimensional ¡ texture), ¡four ¡(for ¡a ¡two-‑dimensional ¡texture), ¡or ¡eight ¡ (for ¡a ¡three-‑dimensional ¡texture) ¡texels ¡whose ¡texture ¡ coordinates ¡are ¡the ¡closest ¡to ¡the ¡input ¡texture ¡ coordinates 26.03.2010 Mohammad Al-Turany, Hades Meeting Friday, March 26, 2010

Texture ¡Address ¡Mode 29  How ¡out-‑of-‑range ¡texture ¡coordinates ¡are ¡handled; ¡  Clamp: ¡ ¡ ¡Out-‑of-‑range ¡texture ¡coordinates ¡are ¡clamped ¡to ¡the ¡valid ¡ range. ¡(Values ¡below ¡0 ¡are ¡set ¡to ¡0 ¡and ¡values ¡greater ¡or ¡equal ¡to ¡N ¡ are ¡set ¡to ¡N-‑1)  Wrap: ¡ ¡Out-‑of-‑ ¡range ¡texture ¡coordinates ¡are ¡wrapped ¡to ¡the ¡valid ¡ range ¡(only ¡for ¡normalized ¡coordinates). ¡Wrap ¡addressing ¡is ¡usually ¡ used ¡when ¡the ¡texture ¡contains ¡a ¡periodic ¡signal. ¡It ¡uses ¡only ¡the ¡ fracMonal ¡part ¡of ¡the ¡texture ¡coordinate; ¡for ¡example, ¡1.25 ¡is ¡treated ¡ the ¡same ¡as ¡0.25 ¡and ¡-‑1.25 ¡is ¡treated ¡the ¡same ¡as ¡0.75 26.03.2010 Mohammad Al-Turany, Hades Meeting Friday, March 26, 2010

Runge-‑KuIa ¡propagator 30  The ¡Geant3 ¡Runge-‑KuIa ¡propagator ¡was ¡re-‑wriIen ¡inside ¡ ¡ a ¡cuda ¡kernel  ¡Runge-‑KuIa ¡method ¡for ¡tracking ¡a ¡parMcle ¡through ¡a ¡magneMc ¡ ¡field. ¡ Uses ¡Nystroem ¡algorithm ¡(See ¡Handbook ¡Nat. ¡Bur. ¡Of ¡ ¡ ¡Standards, ¡ procedure ¡25.5.20)  The ¡algorithm ¡it ¡self ¡is ¡hardly ¡parallelizable, ¡but ¡one ¡can ¡ propagate ¡all ¡tracks ¡in ¡an ¡event ¡in ¡parallel  For ¡each ¡track, ¡a ¡block ¡of ¡8 ¡threads ¡is ¡created, ¡the ¡parMcle ¡ data ¡is ¡copied ¡by ¡all ¡threads ¡at ¡once, ¡then ¡one ¡thread ¡do ¡ the ¡propagaMon 26.03.2010 Mohammad Al-Turany, Hades Meeting Friday, March 26, 2010

Using ¡GPUs ¡in ¡HADES 31  Field ¡Map ¡is ¡converted ¡to ¡XYZ ¡map ¡  Event ¡where ¡generated ¡with ¡0.2-‑.0.8 ¡GeV ¡(protons)  Tracks ¡are ¡propagated ¡from ¡the ¡first ¡layer ¡in ¡the ¡MDC1 ¡to ¡ the ¡sixth ¡layer ¡in ¡MDC4 26.03.2010 Mohammad Al-Turany, Hades Meeting Friday, March 26, 2010

Hades ¡Magnet Friday, March 26, 2010

HADES Field Map 33  HADES field map is saved as half sector in cylindrical coordinates  Each call to the map include conversion from Cartesian to Cylindrical coordinates  For simulation and tracking this is an overhead! which can be removed by transforming the map ones to cartesian coordinates  Drawback:  Size of the map is then 3 times larger (about 75 MB for one sector in xyz) 6/17/09 M.Al-Turany, Panda CM, Turin Friday, March 26, 2010

ρθZ ¡and ¡XYZ ¡MAPS ¡for ¡Hades ¡(in ¡kG) 34 Bz(r) RTZ Bz(r) XYZ Bx(r) RTZ Bx(r) RTZ Bx(r) XYZ Bx(r) XYZ By(r) RTZ Diff Bx(r) By(r) XYZ Diff By(r) Diff Bx(r) Friday, March 26, 2010

ρθZ ¡and ¡XYZ ¡MAPS ¡for ¡Hades ¡(in ¡kG) 34 Bz(r) RTZ Bz(r) XYZ Bx(r) RTZ Bx(r) RTZ Bx(r) XYZ Integrals over the Path (T/m): Bx(r) XYZ Bx By Bz RTZ -0.0247085 0.0259996 -0.0012521 XYZ -0.0247093 0.0260001 -0.0012534 Diff 8E-7 5E-7 12E-7 By(r) RTZ Diff Bx(r) By(r) XYZ Diff By(r) Diff Bx(r) Friday, March 26, 2010

Hades ¡Detector 35 35 26.03.2010 Mohammad Al-Turany, Hades Meeting Friday, March 26, 2010

Timing ¡ 36 36 ¡ ¡ ¡unsigned ¡int ¡Mmer ¡= ¡0; ¡ ¡ ¡cuMlCheckError(cutCreateTimer(&Mmer)); ¡ ¡ ¡cuMlCheckError(cutStartTimer(Mmer)); ¡ ¡ ¡cudaMalloc((void**)&d_vecRKIn, ¡bytes*8); ¡ ¡ ¡ ¡ ¡cuMlSafeCall(cudaHostGetDevicePointer((void ¡**)&d_vecRKOut,(void ¡*)vecRKOut,0)); ¡ ¡ ¡checkCUDAError("Device ¡Pointers"); ¡ ¡ ¡cudaMemcpy ¡(d_vecRKIn, ¡ ¡ ¡ ¡vecRKIn, ¡ ¡ ¡bytes*8, ¡ ¡cudaMemcpyHostToDevice); ¡ ¡ ¡ ¡int ¡threads=8 ¡; ¡int ¡tracks=TRK; ¡ ¡ ¡dim3 ¡dimBlock(threads, ¡1); ¡ ¡ ¡dim3 ¡dimGrid(tracks,1); ¡ ¡ ¡ ¡PropagateToPlane<<< ¡dimGrid, ¡dimBlock ¡>>>(d_vecRKIn,d_vecRKOut); ¡ ¡ ¡cudaThreadSynchronize(); ¡ ¡ ¡cuMlCheckError(cutStopTimer(Mmer)); ¡ ¡ ¡prinz("Processing ¡Mme ¡on ¡GPU ¡: ¡%f ¡(ms) ¡\n", ¡cutGetTimerValue(Mmer)); ¡ ¡ ¡cuMlCheckError(cutDeleteTimer(Mmer)); 26.03.2010 Mohammad Al-Turany, Hades Meeting Friday, March 26, 2010

Timing ¡ 36 36 Allocate Memory ¡ ¡ ¡unsigned ¡int ¡Mmer ¡= ¡0; on Card ¡ ¡ ¡cuMlCheckError(cutCreateTimer(&Mmer)); ¡ ¡ ¡cuMlCheckError(cutStartTimer(Mmer)); ¡ ¡ ¡cudaMalloc((void**)&d_vecRKIn, ¡bytes*8); ¡ ¡ ¡ ¡ ¡cuMlSafeCall(cudaHostGetDevicePointer((void ¡**)&d_vecRKOut,(void ¡*)vecRKOut,0)); ¡ ¡ ¡checkCUDAError("Device ¡Pointers"); ¡ ¡ ¡cudaMemcpy ¡(d_vecRKIn, ¡ ¡ ¡ ¡vecRKIn, ¡ ¡ ¡bytes*8, ¡ ¡cudaMemcpyHostToDevice); ¡ ¡ ¡ ¡int ¡threads=8 ¡; ¡int ¡tracks=TRK; ¡ ¡ ¡dim3 ¡dimBlock(threads, ¡1); ¡ ¡ ¡dim3 ¡dimGrid(tracks,1); ¡ ¡ ¡ ¡PropagateToPlane<<< ¡dimGrid, ¡dimBlock ¡>>>(d_vecRKIn,d_vecRKOut); ¡ ¡ ¡cudaThreadSynchronize(); ¡ ¡ ¡cuMlCheckError(cutStopTimer(Mmer)); ¡ ¡ ¡prinz("Processing ¡Mme ¡on ¡GPU ¡: ¡%f ¡(ms) ¡\n", ¡cutGetTimerValue(Mmer)); ¡ ¡ ¡cuMlCheckError(cutDeleteTimer(Mmer)); 26.03.2010 Mohammad Al-Turany, Hades Meeting Friday, March 26, 2010

Get pointer to Timing ¡ pinned memory for output 36 36 ¡ ¡ ¡unsigned ¡int ¡Mmer ¡= ¡0; ¡ ¡ ¡cuMlCheckError(cutCreateTimer(&Mmer)); ¡ ¡ ¡cuMlCheckError(cutStartTimer(Mmer)); ¡ ¡ ¡cudaMalloc((void**)&d_vecRKIn, ¡bytes*8); ¡ ¡ ¡ ¡ ¡cuMlSafeCall(cudaHostGetDevicePointer((void ¡**)&d_vecRKOut,(void ¡*)vecRKOut,0)); ¡ ¡ ¡checkCUDAError("Device ¡Pointers"); ¡ ¡ ¡cudaMemcpy ¡(d_vecRKIn, ¡ ¡ ¡ ¡vecRKIn, ¡ ¡ ¡bytes*8, ¡ ¡cudaMemcpyHostToDevice); ¡ ¡ ¡ ¡int ¡threads=8 ¡; ¡int ¡tracks=TRK; ¡ ¡ ¡dim3 ¡dimBlock(threads, ¡1); ¡ ¡ ¡dim3 ¡dimGrid(tracks,1); ¡ ¡ ¡ ¡PropagateToPlane<<< ¡dimGrid, ¡dimBlock ¡>>>(d_vecRKIn,d_vecRKOut); ¡ ¡ ¡cudaThreadSynchronize(); ¡ ¡ ¡cuMlCheckError(cutStopTimer(Mmer)); ¡ ¡ ¡prinz("Processing ¡Mme ¡on ¡GPU ¡: ¡%f ¡(ms) ¡\n", ¡cutGetTimerValue(Mmer)); ¡ ¡ ¡cuMlCheckError(cutDeleteTimer(Mmer)); 26.03.2010 Mohammad Al-Turany, Hades Meeting Friday, March 26, 2010

Timing ¡ 36 36 ¡ ¡ ¡unsigned ¡int ¡Mmer ¡= ¡0; ¡ ¡ ¡cuMlCheckError(cutCreateTimer(&Mmer)); copy data to the ¡ ¡ ¡cuMlCheckError(cutStartTimer(Mmer)); card memory ¡ ¡ ¡cudaMalloc((void**)&d_vecRKIn, ¡bytes*8); ¡ ¡ ¡ ¡ ¡cuMlSafeCall(cudaHostGetDevicePointer((void ¡**)&d_vecRKOut,(void ¡*)vecRKOut,0)); ¡ ¡ ¡checkCUDAError("Device ¡Pointers"); ¡ ¡ ¡cudaMemcpy ¡(d_vecRKIn, ¡ ¡ ¡ ¡vecRKIn, ¡ ¡ ¡bytes*8, ¡ ¡cudaMemcpyHostToDevice); ¡ ¡ ¡ ¡int ¡threads=8 ¡; ¡int ¡tracks=TRK; ¡ ¡ ¡dim3 ¡dimBlock(threads, ¡1); ¡ ¡ ¡dim3 ¡dimGrid(tracks,1); ¡ ¡ ¡ ¡PropagateToPlane<<< ¡dimGrid, ¡dimBlock ¡>>>(d_vecRKIn,d_vecRKOut); ¡ ¡ ¡cudaThreadSynchronize(); ¡ ¡ ¡cuMlCheckError(cutStopTimer(Mmer)); ¡ ¡ ¡prinz("Processing ¡Mme ¡on ¡GPU ¡: ¡%f ¡(ms) ¡\n", ¡cutGetTimerValue(Mmer)); ¡ ¡ ¡cuMlCheckError(cutDeleteTimer(Mmer)); 26.03.2010 Mohammad Al-Turany, Hades Meeting Friday, March 26, 2010

Timing ¡ 36 36 ¡ ¡ ¡unsigned ¡int ¡Mmer ¡= ¡0; ¡ ¡ ¡cuMlCheckError(cutCreateTimer(&Mmer)); ¡ ¡ ¡cuMlCheckError(cutStartTimer(Mmer)); ¡ ¡ ¡cudaMalloc((void**)&d_vecRKIn, ¡bytes*8); ¡ ¡ ¡ ¡ ¡cuMlSafeCall(cudaHostGetDevicePointer((void ¡**)&d_vecRKOut,(void ¡*)vecRKOut,0)); Launch the ¡ ¡ ¡checkCUDAError("Device ¡Pointers"); Kernel ¡ ¡ ¡cudaMemcpy ¡(d_vecRKIn, ¡ ¡ ¡ ¡vecRKIn, ¡ ¡ ¡bytes*8, ¡ ¡cudaMemcpyHostToDevice); ¡ ¡ ¡ ¡int ¡threads=8 ¡; ¡int ¡tracks=TRK; ¡ ¡ ¡dim3 ¡dimBlock(threads, ¡1); ¡ ¡ ¡dim3 ¡dimGrid(tracks,1); ¡ ¡ ¡ ¡PropagateToPlane<<< ¡dimGrid, ¡dimBlock ¡>>>(d_vecRKIn,d_vecRKOut); ¡ ¡ ¡cudaThreadSynchronize(); ¡ ¡ ¡cuMlCheckError(cutStopTimer(Mmer)); ¡ ¡ ¡prinz("Processing ¡Mme ¡on ¡GPU ¡: ¡%f ¡(ms) ¡\n", ¡cutGetTimerValue(Mmer)); ¡ ¡ ¡cuMlCheckError(cutDeleteTimer(Mmer)); 26.03.2010 Mohammad Al-Turany, Hades Meeting Friday, March 26, 2010

Timing ¡ 36 36 ¡ ¡ ¡unsigned ¡int ¡Mmer ¡= ¡0; ¡ ¡ ¡cuMlCheckError(cutCreateTimer(&Mmer)); ¡ ¡ ¡cuMlCheckError(cutStartTimer(Mmer)); ¡ ¡ ¡cudaMalloc((void**)&d_vecRKIn, ¡bytes*8); ¡ ¡ ¡ ¡ ¡cuMlSafeCall(cudaHostGetDevicePointer((void ¡**)&d_vecRKOut,(void ¡*)vecRKOut,0)); ¡ ¡ ¡checkCUDAError("Device ¡Pointers"); ¡ ¡ ¡cudaMemcpy ¡(d_vecRKIn, ¡ ¡ ¡ ¡vecRKIn, ¡ ¡ ¡bytes*8, ¡ ¡cudaMemcpyHostToDevice); ¡ ¡ ¡ ¡int ¡threads=8 ¡; ¡int ¡tracks=TRK; ¡ ¡ ¡dim3 ¡dimBlock(threads, ¡1); ¡ ¡ ¡dim3 ¡dimGrid(tracks,1); ¡ ¡ ¡ ¡PropagateToPlane<<< ¡dimGrid, ¡dimBlock ¡>>>(d_vecRKIn,d_vecRKOut); ¡ ¡ ¡cudaThreadSynchronize(); ¡ ¡ ¡cuMlCheckError(cutStopTimer(Mmer)); ¡ ¡ ¡prinz("Processing ¡Mme ¡on ¡GPU ¡: ¡%f ¡(ms) ¡\n", ¡cutGetTimerValue(Mmer)); ¡ ¡ ¡cuMlCheckError(cutDeleteTimer(Mmer)); 26.03.2010 Mohammad Al-Turany, Hades Meeting Friday, March 26, 2010

37 Track ¡Propaga+on ¡ Trk/Event CPU GPU Tesla (Time ¡per ¡event) emu C1060 (240) In ¡HADES ¡case ¡the ¡ number ¡of ¡Tracks ¡ 10 1.0 0.35 0.09 here ¡should ¡be ¡taken ¡ 50 2.8 1.54 0.18 as ¡the ¡number ¡of ¡ 100 5.2 2.97 0.35 propagaMons ¡per ¡ 200 10.0 6.15 0.42 events 500 22.6 16.7 0.66 700 30.3 22.4 0.74 (In ¡HADES ¡fiQng ¡each ¡Track ¡is ¡propagated ¡6 ¡ Mmes ¡for ¡each ¡iteraMon ¡in ¡the ¡fit) 26.03.2010 Mohammad Al-Turany, Hades Meeting Friday, March 26, 2010

Track ¡PropagaMon ¡ ( µs/propagation) 38 38 Trk/Event CPU GPU Tesla emu C1060 (240) 10 100 35 9.0 50 56 31 3.6 100 52 30 3.5 200 50 31 2.0 500 45 33 1.3 700 43 32 1.1 Time in µs needed to propagate one track from MDC1 layer1 to MDC 4 layer 6 26.03.2010 Mohammad Al-Turany, Hades Meeting Friday, March 26, 2010

Speedup ¡factor ¡ 39 39 Trk/Event GPU Tesla emu 11 10 ¡ 2.9 50 1.9 15 100 1.8 15 200 1.6 24 500 1.4 34 700 1.4 41 26.03.2010 Mohammad Al-Turany, Hades Meeting Friday, March 26, 2010

Possible (?) next steps for HADES 40  Optimize the block size (number of threads)  Integrate CUDA in HADES building system (Makefiles) OR, move HADES to CMAKE!  CUDA is integrated  Test system, with nightly and continues builds and Dashboard (see http://fairroot.gsi.de/CDash/ )  Re-write the fitting in CUDA 6/17/09 M.Al-Turany, Panda CM, Turin Friday, March 26, 2010

Example ¡(Zero ¡Copy) U S I N G ¡ T H E ¡ P I N N E D ¡ ( PAG E D -‑ LO C K E D ) ¡ M E M O RY ¡ TO ¡ M A K E ¡ T H E ¡ DATA ¡ AVA I L A B L E ¡ TO ¡ T H E ¡ G P U Friday, March 26, 2010

¡Zero ¡Copy 42  Zero ¡copy ¡was ¡introduced ¡in ¡CUDA ¡Toolkit ¡2.2 ¡  It ¡enables ¡GPU ¡threads ¡to ¡directly ¡access ¡host ¡memory, ¡ and ¡it ¡requires ¡mapped ¡pinned ¡(non-‑pageable) ¡memory  Zero ¡copy ¡can ¡be ¡used ¡in ¡place ¡of ¡streams ¡because ¡kernel-‑ originated ¡data ¡transfers ¡automaMcally ¡overlap ¡kernel ¡ execuMon ¡without ¡the ¡overhead ¡of ¡seQng ¡up ¡and ¡ determining ¡the ¡opMmal ¡number ¡of ¡streams 26.03.2010 Mohammad Al-Turany, Hades Meeting Friday, March 26, 2010

Pinned ¡Memory 43  On ¡discrete ¡GPUs, ¡mapped ¡pinned ¡memory ¡is ¡ advantageous ¡only ¡in ¡certain ¡cases. ¡Because ¡the ¡data ¡is ¡ not ¡cached ¡on ¡the ¡GPU, ¡mapped ¡pinned ¡memory ¡should ¡ be ¡read ¡or ¡wriIen ¡only ¡once, ¡and ¡the ¡global ¡loads ¡and ¡ stores ¡that ¡read ¡and ¡write ¡the ¡memory ¡should ¡be ¡ coalesced. ¡  On ¡integrated ¡GPUs, ¡mapped ¡pinned ¡memory ¡is ¡always ¡a ¡ performance ¡gain ¡because ¡it ¡avoids ¡superfluous ¡copies ¡as ¡ integrated ¡GPU ¡and ¡CPU ¡memory ¡are ¡physically ¡the ¡same. ¡ 26.03.2010 Mohammad Al-Turany, Hades Meeting Friday, March 26, 2010

Track ¡+ ¡vertex ¡fiQng ¡on ¡CPU ¡and ¡GPU 44 CPU Time/GPU Time 30.00 GPU GPU Zero Copy Track/Event 50 100 1000 2000 22.50 GPU ¡ 3.0 4.2 18 18 15.00 GPU ¡(Zero ¡Copy) 15 13 22 20 7.50 0 50 100 1000 2000 Time needed per event (ms) 50 100 1000 2000 CPU 3.0 5.0 120 220 GPU ¡ 1.0 1.2 6.5 12.5 GPU ¡(Zero ¡Copy) 0.2 0.4 5.4 10.5 26.03.2010 Mohammad Al-Turany, Hades Meeting Friday, March 26, 2010

Resource ¡usage ¡in ¡this ¡Test 45 Qaudro ¡NVS ¡ ¡290 ¡ GeForce GeForce Tesla ¡C1060 8400 ¡GT 8800 ¡GT Warps/MulMprocessor 24 24 24 32 Occupancy 33% 33% 33% 25% AcMve ¡Threads 128 256 896 1920 ¡ ¡Limited ¡by ¡Max ¡Warps ¡/ 8 8 8 8 ¡ ¡MulMprocessor 26.03.2010 Mohammad Al-Turany, Hades Meeting Friday, March 26, 2010

Resource ¡usage ¡in ¡this ¡Test 45 Qaudro ¡NVS ¡ ¡290 ¡ GeForce GeForce Tesla ¡C1060 8400 ¡GT 8800 ¡GT Warps/MulMprocessor 24 24 24 32 Occupancy 33% 33% 33% 25% AcMve ¡Threads 128 256 896 1920 ¡ ¡Limited ¡by ¡Max ¡Warps ¡/ 8 8 8 8 ¡ ¡MulMprocessor AcMve ¡threads ¡= ¡Warps ¡x ¡32 ¡x ¡ mulMprocessor ¡x ¡occupancy ¡ AcMve ¡threads ¡in ¡Tesla ¡= ¡ 8x32x30x0.25 ¡= 1920 ¡ 26.03.2010 Mohammad Al-Turany, Hades Meeting Friday, March 26, 2010

Parallelization on CPU/GPU (PANDA track fitting) 46 6/17/09 M.Al-Turany, Panda CM, Turin Friday, March 26, 2010

Parallelization on CPU/GPU (PANDA track fitting) 46 No. of Track/ 50 2000 Process Event (Float) (Float) 1 CPU 1.7 E4 Track/s 9.1 E2 Track/s 1 CPU + GPU (T (Tesla) 5.0 E4 Track/s 6.3 E5 Track/s 4 CPU + GPU (T (Tesla) 1.2 E5 Track/s 2.2 E6 Track/s 6/17/09 M.Al-Turany, Panda CM, Turin Friday, March 26, 2010

Summary ¡ 47 26.03.2010 Mohammad Al-Turany, Hades Meeting Friday, March 26, 2010

Summary ¡ 47  Cuda ¡is ¡an ¡easy ¡to ¡learn ¡and ¡to ¡use ¡tool. 26.03.2010 Mohammad Al-Turany, Hades Meeting Friday, March 26, 2010

Summary ¡ 47  Cuda ¡is ¡an ¡easy ¡to ¡learn ¡and ¡to ¡use ¡tool.  Cuda ¡allows ¡heterogeneous ¡programming. 26.03.2010 Mohammad Al-Turany, Hades Meeting Friday, March 26, 2010

Summary ¡ 47  Cuda ¡is ¡an ¡easy ¡to ¡learn ¡and ¡to ¡use ¡tool.  Cuda ¡allows ¡heterogeneous ¡programming.  Depending ¡on ¡the ¡use ¡case ¡one ¡can ¡win ¡factors ¡in ¡performance ¡ compared ¡to ¡CPU 26.03.2010 Mohammad Al-Turany, Hades Meeting Friday, March 26, 2010

Summary ¡ 47  Cuda ¡is ¡an ¡easy ¡to ¡learn ¡and ¡to ¡use ¡tool.  Cuda ¡allows ¡heterogeneous ¡programming.  Depending ¡on ¡the ¡use ¡case ¡one ¡can ¡win ¡factors ¡in ¡performance ¡ compared ¡to ¡CPU  Texture ¡memory ¡can ¡be ¡used ¡to ¡solve ¡problems ¡that ¡require ¡ lookup ¡tables ¡effecMvely 26.03.2010 Mohammad Al-Turany, Hades Meeting Friday, March 26, 2010

Summary ¡ 47  Cuda ¡is ¡an ¡easy ¡to ¡learn ¡and ¡to ¡use ¡tool.  Cuda ¡allows ¡heterogeneous ¡programming.  Depending ¡on ¡the ¡use ¡case ¡one ¡can ¡win ¡factors ¡in ¡performance ¡ compared ¡to ¡CPU  Texture ¡memory ¡can ¡be ¡used ¡to ¡solve ¡problems ¡that ¡require ¡ lookup ¡tables ¡effecMvely  Pinned ¡Memory ¡simplify ¡some ¡problems, ¡gives ¡also ¡beIer ¡ performance. ¡ 26.03.2010 Mohammad Al-Turany, Hades Meeting Friday, March 26, 2010

Summary ¡ 47  Cuda ¡is ¡an ¡easy ¡to ¡learn ¡and ¡to ¡use ¡tool.  Cuda ¡allows ¡heterogeneous ¡programming.  Depending ¡on ¡the ¡use ¡case ¡one ¡can ¡win ¡factors ¡in ¡performance ¡ compared ¡to ¡CPU  Texture ¡memory ¡can ¡be ¡used ¡to ¡solve ¡problems ¡that ¡require ¡ lookup ¡tables ¡effecMvely  Pinned ¡Memory ¡simplify ¡some ¡problems, ¡gives ¡also ¡beIer ¡ performance. ¡  The ¡results ¡for ¡HADES ¡are ¡preliminary, ¡and ¡sMll ¡a ¡lot ¡of ¡space ¡for ¡ improvement! 26.03.2010 Mohammad Al-Turany, Hades Meeting Friday, March 26, 2010

Backup ¡Slides Friday, March 26, 2010

NVIDIA’s ¡Next ¡Genera.on ¡CUDA ¡ Architecture FERMI ¡ Friday, March 26, 2010

Features: Support ¡a ¡true ¡cache ¡ hierarchy ¡in ¡combinaMon ¡ with ¡on-‑chip ¡shared ¡ memory ¡ Improves ¡bandwidth ¡and ¡ reduces ¡latency ¡through ¡ L1 ¡cache’s ¡configurable ¡ shared ¡memory Fast, ¡coherent ¡data ¡ sharing ¡across ¡the ¡GPU ¡ through ¡unified ¡L2 ¡cache Fermi Tesla http://www.behardware.com/art/imprimer/772/ 26.03.2010 Mohammad Al-Turany, Hades Meeting 50 Friday, March 26, 2010

NVIDIA ¡GigaThread™ ¡ Engine ¡ Increased ¡efficiency ¡with ¡ concurrent ¡kernel ¡execuMon Dedicated, ¡bi-‑direcMonal ¡ data ¡transfer ¡engines Intelligently ¡manage ¡tens ¡of ¡ thousands ¡of ¡threads http://www.behardware.com/art/imprimer/772/ 26.03.2010 Mohammad Al-Turany, Hades Meeting 51 Friday, March 26, 2010

ECC ¡Support 52  First ¡GPU ¡architecture ¡to ¡support ¡ECC  Detects ¡and ¡corrects ¡errors ¡before ¡system ¡is ¡affected  Protects ¡register ¡files, ¡shared ¡memories, ¡L1 ¡and ¡L2 ¡cache, ¡ and ¡DRAM 26.03.2010 Mohammad Al-Turany, Hades Meeting Friday, March 26, 2010

Unified ¡address ¡space 53 Groups local, shared and global memory in the same address space. This unified address space means support for pointers and object references that are necessary for high-level languages such as C++. http://www.behardware.com/art/imprimer/772/ 26.03.2010 Mohammad Al-Turany, Hades Meeting Friday, March 26, 2010

CPU ¡vs ¡GPU ¡code 54 __shared__ float4 field; Double_t h2, h4, f[4]; float h2, h4, f[4]; Double_t xyzt[3], a, b, c, ph,ph2; float xyzt[3], a, b, c, ph,ph2; Double_t secxs[4],secys[4],seczs[4],hxp[3]; float secxs[4],secys[4],seczs[4],hxp[3]; Double_t g1, g2, g3, g4, g5, g6, ang2, dxt, float g1, g2, g3, g4, g5, g6, ang2, dxt, dyt, dzt; dyt, dzt; Double_t est, at, bt, ct, cba; float est, at, bt, ct, cba; Double_t f1, f2, f3, f4, rho, tet, hnorm, hp, float f1, f2, f3, f4, rho, tet, hnorm, hp, rho1, sint, cost; rho1, sint, cost; Double_t x; float x; Double_t y; float y; Double_t z; float z; Double_t xt; float xt; Double_t yt; float yt; Double_t zt; float zt; Double_t maxit = 10; float maxit= 10; Double_t maxcut = 11; float maxcut= 11; const Double_t hmin = 1e-4; float hmin = 1e-4; const Double_t kdlt = 1e-3; float kdlt = 1e-3; const Double_t kdlt32 = kdlt/32.; float kdlt32 = kdlt/32.; const Double_t kthird = 1./3.; float kthird = 1./3.; …… …. 26.03.2010 Mohammad Al-Turany, Hades Meeting Friday, March 26, 2010

CPU ¡vs ¡GPU ¡code 55 ¡do ¡{ ¡ ¡ ¡ ¡ ¡do ¡{ ¡ ¡ ¡ ¡ rest ¡ ¡= ¡step ¡-‑ ¡tl; ¡ ¡ ¡ ¡ rest ¡ ¡= ¡step ¡-‑ ¡tl; ¡ ¡ ¡ ¡ if ¡(TMath::Abs(h) ¡> ¡TMath::Abs(rest)) ¡ ¡ ¡ ¡ ¡ ¡ ¡ if ¡(fabs(h) ¡> ¡fabs(rest)) ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡h ¡= ¡rest; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡h ¡= ¡rest; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ fMagField-‑>GetFieldValue( ¡vout, ¡f); ¡ ¡ ¡ ¡ field=GetField(vout[0],vout[1],vout[2]); ¡ ¡ ¡ ¡ ¡f[0] ¡= ¡-‑1.0*f[0]; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡f[0] ¡= ¡-‑field.x; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡f[1] ¡= ¡-‑1.0*f[1]; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡f[1] ¡= ¡-‑field.y; ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡f[2] ¡= ¡-‑1.0*f[2]; ¡ ¡ ¡ ¡ ¡f[2] ¡= ¡-‑field.z; ……….. ……….. if ¡(step ¡< ¡0.) ¡rest ¡= ¡-‑rest; ¡ ¡ ¡ ¡ ¡if ¡(step ¡< ¡0.) ¡rest ¡= ¡-‑rest; ¡ ¡ ¡ if ¡(rest ¡< ¡1.e-‑5*TMath::Abs(step)) ¡return; ¡ ¡ ¡if ¡(rest ¡< ¡1.e-‑5*fabs(step)) ¡return; ¡ ¡ } ¡while(1); } ¡while(1); 26.03.2010 Mohammad Al-Turany, Hades Meeting Friday, March 26, 2010

Panda ¡Detector 56 26.03.2010 Mohammad Al-Turany, Hades Meeting Friday, March 26, 2010

Magnet ¡and ¡Field 57 26.03.2010 Mohammad Al-Turany, Hades Meeting Friday, March 26, 2010

Field ¡Map 58 ¡Field ¡map ¡grid ¡: ¡Bx, ¡By, ¡Bz  ¡x ¡= ¡0.0 ¡to ¡ ¡ ¡ ¡ ¡ ¡ ¡158 ¡ ¡ ¡ ¡cm, ¡ ¡ ¡80 ¡ ¡ ¡ ¡points, ¡ ¡Δx ¡= ¡2.0 ¡cm  ¡y ¡= ¡0.0 ¡to ¡ ¡ ¡ ¡ ¡ ¡ ¡51 ¡ ¡ ¡ ¡ ¡ ¡cm, ¡ ¡ ¡52 ¡ ¡ ¡ ¡ ¡points, ¡ ¡Δy ¡= ¡1.0 ¡cm  ¡z ¡= ¡342.0 ¡to ¡ ¡ ¡602 ¡ ¡ ¡cm, ¡ ¡131 ¡ ¡ ¡ ¡points, ¡ ¡Δz ¡= ¡2.0 ¡cm 4-‑fold ¡ ¡symmetry ¡ 26.03.2010 Mohammad Al-Turany, Hades Meeting Friday, March 26, 2010

Cards ¡used ¡in ¡this ¡Test 59 Qaudro ¡NVS ¡ ¡ GeForce GeForce Tesla ¡C1060 290 ¡ 8400 ¡GT 8800 ¡GT CUDA ¡ ¡cores 16 ¡(2 ¡x ¡8) 32 ¡(4 ¡x ¡8) 112 ¡(14 ¡x ¡8) 240 ¡(30 ¡x ¡8) Memory ¡(MB) ¡256 128 ¡ 512 ¡ ¡4000 Frequency ¡of ¡processor ¡cores ¡ 0.92 0.94 ¡ 1.5 1.3 (GHz) Compute ¡capability ¡ 1.1 1.1 1.1 1.3 Warps/MulMprocessor 24 24 24 32 Max. ¡No. ¡of ¡threads 1536 3072 10752 30720 Max ¡Power ¡ConsumpMon ¡(W) 21 ¡ 71 ¡ 105 ¡ 200 ¡ 26.03.2010 Mohammad Al-Turany, Hades Meeting Friday, March 26, 2010

Features ¡available ¡only ¡in ¡1.3 ¡compu.ng ¡capabili.es ¡ 60  Support ¡for ¡atomic ¡funcMons ¡operaMng ¡in ¡shared ¡memory ¡ and ¡on ¡64-‑bit ¡words ¡in ¡global ¡memory ¡(for ¡1.1 ¡only ¡32-‑bit ¡ words)  Support ¡for ¡warp ¡vote ¡funcMons  The ¡number ¡of ¡registers ¡per ¡mulMprocessor ¡is ¡16384 ¡(8192 ¡ in ¡1.1)  The ¡maximum ¡number ¡of ¡acMve ¡warps ¡per ¡mulMprocessor ¡is ¡ 32 ¡(24 ¡in ¡1.1)  The ¡maximum ¡number ¡of ¡acMve ¡threads ¡per ¡mulMprocessor ¡ is ¡1024 ¡(768 ¡in ¡1.1)  Support ¡for ¡double-‑precision ¡floaMng-‑point ¡numbers 26.03.2010 Mohammad Al-Turany, Hades Meeting Friday, March 26, 2010

Track ¡PropagaMon ¡(Mme ¡per ¡event) 61 Trk/ CPU GPU Quadro GeForce GeForce Tesla Event emu NVS ¡290 8400GT 8800 ¡GT C1060 (16) (32) (112) (240) 10 ¡ 2.4 1.9 0.9 0.8 0.7 0.4 50 11 7 2.5 1.8 1.0 0.4 100 21 16 4.4 2.9 1.7 0.5 200 42 25 8.9 5.6 2.9 0.9 500 104 86 23 13.2 5.6 1.3 1000 210 177 42 25.7 10.1 1.9 2000 412 356 82 52.2 19.5 3.0 5000 1054 886 200 125 50.0 6.0 Time in ms needed to propagate all tracks in event 26.03.2010 Mohammad Al-Turany, Hades Meeting Friday, March 26, 2010

Track ¡PropagaMon ¡(Mme ¡per ¡track) 62 Trk/ CPU GPU Quadro GeForce GeForce Tesla Event emu NVS ¡290 8400GT 8800 ¡GT C1060 (16) (32) (112) (240) 10 ¡ 240 190 90 80 70 40 50 220 140 50 36 20 8.0 100 210 160 44 29 17 5.0 200 210 125 45 28 15 4.3 500 208 172 46 26 11 2.6 1000 210 177 42 26 10 1.9 2000 206 178 41 26 10 1.5 5000 211 177 40 25 10 1.2 Time in µs needed to propagate one track 1.5 m in a dipole field 26.03.2010 Mohammad Al-Turany, Hades Meeting Friday, March 26, 2010

Track Reconstruc.on on GPUs MOHAMMAD AL-TURANY GSI Darmstadt - PowerPoint PPT Presentation

Track Reconstruc.on on GPUs MOHAMMAD AL-TURANY GSI Darmstadt Friday, March 26, 2010 Outline 2 26.03.2010 Mohammad Al-Turany, Hades Meeting Friday, March 26, 2010 Outline 2 Few words about GPU

LAr TPC reconstruc/on and systema/cs for SBN* Andy Furmanski

Why use GPUs for graph processing? FOSDEM 2020 2 GPUs and Graphs Graphs GPUs Found

Status on positron fraction Multi-track event CC fitted Multi-track event 1 track Multi-Track

Session B.3 Spectrum Efficient Technologies Track Chair: Mr. Tom Young Track Chair Track

Track Filtering/Quality/Merging A proposal for data format of track quality and track merging in

Track & Field Parent Meeting BT Track & Field Mission Bartram Trail Track & Field

Track fitting, vertex fitting and Track fitting, vertex fitting and Track fitting, vertex fitting

Mul$scale Molecular Systems Biology: Reconstruc$on and Model

Reconstruc*on of the MOS 6502 on the Cyclone II

Frac%ona%on, rearrangement, consolida%on, reconstruc%on, &

ADC simula+on tools ProtoDUNE Simula+on and Reconstruc+on David

Dense Planar-Iner,al SLAM for Large Indoor 3D Reconstruc,on

Robust reconstruc,on of flow in MRI C. Bertoglio, H. Carrillo, J. Garay, D. Nolte, A. Osses,

DataPrep Status DUNE FD simula@on and reconstruc@on David

Recent Advances in Adap.ve Sampling and Reconstruc.on for Monte

Scott Le Grand Some Things Never Change (GPUs vs the World) How Best to Exploit GPUs

Declaring Array Variables Chapter 6 Arrays datatype[] arrayRefVar; Preferred! Example:

Principles of Programming Languages h"p://www.di.unipi.it/~andrea/Dida2ca/PLP-16/ Prof.

CPE 335 CPE 335 Computer Organization MIPS ISA Dr. Iyad Jafar Adapted from Dr. Gheith Abandah

CSC 4181 Compiler Construction Overview of Compilers Part 1 Introduction 1 1 What is a

Java: Learning to Program with Robots Chapter 10: Arrays Chapter Objectives After studying this

COMP 1402 Winter 2008 Tutorial #5 Arrays Overview of Tutorial #5 What is an array?

Neural Networks (Perceptrons) A mathematical perspective Textbook reference: Sections 11.1-11.2

Applied Cryptography Lecture 1 Applied Cryptography Lecture 1 Our first encounter with secrecy:

Track Reconstruc.on on GPUs MOHAMMAD AL-TURANY GSI Darmstadt - PowerPoint PPT Presentation

Track Reconstruc.on on GPUs MOHAMMAD AL-TURANY GSI Darmstadt Friday, March 26, 2010 Outline 2 26.03.2010 Mohammad Al-Turany, Hades Meeting Friday, March 26, 2010 Outline 2 Few words about GPU

LAr TPC reconstruc/on and systema/cs for SBN* Andy Furmanski

Why use GPUs for graph processing? FOSDEM 2020 2 GPUs and Graphs Graphs GPUs Found

Status on positron fraction Multi-track event CC fitted Multi-track event 1 track Multi-Track

Session B.3 Spectrum Efficient Technologies Track Chair: Mr. Tom Young Track Chair Track

Track Filtering/Quality/Merging A proposal for data format of track quality and track merging in

Track &amp; Field Parent Meeting BT Track &amp; Field Mission Bartram Trail Track &amp; Field

Track fitting, vertex fitting and Track fitting, vertex fitting and Track fitting, vertex fitting

Mul$scale Molecular Systems Biology: Reconstruc$on and Model

Reconstruc*on of the MOS 6502 on the Cyclone II

Frac%ona%on, rearrangement, consolida%on, reconstruc%on, &amp;

ADC simula+on tools ProtoDUNE Simula+on and Reconstruc+on David

Dense Planar-Iner,al SLAM for Large Indoor 3D Reconstruc,on

Robust reconstruc,on of flow in MRI C. Bertoglio, H. Carrillo, J. Garay, D. Nolte, A. Osses,

DataPrep Status DUNE FD simula@on and reconstruc@on David

Recent Advances in Adap.ve Sampling and Reconstruc.on for Monte

Scott Le Grand Some Things Never Change (GPUs vs the World) How Best to Exploit GPUs

Declaring Array Variables Chapter 6 Arrays datatype[] arrayRefVar; Preferred! Example:

Principles of Programming Languages h&quot;p://www.di.unipi.it/~andrea/Dida2ca/PLP-16/ Prof.

CPE 335 CPE 335 Computer Organization MIPS ISA Dr. Iyad Jafar Adapted from Dr. Gheith Abandah

CSC 4181 Compiler Construction Overview of Compilers Part 1 Introduction 1 1 What is a

Java: Learning to Program with Robots Chapter 10: Arrays Chapter Objectives After studying this

COMP 1402 Winter 2008 Tutorial #5 Arrays Overview of Tutorial #5 What is an array?

Neural Networks (Perceptrons) A mathematical perspective Textbook reference: Sections 11.1-11.2

Applied Cryptography Lecture 1 Applied Cryptography Lecture 1 Our first encounter with secrecy:

Track & Field Parent Meeting BT Track & Field Mission Bartram Trail Track & Field

Frac%ona%on, rearrangement, consolida%on, reconstruc%on, &

Principles of Programming Languages h"p://www.di.unipi.it/~andrea/Dida2ca/PLP-16/ Prof.