an and education for in industry the senai cimatec campus
play

an and Education for In Industry The SENAI CIMATEC Campus - PowerPoint PPT Presentation

UNIT Technology, In Innovation an and Education for In Industry The SENAI CIMATEC Campus Highlights 4 buildings More than 35,000 m Over US$200 million of investment 42 competence areas More than 800 employee SENAI CIMATEC


  1. UNIT Technology, In Innovation an and Education for In Industry

  2. The SENAI CIMATEC Campus Highlights 4 buildings More than 35,000 m² Over US$200 million of investment 42 competence areas More than 800 employee

  3. SENAI CIMATEC Supercomputing Center

  4. Supercomputing Center Timeline PhDs Researches Masters 6. HPC 5. HPC Oil & Industrial Gas (GPU) CIMATEC 7. Q uantum Specialists 1.8 PetaFlops 4. HPC Oil & Gas (CPU) 3. HPC 800 TeraFlops FINEP 180 TFlops 2. Fiocruz Areas of Actuation OMOLU Services 50 TFlops Projects 1. Yemoja - Innovation Oil & Gás 405 TeraFlops Models 0. Cloud (Datacenter) ............................................ 2012 2015 2016 2018 2019 2020 2021

  5. HPC Ògún: Heterogeneous Computing CS2I SINAPAD Intel Xeon Xeon Phi NVidia GPU FPGA +ARM +GPU 59 Nós Xeon 6148 4 Nós 2 Nós x 2 P100 Nvlink 2 x Arria10 o o o o Total: 127 TF 8 TFlops Total: 13 TF Total: 2 TFlops o o o o

  6. Performance and Energy Efficiency Analysis of a Reverse Time Migration Design on FPGA RTM Brief Review Main Computational Challenges Reducing Memory Requirements Summary Hardware-based Acceleration RTMCore's Architecture Performance Tests Conclusions João Carlos Bittencourt, Joaquim Oliveira, Anderson Nascimento, Rodrigo Tutu, Lauê Jesus, Georgina Rojas, Deusdete Matos, Leonardo Fialho, André Lima, Erick Nascimento, João Marcelo Souza , Adhvan Furtado, and Wagner Oliveira

  7. Overview on Reverse Time Migration (RTM) Impulse Source • RTM TM is is a a Se Seism ismic Mig igration tech echnique for or accu accurate im imag aging of of subsurfaces RTM with ith gr great str tructural l an and velo elocity Input Velocity Model com omplexit ities • Lar Largel ely use sed in in Se Seis ismic ic Im Imagin ing Flo low for or refin inin ing boun oundarie ies in in velo elocit ity Enhanced Subsurface Image mod odel l buildin ilding proces esse ses s (F (FWI, PSO SO, , Seismogram Data Tom omography, etc. c.)

  8. Overview on Reverse Time Migration (RTM) • Project's specific: Wave Propagation Geometric Layout Imaging Condition • 2D RTM • Point Source and Receiver • Second-order acoustic wave • Finite-difference based solution • P-waves only

  9. Main Computational Challenges • RTM requir ires a a mas assive computation power, memory ry an and storage to o mig igrate even sm small ll fie field lds • Fin Finit ite-difference (St (Stencil il) op operators require se several l memory ac accesses • Mig igration tim time an and ass associated energy costs may be prohibit itive on on production sc scale le

  10. Main Computational Challenges • Optimization Go Goals ls: • Reducing mem emory ry req equirements • Reducing migration tim ime an and ene energy consumption • De Design Str Strategy: • Cho Choosin ing mem emory ry efficient al algorithms • Op Optimizin ing mem emory ry ac access • Efficient des design of of he heterogeneous com omputing ac accelerators on on FPGA FPGA an and GPU GPU

  11. Reducing Memory Requirements Bou Boundary ry Con Condit ition Mem em. Req equirements • Focus on on bou boundary ry tr trea eatm tment str trategie ies: Required St Strategy Imag Image Qu Quality • Traditional Mem emory (GB) (GB) l Ch Check ck Poin oint str trategy [1] 1] • Ra Checkpoint 311.4 High Random Bo Boundary ry Co Condition (RBC) [2] 2] • Hy RBC 0.25 Low Hybrid Bo Boundary ry Co Condition (HB HBC) [3] 3] • HBC HBC 1.04 1.04 High Hig Du Duri ring forw orward pr propagati tion, a a slice of of the the pr press essure field upper r bo boarder r is saved, for or ea each tim time step ep • On n bac backward pr propagation, the the bor border slices ar are e us used for or 2D D Plu luto Velo elocity Mod odel el sou ource wave rec econstru ruction 6960 indexes • Tes est spe specif ificatio ion: • Pl Pluto 2D 2D mo model (6,9 6,960 x 1,20 1,201) • Number of of Sho Shots: 1 • Tim Time St Steps: : 12,8 12,860

  12. Reducing Memory Requirements 24-bit Fixed-point Numeric Representation* • Fix Fixed-poin int representation 1 23 bits • Fix ixed-poin int t op operations gen enerall lly req equire les less - Bits 1-23 – Fraction part clo clock cy cycle les - Bit 0 – Sign bit • Word le length fix fixed in in 24 24 bits its *No Integer part, all values between – 1 and 1 • Mem emory effi ficie iency is is in incr creased • HW/SW Valid alidation • A fix fixed-poin int reference e soft oftware mod odel el was devel eloped and its its ou outp tputs wer ere ver erif ified ed

  13. Hardware-based Acceleration • Com omple lete sol solutio ion is s a a hw hw/sw sw co co-design: • RTM CP CPU-based hos ost t applic lication • RTM FPGA-based acce ccele leration kern rnel • Th The e Hos ost t app appli licatio ion is s resp esponsib ible for: or: • Con Config iguring kern ernel l parameters • Processin ing in input t and ou outp tput t data • Dis Distrib ibuting shots ots among mult ltiple FPGA • Stackin ing ou outp tput im images • Each ker ernel l pe perf rform rms an an full full im image migr igratio ion

  14. Co-design Architecture

  15. RTMCore's Architecture • Space Par aralle llelis ism: Space Parallelism • All l pr pres essure fi field lds of of the the sam same tim time step can be be upd updated si simult ltaneously ly • Mul ultip iple le Proc ocessing Ele lements upd update up up to o 21 21 pr pres essure poi points ts pe per r iteratio ion • Tim ime Par aralle llelis ism: • Con onsecutiv ive tim time steps can be be com omputed in in Time Parallelism pip pipelin ine • A tot otal l of of 24 cascading Pipeli lined Stag aged Mod odule les (P (PSM) str tream tim time it iteratio ions

  16. RTMCore's Architecture Proposed Ker ernel l Architecture • Th The desi sign model l is is base ased on on research pres esented in in [4] [4]

  17. Performance Evaluation Mig igratio ion Tim ime • Evaluati tion of of the the FPG FPGA per erformance ag again inst t traditi tr tional l ac accele lerati tion alt alternativ ives, su such ch as as T mi mig = T cpu + T writ write + T read + T kernel GP GPU an and Mult ltit ithreadin ing • Two asp aspects ts wer ere consid idered Consumed En Co Energy • Mig igratio ion Tim ime: ho how fas ast is s a a seis seismic sho shot migrated? • En Energy efficiency: whic hich acce accelerator de deli livers s mor ore performance, per , while ile req equiring les ess ene energy? T = T mig (Hou = our) N = = Number of of Power Samples es P(I) (I) = In Instantaneous Power (W (W)

  18. Performance Evaluation • RTM TM imple plementatio ions for or perf performance com omparis ison: A. A. Seri erial l CPU: : use used as as tar arget reference for or spe speed up up ana analysis is B. B. Mul ultit ithread CPU: : 40 CPUs s com omputin ing Multithread CP CPU pres pr essure fi field lds in n par paralle lel for or ea each tim time step (sp (space pa paralle leli lism) NVid idia ia's 's Tit itan X C. C. GPU CUDA: NVi Vidia's Tit Titan X (1 (11 TF TFLOPs) exp xplo lorin ing mas assiv ive spa space par parall llelis ism D. D. FP FPGA: RTM TM ker ernel explo lorin ing bo both spa space In Intel' l's Arr rria ia 10 10 De Dev. Kit Kit and and tim time e par paralle leli lism

  19. Performance Evaluation Energy Measuring Setup • Power Measuring Methodology • A po power meter de devic ice was as pla placed be betw tween po power sup supply an and hos host • Bot oth hos host an and de devic ice po power wer ere meas easured du durin ing RTM executio ions Example: 1 Min. Migration Samples • Power meter de devic ice was as con onfig igured to to coll ollect sam samples at 10H 10Hz • On Only GPU GPU an and FPGA FPGA po power wer ere meas easured

  20. Performance Results 6,960 indexes • In Input Par arameters • Plu Pluto 2D 2D (6,9 (6,960 x x 1,201 1,201) • 12,8 12,860 Tim ime ste teps • Sho Shot Pos osit itio ion: 3,48 3,480 x x 0 • Number of of Sho Shots: 1 • Ov Overall Wor orkload: 1.4 1.4 GB GB Performance Results Energy En Imp Implementatio ion Runtime (s) Ru (s) Speed up Spe up Efficiency • Efficiency measured in (W (Wh) in Serial CPU 21,8 21,873.8 .85 1 - - Speedup/Wh Sp Wh Multithread 2,429.5 9 - - GPU Titan X 182. 182.7 124 36 3.44 FPG FPGA Arr rria ia 10 10 194 194 112 112 20 20 5.60 5.60

  21. Concluding Remarks • Scala labil ilit ity of of th the e solu olution lies lies in in th the e paralle lelizati tion of of shots ots • Multiple FPG FPGA boa boards in one one or or mor ore com ompute no nodes • Hig Higher scala labili lity can be e ach chie ieved ed by exp xploring tem emporal parallel elis ism • Incr Increasing the nu number of of Pip Pipelin ine St Stage Modules • Mor ore iterations could ld be be com omputed in pa parall llel • Exp xploration of of fix fixed ed-point computati tion • Poss ossib ibility to to explo lore suc such meth thod in n 3D 3D ste tencil il ope operators 22/24

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend