introduction to big data analytics frameworks
play

Introduction to [Big] Data Analytics Frameworks Data Sciences - PowerPoint PPT Presentation

Introduction to [Big] Data Analytics Frameworks Data Sciences (pilot) Training EC Sbastien Varrette, PhD Parallel Computing and Optimization Group (PCOG), University of Luxembourg (UL), Luxembourg Feb. 7 th and Apr. 1 st , 2019, Luxembourg


  1. Introduction Different HPC Needs per Domains Deep Learning / Cognitive Computing Biomedical Industry / Life Sciences Material Science & Engineering IoT, FinTech ALL Research Computing Domains #Cores Network Bandwidth Flops/Core Network Latency Storage Capacity I/O Performance Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 9 / 126 �

  2. Introduction New Trends in HPC Continued scaling of scientific, industrial & financial applications → . . . well beyond Exascale ֒ F ��������� ��� � C ����� �� E ��������� �� H ��� -P ���������� C �������� S ������ New trends changing the landscape for HPC → Emergence of Big Data analytics ֒ → Emergence of ( Hyperscale ) Cloud Computing ֒ → Data intensive Internet of Things (IoT) applications ֒ → Deep learning & cognitive computing paradigms ֒ Eurolab-4-HPC Long-Term Vision on High-Performance Computing This study was carried out for RIKEN by Editors: Theo Ungerer, Paul Carpenter Funded by the European Union Horizon 2020 Framework Programme (H2020-EU.1.2.2. - FET Proactive) [Source : EuroLab-4-HPC] Special Study Analysis of the Characteristics and Development Trends of the Next-Generation of Supercomputers in Foreign Countries Earl C. Joseph, Ph.D. Robert Sorensen Steve Conway Kevin Monroe [Source : IDC RIKEN report, 2016] Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 10 / 126 � � � �

  3. Introduction Toward Modular Computing Aiming at scalable , flexible HPC infrastructures → Primary processing on CPUs and accelerators ֒ � HPC & Extreme Scale Booster modules → Specialized modules for: ֒ � HTC & I/O intensive workloads; � [Big] Data Analytics & AI [Source : "Towards Modular Supercomputing: The DEEP and DEEP-ER projects", 2016] Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 11 / 126 �

  4. Introduction Summary 1 Introduction HPC & BD Trends Reviewing the Main HPC and BD Components 2 [Big] Data Management in HPC Environment: Overview and Challenges Performance Overview in Data transfer Data transfer in practice Sharing Data 3 Big Data Analytics with Hadoop, Spark etc. Apache Hadoop Batch vs Stream vs Hybrid Processing Apache Spark 4 [Brief] Overview of other useful Data Analytics frameworks Python Libraries R – Statistical Computing 5 Conclusion Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 12 / 126 �

  5. Introduction HPC Computing Hardware CPU (Central Processing Unit) Highest software flexibility Base → High performance across all computational domains ֒ → Ex: Intel Core i9-9900K (Q4’18) R peak ≃ 922 GFlops (DP) ֒ � 8 cores @3.6GHz (14nm, 95W, ≃ 3.5 billion transistors) + integ. graphics Intel Coffee Lake die Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 13 / 126 �

  6. Introduction HPC Computing Hardware CPU (Central Processing Unit) Highest software flexibility Base → High performance across all computational domains ֒ → Ex: Intel Core i9-9900K (Q4’18) R peak ≃ 922 GFlops (DP) ֒ � 8 cores @3.6GHz (14nm, 95W, ≃ 3.5 billion transistors) + integ. graphics GPU (Graphics Processing Unit) : Ideal for ML/DL workloads → Ex: Nvidia Tesla V100 SXM2 (Q2’17) R peak ≃ 7.8 TFlops (DP) ֒ Accelerators � 5120 cores @ 1.3GHz (12nm, 250W, 21 billion transistors) Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 13 / 126 �

  7. Introduction HPC Computing Hardware CPU (Central Processing Unit) Highest software flexibility Base → High performance across all computational domains ֒ → Ex: Intel Core i9-9900K (Q4’18) R peak ≃ 922 GFlops (DP) ֒ � 8 cores @3.6GHz (14nm, 95W, ≃ 3.5 billion transistors) + integ. graphics GPU (Graphics Processing Unit) : Ideal for ML/DL workloads → Ex: Nvidia Tesla V100 SXM2 (Q2’17) R peak ≃ 7.8 TFlops (DP) ֒ Accelerators � 5120 cores @ 1.3GHz (12nm, 250W, 21 billion transistors) Intel MIC (Many Integrated Core) Accelerator ASIC (Application-Specific Integrated Circuits) , FPGA (Field Programmable Gate Array) → least software flexibility ֒ → highest performance for specialized problems ֒ � Ex: AI, Mining, Sequencing. . . = ⇒ toward hybrid platforms w. DL enabled accelerators Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 13 / 126 �

  8. Introduction HPC Components: Local Memory Larger, slower and cheaper L1 L2 L3 - - - CPU Memory Bus I/O Bus C C C a a a Memory c c c h h h Registers e e e L1-cache L2-cache L3-cache register (SRAM) (SRAM) (DRAM) Memory (DRAM) reference Disk memory reference reference reference reference reference Level: 1 4 2 3 Size: 500 bytes 64 KB to 8 MB 1 GB 1 TB Speed: sub ns 1-2 cycles 10 cycles 20 cycles hundreds cycles ten of thousands cycles SSD (SATA3) R/W: 550 MB/s; 100000 IOPS 450 e /TB HDD (SATA3 @ 7,2 krpm) R/W: 227 MB/s; 85 IOPS 54 e /TB Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 14 / 126 �

  9. Introduction HPC Components: Interconnect latency : time to send a minimal (0 byte) message from A to B bandwidth : max amount of data communicated per unit of time Technology Effective Bandwidth Latency Gigabit Ethernet 1 Gb/s 125 MB/s 40 µ s to 300 µ s 10 Gigabit Ethernet 10 Gb/s 1.25 GB/s 4 µ s to 5 µ s Infiniband QDR 40 Gb/s 5 GB/s 1 . 29 µ s to 2 . 6 µ s Infiniband EDR 100 Gb/s 12.5 GB/s 0 . 61 µ s to 1 . 3 µ s Infiniband HDR 200 Gb/s 25 GB/s 0 . 5 µ s to 1 . 1 µ s 100 Gigabit Ethernet 100 Gb/s 1.25 GB/s 30 µ s Intel Omnipath 100 Gb/s 12.5 GB/s 0 . 9 µ s Infiniband 32.6 % 1.4 % [Source : www.top500.org , Nov. 2017] 40.8 % Proprietary 4.8 % 10G Gigabit Ethernet 7 % 13.4 % Omnipath Custom Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 15 / 126 �

  10. Introduction HPC Components: Interconnect latency : time to send a minimal (0 byte) message from A to B bandwidth : max amount of data communicated per unit of time Technology Effective Bandwidth Latency Gigabit Ethernet 1 Gb/s 125 MB/s 40 µ s to 300 µ s 10 Gigabit Ethernet 10 Gb/s 1.25 GB/s 4 µ s to 5 µ s Infiniband QDR 40 Gb/s 5 GB/s 1 . 29 µ s to 2 . 6 µ s Infiniband EDR 100 Gb/s 12.5 GB/s 0 . 61 µ s to 1 . 3 µ s Infiniband HDR 200 Gb/s 25 GB/s 0 . 5 µ s to 1 . 1 µ s 100 Gigabit Ethernet 100 Gb/s 1.25 GB/s 30 µ s Intel Omnipath 100 Gb/s 12.5 GB/s 0 . 9 µ s Infiniband 32.6 % 1.4 % [Source : www.top500.org , Nov. 2017] 40.8 % Proprietary 4.8 % 10G Gigabit Ethernet 7 % 13.4 % Omnipath Custom Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 15 / 126 �

  11. Introduction Network Topologies Direct vs. Indirect interconnect → direct : each network node attaches to at least one compute node ֒ → indirect : compute nodes attached at the edge of the network only ֒ � many routers only connect to other routers. Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 16 / 126 �

  12. Introduction Network Topologies Direct vs. Indirect interconnect → direct : each network node attaches to at least one compute node ֒ → indirect : compute nodes attached at the edge of the network only ֒ � many routers only connect to other routers. Main HPC Topologies CLOS Network / Fat-Trees [Indirect] → can be fully non-blocking (1:1) or blocking (x:1) ֒ → typically enables best performance ֒ � Non blocking bandwidth, lowest network latency Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 16 / 126 �

  13. Introduction Network Topologies Direct vs. Indirect interconnect → direct : each network node attaches to at least one compute node ֒ → indirect : compute nodes attached at the edge of the network only ֒ � many routers only connect to other routers. Main HPC Topologies CLOS Network / Fat-Trees [Indirect] → can be fully non-blocking (1:1) or blocking (x:1) ֒ → typically enables best performance ֒ � Non blocking bandwidth, lowest network latency Mesh or 3D-torus [Direct] → Blocking network, cost-effective for systems at scale ֒ → Great performance solutions for applications with locality ֒ → Simple expansion for future growth ֒ Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 16 / 126 �

  14. Introduction HPC Components: Operating System Exclusively Linux-based ( really 100%) Reasons: → stability ֒ → development flexibility ֒ [Source : www.top500.org , Nov 2017] Linux 100 % Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 17 / 126 �

  15. Introduction HPC Components: Software Stack Remote connection to the platform SSH Identity Management / SSO : LDAP, Kerberos, IPA. . . Resource management : job/batch scheduler → SLURM, OAR, PBS, MOAB/Torque. . . ֒ (Automatic) Node Deployment : → FAI, Kickstart, Puppet, Chef, Ansible, Kadeploy. . . ֒ (Automatic) User Software Management : → Easybuild, Environment Modules, LMod ֒ Platform Monitoring : → Nagios, Icinga, Ganglia, Foreman, Cacti, Alerta. . . ֒ Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 18 / 126 �

  16. Introduction [Big]Data Management Storage architectural classes & I/O layers Application [Distributed] File system Network Network SATA NFS SAS iSCSI CIFS FC ... AFP ... ... DAS Interface SAN Interface NAS Interface Fiber Ethernet/ Fiber Ethernet/ Channel Network DAS Channel Network SATA SAN SAS File System NAS Fiber Channel SATA SAS SATA Fiber Channel SAS Fiber Channel Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 19 / 126 �

  17. Introduction [Big]Data Management: Disk Encl. ≃ 120 K e - enclosure - 48-60 disks (4U) → incl. redundant (i.e. 2) RAID controllers (master/slave) ֒ Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 20 / 126 �

  18. Introduction [Big]Data Management: File Systems File System (FS) Logical manner to store , organize , manipulate & access data Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 21 / 126 �

  19. Introduction [Big]Data Management: File Systems File System (FS) Logical manner to store , organize , manipulate & access data (local) Disk FS : FAT32 , NTFS , HFS+ , ext{3,4} , {x,z,btr}fs . . . → manage data on permanent storage devices ֒ → poor perf. read : 100 → 400 MB/s | write : 10 → 200 MB/s ֒ Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 21 / 126 �

  20. Introduction [Big]Data Management: File Systems Networked FS : NFS , CIFS / SMB , AFP → disk access from remote nodes via network access ֒ → poorer performance for HPC jobs especially parallel I/O ֒ � read : only 381 MB/s on a system capable of 740MB/s (16 tasks) � write : only 90MB/s on system capable of 400MB/s (4 tasks) [Source : LISA’09] Ray Paden: How to Build a Petabyte Sized Storage System Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 22 / 126 �

  21. Introduction [Big]Data Management: File Systems Networked FS : NFS , CIFS / SMB , AFP → disk access from remote nodes via network access ֒ → poorer performance for HPC jobs especially parallel I/O ֒ � read : only 381 MB/s on a system capable of 740MB/s (16 tasks) � write : only 90MB/s on system capable of 400MB/s (4 tasks) [Source : LISA’09] Ray Paden: How to Build a Petabyte Sized Storage System [scale-out] NAS → aka Appliances OneFS . . . ֒ → Focus on CIFS, NFS ֒ → Integrated HW/SW ֒ → Ex : EMC (Isilon) , IBM ֒ (SONAS), DDN . . . Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 22 / 126 �

  22. Introduction [Big]Data Management: File Systems Basic Clustered FS GPFS → File access is parallel ֒ → File System overhead operations is distributed and done in parallel ֒ � no metadata servers → File clients access file data through file servers via the LAN ֒ Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 23 / 126 �

  23. Introduction [Big]Data Management: File Systems Multi-Component Clustered FS Lustre, Panasas → File access is parallel ֒ → File System overhead operations on dedicated components ֒ � metadata server (Lustre) or director blades (Panasas) → Multi-component architecture ֒ → File clients access file data through file servers via the LAN ֒ Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 24 / 126 �

  24. Introduction [Big]Data Management: FS Summary File System (FS) : Logical manner to store, organize & access data → (local) Disk FS : FAT32 , NTFS , HFS+ , ext4 , {x,z,btr}fs . . . ֒ → Networked FS : NFS , CIFS / SMB , AFP ֒ → Parallel/Distributed FS : SpectrumScale/GPFS , Lustre ֒ � typical FS for HPC / HTC (High Throughput Computing) Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 25 / 126 �

  25. Introduction [Big]Data Management: FS Summary File System (FS) : Logical manner to store, organize & access data → (local) Disk FS : FAT32 , NTFS , HFS+ , ext4 , {x,z,btr}fs . . . ֒ → Networked FS : NFS , CIFS / SMB , AFP ֒ → Parallel/Distributed FS : SpectrumScale/GPFS , Lustre ֒ � typical FS for HPC / HTC (High Throughput Computing) Main Characteristic of Parallel/Distributed File Systems Capacity and Performance increase with #servers Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 25 / 126 �

  26. Introduction [Big]Data Management: FS Summary File System (FS) : Logical manner to store, organize & access data → (local) Disk FS : FAT32 , NTFS , HFS+ , ext4 , {x,z,btr}fs . . . ֒ → Networked FS : NFS , CIFS / SMB , AFP ֒ → Parallel/Distributed FS : SpectrumScale/GPFS , Lustre ֒ � typical FS for HPC / HTC (High Throughput Computing) Main Characteristic of Parallel/Distributed File Systems Capacity and Performance increase with #servers Name Type Read* [GB/s] Write* [GB/s] ext4 Disk FS 0.426 0.212 nfs Networked FS 0.381 0.090 gpfs (iris) Parallel/Distributed FS 11.25 9,46 lustre (iris) Parallel/Distributed FS 12.88 10,07 gpfs (gaia) Parallel/Distributed FS 7.74 6.524 lustre (gaia) Parallel/Distributed FS 4.5 2.956 ∗ maximum random read/write, per IOZone or IOR measures, using concurrent nodes for networked FS. Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 25 / 126 �

  27. Introduction HPC Components: Data Center Definition (Data Center) Facility to house computer systems and associated components → Basic storage component: rack (height: 42 RU) ֒ Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 26 / 126 �

  28. Introduction HPC Components: Data Center Definition (Data Center) Facility to house computer systems and associated components → Basic storage component: rack (height: 42 RU) ֒ Challenges: Power (UPS, battery) , Cooling, Fire protection, Security Power/Heat dissipation per rack: Power Usage Effectiveness → HPC computing racks: 30-120 kW ֒ → Storage racks: 15 kW PUE = Total facility power ֒ → Interconnect racks: 5 kW IT equipment power ֒ Various Cooling Technology → Airflow ֒ → Direct-Liquid Cooling, Immersion... ֒ Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 26 / 126 �

  29. Introduction Software/Modules Management https://hpc.uni.lu/users/software/ Based on Environment Modules / LMod → convenient way to dynamically change the users environment $PATH ֒ → permits to easily load software through module command ֒ Currently on UL HPC: → > 200 software packages , in multiple versions, within 18 categ. ֒ → reworked software set for iris cluster and now deployed everywhere ֒ � RESIF v2.0, allowing [real] semantic versioning of released builds → hierarchical organization Ex : toolchain/{foss,intel} ֒ $> module avail # List available modules $> module load <category>/<software>[/<version>] Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 27 / 126 �

  30. Introduction Software/Modules Management Key module variable: $MODULEPATH / where to look for modules → altered with module use <path> . Ex : ֒ export EASYBUILD_PREFIX=$HOME/.local/easybuild export LOCAL_MODULES=$EASYBUILD_PREFIX/modules/all module use $LOCAL_MODULES Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 28 / 126 �

  31. Introduction Software/Modules Management Key module variable: $MODULEPATH / where to look for modules → altered with module use <path> . Ex : ֒ export EASYBUILD_PREFIX=$HOME/.local/easybuild export LOCAL_MODULES=$EASYBUILD_PREFIX/modules/all module use $LOCAL_MODULES Main modules commands : Command Description module avail Lists all the modules which are available to be loaded module spider <pattern> Search for among available modules (Lmod only) module load <mod1> [mod2...] Load a module module unload <module> Unload a module module list List loaded modules module purge Unload all modules (purge) module display <module> Display what a module does module use <path> Prepend the directory to the MODULEPATH environment variable module unuse <path> Remove the directory from the MODULEPATH environment variable Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 28 / 126 �

  32. Introduction Software/Modules Management http://hpcugent.github.io/easybuild/ Easybuild: open-source framework to (automatically) build scientific SW Why? : "Could you please install this software on the cluster?" → Scientific software is often difficult to build ֒ � non-standard build tools / incomplete build procedures � hardcoded parameters and/or poor/outdated documentation → EasyBuild helps to facilitate this task ֒ � consistent software build and installation framework � includes testing step that helps validate builds � automatically generates LMod modulefiles $ > module use $LOCAL_MODULES $ > module load tools/EasyBuild # Search for recipes for a given software $ > eb -S Spark $ > eb Spark-2.4.0-Hadoop-2.7-Java-1.8.eb -Dr # Dry-run install $ > eb Spark-2.4.0-Hadoop-2.7-Java-1.8.eb -r Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 29 / 126 �

  33. [Big] Data Management in HPC Environment: Overview and Challenges Summary 1 Introduction HPC & BD Trends Reviewing the Main HPC and BD Components 2 [Big] Data Management in HPC Environment: Overview and Challenges Performance Overview in Data transfer Data transfer in practice Sharing Data 3 Big Data Analytics with Hadoop, Spark etc. Apache Hadoop Batch vs Stream vs Hybrid Processing Apache Spark 4 [Brief] Overview of other useful Data Analytics frameworks Python Libraries R – Statistical Computing 5 Conclusion Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 30 / 126 �

  34. [Big] Data Management in HPC Environment: Overview and Challenges Summary 1 Introduction HPC & BD Trends Reviewing the Main HPC and BD Components 2 [Big] Data Management in HPC Environment: Overview and Challenges Performance Overview in Data transfer Data transfer in practice Sharing Data 3 Big Data Analytics with Hadoop, Spark etc. Apache Hadoop Batch vs Stream vs Hybrid Processing Apache Spark 4 [Brief] Overview of other useful Data Analytics frameworks Python Libraries R – Statistical Computing 5 Conclusion Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 31 / 126 �

  35. [Big] Data Management in HPC Environment: Overview and Challenges Data Intensive Computing Data volumes increasing massively → Clusters, storage capacity increasing massively ֒ Disk speeds are not keeping pace. Seek speeds even worse than read/write Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 32 / 126 �

  36. [Big] Data Management in HPC Environment: Overview and Challenges Data Intensive Computing Data volumes increasing massively → Clusters, storage capacity increasing massively ֒ Disk speeds are not keeping pace. Seek speeds even worse than read/write Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 32 / 126 �

  37. [Big] Data Management in HPC Environment: Overview and Challenges Speed Expectation on Data Transfer http://fasterdata.es.net/ How long to transfer 1 TB of data across various speed networks? Network Time 10 Mbps 300 hrs (12.5 days) 100 Mbps 30 hrs 1 Gbps 3 hrs 10 Gbps 20 minutes (Again) small I/Os really kill performances → Ex : transferring 80 TB for the backup of ecosystem_biology ֒ → same rack, 10Gb/s. 4 weeks − → 63TB transfer. . . ֒ Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 33 / 126 �

  38. [Big] Data Management in HPC Environment: Overview and Challenges Speed Expectation on Data Transfer http://fasterdata.es.net/ Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 34 / 126 �

  39. [Big] Data Management in HPC Environment: Overview and Challenges Speed Expectation on Data Transfer http://fasterdata.es.net/ Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 34 / 126 �

  40. [Big] Data Management in HPC Environment: Overview and Challenges Storage Performances: GPFS Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 35 / 126 �

  41. [Big] Data Management in HPC Environment: Overview and Challenges Storage Performances: Lustre Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 36 / 126 �

  42. [Big] Data Management in HPC Environment: Overview and Challenges Storage Performances Based on IOR or IOZone, reference I/O benchmarks Read → tests performed in 2013 ֒ 65536 32768 16384 I/O bandwidth (MiB/s) 8192 4096 2048 1024 512 SHM / Bigmem Lustre / Gaia 256 NFS / Gaia SSD / Gaia 128 Hard Disk / Chaos 64 0 5 10 15 Number of threads Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 37 / 126 �

  43. [Big] Data Management in HPC Environment: Overview and Challenges Storage Performances Based on IOR or IOZone, reference I/O benchmarks Write → tests performed in 2013 ֒ 32768 16384 8192 I/O bandwidth (MiB/s) 4096 2048 1024 512 SHM / Bigmem Lustre / Gaia 256 NFS / Gaia 128 SSD / Gaia Hard Disk / Chaos 64 0 5 10 15 Number of threads Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 37 / 126 �

  44. [Big] Data Management in HPC Environment: Overview and Challenges Understanding Your Storage Options Where can I store and manipulate my data? Shared storage → NFS - not scalable ≃ 1.5 GB/s (R) O (100 TB) ֒ → GPFS/Spectrumscale - scalable ≃ 10-500 GB/s (R) O (10 PB) ֒ → Lustre - scalable ≃ 10-500 GB/s (R) O (10 PB) ֒ Local storage → local file system ( /tmp ) O (1 TB) ֒ � over HDD ≃ 100 MB/s, over SDD ≃ 400 MB/s → RAM ( /dev/shm ) ≃ 30 GB/s (R) O (100 GB) ֒ Distributed storage → HDFS, Ceph, GlusterFS, BeeGFS, - scalable ≃ 1 GB/s ֒ ⇒ In all cases: small I/Os really kill storage performances Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 38 / 126 �

  45. [Big] Data Management in HPC Environment: Overview and Challenges Summary 1 Introduction HPC & BD Trends Reviewing the Main HPC and BD Components 2 [Big] Data Management in HPC Environment: Overview and Challenges Performance Overview in Data transfer Data transfer in practice Sharing Data 3 Big Data Analytics with Hadoop, Spark etc. Apache Hadoop Batch vs Stream vs Hybrid Processing Apache Spark 4 [Brief] Overview of other useful Data Analytics frameworks Python Libraries R – Statistical Computing 5 Conclusion Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 39 / 126 �

  46. [Big] Data Management in HPC Environment: Overview and Challenges Data Transfer in Practice $> wget [-O <output>] <url> # download file from <url> $> curl [-o <output>] <url> # download file from <url> Transfer from FTP/HTTP[S] wget or (better) curl → can also serve to send HTTP POST requests ֒ → support HTTP cookies (useful for JDK download) ֒ Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 40 / 126 �

  47. [Big] Data Management in HPC Environment: Overview and Challenges Data Transfer in Practice $> scp [-P <port>] <src> <user>@<host>:<path> $> rsync -avzu [-e ’ssh -p <port>’] <src> <user>@<host>:<path> [Secure] Transfer from/to two remote machines over SSH → scp or (better) rsync (transfer only what is required) ֒ Assumes you have understood and configured appropriately SSH! Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 41 / 126 �

  48. [Big] Data Management in HPC Environment: Overview and Challenges SSH: Secure Shell Ensure secure connection to remote (UL) server → establish encrypted tunnel using asymmetric keys ֒ � Public id_rsa.pub vs. Private id_rsa ( without .pub ) � typically on a non-standard port ( Ex : 8022) limits kiddie script � Basic rule: 1 machine = 1 key pair → the private key is SECRET : never send it to anybody ֒ � Can be protected with a passphrase Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 42 / 126 �

  49. [Big] Data Management in HPC Environment: Overview and Challenges SSH: Secure Shell Ensure secure connection to remote (UL) server → establish encrypted tunnel using asymmetric keys ֒ � Public id_rsa.pub vs. Private id_rsa ( without .pub ) � typically on a non-standard port ( Ex : 8022) limits kiddie script � Basic rule: 1 machine = 1 key pair → the private key is SECRET : never send it to anybody ֒ � Can be protected with a passphrase SSH is used as a secure backbone channel for many tools → Remote shell i.e remote command line ֒ → File transfer: rsync , scp , sftp ֒ → versionning synchronization ( svn , git ), github, gitlab etc. ֒ Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 42 / 126 �

  50. [Big] Data Management in HPC Environment: Overview and Challenges SSH: Secure Shell Ensure secure connection to remote (UL) server → establish encrypted tunnel using asymmetric keys ֒ � Public id_rsa.pub vs. Private id_rsa ( without .pub ) � typically on a non-standard port ( Ex : 8022) limits kiddie script � Basic rule: 1 machine = 1 key pair → the private key is SECRET : never send it to anybody ֒ � Can be protected with a passphrase SSH is used as a secure backbone channel for many tools → Remote shell i.e remote command line ֒ → File transfer: rsync , scp , sftp ֒ → versionning synchronization ( svn , git ), github, gitlab etc. ֒ Authentication: → password (disable if possible) ֒ → ( better ) public key authentication ֒ Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 42 / 126 �

  51. [Big] Data Management in HPC Environment: Overview and Challenges SSH: Public Key Authentication Client Local Machine local homedir ~/.ssh/ owns local private key id_rsa id_rsa .pub logs known servers known_hosts Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 43 / 126 �

  52. [Big] Data Management in HPC Environment: Overview and Challenges SSH: Public Key Authentication Client Server Local Machine Remote Machine local homedir remote homedir ~/.ssh/ ~/.ssh/ knows granted owns local private key authorized_keys id_rsa (public) key id_rsa .pub logs known servers known_hosts Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 43 / 126 �

  53. [Big] Data Management in HPC Environment: Overview and Challenges SSH: Public Key Authentication Client Server Local Machine Remote Machine local homedir remote homedir ~/.ssh/ ~/.ssh/ knows granted owns local private key authorized_keys id_rsa (public) key id_rsa .pub SSH server config /etc/ssh/ sshd_config logs known servers known_hosts ssh_host_rsa_key ssh_host_rsa_key .pub Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 43 / 126 �

  54. [Big] Data Management in HPC Environment: Overview and Challenges SSH: Public Key Authentication Client Server Local Machine Remote Machine local homedir remote homedir ~/.ssh/ ~/.ssh/ knows granted owns local private key authorized_keys id_rsa (public) key id_rsa .pub Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 43 / 126 �

  55. [Big] Data Management in HPC Environment: Overview and Challenges SSH: Public Key Authentication Client Server Local Machine Remote Machine local homedir remote homedir ~/.ssh/ ~/.ssh/ 1. Initiate connection knows granted owns local private key authorized_keys id_rsa (public) key 2. create random challenge, “encrypt” using public key id_rsa .pub 3. solve challenge using private key return response 4. allow connection iff response == challenge Restrict to public key authentication: /etc/ssh/sshd_config : PermitRootLogin no # Enable Public key auth. # Disable Passwords RSAAuthentication yes PubkeyAuthentication yes PasswordAuthentication no ChallengeResponseAuthentication no Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 43 / 126 �

  56. [Big] Data Management in HPC Environment: Overview and Challenges SSH Setup on Linux / Mac OS OpenSSH natively supported; configuration directory : ~/.ssh/ → package openssh-client (Debian-like) or ssh (Redhat-like) ֒ SSH Key Pairs (public vs private) generation: ssh-keygen → specify a strong passphrase ֒ � protect your private key from being stolen i.e. impersonation � drawback: passphrase must be typed to use your key Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 44 / 126 �

  57. [Big] Data Management in HPC Environment: Overview and Challenges SSH Setup on Linux / Mac OS OpenSSH natively supported; configuration directory : ~/.ssh/ → package openssh-client (Debian-like) or ssh (Redhat-like) ֒ SSH Key Pairs (public vs private) generation: ssh-keygen → specify a strong passphrase ֒ � protect your private key from being stolen i.e. impersonation � drawback: passphrase must be typed to use your key ssh-agent Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 44 / 126 �

  58. [Big] Data Management in HPC Environment: Overview and Challenges SSH Setup on Linux / Mac OS OpenSSH natively supported; configuration directory : ~/.ssh/ → package openssh-client (Debian-like) or ssh (Redhat-like) ֒ SSH Key Pairs (public vs private) generation: ssh-keygen → specify a strong passphrase ֒ � protect your private key from being stolen i.e. impersonation � drawback: passphrase must be typed to use your key ssh-agent DSA and RSA 1024 bit are deprecated now! Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 44 / 126 �

  59. [Big] Data Management in HPC Environment: Overview and Challenges SSH Setup on Linux / Mac OS OpenSSH natively supported; configuration directory : ~/.ssh/ → package openssh-client (Debian-like) or ssh (Redhat-like) ֒ SSH Key Pairs (public vs private) generation: ssh-keygen → specify a strong passphrase ֒ � protect your private key from being stolen i.e. impersonation � drawback: passphrase must be typed to use your key ssh-agent DSA and RSA 1024 bit are deprecated now! $> ssh-keygen -t rsa -b 4096 -o -a 100 # 4096 bits RSA $> ssh-keygen -t ed25519 -o -a 100 # new sexy Ed25519 (better) Public Key Private (identity) key ~/.ssh/id_{rsa,ed25519} .pub ~/.ssh/id_{rsa,ed25519} Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 44 / 126 �

  60. [Big] Data Management in HPC Environment: Overview and Challenges SSH Setup on Windows Use MobaXterm! http://mobaxterm.mobatek.net/ → [tabbed] Sessions management ֒ → X11 server w. enhanced X extensions ֒ → Graphical SFTP browser ֒ → SSH gateway / tunnels wizards ֒ → [remote] Text Editor ֒ → . . . ֒ Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 45 / 126 �

  61. [Big] Data Management in HPC Environment: Overview and Challenges Summary 1 Introduction HPC & BD Trends Reviewing the Main HPC and BD Components 2 [Big] Data Management in HPC Environment: Overview and Challenges Performance Overview in Data transfer Data transfer in practice Sharing Data 3 Big Data Analytics with Hadoop, Spark etc. Apache Hadoop Batch vs Stream vs Hybrid Processing Apache Spark 4 [Brief] Overview of other useful Data Analytics frameworks Python Libraries R – Statistical Computing 5 Conclusion Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 46 / 126 �

  62. [Big] Data Management in HPC Environment: Overview and Challenges Sharing Code and Data Before doing Big Data, manage and version correctly normal data What kinds of systems are available? Good : NAS, Cloud → NextCloud, Dropbox, {Google,iCloud} Drive, Figshare. . . ֒ Better - Version Control systems (VCS) → SVN, Git and Mercurial ֒ Best - Version Control Systems on the Public/Private Cloud → GitHub, Bitbucket, Gitlab ֒ Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 47 / 126 �

  63. [Big] Data Management in HPC Environment: Overview and Challenges Sharing Code and Data Before doing Big Data, manage and version correctly normal data What kinds of systems are available? Good : NAS, Cloud → NextCloud, Dropbox, {Google,iCloud} Drive, Figshare. . . ֒ Better - Version Control systems (VCS) → SVN, Git and Mercurial ֒ Best - Version Control Systems on the Public/Private Cloud → GitHub, Bitbucket, Gitlab ֒ Which one? → Depends on the level of privacy you expect ֒ � . . . but you probably already know these tools � → Few handle GB files. . . Or with Git LFS (Large File Storage) ֒ Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 47 / 126 �

  64. [Big] Data Management in HPC Environment: Overview and Challenges Centralized VCS - CVS, SVN Computer A Central VCS Server Checkout Version Database File Version 3 Version 2 Version 1 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 48 / 126 �

  65. [Big] Data Management in HPC Environment: Overview and Challenges Centralized VCS - CVS, SVN Computer A Central VCS Server Checkout Version Database File Version 3 Version 2 Computer B Checkout Version 1 File Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 48 / 126 �

  66. [Big] Data Management in HPC Environment: Overview and Challenges Distributed VCS - Git Server Computer Version Database Version 3 Computer A Computer B Version 2 File File Version 1 Version Database Version Database Version 3 Version 3 Version 2 Version 2 Version 1 Version 1 Everybody has the full history of commits Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 49 / 126 �

  67. [Big] Data Management in HPC Environment: Overview and Challenges Tracking changes (most VCS) Checkins over Time C1 file A file B file C Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 50 / 126 �

  68. [Big] Data Management in HPC Environment: Overview and Challenges Tracking changes (most VCS) Checkins over Time C1 C2 file A Δ 1 file B file C Δ 1 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 50 / 126 �

  69. [Big] Data Management in HPC Environment: Overview and Challenges Tracking changes (most VCS) Checkins over Time C1 C2 C3 file A Δ 1 file B file C Δ 1 Δ 2 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 50 / 126 �

  70. [Big] Data Management in HPC Environment: Overview and Challenges Tracking changes (most VCS) Checkins over Time C1 C2 C3 C4 file A Δ 1 Δ 2 file B Δ 1 file C Δ 1 Δ 2 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 50 / 126 �

  71. [Big] Data Management in HPC Environment: Overview and Challenges Tracking changes (most VCS) Checkins over Time C1 C2 C3 C4 C5 file A Δ 1 Δ 2 file B Δ 1 Δ 2 file C Δ 1 Δ 2 Δ 3 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 50 / 126 �

  72. [Big] Data Management in HPC Environment: Overview and Challenges Tracking changes (most VCS) Checkins over Time C1 C2 C3 C4 C5 file A Δ 1 Δ 2 delta storage file B Δ 1 Δ 2 file C Δ 1 Δ 2 Δ 3 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 50 / 126 �

  73. [Big] Data Management in HPC Environment: Overview and Challenges Tracking changes (Git) Checkins over Time C1 C2 C3 C4 C5 file A Δ 1 Δ 2 delta storage file B Δ 1 Δ 2 file C Δ 1 Δ 2 Δ 3 snapshot (DAG) storage Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 50 / 126 �

  74. [Big] Data Management in HPC Environment: Overview and Challenges Tracking changes (Git) Checkins over Time C1 C2 C3 C4 C5 file A Δ 1 Δ 2 delta storage file B Δ 1 Δ 2 file C Δ 1 Δ 2 Δ 3 Checkins over Time C1 snapshot A (DAG) storage B C Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 50 / 126 �

  75. [Big] Data Management in HPC Environment: Overview and Challenges Tracking changes (Git) Checkins over Time C1 C2 C3 C4 C5 file A Δ 1 Δ 2 delta storage file B Δ 1 Δ 2 file C Δ 1 Δ 2 Δ 3 Checkins over Time C1 C2 snapshot A A1 (DAG) storage B B C C1 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 50 / 126 �

  76. [Big] Data Management in HPC Environment: Overview and Challenges Tracking changes (Git) Checkins over Time C1 C2 C3 C4 C5 file A Δ 1 Δ 2 delta storage file B Δ 1 Δ 2 file C Δ 1 Δ 2 Δ 3 Checkins over Time C1 C2 snapshot A A1 (DAG) storage B B C C1 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 50 / 126 �

  77. [Big] Data Management in HPC Environment: Overview and Challenges Tracking changes (Git) Checkins over Time C1 C2 C3 C4 C5 file A Δ 1 Δ 2 delta storage file B Δ 1 Δ 2 file C Δ 1 Δ 2 Δ 3 Checkins over Time C1 C2 C3 snapshot A A1 A1 (DAG) storage B B B C C1 C2 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 50 / 126 �

  78. [Big] Data Management in HPC Environment: Overview and Challenges Tracking changes (Git) Checkins over Time C1 C2 C3 C4 C5 file A Δ 1 Δ 2 delta storage file B Δ 1 Δ 2 file C Δ 1 Δ 2 Δ 3 Checkins over Time C1 C2 C3 snapshot A A1 A1 (DAG) storage B B B C C1 C2 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 50 / 126 �

  79. [Big] Data Management in HPC Environment: Overview and Challenges Tracking changes (Git) Checkins over Time C1 C2 C3 C4 C5 file A Δ 1 Δ 2 delta storage file B Δ 1 Δ 2 file C Δ 1 Δ 2 Δ 3 Checkins over Time C1 C2 C3 C4 snapshot A A1 A1 A2 (DAG) storage B B B B1 C C1 C2 C2 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 50 / 126 �

  80. [Big] Data Management in HPC Environment: Overview and Challenges Tracking changes (Git) Checkins over Time C1 C2 C3 C4 C5 file A Δ 1 Δ 2 delta storage file B Δ 1 Δ 2 file C Δ 1 Δ 2 Δ 3 Checkins over Time C1 C2 C3 C4 snapshot A A1 A1 A2 (DAG) storage B B B B1 C C1 C2 C2 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 50 / 126 �

  81. [Big] Data Management in HPC Environment: Overview and Challenges Tracking changes (Git) Checkins over Time C1 C2 C3 C4 C5 file A Δ 1 Δ 2 delta storage file B Δ 1 Δ 2 file C Δ 1 Δ 2 Δ 3 Checkins over Time C1 C2 C3 C4 C5 snapshot A A1 A1 A2 A2 (DAG) storage B B B B1 B2 C C1 C2 C2 C3 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 50 / 126 �

  82. [Big] Data Management in HPC Environment: Overview and Challenges Tracking changes (Git) Checkins over Time C1 C2 C3 C4 C5 file A Δ 1 Δ 2 delta storage file B Δ 1 Δ 2 file C Δ 1 Δ 2 Δ 3 Checkins over Time C1 C2 C3 C4 C5 snapshot A A1 A1 A2 A2 (DAG) storage B B B B1 B2 C C1 C2 C2 C3 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 50 / 126 �

  83. [Big] Data Management in HPC Environment: Overview and Challenges VCS Taxonomy Mac OS File local rcs Versions delta Subversion centralized cvs svn storage mercurial distributed hg cp -r bontmia time rsync local backupninja machine duplicity duplicity snapshot (DAG) centralized storage bitkeeper bazaar git distributed bzr Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks 51 / 126 �

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend