Introduction to [Big] Data Analytics Frameworks Data Sciences - - PowerPoint PPT Presentation

introduction to big data analytics frameworks
SMART_READER_LITE
LIVE PREVIEW

Introduction to [Big] Data Analytics Frameworks Data Sciences - - PowerPoint PPT Presentation

Introduction to [Big] Data Analytics Frameworks Data Sciences (pilot) Training EC Sbastien Varrette, PhD Parallel Computing and Optimization Group (PCOG), University of Luxembourg (UL), Luxembourg Feb. 7 th and Apr. 1 st , 2019, Luxembourg


slide-1
SLIDE 1

Introduction to [Big] Data Analytics Frameworks

Data Sciences (pilot) Training EC

Sébastien Varrette, PhD Parallel Computing and Optimization Group (PCOG), University of Luxembourg (UL), Luxembourg

  • Feb. 7th and Apr. 1st, 2019, Luxembourg

1 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-2
SLIDE 2

Short CV

https://varrette.gforge.uni.lu

Experienced Research Scientist at University of Luxembourg (UL)

֒ → 15 y of experience in management & deployment of HPC systems ֒ → Ph.D Computer Science w. great distinction (2007, INPG/UL) ֒ → Research interests:

Security, performance of parallel & distributed computing platforms (HPC/Cloud/IoT) 590 citations, 4 books, 80 articles in peer-reviewed journals/conf.

2 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-3
SLIDE 3

Short CV

https://varrette.gforge.uni.lu

Experienced Research Scientist at University of Luxembourg (UL)

֒ → 15 y of experience in management & deployment of HPC systems ֒ → Ph.D Computer Science w. great distinction (2007, INPG/UL) ֒ → Research interests:

Security, performance of parallel & distributed computing platforms (HPC/Cloud/IoT) 590 citations, 4 books, 80 articles in peer-reviewed journals/conf.

2 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-4
SLIDE 4

Short CV

https://varrette.gforge.uni.lu

Experienced Research Scientist at University of Luxembourg (UL)

֒ → 15 y of experience in management & deployment of HPC systems ֒ → Ph.D Computer Science w. great distinction (2007, INPG/UL) ֒ → Research interests:

Security, performance of parallel & distributed computing platforms (HPC/Cloud/IoT) 590 citations, 4 books, 80 articles in peer-reviewed journals/conf.

Deputy Head, Uni.lu HPC for Research Management committee member representing Luxembourg:

֒ → ETP4HPC, EU COST NESUS, PRACE (acting Advisor)

National / EU HPC projects involvement:

֒ → National HPC and Big Data competence center (MECO) ֒ → NVidia Cooperation agreement on AI and HPC (SMC) ֒ → EuroHPC JU / Hosting entities for Supercomputer program

2 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-5
SLIDE 5

Welcome!

(Pilot) Data Sciences Training for EC Introduction to [Big] Data Analytics

֒ → starts with data processing facility enabling Analytics

HPC: High Performance Computing HTC: High Throughput Computing

֒ → continue with daily data management . . .

. . . before speaking about Big data management in particular: data transfer (over SSH), data versioning with Git

֒ → introduction to classical Big Data analytics frameworks

Distributed File Systems (DFS), Mapreduce Main Processing engines (batch, stream, hybrid) . . . aka Hadoop,Storm, Samza, Flink, Spark

֒ → (very brief) review of other useful Data Analytic frameworks

. . . and their effective usage on HPC platforms Python, R etc.

3 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-6
SLIDE 6

Disclaimer: Acknowledgements

Part of these slides were courtesy borrowed w. permission from:

֒ → Prof. Martin Theobald (Big Data and Data Science Research Group), UL

Part of the slides material adapted from:

֒ → Advanced Analytics with Spark, O Reilly ֒ → Data Analytics with HPC courses

  • c

CC AttributionNonCommercial-ShareAlike 4.0

General/hands-on material adapted from:

֒ → (of course) the Uni.lu HPC Tutorials, credits: UL HPC team

  • S. Varrette, V. Plugaru, S. Peter, H. Cartiaux, C. Parisot

R section: A. Ginolhac

֒ → similar Github projects:

Jonathan Dursi: hadoop-for-hpcers-tutorial

4 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-7
SLIDE 7

Introduction

Summary

1 Introduction HPC & BD Trends Reviewing the Main HPC and BD Components 2 [Big] Data Management in HPC Environment: Overview and Challenges Performance Overview in Data transfer Data transfer in practice Sharing Data 3 Big Data Analytics with Hadoop, Spark etc. Apache Hadoop Batch vs Stream vs Hybrid Processing Apache Spark 4 [Brief] Overview of other useful Data Analytics frameworks Python Libraries R – Statistical Computing 5 Conclusion

5 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-8
SLIDE 8

Introduction

Summary

1 Introduction HPC & BD Trends Reviewing the Main HPC and BD Components 2 [Big] Data Management in HPC Environment: Overview and Challenges Performance Overview in Data transfer Data transfer in practice Sharing Data 3 Big Data Analytics with Hadoop, Spark etc. Apache Hadoop Batch vs Stream vs Hybrid Processing Apache Spark 4 [Brief] Overview of other useful Data Analytics frameworks Python Libraries R – Statistical Computing 5 Conclusion

6 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-9
SLIDE 9

Introduction

Prerequisites: Metrics

HPC: High Performance Computing BD: Big Data

Main HPC/BD Performance Metrics

Computing Capacity: often measured in flops (or flop/s)

֒ → Floating point operations per seconds

(often in DP)

֒ → GFlops = 109 TFlops = 1012 PFlops = 1015 EFlops = 1018

7 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-10
SLIDE 10

Introduction

Prerequisites: Metrics

HPC: High Performance Computing BD: Big Data

Main HPC/BD Performance Metrics

Computing Capacity: often measured in flops (or flop/s)

֒ → Floating point operations per seconds

(often in DP)

֒ → GFlops = 109 TFlops = 1012 PFlops = 1015 EFlops = 1018

Storage Capacity: measured in multiples of bytes = 8 bits

֒ → GB = 109 bytes TB = 1012 PB = 1015 EB = 1018 ֒ → GiB = 10243 bytes TiB = 10244 PiB = 10245 EiB = 10246

Transfer rate on a medium measured in Mb/s or MB/s Other metrics: Sequential vs Random R/W speed, IOPS . . .

7 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-11
SLIDE 11

Introduction

Why HPC and BD ?

HPC: High Performance Computing BD: Big Data

8 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

  • To out-compete

you must out-compute

Andy Grant, Head of Big Data and HPC, Atos UK&I

Increasing competition, heightened customer expectations and shortening product development cycles are forcing the pace of acceleration across all industries.

slide-12
SLIDE 12

Introduction

Why HPC and BD ?

HPC: High Performance Computing BD: Big Data

Essential tools for Science, Society and Industry

֒ → Data driven economy context ֒ → All scientific disciplines are becoming computational today

require very high computing power, handle huge volumes of data

Industry, SMEs increasingly relying on HPC

֒ → to invent innovative solutions ֒ → . . . while reducing cost & decreasing time to market

8 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

  • To out-compete

you must out-compute

Andy Grant, Head of Big Data and HPC, Atos UK&I

Increasing competition, heightened customer expectations and shortening product development cycles are forcing the pace of acceleration across all industries.

slide-13
SLIDE 13

Introduction

Why HPC and BD ?

HPC: High Performance Computing BD: Big Data

Essential tools for Science, Society and Industry

֒ → Data driven economy context ֒ → All scientific disciplines are becoming computational today

require very high computing power, handle huge volumes of data

Industry, SMEs increasingly relying on HPC

֒ → to invent innovative solutions ֒ → . . . while reducing cost & decreasing time to market

HPC = global race (strategic priority) - EU takes up the challenge:

֒ → PRACE / EuroHPC / IPCEI on HPC and Big Data (BD) Applications

8 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

  • To out-compete

you must out-compute

Andy Grant, Head of Big Data and HPC, Atos UK&I

Increasing competition, heightened customer expectations and shortening product development cycles are forcing the pace of acceleration across all industries.

slide-14
SLIDE 14

Introduction

Different HPC Needs per Domains

Material Science & Engineering

#Cores Network Bandwidth I/O Performance Storage Capacity Flops/Core Network Latency

9 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-15
SLIDE 15

Introduction

Different HPC Needs per Domains

Biomedical Industry / Life Sciences

#Cores Network Bandwidth I/O Performance Storage Capacity Flops/Core Network Latency

9 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-16
SLIDE 16

Introduction

Different HPC Needs per Domains

Deep Learning / Cognitive Computing

#Cores Network Bandwidth I/O Performance Storage Capacity Flops/Core Network Latency

9 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-17
SLIDE 17

Introduction

Different HPC Needs per Domains

IoT, FinTech

#Cores Network Bandwidth I/O Performance Storage Capacity Flops/Core Network Latency

9 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-18
SLIDE 18

Introduction

Different HPC Needs per Domains

Material Science & Engineering Biomedical Industry / Life Sciences IoT, FinTech Deep Learning / Cognitive Computing

#Cores Network Bandwidth I/O Performance Storage Capacity Flops/Core Network Latency

ALL Research Computing Domains

9 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-19
SLIDE 19

Introduction

New Trends in HPC

Continued scaling of scientific, industrial & financial applications

֒ → . . . well beyond Exascale

New trends changing the landscape for HPC

֒ → Emergence of Big Data analytics ֒ → Emergence of (Hyperscale) Cloud Computing ֒ → Data intensive Internet of Things (IoT) applications ֒ → Deep learning & cognitive computing paradigms

10 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

  • F C E

H-P C S

Eurolab-4-HPC Long-Term Vision

  • n High-Performance Computing

Editors: Theo Ungerer, Paul Carpenter

Funded by the European Union Horizon 2020 Framework Programme (H2020-EU.1.2.2. - FET Proactive)

[Source : EuroLab-4-HPC]

Special Study

Analysis of the Characteristics and Development Trends of the Next-Generation of Supercomputers in Foreign Countries

Earl C. Joseph, Ph.D. Robert Sorensen Steve Conway Kevin Monroe

  • This study was carried out for RIKEN by

[Source : IDC RIKEN report, 2016]

slide-20
SLIDE 20

Introduction

Toward Modular Computing

Aiming at scalable, flexible HPC infrastructures

֒ → Primary processing on CPUs and accelerators

HPC & Extreme Scale Booster modules

֒ → Specialized modules for:

HTC & I/O intensive workloads; [Big] Data Analytics & AI

[Source : "Towards Modular Supercomputing: The DEEP and DEEP-ER projects", 2016] 11 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-21
SLIDE 21

Introduction

Summary

1 Introduction HPC & BD Trends Reviewing the Main HPC and BD Components 2 [Big] Data Management in HPC Environment: Overview and Challenges Performance Overview in Data transfer Data transfer in practice Sharing Data 3 Big Data Analytics with Hadoop, Spark etc. Apache Hadoop Batch vs Stream vs Hybrid Processing Apache Spark 4 [Brief] Overview of other useful Data Analytics frameworks Python Libraries R – Statistical Computing 5 Conclusion

12 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-22
SLIDE 22

Introduction

HPC Computing Hardware

Base

CPU (Central Processing Unit) Highest software flexibility

֒ → High performance across all computational domains ֒ → Ex: Intel Core i9-9900K (Q4’18) Rpeak ≃ 922 GFlops (DP)

8 cores @3.6GHz (14nm, 95W, ≃ 3.5 billion transistors) + integ. graphics

Intel Coffee Lake die 13 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-23
SLIDE 23

Introduction

HPC Computing Hardware

Base

CPU (Central Processing Unit) Highest software flexibility

֒ → High performance across all computational domains ֒ → Ex: Intel Core i9-9900K (Q4’18) Rpeak ≃ 922 GFlops (DP)

8 cores @3.6GHz (14nm, 95W, ≃ 3.5 billion transistors) + integ. graphics

Accelerators

GPU (Graphics Processing Unit): Ideal for ML/DL workloads

֒ → Ex: Nvidia Tesla V100 SXM2 (Q2’17) Rpeak ≃ 7.8 TFlops (DP)

5120 cores @ 1.3GHz

(12nm, 250W, 21 billion transistors) 13 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-24
SLIDE 24

Introduction

HPC Computing Hardware

Base

CPU (Central Processing Unit) Highest software flexibility

֒ → High performance across all computational domains ֒ → Ex: Intel Core i9-9900K (Q4’18) Rpeak ≃ 922 GFlops (DP)

8 cores @3.6GHz (14nm, 95W, ≃ 3.5 billion transistors) + integ. graphics

Accelerators

GPU (Graphics Processing Unit): Ideal for ML/DL workloads

֒ → Ex: Nvidia Tesla V100 SXM2 (Q2’17) Rpeak ≃ 7.8 TFlops (DP)

5120 cores @ 1.3GHz

(12nm, 250W, 21 billion transistors)

Intel MIC (Many Integrated Core) Accelerator ASIC (Application-Specific Integrated Circuits), FPGA (Field Programmable Gate Array)

֒ → least software flexibility ֒ → highest performance for specialized problems

Ex: AI, Mining, Sequencing. . .

= ⇒ toward hybrid platforms w. DL enabled accelerators

13 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-25
SLIDE 25

Introduction

HPC Components: Local Memory

CPU

Registers L1

  • C

a c h e

register reference L1-cache (SRAM) reference

L2

  • C

a c h e L3

  • C

a c h e

Memory

L2-cache (SRAM) reference L3-cache (DRAM) reference Memory (DRAM) reference Disk memory reference

Memory Bus I/O Bus

Larger, slower and cheaper

Size: Speed:

500 bytes 64 KB to 8 MB 1 GB 1 TB sub ns 1-2 cycles 10 cycles 20 cycles hundreds cycles ten of thousands cycles

Level:

1 2 3 4

SSD (SATA3) R/W: 550 MB/s; 100000 IOPS 450 e/TB HDD (SATA3 @ 7,2 krpm) R/W: 227 MB/s; 85 IOPS 54 e/TB

14 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-26
SLIDE 26

Introduction

HPC Components: Interconnect

latency: time to send a minimal (0 byte) message from A to B bandwidth: max amount of data communicated per unit of time

Technology Effective Bandwidth Latency Gigabit Ethernet 1 Gb/s 125 MB/s 40µs to 300µs 10 Gigabit Ethernet 10 Gb/s 1.25 GB/s 4µs to 5µs Infiniband QDR 40 Gb/s 5 GB/s 1.29µs to 2.6µs Infiniband EDR 100 Gb/s 12.5 GB/s 0.61µs to 1.3µs Infiniband HDR 200 Gb/s 25 GB/s 0.5µs to 1.1µs 100 Gigabit Ethernet 100 Gb/s 1.25 GB/s 30µs Intel Omnipath 100 Gb/s 12.5 GB/s 0.9µs

32.6 % Infiniband 40.8 % 10G 13.4 % Custom 7 % Omnipath 4.8 % Gigabit Ethernet 1.4 % Proprietary

[Source : www.top500.org, Nov. 2017] 15 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-27
SLIDE 27

Introduction

HPC Components: Interconnect

latency: time to send a minimal (0 byte) message from A to B bandwidth: max amount of data communicated per unit of time

Technology Effective Bandwidth Latency Gigabit Ethernet 1 Gb/s 125 MB/s 40µs to 300µs 10 Gigabit Ethernet 10 Gb/s 1.25 GB/s 4µs to 5µs Infiniband QDR 40 Gb/s 5 GB/s 1.29µs to 2.6µs Infiniband EDR 100 Gb/s 12.5 GB/s 0.61µs to 1.3µs Infiniband HDR 200 Gb/s 25 GB/s 0.5µs to 1.1µs 100 Gigabit Ethernet 100 Gb/s 1.25 GB/s 30µs Intel Omnipath 100 Gb/s 12.5 GB/s 0.9µs

32.6 % Infiniband 40.8 % 10G 13.4 % Custom 7 % Omnipath 4.8 % Gigabit Ethernet 1.4 % Proprietary

[Source : www.top500.org, Nov. 2017] 15 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-28
SLIDE 28

Introduction

Network Topologies

Direct vs. Indirect interconnect

֒ → direct: each network node attaches to at least one compute node ֒ → indirect: compute nodes attached at the edge of the network only

many routers only connect to other routers.

16 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-29
SLIDE 29

Introduction

Network Topologies

Direct vs. Indirect interconnect

֒ → direct: each network node attaches to at least one compute node ֒ → indirect: compute nodes attached at the edge of the network only

many routers only connect to other routers.

Main HPC Topologies

CLOS Network / Fat-Trees [Indirect]

֒ → can be fully non-blocking (1:1) or blocking (x:1) ֒ → typically enables best performance

Non blocking bandwidth, lowest network latency

16 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-30
SLIDE 30

Introduction

Network Topologies

Direct vs. Indirect interconnect

֒ → direct: each network node attaches to at least one compute node ֒ → indirect: compute nodes attached at the edge of the network only

many routers only connect to other routers.

Main HPC Topologies

CLOS Network / Fat-Trees [Indirect]

֒ → can be fully non-blocking (1:1) or blocking (x:1) ֒ → typically enables best performance

Non blocking bandwidth, lowest network latency

Mesh or 3D-torus [Direct]

֒ → Blocking network, cost-effective for systems at scale ֒ → Great performance solutions for applications with locality ֒ → Simple expansion for future growth

16 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-31
SLIDE 31

Introduction

HPC Components: Operating System

Exclusively Linux-based (really 100%) Reasons:

֒ → stability ֒ → development flexibility

[Source : www.top500.org, Nov 2017]

100 % Linux

17 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-32
SLIDE 32

Introduction

HPC Components: Software Stack

Remote connection to the platform SSH Identity Management / SSO: LDAP, Kerberos, IPA. . . Resource management: job/batch scheduler

֒ → SLURM, OAR, PBS, MOAB/Torque. . .

(Automatic) Node Deployment:

֒ → FAI, Kickstart, Puppet, Chef, Ansible, Kadeploy. . .

(Automatic) User Software Management:

֒ → Easybuild, Environment Modules, LMod

Platform Monitoring:

֒ → Nagios, Icinga, Ganglia, Foreman, Cacti, Alerta. . .

18 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-33
SLIDE 33

Introduction

[Big]Data Management

Storage architectural classes & I/O layers

DAS

SATA SAS Fiber Channel DAS Interface

NAS

File System SATA SAS Fiber Channel

Fiber Channel Ethernet/ Network

NAS Interface

SAN

SATA SAS Fiber Channel Fiber Channel Ethernet/ Network SAN Interface Application NFS CIFS AFP ...

Network

iSCSI ...

Network

SATA SAS FC ... [Distributed] File system

19 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-34
SLIDE 34

Introduction

[Big]Data Management: Disk Encl.

≃ 120 Ke - enclosure - 48-60 disks (4U)

֒ → incl. redundant (i.e. 2) RAID controllers (master/slave)

20 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-35
SLIDE 35

Introduction

[Big]Data Management: File Systems

File System (FS)

Logical manner to store, organize, manipulate & access data

21 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-36
SLIDE 36

Introduction

[Big]Data Management: File Systems

File System (FS)

Logical manner to store, organize, manipulate & access data (local) Disk FS : FAT32, NTFS, HFS+, ext{3,4}, {x,z,btr}fs. . .

֒ → manage data on permanent storage devices ֒ → poor perf. read: 100 → 400 MB/s | write: 10 → 200 MB/s

21 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-37
SLIDE 37

Introduction

[Big]Data Management: File Systems

Networked FS: NFS, CIFS/SMB, AFP

֒ → disk access from remote nodes via network access ֒ → poorer performance for HPC jobs especially parallel I/O

read: only 381 MB/s on a system capable of 740MB/s (16 tasks) write: only 90MB/s on system capable of 400MB/s (4 tasks)

[Source : LISA’09] Ray Paden: How to Build a Petabyte Sized Storage System 22 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-38
SLIDE 38

Introduction

[Big]Data Management: File Systems

Networked FS: NFS, CIFS/SMB, AFP

֒ → disk access from remote nodes via network access ֒ → poorer performance for HPC jobs especially parallel I/O

read: only 381 MB/s on a system capable of 740MB/s (16 tasks) write: only 90MB/s on system capable of 400MB/s (4 tasks)

[Source : LISA’09] Ray Paden: How to Build a Petabyte Sized Storage System

[scale-out] NAS

֒ → aka Appliances

  • OneFS. . .

֒ → Focus on CIFS, NFS ֒ → Integrated HW/SW ֒ → Ex: EMC (Isilon), IBM (SONAS), DDN. . .

22 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-39
SLIDE 39

Introduction

[Big]Data Management: File Systems

Basic Clustered FS GPFS

֒ → File access is parallel ֒ → File System overhead operations is distributed and done in parallel

no metadata servers

֒ → File clients access file data through file servers via the LAN

23 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-40
SLIDE 40

Introduction

[Big]Data Management: File Systems

Multi-Component Clustered FS Lustre, Panasas

֒ → File access is parallel ֒ → File System overhead operations on dedicated components

metadata server (Lustre) or director blades (Panasas)

֒ → Multi-component architecture ֒ → File clients access file data through file servers via the LAN

24 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-41
SLIDE 41

Introduction

[Big]Data Management: FS Summary

File System (FS): Logical manner to store, organize & access data

֒ → (local) Disk FS : FAT32, NTFS, HFS+, ext4, {x,z,btr}fs. . . ֒ → Networked FS: NFS, CIFS/SMB, AFP ֒ → Parallel/Distributed FS: SpectrumScale/GPFS, Lustre

typical FS for HPC / HTC (High Throughput Computing)

25 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-42
SLIDE 42

Introduction

[Big]Data Management: FS Summary

File System (FS): Logical manner to store, organize & access data

֒ → (local) Disk FS : FAT32, NTFS, HFS+, ext4, {x,z,btr}fs. . . ֒ → Networked FS: NFS, CIFS/SMB, AFP ֒ → Parallel/Distributed FS: SpectrumScale/GPFS, Lustre

typical FS for HPC / HTC (High Throughput Computing)

Main Characteristic of Parallel/Distributed File Systems Capacity and Performance increase with #servers

25 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-43
SLIDE 43

Introduction

[Big]Data Management: FS Summary

File System (FS): Logical manner to store, organize & access data

֒ → (local) Disk FS : FAT32, NTFS, HFS+, ext4, {x,z,btr}fs. . . ֒ → Networked FS: NFS, CIFS/SMB, AFP ֒ → Parallel/Distributed FS: SpectrumScale/GPFS, Lustre

typical FS for HPC / HTC (High Throughput Computing)

Main Characteristic of Parallel/Distributed File Systems Capacity and Performance increase with #servers

Name Type Read* [GB/s] Write* [GB/s] ext4 Disk FS 0.426 0.212 nfs Networked FS 0.381 0.090 gpfs (iris) Parallel/Distributed FS 11.25 9,46 lustre (iris) Parallel/Distributed FS 12.88 10,07 gpfs (gaia) Parallel/Distributed FS 7.74 6.524 lustre (gaia) Parallel/Distributed FS 4.5 2.956

∗ maximum random read/write, per IOZone or IOR measures, using concurrent nodes for networked FS.

25 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-44
SLIDE 44

Introduction

HPC Components: Data Center

Definition (Data Center)

Facility to house computer systems and associated components

֒ → Basic storage component: rack (height: 42 RU)

26 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-45
SLIDE 45

Introduction

HPC Components: Data Center

Definition (Data Center)

Facility to house computer systems and associated components

֒ → Basic storage component: rack (height: 42 RU)

Challenges: Power (UPS, battery), Cooling, Fire protection, Security

Power/Heat dissipation per rack:

֒ → HPC computing racks: 30-120 kW ֒ → Storage racks: 15 kW ֒ → Interconnect racks: 5 kW

Various Cooling Technology

֒ → Airflow ֒ → Direct-Liquid Cooling, Immersion... Power Usage Effectiveness PUE = Total facility power IT equipment power

26 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-46
SLIDE 46

Introduction

Software/Modules Management

https://hpc.uni.lu/users/software/

Based on Environment Modules / LMod

֒ → convenient way to dynamically change the users environment $PATH ֒ → permits to easily load software through module command

Currently on UL HPC:

֒ → > 200 software packages, in multiple versions, within 18 categ. ֒ → reworked software set for iris cluster and now deployed everywhere

RESIF v2.0, allowing [real] semantic versioning of released builds

֒ → hierarchical organization Ex: toolchain/{foss,intel}

$> module avail

# List available modules

$> module load <category>/<software>[/<version>] 27 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-47
SLIDE 47

Introduction

Software/Modules Management

Key module variable: $MODULEPATH / where to look for modules

֒ → altered with module use <path>. Ex:

export EASYBUILD_PREFIX=$HOME/.local/easybuild export LOCAL_MODULES=$EASYBUILD_PREFIX/modules/all module use $LOCAL_MODULES

28 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-48
SLIDE 48

Introduction

Software/Modules Management

Key module variable: $MODULEPATH / where to look for modules

֒ → altered with module use <path>. Ex:

export EASYBUILD_PREFIX=$HOME/.local/easybuild export LOCAL_MODULES=$EASYBUILD_PREFIX/modules/all module use $LOCAL_MODULES

Main modules commands:

Command Description module avail Lists all the modules which are available to be loaded module spider <pattern> Search for among available modules (Lmod only) module load <mod1> [mod2...] Load a module module unload <module> Unload a module module list List loaded modules module purge Unload all modules (purge) module display <module> Display what a module does module use <path> Prepend the directory to the MODULEPATH environment variable module unuse <path> Remove the directory from the MODULEPATH environment variable 28 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-49
SLIDE 49

Introduction

Software/Modules Management

http://hpcugent.github.io/easybuild/

Easybuild: open-source framework to (automatically) build scientific SW Why?: "Could you please install this software on the cluster?"

֒ → Scientific software is often difficult to build

non-standard build tools / incomplete build procedures hardcoded parameters and/or poor/outdated documentation

֒ → EasyBuild helps to facilitate this task

consistent software build and installation framework includes testing step that helps validate builds automatically generates LMod modulefiles $> module use $LOCAL_MODULES $> module load tools/EasyBuild # Search for recipes for a given software $> eb -S Spark $> eb Spark-2.4.0-Hadoop-2.7-Java-1.8.eb -Dr # Dry-run install $> eb Spark-2.4.0-Hadoop-2.7-Java-1.8.eb -r

29 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-50
SLIDE 50

[Big] Data Management in HPC Environment: Overview and Challenges

Summary

1 Introduction HPC & BD Trends Reviewing the Main HPC and BD Components 2 [Big] Data Management in HPC Environment: Overview and Challenges Performance Overview in Data transfer Data transfer in practice Sharing Data 3 Big Data Analytics with Hadoop, Spark etc. Apache Hadoop Batch vs Stream vs Hybrid Processing Apache Spark 4 [Brief] Overview of other useful Data Analytics frameworks Python Libraries R – Statistical Computing 5 Conclusion

30 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-51
SLIDE 51

[Big] Data Management in HPC Environment: Overview and Challenges

Summary

1 Introduction HPC & BD Trends Reviewing the Main HPC and BD Components 2 [Big] Data Management in HPC Environment: Overview and Challenges Performance Overview in Data transfer Data transfer in practice Sharing Data 3 Big Data Analytics with Hadoop, Spark etc. Apache Hadoop Batch vs Stream vs Hybrid Processing Apache Spark 4 [Brief] Overview of other useful Data Analytics frameworks Python Libraries R – Statistical Computing 5 Conclusion

31 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-52
SLIDE 52

[Big] Data Management in HPC Environment: Overview and Challenges

Data Intensive Computing

Data volumes increasing massively

֒ → Clusters, storage capacity increasing massively

Disk speeds are not keeping pace. Seek speeds even worse than read/write

32 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-53
SLIDE 53

[Big] Data Management in HPC Environment: Overview and Challenges

Data Intensive Computing

Data volumes increasing massively

֒ → Clusters, storage capacity increasing massively

Disk speeds are not keeping pace. Seek speeds even worse than read/write

32 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-54
SLIDE 54

[Big] Data Management in HPC Environment: Overview and Challenges

Speed Expectation on Data Transfer

http://fasterdata.es.net/

How long to transfer 1 TB of data across various speed networks?

Network Time 10 Mbps 300 hrs (12.5 days) 100 Mbps 30 hrs 1 Gbps 3 hrs 10 Gbps 20 minutes

(Again) small I/Os really kill performances

֒ → Ex: transferring 80 TB for the backup of ecosystem_biology ֒ → same rack, 10Gb/s. 4 weeks − → 63TB transfer. . .

33 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-55
SLIDE 55

[Big] Data Management in HPC Environment: Overview and Challenges

Speed Expectation on Data Transfer

http://fasterdata.es.net/ 34 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-56
SLIDE 56

[Big] Data Management in HPC Environment: Overview and Challenges

Speed Expectation on Data Transfer

http://fasterdata.es.net/ 34 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-57
SLIDE 57

[Big] Data Management in HPC Environment: Overview and Challenges

Storage Performances: GPFS

35 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-58
SLIDE 58

[Big] Data Management in HPC Environment: Overview and Challenges

Storage Performances: Lustre

36 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-59
SLIDE 59

[Big] Data Management in HPC Environment: Overview and Challenges

Storage Performances

Based on IOR or IOZone, reference I/O benchmarks Read

֒ → tests performed in 2013

64 128 256 512 1024 2048 4096 8192 16384 32768 65536 5 10 15 I/O bandwidth (MiB/s) Number of threads SHM / Bigmem Lustre / Gaia NFS / Gaia SSD / Gaia Hard Disk / Chaos 37 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-60
SLIDE 60

[Big] Data Management in HPC Environment: Overview and Challenges

Storage Performances

Based on IOR or IOZone, reference I/O benchmarks Write

֒ → tests performed in 2013

64 128 256 512 1024 2048 4096 8192 16384 32768 5 10 15 I/O bandwidth (MiB/s) Number of threads SHM / Bigmem Lustre / Gaia NFS / Gaia SSD / Gaia Hard Disk / Chaos 37 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-61
SLIDE 61

[Big] Data Management in HPC Environment: Overview and Challenges

Understanding Your Storage Options

Where can I store and manipulate my data?

Shared storage

֒ → NFS - not scalable ≃ 1.5 GB/s (R) O(100 TB) ֒ → GPFS/Spectrumscale - scalable ≃ 10-500 GB/s (R) O(10 PB) ֒ → Lustre - scalable ≃ 10-500 GB/s (R) O(10 PB)

Local storage

֒ → local file system (/tmp) O(1 TB)

  • ver HDD ≃ 100 MB/s, over SDD ≃ 400 MB/s

֒ → RAM (/dev/shm) ≃ 30 GB/s (R) O(100 GB)

Distributed storage

֒ → HDFS, Ceph, GlusterFS, BeeGFS, - scalable ≃ 1 GB/s

⇒ In all cases: small I/Os really kill storage performances

38 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-62
SLIDE 62

[Big] Data Management in HPC Environment: Overview and Challenges

Summary

1 Introduction HPC & BD Trends Reviewing the Main HPC and BD Components 2 [Big] Data Management in HPC Environment: Overview and Challenges Performance Overview in Data transfer Data transfer in practice Sharing Data 3 Big Data Analytics with Hadoop, Spark etc. Apache Hadoop Batch vs Stream vs Hybrid Processing Apache Spark 4 [Brief] Overview of other useful Data Analytics frameworks Python Libraries R – Statistical Computing 5 Conclusion

39 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-63
SLIDE 63

[Big] Data Management in HPC Environment: Overview and Challenges

Data Transfer in Practice

$> wget [-O <output>] <url>

# download file from <url>

$> curl [-o <output>] <url>

# download file from <url>

Transfer from FTP/HTTP[S] wget or (better) curl

֒ → can also serve to send HTTP POST requests ֒ → support HTTP cookies (useful for JDK download)

40 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-64
SLIDE 64

[Big] Data Management in HPC Environment: Overview and Challenges

Data Transfer in Practice

$> scp [-P <port>] <src> <user>@<host>:<path> $> rsync -avzu [-e ’ssh -p <port>’] <src> <user>@<host>:<path>

[Secure] Transfer from/to two remote machines over SSH

֒ → scp or (better) rsync (transfer only what is required)

Assumes you have understood and configured appropriately SSH!

41 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-65
SLIDE 65

[Big] Data Management in HPC Environment: Overview and Challenges

SSH: Secure Shell

Ensure secure connection to remote (UL) server

֒ → establish encrypted tunnel using asymmetric keys

Public id_rsa.pub vs. Private id_rsa (without .pub) typically on a non-standard port (Ex: 8022)

limits kiddie script

Basic rule: 1 machine = 1 key pair

֒ → the private key is SECRET: never send it to anybody

Can be protected with a passphrase

42 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-66
SLIDE 66

[Big] Data Management in HPC Environment: Overview and Challenges

SSH: Secure Shell

Ensure secure connection to remote (UL) server

֒ → establish encrypted tunnel using asymmetric keys

Public id_rsa.pub vs. Private id_rsa (without .pub) typically on a non-standard port (Ex: 8022)

limits kiddie script

Basic rule: 1 machine = 1 key pair

֒ → the private key is SECRET: never send it to anybody

Can be protected with a passphrase

SSH is used as a secure backbone channel for many tools

֒ → Remote shell i.e remote command line ֒ → File transfer: rsync, scp, sftp ֒ → versionning synchronization (svn, git), github, gitlab etc.

42 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-67
SLIDE 67

[Big] Data Management in HPC Environment: Overview and Challenges

SSH: Secure Shell

Ensure secure connection to remote (UL) server

֒ → establish encrypted tunnel using asymmetric keys

Public id_rsa.pub vs. Private id_rsa (without .pub) typically on a non-standard port (Ex: 8022)

limits kiddie script

Basic rule: 1 machine = 1 key pair

֒ → the private key is SECRET: never send it to anybody

Can be protected with a passphrase

SSH is used as a secure backbone channel for many tools

֒ → Remote shell i.e remote command line ֒ → File transfer: rsync, scp, sftp ֒ → versionning synchronization (svn, git), github, gitlab etc.

Authentication:

֒ → password (disable if possible) ֒ → (better) public key authentication

42 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-68
SLIDE 68

[Big] Data Management in HPC Environment: Overview and Challenges

SSH: Public Key Authentication

Client Local Machine

id_rsa.pub id_rsa known_hosts

~/.ssh/

local homedir

  • wns local private key

logs known servers

43 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-69
SLIDE 69

[Big] Data Management in HPC Environment: Overview and Challenges

SSH: Public Key Authentication

Server Remote Machine

authorized_keys

~/.ssh/

remote homedir

knows granted

(public) key

Client Local Machine

id_rsa.pub id_rsa known_hosts

~/.ssh/

local homedir

  • wns local private key

logs known servers

43 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-70
SLIDE 70

[Big] Data Management in HPC Environment: Overview and Challenges

SSH: Public Key Authentication

Server Remote Machine

authorized_keys

~/.ssh/

remote homedir

knows granted

(public) key

/etc/ssh/

SSH server config

ssh_host_rsa_key sshd_config ssh_host_rsa_key.pub

Client Local Machine

id_rsa.pub id_rsa known_hosts

~/.ssh/

local homedir

  • wns local private key

logs known servers

43 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-71
SLIDE 71

[Big] Data Management in HPC Environment: Overview and Challenges

SSH: Public Key Authentication

Server Remote Machine

authorized_keys

~/.ssh/

remote homedir

knows granted

(public) key

Client Local Machine

id_rsa.pub id_rsa

~/.ssh/

local homedir

  • wns local private key

43 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-72
SLIDE 72

[Big] Data Management in HPC Environment: Overview and Challenges

SSH: Public Key Authentication

Server Remote Machine

authorized_keys

~/.ssh/

remote homedir

knows granted

(public) key

Client Local Machine

id_rsa.pub id_rsa

~/.ssh/

local homedir

  • wns local private key
  • 1. Initiate connection
  • 2. create random

challenge, “encrypt” using public key

  • 3. solve challenge

using private key return response

  • 4. allow connection iff

response == challenge

Restrict to public key authentication: /etc/ssh/sshd_config:

PermitRootLogin no # Disable Passwords PasswordAuthentication no ChallengeResponseAuthentication no # Enable Public key auth. RSAAuthentication yes PubkeyAuthentication yes

43 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-73
SLIDE 73

[Big] Data Management in HPC Environment: Overview and Challenges

SSH Setup on Linux / Mac OS

OpenSSH natively supported; configuration directory : ~/.ssh/

֒ → package openssh-client (Debian-like) or ssh (Redhat-like)

SSH Key Pairs (public vs private) generation: ssh-keygen

֒ → specify a strong passphrase

protect your private key from being stolen i.e. impersonation drawback: passphrase must be typed to use your key

44 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-74
SLIDE 74

[Big] Data Management in HPC Environment: Overview and Challenges

SSH Setup on Linux / Mac OS

OpenSSH natively supported; configuration directory : ~/.ssh/

֒ → package openssh-client (Debian-like) or ssh (Redhat-like)

SSH Key Pairs (public vs private) generation: ssh-keygen

֒ → specify a strong passphrase

protect your private key from being stolen i.e. impersonation drawback: passphrase must be typed to use your key ssh-agent

44 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-75
SLIDE 75

[Big] Data Management in HPC Environment: Overview and Challenges

SSH Setup on Linux / Mac OS

OpenSSH natively supported; configuration directory : ~/.ssh/

֒ → package openssh-client (Debian-like) or ssh (Redhat-like)

SSH Key Pairs (public vs private) generation: ssh-keygen

֒ → specify a strong passphrase

protect your private key from being stolen i.e. impersonation drawback: passphrase must be typed to use your key ssh-agent

DSA and RSA 1024 bit are deprecated now!

44 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-76
SLIDE 76

[Big] Data Management in HPC Environment: Overview and Challenges

SSH Setup on Linux / Mac OS

OpenSSH natively supported; configuration directory : ~/.ssh/

֒ → package openssh-client (Debian-like) or ssh (Redhat-like)

SSH Key Pairs (public vs private) generation: ssh-keygen

֒ → specify a strong passphrase

protect your private key from being stolen i.e. impersonation drawback: passphrase must be typed to use your key ssh-agent

DSA and RSA 1024 bit are deprecated now!

$> ssh-keygen -t rsa -b 4096 -o -a 100

# 4096 bits RSA

(better) $> ssh-keygen -t ed25519 -o -a 100

# new sexy Ed25519

Private (identity) key

~/.ssh/id_{rsa,ed25519}

Public Key

~/.ssh/id_{rsa,ed25519}.pub

44 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-77
SLIDE 77

[Big] Data Management in HPC Environment: Overview and Challenges

SSH Setup on Windows

Use MobaXterm!

http://mobaxterm.mobatek.net/

֒ → [tabbed] Sessions management ֒ → X11 server w. enhanced X extensions ֒ → Graphical SFTP browser ֒ → SSH gateway / tunnels wizards ֒ → [remote] Text Editor ֒ → . . .

45 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-78
SLIDE 78

[Big] Data Management in HPC Environment: Overview and Challenges

Summary

1 Introduction HPC & BD Trends Reviewing the Main HPC and BD Components 2 [Big] Data Management in HPC Environment: Overview and Challenges Performance Overview in Data transfer Data transfer in practice Sharing Data 3 Big Data Analytics with Hadoop, Spark etc. Apache Hadoop Batch vs Stream vs Hybrid Processing Apache Spark 4 [Brief] Overview of other useful Data Analytics frameworks Python Libraries R – Statistical Computing 5 Conclusion

46 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-79
SLIDE 79

[Big] Data Management in HPC Environment: Overview and Challenges

Sharing Code and Data

Before doing Big Data, manage and version correctly normal data

What kinds of systems are available?

Good: NAS, Cloud

֒ → NextCloud, Dropbox, {Google,iCloud} Drive, Figshare. . .

Better - Version Control systems (VCS)

֒ → SVN, Git and Mercurial

Best - Version Control Systems on the Public/Private Cloud

֒ → GitHub, Bitbucket, Gitlab

47 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-80
SLIDE 80

[Big] Data Management in HPC Environment: Overview and Challenges

Sharing Code and Data

Before doing Big Data, manage and version correctly normal data

What kinds of systems are available?

Good: NAS, Cloud

֒ → NextCloud, Dropbox, {Google,iCloud} Drive, Figshare. . .

Better - Version Control systems (VCS)

֒ → SVN, Git and Mercurial

Best - Version Control Systems on the Public/Private Cloud

֒ → GitHub, Bitbucket, Gitlab

Which one?

֒ → Depends on the level of privacy you expect

. . . but you probably already know these tools

֒ → Few handle GB files. . . Or with Git LFS (Large File Storage)

47 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-81
SLIDE 81

[Big] Data Management in HPC Environment: Overview and Challenges

Centralized VCS - CVS, SVN

File Checkout Version Database Version 3 Version 2 Version 1

Central VCS Server Computer A

48 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-82
SLIDE 82

[Big] Data Management in HPC Environment: Overview and Challenges

Centralized VCS - CVS, SVN

File Checkout Version Database Version 3 Version 2 Version 1

Central VCS Server Computer A

File Checkout

Computer B

48 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-83
SLIDE 83

[Big] Data Management in HPC Environment: Overview and Challenges

Distributed VCS - Git

Version Database Version 3 Version 2 Version 1

Server Computer

File

Computer A

Version Database Version 3 Version 2 Version 1 File

Computer B

Version Database Version 3 Version 2 Version 1

Everybody has the full history of commits

49 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-84
SLIDE 84

[Big] Data Management in HPC Environment: Overview and Challenges

Tracking changes (most VCS)

file A file B file C C1

Checkins over Time

50 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-85
SLIDE 85

[Big] Data Management in HPC Environment: Overview and Challenges

Tracking changes (most VCS)

Δ1 C2 Δ1 file A file B file C C1

Checkins over Time

50 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-86
SLIDE 86

[Big] Data Management in HPC Environment: Overview and Challenges

Tracking changes (most VCS)

C3 Δ2 Δ1 C2 Δ1 file A file B file C C1

Checkins over Time

50 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-87
SLIDE 87

[Big] Data Management in HPC Environment: Overview and Challenges

Tracking changes (most VCS)

C4 Δ2 Δ1 C3 Δ2 Δ1 C2 Δ1 file A file B file C C1

Checkins over Time

50 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-88
SLIDE 88

[Big] Data Management in HPC Environment: Overview and Challenges

Tracking changes (most VCS)

C5 Δ2 Δ3 C4 Δ2 Δ1 C3 Δ2 Δ1 C2 Δ1 file A file B file C C1

Checkins over Time

50 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-89
SLIDE 89

[Big] Data Management in HPC Environment: Overview and Challenges

Tracking changes (most VCS)

C5 Δ2 Δ3 C4 Δ2 Δ1 C3 Δ2 Δ1 C2 Δ1 file A file B file C C1

Checkins over Time

delta storage 50 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-90
SLIDE 90

[Big] Data Management in HPC Environment: Overview and Challenges

Tracking changes (Git)

snapshot (DAG) storage

C5 Δ2 Δ3 C4 Δ2 Δ1 C3 Δ2 Δ1 C2 Δ1 file A file B file C C1

Checkins over Time

delta storage 50 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-91
SLIDE 91

[Big] Data Management in HPC Environment: Overview and Challenges

Tracking changes (Git)

Checkins over Time

A B C C1

snapshot (DAG) storage

C5 Δ2 Δ3 C4 Δ2 Δ1 C3 Δ2 Δ1 C2 Δ1 file A file B file C C1

Checkins over Time

delta storage 50 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-92
SLIDE 92

[Big] Data Management in HPC Environment: Overview and Challenges

Tracking changes (Git)

C2 A1 B C1

Checkins over Time

A B C C1

snapshot (DAG) storage

C5 Δ2 Δ3 C4 Δ2 Δ1 C3 Δ2 Δ1 C2 Δ1 file A file B file C C1

Checkins over Time

delta storage 50 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-93
SLIDE 93

[Big] Data Management in HPC Environment: Overview and Challenges

Tracking changes (Git)

C2 A1 B C1

Checkins over Time

A B C C1

snapshot (DAG) storage

C5 Δ2 Δ3 C4 Δ2 Δ1 C3 Δ2 Δ1 C2 Δ1 file A file B file C C1

Checkins over Time

delta storage 50 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-94
SLIDE 94

[Big] Data Management in HPC Environment: Overview and Challenges

Tracking changes (Git)

C3 A1 B C2 C2 A1 B C1

Checkins over Time

A B C C1

snapshot (DAG) storage

C5 Δ2 Δ3 C4 Δ2 Δ1 C3 Δ2 Δ1 C2 Δ1 file A file B file C C1

Checkins over Time

delta storage 50 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-95
SLIDE 95

[Big] Data Management in HPC Environment: Overview and Challenges

Tracking changes (Git)

C3 A1 B C2 C2 A1 B C1

Checkins over Time

A B C C1

snapshot (DAG) storage

C5 Δ2 Δ3 C4 Δ2 Δ1 C3 Δ2 Δ1 C2 Δ1 file A file B file C C1

Checkins over Time

delta storage 50 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-96
SLIDE 96

[Big] Data Management in HPC Environment: Overview and Challenges

Tracking changes (Git)

C4 A2 B1 C2 C3 A1 B C2 C2 A1 B C1

Checkins over Time

A B C C1

snapshot (DAG) storage

C5 Δ2 Δ3 C4 Δ2 Δ1 C3 Δ2 Δ1 C2 Δ1 file A file B file C C1

Checkins over Time

delta storage 50 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-97
SLIDE 97

[Big] Data Management in HPC Environment: Overview and Challenges

Tracking changes (Git)

C4 A2 B1 C2 C3 A1 B C2 C2 A1 B C1

Checkins over Time

A B C C1

snapshot (DAG) storage

C5 Δ2 Δ3 C4 Δ2 Δ1 C3 Δ2 Δ1 C2 Δ1 file A file B file C C1

Checkins over Time

delta storage 50 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-98
SLIDE 98

[Big] Data Management in HPC Environment: Overview and Challenges

Tracking changes (Git)

C5 A2 B2 C3 C4 A2 B1 C2 C3 A1 B C2 C2 A1 B C1

Checkins over Time

A B C C1

snapshot (DAG) storage

C5 Δ2 Δ3 C4 Δ2 Δ1 C3 Δ2 Δ1 C2 Δ1 file A file B file C C1

Checkins over Time

delta storage 50 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-99
SLIDE 99

[Big] Data Management in HPC Environment: Overview and Challenges

Tracking changes (Git)

C5 A2 B2 C3 C4 A2 B1 C2 C3 A1 B C2 C2 A1 B C1

Checkins over Time

A B C C1

snapshot (DAG) storage

C5 Δ2 Δ3 C4 Δ2 Δ1 C3 Δ2 Δ1 C2 Δ1 file A file B file C C1

Checkins over Time

delta storage 50 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-100
SLIDE 100

[Big] Data Management in HPC Environment: Overview and Challenges

VCS Taxonomy

Subversion svn cvs git mercurial hg time machine cp -r rsync duplicity rcs delta storage snapshot (DAG) storage bazaar bzr bitkeeper local centralized distributed local centralized distributed bontmia backupninja duplicity Mac OS File Versions 51 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-101
SLIDE 101

[Big] Data Management in HPC Environment: Overview and Challenges

Git at the heart of BD

http://git-scm.org 52 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-102
SLIDE 102

[Big] Data Management in HPC Environment: Overview and Challenges

So what makes Git so useful?

(almost) Everything is local

everything is fast every clone is a backup you work mainly offline

Ultra Fast, Efficient & Robust

Snapshots, not patches (deltas) Cheap branching and merging

֒ → Strong support for thousands of parallel branches

Cryptographic integrity everywhere

53 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-103
SLIDE 103

[Big] Data Management in HPC Environment: Overview and Challenges

Other Git features

Git does not delete

֒ → Immutable objects, Git generally only adds data ֒ → If you mess up, you can usually recover your stuff

Recovery can be tricky though

54 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-104
SLIDE 104

[Big] Data Management in HPC Environment: Overview and Challenges

Other Git features

Git does not delete

֒ → Immutable objects, Git generally only adds data ֒ → If you mess up, you can usually recover your stuff

Recovery can be tricky though

Git Tools / Extension

  • cf. Git submodules or subtrees

Introducing git-flow

֒ → workflow with a strict branching model ֒ → offers the git commands to follow the workflow

$> git flow init $> git flow feature { start, publish, finish } <name> $> git flow release { start, publish, finish } <version> 54 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-105
SLIDE 105

[Big] Data Management in HPC Environment: Overview and Challenges

Git in practice

Basic Workflow

Pull latest changes git pull Edit files vim / emacs / subl . . . Stage the changes git add Review your changes git status Commit the changes git commit

55 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-106
SLIDE 106

[Big] Data Management in HPC Environment: Overview and Challenges

Git in practice

Basic Workflow

Pull latest changes git pull Edit files vim / emacs / subl . . . Stage the changes git add Review your changes git status Commit the changes git commit

For cheaters: A Basicerer Workflow

Pull latest changes git pull Edit files vim / emacs / subl . . . Stage & commit all the changes git commit -a

55 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-107
SLIDE 107

[Big] Data Management in HPC Environment: Overview and Challenges

Git Summary

Advices: Commit early, commit often!

֒ → commits = save points

use descriptive commit messages

֒ → Do not get out of sync with your collaborators ֒ → Commit the sources, not the derived files

Not covered here (by lack of time)

֒ → does not mean you should not dig into it! ֒ → Resources:

https://git-scm.com/ tutorial: IT/Dev[op]s Army Knives Tools for the Researcher tutorial: Reproducible Research at the Cloud Era Using Git-crypt to Protect Sensitive Data

56 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-108
SLIDE 108

Big Data Analytics with Hadoop, Spark etc.

Summary

1 Introduction HPC & BD Trends Reviewing the Main HPC and BD Components 2 [Big] Data Management in HPC Environment: Overview and Challenges Performance Overview in Data transfer Data transfer in practice Sharing Data 3 Big Data Analytics with Hadoop, Spark etc. Apache Hadoop Batch vs Stream vs Hybrid Processing Apache Spark 4 [Brief] Overview of other useful Data Analytics frameworks Python Libraries R – Statistical Computing 5 Conclusion

57 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-109
SLIDE 109

Big Data Analytics with Hadoop, Spark etc.

Summary

1 Introduction HPC & BD Trends Reviewing the Main HPC and BD Components 2 [Big] Data Management in HPC Environment: Overview and Challenges Performance Overview in Data transfer Data transfer in practice Sharing Data 3 Big Data Analytics with Hadoop, Spark etc. Apache Hadoop Batch vs Stream vs Hybrid Processing Apache Spark 4 [Brief] Overview of other useful Data Analytics frameworks Python Libraries R – Statistical Computing 5 Conclusion

58 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-110
SLIDE 110

Big Data Analytics with Hadoop, Spark etc.

What is a Distributed File System?

Straightforward idea: separate logical from physical storage.

֒ → Not all files reside on a single physical disk, ֒ → or the same physical server, ֒ → or the same physical rack, ֒ → or the same geographical location,. . .

Distributed file system (DFS):

֒ → virtual file system that enables clients to access files

. . . as if they were stored locally.

59 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-111
SLIDE 111

Big Data Analytics with Hadoop, Spark etc.

What is a Distributed File System?

Straightforward idea: separate logical from physical storage.

֒ → Not all files reside on a single physical disk, ֒ → or the same physical server, ֒ → or the same physical rack, ֒ → or the same geographical location,. . .

Distributed file system (DFS):

֒ → virtual file system that enables clients to access files

. . . as if they were stored locally.

Major DFS distributions:

֒ → NFS: originally developed by Sun Microsystems, started in 1984 ֒ → AFS/CODA: originally prototypes at Carnegie Mellon University ֒ → GFS: Google paper published in 2003, not available outside Google ֒ → HDFS: designed after GFS, part of Apache Hadoop since 2006

59 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-112
SLIDE 112

Big Data Analytics with Hadoop, Spark etc.

Distributed File System Architecture?

Master-Slave Pattern

Single (or few) master nodes maintain state info. about clients All clients R&W requests go through the global master node. Ex: GFS, HDFS

60 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-113
SLIDE 113

Big Data Analytics with Hadoop, Spark etc.

Distributed File System Architecture?

Master-Slave Pattern

Single (or few) master nodes maintain state info. about clients All clients R&W requests go through the global master node. Ex: GFS, HDFS

Peer-to-Peer Pattern

No global state information. Each node may both serve and process data.

60 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-114
SLIDE 114

Big Data Analytics with Hadoop, Spark etc.

Google File System (GFS) (2003)

Radically different architecture compared to NFS, AFS and CODA.

֒ → specifically tailored towards large-scale and long-running analytical processing tasks ֒ → over thousands of storage nodes.

Basic assumption:

֒ → client nodes (aka. chunk servers) may fail any time! ֒ → Bugs or hardware failures. ֒ → Special tools for monitoring, periodic checks. ֒ → Large files (multiple GBs or even TBs) are split into 64 MB chunks. ֒ → Data modifications are mostly append operations to files. ֒ → Even the master node may fail any time!

Additional shadow master fallback with read-only data access.

Two types of reads: Large sequential reads & small random reads

61 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-115
SLIDE 115

Big Data Analytics with Hadoop, Spark etc.

Google File System (GFS) (2003)

62 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-116
SLIDE 116

Big Data Analytics with Hadoop, Spark etc.

GFS Consistency Model

Atomic File Namespace Mutations

֒ → File creations/deletions centrally controlled by the master node. ֒ → Clients typically create and write entire file,

then add the file name to the file namespace stored at the master.

Atomic Data Mutations

֒ → only 1 atomic modification of 1 replica (!) at a time is guaranteed.

Stateful Master

֒ → Master sends regular heartbeat messages to the chunk servers ֒ → Master keeps chunk locations of all files (+ replicas) in memory. ֒ → locations not stored persistently. . .

but polled from the clients at startup.

Session Semantics

֒ → Weak consistency model for file replicas and client caches only. ֒ → Multiple clients may read and/or write the same file concurrently. ֒ → The client that last writes to a file wins.

63 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-117
SLIDE 117

Big Data Analytics with Hadoop, Spark etc.

Fault Tolerance & Fault Detection

Fast Recovery

֒ → master & chunk servers can restore their states and (re-)start in s.

regardless of previous termination conditions.

Master Replication

֒ → shadow master provides RO access when primary master is down.

Switches back to read/write mode when primary master is back.

֒ → Master node does not keep a persistent state info. of its clients,

rather polls clients for their states when started.

Chunk Replication & Integrity Checks

֒ → chunk divided into 64 KB blocks, each with its own 32-bit checksum

verified at read and write times.

֒ → Higher replication factors for more intensively requested chunks (hotspots) can be configured.

64 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-118
SLIDE 118

Big Data Analytics with Hadoop, Spark etc.

Map-Reduce

Breaks the processing into two main phases: 1

the map phase

2

the reduce phase.

Each phase has key-value pairs as input and output,

֒ → the types of which may be chosen by the programmer. ֒ → the programmer also specifies the map and reduce functions

65 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-119
SLIDE 119

Big Data Analytics with Hadoop, Spark etc.

Hadoop

Initially started as a student project at Yahoo! labs in 2006

֒ → Open-source Java implem. of GFS and MapReduce frameworks

Switched to Apache in 2009. Now consists of three main modules: 1

HDFS: Hadoop distributed file system

2

YARN: Hadoop job scheduling and resource allocation

3

MapReduce: Hadoop adaptation of the MapReduce principle

66 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-120
SLIDE 120

Big Data Analytics with Hadoop, Spark etc.

Hadoop

Initially started as a student project at Yahoo! labs in 2006

֒ → Open-source Java implem. of GFS and MapReduce frameworks

Switched to Apache in 2009. Now consists of three main modules: 1

HDFS: Hadoop distributed file system

2

YARN: Hadoop job scheduling and resource allocation

3

MapReduce: Hadoop adaptation of the MapReduce principle

Basis for many other open-source Apache toolkits:

֒ → PIG/PigLatin: file-oriented data storage & script-based query language ֒ → HIVE: distributed SQL-style data warehouse ֒ → HBase: distributed key-value store ֒ → Cassandra: fault-tolerant distributed database, etc.

HDFS still mostly follows the original GFS architecture.

66 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-121
SLIDE 121

Big Data Analytics with Hadoop, Spark etc.

Hadoop Ecosystem

67 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-122
SLIDE 122

Big Data Analytics with Hadoop, Spark etc.

Scale-Out Design

HDD streaming speed ~ 50MB/s

֒ → 3TB =17.5 hrs ֒ → 1PB = 8 months

Scale-out (weak scaling)

֒ → FS distributes data on ingest

Seeking too slow

֒ → ~10ms for a seek ֒ → Enough time to read half a megabyte

Batch processing Go through entire data set in one (or small number) of passes

68 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-123
SLIDE 123

Big Data Analytics with Hadoop, Spark etc.

Combining Results

Each node preprocesses its local data

֒ → Shuffles its data to a small number of other nodes

Final processing, output is done there

69 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-124
SLIDE 124

Big Data Analytics with Hadoop, Spark etc.

Fault Tolerance

Data also replicated upon ingest Runtime watches for dead tasks, restarts them on live nodes Re-replicates

70 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-125
SLIDE 125

Big Data Analytics with Hadoop, Spark etc.

Data Distribution: Disk

Hadoop & al. arch. handle the hardest part of parallelism for you

֒ → aka data distribution.

On disk:

֒ → HDFS distributes, replicates data as it comes in ֒ → Keeps track of computations local to data

71 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-126
SLIDE 126

Big Data Analytics with Hadoop, Spark etc.

Data Distribution: Network

On network: Map Reduce works in terms of key-value pairs.

֒ → Preprocessing (map) phase ingests data, emits (k, v) pairs ֒ → Shuffle phase assigns reducers,

gets all pairs with same key onto that reducer.

֒ → Programmer does not have to design communication patterns

72 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-127
SLIDE 127

Big Data Analytics with Hadoop, Spark etc.

Hadoop Success Reason

Hardest parts of parallel programming with HPC tools

֒ → Decomposing the problem, and, ֒ → Getting the intermediate data where it needs to go, ֒ → Hadoop does that for you!

automatically, for a wide range of problems

Built a reusable substrate

֒ → HDFS and the MapReduce layer were very well architected.

Enables many higher-level tools Data analysis, machine learning, NoSQL DBs,. . .

֒ → Extremely productive environment

. . . Hadoop 2.x (YARN) is now much more than MapReduce

73 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-128
SLIDE 128

Big Data Analytics with Hadoop, Spark etc.

The Hadoop Filesystem

HDFS is a distributed parallel filesystem

֒ → Not a general purpose file system

does not implement posix cannot just mount it and view files

Access via hdfs fs commands or programatic APIs Security slowly improving

$> hdfs fs -[cmd] cat chgrp chmod chown copyFromLocal copyToLocal cp du dus expunge get getmerge ls lsr mkdir movefromLocal mv put rm rmr setrep stat tail test text touchz 74 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-129
SLIDE 129

Big Data Analytics with Hadoop, Spark etc.

The Hadoop Filesystem

Required to be:

֒ → able to deal with large files, large amounts of data ֒ → scalable & reliable in the presence of failures ֒ → fast at reading contiguous streams of data ֒ → only need to write to new files or append to files ֒ → require only commodity hardware

As a result:

֒ → Replication ֒ → Supports mainly high bandwidth, not especially low latency ֒ → No caching

what is the point if primarily for streaming reads? Poor support for seeking around files Poor support for zillions of files

֒ → Have to use separate API to see filesystem ֒ → Modelled after Google File System (2004 Map Reduce paper)

75 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-130
SLIDE 130

Big Data Analytics with Hadoop, Spark etc.

Hadoop vs HPC

HDFS is a block-based FS

֒ → A file is broken into blocks, ֒ → these blocks are distributed across nodes

Blocks are large;

֒ → 64MB is default, ֒ → many installations use 128MB or larger

Large block size

֒ → time to stream a block much larger than time disk time to access the block.

# Lists all blocks in all files: $> hdfs fsck / -files -blocks

76 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-131
SLIDE 131

Big Data Analytics with Hadoop, Spark etc.

Datanodes and Namenode

Two types of nodes in the filesystem:

1 Namenode

֒ → stores all metadata / block locations in memory ֒ → Metadata updates stored to persistent journal

2 Datanodes

֒ → store/retrieve blocks for client/namenode

Newer versions of Hadoop: federation

֒ → = namenodes for /user, /data. . . ֒ → High Availability namenode pairs

77 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-132
SLIDE 132

Big Data Analytics with Hadoop, Spark etc.

Writing a file

Writing a file multiple stage process:

֒ → Create file ֒ → Get nodes for blocks ֒ → Start writing ֒ → Data nodes coordinate replication ֒ → Get ack back ֒ → Complete

78 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-133
SLIDE 133

Big Data Analytics with Hadoop, Spark etc.

Writing a file

Writing a file multiple stage process:

֒ → Create file ֒ → Get nodes for blocks ֒ → Start writing ֒ → Data nodes coordinate replication ֒ → Get ack back ֒ → Complete

78 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-134
SLIDE 134

Big Data Analytics with Hadoop, Spark etc.

Writing a file

Writing a file multiple stage process:

֒ → Create file ֒ → Get nodes for blocks ֒ → Start writing ֒ → Data nodes coordinate replication ֒ → Get ack back ֒ → Complete

78 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-135
SLIDE 135

Big Data Analytics with Hadoop, Spark etc.

Writing a file

Writing a file multiple stage process:

֒ → Create file ֒ → Get nodes for blocks ֒ → Start writing ֒ → Data nodes coordinate replication ֒ → Get ack back ֒ → Complete

78 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-136
SLIDE 136

Big Data Analytics with Hadoop, Spark etc.

Writing a file

Writing a file multiple stage process:

֒ → Create file ֒ → Get nodes for blocks ֒ → Start writing ֒ → Data nodes coordinate replication ֒ → Get ack back (while writing) ֒ → Complete

78 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-137
SLIDE 137

Big Data Analytics with Hadoop, Spark etc.

Writing a file

Writing a file multiple stage process:

֒ → Create file ֒ → Get nodes for blocks ֒ → Start writing ֒ → Data nodes coordinate replication ֒ → Get ack back (while writing) ֒ → Complete

78 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-138
SLIDE 138

Big Data Analytics with Hadoop, Spark etc.

Where to Replicate?

Tradeoff to choosing replication locations

֒ → Close: faster updates, less network bandwidth ֒ → Further: better failure tolerance

Default strategy: 1

copy on different location on same node

2

second on different rack(switch),

3

third on same rack location, different node.

Strategy configurable.

֒ → Need to configure Hadoop file system to know location of nodes

79 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-139
SLIDE 139

Big Data Analytics with Hadoop, Spark etc.

Reading a file

Reading a file

֒ → Open call ֒ → Get block locations ֒ → Read from a replica

80 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-140
SLIDE 140

Big Data Analytics with Hadoop, Spark etc.

Reading a file

Reading a file

֒ → Open call ֒ → Get block locations ֒ → Read from a replica

80 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-141
SLIDE 141

Big Data Analytics with Hadoop, Spark etc.

Reading a file

Reading a file

֒ → Open call ֒ → Get block locations ֒ → Read from a replica

80 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-142
SLIDE 142

Big Data Analytics with Hadoop, Spark etc.

Hadoop / HDFS Deployment

Target: Hadoop Multi-node cluster

֒ → to exploit HPC nodes

81 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-143
SLIDE 143

Big Data Analytics with Hadoop, Spark etc.

Configuring Hadoop Cluster

Templates in ${HADOOP_HOME}/etc/hadoop

֒ → Set NameNode Location (the HDFS master) core-site.xml

fs.defaultFS: NameNode Location

֒ → Set path for HDFS hdfs-site.xml

dfs.namenode.name.dir: namespace/transactions logs dfs.datanode.data.dir: local storage of blocks dfs.replication: how many times data is replicated in the cluster

֒ → Set YARN as Job Scheduler mapred-site.xml

mapred.job.tracker: JobTracker (MapReduce master)

֒ → conf/masters (master only): secondary NameNodes

primary NameNode is where bin/start-dfs.sh is executed

֒ → conf/slaves (master only):

list of Hadoop slave daemons (DataNodes and TaskTrackers)

֒ → Configure Memory Allocation yarn-site.xml

82 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-144
SLIDE 144

Big Data Analytics with Hadoop, Spark etc.

Running Hadoop Cluster

1 Formatting the HDFS filesystem via the NameNode

֒ → simply initializes the directory specified by dfs.name.dir

$> hadoop namenode -format

2 Starting the multi-node cluster (on the NameNode)

֒ → Start HDFS daemons $HADOOP_HOME/bin/start-dfs.sh

Monitor your HDFS Cluster: hdfs dfsadmin -report

֒ → Start MapReduce daemons $HADOOP_HOME/bin/start-mapred.sh ֒ → Start YARN $HADOOP_HOME/bin/start-yarn.sh

Monitor YARN: yarn node -list

83 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-145
SLIDE 145

Big Data Analytics with Hadoop, Spark etc.

Using HDFS

Once the file system is up and running,

֒ → . . . you can copy files back and forth

$> hadoop fs -{get|put|copyFromLocal|copyToLocal} [...]

Default directory is /user/${username}

֒ → Nothing like a cd

$> hdfs fs -mkdir /home/vagrant/hdfs-test $> hdfs fs -ls /home/vagrant $> hdfs fs -ls /home/vagrant/hdfs-test $> hdfs fs -put data.dat /home/vagrant/hdfs-test $> hdfs fs -ls /home/vagrant/hdfs-test

84 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-146
SLIDE 146

Big Data Analytics with Hadoop, Spark etc.

Using HDFS

In general, the data files you send to HDFS will be large

֒ → or else why bother with Hadoop.

Do not want to be constantly copying back and forth

֒ → view, append in place

Several APIs to accessing the HDFS

֒ → Java, C++, Python

85 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-147
SLIDE 147

Big Data Analytics with Hadoop, Spark etc.

Back to Map-Reduce

Map processes one element at a time

֒ → emits results as (key, value) pairs.

All results with same key are gathered to the same reducers

֒ → Reducers process list of values ֒ → emit results as (key, value) pairs

86 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-148
SLIDE 148

Big Data Analytics with Hadoop, Spark etc.

Map

All coupling done during shuffle phase

֒ → Embarrassingly parallel task ֒ → all map

Take input, map it to output, done. Famous case

֒ → NYT using Hadoop to convert 11 million image files to PDFs

almost pure serial farm job

87 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-149
SLIDE 149

Big Data Analytics with Hadoop, Spark etc.

Reduce

Reducing gives the coupling In the case of the NYT task:

֒ → not quite embarrassingly parallel:

images from multi-page articles Convert a page at a time, gather images with same article id onto node for conversion

88 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-150
SLIDE 150

Big Data Analytics with Hadoop, Spark etc.

Shuffle

shuffle is part of the Hadoop magic

֒ → By default, keys are hashed ֒ → hash space is partitioned between reducers

On reducer:

֒ → gathered (k,v) pairs from mappers are sorted by key, ֒ → then merged together by key ֒ → Reducer then runs on one (k,[v]) tuple at a time

you can supply your own partitioner

֒ → Assign similar keys to same node ֒ → Reducer still only sees one (k, [v]) tuple at a time.

89 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-151
SLIDE 151

Big Data Analytics with Hadoop, Spark etc.

Example: Wordcount

Was used as an example in the original MapReduce paper

֒ → Now basically the hello world of map reduce

Problem description: Given a set of documents:

֒ → count occurences of words within these documents

90 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-152
SLIDE 152

Big Data Analytics with Hadoop, Spark etc.

Example: Wordcount

How would you do this with a huge document?

֒ → Each time you see a word:

if it is a new word, add a tick mark beside it,

  • therwise add a new word with a tick

. . . But hard to parallelize

֒ → pb when updating the list

91 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-153
SLIDE 153

Big Data Analytics with Hadoop, Spark etc.

Example: Wordcount

MapReduce way

֒ → all hard work done automatically by shuffle

Map:

֒ → just emit a 1 for each word you see

Shuffle:

֒ → assigns keys (words) to each reducer, ֒ → sends (k,v) pairs to appropriate reducer

Reducer

֒ → just has to sum up the ones

92 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-154
SLIDE 154

Big Data Analytics with Hadoop, Spark etc.

Example: Wordcount

MapReduce way

֒ → all hard work done automatically by shuffle

Map:

֒ → just emit a 1 for each word you see

Shuffle:

֒ → assigns keys (words) to each reducer, ֒ → sends (k,v) pairs to appropriate reducer

Reducer

֒ → just has to sum up the ones

92 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-155
SLIDE 155

Big Data Analytics with Hadoop, Spark etc.

Example: Wordcount

MapReduce way

֒ → all hard work done automatically by shuffle

Map:

֒ → just emit a 1 for each word you see

Shuffle:

֒ → assigns keys (words) to each reducer, ֒ → sends (k,v) pairs to appropriate reducer

Reducer

֒ → just has to sum up the ones

92 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-156
SLIDE 156

Big Data Analytics with Hadoop, Spark etc.

Summary

1 Introduction HPC & BD Trends Reviewing the Main HPC and BD Components 2 [Big] Data Management in HPC Environment: Overview and Challenges Performance Overview in Data transfer Data transfer in practice Sharing Data 3 Big Data Analytics with Hadoop, Spark etc. Apache Hadoop Batch vs Stream vs Hybrid Processing Apache Spark 4 [Brief] Overview of other useful Data Analytics frameworks Python Libraries R – Statistical Computing 5 Conclusion

93 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-157
SLIDE 157

Big Data Analytics with Hadoop, Spark etc.

Hadoop 0.1x

Original Hadoop was basically HDFS + infra. for MapReduce

֒ → Very faithful implementation of Google MapReduce paper. ֒ → Job tracking, orchestration all very tied to M/R model

Made it difficult to run other sorts of jobs

94 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-158
SLIDE 158

Big Data Analytics with Hadoop, Spark etc.

YARN and Hadoop 2

YARN: Yet Another Resource Negotiator

֒ → Looks a lot more like a cluster scheduler/resource manager ֒ → Allows arbitrary jobs.

Allow for new compute/data tools.

95 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-159
SLIDE 159

Big Data Analytics with Hadoop, Spark etc.

Batch vs Stream vs Hybrid Processing

Batch processing: Hadoop

֒ → operating over a large, static dataset ֒ → returning the result when the computation is complete

96 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-160
SLIDE 160

Big Data Analytics with Hadoop, Spark etc.

Batch vs Stream vs Hybrid Processing

Batch processing: Hadoop

֒ → operating over a large, static dataset ֒ → returning the result when the computation is complete

Stream processing: Storm, Samza

֒ → compute over streamed data (as it enters the system). ֒ → can handle a nearly unlimited amount of data

. . . but only process one or very few items at a time . . . with minimal state being maintained in between records

96 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-161
SLIDE 161

Big Data Analytics with Hadoop, Spark etc.

Batch vs Stream vs Hybrid Processing

Batch processing: Hadoop

֒ → operating over a large, static dataset ֒ → returning the result when the computation is complete

Stream processing: Storm, Samza

֒ → compute over streamed data (as it enters the system). ֒ → can handle a nearly unlimited amount of data

. . . but only process one or very few items at a time . . . with minimal state being maintained in between records

Hybrid processing: Spark,Flink

96 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-162
SLIDE 162

Big Data Analytics with Hadoop, Spark etc.

Batch processing Model (Hadoop)

Typical workflow:

֒ → Reading the dataset from the HDFS filesystem ֒ → Dividing dataset into chunks distributed among the available nodes ֒ → (map) Applying the computation on each node to the subset of data

intermediate results are written back to HDFS

֒ → Redistributing the intermediate results to group by key ֒ → (reduce) the value of each key by summarizing and combining the results calculated by the individual nodes ֒ → Write the calculated final results back to HDFS Advantages Drawbacks Can handle BIG datasets slow Can run inexpensive hardware

97 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-163
SLIDE 163

Big Data Analytics with Hadoop, Spark etc.

Stream processing Model

Processing is event-based

֒ → . . . and does not “end” until explicitly stopped.

Results immediately available. . . and updated with new data Near real-time processing Tied to Kafka messaging

98 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-164
SLIDE 164

Big Data Analytics with Hadoop, Spark etc.

Hybrid processing Model

Can handle both batch and stream workloads Spark: focuses primarily on speeding up batch processing workloads

֒ → full in-memory computation and processing optimization.

Flink: for complex stream processing,

֒ → similar characteristics (in-memory engine. . . ) than Spark yet

real streaming processing (Spark treats them as micro batches) Better support for cyclical and iterative processing Lower latency and higher throughput Enhanced memory management (page out to local disk if RAM full)

99 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-165
SLIDE 165

Big Data Analytics with Hadoop, Spark etc.

Summary

1 Introduction HPC & BD Trends Reviewing the Main HPC and BD Components 2 [Big] Data Management in HPC Environment: Overview and Challenges Performance Overview in Data transfer Data transfer in practice Sharing Data 3 Big Data Analytics with Hadoop, Spark etc. Apache Hadoop Batch vs Stream vs Hybrid Processing Apache Spark 4 [Brief] Overview of other useful Data Analytics frameworks Python Libraries R – Statistical Computing 5 Conclusion

100 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-166
SLIDE 166

Big Data Analytics with Hadoop, Spark etc.

Apache Spark

Next generation batch processing framework . . . with stream processing capabilities Built using many of the principles of Hadoop’s MapReduce engine

֒ → Everything you can do in Hadoop, you can also do in Spark

In contrast to Hadoop

Spark computation paradigm is not just MapReduce job Key feature - in-memory analyses.

֒ → Based on Directed Acyclic Graphs (DAGs) representation ֒ → multi-stage, in-memory dataflow graph

based on Resilient Distributed Datasets (RDDs).

101 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-167
SLIDE 167

Big Data Analytics with Hadoop, Spark etc.

Apache Spark

Spark is implemented in Scala, running in a Java Virtual Machine.

֒ → Spark supports different languages for application development:

Java, Scala, Python, R, and SQL.

Originally developed in AMPLab (UC Berkeley) from 2009,

֒ → donated to the Apache Software Foundation in 2013, ֒ → top-level project as of 2014.

Latest release: 2.4.0 (Nov 2018)

Key features

Impressive speed Simpler directed acyclic graph model Rich set of libraries

֒ → Machine learning, graph processing, SQL. . .

102 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-168
SLIDE 168

Big Data Analytics with Hadoop, Spark etc.

Main Building Blocks

The Spark Core API provides the general execution layer on top of which all other functionality is built upon. Four higher-level components (in the _Spark ecosystem): 1

Spark SQL (formerly Shark),

2

Streaming, to build scalable fault-tolerant streaming applications.

3

MLlib for machine learning

4

GraphX, the API for graphs and graph-parallel computation

103 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-169
SLIDE 169

Big Data Analytics with Hadoop, Spark etc.

Resilient Distributed Dataset (RDD)

Default Fault Tolerance in Map Reduce: output everything to disk.

֒ → . . . and always. Effective but extremely costly. ֒ → How to maintain fault tolerance without sacrificing in-memory performance for truly large-scale analyses?

104 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-170
SLIDE 170

Big Data Analytics with Hadoop, Spark etc.

Resilient Distributed Dataset (RDD)

Default Fault Tolerance in Map Reduce: output everything to disk.

֒ → . . . and always. Effective but extremely costly. ֒ → How to maintain fault tolerance without sacrificing in-memory performance for truly large-scale analyses?

Solution:

֒ → Record lineage of an RDD (think version control) ֒ → If container, node goes down, reconstruct RDD from scratch

Either from beginning,

  • r from (occasional) checkpoints which user has some control over.

֒ → User can suggest caching current state of RDD in memory,

  • r persisting it to disk, or both.

֒ → You can also save RDD to disk, or replicate partitions across nodes for other forms of fault tolerance.

104 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-171
SLIDE 171

Big Data Analytics with Hadoop, Spark etc.

Resilient Distributed Dataset (RDD)

Resilient Distributed Dataset (RDD)

֒ → Partitioned collections across nodes: lists, maps. . . ֒ → Set of operations on these RDDs.

map, filter, join (reduce)

Effective Fault Tolerance with RDDs

֒ → Storing, reconstructing lineage ֒ → Replication (optional) ֒ → Persistance to disk (optional)

105 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-172
SLIDE 172

Big Data Analytics with Hadoop, Spark etc.

Building Spark

Trivial on HPC systems with Easybuild

# LOCAL_MODULES=$HOME/.local/easybuild/modules/all $> module use $LOCAL_MODULES $> module load tools/EasyBuild # Search for recipes for a given software $> eb -S Spark # Select and install once of the recipe as part of your local modules $> eb Spark-2.4.0-Hadoop-2.7-Java-1.8.eb -Dr # Dry-run install $> eb Spark-2.4.0-Hadoop-2.7-Java-1.8.eb -r

106 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-173
SLIDE 173

Big Data Analytics with Hadoop, Spark etc.

Running Spark - Interactive usage

Better reserve exclusive node(s) Several interactive interfaces:

֒ → Pyspark, Scala Spark Shell spark-shell, R Spark Shell

$> module load devel/Spark $> module list Currently Loaded Modules: 1) lang/Java/1.8.0_162 2) devel/Spark/2.4.0-Hadoop-2.7-Java-1.8 $> pyspark [...] Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ ‘/ __/ ’_/ /__ / .__/\_,_/_/ /_/\_\ version 2.4.0 /_/ Using Python version 2.7.5 (default, Oct 30 2018 23:45:53) SparkSession available as ’spark’. >>>

107 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-174
SLIDE 174

Big Data Analytics with Hadoop, Spark etc.

Running Spark - Cluster mode

docs

Spark applications run as independent sets of processes on a cluster

֒ → coordinated by the SparkContext object in your main program ֒ → Spark is agnostic to the underlying cluster manager

Spark currently supports sevelal cluster managers:

֒ → Standalone: a simple cluster manager included with Spark ֒ → Apache Mesos / Hadoop YARN

108 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-175
SLIDE 175

Big Data Analytics with Hadoop, Spark etc.

Running Spark - Standalone Cluster

1 create a master and the workers.

֒ → (eventually) Check the web interface of the master

2 submit a spark application to the cluster with spark-submit 3 Let the application run and collect the result 4 stop the cluster at the end.

Script Description sbin/start-master.sh Starts a master instance on the machine the script is executed on. sbin/start-slaves.sh Starts a slave instance on each machine specified in the conf/slaves file. sbin/start-slave.sh Starts a slave instance on the machine the script is executed on. sbin/start-all.sh Starts both a master and a number of slaves as described above. sbin/stop-master.sh Stops the master that was started via the bin/start-master.sh script. sbin/stop-slaves.sh Stops all slave instances on the machines specified in the conf/slaves file. sbin/stop-all.sh Stops both the master and the slaves as described above. 109 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-176
SLIDE 176

Big Data Analytics with Hadoop, Spark etc.

Running Spark - Standalone Cluster

see Uni.lu HPC Tutorial: Big Data - launcher.Spark.sh

# (eventually) interactive job $> srun -N 3 --ntasks-per-node 1 -c 28 --exclusive --pty bash -i $> ./runs/launcher.Spark.sh -i SLURM_JOBID = 256624 SLURM_JOB_NODELIST = iris-[094,096,098] [...] ============== Spark Master ============== url: spark://iris-094:7077 Web UI: http://iris-094:8082 ============ 3 Spark Workers ============== export SPARK_HOME=$EBROOTSPARK export MASTER_URL=spark://iris-094:7077 export SPARK_DAEMON_MEMORY=4096m export SPARK_WORKER_CORES=28 export SPARK_WORKER_MEMORY=110592m export SPARK_EXECUTOR_MEMORY=110592m

110 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-177
SLIDE 177

Big Data Analytics with Hadoop, Spark etc.

Running Spark - Standalone Cluster

111 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-178
SLIDE 178

Big Data Analytics with Hadoop, Spark etc.

Running Spark - Standalone Cluster

# MASTER=$(scontrol show hostname $SLURM_NODELIST | head -n 1) $> spark-submit \

  • -master spark://${MASTER}:7077 \
  • -conf spark.driver.memory=${SPARK_DAEMON_MEMORY} \
  • -conf spark.executor.memory=${SPARK_EXECUTOR_MEMORY} \
  • -conf spark.python.worker.memory=${SPARK_WORKER_MEMORY} \

$SPARK_HOME/examples/src/main/python/pi.py 1000 [...] Pi is roughly 3.141368

112 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-179
SLIDE 179

[Brief] Overview of other useful Data Analytics frameworks

Summary

1 Introduction HPC & BD Trends Reviewing the Main HPC and BD Components 2 [Big] Data Management in HPC Environment: Overview and Challenges Performance Overview in Data transfer Data transfer in practice Sharing Data 3 Big Data Analytics with Hadoop, Spark etc. Apache Hadoop Batch vs Stream vs Hybrid Processing Apache Spark 4 [Brief] Overview of other useful Data Analytics frameworks Python Libraries R – Statistical Computing 5 Conclusion

113 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-180
SLIDE 180

[Brief] Overview of other useful Data Analytics frameworks

Summary

1 Introduction HPC & BD Trends Reviewing the Main HPC and BD Components 2 [Big] Data Management in HPC Environment: Overview and Challenges Performance Overview in Data transfer Data transfer in practice Sharing Data 3 Big Data Analytics with Hadoop, Spark etc. Apache Hadoop Batch vs Stream vs Hybrid Processing Apache Spark 4 [Brief] Overview of other useful Data Analytics frameworks Python Libraries R – Statistical Computing 5 Conclusion

114 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-181
SLIDE 181

[Brief] Overview of other useful Data Analytics frameworks

Python

Effective for fast prototype coding

֒ → Simple (now used a reference language in schools) ֒ → Easy creation of reproducible and isolated environment

pip: Python package manager virtualenv: Create virtual environment

$> pip install –-user <package> $> pip freeze -l > requirements.txt

# Dump python environment

$> pip install -r requirements.txt # Restore saved environment 115 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-182
SLIDE 182

[Brief] Overview of other useful Data Analytics frameworks

Virtualenv / venv

Virtualenv allows you to create several environments

֒ → each will contain their own list of Python packages ֒ → Built-in support in Python 3 for virtual environments with venv

Best-practice: create one virtual environment per project

$> cd /path/to/project /!\ ADAPT the name of the virtual env, below ’myproject’ # Python 2 (deprecated) $> module load lang/Python/2.7.14-intel-2018a $> pip install --user virtualenv $> virtualenv myproject $> source myenv/bin/activate # Python 3 $> module load lang/Python/3.6.4-intel-2018a # Hint: centralize all your venv in ~/venv/ $> python3 -m venv ~/venv/myproject $> source ~/venv/myproject/bin/activate

116 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-183
SLIDE 183

[Brief] Overview of other useful Data Analytics frameworks

Useful Python libraries / Data Sciences

Library Description numpy Fundamental package for scientific computing jupyter (notebook) Create and share documents that contain live code keras, Tensorflow, Pytorch Machine learning frameworks Scoop Scalable COncurrent Operations in Python pythran / Cithon Python->C++ compilation, C bindings . . .

117 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-184
SLIDE 184

[Brief] Overview of other useful Data Analytics frameworks

Summary

1 Introduction HPC & BD Trends Reviewing the Main HPC and BD Components 2 [Big] Data Management in HPC Environment: Overview and Challenges Performance Overview in Data transfer Data transfer in practice Sharing Data 3 Big Data Analytics with Hadoop, Spark etc. Apache Hadoop Batch vs Stream vs Hybrid Processing Apache Spark 4 [Brief] Overview of other useful Data Analytics frameworks Python Libraries R – Statistical Computing 5 Conclusion

118 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-185
SLIDE 185

[Brief] Overview of other useful Data Analytics frameworks

R – Statistical Computing

R – shorthand for GNU R

֒ → An interactive programming language derived from S ֒ → Appeared in 1993, created by R. Ihaka and R. Gentleman ֒ → Focus on data analysis and plotting

also shorthand for the ecosystem around this language

֒ → Book authors ֒ → Package developers ֒ → Ordinary useRs

119 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-186
SLIDE 186

[Brief] Overview of other useful Data Analytics frameworks

Why use R?

It’s free and open-source

֒ → easy to install / maintain ֒ → multi-platform (Windows, macOS, GNU/Linux) ֒ → can process big files and analyse huge amounts of data (db tools) ֒ → integrated data visualization tools, even dynamic ֒ → fast, and even faster with C++ integration via Rcpp.

Easy to get help

֒ → huge R community in the web ֒ → stackoverflow with a lot of tags like r, ggplot2 etc. ֒ → rbloggers

120 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-187
SLIDE 187

[Brief] Overview of other useful Data Analytics frameworks

Why use R?

It’s free and open-source

֒ → easy to install / maintain ֒ → multi-platform (Windows, macOS, GNU/Linux) ֒ → can process big files and analyse huge amounts of data (db tools) ֒ → integrated data visualization tools, even dynamic ֒ → fast, and even faster with C++ integration via Rcpp.

Easy to get help

֒ → huge R community in the web ֒ → stackoverflow with a lot of tags like r, ggplot2 etc. ֒ → rbloggers

YET R is hard to learn

֒ → R base is complex, has a long history and many contributors ֒ → Prefer to learn through tidyverse

120 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-188
SLIDE 188

[Brief] Overview of other useful Data Analytics frameworks

R Packages / Daily Work

CRAN: reliable

֒ → package checked during submission process ֒ → MRAN for windows users

Github: easy install devtools

֒ → could be a security issue

Daily Development: RStudio

# CRAN install install.packages("ggplot2") # Github install -- assumes ’install.packages("devtools")’ devtools::install_github("tidyverse/readr")

121 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-189
SLIDE 189

[Brief] Overview of other useful Data Analytics frameworks

R in HPC environment

$> srun [...] -c 4 --pty bash $> module load lang/R $> R R version 3.4.4 (2018-03-15) -- "Someone to Lean On" [...] > sessionInfo() # information about R version, loaded libraries

R environment itself is not parallelized Effective use of multi-threaded environment:

֒ → certain workloads (mostly linear algebra) can use multiple threads

typically through OpenMP / Intel Math Kernel Library (MKL) cf OMP_NUM_THREAD

Better rely on future package

֒ → Unified Parallel and Distributed Processing in R for Everyone ֒ → run expressions asynchronously

Future expressions are run according to plan defined by the user

For distributed nodes runs: use Rmpi

122 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-190
SLIDE 190

[Brief] Overview of other useful Data Analytics frameworks

R in HPC environment / future Package

> install.packages(c("future", "furrr", "purrr", "dplyr", "dbscan", "tictoc", "Rtsne"), repos = "https://cran.rstudio.com") > library(furrr) ### Sequencial run > plan(sequential) > tictoc::tic() > nothingness <- future_map(c(2, 2, 2),~Sys.sleep(.x),.progress = TRUE) > tictoc::toc() 13.614 sec elapsed ### Parallel run > plan(multiprocess) > tictoc::tic() > nothingness <- future_map(c(2, 2, 2),~Sys.sleep(.x),.progress = TRUE) > tictoc::toc() 4.14 sec elapsed

123 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-191
SLIDE 191

Conclusion

Summary

1 Introduction HPC & BD Trends Reviewing the Main HPC and BD Components 2 [Big] Data Management in HPC Environment: Overview and Challenges Performance Overview in Data transfer Data transfer in practice Sharing Data 3 Big Data Analytics with Hadoop, Spark etc. Apache Hadoop Batch vs Stream vs Hybrid Processing Apache Spark 4 [Brief] Overview of other useful Data Analytics frameworks Python Libraries R – Statistical Computing 5 Conclusion

124 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-192
SLIDE 192

Conclusion

Conclusion

Effective Data Analytic required appropriate processing facility

֒ → HPC: High Performance Computing ֒ → HTC: High Throughput Computing

Before speaking about Big data management

֒ → understand performance bottleneck ֒ → ensure best-practices for daily data management ֒ → effective code/data sharing (Git etc.)

125 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-193
SLIDE 193

Conclusion

Conclusion

Effective Data Analytic required appropriate processing facility

֒ → HPC: High Performance Computing ֒ → HTC: High Throughput Computing

Before speaking about Big data management

֒ → understand performance bottleneck ֒ → ensure best-practices for daily data management ֒ → effective code/data sharing (Git etc.)

Classical Big Data analytics frameworks

֒ → Batch vs. Streaming vs. Hybrid Processing engines ֒ → Hadoop,Storm, Samza, Flink, Spark

Other useful Data Science environment

֒ → Python, R, [Matlab etc] ֒ → . . . requires adaptatation for effective usage on HPC platforms

125 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks

slide-194
SLIDE 194

Thank you for your attention...

Questions?

http://hpc.uni.lu

  • Dr. Sebastien Varrette

University of Luxembourg, Belval Campus: Maison du Nombre, 4th floor 2, avenue de l’Université L-4365 Esch-sur-Alzette mail: sebastien.varrette@uni.lu High Performance Computing @ Uni.lu

  • Prof. P. Bouvry, Dr. S. Varrette
  • V. Plugaru, S. Peter, H. Cartiaux, C. Parisot

1

Introduction HPC & BD Trends Reviewing the Main HPC and BD Components

2

[Big] Data Management in HPC Environment: Overview and Challenges Performance Overview in Data transfer Data transfer in practice Sharing Data

3

Big Data Analytics with Hadoop, Spark etc. Apache Hadoop Batch vs Stream vs Hybrid Processing Apache Spark

4

[Brief] Overview of other useful Data Analytics frameworks Python Libraries R – Statistical Computing

5

Conclusion 126 / 126 Sebastien Varrette (University of Luxembourg) Introduction to [Big] Data Analytics Frameworks