Will They Blend?: Exploring Big Data Computation atop Traditional - PowerPoint PPT Presentation

Will They Blend?: Exploring Big Data Computation atop Traditional HPC NAS Storage Ellis H. Wilson III 1 , 2 Mahmut Kandemir 1 Garth Gibson 2 , 3 1 Department of Computer Science and Engineering, The Pennsylvania State University 2 Panasas, Inc. 3 Department of Computer Science, Carnegie Mellon University July 3rd, 2014

Introduction/Background Converged Architectures Evaluation Before We Begin: Get the Slides and Paper Slides and Paper are Available At: www.ellisv3.com www.ellisv3.com Hadoop on NAS

Introduction/Background Converged Architectures Evaluation Introduction and Background 1 From 10,000 Feet: Considering Hadoop’s Fit in HPC Goals of this Research: MapReduce in HPC? Converged Architectures for Hadoop on NAS 2 Overview of Architectures Reliability and Performance Implications RainFS Performance Evaluation of Converged Architectures 3 Setup and Benchmarks Performance Results www.ellisv3.com Hadoop on NAS

Introduction/Background Hadoop’s Fit in HPC Converged Architectures Goals of this Research Evaluation Motivation Divide between HPC and Big Data is increasingly foggy Big Data processing framework MapReduce (MR) promises faster time-to-solution for data-intensive science But MR often comes tightly coupled with the Hadoop Distributed File System (HDFS) Standard HDFS requires local disks to the compute for distributed storage HPC typically already has it’s own Parallel File System (PFS) solutions in place Using Hadoop threatens to require large capital and maintenance investments Totally dropping MPI and similar solutions for MR is impossible Copying massive amounts of data from Network-Attached Storage (NAS) to HDFS and back is a common problem Dividing your storage into two pools, NAS and HDFS, will exacerbate the Compute-Storage gap www.ellisv3.com Hadoop on NAS

Introduction/Background Hadoop’s Fit in HPC Converged Architectures Goals of this Research Evaluation Hurdles to Adoption of Hadoop in HPC Loss of Infrastructure Consolidation Forced Import/Export I/O Performance Degradation Loss of High-Availability No Modification to Files Inefficient Compute-Storage Coupling www.ellisv3.com Hadoop on NAS

Introduction/Background Hadoop’s Fit in HPC Converged Architectures Goals of this Research Evaluation Goals of this Research Three Main Goals/Contributions: 1 Explore if/how one can enable MR to run on traditional NAS Enables reuse of existing storage – infrastructure consolidation 2 Explore whether one can use MR alongside MPI and others without copying Improves utility of capacity, reduces network contention, fights the I/O Gap 3 Identify the relative efficiencies and reliabilities of potential solutions Examine four different architecture approaches www.ellisv3.com Hadoop on NAS

Introduction/Background Overview of Architectures Converged Architectures Reliability and Performance Implications Evaluation RainFS First: Consider Traditional Hadoop Typical Hadoop Architecture: Example of Write Path www.ellisv3.com Hadoop on NAS

Introduction/Background Overview of Architectures Converged Architectures Reliability and Performance Implications Evaluation RainFS Exploration of Four Possible Architectures Possible Architectures: Traditional HDFS Pointed at a PFS Configure HDFS with PFS paths rather than to local disks HDFS as a Wire Protocol in the PFS NAS Heads Run DataNodes (DNs) on NAS heads instead of all clients No HDFS , MR Directly to the PFS Run MR configured to send data directly to PFS RainFS : Replicating Array of Independent NAS File System New Hadoop Filesystem designed specifically to intermediate between MR and PFS www.ellisv3.com Hadoop on NAS

Introduction/Background Overview of Architectures Converged Architectures Reliability and Performance Implications Evaluation RainFS Architecture Details: Traditional HDFS Pros: Simplicity Cons: Performance Degradation: One full replica in network contention Reliability Limits: Duplication is the ceiling Copy Required: Distinct namespace www.ellisv3.com Hadoop on NAS

Introduction/Background Overview of Architectures Converged Architectures Reliability and Performance Implications Evaluation RainFS Architecture Details: HDFS as a Wire Protocol Pros: HDFS becomes Yet Another Protocol Reliability limits go away Cons: Performance Bottleneck: NAS Head limits throughput NAS Invasion: May not be possible (easy) with many NAS solutions Copy Required: Distinct namespace www.ellisv3.com Hadoop on NAS

Introduction/Background Overview of Architectures Converged Architectures Reliability and Performance Implications Evaluation RainFS Architecture Details: No HDFS Pros: High-Performance: Alleviates overheads and bottlenecks No Copies: Operates on typical POSIX namespace Cons: Requires Single Namespace: No HDFS to intermediate between distinct NAS No Replication: Must tolerate solely RAID www.ellisv3.com Hadoop on NAS

Introduction/Background Overview of Architectures Converged Architectures Reliability and Performance Implications Evaluation RainFS Hadoop vs. HPC Storage: A Reliability Divergence HPC Storage Enterprise storage solutions RAID 5/6 ECC-enabled hardware (sometimes end-to-end) Redundant hardware (PSU/NIC/etc) Hadoop Storage (HDFS) Commodity hard drives in compute nodes Replication performed across nodes/racks No ECC No Redundant hardware www.ellisv3.com Hadoop on NAS

Introduction/Background Overview of Architectures Converged Architectures Reliability and Performance Implications Evaluation RainFS Converged Reliability Guarantees RAID 5 RAID 6 RAID 5 RAID 6 RAID 5 RAID 6 Repl. 1 Repl. 1 Repl. 2 Repl. 2 Repl. 3 Repl. 3 DN-on-Client 1 / 0 2 / 0 3 / 1 5 / 1 – / – – / – DN-on-NAS Node 1 / 0 2 / 0 3 / 1 5 / 1 5 / 2 8 / 2 No HDFS 1 / 0 2 / 0 – / – – / – – / – – / – RainFS 1 / 0 2 / 0 3 / 1 5 / 1 5 / 2 8 / 2 Two main failure modes for converged HDFS/HPC storage: Failure of a disk Failure of a rack www.ellisv3.com Hadoop on NAS

Introduction/Background Overview of Architectures Converged Architectures Reliability and Performance Implications Evaluation RainFS Locality Confusion: Write Transport Errant Pass-Through Behavior on Write 5000 Received Network Throughput (MB/s) Sent 4000 3000 2000 1000 0 00:00 01:00 02:00 03:00 04:00 05:00 06:00 07:00 08:00 09:00 Time Since Start (Minutes:Seconds) www.ellisv3.com Hadoop on NAS

Introduction/Background Overview of Architectures Converged Architectures Reliability and Performance Implications Evaluation RainFS Read Transport Errant Pass-Through Behavior on Read 2500 Received Network Throughput (MB/s) Sent 2000 1500 1000 500 0 00:00 01:00 02:00 03:00 04:00 05:00 06:00 07:00 08:00 09:00 10:00 Time Since Start (Minutes:Seconds) www.ellisv3.com Hadoop on NAS

Introduction/Background Overview of Architectures Converged Architectures Reliability and Performance Implications Evaluation RainFS Design Desirata Four main goals for RainFS: 1 Client-Level Federation of NAS Systems: Enable performance of all available NAS systems concurrently and maintain discrete failure domains 2 Full Replication: Restore replication ability in MapReduce 3 No Data Pass-Throughs: Writes/Reads should never go through another client node. 4 A Fair Namespace: Create a framework-agnostic namespace where no imports or exports are required. www.ellisv3.com Hadoop on NAS

Introduction/Background Overview of Architectures Converged Architectures Reliability and Performance Implications Evaluation RainFS Main Implementation Mechanisms Symbolic Links (symlinks) Symlinks on master failure domain are pointed at replica zero on one of the NAS systems Placement of replica zero is randomly chosen, following replicas are round-robined MR can read from MPI output; MPI can read from MR output Key algorithms and their synchronization issues are covered in the paper Hidden Metadata File Beside and named similarly to the symlink Manage where replicas exist, up/down state, etc Avoids dedicated, centralized metadata manager daemon www.ellisv3.com Hadoop on NAS

Introduction/Background Setup Converged Architectures Results Evaluation Setup and Benchmarks in Use Hardware Environment Cluster of 50 multi-core machines at Carnegie Mellon CentOS 5.5 running as VM on KVM DirectFlow(tm) network attached protocol to: 5 shelves of Panasas ActiveStor 12 Benchmarks in Use Ubiquitous Yahoo! TeraSort Benchmark Suite TeraGen - Write-intensive TeraSort - Mixed, CPU-intensive TeraValidate - Read-intensive www.ellisv3.com Hadoop on NAS

Introduction/Background Setup Converged Architectures Results Evaluation Impact of Architecture on Throughput Performance Yahoo! TeraSort Benchmark (50 clients, 500GBs of Data) Throughput (MB/s) Throughput (MB/s) 3500 300 3000 250 2500 200 2000 150 1500 100 1000 50 500 0 0 TeraSort TeraGen TeraValidate DN-on-Client No-DN DN-on-Client DN-on-NAS No-DN RainFS DN-on-NAS RainFS (a) Rep. Level 1: Write- and Read-Intensive (b) Rep. Level 1: Mixed Throughput (MB/s) Throughput (MB/s) 3500 300 3000 250 2500 200 2000 150 1500 100 1000 50 500 0 0 TeraSort TeraGen TeraValidate DN-on-Client No-DN DN-on-Client DN-on-NAS No-DN RainFS DN-on-NAS RainFS (c) Rep. Level 2: Write- and Read-Intensive (d) Rep. Level 2: Mixed www.ellisv3.com Hadoop on NAS

Will They Blend?: Exploring Big Data Computation atop Traditional - PowerPoint PPT Presentation

Will They Blend?: Exploring Big Data Computation atop Traditional HPC NAS Storage Ellis H. Wilson III 1 , 2 Mahmut Kandemir 1 Garth Gibson 2 , 3 1 Department of Computer Science and Engineering, The Pennsylvania State University 2 Panasas, Inc. 3

Exploring the IPY with NOAA Exploring the IPY with NOAA Exploring the IPY with NOAA Exploring

Network Layer October 2, 2019 guha.jayachandran@sjsu.edu Layer 2: Protocol atop Layer 1

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

EXPLORE ARIZONA THROUGH DATA FOCUS ON STUDENT DATA OVERVIEW WELCOME! EXPLORING DATA

How to build scalable, reliable and stable Kubernetes cluster atop OpenStack Bo Wang

Formal Definition of Computation Formal Definition of Computation p.1/28 Computation

Years ago, sitting atop a mound of lava on the island of Baltra in the Galapagos, a friend of

1 Perched atop a panoramic bluff overlooking the azure waters of the Caribbean Sea and extending

Infrastructure Consolidation of Accelerators Injectors (civil engineering) (civil engineering)

Building Business Systems with DSLs atop OpenResty agentzh@openresty.org Yichun Zhang

Fire Weather Products in the National Blend of Models v3.1 CARLY BUXTON 1 , ROBYN HEFFERNAN 2 ,

Debian Hamradio Blend Iain R. Learmonth < irl@debian.org > Debian Hamradio Maintainers 31st

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES & OPPORTUNITIES Paris Big Data

Complex Pisot Numbers and Newman Representatives Zach Blumenstein , Alicia Lamarche , and

Outline Mining Product Features and Customer Opinions 1 Mining Customer Reviews: Related

Administrative Notes Administrative Notes Michael Stonebraker, Joseph M. Hellerstein Michael

mini-DML project Thierry B Universit Joseph Fourier (Grenoble) WDML workshop

pNFS Update Garth Goodson Status Draft is currently at 62 pages Many updates since Paris

Advanced GATE Embedded Track II, Module 8 Third GATE Training Course AugustSeptember 2010

Progressive Breakdown in HighVoltage GaN MISHEMTs Shireen Warnock and Jess A. del Alamo

Welcomes You 29 Dec 2017 Ple Please ase tu turn rn of off mob

Will They Blend?: Exploring Big Data Computation atop Traditional - PowerPoint PPT Presentation

Will They Blend?: Exploring Big Data Computation atop Traditional HPC NAS Storage Ellis H. Wilson III 1 , 2 Mahmut Kandemir 1 Garth Gibson 2 , 3 1 Department of Computer Science and Engineering, The Pennsylvania State University 2 Panasas, Inc. 3

Exploring the IPY with NOAA Exploring the IPY with NOAA Exploring the IPY with NOAA Exploring

Network Layer October 2, 2019 guha.jayachandran@sjsu.edu Layer 2: Protocol atop Layer 1

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

EXPLORE ARIZONA THROUGH DATA FOCUS ON STUDENT DATA OVERVIEW WELCOME! EXPLORING DATA

How to build scalable, reliable and stable Kubernetes cluster atop OpenStack Bo Wang

Formal Definition of Computation Formal Definition of Computation p.1/28 Computation

Years ago, sitting atop a mound of lava on the island of Baltra in the Galapagos, a friend of

1 Perched atop a panoramic bluff overlooking the azure waters of the Caribbean Sea and extending

Infrastructure Consolidation of Accelerators Injectors (civil engineering) (civil engineering)

Building Business Systems with DSLs atop OpenResty agentzh@openresty.org Yichun Zhang

Fire Weather Products in the National Blend of Models v3.1 CARLY BUXTON 1 , ROBYN HEFFERNAN 2 ,

Debian Hamradio Blend Iain R. Learmonth &lt; irl@debian.org &gt; Debian Hamradio Maintainers 31st

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES &amp; OPPORTUNITIES Paris Big Data

Complex Pisot Numbers and Newman Representatives Zach Blumenstein , Alicia Lamarche , and

Outline Mining Product Features and Customer Opinions 1 Mining Customer Reviews: Related

Administrative Notes Administrative Notes Michael Stonebraker, Joseph M. Hellerstein Michael

mini-DML project Thierry B Universit Joseph Fourier (Grenoble) WDML workshop

pNFS Update Garth Goodson Status Draft is currently at 62 pages Many updates since Paris

Advanced GATE Embedded Track II, Module 8 Third GATE Training Course AugustSeptember 2010

Progressive Breakdown in HighVoltage GaN MISHEMTs Shireen Warnock and Jess A. del Alamo

Welcomes You 29 Dec 2017 Ple Please ase tu turn rn of off mob

Debian Hamradio Blend Iain R. Learmonth < irl@debian.org > Debian Hamradio Maintainers 31st

HOW BIG IS BIG DATA FOR AN INSURER LIKE AXA? CHALLENGES & OPPORTUNITIES Paris Big Data