Dynamic Hadoop clusters on HPC scheduling systems Michele Muggiri, - - PowerPoint PPT Presentation

dynamic hadoop clusters on hpc scheduling systems
SMART_READER_LITE
LIVE PREVIEW

Dynamic Hadoop clusters on HPC scheduling systems Michele Muggiri, - - PowerPoint PPT Presentation

Dynamic Hadoop clusters on HPC scheduling systems Michele Muggiri, Luca Pireddu*, Simone Leo, Gianluigi Zanetti CRS4 August 27, 2013 luca.pireddu@crs4.it (CRS4) Hadoocca August 27, 2013 1 / 37 Outline Introduction 1 Hadoocca Dynamic


slide-1
SLIDE 1

Dynamic Hadoop clusters on HPC scheduling systems

Michele Muggiri, Luca Pireddu*, Simone Leo, Gianluigi Zanetti

CRS4

August 27, 2013

luca.pireddu@crs4.it (CRS4) Hadoocca August 27, 2013 1 / 37

slide-2
SLIDE 2

Outline

1

Introduction

2

Hadoocca – Dynamic MapReduce allocation

3

Conclusion

luca.pireddu@crs4.it (CRS4) Hadoocca August 27, 2013 2 / 37

slide-3
SLIDE 3

Rising interest in Hadoop

Hadoop provides an effective and scalable way to process large quantities of data MapReduce suitable for many types of problems Hadoop ecosystem also growing in other directions

e.g., fast DB-style queries on very large datasets

Growing number of applications Success confirmed by the growing number of users

Image by Datamere luca.pireddu@crs4.it (CRS4) Hadoocca August 27, 2013 3 / 37

slide-4
SLIDE 4

Hadoop’s goals

Hadoop has two main goals

luca.pireddu@crs4.it (CRS4) Hadoocca August 27, 2013 4 / 37

slide-5
SLIDE 5

Hadoop’s goals

Hadoop has two main goals scalable storage scalable computation

luca.pireddu@crs4.it (CRS4) Hadoocca August 27, 2013 4 / 37

slide-6
SLIDE 6

Hadoop’s goals

Hadoop has two main goals scalable storage scalable computation Storage provided through Hadoop Distributed File System (HDFS) Computation provided by Hadoop MapReduce and other systems

For the scope of this work, for computation we focus on MapReduce

luca.pireddu@crs4.it (CRS4) Hadoocca August 27, 2013 4 / 37

slide-7
SLIDE 7

Hadoop 1.x architecture

Two main subsystems, HDFS and MapReduce, each with a master-slave architecture HDFS has many DataNodes

store data blocks locally

MapReduce has many TaskTrackers

run computation locally

Image courtesy of mplsvpn.info luca.pireddu@crs4.it (CRS4) Hadoocca August 27, 2013 5 / 37

slide-8
SLIDE 8

Hadoop 1.x architecture

Normally DataNodes and TaskTrackers are deployed together Quite complementary resource requirements Take advantage of data locality

Image courtesy of MSDN luca.pireddu@crs4.it (CRS4) Hadoocca August 27, 2013 6 / 37

slide-9
SLIDE 9

Hadoop’s use of resources

Hadoop assumes it has exclusive and long-term use of its nodes It has its own job submission, queueing, and scheduling system This arrangement can make it complicated to adopt in some circumstances An important example: HPC centers, with shared clusters accessed via batch systems

Probably still one of the most ways to access private computing resources

Hadoop’s approach to resource acquisition is decidedly in contrast with batch systems!

luca.pireddu@crs4.it (CRS4) Hadoocca August 27, 2013 7 / 37

slide-10
SLIDE 10

Adopting Hadoop

Large, committed, operations have possibility of deploying dedicated clusters Others may not have the resources for a Hadoop cluster Some aren’t sure about investing in one And what about experimenting? Even setting up a temporary reasonably sized cluster

At worst will require sysadmin approval and intervention At best will still require specific skills, which may not be easily accessible

luca.pireddu@crs4.it (CRS4) Hadoocca August 27, 2013 8 / 37

slide-11
SLIDE 11

Example application: DNA sequencing

An example of a user who has a lot of data to process but may not have Hadoop administration skills: bioinformatician! Interesting application of Hadoop is in processing genomic data Typical genomic processing workflow:

embarassingly parallel problems mostly I/O bound well suited for Hadoop

Increasing number of Hadoop-based software for this type of work

luca.pireddu@crs4.it (CRS4) Hadoocca August 27, 2013 9 / 37

slide-12
SLIDE 12

Example application: DNA sequencing

How much data?

Details depend on technology e.g., one run on Illumina high-throughput platform 10 days ≈ 400 Gbases ≈ 4 billion fragments ≈ 1 TB of sequence data

luca.pireddu@crs4.it (CRS4) Hadoocca August 27, 2013 10 / 37

slide-13
SLIDE 13

CRS4

CRS4 sequencing center

CRS4 - largest sequencing center in Italy capacity of generating 5 TBases/month

i.e., about 25 TB of raw data

Most processing performed with the Hadoop-based Seal toolkit

CRS4 computational capacity

3200 cores in its main HPC cluster About 5 PB of storage, most of which in a shared GPFS volume Managed with Grid Engine. Available to everyone at CRS4 Runs a lot of MPI and standard batch jobs

cluster cannot be entirely dedicated to Hadoop

luca.pireddu@crs4.it (CRS4) Hadoocca August 27, 2013 11 / 37

slide-14
SLIDE 14

Hadoop allocation strategies

How can we allow Hadoop to exist in such a typical HPC setting? Various possible static and dynamic Hadoop allocation strategies Some may provide a suitable solution

luca.pireddu@crs4.it (CRS4) Hadoocca August 27, 2013 12 / 37

slide-15
SLIDE 15

Static allocation

Partition cluster: allocated part to HPC and part to Hadoop Works well if both partitions have regular, relatively high load Provides a static/stable HDFS volume But not well suited for variable workloads

easily results in underutilization

luca.pireddu@crs4.it (CRS4) Hadoocca August 27, 2013 13 / 37

slide-16
SLIDE 16

Dynamic allocation

Only occupy nodes when needed

Seems more reasonable strategy in shared HPC environments

Not straightforward because HDFS uses node-local storage HDFS cluster cannot be reduced in size easily

data needs to be transferred off the nodes to be freed – slow!

Number of nodes must always be sufficient to provide required storage space

idle cluster still occupies nodes

Yet, there are various possible flavours of dynamic allocation

luca.pireddu@crs4.it (CRS4) Hadoocca August 27, 2013 14 / 37

slide-17
SLIDE 17

Hadoop-on-Demand (HOD)

Blocks of nodes allocated through a standard batch system HDFS and MapReduce started on those nodes

HDFS volume is temporary, so only useful for intermediate/temporary data

Desired size of cluster must be decided at allocation time Cluster must be deallocated manually

luca.pireddu@crs4.it (CRS4) Hadoocca August 27, 2013 15 / 37

slide-18
SLIDE 18

Hadoop-on-Demand (HOD)

allocation strategy exposed to human factors

given overhead/latency in allocating cluster users may be tempted to keep cluster allocated for longer than strictly necessary

5 10 15 20 25 30 5 10 15 20 25 CPU usage, % total 5 10 15 20 25 time (days) 2 4 6 8 10 MEM usage, % total

luca.pireddu@crs4.it (CRS4) Hadoocca August 27, 2013 16 / 37

slide-19
SLIDE 19

Alternative approach

Alternative approach: decouple Hadoop MapReduce and HDFS MapReduce and HDFS may use different sets of nodes

can even choose to completely forego HDFS and use other storage systems

More allocation strategies open up this way Drawback: risk losing data-locality

luca.pireddu@crs4.it (CRS4) Hadoocca August 27, 2013 17 / 37

slide-20
SLIDE 20

HDFS allocation

Cluster-wide HDFS

Run HDFS daemons on all cluster nodes, alongside other task processes

Dedicated block of machines to host an HDFS volume

Can even recycle older machines whose CPUs or RAM size are no longer competitive

No HDFS: use some other parallel shared storage

use whatever is already in place in addition to HDFS, Hadoop can natively access any mounted file system and Amazon S3

luca.pireddu@crs4.it (CRS4) Hadoocca August 27, 2013 18 / 37

slide-21
SLIDE 21

No HDFS

What’s the price of foregoing HDFS? YMMV

E7 −> E7 HDFS −> HDFS

Throughput per node

Copy direction mean MB/s 2 4 6 8 10

Use hadoop distcp to copy 1.1 TB of data 59 nodes, HDFS replication factor of 2 Each bar is the mean of 3 runs

Warning!

HDFS scales to 1000s of nodes This test only tests ∼ 60 Our nodes only have 1 disk

luca.pireddu@crs4.it (CRS4) Hadoocca August 27, 2013 19 / 37

slide-22
SLIDE 22

MapReduce allocation: per-job

Acquire nodes, start JobTracker and TaskTrackers, run job, shut down and clean-up

Such a solution was implemented for SGE by Sun

Lack of a static JobTracker nodes is not very simple for users and will not work with higher-level applications (e.g., Pig, Hive)

luca.pireddu@crs4.it (CRS4) Hadoocca August 27, 2013 20 / 37

slide-23
SLIDE 23

Static JobTracker, on-demand slaves

Static JobTracker, dynamic cluster We’ve built a solution based on this strategy: Hadoocca

luca.pireddu@crs4.it (CRS4) Hadoocca August 27, 2013 21 / 37

slide-24
SLIDE 24

Outline

1

Introduction

2

Hadoocca – Dynamic MapReduce allocation

3

Conclusion

luca.pireddu@crs4.it (CRS4) Hadoocca August 27, 2013 22 / 37

slide-25
SLIDE 25

Hadoocca

Hadoop MapReduce natively supports dynamically adding and removing slave nodes (Task Trackers)

a feature normally used to handle node failures

Keep a static JobTracker server Monitor its queues

allocate task trackers as capacity as needed

Two main components: Load Monitor, Task Tracker manager

luca.pireddu@crs4.it (CRS4) Hadoocca August 27, 2013 23 / 37

slide-26
SLIDE 26

Load monitor

Monitors Hadoop JobTracker Periodically polls it for its map and reduce task counts:

1

capacity

2

running

3

queued

Currently implemented using JobTracker’s command line interface

hadoop jobs program

Based on number of queued tasks decides how many task trackers to launch

luca.pireddu@crs4.it (CRS4) Hadoocca August 27, 2013 24 / 37

slide-27
SLIDE 27

Scheduling formula

Scheduling decision is currently simple and intuitive Calculate the number of nodes required to put all tasks in running Try to allocate them, capping at a limit per scheduler iteration Iterate again after a delay and repeat the process

luca.pireddu@crs4.it (CRS4) Hadoocca August 27, 2013 25 / 37

slide-28
SLIDE 28

Scheduling formula

Roughly boils down to the following: map_nodes = ceil( (n_map_tasks queued + n_map_tasks running) / n_map_tasks_per_node) red_nodes = ceil( (n_red_tasks queued + n_red_tasks running) / n_red_tasks_per_node) total_nodes = max(map_nodes, red_nodes) capped = min(node_limit, total_nodes) new_nodes = capped - (nodes_running + nodes_queued) to_allocate = min(new_nodes, max_nodes_per_iteration) The system then queues to_allocate new nodes and sleeps until the next iteration

luca.pireddu@crs4.it (CRS4) Hadoocca August 27, 2013 26 / 37

slide-29
SLIDE 29

Scheduling formula

Limiting the number of nodes requested per iteration slows down growth Play nice with neighbours and avoid flooding the cluster Avoid starting nodes unless they’re really necessary

due to excessively quick tasks or errors causing crashes

luca.pireddu@crs4.it (CRS4) Hadoocca August 27, 2013 27 / 37

slide-30
SLIDE 30

Running task tracker nodes

When the Load Monitors tries to allocate a node, it actually queues a job through the batch system. The job, once it starts running performs these steps: Verify that there are still tasks outstanding Uses default hadoop mechanism to start task tracker Keeps running, monitoring task tracker process When no tasks are running on task tracker for some time, terminates it

tell JobTracker node is to be excluded use standard Hadoop tasktracker shutdown command clean up scratch space

luca.pireddu@crs4.it (CRS4) Hadoocca August 27, 2013 28 / 37

slide-31
SLIDE 31

Running task tracker nodes

We don’t have a simple way to monitor the task tracker’s operations Instead:

monitor daemon’s local scratch space specific directories are created while a task is running by seeing which paths exist we know which tasks are running

luca.pireddu@crs4.it (CRS4) Hadoocca August 27, 2013 29 / 37

slide-32
SLIDE 32

Deployment at CRS4

Use shared GPFS for storage Job tracker web interface accessible to all users Multi-user setup

use Hadoop’s own mechanism for running multi-user cluster LinuxTaskController with accompanying setuid-root binary

Task processes with client’s EUID Service daemons (JobTracker, TaskTrackers) run as a system user

luca.pireddu@crs4.it (CRS4) Hadoocca August 27, 2013 30 / 37

slide-33
SLIDE 33

Patches to Hadoop

Multi-user with GPFS required some small patches to Hadoop Hadoop MapReduce uses staging directories on shared FS to pass job info to task processes Some code assumes Hadoop runs as super-user and has full access to file system

enforces permissions that are too restrictive for Hadoop user and task user to both access data

Patched Hadoop to relax those checks

luca.pireddu@crs4.it (CRS4) Hadoocca August 27, 2013 31 / 37

slide-34
SLIDE 34

Deployment layout

Cluster configuration and binaries stored on a shared volume mounted on all cluster nodes paths provided via environment variables loaded with “module”

e.g., $ module load hadoocca $ hadoop jar MyJob.jar \ file:///home/pireddu/data/input \ file:///home/pireddu/data/output

luca.pireddu@crs4.it (CRS4) Hadoocca August 27, 2013 32 / 37

slide-35
SLIDE 35

Outline

1

Introduction

2

Hadoocca – Dynamic MapReduce allocation

3

Conclusion

luca.pireddu@crs4.it (CRS4) Hadoocca August 27, 2013 33 / 37

slide-36
SLIDE 36

Hadoocca

An allocation strategy for dynamic Hadoop MapReduce clusters

doesn’t require HDFS analogous to Amazon’s EMR

Open source implementation Suitable for HPC centres who

don’t want to dedicate portion of their cluster exclusively to Hadoop are happy with a small- to medium-sized installation have a suitable storage infrastructure

luca.pireddu@crs4.it (CRS4) Hadoocca August 27, 2013 34 / 37

slide-37
SLIDE 37

In production use

We’ve been using Hadoocca at CRS4 for about 8 months Born as a prototype, but it’s still running! Static JobTracker makes running Hadoop programs really easy Compatible with Pig and Hive Increased adoption of Hadoop at CRS4

With the addition of tools such as Pydoop script, it has become a common way to write simple parallel programs

Without this type of solution it would have been quite difficult to bring Hadoop as a steady fixture at CRS4

luca.pireddu@crs4.it (CRS4) Hadoocca August 27, 2013 35 / 37

slide-38
SLIDE 38

Future development

Release current prototype code Rewrite

support for Hadoop 1.x and 2.x generalized queuing code (maybe DRMAA) modular scheduling

luca.pireddu@crs4.it (CRS4) Hadoocca August 27, 2013 36 / 37

slide-39
SLIDE 39

Thank you

Thank you! Questions?

luca.pireddu@crs4.it (CRS4) Hadoocca August 27, 2013 37 / 37