Clusters Running Hadoop Dr. Renato Figueiredo ACIS Lab - University - - PowerPoint PPT Presentation

clusters running hadoop
SMART_READER_LITE
LIVE PREVIEW

Clusters Running Hadoop Dr. Renato Figueiredo ACIS Lab - University - - PowerPoint PPT Presentation

Plug-and-play Virtual Appliance Clusters Running Hadoop Dr. Renato Figueiredo ACIS Lab - University of Florida Advanced Computing and Information Systems laboratory Introduction You have so far learned about how to use Hadoop clusters


slide-1
SLIDE 1

Advanced Computing and Information Systems laboratory

Plug-and-play Virtual Appliance Clusters Running Hadoop

  • Dr. Renato Figueiredo

ACIS Lab - University of Florida

slide-2
SLIDE 2

Advanced Computing and Information Systems laboratory

2

Introduction

 You have so far learned about how to

use Hadoop clusters

 Up to now, you have used resources

configured by others

 In this lecture you will learn about ways

  • f deploying your own software stack

using virtual appliances

 And we will overview a system that

makes for simple configuration of groups

  • f virtual appliances – i.e. virtual clusters
slide-3
SLIDE 3

Advanced Computing and Information Systems laboratory

3

Objectives

 Concepts you will learn:

  • What is a virtual appliance?
  • What is a GroupVPN?
  • What is a virtual cluster?

 Demonstrations, software that you will

be able to take and follow on your own

  • Deploy your Hadoop cluster (and beyond)
  • On clouds – e.g. FutureGrid, EC2, private cloud
  • On your own local resources – desktops
  • Even across institutions
slide-4
SLIDE 4

Advanced Computing and Information Systems laboratory

4

Outline

 Virtual appliances and the Grid appliance  GroupVPN – easy to use, social VPNs  Case study and demonstration: creating

your own Hadoop cluster

  • Local resources
  • Cloud resources
  • Across providers
slide-5
SLIDE 5

Advanced Computing and Information Systems laboratory

5

What is an appliance?

 Physical appliances

  • Webster – “an instrument or device designed

for a particular use or function”

slide-6
SLIDE 6

Advanced Computing and Information Systems laboratory

6

What is an appliance?

 Hardware/software appliances

  • TV receiver + computer + hard disk + Linux +

user interface

  • Computer + network interfaces + FreeBSD +

user interface

slide-7
SLIDE 7

Advanced Computing and Information Systems laboratory

7

What is a virtual appliance?

 An appliance that packages software

and configuration needed for a particular purpose into a virtual machine “image”

 The virtual appliance has no hardware –

just software and configuration

 The image is a (big) file  It can be instantiated on hardware

slide-8
SLIDE 8

Advanced Computing and Information Systems laboratory

8

Virtual appliance example

 Linux + Apache + MySQL + PHP

copy instantiate LAMP image A web server Another Web server Repeat…

Virtualization Layer

slide-9
SLIDE 9

Advanced Computing and Information Systems laboratory

9

We were talking about Hadoop?

 Replace Apache, MySQL, PHP with the

middleware of your choice

copy instantiate Hadoop image A Hadoop worker Another Hadoop worker Repeat…

Virtualization Layer

slide-10
SLIDE 10

Advanced Computing and Information Systems laboratory

10

What about the network?

 Multiple Web servers might be

completely independent from each other

 Hadoop workers are not

  • Need to communicate and coordinate with

each other

  • Each worker needs an IP address, uses

TCP/IP sockets

 Cluster middleware stacks assume a

collection of machines, typically on a LAN (Local Area Network)

slide-11
SLIDE 11

Advanced Computing and Information Systems laboratory

11

Enter virtual networks

Physical machines Switched network

NOWs, COWs “WOWs”

  • Wide-area
  • Virtual machines (VMs)
  • Self-organizing overlay

IP tunnels, P2P routing

Installation image Virtual machines VM image

  • Local-area
  • Physical machines
  • Self-organizing switching

(e.g. Ethernet spanning tree)

slide-12
SLIDE 12

Advanced Computing and Information Systems laboratory

12

Virtual cluster appliances

 Virtual appliance + virtual network

copy instantiate Hadoop + Virtual Network A Hadoop worker Another Hadoop worker Repeat…

Virtual machine Virtual network

slide-13
SLIDE 13

Advanced Computing and Information Systems laboratory

13

Virtual network architecture

Application VNIC Virtual Router Virtual Router VNIC Application

(Wide-area) Overlay network Isolated, private virtual address space 10.10.1.2 10.10.1.1 Unmodified applications Connect(10.10.1.2,80) Capture/tunnel, scalable, resilient, self-configuring routing and object store

slide-14
SLIDE 14

Advanced Computing and Information Systems laboratory

14

Demonstration

 A virtual appliance cluster

slide-15
SLIDE 15

Advanced Computing and Information Systems laboratory

15

Q & A

slide-16
SLIDE 16

Advanced Computing and Information Systems laboratory

16

Background

 Virtual appliances

  • Encapsulate software environment in image
  • Virtual disk file(s) and virtual hardware

configuration  The Grid appliance

  • Encapsulates cluster software environments
  • Current examples: Condor, MPI, Hadoop
  • Homogeneous images at each node
  • Virtual LAN connecting nodes to form a

cluster

  • Deploy within or across domains
slide-17
SLIDE 17

Advanced Computing and Information Systems laboratory

17

Grid appliance in a nutshell

 Plug-and-play clusters with a pre-

configured software environment

  • Linux + (Hadoop, Condor, MPI, …)
  • Scripts for zero-configuration
  • “Virtual machine” appliance; open-source

software runs on Linux, Windows, Mac

 Hands-on examples, bootstrap

infrastructure, and zero-configuration software – you’re off to a quick start

slide-18
SLIDE 18

Advanced Computing and Information Systems laboratory

18

Grid appliance in a nutshell

 Creating an equivalent Grid on your own

resources, or on cloud providers, is also easy

 Deploy image on FutureGrid, Amazon EC2  Copy the same appliance to clusters, PC labs  Simple deployment and management of ad-

hoc clusters

  • Opportunistic computing
  • Testing, evaluation
  • Education, training
slide-19
SLIDE 19

Advanced Computing and Information Systems laboratory

19

Example: Desktop Grids

 Reuse wealth of O/S tools:

  • VM image = files
  • Copy, compress, transfer
  • VM instance = process

 Easy install on typical systems

  • KVM, VirtualBox: open-source
  • VMware Player/Server/Workstation
slide-20
SLIDE 20

Advanced Computing and Information Systems laboratory

20

Appliance/GroupVPN Example

  • 2. Create/join

VPN group Download config Free pre-packaged Archer Virtual appliances - run

  • n free VMMs (VMware,

VirtualBox, KVM) CMS, Wiki, YouTube: Community-contributed content: applications, datasets, tutorials

Archer seed resources 450 cores, 5 sites – Archer Global Virtual Network

Condor scheduler NFS file systems 1: Download appliance

  • 3. Boot appliances

Automatic connection to group VPN – self-configuring DHCP Free pre-packaged Archer Virtual appliances - run

  • n free VMMs (VMware,

VirtualBox, KVM) Community-contributed content: applications, datasets, tutorials

– Archer Global Virtual Network

Middleware: Condor scheduler NFS file systems 1: Download appliance

slide-21
SLIDE 21

Advanced Computing and Information Systems laboratory

21

Cloud deployment

 Cloud meaning Infrastructure-as-a-Service

  • Pay as needed
  • Elasticity – you typically only need cycles near

conference deadlines

  • 100 nodes for two weeks vs 4 nodes for a year?
  • Management, cooling, power costs are not an issue
  • Amazon EC2 pricing today makes it a viable option
  • On-demand: $0.085/hour (1 core, 1.7GB), $0.34/hour for

large (2 cores, 7.5GB)

  • $2856 for 100 small nodes for 2 weeks
  • Reserved: $228 fee, then $0.03/hour
  • Research credits available through grants
  • Research infrastructures
  • FutureGrid; Science Clouds
  • Private clouds
slide-22
SLIDE 22

Advanced Computing and Information Systems laboratory

22

Example – FutureGrid

Nimbus Eucalyptus

Appliance image

Education Training

slide-23
SLIDE 23

Advanced Computing and Information Systems laboratory

23

Grid appliance: under the hood

 VM instances + GroupVPN + Grid/cloud middleware

  • VM instances (Xen, Vmware, KVM, …) provide:
  • Sandboxing; software packaging; decoupling
  • Can be provisioned ad-hoc or through Cloud middleware
  • Virtual network (UF’s GroupVPN) provides:
  • Virtual private LAN over WAN; self-configuring and capable
  • f firewall/NAT traversal
  • Grid/cloud middleware (Condor, Hadoop, MPI):
  • Scheduling, data transfers, …
  • unmodified
slide-24
SLIDE 24

Advanced Computing and Information Systems laboratory

24

Virtual network: GroupVPN

 Key technique: IP-over-P2P (IPOP) tunneling

  • Interconnect VM appliances
  • VMs perceive a virtual LAN environment

 Self-configuring

  • Avoid administrative overhead of typical VPNs
  • NAT and firewall traversal

 Scalable and robust

  • P2P routing deals with node joins and leaves

 Networks are isolated

  • One or more private IP address spaces
  • Decentralized DHCP serves addresses for each space
slide-25
SLIDE 25

Advanced Computing and Information Systems laboratory

25

GroupVPN Overview

Alice Carol Bob Social Network Web interface Social network (e.g. XMPP, group site Overlay network (IPOP) node0.ipop 10.10.0.2 node1.ipop 10.10.0.3 Social Network API Messaging layer/information system Alice’s public keys Bob’s public keys Carol’s public key

Bootstrapping private links through Web 2.0 interfaces and IP-over-P2P overlay tunneling

node2.ipop

slide-26
SLIDE 26

Advanced Computing and Information Systems laboratory

26

Creating your own GroupVPN

 Setting up and managing typical VPNs

can be daunting

  • VPN server(s), key distribution, NAT traversal

 GroupVPN makes it simple for users to

create and manage virtual cluster VPNs

 Key insights:

  • Web 2.0 interface: create/manage user groups
  • All the complexity of setting up and managing

VPN links is automated

slide-27
SLIDE 27

Advanced Computing and Information Systems laboratory

27

GroupVPN Web interface

 You can request to join or create your

  • wn VPN group
  • Determines who is allowed to connect to

virtual network

 You can request to join or create your

  • wn appliance group
  • Determines priorities of users on resources
  • wned by their groups
slide-28
SLIDE 28

Advanced Computing and Information Systems laboratory

28

Demonstration

 GroupVPN user interface

slide-29
SLIDE 29

Advanced Computing and Information Systems laboratory

29

Q & A

slide-30
SLIDE 30

Advanced Computing and Information Systems laboratory

30

Deploying virtual clusters

 Same image, different VPNs

copy instantiate Hadoop + Virtual Network A Hadoop worker Another Hadoop worker Repeat…

Virtual machine Group VPN

GroupVPN Credentials (from Web site) Virtual IP - DHCP 10.10.1.1 Virtual IP - DHCP 10.10.1.2

slide-31
SLIDE 31

Advanced Computing and Information Systems laboratory

31

GroupVPN architecture

Application VNIC Virtual Router Virtual Router VNIC Application

GroupVPN

  • verlay

“Tap” devices 10.10.1.2 10.10.1.1 Grid/cloud apps/middleware GroupVPN router

slide-32
SLIDE 32

Advanced Computing and Information Systems laboratory

32

 Bi-directional structured overlay (Brunet library)  Self-configured NAT traversal  Self-optimized links  Direct, relay  Self-healing structure

Multi-hop path Overlay router

Under the hood: overlay architecture

Overlay router Direct path

slide-33
SLIDE 33

Advanced Computing and Information Systems laboratory

33

Cloud deployment approach

 Generate virtual floppies

  • Through GroupVPN and GroupAppliance

Web interface

 Deploy appliances image(s)

  • FutureGrid (Nimbus/Eucalyptus), EC2
  • GUI or command line tools
  • Use APIs to copy virtual floppy to image

 Submit jobs; terminate VMs when done

slide-34
SLIDE 34

Advanced Computing and Information Systems laboratory

34

FutureGrid example - Nimbus

 Example using Nimbus:

workspace.sh --deploy --mdUserdata /tmp/floppy-worker.zip.b64 --service https://f1r.idp.ufl.futuregrid.org:8443/wsrf /services/WorkspaceFactoryService -- file /tmp/output.xml --metadata /tmp/grid-appliance.xml --deploy-mem 1000 --deploy-duration 100 --trash-at- shutdown Trash --exit-state Running -- displayname grid-appliance --sshfile /home/renato/.ssh/id_dsa.pub

GroupVPN floppy image Nimbus service endpoint Metadata – points to image on Nimbus server SSH public key to log in to instance

slide-35
SLIDE 35

Advanced Computing and Information Systems laboratory

35

FutureGrid example - Eucalyptus

 Example using Eucalyptus (or ec2-run-

instances on Amazon EC2): euca-run-instances ami-fd4aa494 -f floppy.zip --instance-type m1.large -k keypair

GroupVPN floppy image Image ID on Eucalyptus server SSH public key to log in to instance

slide-36
SLIDE 36

Advanced Computing and Information Systems laboratory

36

Demonstration

 Deploying virtual appliance node on

FutureGrid

 Configuring Hadoop cluster

slide-37
SLIDE 37

Advanced Computing and Information Systems laboratory

37

Q & A

slide-38
SLIDE 38

Advanced Computing and Information Systems laboratory

38

Local appliance deployments

 Two possibilities:

  • Share our “bootstrap” infrastructure, but run a

separate GroupVPN

  • Simplest to setup
  • Deploy your own “bootstrap” infrastructure
  • More work to setup
  • Especially if across multiple LANs
  • Potential for faster connectivity
slide-39
SLIDE 39

Advanced Computing and Information Systems laboratory

39

PlanetLab bootstrap

 Shared virtual network bootstrap

  • Runs 24/7 on 100s of machines on the public Internet
  • Connect machines across multiple domains, behind NATs
slide-40
SLIDE 40

Advanced Computing and Information Systems laboratory

40

PlanetLab bootstrap: approach

 Create GroupVPN and GroupAppliance

  • n the Grid appliance Web site

 Download configuration floppy  Point users to the interface; allow users

you trust into the group

 Trusted users can download

configuration floppies and boot up appliances

slide-41
SLIDE 41

Advanced Computing and Information Systems laboratory

41

Private bootstrap: General approach

 Good choice for single-domain pools  Create GroupVPN and GroupAppliance

  • n the Grid appliance Web site

 Deploy a small IPOP/GroupVPN

bootstrap P2P pool

  • Can be on a physical machine, or appliance
  • Detailed instructions at grid-appliance.org

 The remaining steps are the same as for

the shared bootstrap

slide-42
SLIDE 42

Advanced Computing and Information Systems laboratory

42

Connecting external resources

 GroupVPN can run directly on a physical

machine, if desired

  • Provides a VPN network interface
  • Useful for example if you already have a local

Condor pool

  • Can “flock” to Archer
  • Also allows you to install Archer stack directly
  • n a physical machine if you wish
slide-43
SLIDE 43

Advanced Computing and Information Systems laboratory

43

Demonstration

 Connecting a local appliance to

FutureGrid cluster

slide-44
SLIDE 44

Advanced Computing and Information Systems laboratory

44

Where to go from here?

 Tutorials on FutureGrid and Grid

appliance Web sites for various middleware stacks

  • Condor, MPI, Hadoop

 A community resource for educational

virtual appliances

  • Success hinges on users effectively getting

involved

  • If you are happy with the system, let others

know!

  • Contribute with your own content – virtual

appliance images, tutorials, etc

slide-45
SLIDE 45

Advanced Computing and Information Systems laboratory

45

Questions?

 More information:

  • http://www.futuregrid.org
  • http://grid-appliance.org

 This document was developed with support from the

National Science Foundation (NSF) under Grant No. 0910812 to Indiana University for "FutureGrid: An Experimental, High-Performance Grid Test-bed." Any

  • pinions, findings, and conclusions or recommendations

expressed in this material are those of the author(s) and do not necessarily reflect the views of the NSF