On the Energy (In)efficiency of Hadoop: Scale-down Efficiency - - PowerPoint PPT Presentation

on the energy in efficiency of hadoop scale down
SMART_READER_LITE
LIVE PREVIEW

On the Energy (In)efficiency of Hadoop: Scale-down Efficiency - - PowerPoint PPT Presentation

On the Energy (In)efficiency of Hadoop: Scale-down Efficiency Jacob Leverich and Christos Kozyrakis Stanford University The current design of Hadoop precludes scale-down of commodity clusters. 2 Outline Hadoop crash-course


slide-1
SLIDE 1

Jacob Leverich and Christos Kozyrakis

Stanford University

On the Energy (In)efficiency of Hadoop: Scale-down Efficiency

slide-2
SLIDE 2

The current design of Hadoop precludes scale-down of commodity clusters.

2

slide-3
SLIDE 3

Outline

3

Hadoop crash-course Scale-down efficiency How Hadoop precludes scale-down How to fix it Did we fix it? Future work

slide-4
SLIDE 4

Hadoop crash-course

4

Hadoop == Distributed Processing Framework

1000s of nodes, PBs of data

Hadoop MapReduce Google MapReduce

Tasks are automatically distributed by the framework.

Hadoop Distributed File System Google File System

Files divided into large (64MB) blocks; amortizes overheads. Blocks replicated for availability and durability management.

slide-5
SLIDE 5

Scale-down motivation

5

HP Proliant DL140 G3

Typical utilization

[Barroso and Holzle, 2007]

slide-6
SLIDE 6

Scale-down for energy proportionality

6

40% 40% 40% 40%

= 4 x P(40%) = 4 x 325W = 1300W

80% 80% 0% 0%

= 2 x P(80%) = 2 x 365W = 730W

slide-7
SLIDE 7

The problem: storage consolidation

7

Hadoop Distributed File System…

Consolidate computation? Easy. Consolidate storage?

Not (as) easy.

“All servers must be available, even during low-load

periods.” [Barroso and Holzle, 2007]

Hadoop inherited this “feature” from Google File System

:-(

slide-8
SLIDE 8

HDFS and block replication

8

“block replication table” = replica 1st replica local,

  • thers remote.

Allocate evenly. Replication factor = 3 1 2 3 4 5 6 7 8 9 Node Block A B C D E F G H

slide-9
SLIDE 9

Attempted scale-down

9

1 2 3 4 5 6 7 8 9 Node Block A B C D E F G H

Problems:

Scale-down vs. Self-healing

Wasted capacity:

sleeping replicas != lost replicas

Flurry of net & disk activity!

Which nodes to disable?

Must maintain data availability

slide-10
SLIDE 10

How to fix it

10

Scale-down vs. Self-healing

Wasted capacity:

sleeping replicas != lost replicas

Flurry of net & disk activity!

Which nodes to disable?

Must maintain data availability

Self-non-healing

slide-11
SLIDE 11

Self-non-healing

11

1 2 3 4 5 6 7 8 9 Node Block A B C D E F G H Zzzzz…

Coordinate with Hadoop

when we put a node to sleep

Prevent block re-replications

slide-12
SLIDE 12

New RPCs in HDFS primary node

12

sleepNode(String hostname)

Similar to node decommissioning, but don’t replicate blocks

% hadoop dfsadmin –sleepNode 10.10.1.80:50020

Save blocks to a “sleeping blocks” map for bookkeeping Ignore heartbeats and block reports from this node

wakeNode(String hostname)

Watch for heartbeats, force node to send block report Execute arbitrary commands (i.e. send wake-on-LAN packet)

wakeBlock(Block target)

Wake a sleeping node that has a particular block

slide-13
SLIDE 13

How to fix it

13

Scale-down vs. Self-healing

Wasted capacity:

sleeping replicas != lost replicas

Flurry of net & disk activity!

Which nodes to disable?

Must maintain data availability

Self-non-healing “Covering Subset” replication invariant

slide-14
SLIDE 14

Replication placement invariants

14

Hadoop uses simple invariants to direct block placement Example: Rack-Aware Block Placement

Protects against common-mode failures

(i.e. switch failure, power delivery failure)

Invariant: Blocks must have replicas on at least 2 racks.

Is there some energy-efficient replication invariant?

Must inform our decision on which nodes we can disable.

slide-15
SLIDE 15

Covering subset replication invariant

15

Goal:

Maximize the number of servers that can simultaneously sleep.

Strategy:

Aggregate live data onto a “covering subset” of nodes. Never turn off a node in the covering subset.

Invariant:

Every block must have one replica in the covering subset.

slide-16
SLIDE 16

Covering subset replication invariant

16

1 2 3 4 5 6 7 8 9 Node Block A B C D E F G H Zzzzz…

slide-17
SLIDE 17

How to fix it

17

Scale-down vs. Self-healing

Wasted capacity:

sleeping replicas != lost replicas

Flurry of net & disk activity!

Which nodes to disable?

Must maintain data availability

Self-non-healing “Covering Subset” replication invariant

slide-18
SLIDE 18

Evaluation

18

slide-19
SLIDE 19

Methodology

19

Disable n nodes, compare Hadoop job energy & perf.

Individual runs of webdata_sort/webdata_scan from GridMix 30 minute job batches (with some idle time!)

Cluster

36 nodes, HP Proliant DL140 G3 2 quad-core Xeon 5335s each, 32GB RAM, 500GB disk 9-node covering subset (1/4 of the cluster)

Energy model

Validated estimate based on CPU utilization Disabled node = 0 Watts Possible to evaluate hypothetical hardware

slide-20
SLIDE 20

Results: Performance

20

It slows down (obviously)

Peak performance

benchmark

Sort (network intensive)

worse off than Scan

Amdahl’s Law

slide-21
SLIDE 21

Results: Energy

21

Less energy consumed for

same amount of work

9% to 51% saved

Nodes consume energy

more than they improve performance

Slower systems usually

more efficient; high performance is a trade-off!

slide-22
SLIDE 22

Results: Power

22

Excellent knob for cluster-

level power capping

Much larger dynamic

range than tweaking frequency/voltage at the server level

slide-23
SLIDE 23

Results: The Bottom Line

23

Operational Hadoop clusters can scale-down. We reduce energy consumption at the expense of single-job latency.

slide-24
SLIDE 24

Continuing Work

24

slide-25
SLIDE 25

Covering subset: mechanism vs. policy

25

The replication invariant is a mechanism. Which nodes constitute a subset is policy (open

question).

Size trade-off

Too small: Low capacity and performance bottleneck Too large: Wasted energy on idle nodes 1 / (replication factor) reasonable starting point

How many covering subsets?

Invariant: Blocks must have a replica in each covering subsets.

slide-26
SLIDE 26

Quantify Trade-offs

26

Random Fault Injection experiments

What happens when a covering subset node fails?

How much do you trust idle disks?

Performance Availability Energy consumption Durability

slide-27
SLIDE 27

Dynamic Power Management

27

Algorithmically decide which nodes to sleep or wakeup What signals to use?

CPU utilization? Disk/net utilization? Job Queue length?

MapReduce and HDFS must cooperate

i.e. idle nodes may host transient Map outputs

slide-28
SLIDE 28

Workloads

28

Benchmarks

HBase/BigTable vs. MapReduce

Short, unpredictable data access vs. long streaming access Quality of service and throughput are important

Pig vs. Sort+Scan Recorded job traces vs. random job traces

Peak performance vs. fractional utilization What are typical usage patterns?

slide-29
SLIDE 29

Scale

29

36-nodes to 1000-nodes; emergent behaviors?

Network hierarchy Hadoop framework inefficiencies Computational overhead (must process many block reports!)

Experiments on Amazon EC2

Awarded an Amazon Web Services grant Can’t measure power! Must use a model.

Any Amazonians here? Let’s make a validated energy model.