data center storage Eno Thereska, Austin Donnelly, Dushyanth - - PowerPoint PPT Presentation

data center storage
SMART_READER_LITE
LIVE PREVIEW

data center storage Eno Thereska, Austin Donnelly, Dushyanth - - PowerPoint PPT Presentation

Sierra: practical power-proportionality for data center storage Eno Thereska, Austin Donnelly, Dushyanth Narayanan Microsoft Research Cambridge, UK Our workloads have peaks and troughs 100% 100% 80% 80% 60% 60% 40% 40% 20% 20% Hotmail


slide-1
SLIDE 1

Sierra: practical power-proportionality for data center storage

Eno Thereska, Austin Donnelly, Dushyanth Narayanan

Microsoft Research Cambridge, UK

slide-2
SLIDE 2

Our workloads have peaks and troughs

0% 20% 40% 60% 80% 100% 0% 20% 40% 60% 80% 100%

Hotmail Messenger

Servers not fully utilized, provisioned for peak Zero-load server draws ~60% of power of fully loaded server!

slide-3
SLIDE 3

Goal: power-proportional data center

  • Hardware is not power proportional 

– CPUs have DVS, but other components don’t

  • Power-proportionality in software

– Turn off servers, rebalance CPU and I/O load

power load

slide-4
SLIDE 4

Storage is the elephant in the room

 CPU & network state can be migrated

– Computation state: VM migration – Network: Chen et al. [NSDI’08]

 Storage state can not be migrated

– Terabytes per server, petabytes per DC – Diurnal patterns  migrate at least twice a day!

  • Turn servers off, but keep data available?

– and consistent, and fault-tolerant

slide-5
SLIDE 5

Context: Azure-like system

Chunk servers: CPU & storage co-located Metadata Service (MDS) Chunk location and namespace Highly available (replicated) Scalable & lightweight Not on data path Client library

  • bject-based

read(), write(), create(), delete()

read(chunk ID, offset, size...) write(chunk ID, offset, size, data...)

NTFS as file system Object striped into chunks Fixed-size (e.g., 64MB) chunks replicated Primary-based concurrency control Allow updates in place

slide-6
SLIDE 6

Challenges

  • Availability, (strong) consistency
  • Recovery from transient failures
  • Fast rebuild after permanent failure
  • Good performance
  • Gear up/down without losing any of these
slide-7
SLIDE 7

Sierra: storage subsystem with “gears”

  • Gear level g  g replicas available

– 0 ≤ g ≤ r = 3 – (r-g)/r of the servers are turned off – Gear level chosen based on load – At coarse time scale (hours)

slide-8
SLIDE 8

Sierra in a nutshell

  • Exploit R-way replication for read availability
  • Careful layout to maximize #servers in standby
  • Distributed virtual log for write availability &

read/write consistency

  • Good power savings

– Hotmail: 23% - 50%

slide-9
SLIDE 9

Outline

  • Motivation
  • Design
  • Evaluation
  • Future work and conclusion
slide-10
SLIDE 10

Sierra design features

  • Power-aware layout
  • Distributed virtual log
  • Load prediction and gear scheduling policies
slide-11
SLIDE 11

Power-aware layout

O1 O2 O3 O4

Power-down r - g

O1 O2 O3 O4 O1 O4

N(r – g)/r N(r – g)/r

replica group 1 replica group 2 gear group 1 gear group 2

Naïve random Naïve grouped Rebuild ||ism N 1 N/r Sierra

slide-12
SLIDE 12

Rack and switch layout

2 2 2 2 3 3 3 3 1 1 1 1

... ... ... ...

1 1 1 1

...

2 3 1 2 3 1 2 3 1 2 3 1

... ... ... ...

1 2 3 1

...

Rack - aligned Rotated

  • Rack-aligned  switch off entire racks
  • Rotated  better thermal balance
slide-13
SLIDE 13

What about write availability?

  • Distributed virtual log (DVL)

S P S

write (C)

L L L

  • ffloading mode (low gear)

P S

write (C)

L L L S reclaim mode (highest gear)

slide-14
SLIDE 14

Distributed virtual log

  • Builds on past work [FAST’08,OSDI’08]
  • Evolved as a distributed system component

– Available, consistent, recoverable, fault-tolerant – Location-aware (network locality , fault domains) – “Pick r closest loggers that are uncorrelated”

  • All data eventually reclaimed

– Versioned store is for short-term use

slide-15
SLIDE 15

Rack and switch layout

C C C L C C C L C C C L

... ... ... ...

C C C L

...

C C C C C C C C C C C C

... ... ... ...

C C C C

...

L L L L L L L L L L L L L L L L

Dedicated Co-located

  • Dedicated loggers  avoid contention
  • Co-located loggers  better multiplexing
slide-16
SLIDE 16

Handling new failure modes

  • Failure detected using heartbeats
  • On chunkserver failure during low-gear

– MDS wakes up all peers, migrate primaries – In g=1 there is short unavailability ~ O(time to wake up) – Tradeoff between power savings and availability using g=2

  • Logger failures

– Wake up servers, reclaim data

  • Failures from powering off servers

– Power off few times a day – Rotate gearing

slide-17
SLIDE 17

Load prediction and gear scheduling

  • Use past to predict future (very simple)

– History in 1-hour buckets  predict for next day – Schedule gear changes (at most once per hour) – Load metric considers rand/seq reads/writes

  • A hybrid predictive + reactive approach is

likely to be superior for other workloads

slide-18
SLIDE 18

Implementation status

  • User-level, event-based implementation in C
  • Chunk servers + MDS + client library = 11kLOC
  • DVL is 7.6 kLOC
  • 17 kLOC of support code (RPC libraries etc)
  • +NTFS (no changes)
  • MDS is not replicated yet
slide-19
SLIDE 19

Summary of tradeoffs and limitations

(see paper for interesting details)

  • New power-aware placement mechanism

– Power savings vs. rebuild speed vs. load balancing

  • New service: distributed virtual log

– Co-located with vs. dedicated chunk servers

  • Availability vs. power savings

– 1 new failure case exposes this tradeoff

  • Spectrum of tradeoffs for gear scheduler

– Predictive vs. reactive vs. hybrid

slide-20
SLIDE 20

Outline

  • Motivation
  • Design
  • Evaluation
  • Future work and conclusion
slide-21
SLIDE 21

Evaluation map

  • Analysis of 1-week large-scale load traces from

Hotmail and Messenger

– Can we predict load patterns? – What is the power savings potential?

  • 48-hour I/O request traces + hardware testbed

– Does gear shifting hurt performance? – Power savings (current and upper bound)

slide-22
SLIDE 22

Hotmail I/O traces

  • 8 Hotmail backend servers, 48 hours
  • 3-way replication
  • Block I/O traces
  • Data (msg files) accesses only
  • 1 MB chunk size (to fit trace)
slide-23
SLIDE 23

Testbed: MSR cluster

  • Lots of effort spent on provisioning correctly (see

paper for details)

  • 6x3 chunk servers in baseline, 5x3 chunk servers + 1x3

loggers in Sierra; clients on other servers; 1 MDS

Cisco Nexus 10Gbps Ethernet 10Gbps Ethernet 10Gbps Ethernet

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node S Y S T R P S S T A T S P E E D M O D E D U P L X Cat alys t 397 2X 1X 1 1 X 12X 1 4 X 1 3 X 2 3 X 2 4 X S E R I E S 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node S Y S T R P S S T A T S P E E D M O D E D U P L X Cat alys t 397 2X 1X 1 1 X 12X 1 4 X 1 3 X 2 3 X 2 4 X S E R I E S 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node S Y S T R P S S T A T S P E E D M O D E D U P L X Cat alys t 397 2X 1X 1 1 X 12X 1 4 X 1 3 X 2 3 X 2 4 X S E R I E S 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node SYST RPS STAT SPEED MODE DUPLX Cat alys t 397 2X 1X 1 1 X 12X 1 4 X 1 3 X 2 3 X 24X S E R I E S 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node S Y S T R P S S T A T S P E E D M O D E D U P L X Cat alys t 397 2X 1X 1 1 X 12X 1 4 X 1 3 X 2 3 X 2 4 X S E R I E S

K11-K12 Rack K17-K18 Rack K25-K27 Rack

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node S Y S T R P S S T A T S P E E D M O D E D U P L X Cat alys t 397 2X 1X 1 1 X 1 2 X 1 4 X 1 3 X 2 3 X 2 4 X S E R I E S
slide-24
SLIDE 24

Load, gears, interesting periods

slide-25
SLIDE 25

Performance during key stages

5 10 15 20 25 30 35

Mean response time (ms)

RD Baseline RD Sierra WR Baseline WR Sierra Total Baseline Total Sierra

Steady state Up shift Down shift

slide-26
SLIDE 26

One power savings curve

(more curves in paper)

slide-27
SLIDE 27

Summary

  • A step towards better power-proportionality

– In software, no reliance on HW-based approaches – Storage state as main challenge (vs. computation )

  • Several challenges addressed

– Layout, availability, consistency, performance

  • Working prototype

– Lots of interesting technical details in the paper

slide-28
SLIDE 28

Future work

  • Filling in troughs with useful work
  • Interactions with CPU/Network-based

consolidation

  • Quorum-based & Byzantine fault tolerant

systems & erasure codes

slide-29
SLIDE 29

Related work

  • Power savings in RAID arrays

– Write off-loading [FAST’08], PARAID[FAST’07] – Multi-speed disks in Hibernator [SOSP’05]

  • A possible improvement over our layout for read-
  • nly workloads [SOCC’10]
  • Hot vs. Cold data

– Popular data concentration [ISC’04]

  • Power savings for other resources

– CPU voltage scaling & VM migration – Network considerations [NSDI’08]

slide-30
SLIDE 30

Thank you

  • http://research.microsoft.com/sierra/