data center storage Eno Thereska, Austin Donnelly, Dushyanth - - PowerPoint PPT Presentation
data center storage Eno Thereska, Austin Donnelly, Dushyanth - - PowerPoint PPT Presentation
Sierra: practical power-proportionality for data center storage Eno Thereska, Austin Donnelly, Dushyanth Narayanan Microsoft Research Cambridge, UK Our workloads have peaks and troughs 100% 100% 80% 80% 60% 60% 40% 40% 20% 20% Hotmail
Our workloads have peaks and troughs
0% 20% 40% 60% 80% 100% 0% 20% 40% 60% 80% 100%
Hotmail Messenger
Servers not fully utilized, provisioned for peak Zero-load server draws ~60% of power of fully loaded server!
Goal: power-proportional data center
- Hardware is not power proportional
– CPUs have DVS, but other components don’t
- Power-proportionality in software
– Turn off servers, rebalance CPU and I/O load
power load
Storage is the elephant in the room
CPU & network state can be migrated
– Computation state: VM migration – Network: Chen et al. [NSDI’08]
Storage state can not be migrated
– Terabytes per server, petabytes per DC – Diurnal patterns migrate at least twice a day!
- Turn servers off, but keep data available?
– and consistent, and fault-tolerant
Context: Azure-like system
Chunk servers: CPU & storage co-located Metadata Service (MDS) Chunk location and namespace Highly available (replicated) Scalable & lightweight Not on data path Client library
- bject-based
read(), write(), create(), delete()
read(chunk ID, offset, size...) write(chunk ID, offset, size, data...)
NTFS as file system Object striped into chunks Fixed-size (e.g., 64MB) chunks replicated Primary-based concurrency control Allow updates in place
Challenges
- Availability, (strong) consistency
- Recovery from transient failures
- Fast rebuild after permanent failure
- Good performance
- Gear up/down without losing any of these
Sierra: storage subsystem with “gears”
- Gear level g g replicas available
– 0 ≤ g ≤ r = 3 – (r-g)/r of the servers are turned off – Gear level chosen based on load – At coarse time scale (hours)
Sierra in a nutshell
- Exploit R-way replication for read availability
- Careful layout to maximize #servers in standby
- Distributed virtual log for write availability &
read/write consistency
- Good power savings
– Hotmail: 23% - 50%
Outline
- Motivation
- Design
- Evaluation
- Future work and conclusion
Sierra design features
- Power-aware layout
- Distributed virtual log
- Load prediction and gear scheduling policies
Power-aware layout
O1 O2 O3 O4
Power-down r - g
O1 O2 O3 O4 O1 O4
N(r – g)/r N(r – g)/r
replica group 1 replica group 2 gear group 1 gear group 2
Naïve random Naïve grouped Rebuild ||ism N 1 N/r Sierra
Rack and switch layout
2 2 2 2 3 3 3 3 1 1 1 1
... ... ... ...
1 1 1 1
...
2 3 1 2 3 1 2 3 1 2 3 1
... ... ... ...
1 2 3 1
...
Rack - aligned Rotated
- Rack-aligned switch off entire racks
- Rotated better thermal balance
What about write availability?
- Distributed virtual log (DVL)
S P S
write (C)
L L L
- ffloading mode (low gear)
P S
write (C)
L L L S reclaim mode (highest gear)
Distributed virtual log
- Builds on past work [FAST’08,OSDI’08]
- Evolved as a distributed system component
– Available, consistent, recoverable, fault-tolerant – Location-aware (network locality , fault domains) – “Pick r closest loggers that are uncorrelated”
- All data eventually reclaimed
– Versioned store is for short-term use
Rack and switch layout
C C C L C C C L C C C L
... ... ... ...
C C C L
...
C C C C C C C C C C C C
... ... ... ...
C C C C
...
L L L L L L L L L L L L L L L L
Dedicated Co-located
- Dedicated loggers avoid contention
- Co-located loggers better multiplexing
Handling new failure modes
- Failure detected using heartbeats
- On chunkserver failure during low-gear
– MDS wakes up all peers, migrate primaries – In g=1 there is short unavailability ~ O(time to wake up) – Tradeoff between power savings and availability using g=2
- Logger failures
– Wake up servers, reclaim data
- Failures from powering off servers
– Power off few times a day – Rotate gearing
Load prediction and gear scheduling
- Use past to predict future (very simple)
– History in 1-hour buckets predict for next day – Schedule gear changes (at most once per hour) – Load metric considers rand/seq reads/writes
- A hybrid predictive + reactive approach is
likely to be superior for other workloads
Implementation status
- User-level, event-based implementation in C
- Chunk servers + MDS + client library = 11kLOC
- DVL is 7.6 kLOC
- 17 kLOC of support code (RPC libraries etc)
- +NTFS (no changes)
- MDS is not replicated yet
Summary of tradeoffs and limitations
(see paper for interesting details)
- New power-aware placement mechanism
– Power savings vs. rebuild speed vs. load balancing
- New service: distributed virtual log
– Co-located with vs. dedicated chunk servers
- Availability vs. power savings
– 1 new failure case exposes this tradeoff
- Spectrum of tradeoffs for gear scheduler
– Predictive vs. reactive vs. hybrid
Outline
- Motivation
- Design
- Evaluation
- Future work and conclusion
Evaluation map
- Analysis of 1-week large-scale load traces from
Hotmail and Messenger
– Can we predict load patterns? – What is the power savings potential?
- 48-hour I/O request traces + hardware testbed
– Does gear shifting hurt performance? – Power savings (current and upper bound)
Hotmail I/O traces
- 8 Hotmail backend servers, 48 hours
- 3-way replication
- Block I/O traces
- Data (msg files) accesses only
- 1 MB chunk size (to fit trace)
Testbed: MSR cluster
- Lots of effort spent on provisioning correctly (see
paper for details)
- 6x3 chunk servers in baseline, 5x3 chunk servers + 1x3
loggers in Sierra; clients on other servers; 1 MDS
Cisco Nexus 10Gbps Ethernet 10Gbps Ethernet 10Gbps Ethernet
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node S Y S T R P S S T A T S P E E D M O D E D U P L X Cat alys t 397 2X 1X 1 1 X 12X 1 4 X 1 3 X 2 3 X 2 4 X S E R I E S 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node S Y S T R P S S T A T S P E E D M O D E D U P L X Cat alys t 397 2X 1X 1 1 X 12X 1 4 X 1 3 X 2 3 X 2 4 X S E R I E S 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node S Y S T R P S S T A T S P E E D M O D E D U P L X Cat alys t 397 2X 1X 1 1 X 12X 1 4 X 1 3 X 2 3 X 2 4 X S E R I E S 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node SYST RPS STAT SPEED MODE DUPLX Cat alys t 397 2X 1X 1 1 X 12X 1 4 X 1 3 X 2 3 X 24X S E R I E S 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node S Y S T R P S S T A T S P E E D M O D E D U P L X Cat alys t 397 2X 1X 1 1 X 12X 1 4 X 1 3 X 2 3 X 2 4 X S E R I E SK11-K12 Rack K17-K18 Rack K25-K27 Rack
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node Node S Y S T R P S S T A T S P E E D M O D E D U P L X Cat alys t 397 2X 1X 1 1 X 1 2 X 1 4 X 1 3 X 2 3 X 2 4 X S E R I E SLoad, gears, interesting periods
Performance during key stages
5 10 15 20 25 30 35
Mean response time (ms)
RD Baseline RD Sierra WR Baseline WR Sierra Total Baseline Total Sierra
Steady state Up shift Down shift
One power savings curve
(more curves in paper)
Summary
- A step towards better power-proportionality
– In software, no reliance on HW-based approaches – Storage state as main challenge (vs. computation )
- Several challenges addressed
– Layout, availability, consistency, performance
- Working prototype
– Lots of interesting technical details in the paper
Future work
- Filling in troughs with useful work
- Interactions with CPU/Network-based
consolidation
- Quorum-based & Byzantine fault tolerant
systems & erasure codes
Related work
- Power savings in RAID arrays
– Write off-loading [FAST’08], PARAID[FAST’07] – Multi-speed disks in Hibernator [SOSP’05]
- A possible improvement over our layout for read-
- nly workloads [SOCC’10]
- Hot vs. Cold data
– Popular data concentration [ISC’04]
- Power savings for other resources
– CPU voltage scaling & VM migration – Network considerations [NSDI’08]
Thank you
- http://research.microsoft.com/sierra/