Storage Fabric
CS6453
Storage Fabric CS6453 Summary Last week: NVRAM is going to change - - PowerPoint PPT Presentation
Storage Fabric CS6453 Summary Last week: NVRAM is going to change the way we thing about storage. Today: Challenges of storage layers (SSDs, HDs) that are created from massive data. Slowdowns in HDs and SSDs. Enforcing policies
CS6453
Last week: NVRAM is going to change the way we thing about storage. Today: Challenges of storage layers (SSDs, HDs) that are created from massive
data.
Slowdowns in HDs and SSDs. Enforcing policies for IO operations in Cloud architectures.
One disk is not enough to handle massive amounts of data. Last time: Efficient datacenter networks using large number of cheap
commodity switches.
Solution: Efficient IO performance using large number of commodity storage
devices.
Achieves Nx performance where
N is the number of Disks.
Is this for free?
When N becomes large then the
probability of Disk failures becomes large as well.
RAID 0 does not tolerate
failures.
Achieves (K-1)-fault tolerance
with Kx Disks.
Is this for free?
There are Kx more disks (e.g. if
you want to tolerate 1 failure you need 2x more Disks than RAID 0).
RAID 1 does not utilize resources
in an efficient way.
Achieves K-fault tolerance with
N+K Disks.
Efficient utilization of Disks (not
as great as RAID 0).
Fault-Tolerance (not as great as
RAID 1).
Is this for free?
Reconstruction Cost : # of Disks
needed from a read in case of failure(s).
RAID 6 has a Reconstruction Cost
Erasure Coding in Windows Azure Storage [Huang, 2012]
Exploit Point:
𝑄𝑠𝑝𝑐 1 𝑔𝑏𝑗𝑚𝑣𝑠𝑓 ≫ 𝑄𝑠𝑝𝑐[2 𝑔𝑏𝑗𝑚𝑣𝑠𝑓𝑡 𝑝𝑠 𝑛𝑝𝑠𝑓]
Solution: Construct Erasure Code Technique that has low reconstruction cost for 1
failure.
1.33x more storage overhead (relatively low). Tolerate up to 3 failures in 16 storage devices. Reconstruction cost of 6 for 1 failure and 12 for 2+ failures.
We have seen how we treat failures with reconstruction. What about
slowdowns in HDs (or SSDs)?
A slowdown of a disk (no failures) might have significant impact at overall
performance.
Questions:
Do HDs or SSDs exhibit transient slowdowns? Are slowdowns of disks frequent enough to affect the overall performance? What causes slowdowns? How do we deal with slowdowns?
RAID
D P Q
Disk SSD #RAID groups 38,029 572 #Data drives per group 3-26 3-22 #Data drives 458,482 4,069 Total drive hours 857,183,442 7,481,055 Total RAID hours 72,046,373 1,072,690
D … D
0.9 0.92 0.94 0.96 0.98 1 1x 2x 4x 8x
Slowdown CDF of Slowdown (Disk)
Si T
Hourly average I/O latency per drive 𝑀
Slowdown: 𝑇 = 𝑀 𝑀𝑛𝑓𝑒𝑗𝑏𝑜
Tail: T = 𝑇𝑛𝑏𝑦
Slow Disks: S ≥ 2
𝑇 ≥ 2 at 99.8 percentile
𝑇 ≥ 1.5 at 99.3 percentile
𝑈 ≥ 2 at 97.8 percentile
𝑈 ≥ 1.5 at 95.2 percentile
SSDs exhibit even more slowdowns
0.2 0.4 0.6 0.8 1 1 2 4 8 16 32 64 128 256 Slowdown Interval (Hours) CDF of Slowdown Interval Disk SSD
Slowdowns are transient
40% of HD slowdowns ≥2 hours
12% of HD slowdowns ≥ 10 hours
Many slowdowns happen in consecutive hours (last more)
0.5 0.6 0.7 0.8 0.9 1 5 10 15 20 25 30 35 Inter-Arrival between Slowdowns (Hours) CDF of Slowdown Inter-Arrival Period Disk SSD
90% of Disk slowdown are within 24 hours of another slowdown of the same Disk.
> 80% of SSDs slowdown are within 24 hours of another slowdown of the same SSD.
Slowdowns happen in the same Disks relatively close to each other.
0.2 0.4 0.6 0.8 1 0.5x 1x 2x 4x Rate Imbalance CDF of RI within Si >= 2 Disk SSD
𝑆𝐽 =
𝐽/𝑃𝑆𝑏𝑢𝑓 𝐽/𝑃𝑆𝑏𝑢𝑓𝑛𝑓𝑒𝑗𝑏𝑜
Rate imbalance does not seem to be the main cause
Disks.
0.2 0.4 0.6 0.8 1 0.5x 1x 2x 4x Size Imbalance CDF of ZI within Si >= 2 Disk SSD
𝑇𝐽 =
𝐽/𝑃𝑇𝑗𝑨𝑓 𝐽/𝑃𝑇𝑗𝑨𝑓𝑛𝑓𝑒𝑗𝑏𝑜
Size imbalance does not seem to be the main cause
Disks.
0.95 0.96 0.97 0.98 0.99 1 1x 2x 3x 4x 5x Slowdown CDF of Slowdown vs. Drive Age (Disk)
9 1 2 3 4 5 7 6 10 8
Disk age seems to have some correlation but it is not strongly correlated.
No correlation of slowdowns to time of the day (0am – 24pm)
No explicit drive events around slow hours
Unplugging disks and plugging them back does not particularly help
SSD vendors have significant differences between them
Create Tail-Tolerant RAIDS.
Treat slow disks as failed disks.
Reactive
Detect slow Disks: take a lot of time to answer (>2x from other Disks). Reconstruct answer from other disks using RAID redundancy if Disk is slow. Latency is going to optimally be around 3x compared to a read from an average Disk.
Proactive
Always use RAID redundancy for additional read. Take fastest answer. Uses much more I/O bandwidth.
Adaptive
Combination of both approaches taking into account the findings. Use reactive approach until a slowdown is detected. After this use proactive approach since slowdowns are repetitive and last many hours.
More research on possible causes for Disk and SSD slowdowns is required Need Tail-Tolerant RAIDS to reduce the overhead from slowdowns
Since reconstruction of data is the way to deal with slowdowns and if
𝑄𝑠𝑝𝑐 1 𝑡𝑚𝑝𝑥𝑒𝑝𝑥𝑜 ≫ 𝑄𝑠𝑝𝑐[2 𝑡𝑚𝑝𝑥𝑒𝑝𝑥𝑜 𝑝𝑠 𝑛𝑝𝑠𝑓] the Azure paper [Huang, 2012] becomes more relevant.
General Purpose Applications Separate VM-VM connections from VM-
Storage connections
Storage is virtualized
Many layers from application to actual storage
Resources are shared across multiple tenants
Cannot support end-to-end policies (e.g.
minimum IO bandwidth from application to storage)
Applications do not have any way of
expressing their storage policies
Sharing infrastructure where aggressive
applications tend to get more IO bandwidth
No existing enforcing mechanism for
controlling IO rates
Aggregate performance policies Non-performance policies Admission control Dynamic enforcement Support for unmodified applications and VMs
<VM, Destination> -> Bandwidth
(static, compute side)
<VM, Destination> -> Min Bandwidth
(dynamic, compute side)
<VM, Destination> -> Sanitize
(static, compute or storage side)
<VM, Destination> -> Priority Level
(static, compute and storage side)
<Set of VMs, Set of Destinations> -> Bandwidth (dynamic, compute side)
Policies:
<VM1,Server X> -> B1 <VM2,Server X> -> B2
Controller to SMBc of physical server containing VM1 and VM2
createQueueRule(<VM1,Server X>,Q1) createQueueRule(<VM2,Server X>,Q2) createQueueRule(<*,*>,Q0) configureQueueService(Q1, <B1, low, S>), where S is the size of the queue configureQueueService(Q2, <B2, low, S>) configureQueueService(Q0, <C-B1-B2, low, S>), where C is the Capacity of Server X.
Policies:
<VM1-VM3,Server X> -> 900 Mbps
Demand:
VM1 -> 600 Mbps VM2 -> 400 Mbps VM3 -> 200 Mbps
Result:
VM1 -> 350 Mbps VM2 -> 350 Mbps VM3 -> 200 Mbps
Windows-based IO stack 10 hypervisors with 12 VMs each (120 VMs total) 4 tenants using 30 VMs each (3 VMs per hypervisor for each tenant) 1 Storage Server
6.4 Gbps IO Bandwidth
1 Controller
1s interval between dynamic enforcements of policies
Tenant Policy Index {VM 1 -30, X} -> Min 800 Mbps Data {VM 31 - 60, X} -> Min 800 Mbps Message {VM 61 -90, X} -> Min 2500 Mbps Log {VM 91 -120, X} -> Min 1500 Mbps
Contributions
First Software Defined Storage approach Fine-grain control over the IO operations in Cloud
Limitations
Network or other resources might be the bottleneck
Need to care about locating the VMs (spatial locality) close to data Flat Datacenter Storage [Nightingale, 2012] provides solutions for this problem
Guaranteed latencies are not expressed by current policies
Best effort approach by setting priority
HDFS [Shvachko, 2009] and GFS [Ghemawat, 2003] work well for Hadoop
MapReduce applications.
Facebook’s Photo Storage [Beaver, 2010] exploits workload characteristics to
design and implement better storage system.