17-June-2011 TERENA TF on Storage
1
PERSISTENT I/O CHALLENGES & APPROACHES CHALLENGES & - - PowerPoint PPT Presentation
17-June-2011 TERENA TF on Storage 1 PERSISTENT I/O CHALLENGES & APPROACHES CHALLENGES & APPROACHES Angelos Bilas, FORTH bilas@ics.forth.gr 17-June-2011 TERENA TF on Storage 2 Outline Outline Modern application stacks pp
17-June-2011 TERENA TF on Storage
1
17-June-2011 TERENA TF on Storage
2
17-June-2011 TERENA TF on Storage
3
17-June-2011 TERENA TF on Storage
4
Credit Card Fraud Detection SLA Compliance Telephony Fraud Detection COIs Aggregation Queries Fraud Profiles Monitoring Fraud Detection Queries Queries SLA Violation Detection Fraud Detection Queries Profiles Fraud Profiles StreamCloud StreamMine Parallel Stream Operators Parallel DB Operators Stream MapReduce Operators Operators Fault Tolerance Self‐Provisioning MapReduce State Machine Dynamic Graphs Communication & Storage Compressed SSD Queue mem‐to‐mem communication Persistent Streaming Silent Error Detection
17-June-2011 TERENA TF on Storage
5
JEE Application Server: JBoss+Hibernate Object Cache: CumuloCache Transactions Self Provisioner Query Engine: Derby Concurrency Controllers Distributed File System: HDFS Column-Oriented Data Store & Block Cache: HBASE Commit Sequencers Monitors y Storage Commu- nication Loggers Load Balancers Transaction Management Elasticity Management
17-June-2011 TERENA TF on Storage
6
17-June-2011 TERENA TF on Storage
7
17‐June‐2011 TERENA TF on Storage 8
Multicores +
PCs/blades Different flavors of PCs/blades PCs/blades Different flavors of PCs/blades Different flavors of PCs/blades
Multicores + memory + IO xput
High‐speed Interconnect Interconnect High‐speed Interconnect
100s file servers
servers 1000s of appl servers 10‐40 Gbits/s
Disk controllers ~2GB/s SATA disks, 12‐36 disks/node
10‐100 Gbits/s
SATA disks, 36 disks/node 100 MBy/s, ~2TBytes +10% SSD cache
17-June-2011 TERENA TF on Storage
9
able to perform more I/O
storage devices
1.
Performance and scaling analysis
1.
Performance and scaling analysis
2.
Hybrid hierarchies to take advantage of potential
3.
Design for memory and synchronization issues
4
Parallelism in lower part of networked I/O stack
4.
Parallelism in lower part of networked I/O stack
17-June-2011 TERENA TF on Storage
10
Guest OS U S Applications
User Space System Calls Applications Middleware
System Calls Virtual Drivers Guest OS Kernel VFS + FS
Host OS VFS + FS Block Devices
SCSI Layers, HW device drivers, PCI driver PCI Express Interconnect
Storage Controller, Disk Controller Network Controller
17-June-2011 TERENA TF on Storage
11
17-June-2011 TERENA TF on Storage
12
1400 1600 1800 MB/sec
DMA Throughput
host-to-HBA HBA-to-host
head : valid queue element New-head
800 1000 1200 4 8 16 32 64 MB/sec transfer size (KB)
Impact of host-issued PIO on DMA Throughput
tail Host head DMA PCIe interconnect : valid element to dequeue
OFF ON host-issued PIO? 2-way to-host from-hostController head Controller initiates DMA tail
f f /O
200 400 600 800 1000 1200 1400 1600 1800 2000 2200 MB/secNew-tail
(2) O tlook I/O controllers sho ld be f sed ith host CPU
17-June-2011 TERENA TF on Storage
13
17-June-2011 TERENA TF on Storage
14
5000 6000 MB/s
Sequential I/O Throughput
4 rs
Random I/O Operations
1000 2000 3000 4000 5000 seq.reads seq.writes 1 2 3 # controller write IOPS read IOPS
throughput almost to half
6
1 2 3 4 # Controllers IOPS
g p
4
TLOR0 TROR0
1 2 3 4 5 6 7 8 2
TROR0 TLORPRIL TLORPLIL
1 7
17-June-2011 TERENA TF on Storage
15
API API
in traditional systems) We include VFS in our analysis additional complexity
17-June-2011 TERENA TF on Storage
16
100 kfsmark ‐ CPU breakdown (1MB files, 64 app. threads)
IO‐WAIT USER SYSTEM INTERRUPT IDLE
100 120 sands 1LOG PER PROCESS 200 250 sands 1LOG PER PROCESS 20 40 60 80 % CPU 20 40 60 80
Thous
CREAT
50 100 150
Thous #CPUs
READ
1 2 4 8 16 1 2 4 6 8 12 16
1 2 4 6 8 12 16
( ) j g
f t t f i t i i di t t d i d i f ti (d t d i d
caches)
g p ( ) ( )
17-June-2011 TERENA TF on Storage
17
17-June-2011 TERENA TF on Storage
18
HDD (WD5001AALS‐00L3B2) SSD (Intel X25‐E) Price/capacity ($/GB) $0 3 $3
Price/capacity ($/GB) $0.3 $3 Response time (ms) 12.6 0.17 Throughput (R/W) (MB/s) 100/90 277/202 IOPS (R/W) 150/150 30,000/3,500
Over disks only
EuroSys 2010 - Compressed SSD I/O Caching
19
TPC H up to 99%, PostMark up to 20%, SPECsfs2008 up to 11%
Low concurrency, small I/O workloads problematic
17-June-2011 TERENA TF on Storage
20
Both fundamental and related to system I/O architecture
[OSDI’10]
17-June-2011 TERENA TF on Storage
21
Request/response
17-June-2011 TERENA TF on Storage
22
implementation p
protocol induces overheads
from storage-specific semantics
I/O semantics not simple
17-June-2011 TERENA TF on Storage
23
Guest OS User Space TPC W S TPC‐W SPECjAppServ RUBiS Replication g Linear Road Streaming Tariff Advisor Tariff Advisor New TPC‐H, TPC‐C
applications
System Calls System Calls Mware p Mware Tariff Advisor Tariff Advisor VFS + FS Guest OS Kernel g Q PostgreSQL
System Calls Block Devices, Virtio Split‐X QEMU
(2) Extensibility and effort
two)
Syste Ca s Host OS VFS + FS On/off‐load module KVM Split‐X
in stack remarkably complex
St /N t St /N t SCSI Layers, HW device drivers, PCI driver Block Devices On/off load module
Co t o e s Storage/Net Controllers Co t o e s Storage/Net Controllers
17-June-2011 TERENA TF on Storage
24
17-June-2011 TERENA TF on Storage
25
17‐June‐2011 TERENA TF on Storage 26
Multicores +
PCs/blades Different flavors of PCs/blades PCs/blades Different flavors of PCs/blades Different flavors of PCs/blades
Multicores + memory + IO xput
High‐speed Interconnect Interconnect High‐speed Interconnect
100s file servers
servers 1000s of appl servers 10‐40 Gbits/s
Disk controllers ~2GB/s SATA disks, 12‐36 disks/node
10‐100 Gbits/s
SATA disks, 36 disks/node 100 MBy/s, ~2TBytes +10% SSD cache
17‐June‐2011 TERENA TF on Storage 27
17‐June‐2011 TERENA TF on Storage 28
17‐June‐2011 TERENA TF on Storage 29
17‐June‐2011 TERENA TF on Storage 30
17‐June‐2011 TERENA TF on Storage 31
are common h ff ( d k d )
from one to many nodes This is not true today by far
A l SAN t d k b t th t
17‐June‐2011 TERENA TF on Storage 32
W d t ith di t ib t d I/O h th t d ll i
more efficiently
g y g buffer to not starve any other resource
ith hi h th h t d i d t i t t i d d
future systems
Who should do what in the distributed I/O path
17‐June‐2011 TERENA TF on Storage 33
H t b id d b fil t
(2) Wh t b l t t diti l SAN/NAS ?
File Servers
h i f l l (k l fi )
NAS (NFS/CIFS) FS Layer Block I/O stack NAS (NFS/CIFS) FS Layer Block I/O stack
Network Block level Block‐level Block‐level stack Block level stack Storage Nodes
33 I/O Path Design & Implementation
Storage Nodes
17-June-2011 TERENA TF on Storage
34
17-June-2011 TERENA TF on Storage
35
JEE Application Server: JBoss+Hibernate Object Cache: CumuloCache Transactions Self Provisioner Query Engine: Derby Concurrency Controllers Distributed File System: HDFS Column-Oriented Data Store & Block Cache: HBASE Commit Sequencers Monitors y Storage Commu- nication Loggers Load Balancers Transaction Management Elasticity Management
17-June-2011 TERENA TF on Storage
36
ways ways
17-June-2011 TERENA TF on Storage
37
variable size values
away during mapping to flat files y g pp g
Last write recovery expensive, no data consistency guarantees
When scaling to large amounts of storage and high rates
17-June-2011 TERENA TF on Storage
38
that incur frequent updates
Perform all operations required (packing, cleanup) for dealing with variable size items over fixed block-size persistent devices
fi bl kl d i t configurable workload requirements
Synergies can be important for performance e g recovery
mechanism
17-June-2011 TERENA TF on Storage
39
17-June-2011 TERENA TF on Storage
40
17-June-2011 TERENA TF on Storage
41
17-June-2011 TERENA TF on Storage
42
resource utilization
Results in more work per I/O
17-June-2011 TERENA TF on Storage
43
Equally important to data
Footprint
M iddl d li i l d h dl
metadata, resulting in multiple times these in-efficiencies
17-June-2011 TERENA TF on Storage
44
17-June-2011 TERENA TF on Storage
45
should be more efficient should be more efficient
simplification
How many cycles of processing per I/O does a data centric
application need?
17-June-2011 TERENA TF on Storage
46
Vi t li ti h d / t ti b t
level to achieve scaling => incurs high overhead and impacts efficiency
17‐June‐2011 TERENA TF on Storage 47
Yiannis Klonatos
Manolis Marazakis