PERSISTENT I/O CHALLENGES & APPROACHES CHALLENGES & - - PowerPoint PPT Presentation

persistent i o challenges approaches challenges approaches
SMART_READER_LITE
LIVE PREVIEW

PERSISTENT I/O CHALLENGES & APPROACHES CHALLENGES & - - PowerPoint PPT Presentation

17-June-2011 TERENA TF on Storage 1 PERSISTENT I/O CHALLENGES & APPROACHES CHALLENGES & APPROACHES Angelos Bilas, FORTH bilas@ics.forth.gr 17-June-2011 TERENA TF on Storage 2 Outline Outline Modern application stacks pp


slide-1
SLIDE 1

17-June-2011 TERENA TF on Storage

1

PERSISTENT I/O CHALLENGES & APPROACHES CHALLENGES & APPROACHES

Angelos Bilas, FORTH bilas@ics.forth.gr

slide-2
SLIDE 2

Outline

17-June-2011 TERENA TF on Storage

2

Outline

  • Modern application stacks

pp

  • Stream processing (STREAM)
  • Transaction processing (CumuloNimbo)

St t h l i

  • Storage technologies
  • Storage virtualization and scaling with multicores (IOLanes)
  • Abstractions for modern applications (CumuloNimbo)
  • Parallel I/O (SCALUS)
  • Remarks
slide-3
SLIDE 3

Application Stacks

17-June-2011 TERENA TF on Storage

3

Application Stacks

  • STREAM
  • CumuloNimbo
slide-4
SLIDE 4

Stream Global Architecture Picture

17-June-2011 TERENA TF on Storage

4

Stream Global Architecture Picture

Credit Card Fraud Detection SLA Compliance Telephony Fraud Detection COIs Aggregation Queries Fraud Profiles Monitoring Fraud Detection Queries Queries SLA Violation Detection Fraud Detection Queries Profiles Fraud Profiles StreamCloud StreamMine Parallel Stream Operators Parallel DB Operators Stream MapReduce Operators Operators Fault Tolerance Self‐Provisioning MapReduce State Machine Dynamic Graphs Communication & Storage Compressed SSD Queue mem‐to‐mem communication Persistent Streaming Silent Error Detection

slide-5
SLIDE 5

CumuloNimbo Global Architecture

17-June-2011 TERENA TF on Storage

5

CumuloNimbo Global Architecture

JEE Application Server: JBoss+Hibernate Object Cache: CumuloCache Transactions Self Provisioner Query Engine: Derby Concurrency Controllers Distributed File System: HDFS Column-Oriented Data Store & Block Cache: HBASE Commit Sequencers Monitors y Storage Commu- nication Loggers Load Balancers Transaction Management Elasticity Management

slide-6
SLIDE 6

Application Stacks

17-June-2011 TERENA TF on Storage

6

Application Stacks

  • They tend to be complex

y p

  • Each layer adds substantial protocol “machinery”
  • E.g. transactions, global name space
  • Today I/O significant bottleneck
  • Hard to know what all layers do
  • Questionable what can be modified realistically
  • Questionable what can be modified realistically
  • How can modern storage systems best support these?
slide-7
SLIDE 7

Outline

17-June-2011 TERENA TF on Storage

7

Outline

  • Modern application stacks

pp

  • Stream processing (STREAM)
  • Transaction processing (CumuloNimbo)

St t h l i

  • Storage technologies
  • Storage virtualization and scaling with multicores (IOLanes)
  • Abstractions for modern applications (CumuloNimbo)
  • Parallel I/O (SCALUS)
  • Remarks
slide-8
SLIDE 8

Dimension Infrastructure Properly

17‐June‐2011 TERENA TF on Storage 8

Multicores +

PCs/blades Different flavors of PCs/blades PCs/blades Different flavors of PCs/blades Different flavors of PCs/blades

Multicores + memory + IO xput

High‐speed Interconnect Interconnect High‐speed Interconnect

100s file servers

servers 1000s of appl servers 10‐40 Gbits/s

  • Dimensioning issues not straight forward today
  • I/O application overheads not understood

Disk controllers ~2GB/s SATA disks, 12‐36 disks/node

10‐100 Gbits/s

  • Do you balance thin or fat?
  • Other factors besides performance, power

SATA disks, 36 disks/node 100 MBy/s, ~2TBytes +10% SSD cache

slide-9
SLIDE 9

Scaling I/O on multicore CPUs

17-June-2011 TERENA TF on Storage

9

Scaling I/O on multicore CPUs

  • Observation
  • As the number of cores increases in modern systems, we are not

able to perform more I/O

  • Target: 1M IOPS 10 GBytes/s
  • Target: 1M IOPS, 10 GBytes/s
  • Goal
  • Provide scalable I/O stack (virtualized) over direct and networked

storage devices

  • Go over

1.

Performance and scaling analysis

1.

Performance and scaling analysis

2.

Hybrid hierarchies to take advantage of potential

3.

Design for memory and synchronization issues

4

Parallelism in lower part of networked I/O stack

4.

Parallelism in lower part of networked I/O stack

slide-10
SLIDE 10

(1) Performance and Scaling Analysis

17-June-2011 TERENA TF on Storage

10

(1) Performance and Scaling Analysis

  • Bottom-up

Guest OS U S Applications

Bottom up

  • Controller
  • Actual controller

User Space System Calls Applications Middleware

  • PCI
  • Host drivers

Bl k l

System Calls Virtual Drivers Guest OS Kernel VFS + FS

  • Block layer
  • SCSI
  • Block

Host OS VFS + FS Block Devices

  • Block
  • Filesystem
  • xfs (a well accepted fs)

SCSI Layers, HW device drivers, PCI driver PCI Express Interconnect

( p )

  • vfs (integral linux part)

Storage Controller, Disk Controller Network Controller

slide-11
SLIDE 11

I/O Controller [Systor’10]

17-June-2011 TERENA TF on Storage

11

I/O Controller [Systor 10]

  • (1) A queue protocol over PCI

( ) q p

  • Many parameters and quite complex
  • Requires decisions: Tune for high throughput

(2) R t t l ti t ll

  • (2) Request translation on controller
  • Memory management: Balance between speed and waste
  • (3) Request issue completion towards devices

(3) q p

  • Use existing mechanisms but do careful scheduling
  • Prototype comparable to commercial products
slide-12
SLIDE 12

Results and Outlook

17-June-2011 TERENA TF on Storage

12

Results and Outlook

1400 1600 1800 MB/sec

DMA Throughput

host-to-HBA HBA-to-host

head : valid queue element New-head

800 1000 1200 4 8 16 32 64 MB/sec transfer size (KB)

Impact of host-issued PIO on DMA Throughput

tail Host head DMA PCIe interconnect : valid element to dequeue

OFF ON host-issued PIO? 2-way to-host from-host

Controller head Controller initiates DMA tail

  • xput: Each controller can achieve 2 Gbytes/s bi-dir
  • IOPs: Each controller can achieve ~80K IOPs

f f /O

200 400 600 800 1000 1200 1400 1600 1800 2000 2200 MB/sec
  • Needs to know tail at Host
  • Host needs to know head at Controller

New-tail

  • 50K for commercial controllers with full I/O processing
  • Controller CPU is an important limitation
  • Outlook
  • (1) Scale throughput and IOPs by using multiple controllers

(2) O tlook I/O controllers sho ld be f sed ith host CPU

  • (2) Outlook: I/O controllers should be fused with host CPU
slide-13
SLIDE 13

Block Layer

17-June-2011 TERENA TF on Storage

13

Block Layer

  • I/O request protocol
  • I/O request protocol

translation, e.g. SCSI

  • Buffer management

Buffer management and placement

  • Other layers involved,

y , essentially a block-type

  • peration

Modern architecture

  • Modern architecture

trends create significant problems

slide-14
SLIDE 14

Results and Outlook

17-June-2011 TERENA TF on Storage

14

Results and Outlook

  • Translation processing scales

with number of cores

5000 6000 MB/s

Sequential I/O Throughput

4 rs

Random I/O Operations

with number of cores

  • Both throughput and IOPs
  • I/O translation incurs overhead

1000 2000 3000 4000 5000 seq.reads seq.writes 1 2 3 # controller write IOPS read IOPS

  • Affinity an important problem
  • Wrong placement can reduce

throughput almost to half

6

1 2 3 4 # Controllers IOPS

g p

4

TLOR0 TROR0

1 2 3 4 5 6 7 8 2

TROR0 TLORPRIL TLORPLIL

1 7

  • No. of Instance of Benchmark
slide-15
SLIDE 15

Filesystem

17-June-2011 TERENA TF on Storage

15

Filesystem

  • Complex layer

p y

  • Many complain about FS performance on multicores
  • Translates from (request, file, offset, size) API to (request, block#)

API API

  • Responsible for recovery (first layer to include extensive metadata

in traditional systems) We include VFS in our analysis additional complexity

  • We include VFS in our analysis – additional complexity
  • Detailed analysis with extensive modifications to kernel
  • Required non-trivial instrumentation to measure lock and wait times
  • Extensive tuning to ensure that we measure “meaningful” cases
slide-16
SLIDE 16

Results and Outlook

17-June-2011 TERENA TF on Storage

16

Results and Outlook

100 kfsmark ‐ CPU breakdown (1MB files, 64 app. threads)

IO‐WAIT USER SYSTEM INTERRUPT IDLE

100 120 sands 1LOG PER PROCESS 200 250 sands 1LOG PER PROCESS 20 40 60 80 % CPU 20 40 60 80

  • ps/sec

Thous

CREAT

50 100 150

  • ps/sec

Thous #CPUs

READ

  • Most FS operations do not scale with # of cores
  • Two main scaling problems
  • (2) FS journaling

1 2 4 8 16 1 2 4 6 8 12 16

  • #CPUs

1 2 4 6 8 12 16

( ) j g

  • All modern FSs need to worry about recovery
  • Most use a journaling scheme that is integrated with lookup/update path
  • Synchronization over this journal is hindering scaling
  • (1) vfs locking

f t t f i t i i di t t d i d i f ti (d t d i d

  • vfs uses a structure for maintaining directory entry and inode information (dentry and inode

caches)

  • Synchronization over the dentry cache is problematic due to vfs design
  • Outlook
  • There is significant potential from both (1) and (2)

g p ( ) ( )

  • (1) is being discussed and (a) people are working on it, (b) there is potential to bypass
  • (2) is more fundamental – our goal is to target this
slide-17
SLIDE 17

Summary of Analysis

17-June-2011 TERENA TF on Storage

17

  • (1) Fundamentally, I/O performance should scale

Summary of Analysis

( ) y, p

  • (2) Controller: use spatial parallelism and go with

technology trends

  • (3) Block: worry about placement and affinity problems
  • (4) FS: worry about synchronization at specific points

Both (3) and (4) are due to current trends in multicores

  • Both (3) and (4) are due to current trends in multicores
  • Not broadly known problems yet
slide-18
SLIDE 18

(2) Hybrid Device Hierarchies

17-June-2011 TERENA TF on Storage

18

(2) Hybrid Device Hierarchies

  • To take advantage of this potential

g p

  • Need hybrid device hierarchies using disks and SSDs
  • Otherwise, not be adequate raw performance
  • [FlashCache’06, BPLRU’08, …]

HDD (WD5001AALS‐00L3B2) SSD (Intel X25‐E) Price/capacity ($/GB) $0 3 $3

  • Designed and evaluated such a base hierarchy

Price/capacity ($/GB) $0.3 $3 Response time (ms) 12.6 0.17 Throughput (R/W) (MB/s) 100/90 277/202 IOPS (R/W) 150/150 30,000/3,500

  • Designed and evaluated such a base hierarchy
  • Significant improvement

Over disks only

  • Over disks only
  • Over disks + SSDs due to our policies
slide-19
SLIDE 19

Summary [Eurosys’10 NAS’11]

EuroSys 2010 - Compressed SSD I/O Caching

19

Summary [Eurosys 10, NAS 11]

  • Transparent SSD caching promising for improving performance
  • Improve SSD caching efficiency using online compression
  • Trade (cheap) CPU cycles for (expensive) I/O performance
  • Address challenges in online block-level compression for SSDs

Address challenges in online block level compression for SSDs

  • Our techniques mitigate CPU and additional I/O overheads
  • Results in increased performance with realistic workloads
  • TPC-H up to 99% PostMark up to 20% SPECsfs2008 up to 11%

TPC H up to 99%, PostMark up to 20%, SPECsfs2008 up to 11%

  • Cache hit ratio improves between 22%-145%
  • Increased CPU utilization by up to 4.5x
  • Low concurrency, small I/O workloads problematic

Low concurrency, small I/O workloads problematic

  • Overall our approach worthwhile, but adds complexity…
  • Future work
  • Power performance implications interesting hardware off loading
  • Power-performance implications interesting, hardware off-loading
  • Improving compression efficiency by grouping similar blocks
slide-20
SLIDE 20

(3) Buffer Mgmt and Recovery Issues

17-June-2011 TERENA TF on Storage

20

(3) Buffer Mgmt and Recovery Issues

  • Revisit
  • Buffer mgmt in DRAM required to stage/cache I/Os
  • Recovery required due to volatility of DRAM

Both fundamental and related to system I/O architecture

  • Both fundamental and related to system I/O architecture
  • We design a new DRAM buffer+cache mechanism
  • (1) Allow isolation and partitioning
  • (2) Allow control over placement
  • (3) Deal with both fixed and variable size items
  • Similar techniques recently used for other structures in kernel
  • Similar techniques recently used for other structures in kernel

[OSDI’10]

  • Use it with a kernel-level FS that is stateless
slide-21
SLIDE 21

(4) Networked I/O Stack

17-June-2011 TERENA TF on Storage

21

(4) Networked I/O Stack

  • Host overhead for network processing significant

p g g

  • We would like to push limits for networked I/O
  • Related: TCP/IP overhead at 10 GigE, xATA over Ethernet

U ti l ll li i th t k

  • Use spatial parallelism in the network
  • Multiple 10GBit/s controllers
  • Total 80GBit/s bi-dir over Ethernet
  • Treat as a transparent link between target and initiator
  • Storage protocols not arbitrary

Request/response

  • Request/response
  • Fixed size buffers
  • How well can we do
slide-22
SLIDE 22

Results and Outlook

17-June-2011 TERENA TF on Storage

22

Results and Outlook

  • Base net protocol design and

implementation p

  • Preliminary numbers (latest)
  • Over 4.5 GBytes/s
  • Writes, 4x10GigE NICs
  • Read is about 2GBytes/s
  • Read is about 2GBytes/s
  • Over 160K IOPs
  • Insight: Using a traditional-generic comm

protocol induces overheads

  • Able to design comm protocol that benefits

from storage-specific semantics

  • Target vs. initiator
  • I/O semantics not simple

I/O semantics not simple

  • Buffer management happens high-up in stack
  • Initiator less important (?)
  • Results very encouraging
slide-23
SLIDE 23

IOLanes

17-June-2011 TERENA TF on Storage

23

IOLanes

  • Overall, data intensive

applications are increasing

Guest OS User Space TPC W S TPC‐W SPECjAppServ RUBiS Replication g Linear Road Streaming Tariff Advisor Tariff Advisor New TPC‐H, TPC‐C

pp g

  • Distributed, data-center type

applications

  • I/O subsystem important

building block

System Calls System Calls Mware p Mware Tariff Advisor Tariff Advisor VFS + FS Guest OS Kernel g Q PostgreSQL

building block

  • Main challenges
  • (1) Performance and scalability
  • (2) Extensibility and effort

System Calls Block Devices, Virtio Split‐X QEMU

(2) Extensibility and effort

  • Today
  • Few disks per cpu/core (e.g.

two)

Syste Ca s Host OS VFS + FS On/off‐load module KVM Split‐X

  • Any new feature or adaptation

in stack remarkably complex

  • IOLanes
  • (1) Identify bottlenecks

St /N t St /N t SCSI Layers, HW device drivers, PCI driver Block Devices On/off load module

  • (1) Identify bottlenecks
  • (2) Build better stack
  • (3) Allow for easier extensibility

Co t o e s Storage/Net Controllers Co t o e s Storage/Net Controllers

slide-24
SLIDE 24

Specific Challenges

17-June-2011 TERENA TF on Storage

24

Specific Challenges

  • Scaling the I/O stack across all system layers on multicore

g y y CPUs

  • Interaction of the I/O paths of multiple isolated virtual

hi machines

  • Use cycles offered by multicores to offer more

“machinery” and optimize online y p

  • Evaluation with realistic workloads
  • Full stack monitoring and analysis
slide-25
SLIDE 25

Outline

17-June-2011 TERENA TF on Storage

25

Outline

  • Modern application stacks

pp

  • Stream processing (STREAM)
  • Transaction processing (CumuloNimbo)

St t h l i

  • Storage technologies
  • Storage virtualization and scaling with multicores (IOLanes)
  • Parallel I/O (SCALUS)
  • Abstractions for modern applications (CumuloNimbo)
  • Remarks
slide-26
SLIDE 26

Dimension Infrastructure Properly

17‐June‐2011 TERENA TF on Storage 26

Multicores +

PCs/blades Different flavors of PCs/blades PCs/blades Different flavors of PCs/blades Different flavors of PCs/blades

Multicores + memory + IO xput

High‐speed Interconnect Interconnect High‐speed Interconnect

100s file servers

servers 1000s of appl servers 10‐40 Gbits/s

  • Dimensioning issues not straight forward today
  • I/O application overheads not understood

Disk controllers ~2GB/s SATA disks, 12‐36 disks/node

10‐100 Gbits/s

  • Do you balance thin or fat?
  • Other factors besides performance, power

SATA disks, 36 disks/node 100 MBy/s, ~2TBytes +10% SSD cache

slide-27
SLIDE 27

Scaling Beyond Single Node Requires

17‐June‐2011 TERENA TF on Storage 27

Scaling Beyond Single Node Requires

  • Namespace management

p g

  • Distributed recovery, mostly for metadata
  • Distributed DRAM caching, at the client side
  • Understanding scaling overheads (efficiency)
slide-28
SLIDE 28

Namespace Management

17‐June‐2011 TERENA TF on Storage 28

Namespace Management

  • Need to go from (filename, offset) to (node, device, object

g ( , ) ( , , j block)

  • This requires translation metadata
  • Metadata cannot be co‐located with file/object data, if we

need to scale single file performance

  • This requires distributed lookup
  • This requires distributed lookup
  • Also, updates can be complicated
  • Would be interesting to separate from rest of data storage

g p g

slide-29
SLIDE 29

Distributed Recovery

17‐June‐2011 TERENA TF on Storage 29

Distributed Recovery

  • Single node recovery not enough when data is spread out
  • Some layer will need to do it
  • Part of the storage system or the application middleware
  • It probably means that storage nodes and application nodes will

d t b t ti need to be separate tiers

  • Fewer storage nodes and more applications nodes
  • Recovery protocol will only involve (hopefully) storage nodes

S f f t ti l API t t i ht

  • Some form of transactional API to storage seems right
  • Not simply read/write any more
  • Versioning vs. logging approaches

Will i l t t l f ll d i l d i

  • Will involve some agreement protocol for all nodes involved in an
  • peration due to striping, replication, metadata/data, etc.
  • New mechanism for the common path
  • Much more complicated to traditional systems
  • Much more complicated to traditional systems
  • Either centralized controllers or centralized metadata servers
slide-30
SLIDE 30

Distributed DRAM Caching

17‐June‐2011 TERENA TF on Storage 30

Distributed DRAM Caching

  • Traditionally, a cache exists as close to the application node as

possible

  • In the file client
  • This is problematic

p

  • For recovery
  • For scaling to many application nodes
  • Two possibilities

Two possibilities

  • (1) Do client‐side caching but avoid write back
  • (2) Do not do client side caching and use single object owner

approach at a next (storage) tier approach at a next (storage) tier

  • Both seem good approaches
  • (1) relies on “smarter” I/O path

(2) li “ /f ” k b li i /fil

  • (2) relies on “smarter/faster” networks between application/file

client and storage node

slide-31
SLIDE 31

Efficiency: Ultimately it is all about power

17‐June‐2011 TERENA TF on Storage 31

Efficiency: Ultimately it is all about power

  • Today, people do not pay much attention to the cost of scaling
  • Goal is to scale performance
  • Experimental setups with 1‐2 disks per node and many nodes for scaling I/O

are common h ff ( d k d )

  • This is very poor efficiency (CPU to disk ratio, consider power)
  • How much are you willing to pay for scaling?
  • Start from a base, optimized I/O stack like the one I have described
  • If we can scale and each I/O subsystem operates at best rate we are fine
  • Essentially, the cost of scaling should not be too high (or ideally visible) going

from one to many nodes This is not true today by far

  • This is not true today, by far…
  • Ultimately power will force everyone to look into this
  • Or, only a few applications will be able to pay for it

A l SAN t d k b t th t

  • Analogy: SANs today work but they cost
slide-32
SLIDE 32

“Machinery” for distribution

17‐June‐2011 TERENA TF on Storage 32

“Machinery” for distribution

  • All previous mechanisms, require “machinery” that is expensive

W d t ith di t ib t d I/O h th t d ll i

  • We need to come up with distributed I/O approaches that do all processing

more efficiently

  • We have or can assume a lot of concurrency so there is always work
  • This is more about being asynchronous all the time and using DRAM as a

g y g buffer to not starve any other resource

  • Design systems that wait only when I/O xput is exhausted
  • No application should be I/O bound!

ith hi h th h t d i d t i t t i d d

  • …with high throughput devices and system interconnects in modern and

future systems

  • Efficiency will matter at some point
  • Even for apps that are able to scale and achieve their perf goals
  • Even for apps that are able to scale and achieve their perf goals
  • We need to understand
  • Mechanisms required for scaling and their overheads

Who should do what in the distributed I/O path

  • Who should do what in the distributed I/O path
  • Different appl domains will resolve tradeoffs in overheads, semantics
slide-33
SLIDE 33

Where Should Each Op Go in the I/O Path?

17‐June‐2011 TERENA TF on Storage 33

Where Should Each Op Go in the I/O Path?

  • (1) Everything in the file‐system (most prevalent today)

H t b id d b fil t

  • Has to be provided by every filesystem
  • The world will have many filesystems
  • Some problems, e.g. consistent client caching, inherently difficult (not scalable)
  • Try using GPFS (not to mention extending it…)

(2) Wh t b l t t diti l SAN/NAS ?

File Servers

  • (2) Why not be closer to traditional SAN/NAS ?
  • Let’s do reliability and availability as SAN
  • File operations and scaling as NAS
  • Requires distributed block‐level consistency and atomicity

h i f l l (k l fi )

NAS (NFS/CIFS) FS Layer Block I/O stack NAS (NFS/CIFS) FS Layer Block I/O stack

  • … at the infrastructure level (kernel, firmware, …)
  • Not clear this is the way to go…
  • (3) Other alternatives? Who knows…

Network Block level Block‐level Block‐level stack Block level stack Storage Nodes

33 I/O Path Design & Implementation

Storage Nodes

slide-34
SLIDE 34

Outline

17-June-2011 TERENA TF on Storage

34

Outline

  • Modern application stacks

pp

  • Stream processing (STREAM)
  • Transaction processing (CumuloNimbo)

St t h l i

  • Storage technologies
  • Storage virtualization and scaling with multicores (IOLanes)
  • Parallel I/O (SCALUS)
  • Abstractions for modern applications (CumuloNimbo)
  • Summary
slide-35
SLIDE 35

CumuloNimbo Global Architecture

17-June-2011 TERENA TF on Storage

35

CumuloNimbo Global Architecture

JEE Application Server: JBoss+Hibernate Object Cache: CumuloCache Transactions Self Provisioner Query Engine: Derby Concurrency Controllers Distributed File System: HDFS Column-Oriented Data Store & Block Cache: HBASE Commit Sequencers Monitors y Storage Commu- nication Loggers Load Balancers Transaction Management Elasticity Management

slide-36
SLIDE 36

State of the Art

17-June-2011 TERENA TF on Storage

36

State of the Art

  • Key-value data stores gaining significance

y g g g

  • Supporting arbitrary variable-size keys and values
  • Distributed key-value stores used increasingly
  • HBase is a component of the CumuloNimbo architecture
  • Also, other s/w stacks are built on top of key-value stores
  • To access persistent storage such systems are built today

To access persistent storage such systems are built today

  • n top of traditional file systems
  • However, semantics of the underlying system differ in fundamental

ways ways

slide-37
SLIDE 37

Key value Store vs FS Mismatch

17-June-2011 TERENA TF on Storage

37

Key-value Store vs. FS Mismatch

  • Hard to map mutable variable size keys/values to files

p y

  • Key-based indexing vs. offset based indexing in the presence of

variable size values

  • Data placement on local/networked storage devices

p g cannot take advantage of semantics of key/value stores

  • Information that has been provided by the application is thrown

away during mapping to flat files y g pp g

  • Local file systems offer limited recovery/availability

guarantees

  • Last write recovery expensive no data consistency guarantees

Last write recovery expensive, no data consistency guarantees

  • Significant performance overheads and scalability

limitations

When scaling to large amounts of storage and high rates

  • When scaling to large amounts of storage and high rates
slide-38
SLIDE 38

Our Goal

17-June-2011 TERENA TF on Storage

38

Our Goal

  • Raise the abstraction of traditional locally managed persistent

t i ti k l API storage using a native key-value API

  • Support mutable variable-length items– important for workloads

that incur frequent updates

  • Perform all operations required (packing cleanup) for dealing with

Perform all operations required (packing, cleanup) for dealing with variable size items over fixed block-size persistent devices

  • Optimize device use based on the importance of data items
  • Ensure consistency of the data store after a failure based on

fi bl kl d i t configurable workload requirements

  • Use tunable data replication for availability purposes
  • Separate distributed aspects from efficiency at local level

Synergies can be important for performance e g recovery

  • Synergies can be important for performance, e.g. recovery

mechanism

slide-39
SLIDE 39

Storage Layer Architecture

17-June-2011 TERENA TF on Storage

39

Storage Layer Architecture

slide-40
SLIDE 40

Outline

17-June-2011 TERENA TF on Storage

40

Outline

  • Modern application stacks

pp

  • Stream processing (STREAM)
  • Transaction processing (CumuloNimbo)

St t h l i

  • Storage technologies
  • Storage virtualization and scaling with multicores (IOLanes)
  • Abstractions for modern applications (CumuloNimbo)
  • Parallel I/O (SCALUS)
  • Remarks
slide-41
SLIDE 41

The role of persistent I/O

17-June-2011 TERENA TF on Storage

41

The role of persistent I/O

  • Required to keep user data

q p

  • Data generated and used at different times (and over long periods)
  • Tolerate failures
  • Persistence of control information (metadata)
  • Both emerge as problems
slide-42
SLIDE 42

Data

17-June-2011 TERENA TF on Storage

42

Data

  • Many applications today in data centers require large

y pp y q g amounts of data

  • “Waste” in todays architectures

G tti d t f i t t d i t

  • Getting data from persistent devices to memory
  • Requires complex namespace operations which lead to significant

resource utilization

  • Contrast this to memory accesses that are simpler in nature
  • Systems have been built to tolerate high response times
  • Results in more work per I/O

Results in more work per I/O

  • Virtualization introduces significant overheads for I/O
  • But important for isolation among workloads and environment
slide-43
SLIDE 43

Metadata

17-June-2011 TERENA TF on Storage

43

Metadata

  • Examples
  • In a filesystem inodes and dentries
  • In a tuplestore hash-tables and b-trees for indexing
  • At block-level (e.g. FTL) logical to physical (re)mapping tables

Equally important to data

  • Equally important to data
  • In some cases even more so
  • Many systems can afford to be sloppy about data but not metadata
  • Footprint

Footprint

  • Metadata needs to be kept in memory for performance purposes
  • Sophisticated (and application-specific) caching techniques
  • Otherwise dramatically increase the number of I/Os per user I/O
  • Persistence
  • Remaining consistent at failures of paramount importance
  • But DRAM not persistent => complex write management techniques

M iddl d li i l d h dl

  • Many system, middleware, and application layers need to handle

metadata, resulting in multiple times these in-efficiencies

slide-44
SLIDE 44

Today

17-June-2011 TERENA TF on Storage

44

Today

  • Persistence is “heavy” due to device/controller technology

y gy

  • Persistence not designed with multicores in mind
  • Persistence inefficient when scaling across nodes
  • Persistence incurs overheads in multiple layers
slide-45
SLIDE 45

What can we do?

17-June-2011 TERENA TF on Storage

45

What can we do?

  • Persistent I/O should “get closer” to the CPU

g

  • Namespace issues should be simpler
  • Transfers between persistent and non-persistent stages of memory

should be more efficient should be more efficient

  • Role of access granularity
  • Architectures should better support persistence for

metadata

  • Treating data and metadata the same is a very inefficient

simplification

  • Understand overheads and scaling characteristics on

modern systems

How many cycles of processing per I/O does a data centric

  • How many cycles of processing per I/O does a data-centric

application need?

slide-46
SLIDE 46

Summary

17-June-2011 TERENA TF on Storage

46

Summary

  • (1) Memory hierarchy work to bring persistence closer to CPU

( ) y y g p

  • Profound changes – impact all layers
  • Achieving efficiency with device technology
  • (2) I/O path evolution to scale with # cores
  • (2) I/O path evolution to scale with # cores
  • Current systems not designed with this in mind
  • As cores increase, base I/O performance does not scale

Vi t li ti h d / t ti b t

  • Virtualization overheads/contention exacerbates
  • Energy proportionality
  • (3) Persistent I/O needs to scale efficiently with # nodes
  • Extensive additional “machinery” today at system and middleware

level to achieve scaling => incurs high overhead and impacts efficiency

  • E.g. Heartbeats and replication not compatible with energy efficiency
slide-47
SLIDE 47

Acknowledgements

17‐June‐2011 TERENA TF on Storage 47

Acknowledgements

  • People
  • Funding agencies
  • EC
  • Shoaib Akram
  • Konstantinos Chassapis
  • Michail Flouris
  • SIVSS, SCALUS, IOLANES
  • CumuloNimbo, STREAM, HiPEAC
  • GSRT: National research office
  • Markos Foundoulakis
  • Dhiraj Gulati
  • Yiannis Klonatos

Yiannis Klonatos

  • Kostas Magoutis
  • Thanos Makatos
  • Manolis Marazakis

Manolis Marazakis

  • Stelios Mavridis
  • Zoe Sebepou
  • Many partners and colleagues
  • Many partners and colleagues