[PPT] - PERSISTENT I/O CHALLENGES & APPROACHES CHALLENGES & PowerPoint Presentation

SLIDE 1

17-June-2011 TERENA TF on Storage

1

PERSISTENT I/O CHALLENGES & APPROACHES CHALLENGES & APPROACHES

Angelos Bilas, FORTH bilas@ics.forth.gr

SLIDE 2

Outline

17-June-2011 TERENA TF on Storage

2

Outline

Modern application stacks

pp

Stream processing (STREAM)
Transaction processing (CumuloNimbo)

St t h l i

Storage technologies
Storage virtualization and scaling with multicores (IOLanes)
Abstractions for modern applications (CumuloNimbo)
Parallel I/O (SCALUS)
Remarks

SLIDE 3

Application Stacks

17-June-2011 TERENA TF on Storage

3

Application Stacks

STREAM
CumuloNimbo

SLIDE 4

Stream Global Architecture Picture

17-June-2011 TERENA TF on Storage

4

Stream Global Architecture Picture

Credit Card Fraud Detection SLA Compliance Telephony Fraud Detection COIs Aggregation Queries Fraud Profiles Monitoring Fraud Detection Queries Queries SLA Violation Detection Fraud Detection Queries Profiles Fraud Profiles StreamCloud StreamMine Parallel Stream Operators Parallel DB Operators Stream MapReduce Operators Operators Fault Tolerance Self‐Provisioning MapReduce State Machine Dynamic Graphs Communication & Storage Compressed SSD Queue mem‐to‐mem communication Persistent Streaming Silent Error Detection

SLIDE 5

CumuloNimbo Global Architecture

17-June-2011 TERENA TF on Storage

5

CumuloNimbo Global Architecture

JEE Application Server: JBoss+Hibernate Object Cache: CumuloCache Transactions Self Provisioner Query Engine: Derby Concurrency Controllers Distributed File System: HDFS Column-Oriented Data Store & Block Cache: HBASE Commit Sequencers Monitors y Storage Commu- nication Loggers Load Balancers Transaction Management Elasticity Management

SLIDE 6

Application Stacks

17-June-2011 TERENA TF on Storage

6

Application Stacks

They tend to be complex

y p

Each layer adds substantial protocol “machinery”
E.g. transactions, global name space
Today I/O significant bottleneck
Hard to know what all layers do
Questionable what can be modified realistically
Questionable what can be modified realistically
How can modern storage systems best support these?

SLIDE 7

Outline

17-June-2011 TERENA TF on Storage

7

Outline

Modern application stacks

pp

Stream processing (STREAM)
Transaction processing (CumuloNimbo)

St t h l i

Storage technologies
Storage virtualization and scaling with multicores (IOLanes)
Abstractions for modern applications (CumuloNimbo)
Parallel I/O (SCALUS)
Remarks

SLIDE 8

Dimension Infrastructure Properly

17‐June‐2011 TERENA TF on Storage 8

Multicores +

PCs/blades Different flavors of PCs/blades PCs/blades Different flavors of PCs/blades Different flavors of PCs/blades

Multicores + memory + IO xput

High‐speed Interconnect Interconnect High‐speed Interconnect

100s file servers

servers 1000s of appl servers 10‐40 Gbits/s

Dimensioning issues not straight forward today
I/O application overheads not understood

Disk controllers ~2GB/s SATA disks, 12‐36 disks/node

10‐100 Gbits/s

Do you balance thin or fat?
Other factors besides performance, power

SATA disks, 36 disks/node 100 MBy/s, ~2TBytes +10% SSD cache

SLIDE 9

Scaling I/O on multicore CPUs

17-June-2011 TERENA TF on Storage

9

Scaling I/O on multicore CPUs

Observation
As the number of cores increases in modern systems, we are not

able to perform more I/O

Target: 1M IOPS 10 GBytes/s
Target: 1M IOPS, 10 GBytes/s
Goal
Provide scalable I/O stack (virtualized) over direct and networked

storage devices

Go over

1.

Performance and scaling analysis

1.

Performance and scaling analysis

2.

Hybrid hierarchies to take advantage of potential

3.

Design for memory and synchronization issues

4

Parallelism in lower part of networked I/O stack

4.

Parallelism in lower part of networked I/O stack

SLIDE 10

(1) Performance and Scaling Analysis

17-June-2011 TERENA TF on Storage

10

(1) Performance and Scaling Analysis

Bottom-up

Guest OS U S Applications

Bottom up

Controller
Actual controller

User Space System Calls Applications Middleware

PCI
Host drivers

Bl k l

System Calls Virtual Drivers Guest OS Kernel VFS + FS

Block layer
SCSI
Block

Host OS VFS + FS Block Devices

Block
Filesystem
xfs (a well accepted fs)

SCSI Layers, HW device drivers, PCI driver PCI Express Interconnect

( p )

vfs (integral linux part)

Storage Controller, Disk Controller Network Controller

SLIDE 11

I/O Controller [Systor’10]

17-June-2011 TERENA TF on Storage

11

I/O Controller [Systor 10]

(1) A queue protocol over PCI

( ) q p

Many parameters and quite complex
Requires decisions: Tune for high throughput

(2) R t t l ti t ll

(2) Request translation on controller
Memory management: Balance between speed and waste
(3) Request issue completion towards devices

(3) q p

Use existing mechanisms but do careful scheduling
Prototype comparable to commercial products

SLIDE 12

Results and Outlook

17-June-2011 TERENA TF on Storage

12

Results and Outlook

1400 1600 1800 MB/sec

DMA Throughput

host-to-HBA HBA-to-host

head : valid queue element New-head

800 1000 1200 4 8 16 32 64 MB/sec transfer size (KB)

Impact of host-issued PIO on DMA Throughput

tail Host head DMA PCIe interconnect : valid element to dequeue

OFF ON host-issued PIO? 2-way to-host from-host

Controller head Controller initiates DMA tail

xput: Each controller can achieve 2 Gbytes/s bi-dir
IOPs: Each controller can achieve ~80K IOPs

f f /O

200 400 600 800 1000 1200 1400 1600 1800 2000 2200 MB/sec

Needs to know tail at Host
Host needs to know head at Controller

New-tail

50K for commercial controllers with full I/O processing
Controller CPU is an important limitation
Outlook
(1) Scale throughput and IOPs by using multiple controllers

(2) O tlook I/O controllers sho ld be f sed ith host CPU

(2) Outlook: I/O controllers should be fused with host CPU

SLIDE 13

Block Layer

17-June-2011 TERENA TF on Storage

13

Block Layer

I/O request protocol
I/O request protocol

translation, e.g. SCSI

Buffer management

Buffer management and placement

Other layers involved,

y , essentially a block-type

peration

Modern architecture

Modern architecture

trends create significant problems

SLIDE 14

Results and Outlook

17-June-2011 TERENA TF on Storage

14

Results and Outlook

Translation processing scales

with number of cores

5000 6000 MB/s

Sequential I/O Throughput

4 rs

Random I/O Operations

with number of cores

Both throughput and IOPs
I/O translation incurs overhead

1000 2000 3000 4000 5000 seq.reads seq.writes 1 2 3 # controller write IOPS read IOPS

Affinity an important problem
Wrong placement can reduce

throughput almost to half

6

1 2 3 4 # Controllers IOPS

g p

4

TLOR0 TROR0

1 2 3 4 5 6 7 8 2

TROR0 TLORPRIL TLORPLIL

1 7

No. of Instance of Benchmark

SLIDE 15

Filesystem

17-June-2011 TERENA TF on Storage

15

Filesystem

Complex layer

p y

Many complain about FS performance on multicores
Translates from (request, file, offset, size) API to (request, block#)

API API

Responsible for recovery (first layer to include extensive metadata

in traditional systems) We include VFS in our analysis additional complexity

We include VFS in our analysis – additional complexity
Detailed analysis with extensive modifications to kernel
Required non-trivial instrumentation to measure lock and wait times
Extensive tuning to ensure that we measure “meaningful” cases

SLIDE 16

Results and Outlook

17-June-2011 TERENA TF on Storage

16

Results and Outlook

100 kfsmark ‐ CPU breakdown (1MB files, 64 app. threads)

IO‐WAIT USER SYSTEM INTERRUPT IDLE

100 120 sands 1LOG PER PROCESS 200 250 sands 1LOG PER PROCESS 20 40 60 80 % CPU 20 40 60 80

ps/sec

Thous

CREAT

50 100 150

ps/sec

Thous #CPUs

READ

Most FS operations do not scale with # of cores
Two main scaling problems
(2) FS journaling

1 2 4 8 16 1 2 4 6 8 12 16

#CPUs

1 2 4 6 8 12 16

( ) j g

All modern FSs need to worry about recovery
Most use a journaling scheme that is integrated with lookup/update path
Synchronization over this journal is hindering scaling
(1) vfs locking

f t t f i t i i di t t d i d i f ti (d t d i d

vfs uses a structure for maintaining directory entry and inode information (dentry and inode

caches)

Synchronization over the dentry cache is problematic due to vfs design
Outlook
There is significant potential from both (1) and (2)

g p ( ) ( )

(1) is being discussed and (a) people are working on it, (b) there is potential to bypass
(2) is more fundamental – our goal is to target this

SLIDE 17

Summary of Analysis

17-June-2011 TERENA TF on Storage

17

(1) Fundamentally, I/O performance should scale

Summary of Analysis

( ) y, p

(2) Controller: use spatial parallelism and go with

technology trends

(3) Block: worry about placement and affinity problems
(4) FS: worry about synchronization at specific points

Both (3) and (4) are due to current trends in multicores

Both (3) and (4) are due to current trends in multicores
Not broadly known problems yet

SLIDE 18

(2) Hybrid Device Hierarchies

17-June-2011 TERENA TF on Storage

18

(2) Hybrid Device Hierarchies

To take advantage of this potential

g p

Need hybrid device hierarchies using disks and SSDs
Otherwise, not be adequate raw performance
[FlashCache’06, BPLRU’08, …]

HDD (WD5001AALS‐00L3B2) SSD (Intel X25‐E) Price/capacity ($/GB) $0 3 $3

Designed and evaluated such a base hierarchy

Price/capacity ($/GB) $0.3 $3 Response time (ms) 12.6 0.17 Throughput (R/W) (MB/s) 100/90 277/202 IOPS (R/W) 150/150 30,000/3,500

Designed and evaluated such a base hierarchy
Significant improvement

Over disks only

Over disks only
Over disks + SSDs due to our policies

SLIDE 19

Summary [Eurosys’10 NAS’11]

EuroSys 2010 - Compressed SSD I/O Caching

19

Summary [Eurosys 10, NAS 11]

Transparent SSD caching promising for improving performance
Improve SSD caching efficiency using online compression
Trade (cheap) CPU cycles for (expensive) I/O performance
Address challenges in online block-level compression for SSDs

Address challenges in online block level compression for SSDs

Our techniques mitigate CPU and additional I/O overheads
Results in increased performance with realistic workloads
TPC-H up to 99% PostMark up to 20% SPECsfs2008 up to 11%

TPC H up to 99%, PostMark up to 20%, SPECsfs2008 up to 11%

Cache hit ratio improves between 22%-145%
Increased CPU utilization by up to 4.5x
Low concurrency, small I/O workloads problematic

Low concurrency, small I/O workloads problematic

Overall our approach worthwhile, but adds complexity…
Future work
Power performance implications interesting hardware off loading
Power-performance implications interesting, hardware off-loading
Improving compression efficiency by grouping similar blocks

SLIDE 20

(3) Buffer Mgmt and Recovery Issues

17-June-2011 TERENA TF on Storage

20

(3) Buffer Mgmt and Recovery Issues

Revisit
Buffer mgmt in DRAM required to stage/cache I/Os
Recovery required due to volatility of DRAM

Both fundamental and related to system I/O architecture

Both fundamental and related to system I/O architecture
We design a new DRAM buffer+cache mechanism
(1) Allow isolation and partitioning
(2) Allow control over placement
(3) Deal with both fixed and variable size items
Similar techniques recently used for other structures in kernel
Similar techniques recently used for other structures in kernel

[OSDI’10]

Use it with a kernel-level FS that is stateless

SLIDE 21

(4) Networked I/O Stack

17-June-2011 TERENA TF on Storage

21

(4) Networked I/O Stack

Host overhead for network processing significant

p g g

We would like to push limits for networked I/O
Related: TCP/IP overhead at 10 GigE, xATA over Ethernet

U ti l ll li i th t k

Use spatial parallelism in the network
Multiple 10GBit/s controllers
Total 80GBit/s bi-dir over Ethernet
Treat as a transparent link between target and initiator
Storage protocols not arbitrary

Request/response

Request/response
Fixed size buffers
How well can we do

SLIDE 22

Results and Outlook

17-June-2011 TERENA TF on Storage

22

Results and Outlook

Base net protocol design and

implementation p

Preliminary numbers (latest)
Over 4.5 GBytes/s
Writes, 4x10GigE NICs
Read is about 2GBytes/s
Read is about 2GBytes/s
Over 160K IOPs
Insight: Using a traditional-generic comm

protocol induces overheads

Able to design comm protocol that benefits

from storage-specific semantics

Target vs. initiator
I/O semantics not simple

I/O semantics not simple

Buffer management happens high-up in stack
Initiator less important (?)
Results very encouraging

SLIDE 23

IOLanes

17-June-2011 TERENA TF on Storage

23

IOLanes

Overall, data intensive

applications are increasing

Guest OS User Space TPC W S TPC‐W SPECjAppServ RUBiS Replication g Linear Road Streaming Tariff Advisor Tariff Advisor New TPC‐H, TPC‐C

pp g

Distributed, data-center type

applications

I/O subsystem important

building block

System Calls System Calls Mware p Mware Tariff Advisor Tariff Advisor VFS + FS Guest OS Kernel g Q PostgreSQL

building block

Main challenges
(1) Performance and scalability
(2) Extensibility and effort

System Calls Block Devices, Virtio Split‐X QEMU

(2) Extensibility and effort

Today
Few disks per cpu/core (e.g.

two)

Syste Ca s Host OS VFS + FS On/off‐load module KVM Split‐X

Any new feature or adaptation

in stack remarkably complex

IOLanes
(1) Identify bottlenecks

St /N t St /N t SCSI Layers, HW device drivers, PCI driver Block Devices On/off load module

(1) Identify bottlenecks
(2) Build better stack
(3) Allow for easier extensibility

Co t o e s Storage/Net Controllers Co t o e s Storage/Net Controllers

SLIDE 24

Specific Challenges

17-June-2011 TERENA TF on Storage

24

Specific Challenges

Scaling the I/O stack across all system layers on multicore

g y y CPUs

Interaction of the I/O paths of multiple isolated virtual

hi machines

Use cycles offered by multicores to offer more

“machinery” and optimize online y p

Evaluation with realistic workloads
Full stack monitoring and analysis

SLIDE 25

Outline

17-June-2011 TERENA TF on Storage

25

Outline

Modern application stacks

pp

Stream processing (STREAM)
Transaction processing (CumuloNimbo)

St t h l i

Storage technologies
Storage virtualization and scaling with multicores (IOLanes)
Parallel I/O (SCALUS)
Abstractions for modern applications (CumuloNimbo)
Remarks

SLIDE 26

Dimension Infrastructure Properly

17‐June‐2011 TERENA TF on Storage 26

Multicores +

PCs/blades Different flavors of PCs/blades PCs/blades Different flavors of PCs/blades Different flavors of PCs/blades

Multicores + memory + IO xput

High‐speed Interconnect Interconnect High‐speed Interconnect

100s file servers

servers 1000s of appl servers 10‐40 Gbits/s

Dimensioning issues not straight forward today
I/O application overheads not understood

Disk controllers ~2GB/s SATA disks, 12‐36 disks/node

10‐100 Gbits/s

Do you balance thin or fat?
Other factors besides performance, power

SATA disks, 36 disks/node 100 MBy/s, ~2TBytes +10% SSD cache

SLIDE 27

Scaling Beyond Single Node Requires

17‐June‐2011 TERENA TF on Storage 27

Scaling Beyond Single Node Requires

Namespace management

p g

Distributed recovery, mostly for metadata
Distributed DRAM caching, at the client side
Understanding scaling overheads (efficiency)

SLIDE 28

Namespace Management

17‐June‐2011 TERENA TF on Storage 28

Namespace Management

Need to go from (filename, offset) to (node, device, object

g ( , ) ( , , j block)

This requires translation metadata
Metadata cannot be co‐located with file/object data, if we

need to scale single file performance

This requires distributed lookup
This requires distributed lookup
Also, updates can be complicated
Would be interesting to separate from rest of data storage

g p g

SLIDE 29

Distributed Recovery

17‐June‐2011 TERENA TF on Storage 29

Distributed Recovery

Single node recovery not enough when data is spread out
Some layer will need to do it
Part of the storage system or the application middleware
It probably means that storage nodes and application nodes will

d t b t ti need to be separate tiers

Fewer storage nodes and more applications nodes
Recovery protocol will only involve (hopefully) storage nodes

S f f t ti l API t t i ht

Some form of transactional API to storage seems right
Not simply read/write any more
Versioning vs. logging approaches

Will i l t t l f ll d i l d i

Will involve some agreement protocol for all nodes involved in an
peration due to striping, replication, metadata/data, etc.
New mechanism for the common path
Much more complicated to traditional systems
Much more complicated to traditional systems
Either centralized controllers or centralized metadata servers

SLIDE 30

Distributed DRAM Caching

17‐June‐2011 TERENA TF on Storage 30

Distributed DRAM Caching

Traditionally, a cache exists as close to the application node as

possible

In the file client
This is problematic

p

For recovery
For scaling to many application nodes
Two possibilities

Two possibilities

(1) Do client‐side caching but avoid write back
(2) Do not do client side caching and use single object owner

approach at a next (storage) tier approach at a next (storage) tier

Both seem good approaches
(1) relies on “smarter” I/O path

(2) li “ /f ” k b li i /fil

(2) relies on “smarter/faster” networks between application/file

client and storage node

SLIDE 31

Efficiency: Ultimately it is all about power

17‐June‐2011 TERENA TF on Storage 31

Efficiency: Ultimately it is all about power

Today, people do not pay much attention to the cost of scaling
Goal is to scale performance
Experimental setups with 1‐2 disks per node and many nodes for scaling I/O

are common h ff ( d k d )

This is very poor efficiency (CPU to disk ratio, consider power)
How much are you willing to pay for scaling?
Start from a base, optimized I/O stack like the one I have described
If we can scale and each I/O subsystem operates at best rate we are fine
Essentially, the cost of scaling should not be too high (or ideally visible) going

from one to many nodes This is not true today by far

This is not true today, by far…
Ultimately power will force everyone to look into this
Or, only a few applications will be able to pay for it

A l SAN t d k b t th t

Analogy: SANs today work but they cost

SLIDE 32

“Machinery” for distribution

17‐June‐2011 TERENA TF on Storage 32

“Machinery” for distribution

All previous mechanisms, require “machinery” that is expensive

W d t ith di t ib t d I/O h th t d ll i

We need to come up with distributed I/O approaches that do all processing

more efficiently

We have or can assume a lot of concurrency so there is always work
This is more about being asynchronous all the time and using DRAM as a

g y g buffer to not starve any other resource

Design systems that wait only when I/O xput is exhausted
No application should be I/O bound!

ith hi h th h t d i d t i t t i d d

…with high throughput devices and system interconnects in modern and

future systems

Efficiency will matter at some point
Even for apps that are able to scale and achieve their perf goals
Even for apps that are able to scale and achieve their perf goals
We need to understand
Mechanisms required for scaling and their overheads

Who should do what in the distributed I/O path

Who should do what in the distributed I/O path
Different appl domains will resolve tradeoffs in overheads, semantics

SLIDE 33

Where Should Each Op Go in the I/O Path?

17‐June‐2011 TERENA TF on Storage 33

Where Should Each Op Go in the I/O Path?

(1) Everything in the file‐system (most prevalent today)

H t b id d b fil t

Has to be provided by every filesystem
The world will have many filesystems
Some problems, e.g. consistent client caching, inherently difficult (not scalable)
Try using GPFS (not to mention extending it…)

(2) Wh t b l t t diti l SAN/NAS ?

File Servers

(2) Why not be closer to traditional SAN/NAS ?
Let’s do reliability and availability as SAN
File operations and scaling as NAS
Requires distributed block‐level consistency and atomicity

h i f l l (k l fi )

NAS (NFS/CIFS) FS Layer Block I/O stack NAS (NFS/CIFS) FS Layer Block I/O stack

… at the infrastructure level (kernel, firmware, …)
Not clear this is the way to go…
(3) Other alternatives? Who knows…

Network Block level Block‐level Block‐level stack Block level stack Storage Nodes

33 I/O Path Design & Implementation

Storage Nodes

SLIDE 34

Outline

17-June-2011 TERENA TF on Storage

34

Outline

Modern application stacks

pp

Stream processing (STREAM)
Transaction processing (CumuloNimbo)

St t h l i

Storage technologies
Storage virtualization and scaling with multicores (IOLanes)
Parallel I/O (SCALUS)
Abstractions for modern applications (CumuloNimbo)
Summary

SLIDE 35

CumuloNimbo Global Architecture

17-June-2011 TERENA TF on Storage

35

CumuloNimbo Global Architecture

JEE Application Server: JBoss+Hibernate Object Cache: CumuloCache Transactions Self Provisioner Query Engine: Derby Concurrency Controllers Distributed File System: HDFS Column-Oriented Data Store & Block Cache: HBASE Commit Sequencers Monitors y Storage Commu- nication Loggers Load Balancers Transaction Management Elasticity Management

SLIDE 36

State of the Art

17-June-2011 TERENA TF on Storage

36

State of the Art

Key-value data stores gaining significance

y g g g

Supporting arbitrary variable-size keys and values
Distributed key-value stores used increasingly
HBase is a component of the CumuloNimbo architecture
Also, other s/w stacks are built on top of key-value stores
To access persistent storage such systems are built today

To access persistent storage such systems are built today

n top of traditional file systems
However, semantics of the underlying system differ in fundamental

ways ways

SLIDE 37

Key value Store vs FS Mismatch

17-June-2011 TERENA TF on Storage

37

Key-value Store vs. FS Mismatch

Hard to map mutable variable size keys/values to files

p y

Key-based indexing vs. offset based indexing in the presence of

variable size values

Data placement on local/networked storage devices

p g cannot take advantage of semantics of key/value stores

Information that has been provided by the application is thrown

away during mapping to flat files y g pp g

Local file systems offer limited recovery/availability

guarantees

Last write recovery expensive no data consistency guarantees

Last write recovery expensive, no data consistency guarantees

Significant performance overheads and scalability

limitations

When scaling to large amounts of storage and high rates

When scaling to large amounts of storage and high rates

SLIDE 38

Our Goal

17-June-2011 TERENA TF on Storage

38

Our Goal

Raise the abstraction of traditional locally managed persistent

t i ti k l API storage using a native key-value API

Support mutable variable-length items– important for workloads

that incur frequent updates

Perform all operations required (packing cleanup) for dealing with

Perform all operations required (packing, cleanup) for dealing with variable size items over fixed block-size persistent devices

Optimize device use based on the importance of data items
Ensure consistency of the data store after a failure based on

fi bl kl d i t configurable workload requirements

Use tunable data replication for availability purposes
Separate distributed aspects from efficiency at local level

Synergies can be important for performance e g recovery

Synergies can be important for performance, e.g. recovery

mechanism

SLIDE 39

Storage Layer Architecture

17-June-2011 TERENA TF on Storage

39

Storage Layer Architecture

SLIDE 40

Outline

17-June-2011 TERENA TF on Storage

40

Outline

Modern application stacks

pp

Stream processing (STREAM)
Transaction processing (CumuloNimbo)

St t h l i

Storage technologies
Storage virtualization and scaling with multicores (IOLanes)
Abstractions for modern applications (CumuloNimbo)
Parallel I/O (SCALUS)
Remarks

SLIDE 41

The role of persistent I/O

17-June-2011 TERENA TF on Storage

41

The role of persistent I/O

Required to keep user data

q p

Data generated and used at different times (and over long periods)
Tolerate failures
Persistence of control information (metadata)
Both emerge as problems

SLIDE 42

Data

17-June-2011 TERENA TF on Storage

42

Data

Many applications today in data centers require large

y pp y q g amounts of data

“Waste” in todays architectures

G tti d t f i t t d i t

Getting data from persistent devices to memory
Requires complex namespace operations which lead to significant

resource utilization

Contrast this to memory accesses that are simpler in nature
Systems have been built to tolerate high response times
Results in more work per I/O

Results in more work per I/O

Virtualization introduces significant overheads for I/O
But important for isolation among workloads and environment

SLIDE 43

Metadata

17-June-2011 TERENA TF on Storage

43

Metadata

Examples
In a filesystem inodes and dentries
In a tuplestore hash-tables and b-trees for indexing
At block-level (e.g. FTL) logical to physical (re)mapping tables

Equally important to data

Equally important to data
In some cases even more so
Many systems can afford to be sloppy about data but not metadata
Footprint

Footprint

Metadata needs to be kept in memory for performance purposes
Sophisticated (and application-specific) caching techniques
Otherwise dramatically increase the number of I/Os per user I/O
Persistence
Remaining consistent at failures of paramount importance
But DRAM not persistent => complex write management techniques

M iddl d li i l d h dl

Many system, middleware, and application layers need to handle

metadata, resulting in multiple times these in-efficiencies

SLIDE 44

Today

17-June-2011 TERENA TF on Storage

44

Today

Persistence is “heavy” due to device/controller technology

y gy

Persistence not designed with multicores in mind
Persistence inefficient when scaling across nodes
Persistence incurs overheads in multiple layers

SLIDE 45

What can we do?

17-June-2011 TERENA TF on Storage

45

What can we do?

Persistent I/O should “get closer” to the CPU

g

Namespace issues should be simpler
Transfers between persistent and non-persistent stages of memory

should be more efficient should be more efficient

Role of access granularity
Architectures should better support persistence for

metadata

Treating data and metadata the same is a very inefficient

simplification

Understand overheads and scaling characteristics on

modern systems

How many cycles of processing per I/O does a data centric

How many cycles of processing per I/O does a data-centric

application need?

SLIDE 46

Summary

17-June-2011 TERENA TF on Storage

46

Summary

(1) Memory hierarchy work to bring persistence closer to CPU

( ) y y g p

Profound changes – impact all layers
Achieving efficiency with device technology
(2) I/O path evolution to scale with # cores
(2) I/O path evolution to scale with # cores
Current systems not designed with this in mind
As cores increase, base I/O performance does not scale

Vi t li ti h d / t ti b t

Virtualization overheads/contention exacerbates
Energy proportionality
(3) Persistent I/O needs to scale efficiently with # nodes
Extensive additional “machinery” today at system and middleware

level to achieve scaling => incurs high overhead and impacts efficiency

E.g. Heartbeats and replication not compatible with energy efficiency

SLIDE 47

Acknowledgements

17‐June‐2011 TERENA TF on Storage 47

Acknowledgements

People
Funding agencies
EC
Shoaib Akram
Konstantinos Chassapis
Michail Flouris
SIVSS, SCALUS, IOLANES
CumuloNimbo, STREAM, HiPEAC
GSRT: National research office
Markos Foundoulakis
Dhiraj Gulati
Yiannis Klonatos

Yiannis Klonatos

Kostas Magoutis
Thanos Makatos
Manolis Marazakis

Manolis Marazakis

Stelios Mavridis
Zoe Sebepou
Many partners and colleagues
Many partners and colleagues