ICDCS 2009
ICDCS 2009 Motivation Motivation Media servers, scientific data - - PowerPoint PPT Presentation
ICDCS 2009 Motivation Motivation Media servers, scientific data - - PowerPoint PPT Presentation
ICDCS 2009 Motivation Motivation Media servers, scientific data applications M di i tifi d t li ti Write once, read many workloads Large sequential files: Media (HD video), Scientific data Large seq ential files Media (HD
Motivation Motivation
M di i tifi d t li ti
- Media servers, scientific data applications
– Write‐once, read‐many workloads Large seq ential files Media (HD ideo) Scientific data – Large sequential files: Media (HD video), Scientific data – Parallel retrieval of sequential I/O streams from disks
- Sequential access: simple & efficient for disks
- Sequential access: simple & efficient for disks
- Challenge
Maintain max read throughput while scaling to – Maintain max read throughput while scaling to large number of I/O streams per disk
- Disk capacity increase less spindles per stream
Disk capacity increase less spindles per stream
– 2TByte disk holds 440 full‐size DVD movies
ICDCS 2009
2
Linux I/O Schedulers Linux I/O Schedulers
1 stream: 60MB/sec 256 streams: 10-15MB/sec
Parallel reading of sequential streams on 1 SATA disk
ICDCS 2009
3
g q
Traditional Solutions Traditional Solutions
C hi & i / i f hi
- Caching & aggressive/static prefetching
- Efficient I/O schedulers
– Anticipatory, Fair‐queuing
- Work well
– Small number of streams – Prefetching buffers fit in memory Prefetching buffers fit in memory
- However
– Various workloads need large number of streams – Various workloads need large number of streams – Storage controllers: many disks and limited memory
ICDCS 2009
4
Other Solutions Other Solutions
- SSDs: expensive & low capacity
– Behavior with high performance workloads not g p well understood – Used as a prefetching buffer? Used as a prefetching buffer?
- Data placement not practical solution
– Predict which streams read together? – Stream playout short‐lived vs. time to p y reorganize data
ICDCS 2009
5
Overview Overview
- Motivation
- Related work & contributions
Related work & contributions
- Disk & controller‐level prefetching
- Our approach
- Evaluation
Evaluation
- Conclusions
ICDCS 2009
6
Related Work Related Work
- Modeling & optimizing disks
- Modeling & optimizing disks
– [Ganger95], [Jacobson & Wilkes 91], [Ruemmler & Wilkes 94], [Shriver 97], [Varki et al. 04], [Zhu & Hu 02]
- I/O performance & scheduling optimizations
I/O performance & scheduling optimizations
– [Bachmat02], [Iyer & Druschel 01], [Kim et al. 06], [Mokbel et al.04], [Shenoy & Vin 98], [Wijayaratne & Reddy 01], [Hsu & Smith 04], [Carrera & Bianchini 02], [Coloma et al. 05], [Yu et al. 06]
- Prefetching
– [Shriver et al. 99], [Cao et al. 95], [Kimbrel & Karlin 00], [Li et al. 07], [Patterson et al. 95], [Ding et al. 07]
S hi ( i l kl d )
- Storage caching (non‐sequential workloads)
– [Chen et al. 03], [Dahlin et al. 94], [Johnson & Shasha 94], [Zhou et al. 02]
- I/O for multimedia applications
– [Chen et al. 94], [Dey‐Sircar et al. 94], [Rangan & Vin 91], [Reddy & Wyllie 94], [Dan et al. 95]
ICDCS 2009
7
Contributions Contributions
A l i f th bl
- Analysis of the problem
- Solution at the host level
Up to 4x higher throughput with 100 streams / disk – Up to 4x higher throughput with 100 streams / disk – Improved disk utilization with limited memory
- Our approach relies on
Our approach relies on
– Identifying & separating sequential streams – Buffering & coalescing small requests in host memory – Notion of working set for servicing multiple I/O streams
- Validation through
Di k i i l i d l i – Disksim simulation and real system experiments – Multiple disk & controller configurations
ICDCS 2009
8
I/O Path I/O Path
- I/O path components that perform caching & queuing
C h b ll d b
- Caches become smaller towards bottom
- Disk cache: limited size, divided into fixed segments
ICDCS 2009
9
Disk level Prefetching Disk‐level Prefetching
A hi d b
- Achieved by
– Increasing application request size – Increasing disk segment size to prefetch full segments Increasing disk segment size to prefetch full segments
- Measurements with Disksim and microbenchmarks
- Larger request sizes improve throughput,
Larger request sizes improve throughput, if there is enough disk cache for all I/O streams
- When number of streams x req. size > cache size
h h ll throughput degrades dramatically
- Increasing disk cache size and prefetching improves
throughput for large number of streams throughput for large number of streams
- However, disk cache size fixed by manufacturer
ICDCS 2009
10
Controller level Prefetching Controller‐level Prefetching
P f t hi t t ll l l i ff ti h
- Prefetching at controller‐level is effective when
there is enough memory for all streams
- Not a solution, because one controller may have
4‐16 disks and should handle thousands of streams ( )
ICDCS 2009
(need GBytes of memory)
11
Host level Approach Host‐level Approach
Server
sifier
Block I/O
Disks
Scheduler
Sequential Requests
Class
I/O Reqs
Non-sequential requests
Scheduler
Bl k l l ti fil t ti
q q
- Block‐level operation, file system agnostic
- System receives block I/O requests
- Classifier detects sequential requests using bitmap
- Classifier detects sequential requests using bitmap
- Non‐sequential requests sent directly to disks
- Requests in sequential streams sent to scheduler
ICDCS 2009
q q
12
Scheduling Scheduling
Server
sifier
Block I/O
Disks N
Class
Scheduler (D,R,N)
I/O Reqs
Policy (RR) N
- Dispatch Set (D): stream set currently in scheduler issues I/O
(RR) Request Completion
- Dispatch Set (D): stream set currently in scheduler issues I/O
- Read‐ahead size (R): size of requests actually issued to disks
- Streams remain in D until having issued N disk requests
- Replacement policy for streams in D: Round‐Robin
- Disk req completion scheduler completes block I/O request
ICDCS 2009
13
Staging prefetched data Staging prefetched data
Request Completion Server
sifier
Buffered Set (M)
Block I/O
Lookup p Disks N Staging
Class
Scheduler (D,R,N)
I/O Reqs
Policy (RR) N Staging
- Streams removed from D staged in buffered set until
(RR) Request Completion
- Streams removed from D staged in buffered set, until
prefetched data are used by new requests or timeout expires
- Classifier looks up req. data in buffered set, completes req. if found
O ll (M) i f b ff d t & di t h t (D)
- Overall memory space (M): size of buffered set & dispatch set (D)
- At all times M ≥ D R N
- Periodically garbage collect inactive/non‐seq streams
ICDCS 2009
y g g / q
14
Implementation Implementation
- Implemented on Linux
- User‐space I/O server & stream generators
User space I/O server & stream generators
- Using asynchronous I/O, not threads
- Direct I/O to bypass kernel buffer cache
ICDCS 2009
15
Evaluation Setup Evaluation Setup
- One storage node
– Dual Opteron machine, 1GB memory p , y – Broadcom RAID controller for 8 SATA disks WD 7200rpm SATA disks (55 60 Mbytes/sec) – WD 7200rpm SATA disks (55‐60 Mbytes/sec)
- Multiple client nodes
– Necessary to saturate 8 disks – Issues many seq stream requests over 1 GigE link Issues many seq. stream requests over 1 GigE link – Data are not transferred over the network
ICDCS 2009
16
Read ahead (R) Read‐ahead (R)
- S: number of input streams
M S R N d S D (fi i di h )
- M = S R N and S = D (fits in dispatch set)
- Substantial amount of memory required
R = 8MBytes (M = ~800MBytes)
(M = D*R*N) 60
R = 8MBytes (M = ~800MBytes) R = 2MBytes (M = ~200MBytes) R = 1MByte (M = ~100MBytes) R = 512KBytes (M = ~50MBytes)
(M = D*R*N) (D = #S) (N = 1) 40 50 60
MBytes/s) R = 128KBytes (M = ~12MBytes) No Readahead
20 30 40
- ughput (MB
10 30 60 100 10 20
Throug
ICDCS 2009
17
10 30 60 100
Number of Streams per Disk (#S)
Memory Size Memory Size
- Interested in many streams that need much memory
- Fixed R value: increasing S lower throughput
- Increased R important for high throughput
50 60
s)
S = 1 (RA = 8M) S = 10 (RA = 8M)
(D = M/R*N), (N = 1)
30 40 50
ut (MBytes/s)
S = 10 (RA = 8M) S = 100 (RA = 8M) S = 1 (RA = 1M) S = 10 (RA = 1M)
10 20 30
Throughput
S = 10 (RA = 1M) S = 100 (RA = 1M) S = 1 (RA = 256K) S = 10 (RA = 256K) S = 100 (RA = 256K)
8 16 64 128 256
Memory Size (MBytes)
10
T
S = 100 (RA = 256K)
ICDCS 2009
18
Memory Size (MBytes)
Multiple disks Multiple disks
- Throughput for 8 disks as S per disk increases
- Throughput drops regardless of read‐ahead value R
- Bottleneck: controller due to buffer management
- Need to separate dispatched from staged streams
400
s/s)
(D = S), (M = D*R*N), (N = 1)
300 400
(MBytes/s)
No Readahead R = 512KBytes R = 1MByte (D = S), (M = D*R*N), (N = 1)
100 200
roughput (M
R = 1MByte R = 2MBytes
10 30 60 100 100
Throu
ICDCS 2009
19
10 30 60 100
Number of Streams per Disk (#S)
Dispatched vs staged Dispatched vs. staged
- 8 disk setup with dispatched < staged streams
- Better behavior with small amount of memory because
- f lower buffer management overhead
- Potential for high utilization by tuning R, D, N and M
400
/s)
300 400
(MBytes/s) R = 512KBytes, D = #disks, N = 128, M = staged*N*R
100 200
roughput (M R = 512KBytes, D = #disks, N = 128, M = staged*N*R R = 512KBytes, from Figure 12 (previous figure)
10 30 60 100 100
Throu
ICDCS 2009
20
10 30 60 100
Number of Streams per Disk (#S)
Single disk Throughput Single‐disk Throughput
- 1 disk with dispatched < staged streams
- Better behavior with small amount of memory
compared to S = D case (fits in dispatch set)
60
s/s)
40 50
(MBytes/s)
20 30
roughput (M
R = 512KBytes, D = 1, N = 128, M = staged*N*R R = 2MBytes, from Figure 10
from S=D figure (slide 17)
10 30 60 100 10
Throu
R = 2MBytes, from Figure 10 R = 8MBytes, from Figure 10
g ( ) from S=D figure (slide 17)
ICDCS 2009
21
10 30 60 100
Number of Streams per Disk (#S)
Response Time Response Time
- Mainly interested in improving disk utilization
- Increasing S high impact on response time
- Increasing R improves response time
- Average request response time not very different
among streams because of round‐robin policy
100 500 100
cy (msec) S = 1 (M = 8MBytes) S = 10 (M = 8MBytes) S = 100 (M = 8MBytes)
100 10 100
age Latency S = 100 (M = 8MBytes) S = 1 (M = 64MBytes) S = 10 (M = 64MBytes) S = 100 (M = 64MBytes)
256 1024 8192
ReadAhead (KBytes)
1
Average S = 100 (M = 64MBytes) S = 1 (M = 256MBytes) S = 10 (M = 256MBytes) S = 100 (M = 256MBytes)
ICDCS 2009
22
ReadAhead (KBytes)
Conclusions Conclusions
A l f f I/O t di k
- Analyze performance of many seq. I/O streams on disk
- Examine the effect of I/O subsystem parameters
- Find certain parameters can improve performance
Find certain parameters can improve performance
- Propose solution at host level that
– Identifies structures needed & parameterizes each – Allows setting these parameters (D, R, N, M) independently
- Implement & measure solution on real system
Up to 4x higher throughput with 100 streams / disk – Up to 4x higher throughput with 100 streams / disk – Makes the I/O subsystem insensitive to number of streams – Approach works with limited memory – Response time affected by no of streams, not read‐ahead
ICDCS 2009
23
Thank you! Thank you!
Questions? “Reducing Disk I/O Performance Sensitivity for Large Numbers of Sequential Streams”
k h l l l l George Panagiotakis, Michail Flouris & Angelos Bilas
Foundation for Research & Technology ‐ Hellas http://www.ics.forth.gr/carv/scalable
ICDCS 2009
24