ICDCS 2009 Motivation Motivation Media servers, scientific data - - PowerPoint PPT Presentation

icdcs 2009 motivation motivation
SMART_READER_LITE
LIVE PREVIEW

ICDCS 2009 Motivation Motivation Media servers, scientific data - - PowerPoint PPT Presentation

ICDCS 2009 Motivation Motivation Media servers, scientific data applications M di i tifi d t li ti Write once, read many workloads Large sequential files: Media (HD video), Scientific data Large seq ential files Media (HD


slide-1
SLIDE 1

ICDCS 2009

slide-2
SLIDE 2

Motivation Motivation

M di i tifi d t li ti

  • Media servers, scientific data applications

– Write‐once, read‐many workloads Large seq ential files Media (HD ideo) Scientific data – Large sequential files: Media (HD video), Scientific data – Parallel retrieval of sequential I/O streams from disks

  • Sequential access: simple & efficient for disks
  • Sequential access: simple & efficient for disks
  • Challenge

Maintain max read throughput while scaling to – Maintain max read throughput while scaling to large number of I/O streams per disk

  • Disk capacity increase less spindles per stream

Disk capacity increase less spindles per stream

– 2TByte disk holds 440 full‐size DVD movies

ICDCS 2009

2

slide-3
SLIDE 3

Linux I/O Schedulers Linux I/O Schedulers

1 stream: 60MB/sec 256 streams: 10-15MB/sec

Parallel reading of sequential streams on 1 SATA disk

ICDCS 2009

3

g q

slide-4
SLIDE 4

Traditional Solutions Traditional Solutions

C hi & i / i f hi

  • Caching & aggressive/static prefetching
  • Efficient I/O schedulers

– Anticipatory, Fair‐queuing

  • Work well

– Small number of streams – Prefetching buffers fit in memory Prefetching buffers fit in memory

  • However

– Various workloads need large number of streams – Various workloads need large number of streams – Storage controllers: many disks and limited memory

ICDCS 2009

4

slide-5
SLIDE 5

Other Solutions Other Solutions

  • SSDs: expensive & low capacity

– Behavior with high performance workloads not g p well understood – Used as a prefetching buffer? Used as a prefetching buffer?

  • Data placement not practical solution

– Predict which streams read together? – Stream playout short‐lived vs. time to p y reorganize data

ICDCS 2009

5

slide-6
SLIDE 6

Overview Overview

  • Motivation
  • Related work & contributions

Related work & contributions

  • Disk & controller‐level prefetching
  • Our approach
  • Evaluation

Evaluation

  • Conclusions

ICDCS 2009

6

slide-7
SLIDE 7

Related Work Related Work

  • Modeling & optimizing disks
  • Modeling & optimizing disks

– [Ganger95], [Jacobson & Wilkes 91], [Ruemmler & Wilkes 94], [Shriver 97], [Varki et al. 04], [Zhu & Hu 02]

  • I/O performance & scheduling optimizations

I/O performance & scheduling optimizations

– [Bachmat02], [Iyer & Druschel 01], [Kim et al. 06], [Mokbel et al.04], [Shenoy & Vin 98], [Wijayaratne & Reddy 01], [Hsu & Smith 04], [Carrera & Bianchini 02], [Coloma et al. 05], [Yu et al. 06]

  • Prefetching

– [Shriver et al. 99], [Cao et al. 95], [Kimbrel & Karlin 00], [Li et al. 07], [Patterson et al. 95], [Ding et al. 07]

S hi ( i l kl d )

  • Storage caching (non‐sequential workloads)

– [Chen et al. 03], [Dahlin et al. 94], [Johnson & Shasha 94], [Zhou et al. 02]

  • I/O for multimedia applications

– [Chen et al. 94], [Dey‐Sircar et al. 94], [Rangan & Vin 91], [Reddy & Wyllie 94], [Dan et al. 95]

ICDCS 2009

7

slide-8
SLIDE 8

Contributions Contributions

A l i f th bl

  • Analysis of the problem
  • Solution at the host level

Up to 4x higher throughput with 100 streams / disk – Up to 4x higher throughput with 100 streams / disk – Improved disk utilization with limited memory

  • Our approach relies on

Our approach relies on

– Identifying & separating sequential streams – Buffering & coalescing small requests in host memory – Notion of working set for servicing multiple I/O streams

  • Validation through

Di k i i l i d l i – Disksim simulation and real system experiments – Multiple disk & controller configurations

ICDCS 2009

8

slide-9
SLIDE 9

I/O Path I/O Path

  • I/O path components that perform caching & queuing

C h b ll d b

  • Caches become smaller towards bottom
  • Disk cache: limited size, divided into fixed segments

ICDCS 2009

9

slide-10
SLIDE 10

Disk level Prefetching Disk‐level Prefetching

A hi d b

  • Achieved by

– Increasing application request size – Increasing disk segment size to prefetch full segments Increasing disk segment size to prefetch full segments

  • Measurements with Disksim and microbenchmarks
  • Larger request sizes improve throughput,

Larger request sizes improve throughput, if there is enough disk cache for all I/O streams

  • When number of streams x req. size > cache size

h h ll throughput degrades dramatically

  • Increasing disk cache size and prefetching improves

throughput for large number of streams throughput for large number of streams

  • However, disk cache size fixed by manufacturer

ICDCS 2009

10

slide-11
SLIDE 11

Controller level Prefetching Controller‐level Prefetching

P f t hi t t ll l l i ff ti h

  • Prefetching at controller‐level is effective when

there is enough memory for all streams

  • Not a solution, because one controller may have

4‐16 disks and should handle thousands of streams ( )

ICDCS 2009

(need GBytes of memory)

11

slide-12
SLIDE 12

Host level Approach Host‐level Approach

Server

sifier

Block I/O

Disks

Scheduler

Sequential Requests

Class

I/O Reqs

Non-sequential requests

Scheduler

Bl k l l ti fil t ti

q q

  • Block‐level operation, file system agnostic
  • System receives block I/O requests
  • Classifier detects sequential requests using bitmap
  • Classifier detects sequential requests using bitmap
  • Non‐sequential requests sent directly to disks
  • Requests in sequential streams sent to scheduler

ICDCS 2009

q q

12

slide-13
SLIDE 13

Scheduling Scheduling

Server

sifier

Block I/O

Disks N

Class

Scheduler (D,R,N)

I/O Reqs

Policy (RR) N

  • Dispatch Set (D): stream set currently in scheduler issues I/O

(RR) Request Completion

  • Dispatch Set (D): stream set currently in scheduler issues I/O
  • Read‐ahead size (R): size of requests actually issued to disks
  • Streams remain in D until having issued N disk requests
  • Replacement policy for streams in D: Round‐Robin
  • Disk req completion scheduler completes block I/O request

ICDCS 2009

13

slide-14
SLIDE 14

Staging prefetched data Staging prefetched data

Request Completion Server

sifier

Buffered Set (M)

Block I/O

Lookup p Disks N Staging

Class

Scheduler (D,R,N)

I/O Reqs

Policy (RR) N Staging

  • Streams removed from D staged in buffered set until

(RR) Request Completion

  • Streams removed from D staged in buffered set, until

prefetched data are used by new requests or timeout expires

  • Classifier looks up req. data in buffered set, completes req. if found

O ll (M) i f b ff d t & di t h t (D)

  • Overall memory space (M): size of buffered set & dispatch set (D)
  • At all times M ≥ D R N
  • Periodically garbage collect inactive/non‐seq streams

ICDCS 2009

y g g / q

14

slide-15
SLIDE 15

Implementation Implementation

  • Implemented on Linux
  • User‐space I/O server & stream generators

User space I/O server & stream generators

  • Using asynchronous I/O, not threads
  • Direct I/O to bypass kernel buffer cache

ICDCS 2009

15

slide-16
SLIDE 16

Evaluation Setup Evaluation Setup

  • One storage node

– Dual Opteron machine, 1GB memory p , y – Broadcom RAID controller for 8 SATA disks WD 7200rpm SATA disks (55 60 Mbytes/sec) – WD 7200rpm SATA disks (55‐60 Mbytes/sec)

  • Multiple client nodes

– Necessary to saturate 8 disks – Issues many seq stream requests over 1 GigE link Issues many seq. stream requests over 1 GigE link – Data are not transferred over the network

ICDCS 2009

16

slide-17
SLIDE 17

Read ahead (R) Read‐ahead (R)

  • S: number of input streams

M S R N d S D (fi i di h )

  • M = S R N and S = D (fits in dispatch set)
  • Substantial amount of memory required

R = 8MBytes (M = ~800MBytes)

(M = D*R*N) 60

R = 8MBytes (M = ~800MBytes) R = 2MBytes (M = ~200MBytes) R = 1MByte (M = ~100MBytes) R = 512KBytes (M = ~50MBytes)

(M = D*R*N) (D = #S) (N = 1) 40 50 60

MBytes/s) R = 128KBytes (M = ~12MBytes) No Readahead

20 30 40

  • ughput (MB

10 30 60 100 10 20

Throug

ICDCS 2009

17

10 30 60 100

Number of Streams per Disk (#S)

slide-18
SLIDE 18

Memory Size Memory Size

  • Interested in many streams that need much memory
  • Fixed R value: increasing S lower throughput
  • Increased R important for high throughput

50 60

s)

S = 1 (RA = 8M) S = 10 (RA = 8M)

(D = M/R*N), (N = 1)

30 40 50

ut (MBytes/s)

S = 10 (RA = 8M) S = 100 (RA = 8M) S = 1 (RA = 1M) S = 10 (RA = 1M)

10 20 30

Throughput

S = 10 (RA = 1M) S = 100 (RA = 1M) S = 1 (RA = 256K) S = 10 (RA = 256K) S = 100 (RA = 256K)

8 16 64 128 256

Memory Size (MBytes)

10

T

S = 100 (RA = 256K)

ICDCS 2009

18

Memory Size (MBytes)

slide-19
SLIDE 19

Multiple disks Multiple disks

  • Throughput for 8 disks as S per disk increases
  • Throughput drops regardless of read‐ahead value R
  • Bottleneck: controller due to buffer management
  • Need to separate dispatched from staged streams

400

s/s)

(D = S), (M = D*R*N), (N = 1)

300 400

(MBytes/s)

No Readahead R = 512KBytes R = 1MByte (D = S), (M = D*R*N), (N = 1)

100 200

roughput (M

R = 1MByte R = 2MBytes

10 30 60 100 100

Throu

ICDCS 2009

19

10 30 60 100

Number of Streams per Disk (#S)

slide-20
SLIDE 20

Dispatched vs staged Dispatched vs. staged

  • 8 disk setup with dispatched < staged streams
  • Better behavior with small amount of memory because
  • f lower buffer management overhead
  • Potential for high utilization by tuning R, D, N and M

400

/s)

300 400

(MBytes/s) R = 512KBytes, D = #disks, N = 128, M = staged*N*R

100 200

roughput (M R = 512KBytes, D = #disks, N = 128, M = staged*N*R R = 512KBytes, from Figure 12 (previous figure)

10 30 60 100 100

Throu

ICDCS 2009

20

10 30 60 100

Number of Streams per Disk (#S)

slide-21
SLIDE 21

Single disk Throughput Single‐disk Throughput

  • 1 disk with dispatched < staged streams
  • Better behavior with small amount of memory

compared to S = D case (fits in dispatch set)

60

s/s)

40 50

(MBytes/s)

20 30

roughput (M

R = 512KBytes, D = 1, N = 128, M = staged*N*R R = 2MBytes, from Figure 10

from S=D figure (slide 17)

10 30 60 100 10

Throu

R = 2MBytes, from Figure 10 R = 8MBytes, from Figure 10

g ( ) from S=D figure (slide 17)

ICDCS 2009

21

10 30 60 100

Number of Streams per Disk (#S)

slide-22
SLIDE 22

Response Time Response Time

  • Mainly interested in improving disk utilization
  • Increasing S high impact on response time
  • Increasing R improves response time
  • Average request response time not very different

among streams because of round‐robin policy

100 500 100

cy (msec) S = 1 (M = 8MBytes) S = 10 (M = 8MBytes) S = 100 (M = 8MBytes)

100 10 100

age Latency S = 100 (M = 8MBytes) S = 1 (M = 64MBytes) S = 10 (M = 64MBytes) S = 100 (M = 64MBytes)

256 1024 8192

ReadAhead (KBytes)

1

Average S = 100 (M = 64MBytes) S = 1 (M = 256MBytes) S = 10 (M = 256MBytes) S = 100 (M = 256MBytes)

ICDCS 2009

22

ReadAhead (KBytes)

slide-23
SLIDE 23

Conclusions Conclusions

A l f f I/O t di k

  • Analyze performance of many seq. I/O streams on disk
  • Examine the effect of I/O subsystem parameters
  • Find certain parameters can improve performance

Find certain parameters can improve performance

  • Propose solution at host level that

– Identifies structures needed & parameterizes each – Allows setting these parameters (D, R, N, M) independently

  • Implement & measure solution on real system

Up to 4x higher throughput with 100 streams / disk – Up to 4x higher throughput with 100 streams / disk – Makes the I/O subsystem insensitive to number of streams – Approach works with limited memory – Response time affected by no of streams, not read‐ahead

ICDCS 2009

23

slide-24
SLIDE 24

Thank you! Thank you!

Questions? “Reducing Disk I/O Performance Sensitivity for Large Numbers of Sequential Streams”

k h l l l l George Panagiotakis, Michail Flouris & Angelos Bilas

Foundation for Research & Technology ‐ Hellas http://www.ics.forth.gr/carv/scalable

ICDCS 2009

24