FAST: Quick Application Launch on Solid-State Drives Yongsoo Joo 1 - - PowerPoint PPT Presentation

fast quick application launch on solid state drives
SMART_READER_LITE
LIVE PREVIEW

FAST: Quick Application Launch on Solid-State Drives Yongsoo Joo 1 - - PowerPoint PPT Presentation

FAST: Quick Application Launch on Solid-State Drives Yongsoo Joo 1 , Junhee Ryu 2 , Sangsoo Park 1 , and Kang G. Shin 1,3 1 Ewha Womans University, Korea 2 Seoul National University, Korea 3 University of Michigan, USA Application Launch Delay


slide-1
SLIDE 1

FAST: Quick Application Launch on Solid-State Drives

Yongsoo Joo1, Junhee Ryu2, Sangsoo Park1, and Kang G. Shin1,3

1Ewha Womans University, Korea 2Seoul National University, Korea 3University of Michigan, USA

slide-2
SLIDE 2

Application Launch Delay

Elapsed time between two events

A user clicks the icon The application becomes responsible

Important for interactive applications

Critically affects user satisfaction

2

slide-3
SLIDE 3

Application Launch Performance

Moore’s law not applicable

Faster CPU and larger main memory not helpful HDD seek and rotational latencies do not improve well

3

0.1 1 10 100 1000 10000 100000 1970 1980 1990 2000 2010 50 100 150 200 250 1980 1990 2000 2010 (MIPS) (Gbit/s) (a) CPU performance (b) Peak bandwidth of DRAMs 200 400 600 800 1000 1200 1990 2000 2010 3 6 9 12 15 1990 2000 2010 Average seek time Average rotational latency (Mbit/s) (ms) (c) Peak bandwidth of HDDs (d) Disk access latency

CPU performance DRAM throughput HDD throughput HDD access latency

Exponential improvement Linear improvement

seek rotational

slide-4
SLIDE 4

Application Launch Performance

Application launch breakdown

4

!"#$"%&'( )'$*% +,'-.$/'0 1/2*3'( 1456'$ 78 978 :78 ;78 <78 =778 >'%6.$?$/'0@

  • ?$*0A#

5**B@?0C@2'$?4 $/'0?-@-?$*0A# D?$?@$2?0E3*2@ $/%*

slide-5
SLIDE 5

Many SW-level schemes deployed in OSes

Application defragment, Superfetch, readahead, BootCache, etc.

Sorted prefetch (ex: Windows prefetch)

Obtain the set of accessed blocks for each application

Monitor I/O requests during an application launch

Pause the target application upon detection of its launch Prefetch the predetermined set of blocks in their LBA order

Reduce the total seek distance of the disk head

Resume the launch after the prefetch completes

SW-Level Optimization

5

slide-6
SLIDE 6

How sorted prefetch works

SW-Level Optimization

6

Improvement (typ: 40%)

Time Time HDD track position HDD track position Launch start Launch completion Launch detection Launch resumption Launch completion

Prefetcher execution

<Without sorted prefetch> <With sorted prefetch>

CPU computation

(x-axis not in scale)

slide-7
SLIDE 7

Flash-based SSD

The single most effective way to eliminate

disk head positioning delay

Acrobat reader: 4.0s -> 0.8s (84% reduction) Matlab: 16.0s -> 5.1s (68% reduction)

Characteristics

Consist of multiple NAND flash chips No mechanical moving part Uniform access latency (a few 100 microseconds)

Prices now affordable

80 GB MLC SSD: less than 200$ now

7

slide-8
SLIDE 8

Motivation

Question: Are we satisfied with the app launch on SSD?

Yes for lightweight applications (e.g., less than 1 sec) No for heavy applications (e.g., more than 5 sec)

Far from ultimate user satisfaction

Faster application launch is always good (at least, not bad)

Needs increase for launch optimization on SSDs

Applications are getting HEAVIER

More blocks to be read

SSD random read performance improves slowly

Bounded by the single chip performance

8

slide-9
SLIDE 9

HDD-Aware Optimizers on SSD

Question: Will traditional HDD optimizers work for SSDs?

Consensus: they will not be effective on SSDs Rationale: they mostly optimize disk head movement

No disk head in SSDs

Often recommended not to use on SSDs

Microsoft Windows 7

HDD-aware optimizers disabled upon detection of SSD

Windows prefetch, Application defragmentation, Superfetch,

Readyboost, etc.

9

slide-10
SLIDE 10

No benefit from LBA sorting

Uniform seek latency of SSD

Launch performance still improves

Increased effective queue depth (0.3->3.4, app: Eclipse) Observed 7% launch time reduction: better than nothing!

Sorted Prefetch on SSDs

10

(a) Cold start (no prefetcher) 32 24 16 8 1 2 3 4 5 Average QD: 0.3

(sec)

Queue depth

(b) Baseline prefetcher 32 24 16 8 1 2 3 4 5 (c) Baseline prefetcher (zoomed in) 32 24 16 8 0.1 0.2 0.3 0.4 0.5 0.6 Average QD: 3.4 (sec)

(sec)

Queue depth Queue depth

Queue depth: 0.3 Queue depth: 3.4

slide-11
SLIDE 11

FAST: Fast Application STarter

Overlap CPU computation with SSD accesses

11

s1 s2 s3 s4 c1 c2 c3 c4 s1 s2 s3 s4 c1 c2 c3 c4

Application Prefetcher Application Time Time Time (a) Cold start scenario (c) Proposed prefetching ( )

c1 c2 c3 c4

Application Time (b) Warm start scenario tcpu > tssd tlaunch tlaunch tlaunch

slide-12
SLIDE 12

Application Launch Sequence

Deterministic block requests over repeated launches Raw block request traces Application launch sequence

12

b1 b2 b3 b4 b5 b4 b5

...

b3 b4 b5 b1 b2 b3 b1 b2 b1 b2 b3 b4 b5

Unrelated to application launch

Block requests irrelevant to the application launch

slide-13
SLIDE 13

What to Do

Application launch sequence profiling

Using blktrace tool

Prefetcher generation

Replay block requests according to the application launch

sequence

Prefetcher execution

Simultaneously with the original application By wrapping the system call exec() LD_PRELOAD

13

slide-14
SLIDE 14

Prefetcher Generation

Example application launch sequence

AB->C->D

Block-level I/O: (start LBA, size)

(5, 2)->(1, 1)->(7, 1) <- obtainable from blktrace

File-level I/O: (filename, offset, size)

(“b.so”, 2, 2)->(“a.conf”, 1, 1)->(“c.lib”, 0, 1)

14

C A B D 1 2 3 4 5 7 8 9 6 1 2 1 2 1 2 3 "/dev/sda" LBA File offset "a.conf" "b.so" "c.lib" Accessed block

slide-15
SLIDE 15

Prefetcher Generation

Example application launch sequence

AB->C->D

Block-level I/O: (start LBA, size)

(5, 2)->(1, 1)->(7, 1) <- obtainable from blktrace

File-level I/O: (filename, offset, size)

(“b.so”, 2, 2)->(“a.conf”, 1, 1)->(“c.lib”, 0, 1)

15

C A B D 1 2 3 4 5 7 8 9 6 1 2 1 2 1 2 3 "/dev/sda" LBA File offset "a.conf" "b.so" "c.lib" Accessed block

slide-16
SLIDE 16

Prefetcher Generation

Block-level I/O replay

16

int main(void) { fd = open("/dev/sda",O_RDONLY|O_LARGEFILE); posix_fadvise(fd,5*512,2*512,POSIX_FADV_WILLNEED); posix_fadvise(fd,1*512,1*512,POSIX_FADV_WILLNEED); posix_fadvise(fd,7*512,1*512,POSIX_FADV_WILLNEED); return 0; }

LBA size

C A B D 1 2 3 4 5 7 8 9 6 1 2 1 2 1 2 3 "/dev/sda" LBA File offset "a.conf" "b.so" "c.lib" Accessed block

slide-17
SLIDE 17

Page Cache Structure

17

/dev/sda A B C D inode cached blocks a.conf b.so c.lib Page cache

slide-18
SLIDE 18

Page Cache Structure

18

/dev/sda A B C D inode cached blocks a.conf b.so c.lib Page cache Miss! Miss! Miss!

slide-19
SLIDE 19

Page Cache Structure

19

/dev/sda A B C D inode cached blocks a.conf b.so c.lib C A B D Page cache

What we need to construct

slide-20
SLIDE 20

Prefetcher Generation

File-level I/O replay

20

int main(void) { fd1 = open("b.so", O_RDONLY); posix_fadvise(fd1,2*512,2*512,POSIX_FADV_WILLNEED); fd2 = open("a.conf",O_RDONLY); posix_fadvise(fd2,1*512,1*512,POSIX_FADV_WILLNEED); fd3 = open("c.lib", O_RDONLY); posix_fadvise(fd3,0*512,1*512,POSIX_FADV_WILLNEED); return 0; }

file offset size

C A B D 1 2 3 4 5 7 8 9 6 1 2 1 2 1 2 3 "/dev/sda" LBA File offset "a.conf" "b.so" "c.lib" Accessed block

file name

slide-21
SLIDE 21

Block-to-File Level I/O Conversion

21

C A B D 1 2 3 4 5 7 8 9 6 1 2 1 2 1 2 3 "/dev/sda" LBA File offset "a.conf" "b.so" "c.lib" Accessed block

(5,2) (1,1) (7,1) (“b.so”, 2,2) (“a.conf”,1,1) (“c.lib”, 0,1)

LBA-to-inode mapping

Not supported by EXT file system

slide-22
SLIDE 22

Block-to-File Level I/O Conversion

Inode-to-LBA map for a single file

Easy to build

LBA-to-inode map for the entire file system

Millions of files in a file system Frequently changed Only a few 100s of files used by a single application

Our approach: build a partial map for each application

Determine the set of files used for the launch

Monitoring system calls using filename as their argument

22

slide-23
SLIDE 23

Application Prefetcher

Automatically generated application prefetcher for Gimp

23

int main(void) { ... readlink("/etc/fonts/conf.d/90-ttf-arphic-uming-embolden.conf", linkbuf, 256); int fd423; fd423 = open("/etc/fonts/conf.d/90-ttf-arphic-uming-embolden.conf", O_RDONLY); posix_fadvise(fd423, 0, 4096, POSIX_FADV_WILLNEED); posix_fadvise(fd351, 286720, 114688, POSIX_FADV_WILLNEED); int fd424; fd424 = open("/usr/share/fontconfig/conf.avail/90-ttf-arphic-uming-embolden.conf", O_RDONLY); posix_fadvise(fd424, 0, 4096, POSIX_FADV_WILLNEED); int fd425; fd425 = open("/root/.gnupg/trustdb.gpg", O_RDONLY); posix_fadvise(fd425, 0, 4096, POSIX_FADV_WILLNEED); dirp = opendir("/var/cache/"); if(dirp)while(readdir(dirp)); ... return 0; }

slide-24
SLIDE 24

CPU and SSD Usage

24

SSD CPU SSD CPU SSD CPU SSD CPU SSD CPU SSD CPU SSD CPU SSD CPU

Eclipse Firefox

5 1 (sec) (sec)

tcold twarm tFAST tsorted tcold twarm tFAST tsorted

Cold start Warm start FAST Sorted prefetch Cold start Warm start FAST Sorted prefetch Low CPU usage (a) (b) (c) 1 2 3 4

Reduction: 24%

slide-25
SLIDE 25

CPU and SSD Usage

25

SSD CPU SSD CPU SSD CPU SSD CPU SSD CPU SSD CPU SSD CPU SSD CPU

Eclipse Firefox

5 1 (sec) (sec)

tcold twarm tFAST tsorted tcold twarm tFAST tsorted

Cold start Warm start FAST Sorted prefetch Cold start Warm start FAST Sorted prefetch Low CPU usage (a) (b) (c) 1 2 3 4

Reduction: 24% Reduction: 37%

slide-26
SLIDE 26

CPU and SSD Usage

26

SSD CPU SSD CPU SSD CPU SSD CPU SSD CPU SSD CPU SSD CPU SSD CPU

Eclipse Firefox

5 1 (sec) (sec)

tcold twarm tFAST tsorted tcold twarm tFAST tsorted

Cold start Warm start FAST Sorted prefetch Cold start Warm start FAST Sorted prefetch Low CPU usage (a) (b) (c) 1 2 3 4

Reduction: 24% Reduction: 37%

slide-27
SLIDE 27

Measured Application Launch Time

Launch time reduction

Warm start: 37% (upper bound) Proposed: 28% (min: 16%, max: 46%) Sorted prefetch: 7% (min: -5%, max: 21%)

27

(Normalized to the cold start time.)

!""#$$ !"%&'()*%#(+#% ,#$-./#%01)2 3"4-5$# 36"#4 7085&) 7-%#9&6 :-;5 :/&;# <&=+-/- >+#?+#$-./#% >+#?#4&5 >&/@=#%&% A('?-#B C()4(' D5#/D99-"# E&B#%5&-/) 8FG5# HI=/+#%'-%+ J-$-& K&%+ L-4-/6M83 !?#%(.# NONP QNONP 2NONP RNONP SNONP TNNONP TQNONP UDA, 8DVH 7!8H K!VC EV37 WDXY 1.6s 0.8s 1.9s 4.8s 2.1s 1.1s 0.9s 2.3s 2.6s 5.6s 1.8s 1.6s 1.2s 2.7s 5.1s 0.9s 1.9s 1.0s 1.0s 3.7s 2.6s 6.6s 93% 72% 63% 27% tcold tsorted tFAST twarm tssd tbound

slide-28
SLIDE 28

Measured Application Launch Time

Launch time reduction

Warm start: 37% (upper bound) Proposed: 28% (min: 16%, max: 46%) Sorted prefetch: 7% (min: -5%, max: 21%)

28

!"#$%&# '( )'( *'( +'( ,'(

  • ''(
  • )'(

./01 2/$3#1 4!25 6%$7

93% 100% 72% 63%

slide-29
SLIDE 29

Applicability on Smartphones

Similarity to PCs with a SSD

Running various applications

Application launch performance does matter

NAND Flash-based storage

The same performance characteristic as SSDs

Slightly modified OSes and file systems designed for PCs

Easy to port

29

slide-30
SLIDE 30

Applicability on Smartphones

Further benefits

More frequent launches of applications Limited main memory capacity

Cold start scenario occurs more often

Slower CPU and flash storage speed

Relatively longer application launch time

30

slide-31
SLIDE 31

Applicability on Smartphones

Measured cold & warm start time on iPhone 4

Average cold start time: 6.1 seconds Warm start time: 63% of cold start time

31

!""# !""$ !""% !""& !""' !""( !"") !""* !""+ !""#, !""## !""#$ !""#% !""#& !-./01. , ' #, #' 23456780/8 90/:6780/8 6.1s 3.7s Launch time (sec)

slide-32
SLIDE 32

Conclusion & Future Work

Introduced an application prefetcher designed for SSDs Our ultimate goal

Warm start performance in the cold start scenario

Further improving FAST by exploiting the SSD parallelism

See our poster!

32

(a) Cold start (no prefetcher) (b) Baseline prefetcher (c) Baseline prefetcher (zoomed in) (d) Two-phase prefetcher (e) Two-phase prefetcher (zoomed in) 32 24 16 8 1 2 3 4 32 24 16 8 32 24 16 8 32 24 16 8 32 24 16 8 0.1 0.2 0.3 0.4 0.5 0.6 0.1 0.2 0.3 5 1 2 3 4 5 1 2 3 4 5 Average QD: 0.3 Average QD: 3.4 Average QD: 30.6

m1 d1 d2 d3 m4 d4 d5 c1 c2 c3 c4 m4 d4 d5 m1 d2 d1 d3 c1 c2 c3 c4 c5 c5 m4 d4 d5 d2 m1 d1 d3 c1 c2 c3 c4 c5 (a) No prefetcher (b) Baseline prefetcher (c) Two-phase prefetcher 1 2 1 2 3 1 2 3 4 5 Prefetcher execution Launch completion QD CPU SSD CPU SSD CPU SSD Time Time Time Prefetcher execution First phase Second phase QD QD

slide-33
SLIDE 33

Backup Slides

slide-34
SLIDE 34

Applicability on HDDs

FAST works as well on HDDs, but ...

Application launch on HDDs: I/O bound Little room for overlapping CPU time and HDD access time Launch time reduction: 15%

Sorted prefetch performs better

Launch time reduction: 40%

34

!"#$ %&'( '")*+$ ,-). /0 1/0 2/0 3/0 4/0 5//0 85% 60% 15% 100% Normalized application launch time on HDD

slide-35
SLIDE 35

Determinism on Multi-Core

We observed determinism even on multi-core CPUs

Only one core is active during the most time periods SSD is mostly idle when two or more cores are active

35

SSD

CPU core 1 CPU core 2 CPU core 3 CPU core 4 CPU core 5 SSD

slide-36
SLIDE 36

Why not Capturing File I/O?

Why not simply capture all the file-level I/Os and replay

them?

Ex) Capture all read() calls using strace

That’s possible, but the problem is...

The number of read() calls are extremely large Only few of them will cause page fault, generating a block I/O Replaying all the captured read() calls are inefficient

In terms of both prefetcher size and function call overhead

Not easy to determine which of them will actually cause page

faults

May be more complicated than our approach (block-level to file-level

I/O conversion)

36