Pianola: A script-based I/O benchmark John May PSDW08, 17 November - - PowerPoint PPT Presentation

pianola a script based i o benchmark
SMART_READER_LITE
LIVE PREVIEW

Pianola: A script-based I/O benchmark John May PSDW08, 17 November - - PowerPoint PPT Presentation

Lawrence Livermore National Laboratory Pianola: A script-based I/O benchmark John May PSDW08, 17 November 2008 Lawrence Livermore National Laboratory, P. O. Box 808, Livermore, CA 94551 This work performed under the auspices of the U.S.


slide-1
SLIDE 1

Lawrence Livermore National Laboratory

Pianola: A script-based I/O benchmark

Lawrence Livermore National Laboratory, P. O. Box 808, Livermore, CA 94551 This work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344

John May

PSDW08, 17 November 2008

LLNL-PRES-406688

slide-2
SLIDE 2

2

Lawrence Livermore National Laboratory

I/O benchmarking: What’s going on here?

Is my computer’s I/O system “fast”? Is the I/O system keeping up with my application? Is the app using the I/O system effectively? What tools do I need to answer these questions? And what exactly do I mean by “I/O system” anyway?

  • For this talk, an I/O system is everything involved in

storing data from the filesystem to the storage hardware

slide-3
SLIDE 3

3

Lawrence Livermore National Laboratory

Existing tools can measure general or application-specific performance

  • IOzone automatically

measures I/O system performance for different

  • perations and parameters
  • Relatively little ability to

customize I/O requests

  • Many application-oriented

benchmarks

  • SWarp, MADbench2…
  • Interface-specific benchmarks
  • IOR, //TRACE

64 256 1024 4096 16384 65536 262144 1048576 4194304 4 32 256 2048 16384 100000 200000 300000 400000 500000 600000 KB/ sec File size ( KB) Record size ( KB)

W rite new file

500000-600000 400000-500000 300000-400000 200000-300000 100000-200000 0-100000

slide-4
SLIDE 4

4

Lawrence Livermore National Laboratory

Measuring the performance that matters

System benchmarks only measure general response, not application-specific response Third-party application-based benchmarks may not generate the stimulus you care about In-house applications may not be practical benchmarks

  • Difficult for nonexperts to build and run
  • Nonpublic source cannot be distributed to vendors

and collaborators Need benchmarks that…

  • Can be generated and used easily
  • Model application-specific characteristics
slide-5
SLIDE 5

5

Lawrence Livermore National Laboratory

Script-based benchmarks emulate real apps

  • Capture trace data from application and generate the same sequence of
  • perations in a replay-benchmark
  • We began with //TRACE from CMU (Ganger’s group)
  • Records I/O events and intervening “compute” times
  • Focused on parallel I/O, but much of the infrastructure is useful for our

sequential I/O work

Application Capture library

  • pen

compu te r ead compu te …

Replay script Replay tool Script-based benchmark

slide-6
SLIDE 6

6

Lawrence Livermore National Laboratory

Challenges for script-based benchmarking: Recording I/O calls at the the right level

fprintf( … ); /* work */ fprintf( … ); /* more work */ fprintf( … ); … write( … );

  • pen

compu te wr i t e

  • Instrumenting at high level

+ Easy with LD_PRELOAD

  • Typically generates more

events, so logs are bigger

  • Need to replicate formatting
  • Timing includes computation
  • Instrumenting at low level

+ Fewer types of calls to capture + Instrumentation is at I/O system interface

  • Cannot use LD_PRELOAD to

intercept all calls

slide-7
SLIDE 7

7

Lawrence Livermore National Laboratory

First attempt at capturing system calls: Linux strace utility

  • Records any selected set of system calls
  • Easy to use: just add to command line
  • Produces easily parsed output

$ s t rac e

  • r
  • T
  • s
  • e

t race = f i l e ,desc l s .000000 execve ( " / b i n / l s " , [ , . . . ] , [ / * 29 v a rs * / ] ) = <0 . 000237> .000297

  • pen(

" / e t c / l d . s

  • .cache"

, O_ R D O N L Y) = 3 < .000047 > .000257 f s ta t64 (3 , { s t _ m

  • d

e = S _ I F R E G|0644 , s t_ s i ze=64677 , . . . } ) = < .0 3 3 > .000394 c l

  • se(

3 ) = <0 .000015> .000230

  • pen(

" / l i b / l i b r t . so .1 " , O_R D O NLY) = 3 <0 .000046> .000289 read ( 3 , " " . . . , 512 ) = 512 <0 .000028> . . .

slide-8
SLIDE 8

8

Lawrence Livermore National Laboratory

Strace results look reasonably accurate, but overall runtime is exaggerated

5 10 15 20 25 30 35 40 45 50 100 150 200 250 300 350 400 450 Execution Time Application Read Application Write Replay Read Replay Write

Read (sec.) Write (sec.) Elapsed (sec.) Uninstrumented

  • 324

Instrumented Application 41.8 11.8 402 Replay 37.0 11.8 390

slide-9
SLIDE 9

9

Lawrence Livermore National Laboratory

For accurate recording, gather I/O calls using binary instrumentation

  • Can intercept and instrument specific system-level calls
  • Overhead of instrumentation is paid at program startup
  • Existing Jockey library works well for x86_32, but not ported to
  • ther platforms
  • Replay can be portable, though

mov 4 (%esp), %ebx int $0x80 ret mov $0x4, %eax save current state mov 4 (%esp), %ebx nop ret jmp to trampoline call my write function restore state jmp to original code

Instrumented Uninstrumented

slide-10
SLIDE 10

10

Lawrence Livermore National Laboratory

Issues for accurate replay

  • Replay engine must be able to

read and parse events quickly

  • Reading script must not

interfere significantly with I/O activities being replicated

  • Script must be portable across

platforms Accurate replay Minimize I/O impact of reading script Correctly reproduce inter-event delays

  • pen

compu te wr i t e

Replayed I/O events

slide-11
SLIDE 11

11

Lawrence Livermore National Laboratory

Accurate replay: Preparsing, compression, and buffering

  • Text-formatted output script is portable across platforms
  • Instrumentation output is parsed into binary format and

compressed (~30:1)

  • Conversion done on target platform
  • Replay engine reads and buffers script data during “compute”

phases between I/O events

  • pen

compu te wr i t e

010010 110101 010110

… … …

preparsing compression buffering

slide-12
SLIDE 12

12

Lawrence Livermore National Laboratory

Replay timing and profile match original application well

5 10 15 20 25 30 35 40 50 100 150 200 250 300 350 400 Execution Time Application Read Application Write Replay Read Replay Write

Read (sec.) Write (sec.) Elapsed(sec.) Uninstrumented

  • 314

Instrumented Application 35.8 12.8 334 Replay 35.7 12.5 319

slide-13
SLIDE 13

13

Lawrence Livermore National Laboratory

Things that didn’t help

Compressing text script as it’s generated

  • Only 2:1 compression
  • Time of I/O events themselves are not what’s very

important during instrumentation phase Replicating the memory footprint

  • Memory used by application is taken from same pool

as I/O buffer cache

  • Smaller application (like the replay engine) should

go faster because more buffer space available

  • Replicated memory footprint by tracking brk() and

mmap() calls, but it made no difference!

slide-14
SLIDE 14

14

Lawrence Livermore National Laboratory

Conclusions on script-based I/O benchmarking

Gathering accurate I/O traces is harder than it seems

  • Currently, no solution is both portable and efficient

Replay is easier, but efficiency still matters Many possibilities for future work—which matter most?

  • File name transformation
  • Parallel trace and replay
  • More portable instrumentation
  • How to monitor mmap’d I/O?