cam i o scheduler
play

CAM I/O Scheduler M. Warner Losh Netflix, Inc. BSDCan 2015 - PowerPoint PPT Presentation

CAM I/O Scheduler M. Warner Losh Netflix, Inc. BSDCan 2015 http://people.freebsd.org/~imp/bsdcan2015/iosched-slides.pdf http://people.freebsd.org/~imp/bsdcan2015/paper.pdf Outline Overview / Motivation Graphs Roadmap Background and Context


  1. CAM I/O Scheduler M. Warner Losh Netflix, Inc. BSDCan 2015 http://people.freebsd.org/~imp/bsdcan2015/iosched-slides.pdf http://people.freebsd.org/~imp/bsdcan2015/paper.pdf

  2. Outline Overview / Motivation Graphs Roadmap Background and Context Netflix OCA NAND Physics and SSD FreeBSD I/O Stack Netflix I/O Scheduler Recent Updates

  3. Outline Overview / Motivation Graphs Roadmap Background and Context Netflix OCA NAND Physics and SSD FreeBSD I/O Stack Netflix I/O Scheduler Recent Updates

  4. Motivations ◮ Big performance hit in our operations ◮ 20-50% drop in throughput ◮ devstat reporting 100% busy for SSD with tiny use ◮ Netflix’s QoE metrics show customer problems ◮ Traced to adding / deleting content ◮ Happened during content fill ◮ Work around by idling server before fill ◮ Sometimes an issue, other times not ◮ Some systems would have problems, others not ◮ No fixed pattern to failure ◮ Only affected flash caches ◮ Seems related to write activity.

  5. Macro View

  6. Root Cause ◮ High write workload ◮ Read latencies can spike

  7. One Graph View SSD Write Latency Cliff 700 r 120 s r r r r r r r r r r r 600 r r 100 r r r r r r r r r 500 s r 80 r r r r Service Time (ms) r r s 400 r r Legend s r s r s s MB/s r Read MB/s 60 r r s s w Write MB/s s s r r r 300 s Read Time (ms) r r s w s s w 40 s s 200 w w w w ww w w w ww w w 20 s w w 100 w w w s s s s s s s s s s s s wwwwwwwwwwww wwwwwwwww s s s s s s s s s s s s 0 0 0 10 20 30 40 Time (min) Houston, we have a problem

  8. Topics ◮ I/O Scheduler ◮ CAM ◮ I/O Stack ◮ SSDs ◮ NAND Physics ◮ Netflix Work load

  9. Outline Overview / Motivation Graphs Roadmap Background and Context Netflix OCA NAND Physics and SSD FreeBSD I/O Stack Netflix I/O Scheduler Recent Updates

  10. Netflix OCA Overview of Netflix’s Open Connect Appliance (OCA)

  11. Netflix Open Connect Appliance (OCA) Source: Netflix

  12. Netflix Open Connect Appliance (OCA) Directing Clients to Caches Netflix Control Servers Broadband ISP • User routing is done by Netflix control servers, not dependent on client DNS configuration Open Connect Appliance (OCA) • Request is routed to the nearest available OCA – 3. Client connects to local cache backup URLs are 4. Local cache delivers video stream provided for failover • ISP controls client to OCA mapping/clustering/ failover via BGP Source: Netflix

  13. Netflix Open Connect Appliance (OCA) Content Flow – Open Connect Appliance (OCA) Popularity Open Connect 1. OCA registers with OCA control service OCA Control Appliance 6. Popularity service tells OCA control service 2. OCA control service delivers content manifest (OCA) Service about: a. Ordered list of popular downloadable URLs 3.a. Manifest Service compares b. Downloadable file size manifest to what’s on disk 3.b. Manifest Service creates Manifest 5. OCA control service script to fetch new content Service tells Client control from origin service: 3.c. Manifest Service deletes a. OCA definition content no longer needed b. Content to cache mapping 4. OCA submits new manifest to OCA CS Client Control Service Source: Netflix

  14. Typical Disk Activity (1 day) ◮ Lots of read traffic ◮ Little write traffic ◮ 100:1 read:write or more ratio ◮ “Fill Window” needed to avoid problem

  15. NAND Flash A Brief Primer on Why and How NAND Flash “Presents Challenges” and How SSDs Hide That Through Lies and Deceit Source: http://pretty-little-liars.wikia.com/wiki/Season_5

  16. NAND Geometry 16 chips/package 4000 blocks/chip 500 pages/block Read / Write unit CE Parallel Erase unit NAND typical geometry Pages are 4kiB-32kiB (typically 16kiB) plus OOB

  17. NAND Limitations ◮ Read page at a time ◮ Write page at a time ◮ Write pages in order ◮ Erase entire block ◮ Single Duplex, No queuing ◮ Low endurance (˜3k P/E cycles) ◮ Unreliable Media (ECC / LDPC needed) ◮ Parallelism through Banks / CE lines ◮ NAND Flash and FreeBSD https://www.youtube.com/watch?v=lj0XAE6C6-k https://people.freebsd.org/~imp/bsdcan2014.pdf

  18. Typical SSD / NVMe Host Interconnect Processor DRAM Buffer NAND Controller NAND Chip NAND Chip NAND Chip NAND Chip Generalized block diagram of flash storage device.

  19. SSD / NVME Firmware ◮ Flash Translation Layer (FTL) ◮ LBA to PA translation ◮ Metadata for log / NAND ◮ Wear Leveling ◮ Where to start writing at block boundary ◮ Which block to garbage collect ◮ Reliability ◮ Retention (data too old or read too much) ◮ Wear out (block too worn with high RBER) ◮ Program / Erase error processing ◮ Garbage Collection ◮ Moves data forward ◮ Extra reads and writes (Write Amplification) ◮ Can affect performance

  20. Effects of Garbage Collection Source http://www.howtogeek.com/165542/

  21. Effects of Garbage Collection Source http://www.howtogeek.com/165542/

  22. FreeBSD I/O Stack System Call Interface Active File Entries OBJECT/VNODE File Systems Page Cache Upper ↑ GEOM Lower ↓ CAM Periph Driver SD Card CAM XPT mmc bus NVMe NAND CAM SIM Driver sdhci Newbus Bus Space busdma After Figure 7.1 in The Design and Implementation of the FreeBSD Operating System, 2015.

  23. FreeBSD I/O Stack ◮ Upper half of I/O Stack focus of VM system ◮ Buffer cache ◮ Memory mapped files / devices ◮ Loosely coupled user actions to device action ◮ GEOM handles partitioning, compression, encryption ◮ Filters data (compression, encryption) ◮ Muxes Many to one (partitioning) ◮ Muxes One to Many (striping / RAID) ◮ Limited Scheduling ◮ CAM handles queuing ◮ Shapes flows to device ◮ Limits requests to number of slots ◮ Enforces rules (eg tagged vs non-tagged) ◮ Multiplexes shared resources between devices

  24. CAM – Data I/O data path through CAM schedule bio dastrategy(bio) bio done(bio) bioq disksort daschedule() dadone(ccb,bio) enq bio queue / delete queue xpt schedule() xpt done(ccb) deq while room in devq bioq first dastart() sim intr() bio use bio to hw interrupt fill in ccb xpt action(ccb) simaction(ccb)

  25. Outline Overview / Motivation Graphs Roadmap Background and Context Netflix OCA NAND Physics and SSD FreeBSD I/O Stack Netflix I/O Scheduler Recent Updates

  26. FreeBSD Default I/O Scheduler ◮ No differentiation of I/O (except BIO DELETE ) ◮ Implements two ordering policies ◮ In order ◮ elevator ◮ Implemented in the CAM PERIPH drivers ◮ Most flexible ◮ Duplicated code ◮ Partial lie: SIM drivers also involved (NCQ) ◮ Generally performs well for well behaved devices ◮ SSDs break the rules ◮ Different SSDs break different rules ◮ Assumed cost symmetry often not the case

  27. Another Look SSD Write Latency Cliff 700 r 120 s r r r r r r r r r r r 600 r r 100 r r r r r r r r r 500 s r 80 r r r r Service Time (ms) r r s 400 r r Legend s r s r s s MB/s r Read MB/s 60 r r s s w Write MB/s s s r r r 300 s Read Time (ms) r r s w s s w 40 s s 200 w w w w ww w w w ww w w 20 s w w 100 w w w s s s s s s s s s s s s wwwwwwwwwwww wwwwwwwww s s s s s s s s s s s s 0 0 0 10 20 30 40 Time (min) Houston, we have a problem

  28. Netflix I/O Scheduler Theory ◮ Restricting write rate reduces write amp disturbance ◮ Fewer concurrent writes leaves more banks for reads ◮ Elevated latency OK within limits ◮ Lowest later needed since PERIPH knows about device ◮ GEOM layer too high ◮ GEOM filters requests, but can’t force PERIPH

  29. I/O Scheduler Changes ◮ Create abstract interface to scheduler ◮ Convert da and ada PERIPHs to new interface ◮ Make sure no regressions

  30. Netflix I/O Scheduler Changes ◮ Separate read, write and delete queues ◮ Lots of statistics ◮ Ability to limit number of I/O in device at a time ◮ Adjustments needed for scheduling

  31. Netflix I/O Scheduler Changes schedule bio cam iosched queue work dastrategy(bio) bio done(bio) bioq disksort daschedule() dadone(ccb,bio) enq * read queue / write queue / xpt schedule() xpt done(ccb) delete queue while room in devq deq dastart() sim intr() bioq first next bio hw interrupt cam iosched next bio Same as above Changes to default scheduler in red.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend