Need for a Deeper Cross-Layer Optimization for Dense NAND SSD to - PowerPoint PPT Presentation

Need for a Deeper Cross-Layer Optimization for Dense NAND SSD to Improve Read Performance of Big Data Applications: A Case for Melded Pages Arpith K, Indian Institute of Science, Bangalore K. Gopinath, Indian Institute of Science, Bangalore

Organization of a Flash Packages Die   Smallest unit that can independently execute commands. Plane   Smallest unit to serve an I/O request in a parallel fashion. Block   Smallest unit that can be erased Page   Smallest unit that can be read or programed Cell 

Floating Gate Transistors The presence of electrons in the  floating gate increases the threshold voltage of the cell

STATE 1 STATE 0 0 1 Probability Density 0 1 Threshold Voltage Threshold Window

Reads Number of threshold voltage states determines how many bits a transistor can store.  MLC TLC

Reads LSB   V 3 CSB   V 1, V 5 MSB   V 0 , V 2 , V 4 , V 6 TLC

Organization of Transistors in a Block Page (Smallest unit that can be read or programed) 

Organization of Transistors in a Block MSB Page MSB MSB MSB MSB MSB MSB … CSB Page CSB CSB CSB CSB CSB CSB LSB Page LSB LSB LSB LSB LSB LSB

Reads Latency for TLC Page Latency (µs) LSB Page 58 CSB Page 78 MSB Page 107 TLC

D D i i e e 0 1 Page Sources of Read Overheads Block 0 Decoder Block 1 Address translation • • Accessing the wordline Block 2 Setting up the block that contains the • Decoder Block requested data • Post processing operations (such as detecting and correcting bit errors). Block n-1

Block Setup V pass V pass . . V read V pass

D D i i e e 0 1 Page Sources of Read Overheads Block 0 Decoder Block 1 Address translation • • Accessing the wordline Block 2 Setting up the block that contains the • Decoder Block requested data • Post processing operations (such as detecting and correcting bit errors). Block n-1

Reads X → Overhead. Includes time to address a wordline, apply pass through  voltage (to other wordlines in that block) and post process data. Y → Time required to apply one read reference voltage and sense the cell’s  conductivity. Page Latency (us) X + Y LSB Page 58  X + 2Y CSB Page 78  MSB Page 107 X + 4Y  TLC

Meded-Pages Total time to read all three pages reduces from (3X + 7Y) to (X + 7Y)  Page Latency (us) Latency MP (us) LSB Page 58 166 CSB Page 78 MSB Page 107 Melded Page MSB Page MSB MSB MSB MSB MSB MSB … CSB Page CSB CSB CSB CSB CSB CSB LSB Page LSB LSB LSB LSB LSB LSB

Meded-Pages Schedule the writes in such a way that, later, while reading, requests for data  in LSB, CSB and MSB pages are all present in the read request queue. Melded Page MSB Page MSB MSB MSB MSB MSB MSB … CSB Page CSB CSB CSB CSB CSB CSB LSB Page LSB LSB LSB LSB LSB LSB

Scheduling of Writes Write Request Queue Req1 (12KB) Req0 (12KB) Split (to 4KB chunks) 0 1 2 3 4 5 Block Block WL 2 WL 1 WL 0 LSB Pg CSB Pg MSB Pg

Scheduling of Writes Write Request Queue Req1 (12KB) Req0 (12KB) Split (to 4KB chunks) 0 1 2 3 4 5 Block Block WL 2 WL 1 WL 0 0 1 2 LSB Pg CSB Pg MSB Pg

Scheduling of Writes Write Request Queue Req1 (12KB) Req0 (12KB) Split (to 4KB chunks) 0 1 2 Block Block WL 2 WL 1 3 4 5 WL 0 0 1 2 LSB Pg CSB Pg MSB Pg

Scheduling of Writes Write Request Queue Req1 (12KB) Req0 (12KB) Split (to 4KB chunks) 0 1 2 Block Block WL 2 WL 1 3 4 5 WL 0 - - - LSB Pg CSB Pg MSB Pg

Scheduling of Writes Write Request Queue Req1 (12KB) Req0 (12KB) Split (to 4KB chunks) 0 1 2 3 4 5

Scheduling of Writes Write Request Queue Req1 (12KB) Req0 (12KB) Split (to 4KB chunks) 1 2 3 4 5 Block Block WL 2 WL 1 WL 0 0 LSB Pg CSB Pg MSB Pg

Scheduling of Writes Write Request Queue Req1 (12KB) Req0 (12KB) Split (to 4KB chunks) 2 3 4 5 Block Block WL 2 1 WL 1 WL 0 0 LSB Pg CSB Pg MSB Pg

Scheduling of Writes Write Request Queue Req1 (12KB) Req0 (12KB) Split (to 4KB chunks) 3 4 5 Block Block WL 2 1 WL 1 2 WL 0 0 LSB Pg CSB Pg MSB Pg

Scheduling of Writes Write Request Queue Req1 (12KB) Req0 (12KB) Split (to 4KB chunks) 4 5 Block Block 3 WL 2 1 WL 1 2 WL 0 0 LSB Pg CSB Pg MSB Pg

Scheduling of Writes Write Request Queue Req1 (12KB) Req0 (12KB) Split (to 4KB chunks) 5 Block Block 3 WL 2 1 4 WL 1 WL 0 0 2 LSB Pg CSB Pg MSB Pg

Scheduling of Writes Write Request Queue Req1 (12KB) Req0 (12KB) Split (to 4KB chunks) Block Block 3 WL 2 1 4 WL 1 WL 0 0 2 5 LSB Pg CSB Pg MSB Pg

Scheduling of Writes Write Request Queue Req1 (12KB) Req0 (12KB) Split (to 4KB chunks) 0 1 2 3 4 5 1 0 3 4 5 2 Block WL 2 WL 1 WL 0 LSB Pg CSB Pg MSB Pg

Scheduling of Writes Write Request Queue Req1 (12KB) Req0 (12KB) Split (to 4KB chunks) 1 2 3 4 5 Block Block WL 2 WL 1 WL 0 0 LSB Pg CSB Pg MSB Pg

Scheduling of Writes Write Request Queue Req1 (12KB) Req0 (12KB) Split (to 4KB chunks) 1 2 4 5 Block Block WL 2 WL 1 3 WL 0 0 LSB Pg CSB Pg MSB Pg

Scheduling of Writes Write Request Queue Req1 (12KB) Req0 (12KB) Split (to 4KB chunks) 2 4 5 Block Block WL 2 WL 1 3 1 WL 0 0 LSB Pg CSB Pg MSB Pg

Scheduling of Writes Write Request Queue Req1 (12KB) Req0 (12KB) Split (to 4KB chunks) 2 5 Block Block WL 2 4 WL 1 3 1 WL 0 0 LSB Pg CSB Pg MSB Pg

Scheduling of Writes Write Request Queue Req1 (12KB) Req0 (12KB) Split (to 4KB chunks) 2 Block Block WL 2 4 WL 1 5 3 1 WL 0 0 LSB Pg CSB Pg MSB Pg

Scheduling of Writes Write Request Queue Req1 (12KB) Req0 (12KB) Split (to 4KB chunks) Block Block WL 2 4 WL 1 5 3 1 WL 0 0 2 LSB Pg CSB Pg MSB Pg

It’s only beneficial to use melded pages when large amounts of data needs to  be read. How large is large enough? 

1 6 7 5 3 6 Number of channels: 8  Number of parallel units per channel: 8 5  Total number if parallel units: 64  4 Channel's operating frequency : 800 MT/s  Page Size: 4KB  3 2 1 5 8 LU N 0 6

50000 Normal TLC (us) Melded TLC (us) 2^12 63 183 45000 2^13 63 183 40000 Improvement of 2^14 63 183 41.3% Time to fulfill the request (us) 2^15 63 183 35000 2^16 69 183 30000 2^17 81 200 25000 2^18 104 218 2^19 188 270 20000 2^20 364 401 15000 2^21 708 636 2^22 1406 1134 10000 2^23 2791 2103 5000 2^24 5572 4068 2^25 11124 7971 0 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 2^26 22236 15803 Normal TLC SuperPaged TLC 2^27 44452 31440 Read Size(2^X)

Normal TLC (us) Melded TLC (us) 2^12 63 183 1 6 2^13 63 183 7 5 3 2^14 63 183 1 2^15 63 183 6 4 2^16 69 183 1 5 2^17 81 200 3 2^18 104 218 1 4 2 2^19 188 270 2^20 364 401 1 3 1 2^21 708 636 1 2^22 1406 1134 2 0 2^23 2791 2103 1 9 2^24 5572 4068 2^25 11124 7971 5 LUN 8 2^26 22236 15803 0 6 2^27 44452 31440

It’s only beneficial to use melded pages when large amounts of data needs to  be read. Problem: Decision to use melded pages needs to be done in program phase.  How does the scheduler know the read pattern during writes. 

Directives (Hints) Host provides hints to the scheduler when submitting the write request.  NVMe's Directives support (1.3 and above)   Provides an ability to exchange extra metadata in the headers of ordinary NVMe commands.  Proposal is to add a new directive that enables the application to declare the read patterns.

Generating Hints Host provides hints to the scheduler when submitting the write request.  These hints can be explicitly provided by the developer or automatically  generated by looking at the history.

Hadoop Distributed File System Hadoop and Spark is an open-source cluster-computing framework.  Large-scale data processing.  Data itself is managed using HDFS.   HDFS is designed to store very large files across machines in a large cluster.

Hadoop Distributed File System NameNodes   HDFS cluster consists of a single NameNode.  Manages metadata  Maintains mapping of blocks to DataNodes DataNodes   Usually one per node in the cluster.  Stores blocks of data.

Need for a Deeper Cross-Layer Optimization for Dense NAND SSD to - PowerPoint PPT Presentation

Need for a Deeper Cross-Layer Optimization for Dense NAND SSD to Improve Read Performance of Big Data Applications: A Case for Melded Pages Arpith K, Indian Institute of Science, Bangalore K. Gopinath, Indian Institute of Science, Bangalore

Network Layer October 2, 2019 guha.jayachandran@sjsu.edu Layer 2: Protocol atop Layer 1

Lecture 6: Wireless Link Layer, Lecture 6: Wireless Link Layer, MAC protocols, CSMA MAC

1 Transport Layer Transport Layer Outline Message, Segment, Datagram Transport-layer

ELEC / COMP 177 Fall 2016 Some slides from Kurose and Ross, Computer Networking , 5 th Edition

5 Network Layer Network Layer Network Layer Network Layer Example: Choosing among multiple ASes

02 | 27 SOUTHERN CROSS 23.04 03 | 27 SOUTHERN CROSS 23.04 04 | 27 SOUTHERN CROSS 23.04 06

Network Layer (Routing) Recap: Why do we need a Network layer? Internetworking Need to

The Shadow of the Cross The Cross of Jesus part 1B The Shadow of the Cross Hebrews 10:1-14 The

10 mm Cytoarchitecture and function layer 4: input layer 5: output Motor cortex: expanded layer

Data-link layer Da Data ta-link link layer er Referred to as layer 2 Physical

CompSci 356: Computer Network Architectures Lecture 25: Application Layer Protocols Chapter 9.1

7 Network Layer Network Layer Network Layer Network Layer Subnets Classful Address

1 Network Layer Network Layer Recall: Circuit Switching vs. Packet Interplay between routing

CompSci 356: Computer Network Architectures Lecture 23: Application Layer Protocols Chapter 9.1

4 Network Layer Network Layer Network Layer Network Layer Switching Via Memory Three types of

Network Layer Addressing, forwarding, routing Why do we need a Network layer? Cannot afford

Federated Machine Learning via Over-the-Air Computation Yuanming Shi ShanghaiTech University 1

Open-Channel Solid State Drives Matias Bjrling 2015/03/12 Vault 1 Solid State Drives

Mass Storage & IO Positioning time ( random-access time ) is time to move disk arm to

36. I/O Devices Operating System: Three Easy Pieces 1 Youjip Won I/O Devices I/O is

Di-Higgs production and Higgs self-coupling in ATLAS at HL-LHC Petar Bokan on behalf of the

LECTURE 33 NETWORK ARCHITECTURE MCS 260 Fall 2020 David Dumas / REMINDERS Quiz 11 due today

Control Path Design and Lab 3 1 Separating Control From Data The datapath is where data

Platform IO DMA Transaction Acceleration ICS/CACHES Steen Larsen (steen.larsen@intel.com) Ben

Sambuz

Useful Links

Newsletter

Mail Us

Need for a Deeper Cross-Layer Optimization for Dense NAND SSD to - PowerPoint PPT Presentation

Need for a Deeper Cross-Layer Optimization for Dense NAND SSD to Improve Read Performance of Big Data Applications: A Case for Melded Pages Arpith K, Indian Institute of Science, Bangalore K. Gopinath, Indian Institute of Science, Bangalore

Network Layer October 2, 2019 guha.jayachandran@sjsu.edu Layer 2: Protocol atop Layer 1

Lecture 6: Wireless Link Layer, Lecture 6: Wireless Link Layer, MAC protocols, CSMA MAC

1 Transport Layer Transport Layer Outline Message, Segment, Datagram Transport-layer

ELEC / COMP 177 Fall 2016 Some slides from Kurose and Ross, Computer Networking , 5 th Edition

5 Network Layer Network Layer Network Layer Network Layer Example: Choosing among multiple ASes

02 | 27 SOUTHERN CROSS 23.04 03 | 27 SOUTHERN CROSS 23.04 04 | 27 SOUTHERN CROSS 23.04 06

Network Layer (Routing) Recap: Why do we need a Network layer? Internetworking Need to

The Shadow of the Cross The Cross of Jesus part 1B The Shadow of the Cross Hebrews 10:1-14 The

10 mm Cytoarchitecture and function layer 4: input layer 5: output Motor cortex: expanded layer

Data-link layer Da Data ta-link link layer er Referred to as layer 2 Physical

CompSci 356: Computer Network Architectures Lecture 25: Application Layer Protocols Chapter 9.1

7 Network Layer Network Layer Network Layer Network Layer Subnets Classful Address

1 Network Layer Network Layer Recall: Circuit Switching vs. Packet Interplay between routing

CompSci 356: Computer Network Architectures Lecture 23: Application Layer Protocols Chapter 9.1

4 Network Layer Network Layer Network Layer Network Layer Switching Via Memory Three types of

Network Layer Addressing, forwarding, routing Why do we need a Network layer? Cannot afford

Federated Machine Learning via Over-the-Air Computation Yuanming Shi ShanghaiTech University 1

Open-Channel Solid State Drives Matias Bjrling 2015/03/12 Vault 1 Solid State Drives

Mass Storage &amp; IO Positioning time ( random-access time ) is time to move disk arm to

36. I/O Devices Operating System: Three Easy Pieces 1 Youjip Won I/O Devices I/O is

Di-Higgs production and Higgs self-coupling in ATLAS at HL-LHC Petar Bokan on behalf of the

LECTURE 33 NETWORK ARCHITECTURE MCS 260 Fall 2020 David Dumas / REMINDERS Quiz 11 due today

Control Path Design and Lab 3 1 Separating Control From Data The datapath is where data

Platform IO DMA Transaction Acceleration ICS/CACHES Steen Larsen (steen.larsen@intel.com) Ben

Sambuz

Useful Links

Newsletter

Mail Us

Mass Storage & IO Positioning time ( random-access time ) is time to move disk arm to