Data Systems on Modern Hardware: Multi-cores, Solid-State Drives, - - PowerPoint PPT Presentation

data systems on modern hardware
SMART_READER_LITE
LIVE PREVIEW

Data Systems on Modern Hardware: Multi-cores, Solid-State Drives, - - PowerPoint PPT Presentation

CS 591: Da Data Systems Architectures Data Systems on Modern Hardware: Multi-cores, Solid-State Drives, and Non-Volatile Memories Prof. Manos Athanassoulis https://midas.bu.edu/classes/CS591A1 with slides developed at Harvard Some class


slide-1
SLIDE 1

Data Systems on Modern Hardware:

Multi-cores, Solid-State Drives, and Non-Volatile Memories

  • Prof. Manos Athanassoulis

https://midas.bu.edu/classes/CS591A1

CS 591: Da Data Systems Architectures

with slides developed at Harvard

slide-2
SLIDE 2

Some class logistics in light of … reality

https://forms.gle/hWHhu6YVbr4ZDicq8 See also Zoom chat for the link! Please respond now! Let’s also try the raise-hand option in a different way!

Everyone who is connected via Ethernet (not wifi) please raise your hand Now, everyone who is connected via wifi (not Ethernet) please do so

slide-3
SLIDE 3

Compute, Memory, and Storage Hierarchy

Traditional von-Neuman computer architecture

(i) assumes CPU is fast enough (for our applications) (ii) assumes memory can keep-up with CPU and can hold all data

CPU Memory

is this the case? for (i): applications increasingly complex, higher CPU demand is the CPU going to be always fast enough?

slide-4
SLIDE 4

Moore’s law

Often expressed as: “X doubles every 18-24 months” where X is: “performance” CPU clock speed the number of transistors per chip

based on William Gropp’s slides

which one is it? (check Zoom chat)

slide-5
SLIDE 5

but … exponential growth!

slide-6
SLIDE 6
slide-7
SLIDE 7

Can (a single) CPU cope with increasing application complexity? No, because CPUs (cores) are not getting faster!!! .. but they are getting more and more (higher parallelism) Research Challenges how to handle them? how to parallel program?

slide-8
SLIDE 8

Compute, Memory, and Storage Hierarchy

Traditional von-Neuman computer architecture

(i) assumes CPU is fast enough (for our applications) (ii) assumes memory can keep-up with CPU and can hold all data

CPU Memory

is this the case? for (ii): is memory faster than CPU (to deliver data in time)? does it have enough capacity? not always!

slide-9
SLIDE 9

Which one is faster? (check Zoom chat)

Memory Wall As the gap grows, we need a deep memory hierarchy

slide-10
SLIDE 10

A single level of main memory is not enough We need a me memo mory y hierarchy

slide-11
SLIDE 11

What is the memory hierarchy ?

slide-12
SLIDE 12

HDD / Shingled HDD SSD (Flash) Main Memory L3 L2 L1 ~2ms Bigger Cheaper Slower Faster Smaller More expensive ~100μs ~100ns ~3ns <1ns ~10ns

slide-13
SLIDE 13

Access Granularity

slide-14
SLIDE 14

HDD / Shingled HDD SSD (Flash) Main Memory L3 L2 L1 ~2ms ~100μs ~100ns ~3ns <1ns ~10ns 4 page size ~4KB

block size (cacheline) 64B

Bigger Cheaper Slower Faster Smaller More expensive

slide-15
SLIDE 15

IO cost: Scanning a relation to select 10%

HDD Main Memory 5-page buffer IO#: Load 5 pages

slide-16
SLIDE 16

IO cost: Scanning a relation to select 10%

HDD Main Memory 5-page buffer IO#: 5

slide-17
SLIDE 17

IO cost: Scanning a relation to select 10%

HDD Main Memory 5-page buffer IO#: 5 Send for consumption

slide-18
SLIDE 18

IO cost: Scanning a relation to select 10%

HDD Main Memory 5-page buffer IO#: 5 Load 5 pages

slide-19
SLIDE 19

IO cost: Scanning a relation to select 10%

HDD Main Memory 5-page buffer IO#: 10

slide-20
SLIDE 20

IO cost: Scanning a relation to select 10%

HDD Main Memory 5-page buffer IO#: 10 Load 5 pages

slide-21
SLIDE 21

IO cost: Scanning a relation to select 10%

HDD Main Memory 5-page buffer IO#: 15

slide-22
SLIDE 22

IO cost: Scanning a relation to select 10%

HDD Main Memory 5-page buffer IO#: 15 Load 5 pages

slide-23
SLIDE 23

IO cost: Scanning a relation to select 10%

HDD Main Memory 5-page buffer IO#: 20

slide-24
SLIDE 24

IO cost: Scanning a relation to select 10%

HDD Main Memory 5-page buffer IO#: 20 Send for consumption

slide-25
SLIDE 25

What if we had an oracle (index)?

slide-26
SLIDE 26

IO cost: Scanning a relation to select 10%

HDD Main Memory 5-page buffer IO#: Index

slide-27
SLIDE 27

IO cost: Use an index to select 10%

HDD Main Memory 5-page buffer IO#: Index Load the index

slide-28
SLIDE 28

IO cost: Use an index to select 10%

HDD Main Memory 5-page buffer IO#: 1 Index

slide-29
SLIDE 29

IO cost: Use an index to select 10%

HDD Main Memory 5-page buffer IO#: 1 Index Load useful pages

slide-30
SLIDE 30

IO cost: Use an index to select 10%

HDD Main Memory 5-page buffer IO#: 3 Index

slide-31
SLIDE 31

What if useful data is in all pages?

slide-32
SLIDE 32

Scan or Index ?

HDD Main Memory 5-page buffer IO#: Index

slide-33
SLIDE 33

Scan or Index ?

HDD Main Memory 5-page buffer IO#: 20 with scan Index IO#: 21 with index

slide-34
SLIDE 34

Same analysis for any two memory levels!!

L2 / L3 / Main Memory / HDD L1 / L2 / L3 / Main Memory 5-page buffer IO#:

slide-35
SLIDE 35

Cache Hierarchy

slide-36
SLIDE 36

HDD / Shingled HDD SSD (Flash) Main Memory L3 L2 L1 ~2ms ~100μs ~100ns ~3ns <1ns ~10ns Bigger Cheaper Slower Faster Smaller More expensive

slide-37
SLIDE 37

Cache Hierarchy

L3 L2 L1

What is a core? What is a socket?

slide-38
SLIDE 38

Cache Hierarchy

Shared Cache: L3 (or LLC: Last Level Cache) L3 is physically distributed in multiple sockets L2 is physically distributed in every core of every socket Each core has its own private L1 & L2 cache All levels need to be coherent*

L3 L2 L1 L1 L1 L1 1 2 3 L2 L2 L2

slide-39
SLIDE 39

Core 0 reads faster when data are in its L1 If it does not fit, it will go to L2, and then in L3 Can we control where data is placed? We would like to avoid going to L2 and L3 altogether But, at least we want to avoid to remote L2 and L3 And remember: this is only one socket! We have multiple of those!

Non Uniform Memory Access (NUMA)

1 2 3 L3 L1 L1 L1 L1 L2 L2 L2 L2

slide-40
SLIDE 40

Non Uniform Memory Access (NUMA)

1 2 3 L3 L1 L1 L1 L1 1 2 3 L3 L1 L1 L1 L1 Main Memory L2 L2 L2 L2 L2 L2 L2 L2

slide-41
SLIDE 41

Non Uniform Memory Access (NUMA)

1 2 3 L3 L1 L1 L1 L1 1 2 3 L3 L1 L1 L1 L1 Main Memory L2 L2 L2 L2 L2 L2 L2 L2 Cache hit!

slide-42
SLIDE 42

Non Uniform Memory Access (NUMA)

1 2 3 L3 L1 L1 L1 L1 1 2 3 L3 L1 L1 L1 L1 Main Memory L2 L2 L2 L2 L2 L2 L2 L2 Cache miss! Cache hit!

slide-43
SLIDE 43

Non Uniform Memory Access (NUMA)

1 2 3 L3 L1 L1 L1 L1 1 2 3 L3 L1 L1 L1 L1 Main Memory L2 L2 L2 L2 L2 L2 L2 L2 Cache miss! Cache hit! Cache miss!

slide-44
SLIDE 44

Non Uniform Memory Access (NUMA)

1 2 3 L3 L1 L1 L1 L1 1 2 3 L3 L1 L1 L1 L1 Main Memory L2 L2 L2 L2 L2 L2 L2 L2 Cache miss! LLC miss! Cache miss!

slide-45
SLIDE 45

Non Uniform Memory Access (NUMA)

1 2 3 L3 L1 L1 L1 L1 1 2 3 L3 L1 L1 L1 L1 Main Memory L2 L2 L2 L2 L2 L2 L2 L2 Cache miss! NUMA access! Cache miss!

slide-46
SLIDE 46

Why knowing the cache hierarchy matters

int arraySize; for (arraySize = 1024/sizeof(int) ; arraySize <= 2*1024*1024*1024/sizeof(int) ; arraySize*=2) // Create an array of size 1KB to 4GB and run a large arbitrary number of operations { int steps = 64 * 1024 * 1024; // Arbitrary number of steps int* array = (int*) malloc(sizeof(int)*arraySize); // Allocate the array int lengthMod = arraySize - 1; // Time this loop for every arraySize int i; for (i = 0; i < steps; i++) { array[(i * 16) & lengthMod]++; // (x & lengthMod) is equal to (x % arraySize) } }

256KB 16MB

This machine has: 256KB L2 per core 16MB L3 per socket

NUMA!

slide-47
SLIDE 47

Storage Hierarchy

slide-48
SLIDE 48

Why not just stay in memory?

slide-49
SLIDE 49

memory flash HDD

Cost! what else?

slide-50
SLIDE 50

Storage Hierarchy

Why not stay in memory? Rephrase: what is missing from memory hierarchy? Durability (data survives between restarts) Capacity (enough capacity for data-intensive applications)

slide-51
SLIDE 51

Storage Hierarchy

HDD SSD (Flash) Main Memory Shingled Disks Tape

slide-52
SLIDE 52

Storage Hierarchy

HDD SSD (Flash) Main Memory Shingled Disks Tape

slide-53
SLIDE 53

Hard Disk Drives

Secondary durable storage that support both random and sequential access

Data organized on pages/blocks (across tracks) Multiple tracks create an (imaginary) cylinder Disk access time: seek latency + rotational delay + transfer time (0.5-2ms) + (0.5-3ms) + <0.1ms/4KB Sequential >> random access (~10x) Goal: avoid random access

slide-54
SLIDE 54

Seek time + Rotational delay + Transfer time

Seek time: the head goes to the right track Short seeks are dominated by “settle” time (D is on the order of hundreds or more) Rotational delay: The platter rotates to the right sector. What is the min/max/avg rotational delay for 10000RPM disk? Transfer time: <0.1ms / page → more than 100MB/s

slide-55
SLIDE 55

Sequential vs. Random Access

Bandwidth for Sequential Access (assuming 0.1ms/4KB): 0.04ms for 4KB → 100MB/s Bandwidth for Random Access (4KB): 0.5ms (seek time) + 1ms (rotational delay) + 0.04ms = 1.54ms 4KB/1.54ms → 2.5MB/s

slide-56
SLIDE 56

Flash

Secondary durable storage that support both random and sequential access

Data organized on pages (similar to disks) which are further grouped to erase blocks Main advantage over disks: random read is now much more efficient BUT: Slow random writes! Goal: avoid random writes

slide-57
SLIDE 57

The internals of flash

interconnected flash chips no mechanical limitations maintain the block API compatible with disks layout internal parallelism for both read/write complex software driver

slide-58
SLIDE 58

Flash access time

… depends on: device organization (internal parallelism) software efficiency (driver) bandwidth of flash packages the Flash Translation Layer (FTL), a complex device driver (firmware) which

tunes performance and device lifetime

slide-59
SLIDE 59

High Performance Expensive Memory Low Performance Cheap Memory

Flash vs HDD

HDD

✓ Large - cheap capacity ✗ Inefficient random reads

Flash

✗ Small - expensive capacity ✓ Very efficient random reads ✗ Read/Write Asymmetry

slide-60
SLIDE 60

Storage Hierarchy

HDD Flash Main Memory Shingled Disks Tape

slide-61
SLIDE 61

Tapes

Data size grows exponentially! Cheaper capacity: Increase density (bits/in2) Simpler devices Tapes: Magnetic medium that allows

  • nly sequential access

(yes like an old school tape)!

slide-62
SLIDE 62

Very difficult to differentiate between tracks “settle” time becomes

Increasing disk density

Writing a track affects neighboring tracks Create different readers/writers Interleave writes tracks

slide-63
SLIDE 63

Memory & Storage Walls

63

slide-64
SLIDE 64

Memory Wall

64

slide-65
SLIDE 65

Memory Wall

65

every byte counts

slide-66
SLIDE 66

Storage Wall

HDD ü capacity ü sequential access × random access × latency plateaus

66

slide-67
SLIDE 67

Evolution of hard disks

67

2

0.01 0.1 1 10 100 1000 10000 100000 1000000 10000000 100000000 1950 1960 1970 1980 1990 2000 2010 2020

RPM Latency (ms) $/GB GB/in Full Read

15K RPM 1.2K RPM 7.2K-10K RPM 250 ms 4 ms 4-8 ms $15.3M/GB $0.08/GB 0.1 GB/in2 800 GB/in2 8 s 2 h 8000x Denser capacity 260x

Speed Price Density Latency Full read

disks become larger but – relatively – slower

slide-68
SLIDE 68

Storage Wall

HDD ü capacity ü sequential access × random access × latency plateaus SSD (Single Level Cell) ü random reads ü low latency × capacity × endurance × read/write asymmetry SSD (Multi Level Cell) ü capacity × endurance (worse)

68

slide-69
SLIDE 69

“Tape is Dead, Disk is Tape, Flash is Disk”

  • Storage without

mechanical limitations Several technologies

  • Flash
  • Phase Change Memory (IBM)
  • Memristor (HP)

69

flash vs. hard disks?

[Jim Gray 2007]

slide-70
SLIDE 70

Flash vs. HDD

70

0.01 0.1 1 10 100 1000 10000 100000 Performance Price Endurance Idle power Max power

100 IO/S 100K IO/S 0.1 $/GB 2 $/GB

10K writes 5.5 Watts 8.5 Watts 0.85 Watts 1.21 Watts

flash = opportunity read, update, storage tradeoffs

slide-71
SLIDE 71

Storage Wall

HDD ü capacity ü sequential access × random access × latency plateaus SSD (Single Level Cell) ü random reads ü low latency × capacity × endurance × read/write asymmetry SSD (Multi Level Cell) ü capacity × endurance (worse) HDD (Shingled Magnetic Rec.) ü capacity × read/write asymmetry

71

every byte counts

slide-72
SLIDE 72

Technology Trends & Research Challenges

(1) From fast single cores to increased parallelism (2) From slow storage to efficient random reads (3) From infinite endurance to limited endurance (4) From symmetric to asymmetric read/write performance

slide-73
SLIDE 73

Technology Trends & Research Challenges

How to exploit increasing parallelism (in compute and storage)? How to redesign systems for efficient random reads? e.g., no need to aggressively minimize index height! How to reduce write amplification (physical writes per logical write)? How to write algorithms for asymmetric storage?

slide-74
SLIDE 74

74

4000 50 17 17 17 17 1 10 100 1000 10000

HDD SSD NVMe

Latency (μs) Device Latency (H/W) OS & FS Latency (S/W) Even faster devices are available (NVMe ) How to use them when the software stack is too slow?