NUMA-Aware Thread Migration for High Performance NVMM File Systems - - PowerPoint PPT Presentation

numa aware thread migration for high performance nvmm
SMART_READER_LITE
LIVE PREVIEW

NUMA-Aware Thread Migration for High Performance NVMM File Systems - - PowerPoint PPT Presentation

NUMA-Aware Thread Migration for High Performance NVMM File Systems Ying Wang , Dejun Jiang, Jin Xiong Institute of Computing Technology, CAS University of Chinese Academy of


slide-1
SLIDE 1

NUMA-Aware Thread Migration for High Performance NVMM File Systems

Ying Wang, Dejun Jiang, Jin Xiong Institute of Computing Technology, CAS University of Chinese Academy of Sciences

slide-2
SLIDE 2

Outline

  • Background & Motivation
  • NThread design

– Reduce remote access – Reduce resource contention – Increase CPU cache sharing

  • Evaluation
  • Summary

2

slide-3
SLIDE 3

Background

  • Non-Volatile Main Memories(NVMMs) provide low latency,

high bandwidth, byte-addressable and persistent storage

– PCM, MRAM, RRAM, 3D Xpoint[1]

  • Intel releases Optane DC Persistent Memory (Optane PMM)

3

[1] What is Intel Optane DC Persistent Memory. Intel. [2] The data from our evaluation and the paper of “Basic Performance Measurements of the Intel Optane DC Persistent Memory Module”

[2] R lat. W lat. R BW. W BW. DRAM 60ns 69ns 20 GB/s ~15 GB/s Optane PMM 305ns 81ns ~6GB/s ~2GB/s NVMe SSD 120us 30us 2GB/s 500MB/s HDD 10ms 10ms 0.1GB/s 0.1GB/s

slide-4
SLIDE 4

Background

  • Non-Volatile Main Memories(NVMMs) provide low latency,

high bandwidth, byte-addressable and persistent storage

– PCM, MRAM, RRAM, 3D Xpoint[1]

  • Intel releases Optane DC Persistent Memory (Optane PMM)
  • File system can be directly built on memory

– Improve file system I/O performance 3

NVMM CPU Memory bus File system

[1] What is Intel Optane DC Persistent Memory. Intel. [2] The data from our evaluation and the paper of “Basic Performance Measurements of the Intel Optane DC Persistent Memory Module”

[2] R lat. W lat. R BW. W BW. DRAM 60ns 69ns 20 GB/s ~15 GB/s Optane PMM 305ns 81ns ~6GB/s ~2GB/s NVMe SSD 120us 30us 2GB/s 500MB/s HDD 10ms 10ms 0.1GB/s 0.1GB/s

I/O

slide-5
SLIDE 5

Background

  • Non-Uniform Memory Access architecture(NUMA) is

widely used in data center [1,2,3,4,5,6,7]

4

[1] Lepers, ATC’2015 [2] Dashti, ASPLOS’2013 [3] Blagodurov, ATC’2011 [4] Tam, EuroSys’2007 [5] Yu, CS’2017 [6] Calciu, ASPLOS’2017 [7] Blagodurov, ACM Trans’2010

NVMM CPU0 Memory bus Node 0 NVMM CPU1 Memory bus Node 1 QPI link

slide-6
SLIDE 6

Background

  • Non-Uniform Memory Access architecture(NUMA) is

widely used in data center [1,2,3,4,5,6,7]

– Multiple NUMA (memory) nodes 4

[1] Lepers, ATC’2015 [2] Dashti, ASPLOS’2013 [3] Blagodurov, ATC’2011 [4] Tam, EuroSys’2007 [5] Yu, CS’2017 [6] Calciu, ASPLOS’2017 [7] Blagodurov, ACM Trans’2010

NVMM CPU0 Memory bus Node 0 NVMM CPU1 Memory bus Node 1 QPI link

slide-7
SLIDE 7

Background

  • Non-Uniform Memory Access architecture(NUMA) is

widely used in data center [1,2,3,4,5,6,7]

– Multiple NUMA (memory) nodes

  • Each memory node contains independent CPU and memory

4

[1] Lepers, ATC’2015 [2] Dashti, ASPLOS’2013 [3] Blagodurov, ATC’2011 [4] Tam, EuroSys’2007 [5] Yu, CS’2017 [6] Calciu, ASPLOS’2017 [7] Blagodurov, ACM Trans’2010

NVMM CPU0 Memory bus Node 0 NVMM CPU1 Memory bus Node 1 QPI link

slide-8
SLIDE 8

Background

  • Non-Uniform Memory Access architecture(NUMA) is

widely used in data center [1,2,3,4,5,6,7]

– Multiple NUMA (memory) nodes

  • Each memory node contains independent CPU and memory
  • Each node can run in parallel without interference

4

[1] Lepers, ATC’2015 [2] Dashti, ASPLOS’2013 [3] Blagodurov, ATC’2011 [4] Tam, EuroSys’2007 [5] Yu, CS’2017 [6] Calciu, ASPLOS’2017 [7] Blagodurov, ACM Trans’2010

NVMM CPU0 Memory bus Node 0 NVMM CPU1 Memory bus Node 1 QPI link

slide-9
SLIDE 9

CPU0 NVMM

Background

5

Node 0 NVMM Node 1 QPI link CPU1

slide-10
SLIDE 10

CPU0 NVMM

Background

5

Node 0 NVMM Node 1

Remote memory access Local memory access

QPI link CPU1

slide-11
SLIDE 11

CPU0 CPU0 NVMM

Background

  • , , , ,
  • ,, , ,

,,, , ,

  • Memory (DRAM, NVMM), CPU

5

NVMM Node 0 NVMM Node 1

Remote memory access Local memory access NVMM access contention

QPI link CPU1

slide-12
SLIDE 12

CPU0 CPU0 NVMM

Background

  • , , , ,
  • ,, , ,

,,, , ,

  • Memory (DRAM, NVMM), CPU
  • The I/O performance of NVMM file system is affected by the

these factors

5

NVMM Node 0 NVMM Node 1

Remote memory access Local memory access NVMM access contention

QPI link CPU1 NVMM file system

slide-13
SLIDE 13

Motivation

  • Existing NVMM file systems are not aware of NUMA

– Remote memory access

6

NVMM CPU0 Node 0 NVMM CPU1 Node 1

Remote memory access

NVMM file system QPI link

slide-14
SLIDE 14

Motivation

  • Existing NVMM file systems are not aware of NUMA

– Remote memory access

  • File location is transparent to thread

6

NVMM CPU0 Node 0 NVMM CPU1 Node 1

Remote memory access

NVMM file system QPI link

slide-15
SLIDE 15

Motivation

  • Existing NVMM file systems are not aware of NUMA

– Remote memory access

  • File location is transparent to thread
  • Thread is randomly scheduled by OS

6

NVMM CPU0 Node 0 NVMM CPU1 Node 1

Remote memory access

NVMM file system QPI link

slide-16
SLIDE 16

Motivation

  • Existing NVMM file systems are not aware of NUMA

– Remote memory access

  • File location is transparent to thread
  • Thread is randomly scheduled by OS
  • Remote NVMM accesses increase the read latency of NVMM file

system by 65.6%

6

NVMM CPU0 Node 0 NVMM CPU1 Node 1

Remote memory access

NVMM file system QPI link

slide-17
SLIDE 17

Motivation

  • Existing NVMM file system is not aware of NUMA

– Resource contention

7

NVMM CPU0 Node 0 NVMM CPU1 Node 1 NVMM file system

NVMM access contention

QPI link

slide-18
SLIDE 18

Motivation

  • Existing NVMM file system is not aware of NUMA

– Resource contention

  • Random placement of data leads to unbalanced data access among

NUMA nodes

7

NVMM CPU0 Node 0 NVMM CPU1 Node 1 NVMM file system

NVMM access contention

QPI link

slide-19
SLIDE 19

Motivation

  • Existing NVMM file system is not aware of NUMA

– Resource contention

  • Random placement of data leads to unbalanced data access among

NUMA nodes

  • NVMM access contention can increases file access latency by

120.5%

7

NVMM CPU0 Node 0 NVMM CPU1 Node 1 NVMM file system

NVMM access contention

QPI link

slide-20
SLIDE 20

Existing works

  • For memory applications

8

[1]Matthias, SBAC-PAD’14 [2] Lachaize, ATC’12 [3] Wu, Cluster’19 [4] Xu, ASPLOS’19

slide-21
SLIDE 21

Existing works

  • For memory applications

– Allocating memory on the memory node where the thread runs

  • Cannot solve the problem of NVMM contention

8

[1]Matthias, SBAC-PAD’14 [2] Lachaize, ATC’12 [3] Wu, Cluster’19 [4] Xu, ASPLOS’19

slide-22
SLIDE 22

Existing works

  • For memory applications

– Allocating memory on the memory node where the thread runs

  • Cannot solve the problem of NVMM contention

– Migrating thread and thread data (such as stack, heap) [1,2,34]

  • Reduce remote access
  • Reduce resource contention by unbalanced use of resources
  • A lot of data migration overhead

8

[1]Matthias, SBAC-PAD’14 [2] Lachaize, ATC’12 [3] Wu, Cluster’19 [4] Xu, ASPLOS’19

slide-23
SLIDE 23

High data migration overhead on NVMM FS

9

slide-24
SLIDE 24

High data migration overhead on NVMM FS

  • NVMM has long latency and low bandwidth than DRAM

– The migrating latency of 16 KB data in NVMM is 2.8X of DRAM

9

slide-25
SLIDE 25

High data migration overhead on NVMM FS

  • NVMM has long latency and low bandwidth than DRAM

– The migrating latency of 16 KB data in NVMM is 2.8X of DRAM

  • File system needs consistency

– Additional overhead, such as log or journal

9

slide-26
SLIDE 26

High data migration overhead on NVMM FS

  • NVMM has long latency and low bandwidth than DRAM

– The migrating latency of 16 KB data in NVMM is 2.8X of DRAM

  • File system needs consistency

– Additional overhead, such as log or journal

  • File data is shared between threads

– Difficult to decide the node to migrate data

9

slide-27
SLIDE 27

High data migration overhead on NVMM FS

  • NVMM has long latency and low bandwidth than DRAM

– The migrating latency of 16 KB data in NVMM is 2.8X of DRAM

  • File system needs consistency

– Additional overhead, such as log or journal

  • File data is shared between threads

– Difficult to decide the node to migrate data

  • NVMM has low write endurance

– Reduce the lifetime of NVMM

9

slide-28
SLIDE 28

Contribution

  • A NUMA-Aware thread migration for NVMM FS

10

NVMM CPU0 Node 0 NVMM CPU1 Node 1 QPI link NVMM file system Application

slide-29
SLIDE 29

Contribution

  • A NUMA-Aware thread migration for NVMM FS

– Reduce remote access

10

NVMM CPU0 Node 0 NVMM CPU1 Node 1 QPI link NVMM file system Application

slide-30
SLIDE 30

Contribution

  • A NUMA-Aware thread migration for NVMM FS

– Reduce remote access – Reduce resource contention

  • CPU
  • NVMM

10

NVMM CPU0 Node 0 NVMM CPU1 Node 1 QPI link NVMM file system Application

slide-31
SLIDE 31

Contribution

  • A NUMA-Aware thread migration for NVMM FS

– Reduce remote access – Reduce resource contention

  • CPU
  • NVMM

– Increase CPU cache sharing between threads

10

NVMM CPU0 Node 0 NVMM CPU1 Node 1 QPI link NVMM file system Application

slide-32
SLIDE 32

Contribution

  • A NUMA-Aware thread migration for NVMM FS

– Reduce remote access – Reduce resource contention

  • CPU
  • NVMM

– Increase CPU cache sharing between threads – Transparent to application

10

NVMM CPU0 Node 0 NVMM CPU1 Node 1 QPI link NVMM file system NThread Application

slide-33
SLIDE 33

Outline

  • Background & Motivation
  • NThread design

– Reduce remote access – Reduce resource contention – Increase CPU cache sharing

  • Evaluation
  • Summary

11

slide-34
SLIDE 34

Reduce remote access

  • How to reduce remote access
  • How to avoid ping-pong migration

12

slide-35
SLIDE 35

Reduce remote access

  • How to reduce remote access

– Write

  • allocate new space to perform write operations
  • Write data on the node where the thread running
  • How to avoid ping-pong migration

12

slide-36
SLIDE 36

Reduce remote access

  • How to reduce remote access

– Write

  • allocate new space to perform write operations
  • Write data on the node where the thread running

– Read

  • Count the read amount of each node for each thread
  • Migrate threads to the node with the most data read
  • How to avoid ping-pong migration

12

slide-37
SLIDE 37

Reduce remote access

  • How to reduce remote access

– Write

  • allocate new space to perform write operations
  • Write data on the node where the thread running

– Read

  • Count the read amount of each node for each thread
  • Migrate threads to the node with the most data read
  • How to avoid ping-pong migration
  • When the read size of a thread on one node is higher than all other

nodes by a value per period (such as 200 MB per second)

12

T1

100MB 300MB

T1

100MB 300MB

Node 1 Node 0 Node 1 Node 0

slide-38
SLIDE 38

Outline

  • Background & Motivation
  • NThread design

– Reduce remote access – Reduce resource contention – Increase CPU cache sharing

  • Evaluation
  • Summary

13

slide-39
SLIDE 39

Reduce resource contention

  • Problems

– How to find contention – How to reduce contention – How to avoid new contention

14

NVMM CPU0 Node 0 NVMM CPU1 Node 1 NVMM file system

NVMM access contention

QPI link

slide-40
SLIDE 40

Reduce NVMM contention

  • How to find contention

15

slide-41
SLIDE 41

Reduce NVMM contention

  • How to find contention

– The access amount of NVMM in one node exceeds a threshold that the use of other nodes is less than ½ of the node

15

slide-42
SLIDE 42

Reduce NVMM contention

  • How to find contention

– The access amount of NVMM in one node exceeds a threshold that the use of other nodes is less than ½ of the node – How to define access amount

15

slide-43
SLIDE 43

Reduce NVMM contention

  • How to find contention

– The access amount of NVMM in one node exceeds a threshold that the use of other nodes is less than ½ of the node – How to define access amount

  • Bandwidth !!!!

– Considering the theoretical bandwidth with running bandwidth of NVMM – Bandwidth = read bandwidth + write bandwidth 15

slide-44
SLIDE 44

Reduce NVMM contention

  • How to find contention

– The access amount of NVMM in one node exceeds a threshold that the use of other nodes is less than ½ of the node – How to define access amount

  • Bandwidth !!!!

– Considering the theoretical bandwidth with running bandwidth of NVMM – Bandwidth = read bandwidth + write bandwidth

  • However

– The write bandwidth of NVMM is about 1/3 of the read bandwidth 15

slide-45
SLIDE 45

Reduce NVMM contention

  • How to find contention

– Bandwidth

16

slide-46
SLIDE 46

Reduce NVMM contention

  • How to find contention

– Bandwidth

  • It is inaccurate to calculate NVMM access by using the sum of read

and write bandwidth

16

slide-47
SLIDE 47

Reduce NVMM contention

  • How to find contention

– Bandwidth

  • It is inaccurate to calculate NVMM access by using the sum of read

and write bandwidth

– Read 1 GB/s + Write 1 GB/s = 2GB/s à low contention 16

R 1GB/s W 1GB/s

2GB/s

Low Contention

slide-48
SLIDE 48

Reduce NVMM contention

  • How to find contention

– Bandwidth

  • It is inaccurate to calculate NVMM access by using the sum of read

and write bandwidth

– Read 1 GB/s + Write 1 GB/s = 2GB/s à low contention – Read 0 GB/s + write 2 GB/s = 2GB/s à high contention 16

R 1GB/s W 1GB/s R 0 GB/s W 2GB/s

2GB/s

Low Contention High Contention

slide-49
SLIDE 49

Reduce NVMM contention

  • How to find contention

– Bandwidth

  • It is inaccurate to calculate NVMM access by using the sum of read

and write bandwidth

– Read 1 GB/s + Write 1 GB/s = 2GB/s à low contention – Read 0 GB/s + write 2 GB/s = 2GB/s à high contention

  • Solution

– Change the read and write weight of bandwidth » BWN = NWrN * 1/3 + BWwN (Refer to paper) 16

R 1GB/s W 1GB/s R 0 GB/s W 2GB/s

2GB/s

Low Contention High Contention

slide-50
SLIDE 50

Reduce NVMM contention

  • How to reduce contention

17

slide-51
SLIDE 51

Reduce NVMM contention

  • How to reduce contention

– The access contention come from read and write

17

slide-52
SLIDE 52

Reduce NVMM contention

  • How to reduce contention

– The access contention come from read and write

  • Read

– data location is fixed 17

slide-53
SLIDE 53

Reduce NVMM contention

  • How to reduce contention

– The access contention come from read and write

  • Read

– data location is fixed

  • Write

– Specify the node where data is written 17

slide-54
SLIDE 54

Reduce NVMM contention

  • How to reduce contention

– The access contention come from read and write

  • Read

– data location is fixed

  • Write

– Specify the node where data is written – Long remote write latency: reduce performance by 65.5% 17

T1 Node 1 Node 0 Remote write

slide-55
SLIDE 55

Reduce NVMM contention

  • How to reduce contention

18

slide-56
SLIDE 56

Reduce NVMM contention

  • How to reduce contention

– Migrating threads with high write rate to the nodes with low access pressure

  • Reduce remote write
  • Reduce NVMM contention

18

slide-57
SLIDE 57

Reduce NVMM contention

  • How to reduce contention

– Migrating threads with high write rate to the nodes with low access pressure

  • Reduce remote write
  • Reduce NVMM contention

18

T1 W:90% T2 W:70%

Node 0

T3 W:20% T4 W:10%

Node 1 Access: 4 Access: 0

slide-58
SLIDE 58

Reduce NVMM contention

  • How to reduce contention

– Migrating threads with high write rate to the nodes with low access pressure

  • Reduce remote write
  • Reduce NVMM contention

18

T1 W:90% T2 W:70%

Node 0

T3 W:20% T4 W:10%

Node 1

T1 W:90% T2 W:70%

Node 0

T3 W:20% T4 W:10%

Node 1 Access: 4 Access: 0 Access: 2.4 Access: 1.6

Remote read

0.4

slide-59
SLIDE 59

Reduce NVMM contention

  • How to avoid new contention

19

slide-60
SLIDE 60

Reduce NVMM contention

  • How to avoid new contention

– Migrate too much threads to low contention nodes

19

slide-61
SLIDE 61

Reduce NVMM contention

  • How to avoid new contention

– Migrate too much threads to low contention nodes – Determine the number of threads to migrate according to the current bandwidth of each node

19

slide-62
SLIDE 62

Reduce NVMM contention

  • How to avoid new contention

– Migrate too much threads to low contention nodes – Determine the number of threads to migrate according to the current bandwidth of each node

19

T1 W:90% T2 W:70%

Node 0

T3 W:20% T4 W:10%

Access: 4

slide-63
SLIDE 63

Reduce NVMM contention

  • How to avoid new contention

– Migrate too much threads to low contention nodes – Determine the number of threads to migrate according to the current bandwidth of each node

19

T1 W:90% T2 W:70%

Node 0

T3 W:20% T4 W:10%

Node 1 Access: 4

T5 W:90% T6 W:70% T7 W:70%

Access: 3

slide-64
SLIDE 64

Reduce NVMM contention

  • How to avoid new contention

– Migrate too much threads to low contention nodes – Determine the number of threads to migrate according to the current bandwidth of each node

19

T1 W:90% T2 W:70%

Node 0

T3 W:20% T4 W:10%

Node 1 Access: 4

T5 W:90% T6 W:70% T7 W:70%

Access: 3 Average access: 3.5

slide-65
SLIDE 65

Reduce NVMM contention

  • How to avoid new contention

– Migrate too much threads to low contention nodes – Determine the number of threads to migrate according to the current bandwidth of each node

19

T1 W:90% T2 W:70%

Node 0

T3 W:20% T4 W:10%

Node 1 Access: 4

T5 W:90% T6 W:70% T7 W:70%

Access: 3 Average access: 3.5

slide-66
SLIDE 66

Reduce CPU contention

  • How to find contention

20

slide-67
SLIDE 67

Reduce CPU contention

  • How to find contention

– When the CPU utilization of a node exceeds 90% and is 2x of

  • ther nodes

20

slide-68
SLIDE 68

Reduce CPU contention

  • How to find contention

– When the CPU utilization of a node exceeds 90% and is 2x of

  • ther nodes
  • How to reduce contention

– Migrating threads from NUMA node with high CPU utilization to

  • ther low CPU utilization node

20

slide-69
SLIDE 69

Reduce CPU contention

  • How to find contention

– When the CPU utilization of a node exceeds 90% and is 2x of

  • ther nodes
  • How to reduce contention

– Migrating threads from NUMA node with high CPU utilization to

  • ther low CPU utilization node
  • How to avoid new contention

– If the CPU utilization of migrate thread and target NUMA node does not exceed 90%, migrating thread

20

slide-70
SLIDE 70

Outline

  • Background & Motivation
  • NThread design

– Reduce remote access – Reduce resource contention – Increase CPU cache sharing

  • Evaluation
  • Summary

21

slide-71
SLIDE 71

Increase CPU cache sharing

22

slide-72
SLIDE 72

Increase CPU cache sharing

  • How to find threads that share data

22

slide-73
SLIDE 73

Increase CPU cache sharing

  • How to find threads that share data

– Once a file accessed by multiple threads, all threads accessing the file share data

22

slide-74
SLIDE 74

Increase CPU cache sharing

  • How to find threads that share data

– Once a file accessed by multiple threads, all threads accessing the file share data

  • How to increase CPU cache sharing

22

slide-75
SLIDE 75

Increase CPU cache sharing

  • How to find threads that share data

– Once a file accessed by multiple threads, all threads accessing the file share data

  • How to increase CPU cache sharing

– Reducing remote memory access

22

slide-76
SLIDE 76

Composing Optimizations together

  • Remote access, resource contention and CPU cache

sharing

23

slide-77
SLIDE 77

Composing Optimizations together

  • Remote access, resource contention and CPU cache

sharing

– Reduce remote access can increase CPU cache sharing

  • Threads accessing the same data run in the same node, sharing

CPU cache

23

slide-78
SLIDE 78

Composing Optimizations together

  • Remote access, resource contention and CPU cache

sharing

– Reduce remote access can increase CPU cache sharing

  • Threads accessing the same data run in the same node, sharing

CPU cache

– Reduce resource contention may increase remote memory access and destroy CPU cache sharing

23

slide-79
SLIDE 79

Composing Optimizations together

  • Remote access, resource contention and CPU cache

sharing

– Reduce remote access can increase CPU cache sharing

  • Threads accessing the same data run in the same node, sharing

CPU cache

– Reduce resource contention may increase remote memory access and destroy CPU cache sharing – Reduce NVMM contention may increase CPU contention

NVMM CPU0 Node 0 NVMM CPU1 Node 1 NVMM file system QPI link

23

slide-80
SLIDE 80

Composing Optimizations together

  • What-if analysis

24

slide-81
SLIDE 81

Composing Optimizations together

  • What-if analysis

– Get information each second

  • Data access size, NVMM

bandwidth, CPU utilization and data sharing

24

Get information

1

slide-82
SLIDE 82

Composing Optimizations together

  • What-if analysis

– Get information each second

  • Data access size, NVMM

bandwidth, CPU utilization and data sharing

– Decide initial target node

  • Reduce remote memory access

24

Reduce remote access

Get information Decide initial target node

2 1

slide-83
SLIDE 83

Composing Optimizations together

  • What-if analysis

– Get information each second

  • Data access size, NVMM

bandwidth, CPU utilization and data sharing

– Decide initial target node

  • Reduce remote memory access

– Decide final target node

  • Reduce NVMM and CPU

contention

24

Reduce remote access

Get information Decide initial target node Decide final target node

2 3 1

slide-84
SLIDE 84

Composing Optimizations together

  • What-if analysis

– Get information each second

  • Data access size, NVMM

bandwidth, CPU utilization and data sharing

– Decide initial target node

  • Reduce remote memory access

– Decide final target node

  • Reduce NVMM and CPU

contention

– Avoid migrate shared thread 24

Reduce remote access

Get information Decide initial target node Decide final target node Avoid migrate shared thread

2 3 1

slide-85
SLIDE 85

Composing Optimizations together

  • What-if analysis

– Get information each second

  • Data access size, NVMM

bandwidth, CPU utilization and data sharing

– Decide initial target node

  • Reduce remote memory access

– Decide final target node

  • Reduce NVMM and CPU

contention

– Avoid migrate shared thread – NVMM > CPU (Refer to paper) 24

Reduce remote access NVMM contention? CPU Reduce NVMM con. Reduce CPU con.

Get information Decide initial target node Decide final target node Avoid migrate shared thread

2 3

Y N Y N

1

slide-86
SLIDE 86

Composing Optimizations together

  • What-if analysis

– Get information each second

  • Data access size, NVMM

bandwidth, CPU utilization and data sharing

– Decide initial target node

  • Reduce remote memory access

– Decide final target node

  • Reduce NVMM and CPU

contention

– Avoid migrate shared thread – NVMM > CPU (Refer to paper)

– Migrate threads

24

Reduce remote access NVMM contention? CPU Reduce NVMM con. Reduce CPU con.

Get information Decide initial target node Decide final target node Avoid migrate shared thread

Migrate threads

2 3

Y N Y N

1 4

slide-87
SLIDE 87

Outline

  • Background & Motivation
  • NThread design

– Reduce remote access – Reduce resource contention – Increase CPU cache sharing

  • Evaluation
  • Summary

25

slide-88
SLIDE 88

Evaluation

  • Platform

– Two NUMA nodes

  • Intel Xeon 5214 CPU10 CPU core
  • 64G DRAM, 128G Optane PMM

– Four NUMA nodes

  • Intel Xeon 5214 CPU10 CPU core
  • 4GB DRAM, 12GB Emulated PMM
  • Compared system

– Existing FS: Ext4-dax, PMFS, NOVA, NOVA_n – Modified FS: NOVA_n (A NOVA-based multi-node support FS)

26

slide-89
SLIDE 89

Micro-benchmark: fio

27

0.0 0.5 1.0 1.5 2.0 20% 40% 60% 80% bandwidth GB/s Read ratio ext4-dax PMFS NOVA NOVA_N NThread_rl NThread

slide-90
SLIDE 90

Micro-benchmark: fio

  • NThread_rl: reduce remote access

– The bandwidth is increased by 26.9% when the read ratio is 40%

27

0.0 0.5 1.0 1.5 2.0 20% 40% 60% 80% bandwidth GB/s Read ratio ext4-dax PMFS NOVA NOVA_N NThread_rl NThread

slide-91
SLIDE 91

Micro-benchmark: fio

  • NThread_rl: reduce remote access

– The bandwidth is increased by 26.9% when the read ratio is 40%

  • NThread: reduce remote access, avoid contention and

increase CPU sharing

– Bandwidth increased by an average of 43.8%

27

0.0 0.5 1.0 1.5 2.0 20% 40% 60% 80% bandwidth GB/s Read ratio ext4-dax PMFS NOVA NOVA_N NThread_rl NThread

slide-92
SLIDE 92

Application: RocksDB

  • NThread increases the throughput by 88.6% on average

when RocksDB runs in the NVMM file system

28

100 200 300 400 500 600 700 PUT GET MIX Throughput (K ops/s) ext4-dax PMFS NOVA NOVA_n NThread 500 1000 1500 2000 2500 3000 PUT GET MIX Throughput (K ops/s) ext4-dax PMFS NOVA NOVA_n NThread

Four NUMA nodes Two NUMA nodes

slide-93
SLIDE 93

Outline

  • Background & Motivation
  • NThread design

– Reduce remote access – Reduce resource contention – Increase CPU cache sharing

  • Evaluation
  • Summary

29

slide-94
SLIDE 94

Summary

30

slide-95
SLIDE 95

Summary

  • The features of NVMM enables FS to be built on the

memory bus, improving the performance of FS

30

slide-96
SLIDE 96

Summary

  • The features of NVMM enables FS to be built on the

memory bus, improving the performance of FS

  • NUMA brings remote access and resource contention to

NVMM FS

30

slide-97
SLIDE 97

Summary

  • The features of NVMM enables FS to be built on the

memory bus, improving the performance of FS

  • NUMA brings remote access and resource contention to

NVMM FS

  • NThread is a NUMA-aware thread migration

30

slide-98
SLIDE 98

Summary

  • The features of NVMM enables FS to be built on the

memory bus, improving the performance of FS

  • NUMA brings remote access and resource contention to

NVMM FS

  • NThread is a NUMA-aware thread migration

– Migrate threads according to data amount to reduce remote access

30

slide-99
SLIDE 99

Summary

  • The features of NVMM enables FS to be built on the

memory bus, improving the performance of FS

  • NUMA brings remote access and resource contention to

NVMM FS

  • NThread is a NUMA-aware thread migration

– Migrate threads according to data amount to reduce remote access – Reduce resource contention and avoid introducing new contention

30

slide-100
SLIDE 100

Summary

  • The features of NVMM enables FS to be built on the

memory bus, improving the performance of FS

  • NUMA brings remote access and resource contention to

NVMM FS

  • NThread is a NUMA-aware thread migration

– Migrate threads according to data amount to reduce remote access – Reduce resource contention and avoid introducing new contention – Avoid migrating data-sharing threads to increase CPU cache sharing

30

slide-101
SLIDE 101

Summary

  • The features of NVMM enables FS to be built on the

memory bus, improving the performance of FS

  • NUMA brings remote access and resource contention to

NVMM FS

  • NThread is a NUMA-aware thread migration

– Migrate threads according to data amount to reduce remote access – Reduce resource contention and avoid introducing new contention – Avoid migrating data-sharing threads to increase CPU cache sharing – Apply what-if analysis to decide the execution orders of these

  • ptimizations

30

slide-102
SLIDE 102

Summary

  • The features of NVMM enables FS to be built on the

memory bus, improving the performance of FS

  • NUMA brings remote access and resource contention to

NVMM FS

  • NThread is a NUMA-aware thread migration

– Migrate threads according to data amount to reduce remote access – Reduce resource contention and avoid introducing new contention – Avoid migrating data-sharing threads to increase CPU cache sharing – Apply what-if analysis to decide the execution orders of these

  • ptimizations

– Increase application throughput by 88.6% on average

30

slide-103
SLIDE 103

Thanks

31

Author email: wangying01@ict.ac.cn