LITE Kernel RDMA
Support for Datacenter Applications
Shin-Yeh Tsai, Yiying Zhang
2 Berkeley Socket Userspace Kernel Hardware Time 1983 2 - - PowerPoint PPT Presentation
LITE Kernel RDMA Support for Datacenter Applications Shin-Yeh Tsai , Yiying Zhang Time 2 Berkeley Socket Userspace Kernel Hardware Time 1983 2 Berkeley TCP Arrakis & Socket IX O ffl oad engine mTCP Userspace Kernel Hardware
Shin-Yeh Tsai, Yiying Zhang
Time
2
Time
2
1983 Berkeley Socket
Userspace Kernel Hardware
Time
2
RDMA in Datacenters
?
2017 1983 Berkeley Socket
Userspace Kernel Hardware
1995 U-Net
2000s TCP Offload engine 2014 Arrakis & mTCP IX RDMA in HPC
– Low latency – High throughput – Low CPU utilization
3
Memory CPU User Kernel
RDMA
4
5
5
[VLDB ’16]
RSI
[EuroSys ’16]
DrTM+R
[NSDI ’14]
FaRM
[SOSP ’15]
FaRM+Xact
[SIGCOMM ’14]
HERD
[ATC ’16]
HERD-RPC
[OSDI ’16]
FaSST
[ATC ’17]
Octopus
[ATC ’13]
Pilaf
[SoCC ’17]
Hotpot
[OSDI ’16]
Wukong
[SoCC ’17]
APUS
[SOSP ’15]
DrTM
[VLDB ’17]
NAM-DB
[ASPLOS ’15]
Mojim
[ATC ’16]
Cell
Things have worked well in HPC
6
7
OS
User-Level RDMA App
RNIC
node, lkey, rkey addr
Permission check Address mapping
lkey 1 lkey n rkey 1 rkey n
… …
send recv
Library
Conn Mgmt Mem Mgmt
Cached PTEs
Connections Queues Keys Memory space
User Space Kernel Space Hardware
Userspace Hardware
7
OS
User-Level RDMA App
RNIC
node, lkey, rkey addr
Permission check Address mapping
lkey 1 lkey n rkey 1 rkey n
… …
send recv
Library
Conn Mgmt Mem Mgmt
Cached PTEs
Connections Queues Keys Memory space
User Space Kernel Space Hardware
Kernel Bypassing
Userspace Hardware
7
OS
User-Level RDMA App
RNIC
node, lkey, rkey addr
Permission check Address mapping
lkey 1 lkey n rkey 1 rkey n
… …
send recv
Library
Conn Mgmt Mem Mgmt
Cached PTEs
Connections Queues Keys Memory space
User Space Kernel Space Hardware
Kernel Bypassing
Userspace Hardware
8
Userspace Hardware High-level Easy to use Low-level Difficult to use
8
Userspace Hardware
Developers want
High-level Easy to use Low-level Difficult to use
8
Userspace Hardware
Socket Developers want
High-level Easy to use Low-level Difficult to use
8
Userspace Hardware
RDMA Socket Developers want
High-level Easy to use Low-level Difficult to use
8
Userspace Hardware
RDMA Socket Developers want
High-level Easy to use Resource share Isolation Low-level Difficult to use Difficult to share
8
Userspace Hardware
RDMA Socket Developers want
Abstraction Mismatch
High-level Easy to use Resource share Isolation Low-level Difficult to use Difficult to share
8
Userspace Hardware
RDMA Socket Developers want
Abstraction Mismatch
High-level Easy to use Resource share Isolation Low-level Difficult to use Difficult to share
Things have worked well in HPC
9
What about datacenters?
Things have worked well in HPC
9
What about datacenters?
Things have worked well in HPC
9
What about datacenters?
10
OS
User-Level RDMA App
RNIC
node, lkey, rkey addr
Permission check Address mapping
lkey 1 lkey n rkey 1 rkey n
… …
send recv
Library
Conn Mgmt Mem Mgmt
Cached PTEs
Connections Queues Keys Memory space
User Space Kernel Space Hardware
Kernel Bypassing
Userspace Hardware
Userspace Hardware
11
Userspace Hardware
11
On-NIC SRAM
region
Requests /us 1.5 3 4.5 6
Total Size (MB)
1 4 16 64 256 1024 Write-64B Write-1K
Userspace Hardware
11
On-NIC SRAM
region
Requests /us 1.5 3 4.5 6
Total Size (MB)
1 4 16 64 256 1024 Write-64B Write-1K
Userspace Hardware
11
On-NIC SRAM
region
Things have been good in HPC
12
What about datacenters?
Things have been good in HPC
12
What about datacenters?
Things have been good in HPC
12
What about datacenters?
13
Fat applications No resource sharing
13
Fat applications No resource sharing
13
14
High-level abstraction Protection Resource sharing Performance isolation
15
High-level abstraction Protection Resource sharing Performance isolation
15
High-level abstraction Protection Resource sharing Performance isolation
15
High-level abstraction Protection Resource sharing Performance isolation
15
High-level abstraction Protection Resource sharing Performance isolation
15
High-level abstraction Protection Resource sharing Performance isolation
Protection Performance isolation Resource sharing High-level abstraction
15
Butler Lampson
16
RNIC
17
Permission check Address mapping
Cached PTEs lkey 1 lkey n rkey 1 rkey n
… … Library
Connections Queues Keys Memory space
User-Level RDMA App
node, lkey, rkey addr send recv Conn Mgmt Mem Mgmt
User Space Hardware
LITE
18
Connections Queues Keys Memory space
User-Level RDMA App
node, lkey, rkey addr send recv Conn Mgmt Mem Mgmt
LITE APIs
Memory APIs RPC/Msg APIs Sync APIs
User Space Kernel Space
RNIC
Permission check Address mapping
Cached PTEs lkey 1 lkey n rkey 1 rkey n
… … Hardware
LITE
18
Connections Queues Keys Memory space
User-Level RDMA App
node, lkey, rkey addr send recv Conn Mgmt Mem Mgmt
LITE APIs
Memory APIs RPC/Msg APIs Sync APIs
User Space Kernel Space
RNIC
Permission check Address mapping
Cached PTEs lkey 1 lkey n rkey 1 rkey n
… … Hardware
LITE RNIC
19
Connections Queues Keys Memory space
User-Level RDMA App
node, lkey, rkey addr send recv Conn Mgmt Mem Mgmt
LITE APIs
Memory APIs RPC/Msg APIs Sync APIs
Permission check Address mapping
Global rkey Global lkey Global lkey Global rkey
User Space Kernel Space Hardware
LITE RNIC
19
Connections Queues Keys Memory space
User-Level RDMA App
node, lkey, rkey addr send recv Conn Mgmt Mem Mgmt
LITE APIs
Memory APIs RPC/Msg APIs Sync APIs
Permission check Address mapping
Global rkey Global lkey Global lkey Global rkey
User Space Kernel Space Hardware
RDMA Verbs
LITE RNIC
19
Connections Queues Keys Memory space
User-Level RDMA App
node, lkey, rkey addr send recv Conn Mgmt Mem Mgmt
LITE APIs
Memory APIs RPC/Msg APIs Sync APIs
Permission check Address mapping
Global rkey Global lkey Global lkey Global rkey
User Space Kernel Space Hardware
RDMA Verbs
20
20
20
20
Butler Lampson
21
Butler Lampson David Wheeler
21
Butler Lampson David Wheeler
22
23
1.Indirection only at local for one-sided RDMA
Memory
Berkeley Socket
CPU User Kernel Memory CPU User Kernel
RDMA
Userspace Kernel Hardware
23
1.Indirection only at local for one-sided RDMA
Memory
Berkeley Socket
CPU User Kernel Memory CPU User Kernel
RDMA
Memory CPU User Kernel
LITE
Userspace Kernel Hardware
2.Avoid hardware indirection
24
1.Indirection only at local for one-sided RDMA
LITE RNIC
Kernel Space Hardware
Address mapping
Permission check
Address mapping
Permission check
2.Avoid hardware indirection
24
1.Indirection only at local for one-sided RDMA
LITE RNIC
Kernel Space Hardware
Address mapping
Permission check
Address mapping
Permission check
2.Avoid hardware indirection
24
1.Indirection only at local for one-sided RDMA
LITE RNIC
Kernel Space Hardware
Address mapping
Permission check
2.Avoid hardware indirection 3.Hide kernel cost
25
1.Indirection only at local for one-sided RDMA
2.Avoid hardware indirection 3.Hide kernel cost
25
1.Indirection only at local for one-sided RDMA
2.Avoid hardware indirection 3.Hide kernel cost
25
1.Indirection only at local for one-sided RDMA
26
27
OS RNIC Driver
User-Level App Kernel App LITE Abstraction Verbs Abstraction
RNIC
global lkey Mgmt User-Level App global rkey User-Level RPC Function
27
OS RNIC Driver
User-Level App Kernel App LITE Abstraction Verbs Abstraction
RNIC
global lkey Mgmt User-Level App global rkey User-Level RPC Function
LITE 1-Side RDMA
global rkey
addr1 addr2 lh1 lh2
Permission check Address mapping
global lkey
27
OS RNIC Driver
User-Level App Kernel App LITE Abstraction Verbs Abstraction
RNIC
global lkey Mgmt User-Level App global rkey User-Level RPC Function
LITE RPC
send poll recv
Connections Queues
RPC Client RPC Server
RDMA Buffer Mgmt
LITE 1-Side RDMA
global rkey
addr1 addr2 lh1 lh2
Permission check Address mapping
global lkey
27
OS RNIC Driver
User-Level App Kernel App LITE Abstraction Verbs Abstraction
RNIC
global lkey Mgmt User-Level App global rkey User-Level RPC Function
LITE RPC
send poll recv
Connections Queues
RPC Client RPC Server
RDMA Buffer Mgmt
LITE APIs
synch mgmt mem RPC msging
LITE 1-Side RDMA
global rkey
addr1 addr2 lh1 lh2
Permission check Address mapping
global lkey
27
OS RNIC Driver
User-Level App Kernel App LITE Abstraction Verbs Abstraction
RNIC
global lkey Mgmt User-Level App global rkey User-Level RPC Function
LITE RPC
send poll recv
Connections Queues
RPC Client RPC Server
RDMA Buffer Mgmt
LITE APIs
synch mgmt mem RPC msging
LITE 1-Side RDMA
global rkey
addr1 addr2 lh1 lh2
Permission check Address mapping
global lkey
28
OS LITE RNIC
Connections Queues Keys Memory space
Permission check Address mapping
28
OS LITE RNIC
Connections Queues Keys Memory space
Permission check Address mapping
Perform address mapping and protection in kernel
29
OS LITE RNIC
Connections Queues Keys Memory space
Permission check Address mapping
Cached PTEs
lkey 1 lkey n rkey 1 rkey n
… …
Challenge: How to eliminate hardware indirection without changing hardware?
29
OS LITE RNIC
Connections Queues Keys Memory space
Permission check Address mapping
Cached PTEs
lkey 1 lkey n rkey 1 rkey n
… …
Challenge: How to eliminate hardware indirection without changing hardware?
29
OS LITE RNIC
Connections Queues Keys Memory space
Permission check Address mapping
lkey 1 lkey n rkey 1 rkey n
… …
Challenge: How to eliminate hardware indirection without changing hardware?
29
OS LITE RNIC
Connections Queues Keys Memory space
Permission check Address mapping
lkey 1 lkey n rkey 1 rkey n
… …
Challenge: How to eliminate hardware indirection without changing hardware?
29
OS LITE RNIC
Connections Queues Keys Memory space
Permission check Address mapping
Global rkey Global lkey Global lkey Global rkey
Challenge: How to eliminate hardware indirection without changing hardware?
30
Userspace application
Network
Remote nodes
30
Userspace application
Network
LMR
Remote nodes
30
Userspace application
Network
1 0x45 4 0x27 Node Phy Addr
LMR
Remote nodes
30
Userspace application
Network
1 0x45 4 0x27 Node Phy Addr
0x45 Node 1 Node 4 0x27
LMR
Remote nodes
30
Userspace application
lh
Network
1 0x45 4 0x27 Node Phy Addr
0x45 Node 1 Node 4 0x27
LMR
Remote nodes
30
Userspace application
lh
Network
1 0x45 4 0x27 Node Phy Addr
0x45 Node 1 Node 4 0x27
LMR
Remote nodes
LITE_read(lh, offset, size)
30
Userspace application
lh
Network
Permission check QoS 1 0x45 4 0x27 Node Phy Addr
0x45 Node 1 Node 4 0x27
LMR
Remote nodes
LITE_read(lh, offset, size)
30
Userspace application
lh
Network
Permission check QoS 1 0x45 4 0x27 Node Phy Addr
0x45 Node 1 Node 4 0x27
LMR
Remote nodes
Offset
LITE_read(lh, offset, size)
30
Userspace application
lh
Network
Permission check QoS 1 0x45 4 0x27 Node Phy Addr
0x45 Node 1 Node 4 0x27
LMR
Remote nodes
LITE_read(lh, offset, size)
30
Userspace application
lh
Network
Permission check QoS 1 0x45 4 0x27 Node Phy Addr
0x45 Node 1 Node 4 0x27
LMR
Remote nodes
LITE_read(lh, offset, size)
Requests /us 1.5 3 4.5 6
Total Size (MB)
1 4 16 64 256 1024
Write-64B LITE_write-64B Write-1K LITE_write-1K
31
Requests /us 1.5 3 4.5 6
Total Size (MB)
1 4 16 64 256 1024
Write-64B LITE_write-64B Write-1K LITE_write-1K
31
Requests /us 1.5 3 4.5 6
Total Size (MB)
1 4 16 64 256 1024
Write-64B LITE_write-64B Write-1K LITE_write-1K
31
Latency (us) 15 30 45 60
Request Size (B)
8 512 2048 8K 32K
32
kernel space user space
Latency (us) 15 30 45 60
Request Size (B)
8 512 2048 8K 32K
32
kernel space user space
Latency (us) 15 30 45 60
Request Size (B)
8 512 2048 8K 32K
32
kernel space user space
Latency (us) 15 30 45 60
Request Size (B)
8 512 2048 8K 32K
32
kernel space user space
Latency (us) 15 30 45 60
Request Size (B)
8 512 2048 8K 32K
32
kernel space user space
– Low latency – Low memory utilization – Low CPU utilization
33
34
Application LOC LOC using LITE Student Days LITE-Log 330 36 1 LITE-MapReduce 600* 49 4 LITE-Graph 1400 20 7 LITE-Kernel-DSM 3000 45 26 LITE-Graph-DSM 1300 5
35
* LITE-MapReduce ports from the 3000-LOC Phoenix with 600 lines of change or addition
[1]: “Ranger etal., Evaluating MapReduce for Multi-core and Multiprocessor Systems. (HPCA 07)” 36 2 4 6 8 21 23 25
Hadoop Phoenix LITE
Runtime (sec)
Phoenix 2-node 4-node 8-node
[1]: “Ranger etal., Evaluating MapReduce for Multi-core and Multiprocessor Systems. (HPCA 07)” 36 2 4 6 8 21 23 25
Hadoop Phoenix LITE
Runtime (sec)
Phoenix 2-node 4-node 8-node
Runtime (sec)
2 4 6 8 10
4 nodes x 4threads 7x4
LITE-Graph Grappa PowerGraph
37
4 nodes x 4 threads 7 nodes x 4 threads
Runtime (sec)
2 4 6 8 10
4 nodes x 4threads 7x4
LITE-Graph Grappa PowerGraph
37
4 nodes x 4 threads 7 nodes x 4 threads
38
38
Get LITE at: https://github.com/Wuklab/LITE
wuklab.io