StRoM: Smart Remote Memory ry David Sidler* , Zeke Wang , Monica - - PowerPoint PPT Presentation

strom smart remote memory ry
SMART_READER_LITE
LIVE PREVIEW

StRoM: Smart Remote Memory ry David Sidler* , Zeke Wang , Monica - - PowerPoint PPT Presentation

StRoM: Smart Remote Memory ry David Sidler* , Zeke Wang , Monica Chiosa , Amit Kulkarni , Gustavo Alonso * Microsoft Corporation Collaborative Innovation Center of Artificial Intelligence, Zhejiang University


slide-1
SLIDE 1

StRoM: Smart Remote Memory ry

David Sidler*‡, Zeke Wang†‡, Monica Chiosa‡, Amit Kulkarni‡, Gustavo Alonso‡ * Microsoft Corporation † Collaborative Innovation Center of Artificial Intelligence, Zhejiang University ‡ Systems Group, Department of Computer Science, ETH Zürich

slide-2
SLIDE 2

0.1 1 10 100 1000 10000 100000 1980 1990 2000 2010 2020 Relative Speedup CPU frequency Network bandwidth Compute- Bandwidth Gap

Increasing Compute-Bandwidth Gap

▪ Increase in CPU cycles allocated towards network processing ▪ Context switches between OS network stack and application amplify the issue

slide-3
SLIDE 3

RDMA (Remote Direct Memory Access) Complete Hardware offload => Bypasses OS and CPU

Memory CPU NIC Memory CPU NIC

Distributed key-value stores[1,2]

RDMA (Remote Direct Memory Access)

Parallel database systems Distributed graph computation[3]

[1] C. Mitchell, et al., Using One-sided RDMA Reads to build a fast, CPU-efficient key-value store, ATC’13 [2] A. Dragojevic, et al., FaRM: Fast Remote Memory, NSDI’14 [3] M. Wu, et al., GRAM: Scaling graph computation to the trillions, SoCC’15

slide-4
SLIDE 4

Get over RDMA: Two-sided vs One-sided

Remote Memory

CPU NIC

Two-sided (Send/Receive)

Hash Table Value Store

Client

▪ Read hash entry ▪ Compare keys ▪ Read value

2 1 3

Send GET Send Value

Remote Memory

CPU NIC

One-sided (Direct Access)

Hash Table Value Store

Compare keys

1 3

Read Hash Table Read Value

Client

2 ▪ Single round trip ▪ Simple client-server model ▪ Remote CPU involved ▪ Remote CPU not involved ▪ At least 2 RTs necessary ▪ Handling of misses costly

No clear winner

slide-5
SLIDE 5

StRoM: Deployment of Acceleration kernels on the NIC

Memory CPU NIC

StRoM kernel StRoM kernel ▪ Direct access to host memory ▪ Able to receive/transmit data

  • ver RDMA

StRoM: Smart Remote Memory

slide-6
SLIDE 6

GET as StRoM Kernel

Remote Memory

CPU NIC

Hash Table Value Store

▪ Read hash entry ▪ Compare keys ▪ Read value

1 3

RDMA RPC Write Value

Client

2 ▪ Single round trip ▪ Remote CPU not involved GET kernel

slide-7
SLIDE 7

Memory CPU NIC

StRoM kernel

Accelerating Data Access

Invoke one-sided RPCs on the remote NIC On-the-fly data processing when transmitting/receiving

Acceleration Capabilities

Memory CPU NIC

StRoM kernel

Accelerating Data Processing

▪ Traversal of remote data structures ▪ Verification of data objects ▪ Manipulation of simple data structures ▪ Data shuffling ▪ Filtering ▪ Pattern/event detection ▪ Aggregation ▪ Compression ▪ Statistics gathering

slide-8
SLIDE 8

HyperLogLog (HLL) kernel to estimate cardinality of a data set

  • Bump-in-the-wire kernel
  • Cardinality estimation can augment the optimizer in data processing systems

Use Case: Gathering Statistics

Hash Leading Zeros Buckets Harmonic mean

NIC

Remote Memory

CPU

data statistics 1 RDMA RPC WRITE

Node

2 HLL kernel 1

slide-9
SLIDE 9
  • FPGA-based prototype RDMA NIC
  • Extended RoCEv2 implementation with support for StRoM

Evaluation – StRoM NIC

Alpha Data ADM-PCIE-7V3 Xilinx VCU118 StRoM at 10G StRoM at 100G

slide-10
SLIDE 10

Evaluation – GET kernel

5μs

slide-11
SLIDE 11

Evaluation – HLL kernel

slide-12
SLIDE 12
  • Deployment of acceleration kernels on the NIC
  • Acceleration of data access and data processing at up to 100G
  • Research platform

Conclusion

StRoM: Smart Remote Memory

Open source at github.com/fpgasystems/fpga-network-stack