Spinning Relations: High-Speed Networks for Distributed Join - - PowerPoint PPT Presentation

spinning relations high speed networks for distributed
SMART_READER_LITE
LIVE PREVIEW

Spinning Relations: High-Speed Networks for Distributed Join - - PowerPoint PPT Presentation

Spinning Relations: High-Speed Networks for Distributed Join Processing Philip Frey, Romulo Goncalves, Martin Kersten, Jens Teubner Problem Statement We address a core database problem, but for large problem sizes: Process a join R S


slide-1
SLIDE 1

Spinning Relations: High-Speed Networks for Distributed Join Processing

Philip Frey, Romulo Goncalves, Martin Kersten, Jens Teubner

slide-2
SLIDE 2

Problem Statement

We address a core database problem, but for large problem sizes: Process a join R θ S (arbitrary join predicate). R and S are large (many gigabytes, even terabytes). Traditional approach: Use a big machine and/or suffer the severe disk I/O bottleneck of block nested loops join. Can do distributed evaluation only for certain θ or certain data distributions (or suffer high network I/O cost). Today: Assume a cluster of commodity machines only. Leverage modern high-speed networks (10 Gb/s and beyond).

Jens Teubner · Spinning Relations: High-Speed Networks for Distributed Join Processing 2 / 11

slide-3
SLIDE 3

Modern Networks: High Speed?

It is actually very hard to saturate modern (e.g., 10 Gb/s) networks.

System 1

CPU RAM NIC

System 2

CPU RAM NIC underutilized network High CPU demand

◮ Rule of thumb: 1 GHz CPU per 1 Gb/s network throughput (!)

Memory bus contention

◮ Data typically has to cross the memory bus three times

→ ≈ 3 GB/s bus capacity needed for 10 Gb/s network

Jens Teubner · Spinning Relations: High-Speed Networks for Distributed Join Processing 3 / 11

slide-4
SLIDE 4

RDMA: Remote Direct Memory Access

RDMA-capable network cards (RNICs) can saturate the link using direct data placement (avoid unnecessary bus transfers), OS bypassing (avoid context switches), and TCP offloading (avoid CPU load).

System 1

CPU RAM RNIC

System 2

CPU RAM RNIC fully utilized network Data is read/written on both ends using intra-host DMA. Asynchronous transfer after work request issued by CPU.

Jens Teubner · Spinning Relations: High-Speed Networks for Distributed Join Processing 4 / 11

slide-5
SLIDE 5

Cyclo-Join Idea

Host H0 Host H1 Host H2 Host H3 Host H4 Host H5

RDMA RDMA RDMA RDMA RDMA RDMA

S0 S1 S2 S3 S4 S5 R2 R3 R4 R5 R0 R1 R2 R3 R4 R5 R0 R1 R2 R3 R4 R5 R0 R1 R2 R3 R4 R5 R0 R1 input S input R

1 distribute 2 join locally 3 rotate

RDMA: join and rotate

Jens Teubner · Spinning Relations: High-Speed Networks for Distributed Join Processing 5 / 11

slide-6
SLIDE 6

Analysis

Cyclo-join has similarities to block nested loops join. Cut input data into blocks Ri and Sj. Join all combinations Ri Sj in memory. As such, cyclo-join can be paired with any in-memory join algorithm, can be used to distribute the processing of any join predicate. Cyclo-join fits into a “cloud-style” environment: additional nodes can be hooked in as needed, arbitrary assignment host ↔ task, cyclo-join consumes and produces distributed tables → n-way joins.

Jens Teubner · Spinning Relations: High-Speed Networks for Distributed Join Processing 6 / 11

slide-7
SLIDE 7

Cyclo-Join Put Into Practice

We implemented a prototype of cyclo-join: four processing nodes

◮ Intel Xeon quad-core 2.33 GHz ◮ 6 GB RAM per node; memory bandwidth: 3.4 GB/s (measured)

10 Gb/s Ethernet

◮ Chelsio T3 RDMA-enabled network cards ◮ Nortel 10 Gb/s Ethernet switch

in-memory hash join

◮ hash phase physically re-organizes data (on each node)

→ better cache efficiency during join phase

◮ I/O complexity: O (|R| + |S|)

Jens Teubner · Spinning Relations: High-Speed Networks for Distributed Join Processing 7 / 11

slide-8
SLIDE 8

Experiments

Experiment 1: Distribute evaluation of a join where |R| = |S| = 1.8 GB. MonetDB

(single-host)

20 40 60 80 wall-clock time [s] 1 host 1.8 1.8 2 hosts 1.8 1.8 3 hosts 1.8 1.8 4 hosts 1.8 1.8 # hosts / sizes of S R [GB] join execution synchronization hash buildup Main benefit: reduced hash buildup time.

Jens Teubner · Spinning Relations: High-Speed Networks for Distributed Join Processing 8 / 11

slide-9
SLIDE 9

Experiments

Experiment 2: Scale up and join larger S (hash buildup ignored here).

1.35 2.08 0.80 2.83 0.58 3.54 0.26

1 2 3 4 wall-clock time [s] 1 host 1.8 1.8 2 hosts 3.6 1.8 3 hosts 5.4 1.8 4 hosts 7.2 1.8 # hosts / sizes of S R [GB] join execution synchronization

System scales like a machine with large RAM would. CPUs have to wait for network transfers (“synchronization”).

Jens Teubner · Spinning Relations: High-Speed Networks for Distributed Join Processing 9 / 11

slide-10
SLIDE 10

Memory Transfers

Need to wait for network: Does that mean RDMA doesn’t work at all? join Ri Sj

2.83 s

1.8 GB 10 Gb/s = 1.44 s

0.58 s RDMA trans. bus bandwidth 1 2 3 4 5 1 2 3 4 time memory bandwidth [GB/s]

2.83 0.58

1 2 3 time 3 hosts 5.4 1.8

The culprit is the local memory bus! If RDMA hadn’t saved us some bus transfers, this would be worse.

Jens Teubner · Spinning Relations: High-Speed Networks for Distributed Join Processing 10 / 11

slide-11
SLIDE 11

Conclusions

I demonstrated cyclo-join: ring topology to process large joins, use distributed memory to process arbitrary joins, hardware acceleration via RDMA is crucial:

◮ reduce CPU load and memory bus contention.

Cyclo-join is part of the Data Cyclotron project: support for more local join algorithms, process full queries in a merry-go-round setup.

Jens Teubner · Spinning Relations: High-Speed Networks for Distributed Join Processing 11 / 11