[PPT] - OS Support for a Commodity Database on PC Clusters Distributed PowerPoint Presentation

SLIDE 1

OS Support for a Commodity Database on PC Clusters Distributed Devices vs. Distributed File Systems

Felix Rauch (National ICT Australia) Thomas M. Stricker (Google Inc., USA) Laboratory for Computer Systems, ETH Zurich, Switzerland

NIC TA Member s NIC TA Pa r tner s

SLIDE 2

2

Commodity Solutions for OLAP Workloads

Customer Nation Region Part PartSupp Supplier Order LineItem

TPC-D data model Database size: 10-100 GByte What kind of system architectures are suitable for this type of workload?

SLIDE 3

3

Platforms

Symmetric Multi- processor (SMP) E.g. DEC 8400 Cluster of commodity PCs E.g. Patagonia multi-use cluster at ETH Zurich Traditionally: More recently:

SLIDE 4

4

Killer SMPs vs. Clusters of PCs

Killer SMP

Killer performance!
Killing price...

Cluster of commodity PCs

Killer price!
Killing performance?

Processor Caches Memory Disks M D M D C C C P P P P C M D P P P C C C M M M D D D Network Bus / Network

SLIDE 5

5

Overview

Introduction
Motivation
Distributed storage architectures
Evaluation
Analysis of results
Alternative: Middleware
Conclusion

SLIDE 6

6

Research Goal

Turn PC clusters into ''killer SMPs'' for OLAP. Combine excess storage and high-speed network already available on cluster nodes. Provide transparent distributed storage architecture as database's storage backend for OLAP applications. System architect's point of view. Focus on performance and understanding.

SLIDE 7

7

Storage Architectures for Clusters of PCs

Traditional:

Big server with RAID
Storage-area networks (SAN)
Network-attached storage (NAS)

→ Additional hardware and costs Our proposed alternative: Use available commodity hardware and distribute data in software layers.

SLIDE 8

8

Why Should Such an Architecture Work?

Commodity hardware and software (OS) allows high cost effectiveness. Trends:

Disks becoming larger and cheaper
Built-in high-speed network

SLIDE 9

9

Large Hard-Disk Drives

1998 1999 2000 2001 2002 2003 2004 20 40 60 80 100 120 140 Size [GByte] Year of survey Disk size (median) Full OS size (incl. applications)

SLIDE 10

10

Large Hard-Disk Drives

1998 1999 2000 2001 2002 2003 2004 20 40 60 80 100 120 140 Size [GByte] Year of survey Disk size (median) Full OS size (incl. applications)

SLIDE 11

11

Large Hard-Disk Drives

1998 1999 2000 2001 2002 2003 2004 20 40 60 80 100 120 140 Size [GByte] Year of survey Disk size (median) Full OS size (incl. applications)

SLIDE 12

12

High-Speed Network

1995 2000 2005 1 10 100 1000 2000 Throughput [MByte/s] Year

Max. disk throughput

Ethernet throughput Fast Ethernet Gigabit Ethernet 10 Gigabit Ethernet

→ Enough bandwith to support distributed storage.

SLIDE 13

13

Our Scenario

Parallel file systems for high- performance computing Distributed File System (network RAID0) Scalable (Lustre, PVFS) Boost DB performance

I/O node I/O node I/O node DB node Compute node Compute node Compute node I/O node I/O node I/O node Compute node I/O node

SLIDE 14

14

Our Scenario

Parallel file systems for high- performance computing Distributed File System (network RAID0) Scalable (Lustre, PVFS) Boost DB performance

I/O node I/O node I/O node DB node Compute node Compute node Compute node I/O node I/O node I/O node Compute node I/O node

SLIDE 15

15

Alternative Systems

Petal [Lee & Thekkath, 1996]:

Distributed virtual disks with special emphasis on dynamic reconfiguration and load balancing.

Frangipani [Thekkath, Mann & Lee, 1997]:

Distributed file system that builds on Petal.

Lustre [Cluster File Systems, Inc.]:

Object oriented file system for large clusters.

SLIDE 16

16

Investigated Architectures

Fast Network Block Device (FNBD)

Maps hard-disk device over network
No intelligence, but highly optimised

Parallel Virtual File System (PVFS)

Integrates nodes' disks into parallel FS
Fully-featured file system

SLIDE 17

17

Fast Network Block Device (FNBD)

Loosely based on Linux network block dev.
Implemented as kernel modules
Maps remote disk blocks over Gigabit

Ethernet (from 3 servers)

Uses hardware features of commodity

network interface to implement zero copy

Multiple instances into RAID0-like array of

networked disks

SLIDE 18

18

Parallel Virtual File System (PVFS)

Widely used for PC clusters
Implemented as dynamically linked library
Fully featured distributed file system
Can be accessed by any participating node
Combines special directories on server

nodes into large file system

6 servers due to space limitations

SLIDE 19

19

Architecture of Reference Case

Application Disk driver OS kernel Application(s) File system Local disk access Single node

SLIDE 20

20

Architecture of FNBD

Server nodes Client node Distributed device driver (server part) Distributed device driver (client part) Application OS kernel Application(s) Disk driver Disk driver OS kernel Fast Network Block Device Application(s) File system

SLIDE 21

21

Architecture of PVFS

Server nodes Client node PVFS library Application File system OS kernel Application(s) Disk driver Disk driver OS kernel Parallel Virtual File System PVFS server daemon Application(s) File system

SLIDE 22

22

A Stream-Based Analytic Model

Presented at EuroPar 2000 conference. Considers flow of data stream and limits of building blocks. → Set of (in)equations. Solve to find maximal throughput of stream. Simple, works well for large data streams.

SLIDE 23

23

Modelling Workload

Need to know performance characteristics of all involved building blocks.

Easy for small and simple parts (HW, OS

functionality): Measurements or data sheets.

Very difficult for complex, closed software

(RDBMS): Black-box. → Calibration model with know queries.

SLIDE 24

24

Calibration of Database Performance

Two cases:

''Simple'' case: Full table scan (find max.)
''Complex'' case: Scan including CPU (sort)

Experimental calibration with data in RAM:

140 MByte/s throughput for simple case
7.75 MByte/s throughput for complex case

SLIDE 25

25

Modelling OLAP on FNBD

OS kernel Disk driver Gigabit/s network User space FNBD driver (server part) NIC driver Special

DMA DMA DMA

OS kernel RDBMS App.

NIC driver User space Special (client part) FNBD driver File system

Pipe (reduced copy) Copy

Server side Client side

SLIDE 26

26

Modelling OLAP on PVFS

OS kernel Disk driver Gigabit/s network User space NIC driver Special File system TCP/IP

Copy

PVFS Daemon

Copy

TCP/IP

Copy DMA DMA DMA

OS kernel PVFS library RDBMS App.

NIC driver User space Special

Pipe (reduced copy)

Server side Client side

SLIDE 27

27

Evaluation Criteria

Small microbenchmark ''speed'':

Throughput for large contiguous I/O opera-

tions with varying user-level block sizes. Application benchmark TPC-D:

Broad range of decision support applica-

tions, long-running, complex ad-hoc queries.

New TPC-H and TPC-R include updates.

SLIDE 28

28

Experimental Testbed

Multi-use cluster with 16 nodes, each with:

Two 1-GHz PentiumIII CPUs
512 MByte ECC SDRAM
2 x 9 GByte disk space
2 Gigabit Ethernet adapters
Linux kernel 2.4.3

SLIDE 29

29

Microbenchmarks

4 32 256 5 10 15 20 25 30 35 40 45 Throughput [MByte/s] User-level block size [KByte] Distributed devices FNBD (3 servers) Distributed file system PVFS (6 servers) Reference case: Single local disk (1 disk)

SLIDE 30

30

Experimental Evaluation with OLAP

TPC-D decision support benchmark on ORACLE

1 2 3 4 6 9 10 12 13 17 0.2 0.4 0.6 0.8 1 1.2 Speedup over local disk TPC-D query number

Distr. devices FNBD

(3 servers)

Distr. file system PVFS

(6 servers) Reference case: Single local disk (1 disk)

SLIDE 31

31

Experimental Evaluation with OLAP

TPC-D decision support benchmark on ORACLE

1 2 3 4 6 9 10 12 13 17 0.2 0.4 0.6 0.8 1 1.2 Speedup over local disk TPC-D query number

Distr. devices FNBD

(3 servers)

Distr. file system PVFS

(6 servers) Reference case: Single local disk (1 disk) Disk-limited query

SLIDE 32

32

Quantitative Performance: Model vs. Measurements

Simple query Complex query TPC-D query 4 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 Speedup over local disk

Distr. devices FNBD
Distr. file system PVFS

Reference case: Single local disk (1 disk) modelled measured

SLIDE 33

33

Analysis of Results

Performance lower than expected. Aggregation of distributed disks did not increase application performance. Fully-featured distributed file system failed to deliver decent performance. Stream-based analytic model too simple for complex workload.

SLIDE 34

34

Alternative: Performance with TP-Lite Middleware

Data distribution in middleware layer: TP-Lite by [Böhm et al, 2000]

Distributes queries to multiple database

servers in parallel

Needs multiple servers (costs)
Small changes to application (not always

possible)

SLIDE 35

35

Modelling OLAP with TP-Lite

Disk driver Gigabit/s network User space NIC driver Special File system TCP/IP RDBMS

Copy

TCP/IP

DMA

OS kernel

Reduced copy Reduced DMA Reduced copy Reduced DMA

OS kernel RDBMS App.

NIC driver User space Special

Pipe (reduced copy)

Server side Client side

SLIDE 36

36

Performance of TP-Lite

TPC-D decision support benchmark on ORACLE

1 2 3 4 6 9 10 12 13 17 0.5 1 1.5 2 2.5 3 Speedup over local disk TPC-D query number

Distr. devices FNBD

(3 servers)

Distr. file system PVFS

(6 servers) Middleware TP-Lite (3 servers) Reference case: Single local disk (1 disk)

SLIDE 37

37

Conclusions

We tried to turn clusters of PCs into ''killer SMPs'' as an architecture to boost OLAP performance. Our cost-effective approach uses excess storage

n clusters nodes and a transparent parallelisation.

Simple network RAID can not boost performance. Fully-featured scalable parallel file system failed. To model the workload is almost impossible. We system architects can not help database community with system tricks (sorry).

SLIDE 38

38

Questions?

CoPs Project (Clusters of PCs) 1996-2004 @ ETH Zurich http://www.cs.inf.ethz.ch/CoPs/

National ICT Australia (NICTA)