OS Support for a Commodity Database on PC Clusters Distributed - - PowerPoint PPT Presentation

os support for a commodity database on pc clusters
SMART_READER_LITE
LIVE PREVIEW

OS Support for a Commodity Database on PC Clusters Distributed - - PowerPoint PPT Presentation

OS Support for a Commodity Database on PC Clusters Distributed Devices vs. Distributed File Systems Felix Rauch (National ICT Australia) Thomas M. Stricker (Google Inc., USA) Laboratory for Computer Systems, ETH Zurich, Switzerland NIC


slide-1
SLIDE 1

OS Support for a Commodity Database on PC Clusters Distributed Devices vs. Distributed File Systems

Felix Rauch (National ICT Australia) Thomas M. Stricker (Google Inc., USA) Laboratory for Computer Systems, ETH Zurich, Switzerland

NIC TA Member s NIC TA Pa r tner s

slide-2
SLIDE 2

2

Commodity Solutions for OLAP Workloads

Customer Nation Region Part PartSupp Supplier Order LineItem

TPC-D data model Database size: 10-100 GByte What kind of system architectures are suitable for this type of workload?

slide-3
SLIDE 3

3

Platforms

Symmetric Multi- processor (SMP) E.g. DEC 8400 Cluster of commodity PCs E.g. Patagonia multi-use cluster at ETH Zurich Traditionally: More recently:

slide-4
SLIDE 4

4

Killer SMPs vs. Clusters of PCs

Killer SMP

  • Killer performance!
  • Killing price...

Cluster of commodity PCs

  • Killer price!
  • Killing performance?

Processor Caches Memory Disks M D M D C C C P P P P C M D P P P C C C M M M D D D Network Bus / Network

slide-5
SLIDE 5

5

Overview

  • Introduction
  • Motivation
  • Distributed storage architectures
  • Evaluation
  • Analysis of results
  • Alternative: Middleware
  • Conclusion
slide-6
SLIDE 6

6

Research Goal

Turn PC clusters into ''killer SMPs'' for OLAP. Combine excess storage and high-speed network already available on cluster nodes. Provide transparent distributed storage architecture as database's storage backend for OLAP applications. System architect's point of view. Focus on performance and understanding.

slide-7
SLIDE 7

7

Storage Architectures for Clusters of PCs

Traditional:

  • Big server with RAID
  • Storage-area networks (SAN)
  • Network-attached storage (NAS)

→ Additional hardware and costs Our proposed alternative: Use available commodity hardware and distribute data in software layers.

slide-8
SLIDE 8

8

Why Should Such an Architecture Work?

Commodity hardware and software (OS) allows high cost effectiveness. Trends:

  • Disks becoming larger and cheaper
  • Built-in high-speed network
slide-9
SLIDE 9

9

Large Hard-Disk Drives

1998 1999 2000 2001 2002 2003 2004 20 40 60 80 100 120 140 Size [GByte] Year of survey Disk size (median) Full OS size (incl. applications)

slide-10
SLIDE 10

10

Large Hard-Disk Drives

1998 1999 2000 2001 2002 2003 2004 20 40 60 80 100 120 140 Size [GByte] Year of survey Disk size (median) Full OS size (incl. applications)

slide-11
SLIDE 11

11

Large Hard-Disk Drives

1998 1999 2000 2001 2002 2003 2004 20 40 60 80 100 120 140 Size [GByte] Year of survey Disk size (median) Full OS size (incl. applications)

slide-12
SLIDE 12

12

High-Speed Network

1995 2000 2005 1 10 100 1000 2000 Throughput [MByte/s] Year

  • Max. disk throughput

Ethernet throughput Fast Ethernet Gigabit Ethernet 10 Gigabit Ethernet

→ Enough bandwith to support distributed storage.

slide-13
SLIDE 13

13

Our Scenario

Parallel file systems for high- performance computing Distributed File System (network RAID0) Scalable (Lustre, PVFS) Boost DB performance

I/O node I/O node I/O node DB node Compute node Compute node Compute node I/O node I/O node I/O node Compute node I/O node

slide-14
SLIDE 14

14

Our Scenario

Parallel file systems for high- performance computing Distributed File System (network RAID0) Scalable (Lustre, PVFS) Boost DB performance

I/O node I/O node I/O node DB node Compute node Compute node Compute node I/O node I/O node I/O node Compute node I/O node

slide-15
SLIDE 15

15

Alternative Systems

  • Petal [Lee & Thekkath, 1996]:

Distributed virtual disks with special emphasis on dynamic reconfiguration and load balancing.

  • Frangipani [Thekkath, Mann & Lee, 1997]:

Distributed file system that builds on Petal.

  • Lustre [Cluster File Systems, Inc.]:

Object oriented file system for large clusters.

slide-16
SLIDE 16

16

Investigated Architectures

Fast Network Block Device (FNBD)

  • Maps hard-disk device over network
  • No intelligence, but highly optimised

Parallel Virtual File System (PVFS)

  • Integrates nodes' disks into parallel FS
  • Fully-featured file system
slide-17
SLIDE 17

17

Fast Network Block Device (FNBD)

  • Loosely based on Linux network block dev.
  • Implemented as kernel modules
  • Maps remote disk blocks over Gigabit

Ethernet (from 3 servers)

  • Uses hardware features of commodity

network interface to implement zero copy

  • Multiple instances into RAID0-like array of

networked disks

slide-18
SLIDE 18

18

Parallel Virtual File System (PVFS)

  • Widely used for PC clusters
  • Implemented as dynamically linked library
  • Fully featured distributed file system
  • Can be accessed by any participating node
  • Combines special directories on server

nodes into large file system

  • 6 servers due to space limitations
slide-19
SLIDE 19

19

Architecture of Reference Case

Application Disk driver OS kernel Application(s) File system Local disk access Single node

slide-20
SLIDE 20

20

Architecture of FNBD

Server nodes Client node Distributed device driver (server part) Distributed device driver (client part) Application OS kernel Application(s) Disk driver Disk driver OS kernel Fast Network Block Device Application(s) File system

slide-21
SLIDE 21

21

Architecture of PVFS

Server nodes Client node PVFS library Application File system OS kernel Application(s) Disk driver Disk driver OS kernel Parallel Virtual File System PVFS server daemon Application(s) File system

slide-22
SLIDE 22

22

A Stream-Based Analytic Model

Presented at EuroPar 2000 conference. Considers flow of data stream and limits of building blocks. → Set of (in)equations. Solve to find maximal throughput of stream. Simple, works well for large data streams.

slide-23
SLIDE 23

23

Modelling Workload

Need to know performance characteristics of all involved building blocks.

  • Easy for small and simple parts (HW, OS

functionality): Measurements or data sheets.

  • Very difficult for complex, closed software

(RDBMS): Black-box. → Calibration model with know queries.

slide-24
SLIDE 24

24

Calibration of Database Performance

Two cases:

  • ''Simple'' case: Full table scan (find max.)
  • ''Complex'' case: Scan including CPU (sort)

Experimental calibration with data in RAM:

  • 140 MByte/s throughput for simple case
  • 7.75 MByte/s throughput for complex case
slide-25
SLIDE 25

25

Modelling OLAP on FNBD

OS kernel Disk driver Gigabit/s network User space FNBD driver (server part) NIC driver Special

DMA DMA DMA

OS kernel RDBMS App.

NIC driver User space Special (client part) FNBD driver File system

Pipe (reduced copy) Copy

Server side Client side

slide-26
SLIDE 26

26

Modelling OLAP on PVFS

OS kernel Disk driver Gigabit/s network User space NIC driver Special File system TCP/IP

Copy

PVFS Daemon

Copy

TCP/IP

Copy DMA DMA DMA

OS kernel PVFS library RDBMS App.

NIC driver User space Special

Pipe (reduced copy)

Server side Client side

slide-27
SLIDE 27

27

Evaluation Criteria

Small microbenchmark ''speed'':

  • Throughput for large contiguous I/O opera-

tions with varying user-level block sizes. Application benchmark TPC-D:

  • Broad range of decision support applica-

tions, long-running, complex ad-hoc queries.

  • New TPC-H and TPC-R include updates.
slide-28
SLIDE 28

28

Experimental Testbed

Multi-use cluster with 16 nodes, each with:

  • Two 1-GHz PentiumIII CPUs
  • 512 MByte ECC SDRAM
  • 2 x 9 GByte disk space
  • 2 Gigabit Ethernet adapters
  • Linux kernel 2.4.3
slide-29
SLIDE 29

29

Microbenchmarks

4 32 256 5 10 15 20 25 30 35 40 45 Throughput [MByte/s] User-level block size [KByte] Distributed devices FNBD (3 servers) Distributed file system PVFS (6 servers) Reference case: Single local disk (1 disk)

slide-30
SLIDE 30

30

Experimental Evaluation with OLAP

TPC-D decision support benchmark on ORACLE

1 2 3 4 6 9 10 12 13 17 0.2 0.4 0.6 0.8 1 1.2 Speedup over local disk TPC-D query number

  • Distr. devices FNBD

(3 servers)

  • Distr. file system PVFS

(6 servers) Reference case: Single local disk (1 disk)

slide-31
SLIDE 31

31

Experimental Evaluation with OLAP

TPC-D decision support benchmark on ORACLE

1 2 3 4 6 9 10 12 13 17 0.2 0.4 0.6 0.8 1 1.2 Speedup over local disk TPC-D query number

  • Distr. devices FNBD

(3 servers)

  • Distr. file system PVFS

(6 servers) Reference case: Single local disk (1 disk) Disk-limited query

slide-32
SLIDE 32

32

Quantitative Performance: Model vs. Measurements

Simple query Complex query TPC-D query 4 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 Speedup over local disk

  • Distr. devices FNBD
  • Distr. file system PVFS

Reference case: Single local disk (1 disk) modelled measured

slide-33
SLIDE 33

33

Analysis of Results

Performance lower than expected. Aggregation of distributed disks did not increase application performance. Fully-featured distributed file system failed to deliver decent performance. Stream-based analytic model too simple for complex workload.

slide-34
SLIDE 34

34

Alternative: Performance with TP-Lite Middleware

Data distribution in middleware layer: TP-Lite by [Böhm et al, 2000]

  • Distributes queries to multiple database

servers in parallel

  • Needs multiple servers (costs)
  • Small changes to application (not always

possible)

slide-35
SLIDE 35

35

Modelling OLAP with TP-Lite

Disk driver Gigabit/s network User space NIC driver Special File system TCP/IP RDBMS

Copy

TCP/IP

DMA

OS kernel

Reduced copy Reduced DMA Reduced copy Reduced DMA

OS kernel RDBMS App.

NIC driver User space Special

Pipe (reduced copy)

Server side Client side

slide-36
SLIDE 36

36

Performance of TP-Lite

TPC-D decision support benchmark on ORACLE

1 2 3 4 6 9 10 12 13 17 0.5 1 1.5 2 2.5 3 Speedup over local disk TPC-D query number

  • Distr. devices FNBD

(3 servers)

  • Distr. file system PVFS

(6 servers) Middleware TP-Lite (3 servers) Reference case: Single local disk (1 disk)

slide-37
SLIDE 37

37

Conclusions

We tried to turn clusters of PCs into ''killer SMPs'' as an architecture to boost OLAP performance. Our cost-effective approach uses excess storage

  • n clusters nodes and a transparent parallelisation.

Simple network RAID can not boost performance. Fully-featured scalable parallel file system failed. To model the workload is almost impossible. We system architects can not help database community with system tricks (sorry).

slide-38
SLIDE 38

38

Questions?

CoPs Project (Clusters of PCs) 1996-2004 @ ETH Zurich http://www.cs.inf.ethz.ch/CoPs/

  • National ICT Australia (NICTA)

Embedded, Real-Time, and Operating Systems Research Program (ERTOS) http://www.ertos.nicta.com.au/