[PDF] - Data Storage and Disk Structure A Simple Implementation of DBMS PDF Document

SLIDE 1

Data Storage and Disk Structure

SLIDE 2

CMPT 454: Database II -- Storage and Disk Structure 2

A Simple Implementation of DBMS

One file per table

– Students(name, id, dept) in a file Students – A meta symbol “#” to separate attributes

Smith#123#CS Johnson#522#EE …

Database schema in a special file Schema

Students#name#STR#id#INT#dept#STR Depts#name#STR#office#Str …

SLIDE 3

CMPT 454: Database II -- Storage and Disk Structure 3

Naïve Query Answering

SELECT * FROM Students WHERE dept = ‘CS’ | CSStud

Read file Schema to determine the attributes of relation Student and

their types

Check that condition dept = ‘CS’ is semantically valid for Students
Create a new file CSStud
Read file Students, for each line

– Check condition dept = ‘CS’, if it is true then write the line as a tuple to file CSStud

Add to the file Schema a line about CSStud
Problems

– If we change EE to ECON in one tuple in Students, the entire file has to be rewritten – Even if we look for one student, we have to read the whole file – If multiple users read/write file Students simultaneously, what would happen?

SLIDE 4

CMPT 454: Database II -- Storage and Disk Structure 4

Handling Joins

SELECT office FROM Students, Depts WHERE Students.name = ‘Smith’ AND Students.dept = Depts.name;

Algorithm

FOR each tuple s in Students DO FOR each tuple d in depts DO IF s.name = ‘Smith’ AND s.dept = d.dept THEN write d.office as a tuple to the output

More problems

– Why do we need to match a student “Cindy” with all departments? – I/O Complexity: O(n2), costly! – What if the system crashes?

SLIDE 5

CMPT 454: Database II -- Storage and Disk Structure 5

Storage Device Hierarchy

How should we store

data on disks so that queries can be answered efficiently?

How can we organize

disks effectively so that a database built on top can be more efficient and robust?

SLIDE 6

CMPT 454: Database II -- Storage and Disk Structure 6

Where Is DMBS?

Disk Tertiary storage Main memory Cache Virtual memory File system DBMS

SLIDE 7

CMPT 454: Database II -- Storage and Disk Structure 7

Volatile/Nonvolatile, Trustworthy Storage

A volatile device “forgets” what is stored in it when the power goes off

– Main memory, cache

A nonvolatile device keeps its contents intact even for long periods

when the device is turned off or there is a power failure

– Disk, CD, USB flash memory, …

When a customer and a bank have controversy on a transaction … Let

us check the transaction record – oops, the records are stored on a disk in the bank …

Trustworthy: once the data is stored, it cannot be changed later

– WORM model: write once read many, e.g., CD ROM, ROM memory chips – WORM storage systems

SLIDE 8

CMPT 454: Database II -- Storage and Disk Structure 8

Moving Head Disk Mechanism

SLIDE 9

CMPT 454: Database II -- Storage and Disk Structure 9

Disk Controller

Control the actuator to move the head

assembly

Select a surface to read or write
Transfer the bits read/to be written between

main memory and the desired sector

Bus CPU Main memory Disk controller Disk Disk Disk Disk

SLIDE 10

CMPT 454: Database II -- Storage and Disk Structure 10

Disk Access Characteristics

How is a desired data block read?

– The heads are positioned at the cylinder containing the track on which the block is located – The sectors containing the block move under the disk head as the entire disk assembly rotates

Latency of the disk: the time taken between the

moment at which the command to read a block is issued and the time that the contents of the block appear in main memory

SLIDE 11

CMPT 454: Database II -- Storage and Disk Structure 11

Breakdown of Disk Latency

The time taken by processor and disk controller to process the request

– Typically a fraction of a millisecond, can be neglected

Seek time: the time to position the head assembly at the proper

cylinder

– Typically tens of milliseconds

Rotation latency: the time for the disk to rotate so the first of the sectors

containing the block reaches the head

– A typical disk rotates completely once very 10 milliseconds, thus, the expected rotational latency is about 5 milliseconds

Transfer time: the time it takes the sectors of the block and any gaps

between them to rotate past the head

– If a disk has 250 kb per track and rotates once in 10 milliseconds, the read rate is 25 mb per second

SLIDE 12

CMPT 454: Database II -- Storage and Disk Structure 12

Breakdown of Disk Latency

SLIDE 13

CMPT 454: Database II -- Storage and Disk Structure 13

Organizing Data by Cylinders

To reduce seek time, we can store data that is likely to be accessed

together on a single cylinder or several adjacent cylinders

If all bocks to read/write are on a single track or on a cylinder

consecutively, only one seek time and one rotational latency are needed

Writing blocks is quite analogous to reading a block

– If it is required to verify whether the written block is correct, wait for an additional rotation and read each sector back

Modifying blocks

– Read the block into main memory – Make changes to the block in the main memory copy of the block – Write the new contents of block back to the disk – If necessary, verify the write

SLIDE 14

CMPT 454: Database II -- Storage and Disk Structure 14

Stripping and Mirroring

Bit-level stripping

– Write bit i of each byte to disk i – Access is sped up for 8 times

Block-level stripping

– With an array of n disks, block I of the disk array is written to disk (i mod n) + 1 and use block ⎣i / n⎦ of the disk

Mirroring: maintaining multiple copies

SLIDE 15

CMPT 454: Database II -- Storage and Disk Structure 15

Stripping versus Mirroring

Block-level stripping

SLIDE 16

CMPT 454: Database II -- Storage and Disk Structure 16

RAID Levels 0, 1, 2, and 3

Redundant Array of Independent/Inexpensive Drives
RAID 0: striping at the block level, no redundancy
RAID 1 mirroring and block striping
RAID 2: byte level striping + error-correcting codes
RAID 3: bit-interleaved parity organization

SLIDE 17

CMPT 454: Database II -- Storage and Disk Structure 17

RAID Levels 4, 5, and 6

RAID 4: block-level striping + a parity block
RAID 5: parity distributed in all disks
RAID 6: storing more redundant information to recover

from multiple disk failures

SLIDE 18

CMPT 454: Database II -- Storage and Disk Structure 18

RAM Model and I/O Model

RAM Model: when data can be held in main memory, the

bottleneck of computation is CPU

– Each data item can be accessed using approximate the same amount of time

I/O Model: if a block needs to be moved between disk and

main memory, the time taken to perform the read/write is much longer than the time likely to be used to manipulate the data in main memory

– Databases often cannot fit into main memory

The number of block access is a good approximation to the

time needed by the algorithm and should be minimized

SLIDE 19

CMPT 454: Database II -- Storage and Disk Structure 19

Sorting in Main Memory

Sorting 5,000,000 tuples, each tuple takes 128

bytes, totally 640 Mb data – cannot fit into a computer with 512 Mb main memory

Each disk block has 16 kb and can contain 128

tuples – 39,063 disk blocks

Quicksort? Fastest if all data is in memory

– Need to randomly access data items, many accesses to disk blocks if data is on disk

SLIDE 20

CMPT 454: Database II -- Storage and Disk Structure 20

Merge Sort

Basis: if the length of the list is
ne, return
Induction: if the list is of length

more than one, divide the list into two lists that are either of the same length or as close as possible, recursively sort the two sublists and then merge the resulting sorted sublists into

ne sorted list
The time to merge two lists in

main memory is O(n1+n2)

Log2n phases in total, total cost

O(n log n)

SLIDE 21

CMPT 454: Database II -- Storage and Disk Structure 21

Two-Phase, Multiway Merge Sort

Phase 1 – sort main memory sized

partitions into sorted sublists

– Fill all available main memory with blocks from the original tuples to be sorted – Sort the records that are in main memory – Write the sorted records from main memory onto new blocks of disk, form

ne sorted list
Phase 2 – merge all the sorted

sublists into a single sorted list

– The number of sublists is smaller than the number of blocks in the available main memory

SLIDE 22

CMPT 454: Database II -- Storage and Disk Structure 22

How to Merge?

Find the smallest key among the first remaining elements of all the lists

– A linear search of the heads of all sublists which are in main memory

Move the smallest element to the first available position of the output

block

If the output block is full, write it to disk and reinitialize the same buffer

in main memory to hold the next output block

If the block from which the smallest element was just taken is now

exhausted of records, read the next block from the same sorted sublist into the same buffer

Cost analysis

– Blocks are read in an unpredictable order – Every block holding records from one of the sorted lists is read from disk

nly once

– The number of block written to disk is the same of the blocks holding the sublists – Each record in a sublist will be moved to the output buffer exactly once

SLIDE 23

CMPT 454: Database II -- Storage and Disk Structure 23

Generalization to Multiple Phases

A third phase can be used if there are too

many sublists

A small computer has 100 Mb for sorting
A block has 16 kb, 6400 blocks can be held

into main memory

A third phase is needed only if the size of

the data is more than 100 Mb * (6400-1) = 624.9 Gb

SLIDE 24

CMPT 454: Database II -- Storage and Disk Structure 24

First-Come-First-Serve Scheduling

Cylinder

f request

First time available Time completed 2000 10 20 30 4.42 6000 13.84 14000 27.26 4000 42.68 16000 60.10 10000 71.52 2000 2000 + |6000-2000| 4000 + |14000-6000| 8000 + |4000-14000| 10000 + |16000-4000| 12000 + |10000-16000| 6000 = 42000 2000 4000 6000 10000 14000 16000

SLIDE 25

CMPT 454: Database II -- Storage and Disk Structure 25

Elevator Algorithm

Cylinder

f request

First time available Time completed 2000 10 20 30 4.42 6000 13.84 14000 27.26 4000 57.52 16000 34.68 10000 46.10 2000 2000 + |6000-2000| 4000 + |14000-6000| 8000 + |16000-14000| 2000 + |10000-16000| 6000 + |4000-10000| 6000 = 28000 2000 4000 6000 10000 14000 16000

SLIDE 26

CMPT 454: Database II -- Storage and Disk Structure 26

Disk Scheduling

The elevator algorithm can achieve good

performance on average

– In our example, it saves 14000 (1/3 of 42000 in first-come-first-serve method)

The more different request, the better

performance the elevator algorithm

The elevator algorithm is not optimal

– Can you give an example where the elevator algorithm performs worse than the first-come- first serve method?

SLIDE 27

CMPT 454: Database II -- Storage and Disk Structure 27

Prefetching

SLIDE 28

CMPT 454: Database II -- Storage and Disk Structure 28

Accelerating Access to Disk

R/W in a predictable sequence Scattered R/W requests, unpredictable Cylinder-based organization X Multiple disks X X Mirroring X X Elevator algorithm X Prefetching X More disks (and thus money) are needed for using multiple disks and mirroring

SLIDE 29

CMPT 454: Database II -- Storage and Disk Structure 29

Fixed-Length Records and Free List

SLIDE 30

CMPT 454: Database II -- Storage and Disk Structure 30

Slotted-Page Structure

SLIDE 31

CMPT 454: Database II -- Storage and Disk Structure 31

Sequential File Organization

Search key: a set of attributes on which a

query is often conducted

SLIDE 32

CMPT 454: Database II -- Storage and Disk Structure 32

Insertion into a Sequential File

SLIDE 33

CMPT 454: Database II -- Storage and Disk Structure 33

Data Storage and Disk Structure

A Simple Implementation of DBMS

– Students(name, id, dept) in a file Students – A meta symbol “#” to separate attributes

Smith#123#CS Johnson#522#EE …

Students#name#STR#id#INT#dept#STR Depts#name#STR#office#Str …

Naïve Query Answering

Handling Joins

SELECT office FROM Students, Depts WHERE Students.name = ‘Smith’ AND Students.dept = Depts.name;

FOR each tuple s in Students DO FOR each tuple d in depts DO IF s.name = ‘Smith’ AND s.dept = d.dept THEN write d.office as a tuple to the output

– Why do we need to match a student “Cindy” with all departments? – I/O Complexity: O(n2), costly! – What if the system crashes?

Storage Device Hierarchy

data on disks so that queries can be answered efficiently?

disks effectively so that a database built on top can be more efficient and robust?

Where Is DMBS?

Volatile/Nonvolatile, Trustworthy Storage

Moving Head Disk Mechanism

Disk Controller

assembly

main memory and the desired sector

Disk Access Characteristics

– The heads are positioned at the cylinder containing the track on which the block is located – The sectors containing the block move under the disk head as the entire disk assembly rotates

moment at which the command to read a block is issued and the time that the contents of the block appear in main memory

Breakdown of Disk Latency

Breakdown of Disk Latency

Organizing Data by Cylinders

Stripping and Mirroring

Stripping versus Mirroring

RAID Levels 0, 1, 2, and 3

RAID Levels 4, 5, and 6

from multiple disk failures

RAM Model and I/O Model

bottleneck of computation is CPU

main memory, the time taken to perform the read/write is much longer than the time likely to be used to manipulate the data in main memory

time needed by the algorithm and should be minimized

Sorting in Main Memory

bytes, totally 640 Mb data – cannot fit into a computer with 512 Mb main memory

tuples – 39,063 disk blocks

– Need to randomly access data items, many accesses to disk blocks if data is on disk

Merge Sort

Two-Phase, Multiway Merge Sort

How to Merge?

Generalization to Multiple Phases

many sublists

into main memory

the data is more than 100 Mb * (6400-1) = 624.9 Gb

First-Come-First-Serve Scheduling

Elevator Algorithm

Disk Scheduling

performance on average

– In our example, it saves 14000 (1/3 of 42000 in first-come-first-serve method)

performance the elevator algorithm

– Can you give an example where the elevator algorithm performs worse than the first-come- first serve method?

Prefetching

Accelerating Access to Disk

R/W in a predictable sequence Scattered R/W requests, unpredictable Cylinder-based organization X Multiple disks X X Mirroring X X Elevator algorithm X Prefetching X More disks (and thus money) are needed for using multiple disks and mirroring

Fixed-Length Records and Free List

Slotted-Page Structure

Sequential File Organization

query is often conducted

Insertion into a Sequential File

Multitable Clustering File Structure