2. Management of large objects LOB = Large OBject Normal DBMS - - PowerPoint PPT Presentation

2 management of large objects
SMART_READER_LITE
LIVE PREVIEW

2. Management of large objects LOB = Large OBject Normal DBMS - - PowerPoint PPT Presentation

2. Management of large objects LOB = Large OBject Normal DBMS regards a LOB as one field with no internal structure Traditional business-oriented relational DBMSs: Maximum field length e.g. 255 or 32767 bytes Media objects are


slide-1
SLIDE 1

MMDB-2 J. Teuhola 2012 19

  • 2. Management of large objects

LOB = Large OBject ‘Normal’ DBMS regards a LOB as one field with no internal structure Traditional business-oriented relational DBMSs:

Maximum field length e.g. 255 or 32767 bytes

Media objects are usually considerably larger Today’s relational DBMSs support field lengths of several Gbytes, but

  • Wasteful to access the whole object if only a piece is needed.
  • The long object may not fit in the main memory.
  • Piecewise processing should be supported.
  • The logical structure is handled by higher-level software.
  • A log file is needed for recovery from errors. Logging a whole
  • bject is very ineffective, if only a small part of it is affected.
  • Secondary storage management should be more flexible:

Multiple page sizes or multiple-size clusters of pages would enhance the I/O for variable-length objects.

slide-2
SLIDE 2

MMDB-2 J. Teuhola 2012 20

SQL and long fields

Long data types:

Character large object (CLOB), content e.g. HTML, XML Binary large object (BLOB), sequence of 8-bit octets,

content e.g. MP3 or JPEG

External, read-only file (BFILE), content e.g. AVI, MPEG

Operations:

Concatenation Substring (from a start position for a given length) Overlay (substring replacement) Trim (remove given leading/trailing characters) Length (function returning the number of characters) Position (start position of searched substring) But: Not GROUP BY, ORDER BY, join, set operations, etc.)

slide-3
SLIDE 3

MMDB-2 J. Teuhola 2012 21

Tree-structured representation

B-tree-type multi-level directory: Used e.g. in SQL Server, Oracle, … Example architecture: EXODUS storage system (extensible OODBMS) Very flexible management of large objects that can grow and shrink

at arbitrary positions.

Not optimized for sequential processing speed (best for long text doc.) Each object has a unique OID = <page no, slot no> Two kinds of objects:

(1) Small objects fit in one page. (2) Large objects occupy multiple pages, OID points to the header.

Two kinds of pages:

(1) Slotted pages contain small objects & headers of large objects. (2) Other pages contain parts of large objects, each page being private to one object, only.

When a small object grows larger than a page, it is converted

automatically into a large object.

slide-4
SLIDE 4

MMDB-2 J. Teuhola 2012 22

Page allocation schematically

free space

Slotted pages LOB pages

Small

  • bj

Small

  • bj

LOB x header Small

  • bj

Small

  • bj

Small

  • bj

Small

  • bj

Small

  • bj

LOB y header … Pages of LOB x Pages of LOB y

slide-5
SLIDE 5

MMDB-2 J. Teuhola 2012 23

Tree-structured representation (cont.)

Physical representation: B+-tree, indexed on byte positions

within the object.

Root is a header for the large object Internal nodes: <count, pointer> pair for each child.

  • Count means the highest relative byte number

(= offset within subtree) rooted at that node.

  • Pointer means page id (address).

The count of the rightmost child is the size of the (sub)tree rooted by the current node. The number of <count, pointer> pairs in a node is between k and 2k+1 (i.e. nodes are at least about half-full) where degree k is the B+-tree parameter. Internal nodes occupy one page, each.

Leaves are blocks of one or more pages (system parameter).

Leaf blocks contain nothing else but actual data. Also leaves can vary from half-full to full.

slide-6
SLIDE 6

MMDB-2 J. Teuhola 2012 24

Tree-structured representation: Example

Maximal object sizes for 4Kbyte pages, 4-byte pointers,

4-byte counts and 4-page leaf blocks:

  • 2-level tree: 8 Mbytes
  • 3-level tree: 4 Gbytes

421 786 120 282 421 192 365 OID 120 bytes 162 bytes 139 bytes 173 bytes 192 bytes

slide-7
SLIDE 7

MMDB-2 J. Teuhola 2012 25

Tree-structured representation (cont.)

Notations:

Counts: c[i], pointers p[i], 1 ≤ i ≤ 2k+1. For convenience, c[0] = 0

Retrieval algorithm: Get a sequence of N bytes, starting at S. begin Read the root page P. Let start = S. while P is a non-leaf node do Save P to a stack Find the smallest c[i] such that start ≤ c[i]. // e.g. binary search Set start := start − c[i-1]. // relative start index Read p[i] as the new page P. The first desired byte is at location start in P. // being in a leaf For the rest of the bytes, walk the tree in depth-first order using the stack. end

slide-8
SLIDE 8

MMDB-2 J. Teuhola 2012 26

Tree-structured representation (cont.)

Insert algorithm: Add a sequence of N bytes after position S. begin Search byte position S, as above, but on the path down, update the byte counts to reflect the insertion and save the path in a stack. Denote the reached leaf by L. if N bytes fit in L then do the insert within L else Allocate a sufficient number of new leaves, and distribute L’s

  • ld bytes and the N new bytes evenly among the leaves.

Propagate the new counts and pointers upwards (use the stack) If an internal node overflows, it is handled in a similar way as the leaf overflow. end Note: Space utilization can be improved by inspecting the left and right neighbours of the found leaf, and using the available free space.

slide-9
SLIDE 9

MMDB-2 J. Teuhola 2012 27

Tree-structured representation (cont.)

Append algorithm: Add N bytes to the end of an object. (Special case of insert) begin Walk the rightmost path of the tree, add N to the counts, and save the path in a stack. if the rightmost leaf R has N free bytes thendo the appending there, and stop else Access R’s left neighbour L. Allocate as many new leaves as required to accommodate L’s and R’s bytes plus the N new ones. Fill all but the last two pages completely, and the last two evenly (both become at least half-full). Propagate the counts and pointers upwards, using the stack. Handle internal node overflows as in insert. end Note: The advantage of this special insert is that it allows large objects to be built in pieces. The next piece fills the last two non-full leaves.

slide-10
SLIDE 10

MMDB-2 J. Teuhola 2012 28

Tree-structured representation: Observations

The organization is quite effective in practice:

Storage utilization is 70% for simple and 80% for advanced insertion. Complexity of locating the correct position theoretically O(log N), in

pratice almost constant.

Access speed is some tens of milliseconds, depending on disk speed

and buffering. Not the best choice for streaming media. Extension: Versioning of large objects

Common parts of different versions can be shared. Updates must not invalidate old versions:

nodes on the update path must be copied for changing.

Old versions are not updated, but deletion should be allowed:

Avoid deleting nodes shared by other versions. Expensive way: Mark nodes of all other versions, and then discard the unmarked ones.

slide-11
SLIDE 11

MMDB-2 J. Teuhola 2012 29

Advanced 2-level representation

Example architecture: Starburst long field manager.

(Experimental DBMS, developed at IBM research center.)

Suggests an elegant and extremely fast 2-level scheme for long fields. Key idea: Build the field by allocating variable-size (with size units

  • f exponential scale), physically contiguous disk extents.

Not arbitrary sizes, nor arbitrary starting points.

Buddy system:

In a buddy space of 2n pages, buddy segments can be allocated,

so that a segment of size 2k can start at address 0, 2k, 2×2k, 3×2k, …

Two same-sized (2k) consecutive segments are buddies,

if their concatenation is a legal buddy segment of size 2k+1.

The address of a segment XORed with its size gives the address

  • f its buddy.

Advantage: Shorter pointers, because the repertoire of segment sizes

is restricted.

slide-12
SLIDE 12

MMDB-2 J. Teuhola 2012 30

Memory architecture in Starburst

The whole external memory is divided into database spaces,

that may correspond to e.g. separate disks.

Each database space contains an array of buddy spaces. A buddy space consists of

An allocation page (specially coded segment index) 2n data pages (buddies marked):

Fragmentation (normal problem of buddy system) is partially

avoided because the long field can be built from several segments.

For any long field, less than one disk page is lost due to

fragmentation. 2n 2n-1 2n-2 3·2n-3 5·2n-4

slide-13
SLIDE 13

MMDB-2 J. Teuhola 2012 31

Long field descriptor in Starburst

The descriptor is a directory to the field components. The descriptor size is at most 255 bytes and stored in the record

where the long field logically belongs to.

The descriptor components:

  • Database space id
  • Field size
  • Number of buddy segments
  • Sizes of the first and last segment
  • Pointers (= offsets) to the buddy segments

The key solution to keep the field descriptor small is to have

exponentially growing segment sizes.

slide-14
SLIDE 14

MMDB-2 J. Teuhola 2012 32

Descriptor usage: schematic example

‘Person’ table PID Name Addr Photo Segments storing the photo LOB descriptor, max 255 bytes 12345 23456 11223 33211 54321 Jones Smith Blake Brown Clark Miami LA Dallas Denver NYC

slide-15
SLIDE 15

MMDB-2 J. Teuhola 2012 33

Long field creation in Starburst

If the size is known in advance, sufficiently large segment(s)

are used right at the start.

If the size is not known a priori, successive allocated segments

double in size: 1 page, 2 pages, 4 pages, … until the field ends

  • r the maximum segment size (2n) is reached. A sequence of
  • f maximum-size segments is allocated according to the need.

The last segment is trimmed to the nearest page boundary;

e.g. trimming a 16-page segment to 11 results in 8+2+1 -sized contiguous segments, leaving 4+1 -sized segments free.

The maximum field size with 4 Kbyte pages and maximum buddy

segment size of 8 MBytes is 448 Mbytes.

The Starburst long field manager is geared for fast sequential

processing of e.g. images, sound and video.

slide-16
SLIDE 16

MMDB-2 J. Teuhola 2012 34

Example of segment allocation

Assume: long field size 74 pages, max segment size 16 pages

1 2 4 8 16 16 16 11 74 free segments of 1 and 4 pages trimmed segment

slide-17
SLIDE 17

MMDB-2 J. Teuhola 2012 35

Free space management in Starburst

Free space lists:

  • n linked lists of free segments, plus an array of headers
  • Each list contains segments of the same size (1, 2, 4, 8, ... 2n pages)

Space allocation:

  • 1. Choose the best-fit block
  • 2. If the block is too large, split it to half, possibly several times, and

add the remaining pieces to the related lists according to their size. Releasing a segment:

  • 1. Inspect the buddy; if it is free, then combine the segments (& repeat).
  • 2. The combined pieces are removed from their free space lists
  • 3. The new (largest possible) free segment is added to the related list

Update operations: Only appending of long fields supported; sufficient for streaming media.

slide-18
SLIDE 18

MMDB-2 J. Teuhola 2012 36

Free space management schematically

Headers of free segment lists . . . segments of 1 page segments of 2 pages segments of 4 pages segments of 8 pages

slide-19
SLIDE 19

MMDB-2 J. Teuhola 2012 37

Recent approaches to ’Big Data’

Multimedia collections are one example of a Big Data application.

(Others occur e.g. in meteorology, bioinformatics, business informatics, sensor networks, etc.)

Data volumes are of petabyte order (PB is about 1015 bytes).

(Facebook disk space 100 PetaBytes = 100 000 Terabytes!)

Distributed storage and access Examples of large file systems:

Global File System (GFS2 by RedHat) General Parallel Filesystem (GPFS by IBM) ( supercomputers) Google File System (GFS, GoogleFS) ( Google cloud services) Hadoop Distributed File System (HDFS by Apache)( Facebook) Amazon S3 filesystem (’Simple Storage Service’)

slide-20
SLIDE 20

MMDB-2 J. Teuhola 2012 38

Features of the Google File System

Cluster of parallel storage servers:

  • One master server, handling the metadata (in main memory)
  • Several chunk servers, storing the actual data

Replication of data for recovery; checksums to notice corruption. Designed for large reading streams.

Random access and local update are possible but not very efficient

Desgined for big files: the chunk size is 64 MBytes.

Smaller files are grouped to chunks.

The main type of update is append (extending to the end).

Concurrent appending to the same file is supported.

Lazy garbage collection. Designed 2003, complemented later by Google’s BigTable ’database’

and MapReduce parallelization scheme.