SLIDE 1 MMDB-2 J. Teuhola 2012 19
- 2. Management of large objects
LOB = Large OBject ‘Normal’ DBMS regards a LOB as one field with no internal structure Traditional business-oriented relational DBMSs:
Maximum field length e.g. 255 or 32767 bytes
Media objects are usually considerably larger Today’s relational DBMSs support field lengths of several Gbytes, but
- Wasteful to access the whole object if only a piece is needed.
- The long object may not fit in the main memory.
- Piecewise processing should be supported.
- The logical structure is handled by higher-level software.
- A log file is needed for recovery from errors. Logging a whole
- bject is very ineffective, if only a small part of it is affected.
- Secondary storage management should be more flexible:
Multiple page sizes or multiple-size clusters of pages would enhance the I/O for variable-length objects.
SLIDE 2
MMDB-2 J. Teuhola 2012 20
SQL and long fields
Long data types:
Character large object (CLOB), content e.g. HTML, XML Binary large object (BLOB), sequence of 8-bit octets,
content e.g. MP3 or JPEG
External, read-only file (BFILE), content e.g. AVI, MPEG
Operations:
Concatenation Substring (from a start position for a given length) Overlay (substring replacement) Trim (remove given leading/trailing characters) Length (function returning the number of characters) Position (start position of searched substring) But: Not GROUP BY, ORDER BY, join, set operations, etc.)
SLIDE 3
MMDB-2 J. Teuhola 2012 21
Tree-structured representation
B-tree-type multi-level directory: Used e.g. in SQL Server, Oracle, … Example architecture: EXODUS storage system (extensible OODBMS) Very flexible management of large objects that can grow and shrink
at arbitrary positions.
Not optimized for sequential processing speed (best for long text doc.) Each object has a unique OID = <page no, slot no> Two kinds of objects:
(1) Small objects fit in one page. (2) Large objects occupy multiple pages, OID points to the header.
Two kinds of pages:
(1) Slotted pages contain small objects & headers of large objects. (2) Other pages contain parts of large objects, each page being private to one object, only.
When a small object grows larger than a page, it is converted
automatically into a large object.
SLIDE 4 MMDB-2 J. Teuhola 2012 22
Page allocation schematically
free space
Slotted pages LOB pages
Small
Small
LOB x header Small
Small
Small
Small
Small
LOB y header … Pages of LOB x Pages of LOB y
SLIDE 5 MMDB-2 J. Teuhola 2012 23
Tree-structured representation (cont.)
Physical representation: B+-tree, indexed on byte positions
within the object.
Root is a header for the large object Internal nodes: <count, pointer> pair for each child.
- Count means the highest relative byte number
(= offset within subtree) rooted at that node.
- Pointer means page id (address).
The count of the rightmost child is the size of the (sub)tree rooted by the current node. The number of <count, pointer> pairs in a node is between k and 2k+1 (i.e. nodes are at least about half-full) where degree k is the B+-tree parameter. Internal nodes occupy one page, each.
Leaves are blocks of one or more pages (system parameter).
Leaf blocks contain nothing else but actual data. Also leaves can vary from half-full to full.
SLIDE 6 MMDB-2 J. Teuhola 2012 24
Tree-structured representation: Example
Maximal object sizes for 4Kbyte pages, 4-byte pointers,
4-byte counts and 4-page leaf blocks:
- 2-level tree: 8 Mbytes
- 3-level tree: 4 Gbytes
421 786 120 282 421 192 365 OID 120 bytes 162 bytes 139 bytes 173 bytes 192 bytes
SLIDE 7
MMDB-2 J. Teuhola 2012 25
Tree-structured representation (cont.)
Notations:
Counts: c[i], pointers p[i], 1 ≤ i ≤ 2k+1. For convenience, c[0] = 0
Retrieval algorithm: Get a sequence of N bytes, starting at S. begin Read the root page P. Let start = S. while P is a non-leaf node do Save P to a stack Find the smallest c[i] such that start ≤ c[i]. // e.g. binary search Set start := start − c[i-1]. // relative start index Read p[i] as the new page P. The first desired byte is at location start in P. // being in a leaf For the rest of the bytes, walk the tree in depth-first order using the stack. end
SLIDE 8 MMDB-2 J. Teuhola 2012 26
Tree-structured representation (cont.)
Insert algorithm: Add a sequence of N bytes after position S. begin Search byte position S, as above, but on the path down, update the byte counts to reflect the insertion and save the path in a stack. Denote the reached leaf by L. if N bytes fit in L then do the insert within L else Allocate a sufficient number of new leaves, and distribute L’s
- ld bytes and the N new bytes evenly among the leaves.
Propagate the new counts and pointers upwards (use the stack) If an internal node overflows, it is handled in a similar way as the leaf overflow. end Note: Space utilization can be improved by inspecting the left and right neighbours of the found leaf, and using the available free space.
SLIDE 9
MMDB-2 J. Teuhola 2012 27
Tree-structured representation (cont.)
Append algorithm: Add N bytes to the end of an object. (Special case of insert) begin Walk the rightmost path of the tree, add N to the counts, and save the path in a stack. if the rightmost leaf R has N free bytes thendo the appending there, and stop else Access R’s left neighbour L. Allocate as many new leaves as required to accommodate L’s and R’s bytes plus the N new ones. Fill all but the last two pages completely, and the last two evenly (both become at least half-full). Propagate the counts and pointers upwards, using the stack. Handle internal node overflows as in insert. end Note: The advantage of this special insert is that it allows large objects to be built in pieces. The next piece fills the last two non-full leaves.
SLIDE 10
MMDB-2 J. Teuhola 2012 28
Tree-structured representation: Observations
The organization is quite effective in practice:
Storage utilization is 70% for simple and 80% for advanced insertion. Complexity of locating the correct position theoretically O(log N), in
pratice almost constant.
Access speed is some tens of milliseconds, depending on disk speed
and buffering. Not the best choice for streaming media. Extension: Versioning of large objects
Common parts of different versions can be shared. Updates must not invalidate old versions:
nodes on the update path must be copied for changing.
Old versions are not updated, but deletion should be allowed:
Avoid deleting nodes shared by other versions. Expensive way: Mark nodes of all other versions, and then discard the unmarked ones.
SLIDE 11 MMDB-2 J. Teuhola 2012 29
Advanced 2-level representation
Example architecture: Starburst long field manager.
(Experimental DBMS, developed at IBM research center.)
Suggests an elegant and extremely fast 2-level scheme for long fields. Key idea: Build the field by allocating variable-size (with size units
- f exponential scale), physically contiguous disk extents.
Not arbitrary sizes, nor arbitrary starting points.
Buddy system:
In a buddy space of 2n pages, buddy segments can be allocated,
so that a segment of size 2k can start at address 0, 2k, 2×2k, 3×2k, …
Two same-sized (2k) consecutive segments are buddies,
if their concatenation is a legal buddy segment of size 2k+1.
The address of a segment XORed with its size gives the address
Advantage: Shorter pointers, because the repertoire of segment sizes
is restricted.
SLIDE 12
MMDB-2 J. Teuhola 2012 30
Memory architecture in Starburst
The whole external memory is divided into database spaces,
that may correspond to e.g. separate disks.
Each database space contains an array of buddy spaces. A buddy space consists of
An allocation page (specially coded segment index) 2n data pages (buddies marked):
Fragmentation (normal problem of buddy system) is partially
avoided because the long field can be built from several segments.
For any long field, less than one disk page is lost due to
fragmentation. 2n 2n-1 2n-2 3·2n-3 5·2n-4
SLIDE 13 MMDB-2 J. Teuhola 2012 31
Long field descriptor in Starburst
The descriptor is a directory to the field components. The descriptor size is at most 255 bytes and stored in the record
where the long field logically belongs to.
The descriptor components:
- Database space id
- Field size
- Number of buddy segments
- Sizes of the first and last segment
- Pointers (= offsets) to the buddy segments
The key solution to keep the field descriptor small is to have
exponentially growing segment sizes.
SLIDE 14
MMDB-2 J. Teuhola 2012 32
Descriptor usage: schematic example
‘Person’ table PID Name Addr Photo Segments storing the photo LOB descriptor, max 255 bytes 12345 23456 11223 33211 54321 Jones Smith Blake Brown Clark Miami LA Dallas Denver NYC
SLIDE 15 MMDB-2 J. Teuhola 2012 33
Long field creation in Starburst
If the size is known in advance, sufficiently large segment(s)
are used right at the start.
If the size is not known a priori, successive allocated segments
double in size: 1 page, 2 pages, 4 pages, … until the field ends
- r the maximum segment size (2n) is reached. A sequence of
- f maximum-size segments is allocated according to the need.
The last segment is trimmed to the nearest page boundary;
e.g. trimming a 16-page segment to 11 results in 8+2+1 -sized contiguous segments, leaving 4+1 -sized segments free.
The maximum field size with 4 Kbyte pages and maximum buddy
segment size of 8 MBytes is 448 Mbytes.
The Starburst long field manager is geared for fast sequential
processing of e.g. images, sound and video.
SLIDE 16
MMDB-2 J. Teuhola 2012 34
Example of segment allocation
Assume: long field size 74 pages, max segment size 16 pages
1 2 4 8 16 16 16 11 74 free segments of 1 and 4 pages trimmed segment
SLIDE 17 MMDB-2 J. Teuhola 2012 35
Free space management in Starburst
Free space lists:
- n linked lists of free segments, plus an array of headers
- Each list contains segments of the same size (1, 2, 4, 8, ... 2n pages)
Space allocation:
- 1. Choose the best-fit block
- 2. If the block is too large, split it to half, possibly several times, and
add the remaining pieces to the related lists according to their size. Releasing a segment:
- 1. Inspect the buddy; if it is free, then combine the segments (& repeat).
- 2. The combined pieces are removed from their free space lists
- 3. The new (largest possible) free segment is added to the related list
Update operations: Only appending of long fields supported; sufficient for streaming media.
SLIDE 18
MMDB-2 J. Teuhola 2012 36
Free space management schematically
Headers of free segment lists . . . segments of 1 page segments of 2 pages segments of 4 pages segments of 8 pages
SLIDE 19
MMDB-2 J. Teuhola 2012 37
Recent approaches to ’Big Data’
Multimedia collections are one example of a Big Data application.
(Others occur e.g. in meteorology, bioinformatics, business informatics, sensor networks, etc.)
Data volumes are of petabyte order (PB is about 1015 bytes).
(Facebook disk space 100 PetaBytes = 100 000 Terabytes!)
Distributed storage and access Examples of large file systems:
Global File System (GFS2 by RedHat) General Parallel Filesystem (GPFS by IBM) ( supercomputers) Google File System (GFS, GoogleFS) ( Google cloud services) Hadoop Distributed File System (HDFS by Apache)( Facebook) Amazon S3 filesystem (’Simple Storage Service’)
SLIDE 20 MMDB-2 J. Teuhola 2012 38
Features of the Google File System
Cluster of parallel storage servers:
- One master server, handling the metadata (in main memory)
- Several chunk servers, storing the actual data
Replication of data for recovery; checksums to notice corruption. Designed for large reading streams.
Random access and local update are possible but not very efficient
Desgined for big files: the chunk size is 64 MBytes.
Smaller files are grouped to chunks.
The main type of update is append (extending to the end).
Concurrent appending to the same file is supported.
Lazy garbage collection. Designed 2003, complemented later by Google’s BigTable ’database’
and MapReduce parallelization scheme.