Database Systems
Indexes
DBMSs and ”NoSQL”
1
Database Systems Indexes DBMSs and NoSQL 1 Quiz! How costly is - - PowerPoint PPT Presentation
Database Systems Indexes DBMSs and NoSQL 1 Quiz! How costly is this operation (naive solution)? course per weekday hour room TDA357 3 HC1 Monday 15:15 TDA357 3 HC1 Thursday 10:00 TDA357 2 HB1 Tuesday 08:00 n TDA357
1
SELECT * FROM Lectures WHERE course = ’TDA357’ AND period = 3;
course per weekday hour room
TDA357 3 HC1 Monday 15:15 TDA357 3 HC1 Thursday 10:00 TDA357 2 HB1 Tuesday 08:00 TDA357 2 HB1 Friday 13:15 TIN090 1 HC1 Wednesday 08:00 TIN090 1 HA3 Thursday 13:15
n
Go through all n rows, compare with the values for course and period = 2n comparisons
2
SELECT * FROM Lectures WHERE course = ’TDA357’ AND period = 3;
If rows were stored sorted according to the values course and period, we could get all rows with the given values faster (O(log n) for tree structure). Storing rows sorted is expensive, but we can use an index that given values of these attributes points out all sought rows (an index could be a hash map, giving O(1) complexity to lookups).
3
4
The asymptotic complexity works for data structures in main
running time of the data structure, once in main memory, is negligible compared to the time it takes to read data from
to minimize the number of disk blocks accessed (could use asymptotic complexity over disk block accessing though). Indexes help here too though. If a relation is stored over a number of disk blocks, knowing in which of these to look is helpful.
5
6
SELECT * FROM Lectures WHERE course = ’TDA357’ AND period = 3;
7
SELECT * FROM Lectures, Courses WHERE course = code;
Go through all n blocks in Lectures, compare the value for course from each row with the values for code in all rows of Courses, stored in all m
accessed disk blocks.
Lectures: n disk blocks Courses: m disk blocks
Index on code in Courses: No index: Go through all n blocks in Lectures, compare the value for course from each row with the index. Since course is a key, each value will exist at most once, so the cost is 2 * n + 1 accessed disk blocks (1 for fetching the index once).
8
9
10
11
12
SELECT * FROM Lectures WHERE course = ’TDA357’ AND period = 3; Lectures: n disk blocks SELECT * FROM Lectures WHERE weekday = ’Monday’ AND room = ’HC1’;
A multi-attribute index is typically organized hierarchically. First the rows are indexed according to the first attribute, then according to the second within each group, and so on. Thus the left query costs at most k + 1 where k is the number of rows matching the values. The right query can’t use the index, and thus costs n, where n is the size of the relation in disk blocks.
13
14
The amortized cost depends on the distribution of the operations. p1 is proportion of operations that are Q1 queries, p2 similarly for Q2, and thus the proportion of operations that are Ins modifications is 1 – p1 – p2. For some different values of p1 and p2 we get actual costs of:
6 – 3p1 + 2p2 4 + 16p1 + 4p2 4 – p1 + 16p2 2 + 18p1 + 18p2 cost 8 8 20 20 Q2 3 20 3 20 Q1 6 4 4 2 Ins Both indexes Index for room Index for (course, period, weekday) No index Indexes 2 + 18p1 + 18p2 4 – p1 + 16p2 4 + 16p1 + 4p2 6 – 3p1 + 2p2 p1 = p2 = 0.4 16.4 10 12 5.6 p1 = p2 = 0.1 5.6 5.5 6 5.9 p1 = 0.6, p2 = 0.3 18.2 8.2 14.8 4.8
Insert new lectures (Ins) List all lectures of a particular course (Q1) List all lectures in a given room (Q2)
15
16
17
18
Oracle 40% IBM DB2 30% Microsoft SQL Server 15% Sybase 3%
(…)
MySQL 1% PostgreSQL 0.5% (Microsoft Access)
19
20
– Subqueries (!) added in recent release – FK constraints only with non-standard backend – ACID transactions only (crudely) with non-standard backends – Not optimized for joins
– No CHECK constraints (including assertions) – No sequencing (WITH)
– Early support in PHP helped boost
21
– Efficient support for joins
– Full constraint support – except assertions! – Full ACID transactions – Sequencing (WITH)
22
23
http://xkcd.com/327/ The possibility for SQL injection attacks has lead development away from literal SQL, towards higher-level interfaces, tools and libraries.
24
– it’s a selling point!
25
26
– The archetypical RDBMS serves a bank.
– MASSIVE+++ amounts of data (e.g. eBay) – Super-fast indexing of documents (e.g. Google) – Serving pages on high-traffic websites (e.g. Facebook) – Streaming media (e.g. Spotify)
27
28
29
– Google originally; Hadoop (Apache), …
– BigTable (Google), Cassandra (Apache), …
– CouchDB, MongoDB, SimpleDB, …
– Neo4j, …
– (Native) XML databases
30
– Reader :: file → [input record] – Map :: input record → <key, value> – Reduce :: <key, [value]> → [output record] – Writer :: [output record] → file
– Consistency, atomicity, distribution and parallelism, ”glue”
– Running simple queries over all data at once
31
32
33
there isn’t anything new when I click the icon??!?”
34
35
36
– Nodes = ”entities” – Properties = ”tags”, attribute values – Edges connect
– All entities that share a common property – Computing association paths
37
– The type of each ”entity” is its own business. – Labels indicate meanings of substructures.
38
39
40
41