1
Topic for Thursday?
Miscellaneous Topics in Databases P ARALLEL DBMS W HY P ARALLEL A - - PDF document
Topic for Thursday? 1 Miscellaneous Topics in Databases P ARALLEL DBMS W HY P ARALLEL A CCESS T O D ATA ? At 10 MB/s 1,000 x parallel 1.2 days to scan 1.5 minute to scan. 1 Terabyte 1 Terabyte Parallelism: 10 MB/s divide a big
1
Topic for Thursday?
4
5
Parallelism is natural to DBMS processing Pipeline parallelism: many machines each doing one
Partition parallelism: many machines doing the same
Both are natural in DBMS!
Any Sequential Program Any Sequential Program Sequential Sequential Sequential Sequential Any Sequential Program Any Sequential Program
6
Speed-Up More resources means
Scale-Up If resources increased
Why Realistic <> Ideal?
Ideal
Ideal Realistic Realistic
7
Parallel machines are becoming quite common
Prices of microprocessors, memory and disks have
Recent desktop computers feature multiple processors
Databases are growing increasingly large large volumes of transaction data are collected and
multimedia objects like images are increasingly stored
Large-scale parallel database systems
storing large volumes of data processing time-consuming decision-support queries providing high throughput for transaction processing
8
Google data centers around the world, as of 2008
9
Data can be partitioned across multiple disks for
Individual relational operations (e.g., sort, join,
data can be partitioned and each processor can work
Results merged when done Different queries can be run in parallel with each
Concurrency control takes care of conflicts. Thus, databases naturally lend themselves to
10 10 10 10
Horizontal partitioning (shard) involves putting different rows into different tables Ex: customers with ZIP codes less than 50000 are
Vertical partitioning involves creating tables with fewer columns and using
partitions columns even when already normalized called "row splitting" (the row is split by its columns) Ex: split (slow to find) dynamic data from (fast to find)
11 11 11 11
Evaluate how well partitioning techniques
E.g., r.A = 25.
E.g., 10 r.A < 25.
12 12 12 12
Balanced partitioning vector can be constructed from
histogram in a relatively straightforward fashion
Assume uniform distribution within each range of the histogram Histogram can be constructed by scanning relation, or
sampling (blocks containing) tuples of the relation
13 13 13 13
Queries/transactions execute in parallel with one
concurrent processing Increases transaction throughput; used primarily
Easiest form of parallelism to support
14 14 14 14
Execution of a single query in parallel on multiple
Two complementary forms of intraquery
Intraoperation Parallelism – parallelize the
Interoperation Parallelism – execute the different
15 15 15 15
The join operation requires pairs of tuples to be
Parallel join algorithms attempt to split the pairs
In a final step, the results from each processor can
16 16 16 16
Query optimization in parallel databases is more complex
than in sequential databases
Cost models are more complicated, since we must take into
account partitioning costs and issues such as skew and resource contention
When scheduling execution tree in parallel system, must
decide:
How to parallelize each operation how many processors to use for it What operations to pipeline what operations to execute independently in parallel what operations to execute sequentially
Determining the amount of resources to allocate for each
E.g., allocating more processors than optimal can result
in high communication overhead
18 18 18 18
Declarative Language Language to specify rules Inference Engine (Deduction Machine) Can deduce new facts by interpreting the rules Related to logic programming
Prolog language (Prolog => Programming in logic) Uses backward chaining to evaluate Top-down application of the rules
Consists of: Facts
Similar to relation specification without the necessity of
including attribute names Rules
Similar to relational views (virtual relations that are not stored)
19 19 19 19
Facts are provided as predicates Predicate has a name a fixed number of arguments
Convention: Constants are numeric or character strings Variables start with upper case letters
E.g., SUPERVISE(Supervisor, Supervisee)
States that Supervisor SUPERVISE(s) Supervisee
20 20 20 20
Rule Is of the form head :- body
where :- is read as if and only iff
E.g., SUPERIOR(X,Y) :- SUPERVISE(X,Y) E.g., SUBORDINATE(Y,X) :- SUPERVISE(X,Y)
21 21 21 21
Query Involves a predicate symbol followed by some variable
where :- is read as if and only iff
E.g., SUPERIOR(james,Y)? E.g., SUBORDINATE(james,X)?
22 22 22 22
23 23 23 23
24 24 24 24
26 26 26 26
Example pattern (Census Bureau Data): If (relationship = husband), then (gender = male). 99.6%
27 27 27 27
Data mining is the exploration and analysis of large quantities
and ultimately understandable patterns in data.
28 28 28 28
Volume and dimensionality of the data High data growth rate
Data Storage Computational power Off-the-shelf software Expertise
29 29 29 29
30 30 30 30
31 31 31 31
Supervised learning Classification and regression Unsupervised learning Clustering Dependency modeling Associations, summarization, causality Outlier and deviation detection Trend analysis and change detection
32 32 32 32
Input data: 3 TB of image data with 2
Goal: Generate a catalog with all objects
Method: Use decision trees as data mining
Results: 94% accuracy in predicting sky object classes Increased number of faint objects classified by
Helped team of astronomers to discover 16 new
33 33 33 33
Two predictor attributes:
Age is ordered, Car-type is
Class label indicates
Dependent attribute is
34 34 34 34
Goals: To produce an accurate classifier/regression function To understand the structure of the problem Requirements on the model: High accuracy Understandable by humans, interpretable Fast construction for very large training databases
35 35 35 35
36 36 36 36
A cluster is defined as a connected dense component. Density is defined in terms of number of neighbors of
We can find clusters of arbitrary shape
37 37 37 37
Consider shopping cart filled with several items Market basket analysis tries to answer the
Who makes purchases? What do customers buy together? In what order do customers purchase items?
38 38 38 38
Coocurrences 80% of all customers purchase items X, Y and Z together. Association rules 60% of all customers who purchase X and Y also buy Z. Sequential patterns 60% of customers who first buy X also purchase Y within
40 40 40 40 41 41 41 41
Database that: Stores spatial objects Manipulates spatial objects just like other objects in
42 42 42 42
Data which describes either location or shape
In the abstract, reductionist view of the computer,
43 43 43 43
44 44 44 44
45 45 45 45
46 46 46 46
Not just interested in location, also interested in
The most common relationships are
Proximity : distance Adjacency : “touching” and “connectivity” Containment : inside/overlapping
47 47 47 47
48 48 48 48
49 49 49 49
50 50 50 50
51 51 51 51
Geocodable addresses Customer location Store locations Transportation
Statistical/Demograph
Cartography Epidemiology Crime patterns Weather Information Land holdings Natural resources City Planning Environmental
Information
Hazard detection
52 52 52 52
Able to treat your spatial data like anything else
transactions backups integrity checks less data redundancy fundamental organization and operations handled by
multi-user support security/access control locking
53 53 53 53
Offset complicated tasks to the DB server
Significantly lowers the development time of
54 54 54 54
Spatial querying using SQL
distance adjacency containment
area length intersection union buffer
55 55 55 55
56 56 56 56
57 57 57 57
58 58 58 58
Simple value of the proposed lot Area(<my lot>) * <price per acre> + area(intersect(<my log>,<forested area>) ) * <wood value per acre>
59 59 59 59
2001 election.
looking at voting in 1996.
the voting areas polygons.
voting and sum
voting More advanced: also use demographic data.
60 60 60 60
Cost to implement can be high Some inflexibility Incompatibilities with some GIS software Slower than local, specialized data structures User/managerial inexperience and caution
61 61 61 61
Types: Basic Shapes, Multi-Shapes, Derived
Basic Shapes Alternate Shapes Multi-Shapes Any Possible Shape Derived Shapes User Defined Shape
N 0, N
62 62 62 62
Form an entity to hold county names, states,
Form an entity to hold river names, sources,
63 63 63 63
64 64 64 64