Miscellaneous Topics in Databases P ARALLEL DBMS W HY P ARALLEL A - - PDF document

miscellaneous topics in databases
SMART_READER_LITE
LIVE PREVIEW

Miscellaneous Topics in Databases P ARALLEL DBMS W HY P ARALLEL A - - PDF document

Topic for Thursday? 1 Miscellaneous Topics in Databases P ARALLEL DBMS W HY P ARALLEL A CCESS T O D ATA ? At 10 MB/s 1,000 x parallel 1.2 days to scan 1.5 minute to scan. 1 Terabyte 1 Terabyte Parallelism: 10 MB/s divide a big


slide-1
SLIDE 1

1

 Topic for Thursday?

Miscellaneous Topics in Databases

PARALLEL DBMS

slide-2
SLIDE 2

4

WHY PARALLEL ACCESS TO DATA?

1 Terabyte 10 MB/s At 10 MB/s 1.2 days to scan

1 Terabyte

1,000 x parallel 1.5 minute to scan.

Parallelism: divide a big problem into many smaller ones to be solved in parallel.

5

PARALLEL DBMS: INTRO

 Parallelism is natural to DBMS processing  Pipeline parallelism: many machines each doing one

step in a multi-step process.

 Partition parallelism: many machines doing the same

thing to different pieces of data.

 Both are natural in DBMS!

Pipeline Partition

Any Sequential Program Any Sequential Program Sequential Sequential Sequential Sequential Any Sequential Program Any Sequential Program

  • utputs split N ways, inputs merge

6

SOME || TERMINOLOGY

 Speed-Up  More resources means

proportionally less time for given amount of data.

 Scale-Up  If resources increased

in proportion to increase in data size, time is constant.

 Why Realistic <> Ideal?

degree of ||-ism Xact/sec. (throughput)

Ideal

degree of ||-ism sec./Xact (response time)

Ideal Realistic Realistic

slide-3
SLIDE 3

7

INTRODUCTION

 Parallel machines are becoming quite common

and affordable

 Prices of microprocessors, memory and disks have

dropped sharply

 Recent desktop computers feature multiple processors

and this trend is projected to accelerate

 Databases are growing increasingly large  large volumes of transaction data are collected and

stored for later analysis.

 multimedia objects like images are increasingly stored

in databases

 Large-scale parallel database systems

increasingly used for:

 storing large volumes of data  processing time-consuming decision-support queries  providing high throughput for transaction processing

8

 Google data centers around the world, as of 2008

9

PARALLELISM IN DATABASES

 Data can be partitioned across multiple disks for

parallel I/O.

 Individual relational operations (e.g., sort, join,

aggregation) can be executed in parallel

 data can be partitioned and each processor can work

independently on its own partition

 Results merged when done  Different queries can be run in parallel with each

  • ther.

 Concurrency control takes care of conflicts.  Thus, databases naturally lend themselves to

parallelism.

slide-4
SLIDE 4

10 10 10 10

PARTITIONING

 Horizontal partitioning (shard)  involves putting different rows into different tables  Ex: customers with ZIP codes less than 50000 are

stored in CustomersEast, while customers with ZIP codes greater than or equal to 50000 are stored in CustomersWest

 Vertical partitioning  involves creating tables with fewer columns and using

additional tables to store the remaining columns

 partitions columns even when already normalized  called "row splitting" (the row is split by its columns)  Ex: split (slow to find) dynamic data from (fast to find)

static data in a table where the dynamic data is not used as often as the static

11 11 11 11

COMPARISON OF PARTITIONING TECHNIQUES

 Evaluate how well partitioning techniques

support the following types of data access: 1.Scanning the entire relation. 2.Locating a tuple associatively – point queries.

 E.g., r.A = 25.

3.Locating all tuples such that the value of a given attribute lies within a specified range – range queries.

 E.g., 10  r.A < 25.

12 12 12 12

HANDLING SKEW USING HISTOGRAMS

 Balanced partitioning vector can be constructed from

histogram in a relatively straightforward fashion

 Assume uniform distribution within each range of the histogram  Histogram can be constructed by scanning relation, or

sampling (blocks containing) tuples of the relation

slide-5
SLIDE 5

13 13 13 13

INTERQUERY PARALLELISM

 Queries/transactions execute in parallel with one

another

 concurrent processing  Increases transaction throughput; used primarily

to scale up a transaction processing system to support a larger number of transactions per second.

 Easiest form of parallelism to support

14 14 14 14

INTRAQUERY PARALLELISM

 Execution of a single query in parallel on multiple

processors/disks; important for speeding up long- running queries

 Two complementary forms of intraquery

parallelism :

 Intraoperation Parallelism – parallelize the

execution of each individual operation in the query (each CPU runs on a subset of tuples)

 Interoperation Parallelism – execute the different

  • perations in a query expression in parallel.

(each CPU runs a subset of operations on the data)

15 15 15 15

PARALLEL JOIN

 The join operation requires pairs of tuples to be

tested to see if they satisfy the join condition, and if they do, the pair is added to the join output.

 Parallel join algorithms attempt to split the pairs

to be tested over several processors. Each processor then computes part of the join locally.

 In a final step, the results from each processor can

be collected together to produce the final result.

slide-6
SLIDE 6

16 16 16 16

QUERY OPTIMIZATION

 Query optimization in parallel databases is more complex

than in sequential databases

 Cost models are more complicated, since we must take into

account partitioning costs and issues such as skew and resource contention

 When scheduling execution tree in parallel system, must

decide:

 How to parallelize each operation  how many processors to use for it  What operations to pipeline  what operations to execute independently in parallel  what operations to execute sequentially

 Determining the amount of resources to allocate for each

  • peration is a problem

 E.g., allocating more processors than optimal can result

in high communication overhead

DEDUCTIVE DATABASES

18 18 18 18

OVERVIEW OF DEDUCTIVE DATABASES

 Declarative Language  Language to specify rules  Inference Engine (Deduction Machine)  Can deduce new facts by interpreting the rules  Related to logic programming

 Prolog language (Prolog => Programming in logic)  Uses backward chaining to evaluate  Top-down application of the rules

 Consists of:  Facts

 Similar to relation specification without the necessity of

including attribute names  Rules

 Similar to relational views (virtual relations that are not stored)

slide-7
SLIDE 7

19 19 19 19

PROLOG/DATALOG NOTATION

 Facts are provided as predicates  Predicate has  a name  a fixed number of arguments

 Convention:  Constants are numeric or character strings  Variables start with upper case letters

 E.g., SUPERVISE(Supervisor, Supervisee)

 States that Supervisor SUPERVISE(s) Supervisee

20 20 20 20

PROLOG/DATALOG NOTATION

 Rule  Is of the form head :- body

 where :- is read as if and only iff

 E.g., SUPERIOR(X,Y) :- SUPERVISE(X,Y)  E.g., SUBORDINATE(Y,X) :- SUPERVISE(X,Y)

21 21 21 21

PROLOG/DATALOG NOTATION

 Query  Involves a predicate symbol followed by some variable

arguments to answer the question

 where :- is read as if and only iff

 E.g., SUPERIOR(james,Y)?  E.g., SUBORDINATE(james,X)?

slide-8
SLIDE 8

22 22 22 22

Supervisory tree Prolog notation

23 23 23 23

PROVING A NEW FACT

24 24 24 24

slide-9
SLIDE 9

DATA MINING

26 26 26 26

DEFINITION

Data mining is the exploration and analysis of large quantities of data in order to discover valid, novel, potentially useful, and ultimately understandable patterns in data.

Example pattern (Census Bureau Data): If (relationship = husband), then (gender = male). 99.6%

27 27 27 27

DEFINITION (CONT.)

Data mining is the exploration and analysis of large quantities

  • f data in order to discover valid, novel, potentially useful,

and ultimately understandable patterns in data.

Valid: The patterns hold in general. Novel: We did not know the pattern beforehand. Useful: We can devise actions from the patterns. Understandable: We can interpret and comprehend the patterns.

slide-10
SLIDE 10

28 28 28 28

WHY USE DATA MINING TODAY?

Human analysis skills are inadequate:

 Volume and dimensionality of the data  High data growth rate

Availability of:

 Data  Storage  Computational power  Off-the-shelf software  Expertise

29 29 29 29

THE KNOWLEDGE DISCOVERY PROCESS

Steps:

Identify business problem

Data mining

Action

Evaluation and measurement

Deployment and integration into businesses processes

30 30 30 30

PREPROCESSING AND MINING

Original Data Target Data Preprocessed Data Patterns Knowledge Data Integration and Selection Preprocessing Model Construction Interpretation

slide-11
SLIDE 11

31 31 31 31

DATA MINING TECHNIQUES

 Supervised learning  Classification and regression  Unsupervised learning  Clustering  Dependency modeling  Associations, summarization, causality  Outlier and deviation detection  Trend analysis and change detection

32 32 32 32

EXAMPLE APPLICATION: SKY SURVEY

Input data: 3 TB of image data with 2

billion sky objects, took more than six years to complete

Goal: Generate a catalog with all objects

and their type

Method: Use decision trees as data mining

model

Results:  94% accuracy in predicting sky object classes  Increased number of faint objects classified by

300%

 Helped team of astronomers to discover 16 new

high red-shift quasars in one order of magnitude less observation time

33 33 33 33

CLASSIFICATION EXAMPLE

Example training

database

 Two predictor attributes:

Age and Car-type (Sport, Minivan and Truck)

 Age is ordered, Car-type is

categorical attribute

 Class label indicates

whether person bought product

 Dependent attribute is

categorical Age Car Class 20 M Yes 30 M Yes 25 T No 30 S Yes 40 S Yes 20 T No 30 M Yes 25 M Yes 40 M Yes 20 S No

slide-12
SLIDE 12

34 34 34 34

GOALS AND REQUIREMENTS

 Goals:  To produce an accurate classifier/regression function  To understand the structure of the problem  Requirements on the model:  High accuracy  Understandable by humans, interpretable  Fast construction for very large training databases

35 35 35 35

WHAT ARE DECISION TREES?

Minivan Age Car Type YES NO YES <30 >=30 Sports, Truck 30 60 Age YES YES NO Minivan Sports, Truck

36 36 36 36

DENSITY-BASED CLUSTERING

 A cluster is defined as a connected dense component.  Density is defined in terms of number of neighbors of

a point.

 We can find clusters of arbitrary shape

slide-13
SLIDE 13

37 37 37 37

MARKET BASKET ANALYSIS

 Consider shopping cart filled with several items  Market basket analysis tries to answer the

following questions:

 Who makes purchases?  What do customers buy together?  In what order do customers purchase items?

38 38 38 38

MARKET BASKET ANALYSIS (CONTD.)

 Coocurrences  80% of all customers purchase items X, Y and Z together.  Association rules  60% of all customers who purchase X and Y also buy Z.  Sequential patterns  60% of customers who first buy X also purchase Y within

three weeks.

SPATIAL DATA

slide-14
SLIDE 14

40 40 40 40 41 41 41 41

WHAT IS A SPATIAL DATABASE?

 Database that:  Stores spatial objects  Manipulates spatial objects just like other objects in

the database

42 42 42 42

WHAT IS SPATIAL DATA?

 Data which describes either location or shape

e.g.House or Fire Hydrant location Roads, Rivers, Pipelines, Power lines Forests, Parks, Municipalities, Lakes

 In the abstract, reductionist view of the computer,

these entities are represented as Points, Lines, and Polygons.

slide-15
SLIDE 15

43 43 43 43

Roads are represented as Lines Mail Boxes are represented as Points

44 44 44 44

TOPIC THREE

Land Use Classifications are represented as Polygons

45 45 45 45

TOPIC THREE

Combination of all the previous data

slide-16
SLIDE 16

46 46 46 46

SPATIAL RELATIONSHIPS

 Not just interested in location, also interested in

“Relationships” between objects that are very hard to model outside the spatial domain.

 The most common relationships are

 Proximity : distance  Adjacency : “touching” and “connectivity”  Containment : inside/overlapping

47 47 47 47

SPATIAL RELATIONSHIPS

Distance between a toxic waste dump and a piece of property you were considering buying.

48 48 48 48

SPATIAL RELATIONSHIPS

Distance to various pubs

slide-17
SLIDE 17

49 49 49 49

SPATIAL RELATIONSHIPS

Adjacency: All the lots which share an edge

50 50 50 50

Connectivity: Tributary relationships in river networks

51 51 51 51

MOST ORGANIZATIONS HAVE SPATIAL DATA

 Geocodable addresses  Customer location  Store locations  Transportation

tracking

 Statistical/Demograph

ic

 Cartography  Epidemiology  Crime patterns  Weather Information  Land holdings  Natural resources  City Planning  Environmental

planning

 Information

Visualization

 Hazard detection

slide-18
SLIDE 18

52 52 52 52

ADVANTAGES OF SPATIAL DATABASES

 Able to treat your spatial data like anything else

in the DB

 transactions  backups  integrity checks  less data redundancy  fundamental organization and operations handled by

the DB

 multi-user support  security/access control  locking

53 53 53 53

ADVANTAGES OF SPATIAL DATABASES

 Offset complicated tasks to the DB server 

  • rganization and indexing done for you

do not have to re-implement operators

do not have to re-implement functions

 Significantly lowers the development time of

client applications

54 54 54 54

ADVANTAGES OF SPATIAL DATABASES

 Spatial querying using SQL 

use simple SQL expressions to determine spatial relationships

 distance  adjacency  containment

use simple SQL expressions to perform spatial

  • perations

 area  length  intersection  union  buffer

slide-19
SLIDE 19

55 55 55 55

Original Polygons Union Intersection

56 56 56 56

Original river network Buffered rivers

57 57 57 57

ADVANTAGES OF SPATIAL DATABASES

… WHERE distance(<me>,pub_loc) < 1000 SELECT distance(<me>,pub_loc)*$0.01 + beer_cost … ... WHERE touches(pub_loc, street) … WHERE inside(pub_loc,city_area) and city_name = ...

slide-20
SLIDE 20

58 58 58 58

ADVANTAGES OF SPATIAL DATABASES

Simple value of the proposed lot Area(<my lot>) * <price per acre> + area(intersect(<my log>,<forested area>) ) * <wood value per acre>

  • distance(<my lot>, <power lines>) * <cost of power line laying>

59 59 59 59

New Electoral Districts

  • Changes in areas between 1996 and

2001 election.

  • Want to predict voting in 2001 by

looking at voting in 1996.

  • Intersect the 2001 district polygon with

the voting areas polygons.

  • Outside will have zero area
  • Inside will have 100% area
  • On the border will have partial area
  • Multiply the % area by 1996 actual

voting and sum

  • Result is a simple prediction of 2001

voting More advanced: also use demographic data.

60 60 60 60

DISADVANTAGES OF SPATIAL DATABASES

 Cost to implement can be high  Some inflexibility  Incompatibilities with some GIS software  Slower than local, specialized data structures  User/managerial inexperience and caution

slide-21
SLIDE 21

61 61 61 61

PICTOGRAMS - SHAPES

 Types: Basic Shapes, Multi-Shapes, Derived

Shapes, Alternate Shapes, Any possible Shape, User-Defined Shapes

Basic Shapes Alternate Shapes Multi-Shapes Any Possible Shape Derived Shapes User Defined Shape

N 0, N

* !

62 62 62 62

SPATIAL DATA ENTITY CREATION

 Form an entity to hold county names, states,

populations, and geographies

CREATE TABLE County( Name varchar(30), State varchar(30), Pop Integer, Shape Polygon);

 Form an entity to hold river names, sources,

lengths, and geographies

CREATE TABLE River( Name varchar(30), Source varchar(30), Distance Integer, Shape LineString);

63 63 63 63

EXTENDING THE ER DIAGRAM

Standard ER Diagram Spatial ER Diagram

slide-22
SLIDE 22

64 64 64 64