Analytical data bases Database lectures for mathematics students - - PowerPoint PPT Presentation

analytical data bases
SMART_READER_LITE
LIVE PREVIEW

Analytical data bases Database lectures for mathematics students - - PowerPoint PPT Presentation

Analytical data bases Database lectures for mathematics students Zbigniew Jurkiewicz, Institute of Informatics UW May 14, 2017 Analytical data bases Database lectures for mathematics Zbigniew Jurkiewicz, Institute of Informatics UW Decision


slide-1
SLIDE 1

Analytical data bases

Database lectures for mathematics students

Zbigniew Jurkiewicz, Institute of Informatics UW May 14, 2017

Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Database lectures for mathematics

slide-2
SLIDE 2

Decision support systems

From the perspective of the time span all decisions in the

  • rganization could be divided into three categories:
  • perational decisions within the scope of days or weeks;

tactical decisions, whose effects range from a few months to one year, strategic decisions, which impact the organization development for the next few years.

It has been observed that when moving from operational decisions towards the strategic ones, the procedures used become less and less algorithmic and formalized.

Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Database lectures for mathematics

slide-3
SLIDE 3

Decision support systems

Initially in business activities computer systems had been used mostly for operational data processing with such applications as sales order management, invoicing, or magazine inventory. Gradually the computers has been used for less routine activities, called Decision Support Systems (DSS). They are also known under other popular names

BIS/BIT (Business Intelligence System/Technology) EIS (Executive Information System).

In addition to “mechanical” data processing they also provide various mechanisms for deducing new information from the facts contained in the database. This has lead to a division of “database” applications into

  • perational (transactional) and analytical.

Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Database lectures for mathematics

slide-4
SLIDE 4

Requirements for decision support systems

Information should usually be presented in a summarized form. No standard access path, very varied methods of selection and formatting the information to be presented, dynamics. Associating the selected information which other computational resources (spreadsheets, specialized statistical packages).

Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Database lectures for mathematics

slide-5
SLIDE 5

Analytical data processing

Commonly known as On-Line Analytical Processing, abbreviated usually to OLAP . Typical applications: trend analysis, looking for patterns of behavior, looking for anomalies. Used interactively, so efficiency is very important, especially time efficiancy. If a user observes that some queries (e.g. based on 5 and more joins) are executed very slowly, she will try to avoid them. It is assumed that the answers for 90% of queries should be available within 10 seconds.

Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Database lectures for mathematics

slide-6
SLIDE 6

Analytical data bases

Also called On-Line Analytical Processing (OLAP) databases. Growing in importance. From personal computers to large client-server configurations. Many buzzwords

roll-up and drill-down, drill-through, MOLAP , pivoting.

Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Database lectures for mathematics

slide-7
SLIDE 7

Main issues

What is an analytical database? Models and operations Implementing analytical database Development trends

Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Database lectures for mathematics

slide-8
SLIDE 8

Matrix reports

Data analysis was first supported by matrix reports. Matrix reports look like spreadsheets. They are often concerned with finances or management. Sales system should for example contain a report about customers and their buying patterns, divided by country regions. However, instead of analysing buying patterns for each product, we divide product into categories. So the report would have product categories as columns, country regions as rows, and each report cell will show the number of items sold in this category in this region.

Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Database lectures for mathematics

slide-9
SLIDE 9

Data Mart

Small data warehouse, sometimes called thematic database Covers only some areas (themes) of the enterprise, e.g.

marketing: customers, products, sales

Model adapted to the needes of a department. Usually the information is initially preaggregated

Elimination of unnecessary details Some critical level of details selected.

Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Database lectures for mathematics

slide-10
SLIDE 10

Tools for querying and analysis

Query builders Report generators

comparisons: growth, decrease trends, graphs

Spreadsheets WWW interface Data mining

Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Database lectures for mathematics

slide-11
SLIDE 11

Other operations

Functions over time

e.g. averages on different periods

Computed attributes

e.g.. profit = sales * rate

Textual queries, e.g.

find all documents containing words A and B

  • rder documents by frequency of occurence for words X, Y

and Z

Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Database lectures for mathematics

slide-12
SLIDE 12

Data models and operators

Data models

relation star and snowflake cube: extension of spreadsheet idea (multidimensional tables, dimensions indexed by database values)

Operators

slice & dice roll-up, drill down pivoting

  • ther

Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Database lectures for mathematics

slide-13
SLIDE 13

Multidimensional data model

Multidimensional databases are most popular due to analytical data model of the form of multidimensional cube containing:

facts (also called measures), e.g. the number of cars sold; dimensions, e.g. months, regions of sale.

Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Database lectures for mathematics

slide-14
SLIDE 14

Dimensions

Dimensions usually form hierarchies, e.g. for time dimension the hierarchy will be year-quarter-month-day. Hierarchies enable the interactive change of detail level (granularity) of the information presented. In more complex models the hierachies can branch, e.g. division into weeks is incompatible with division into months.

Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Database lectures for mathematics

slide-15
SLIDE 15

Time

Time dimension needs a special treatment It is hidden — there is no separate table for time. As a dimension time is specific because it is sequential in nature.

We might ask to see the sales for May or the sales for the first three months of 2007. But we would rarely ask to see the sales for the first five goods (ever assuming they are ordered by name).

Method of aggregation for time depends on the meaning of the measure.

If a company sold 10 computers in January, 15 computers in February, and 10 computers in March, then typical query would ask for total number (i.e. sum) sold for the first quarter. On the other side, if a company had employeed 10 people in January, 7 in February, and 10 again in March, then we would usually ask about the average count for the quarter.

Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Database lectures for mathematics

slide-16
SLIDE 16

Database

The data is usually taken from data warehouse (real or virtual). Direct storage of information for all facts and all levels of detail in database could be very costly in terms of space, so

Store only data for the most often used levels of hierarchies. Other data is computed from stored data on the fly if needed.

When aggregating measures it is important to take into account various rules of aggregation, e.g.

Sales amount is usually summed. Temperature or price will rather be averaged.

The analytical database stores as a rule only aggregated data. To see the detail data (drill-through) it is necessary to fetch it from data warehouse or operational database. Because this takes a lot of time, such need should not

  • ccur too often.

Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Database lectures for mathematics

slide-17
SLIDE 17

Operations on data

Cutting and projecting on the cross-section surface (slice and dice) Change of detail level: drill-down and roll-up) Turning (pivot): changes the visible dimensions on the “image”.

Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Database lectures for mathematics

slide-18
SLIDE 18

Approaches to building the OLAP database

1

ROLAP = “Relational OLAP”: we adapt the relational DBMS to star or snowflake schema.

2

MOLAP = “Multidimensional OLAP”: we use specialized DBMS based on “datacube” model.

Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Database lectures for mathematics

slide-19
SLIDE 19

Star schema

Star schema is a typical method of data organization in relational database for OLAP . It is composed of:

Fact table: large set of facts such as informations about the amount of sale. Dimension tables: smaller, statical information about the

  • bjects that the facts deal with.

Generalization: snowflake model.

Hierarchies of tables for particular dimensions: dimension table normalization.

Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Database lectures for mathematics

slide-20
SLIDE 20

Example star schema

We want to have in OLAP database informations about the selling of beers: pub, beer name, drinker who bought it, day, hour, and price. We take the following relation as our fact table:

Sales(pub,beer,drinker,day,hour,price)

Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Database lectures for mathematics

slide-21
SLIDE 21

Example, cont.

Dimension tables contain informations about pubs, beers and drinkers:

Pubs(pub, address, licence) Beers(beer, producer) Drinkers(drinker, address, phone)

Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Database lectures for mathematics

slide-22
SLIDE 22

Dimension attributes and dependent attributes

Two kinds of attributes exist in fact table:

Dimension attributes: the keys for dimension tables. Dependent attributes: the values associated with particular combinations of dimension attributes values.

Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Database lectures for mathematics

slide-23
SLIDE 23

Example: dependent attribute

price is a dependent attribute in the example relation Sales. Its value is determined by the combination of dimension attributes: pub, beer, drinker and time (the combination of date and hour).

Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Database lectures for mathematics

slide-24
SLIDE 24

ROLAP optimization techniques

Bitmap indexes: for each value of the index key in a dimension table (e.g. for each beer in the Beers table) we create a bit vector showing which tuples in a fact table contain this value. Materialized views: the OLAP database (or ever the data warehouse) stores precomputed answers for some useful queries (perspectives).

Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Database lectures for mathematics

slide-25
SLIDE 25

Typical OLAP query

OLAP query often starts with “star join”: the natural join of the fact table with all or most dimension tables. Example:

SELECT * FROM Sales,Pubs,Beers,Drinkers WHERE Sales.pub = Beers.pub AND Sales.beer = Beers.beer AND Sales.drinker = Drinkers.drinker;

Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Database lectures for mathematics

slide-26
SLIDE 26

Typical OLAP query

Starts with a star join. Selects interesting tuples using data from dimension tables. Groups on one or more dimensions. Aggregates some attributes of the result.

Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Database lectures for mathematics

slide-27
SLIDE 27

Example OLAP query

For each pub in Pozna´ n show the total sale of each beer produced by Anheuser-Busch brewery. Filter: address = “Pozna´ n” and producer = “Anheuser-Busch”. Grouping: by pub and beer. Aggregation: Sum over price.

Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Database lectures for mathematics

slide-28
SLIDE 28

Example: SQL

SELECT pub, beer, SUM(price) FROM Sales NATURAL JOIN Pubs NATURAL JOIN Beers WHERE addr = ’Pozna´ n’ AND producer = ’Anheuser-Busch’ GROUP BY pub, beer;

Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Database lectures for mathematics

slide-29
SLIDE 29

Materialized views

Direct execution of our query for the table Sales and dimension tables may take a lot more time than we accept. If we would create a materialized view containg the appropriate information, we could give the answer much faster.

Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Database lectures for mathematics

slide-30
SLIDE 30

Example: materialized view

Which view could help us? Basic requirements:

1

Must join at least Sales, Pubs and Beers.

2

Must group at least by pub and beer.

3

Does not need to select pubs in Pozna´ n nor beers from Anheuser-Busch.

4

Does not need to omit columns address and producer.

Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Database lectures for mathematics

slide-31
SLIDE 31

Example

Here is a useful view:

CREATE VIEW PuBeS(pub, address, beer, prod, sale) AS SELECT pub, address, beer, prod, SUM(price) AS sale FROM Sales NATURAL JOIN Pubs NATURAL JOIN Beers GROUP BY pub, address, beer, producer; Because pub → address and beer → producer, some grouping is superficial, but it is necessary because address and producer

  • ccur in the SELECT phrase.

Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Database lectures for mathematics

slide-32
SLIDE 32

Example — finale

The reformulated query (now it uses the materialized view BaBaS):

SELECT pub, beer, sale FROM PuBeS WHERE address = ’Pozna´ n’ AND producer = ’Anheuser-Busch’;

Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Database lectures for mathematics

slide-33
SLIDE 33

Materialization aspects

Type and frequency of queries Computing time for queries Storage costs Updating costs

Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Database lectures for mathematics

slide-34
SLIDE 34

MOLAP and datacubes

The (keys of) dimension tables become the dimensions of hypercube.

Example: for data from Sales table we have 4 dimensions: pub, beer, drinker i time.

Dependent attributes (e.g. price) are located in points (cells) of the hypercube.

Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Database lectures for mathematics

slide-35
SLIDE 35

Visualization — hypercubes

Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Database lectures for mathematics

slide-36
SLIDE 36

Borders

Often a cube should also contain aggregations (usually SUM or AVG) along the hyperedges of the cube. Borders contain one-dimensional, two-dimensional, . . . aggregations.

Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Database lectures for mathematics

slide-37
SLIDE 37

Example: borders

Our 4-dimensional hypercube Sales contains sums of price for each pub, each beer, each drinker and each time unit (probably days). It also contains sums of price for all pairs pub-beer, triples pub-drinker-day, . . .

Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Database lectures for mathematics

slide-38
SLIDE 38

Structure of the cube

We extend each dimension to have one additional value *. Internal cell with one or more coordinate being * contains aggregates for grouping by dimensions with *. Example: Sales(’Pod ˙ Zaglem’, ’Bud’, *, *) contains the sum

  • f the cost of the beer Bud which has been drunk in the

pub “Pod ˙ Zaglem” by all drinkers at any time.

Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Database lectures for mathematics

slide-39
SLIDE 39

Drill-down

Drill-down = “deaggregation” — decompose the aggregation into its components. Example: after finding that “Pod ˙ Zaglem” sells few Okocim beer, one may try to decompose this sales into particular kinds of Okocim.

Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Database lectures for mathematics

slide-40
SLIDE 40

Roll-up

Roll-up = additional aggregation on one or more dimensions. Example: having the table showing how much Okocim beer is drunk by each drinker in each pub, we roll it into a table giving the total amount of Okocim beer drunk by each

  • f drinkers.

Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Database lectures for mathematics

slide-41
SLIDE 41

Roll-Up i Drill-Down

Anheuser-Busch for drinker/pub Jim Bob Mary Joe’s Pub 45 33 30 Nut-House 50 36 42 Blue Chalk 38 31 40 Rolling-up by Pubs A-B / drinker Jim Bob Mary 133 100 112

Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Database lectures for mathematics

slide-42
SLIDE 42

Roll-Up i Drill-Down

Drill-down by Beers Beers A-B / drinker Jim Bob Mary Bud 40 29 40 M’lob 45 31 37 Bud Light 48 40 35

Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Database lectures for mathematics

slide-43
SLIDE 43

Materialized views for datacubes

Useful materialized views for datacubes should aggregate by one or more dimensions. The dimensions should not be totally aggregated, but possibly grouped by some attribute from a dimension table.

Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Database lectures for mathematics

slide-44
SLIDE 44

Example

A materialized view for our Sales hypercube could:

1

Aggregate totally by drinker.

2

Do not aggregate at all by beer.

3

Aggregate by time using week.

4

Aggregate by town for pubs.

Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Database lectures for mathematics

slide-45
SLIDE 45

Indexes

Traditional techniques

B-trees, hashing tables, R-trees, grids, ...

Specific

inverted lists bitmap indexes join indexes

Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Database lectures for mathematics

slide-46
SLIDE 46

Using inverted lists

Query:

Find people with age = 20 and name = “Fred”

List for age = 20: r4, r18, r34, r35 List for name = “Fred”: r18, r52 The answer is obtained as intersection: r18

Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Database lectures for mathematics

slide-47
SLIDE 47

MDX

Multidimensional Expressions (MDX): query language for MOLAP , initially part OLE DB (Microsoft 1997). Then used by Microsoft OLAP Services 7.0 and Microsoft Analysis Services. XML for Analysis contains MDX as query language. Supported by Applix, Oracle, SAS, SAP , Panorama Software, Cognos, Hyperion Solutions and others. In 2001 XMLA Council (www.xmla.org) publishes the standard for XML for Analysis, with query language mdXML (MDX enclosed with <Statement> tag from XML.

Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Database lectures for mathematics

slide-48
SLIDE 48

Example query in MDX

SELECT { [Measures].[Sales in shops] } ON COLUMNS, { [Date].[2002], [Date].[2003] } ON ROWS FROM Sales WHERE ( [Shop].[Europe].[Poland] )

The SELECT clause determines the “axes” of query as Sales in shops from Measures dimension and 2002 plus 2003 from Date dimension. The FROM clause indicates, that the data source is the hypercube Sales. The WHERE clause defines the “cross-section” as the element Poland of the dimension Shop.

Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Database lectures for mathematics

slide-49
SLIDE 49

Trends

Oracle: Essbase (after taking over Hyperion), BI Server. IBM: Cognos 8 BI (together with ‘PowerPlay Studio’), database TM1 (Applix). Microsoft: database Panorama (included into SQL Server 7), two analysis tools (Maximal i ProClarity), integration with Excel, SharePoint and Visio. Planned in-memory tool Gemini.

Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Database lectures for mathematics

slide-50
SLIDE 50

Data Mining

Automatic search for “interesting” patterns and trends in data. The term data mining is mostly used for describing the summarization of large data sets in a useful way.

Showing regularities, often written using rules

Inductive methods used

Consequence: the results are never universally guaranteed, they could be the effect of the momentary contents of the data base.

Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Database lectures for mathematics

slide-51
SLIDE 51

Examples

Grouping all WWW Internet pages according to subjects. Preventing credit frauds: finding characteristic properties of illegal transctions with credit cards. Searching for associations, e.g. finding goods often bought together. Finding similar sequences of behavior, e.g. shares with similar oscillations of quotations.

Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Database lectures for mathematics

slide-52
SLIDE 52

Characteristics

Basically a nontrivial automatic extraction of unknown and potentially useful information contained implicite in database. Based on searching for patterns in data, without previous construction of hypotheses

This differs from classical statistical approach, where analyst builds hypotheses and tries to verify them on a smaple from data bases.

More troubles in situations, when patterns are discovered in recursive decision process. Information in database is often disturbed and incomplete, so some statistic knowledge is necessary anyway.

Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Database lectures for mathematics

slide-53
SLIDE 53

Technology

Generally artificial intelligence, machine learning, neural networks, association rules, rough sets. Classification and forecasting: building a classifier for categories given in advance. Cluster analysis: defining categories during analysis. Pattern recognition and searching. Decision trees.

Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Database lectures for mathematics

slide-54
SLIDE 54

Clustering: issues

Partitioning data into into automatically generated categories Do we have the expected number of groups? How to find the ,,best” groups? Are groups semantically meaningful?

Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Database lectures for mathematics

slide-55
SLIDE 55

Market-basket analysis

Market baskets = sets of goods which are bought together by a customer during one visit in the shop. Summary of market-baskets: frequent sets of items — sets

  • f goods often found together.

Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Database lectures for mathematics

slide-56
SLIDE 56

Tools

Weka: New Zealand Rses i Rses-lib: MIMUW. SAS

Zbigniew Jurkiewicz, Institute of Informatics UW Analytical data bases Database lectures for mathematics