General Purpose Database Summarization A web service architecture - - PowerPoint PPT Presentation

general purpose database summarization
SMART_READER_LITE
LIVE PREVIEW

General Purpose Database Summarization A web service architecture - - PowerPoint PPT Presentation

Table of Content General Purpose Database Summarization A web service architecture for on-line database summarization R egis Saint-Paul ( speaker ), Guillaume Raschia, Noureddine Mouaddib LINA - PolytechNantes - INRIA ATLAS-GRIM Group


slide-1
SLIDE 1

Table of Content

General Purpose Database Summarization

A web service architecture for on-line database summarization R´ egis Saint-Paul (speaker), Guillaume Raschia, Noureddine Mouaddib

LINA - Polytech’Nantes - INRIA ATLAS-GRIM Group

VLDB Conference — Sept. 1st 2005

ATLAS-GRIM General Purpose Database Summarization VLDB 2005 1 / 28

slide-2
SLIDE 2

Table of Content

Table of Content

1 Introduction

Generalities Related works

2 Summary model

Description space Building the summaries

3 System architecture

Web service organization Complexity and performances

4 Conclusion

ATLAS-GRIM General Purpose Database Summarization VLDB 2005 2 / 28

slide-3
SLIDE 3

Introduction Summary model System architecture Conclusion Generalities Related works

Table of Content

1 Introduction

Generalities Related works

2 Summary model

Description space Building the summaries

3 System architecture

Web service organization Complexity and performances

4 Conclusion

ATLAS-GRIM General Purpose Database Summarization VLDB 2005 3 / 28

slide-4
SLIDE 4

Introduction Summary model System architecture Conclusion Generalities Related works

Motivations

Provide small versions of very large databases Descriptive ability :

scientific studies (epidemiology) ; commercial and marketing studies (customer segmentation) ; log analysis (connection/operation profile) ; data obfuscation ; data personalization and filtering.

Data size reduction ability :

approximate querying (hotel booking), database browsing (image database), storing rough view of the data on devices with low memory capacity (tourism GPS data).

ATLAS-GRIM General Purpose Database Summarization VLDB 2005 4 / 28

slide-5
SLIDE 5

Introduction Summary model System architecture Conclusion Generalities Related works

Motivations

Provide small versions of very large databases Descriptive ability :

scientific studies (epidemiology) ; commercial and marketing studies (customer segmentation) ; log analysis (connection/operation profile) ; data obfuscation ; data personalization and filtering.

Data size reduction ability :

approximate querying (hotel booking), database browsing (image database), storing rough view of the data on devices with low memory capacity (tourism GPS data).

ATLAS-GRIM General Purpose Database Summarization VLDB 2005 4 / 28

slide-6
SLIDE 6

Introduction Summary model System architecture Conclusion Generalities Related works

Motivations

Provide small versions of very large databases Descriptive ability :

scientific studies (epidemiology) ; commercial and marketing studies (customer segmentation) ; log analysis (connection/operation profile) ; data obfuscation ; data personalization and filtering.

Data size reduction ability :

approximate querying (hotel booking), database browsing (image database), storing rough view of the data on devices with low memory capacity (tourism GPS data).

ATLAS-GRIM General Purpose Database Summarization VLDB 2005 4 / 28

slide-7
SLIDE 7

Introduction Summary model System architecture Conclusion Generalities Related works

What is a summary ?

Definition A summary is a concise representation of a set of structured data. ⇒ Semantic Compression Occupation Income Ph.D. Student 1 000 Lecturer 2 000 Managing Director 8 500 Politician xx xxx

Tab.: Relation R

Occupation Income Research Miserable Executive Enormous

Tab.: Summary R∗

ATLAS-GRIM General Purpose Database Summarization VLDB 2005 5 / 28

slide-8
SLIDE 8

Introduction Summary model System architecture Conclusion Generalities Related works

Aggregate computation

  • Aggregate computation

SDB, OLAP [Codd et al. 93], DataCubes [Gray et al. 93] Datacube summarization QuotientCube [Lakshmanan et al. 2002] Limitations Do not preserve the initial data schema ; Subject oriented, has to be designed ; Fixed and crisp granularity, threshold effect.

ATLAS-GRIM General Purpose Database Summarization VLDB 2005 6 / 28

slide-9
SLIDE 9

Introduction Summary model System architecture Conclusion Generalities Related works

Clustering approaches for semantic compression

intuition Describe groups rather than individual observation. Clustering – ItCompress [Jagadish et al. 1999] Bayesian network classifier – Spartan [Babu et al. 2001] Association rules – Fascicule [Jagadish et al. 1999] Limitations Classes shape depends on the selected criteria [Fasulo 1999] ; Single granularity of the compressed relation ; Non-intuitive intentional description of classes.

ATLAS-GRIM General Purpose Database Summarization VLDB 2005 7 / 28

slide-10
SLIDE 10

Introduction Summary model System architecture Conclusion Generalities Related works

Foundations of our approach

Intuition Trying to reproduce the human learning mechanisms. Formal concept analysis [Barbut et al. 1970, Wille 1982] Conceptual clustering – [Michalski et Stepp 1983] Unimem [Lebowitz 1986], Cobweb [Fisher 1987], Fuzz [Chen & Lu 1997] Limitations Approaches were validated only on small data samples ; Lack of maintenance capabilities.

ATLAS-GRIM General Purpose Database Summarization VLDB 2005 8 / 28

slide-11
SLIDE 11

Introduction Summary model System architecture Conclusion Description space Building the summaries

Table of Content

1 Introduction

Generalities Related works

2 Summary model

Description space Building the summaries

3 System architecture

Web service organization Complexity and performances

4 Conclusion

ATLAS-GRIM General Purpose Database Summarization VLDB 2005 9 / 28

slide-12
SLIDE 12

Introduction Summary model System architecture Conclusion Description space Building the summaries

Possibilistic Data Representation

Theoretical foundation :

Fuzzy-set theory (Zadeh, 1965) et Possibility theory (Zadeh 1978, Dubois&Prade 1985)

Management of uncertain, incomplete and gradual information : “John’s age should approximately be between 16 and 20, but that’s not sure.” Possibility distribution

AGE

16 20 1.0 0.0

Dom

1.0 0.0 a d b c e f ATLAS-GRIM General Purpose Database Summarization VLDB 2005 10 / 28

slide-13
SLIDE 13

Introduction Summary model System architecture Conclusion Description space Building the summaries

Background knowledge

For each attribute A with domain DA, a set of Linguistic Labels is defined together with their membership function over DA. Example, on attribute income : Dincome = [0, 200000] D+

income

= {none, miserable, modest, . . .}

20 40 60 80 100 1 none miserable modest reasonable comfortable enormous

  • utrageous

DINCOME(K$)

ATLAS-GRIM General Purpose Database Summarization VLDB 2005 11 / 28

slide-14
SLIDE 14

Introduction Summary model System architecture Conclusion Description space Building the summaries

Summary representation space

Original tuple (raw data) t = t.A1, . . . , t.Ak, t ∈ R {t} DA R(A1, . . . , Ak) = k

i=1 DAi

 

  • {z}

F(D+

A )

R∗(A1, . . . , Ak) = k

i=1 F(D+ Ai)

Summarized tuple z = z.A1, . . . , z.Ak, z ∈ R∗

ATLAS-GRIM General Purpose Database Summarization VLDB 2005 12 / 28

slide-15
SLIDE 15

Introduction Summary model System architecture Conclusion Description space Building the summaries

Summary model

A summary is a 3-uple z = (Iz, Rz, Ez) with : Iz : the intentional content ; Rz : the extensional content, subset of the relation R ; Ez : a set of edges toward other summaries. Example of a summary

Label satisfaction support intention Iz 1.83 OCCUPATION employee 0.2 1.25 manager 1.0 0.33 managing director 0.7 0.25 INCOME comfortable 1.0 1.50 high 1.0 0.33 extension Rz { t1, t2, t5, t13 } 4

ATLAS-GRIM General Purpose Database Summarization VLDB 2005 13 / 28

slide-16
SLIDE 16

Introduction Summary model System architecture Conclusion Description space Building the summaries

Partial order on summaries

Subsumption relation : z ⊑ z′ ⇐ ⇒ Rz ⊆ Rz′ Hierarchical organization :

root : most general summary ; leaves : most specific summaries.

The user-defined Background Knowledge fixes the finest level and, consequently, the maximal hierarchy size.

ATLAS-GRIM General Purpose Database Summarization VLDB 2005 14 / 28

slide-17
SLIDE 17

Introduction Summary model System architecture Conclusion Description space Building the summaries

Algorithm outline

hierarchical conceptual classification incremental process top-down approach selective local search Advantages summary freshness through incremental maintenance linear time complexity w.r.t. the number of tuples Weaknesses sub-optimal model (dynamic environment)

  • rder effect

(use of bidirectional learning operators)

ATLAS-GRIM General Purpose Database Summarization VLDB 2005 15 / 28

slide-18
SLIDE 18

Introduction Summary model System architecture Conclusion Description space Building the summaries

Process overview

  • !

"# $ %&$

'

  • (
  • ))

*

  • '
  • +++
  • ,+++

$ (+++

  • $$

ATLAS-GRIM General Purpose Database Summarization VLDB 2005 16 / 28

slide-19
SLIDE 19

Introduction Summary model System architecture Conclusion Description space Building the summaries

Local search

The process looks for the learning operator which produces the highest quality child partition. Learning operators affect, create, merge, split.

  • ATLAS-GRIM

General Purpose Database Summarization VLDB 2005 17 / 28

slide-20
SLIDE 20

Introduction Summary model System architecture Conclusion Description space Building the summaries

Local search

The process looks for the learning operator which produces the highest quality child partition. Learning operators affect, create, merge, split.

  • ATLAS-GRIM

General Purpose Database Summarization VLDB 2005 17 / 28

slide-21
SLIDE 21

Introduction Summary model System architecture Conclusion Description space Building the summaries

Local search

The process looks for the learning operator which produces the highest quality child partition. Learning operators affect, create, merge, split.

  • ATLAS-GRIM

General Purpose Database Summarization VLDB 2005 17 / 28

slide-22
SLIDE 22

Introduction Summary model System architecture Conclusion Description space Building the summaries

Local search

The process looks for the learning operator which produces the highest quality child partition. Learning operators affect, create, merge, split.

  • ATLAS-GRIM

General Purpose Database Summarization VLDB 2005 17 / 28

slide-23
SLIDE 23

Introduction Summary model System architecture Conclusion Description space Building the summaries

Multi-granularity summary

The summary hierarchy presents many different precision levels.

  • The trade-off between size and concision can be chosen

a-posteriori depending on the user need ; Analogy between the drill-down/roll-up operation on Datacube and the summary hierarchy navigation.

ATLAS-GRIM General Purpose Database Summarization VLDB 2005 18 / 28

slide-24
SLIDE 24

Introduction Summary model System architecture Conclusion Web service organization Complexity and performances

Table of Content

1 Introduction

Generalities Related works

2 Summary model

Description space Building the summaries

3 System architecture

Web service organization Complexity and performances

4 Conclusion

ATLAS-GRIM General Purpose Database Summarization VLDB 2005 19 / 28

slide-25
SLIDE 25

Introduction Summary model System architecture Conclusion Web service organization Complexity and performances

Process overview

  • Message Oriented Application ;

Each document has autonomous specification (XSchema) ; Possibility to benefit from Message Oriented Middleware (MOM) ; Each service may be used separately or composed with others ; Based on wide spread standards (W3C, ECMA et ISO).

ATLAS-GRIM General Purpose Database Summarization VLDB 2005 20 / 28

slide-26
SLIDE 26

Introduction Summary model System architecture Conclusion Web service organization Complexity and performances

Concept formation performed by autonomous “agents”

  • !

"

  • "
  • #

$ "

Memory management optimized through specific pagination method ; Process parallelization, Computation optimization through the use of a local cache with incremental upholding (contrast matrix).

ATLAS-GRIM General Purpose Database Summarization VLDB 2005 21 / 28

slide-27
SLIDE 27

Introduction Summary model System architecture Conclusion Web service organization Complexity and performances

Process performance evaluation

Tests based on 1990 US census data [UCI KDD Archive]. 1 billion tuples ; 14 attributes used for the summarization ; 5 to 14 modalities per attributes (prepared).

  • ATLAS-GRIM

General Purpose Database Summarization VLDB 2005 22 / 28

slide-28
SLIDE 28

Introduction Summary model System architecture Conclusion Web service organization Complexity and performances

Dynamic performances

  • Process performance is

dependent only on the hierarchy size. depth = logwidth(leaves)

ATLAS-GRIM General Purpose Database Summarization VLDB 2005 23 / 28

slide-29
SLIDE 29

Introduction Summary model System architecture Conclusion Web service organization Complexity and performances

Comparison with a real-life dataset

The marketing department of CIC (a french banking group) provided customer data : 33700 records ; 70 attributes (10 of them used for the summary) ; Background Knowledge defined with the bank marketing experts ; 3 to 8 linguistic descriptors used per attribute.

ATLAS-GRIM General Purpose Database Summarization VLDB 2005 24 / 28

slide-30
SLIDE 30

Introduction Summary model System architecture Conclusion Web service organization Complexity and performances

Dynamic performances on real data

Number of processed tuples 30 000 25 000 20 000 15 000 10 000 5 000 Number of leaves

14 000 13 000 12 000 11 000 10 000 9 000 8 000 7 000 6 000 5 000 4 000 3 000 2 000 1 000

Number of processed tuples 30 000 25 000 20 000 15 000 10 000 5 000 Processing rate (Tuple/s)

75 70 65 60 55 50 45 40 35 30 25 20 15 10 5

The number of leaves follows an asymptotic evolution ; The process tends toward a classification only regime.

ATLAS-GRIM General Purpose Database Summarization VLDB 2005 25 / 28

slide-31
SLIDE 31

Introduction Summary model System architecture Conclusion

Table of Content

1 Introduction

Generalities Related works

2 Summary model

Description space Building the summaries

3 System architecture

Web service organization Complexity and performances

4 Conclusion

ATLAS-GRIM General Purpose Database Summarization VLDB 2005 26 / 28

slide-32
SLIDE 32

Introduction Summary model System architecture Conclusion

Conclusion

We presented : A general purpose multi-granularity summarization model :

an adaptative alternative to the group by ; simultaneous maintenance of several compression levels ; robust and intuitive classes thanks to human-like learning mechanism and uncertainty handling.

The architecture of the system, which contributes to :

ease of coupling with DBMS (web services) ; performance optimization and parallelization (use of autonomous agent) ;

Validation of the system performance on a test database and a real-life one.

ATLAS-GRIM General Purpose Database Summarization VLDB 2005 27 / 28

slide-33
SLIDE 33

Introduction Summary model System architecture Conclusion

Questions ?

Web Site of SaintEtiQ http://www.simulation.fr/seq Win32 prototype with test dataset available for download Process available online as web service References and documentation

ATLAS-GRIM General Purpose Database Summarization VLDB 2005 28 / 28