provenance management in databases under schema evolution
play

Provenance Management in Databases Under Schema Evolution Shi Gao, - PDF document

Provenance Management in Databases Under Schema Evolution Shi Gao, Carlo Zaniolo Department of Computer Science University of California, Los Angeles { gaoshi,zaniolo } @cs.ucla.edu Abstract When the schema evolves, the database under old


  1. Provenance Management in Databases Under Schema Evolution Shi Gao, Carlo Zaniolo Department of Computer Science University of California, Los Angeles { gaoshi,zaniolo } @cs.ucla.edu Abstract When the schema evolves, the database under old schema is migrated into the new one conforming to new Since changes caused by database updates combine with schema. Therefore, the current database snapshot is the the internal changes caused by database schema evolu- combined result of (i) the external actions that entered the tion, an integrated provenance management for data and original information (e.g., via SQL inserts or updates), metadata represents a key requirement for modern infor- and (ii) the migration steps that have then transformed mation systems. In this paper, we introduce the Archived the data as part of the schema evolution process. Thus the Metadata and Provenance Manager (AM&PM) system history of schema changes since a piece of information which addresses this requirement by (i) extending the In- was first recorded becomes an integral part of its prove- formation Schema with the capability of representing the nance. A combined provenance management system for provenance of the schema and other metadata, (ii) pro- data and metadata can be used to meet many important viding a simple time-stamp based representation of the requirements [6, 15], including the following ones: provenance of the actual data, and (iii) supporting power- Provenance Tracing . Users may be interested in the ful queries on the provenance of the data and the history provenance of both data and metadata. The provenance of the metadata. of metadata allows users to audit the process of schema evolution and examine its history. 1 Introduction Provenance Verifying . The combined provenance of data and metadata gives users an effective way to audit The importance of recording the provenance , or lineage , suspectable data updates and schema changes. about information of significance is now widely recog- Source Data Recovery . The source database can be re- nized, and a large body of research has been produced covered from data and archived versions of database. on provenance management in scientific workflows and This can also be used to correct faulty source data. databases [4, 15, 19]. Existing provenance systems fo- The Archived Metadata and Provenance Manager cus on capturing the “why”, “where” and “how” facets of (AM&PM) proposed in this paper is the first system de- provenances [5, 12] and support a rich set of provenance- signed to address the problem of supporting provenance related functions and queries. Unfortunately, most previ- under schema evolution. AM&PM achieves this goal by ous works assume that the database schemas and work- (i) extending the SQL information schema facility with flows are fixed and do not change with time. timestamp based archival capability to store the prove- In reality, modern information systems, particularly nance of data and metadata, (ii) using Schema Modifica- big science projects, undergo frequent database schema tion Operators (SMOs) and Integrity Constraints Modifi- changes as illustrated by the UCLA testbed collecting the cation Operators (ICMOs) [9, 10] to express the schema schema history for 20 large information systems, includ- mapping between different versions of the database after ing Mediawiki/Wikipedia, Ensembl GeneticDB and var- database upgrades, and (iii) supporting efficiently pow- ious CERN Physics DBs [1]. For instance, the database erful queries on both data provenance and schema prove- of Mediawiki software supporting Wikipedia has expe- nance. rienced more than 300 schema versions in its nine years history and similar observations hold for the rest. Very 2 Background often, when a mistake is found in the latest version of the database, it is hard to trace the provenance of this mistake Schema Evolution Language The Schema Modifi- in early versions since the schema has changed. cation Operators (SMOs) were introduced in [10] as a 1

  2. Table 1: Schema Evolution Language: SMO and ICMO SMO CREATE TABLE R(a,b,c) DROP TABLE R RENAME TABLE R INTO T COPY TABLE R INTO T MERGE TABLE R, S INTO T Figure 1: Two schema versions for table customer PARTITION TABLE R INTO S WITH cond, T DECOMPOSE TABLE R INTO S(a,b), T(a,c) JOIN TABLE R,S INTO T WHERE cond Temporal Data Model The temporal data model ADD COLUMN d [AS constjfunc(a; b; c)] INTO R tracks the time when data is valid. The relational DROP COLUMN c FROM R schemas in the temporal data model have timestamp RENAME COLUMN b IN R TO d attributes to record the valid time intervals of tuples. ICMO Two temporal dimensions are discussed in past tempo- ALTER TABLE R ADD PRIMARY KEY pk1(a; b) [policy] ALTER TABLE R ADD FOREIGN KEY fk1(c; d) ral database research [14, 21]: valid time and transaction REFERENCES T(a; b) [policy] time . Valid time denotes the time period when entities are ALTER TABLE R ADD VALUE CONSTRAINT vc1 AS valid in the real world. Transaction time denotes the time R:e = 0 [policy] period when entities are valid in the database system. A ALTER TABLE R DROP PRIMARY KEY pk1 ALTER TABLE R DROP FOREIGN KEY fk1 temporal data model is called bi-temporal if both of them ALTER TABLE R DROP VALUE CONSTRAINT vc1 are recorded. In this paper, we focus on the transaction-time model and the transaction-time database. In the transaction- very effective tool for characterizing schema upgrades time database, the tuples created or updated are times- and automating the process of migrating the database tamped with the current system time. The deletion of and upgrading the query-based applications. Since in a tuple causes the tuple to be labeled as deleted and many situations, the schema integrity constraints are its time-stamp to be updated. Many transaction-time also changed along with schema structure, Integrity database implementations are now available, including Constraints Modification Operators (ICMOs) were those of IBM DB2 10, Teradata 13.10, PostgreSQL [18], introduced in [9] to characterize integrity constraints and ArchIS [20]. These systems provide automatic tem- and automate the upgrading of applications involving poral updates processing and support temporal queries. integrity constraint changes. Three types of integrity constraints are considered: primary key, foreign key, and value constraint. Extensive experience with the schema Information Schema In relational databases, the In- evolution testbed [1] shows that the combination of formation Schema [11], also called data dictionary or SMOs and ICMOs displayed in Table 1 can effectively system catalog , is a standard database which archives the capture and characterize the schema evolution history of metadata of the databases. information systems. We would next present a simple For instance, the table COLUMNS in the informa- example to show how our schema evolution language tion schema of MySQL 5.1 is defined with 19 at- works. Consider the table customer in a SALES database tributes. These attributes include Table Schema , Ta- shown in Figure 1. Underlines indicate the primary ble Name , Column Name , Column Default , Data Type , keys. The initial schema version is V 1 . Then the schema and others which are not discussed here since they are evolves into V 2 . Assuming that the data type of the not important for provenance. These attributes provide attribute city in V 2 is the same with the attribute location meta information about columns in tables. The informa- in V 1 , the transformation from V 1 to V 2 can be described tion schema of MySQL 5.1 uses 28 tables to store all as follows: kinds of structural information on database schema. In other words, the information schema is the database of 1. ADD COLUMN birth DATETIME INTO customer other databases and widely supported in the commercial 2. RENAME COLUMN location IN customer TO city database systems. The combination of these two SMOs defines the 3 Approach schema evolution script , which describes the mapping between different versions of database schemas in a con- In AM&PM, we archive the schema evolution and the cise and unambiguous way. Moreover, it is easy to con- provenance information using an enhanced information vert schema evolution scripts into standard SQL scripts, schema, in order to support the queries on the provenance as proved in [10]. of data and metadata discussed next. 2

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend