Description of the general reference scenario and presentation of - - PDF document

description of the general reference scenario and
SMART_READER_LITE
LIVE PREVIEW

Description of the general reference scenario and presentation of - - PDF document

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/255580215 Description of the general reference scenario and presentation of metadata Article January 2002 CITATIONS READS 0 40 6


slide-1
SLIDE 1

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/255580215

Description of the general reference scenario and presentation of metadata

Article · January 2002

CITATIONS READS

40

6 authors, including: Some of the authors of this publication are also working on these related projects: E2mC Horizon2020 View project Data Quality View project Cinzia Cappiello Politecnico di Milano

141 PUBLICATIONS 2,450 CITATIONS

SEE PROFILE

Chiara Francalanci Politecnico di Milano

132 PUBLICATIONS 2,309 CITATIONS

SEE PROFILE

Paolo Missier Newcastle University

192 PUBLICATIONS 3,131 CITATIONS

SEE PROFILE

All content following this page was uploaded by Paolo Missier on 02 October 2014.

The user has requested enhancement of the downloaded file.

slide-2
SLIDE 2

DaQuinCIS

Metodologie e Strumenti per la Qualit` a dei Dati in Sistemi Informativi Cooperativi http://www.dis.uniroma1.it/∼dq/ Programma di Ricerca Cofinanziato dal MIUR (Esercizio 2001)

Description of the general reference scenario and presentation of metadata

Cinzia Cappiello, Chiara Francalanci, Paolo Missier, Barbara Pernici, Pierluigi Plebani, Monica Scannapieco, Sommario This report is aimed at the definition of the metrics of data quality based on processes and at the definition of metadata on which a certificate can be produced to certify the quality of data. In particular, data quality dimensions are classified into four categories: subject dimensions,

  • bject dimensions, process dimensions, and architectural dimensions. For each dimensions the

metadata involved in measurement process are described. A selection of relevant data quality dimensions in a CIS context is proposed in order to focus the project activity on this apropos

  • set. Finally, the structure of quality certificate that can be associated with data flowing across

different organizations is defined. Data 20 Dicembre 2002 Tipo di prodotto Rapporto tecnico Numero di pagine 38 Unit` a responsabile MIP Unit` a coinvolte MIB, MIP, RM Autore da contattare

slide-3
SLIDE 3

Description of the general reference scenario and presentation of metadata

Cinzia Cappiello, Chiara Francalanci, Paolo Missier, Barbara Pernici, Pierluigi Plebani, Monica Scannapieco, 20 Dicembre 2002

Abstract This report is aimed at the definition of the metrics of data quality based on processes and at the definition of metadata on which a certificate can be produced to certify the quality

  • f data. In particular, data quality dimensions are classified into four categories: subject

dimensions, object dimensions, process dimensions, and architectural dimensions. For each dimensions the metadata involved in measurement process are described. A selection of relevant data quality dimensions in a CIS context is proposed in order to focus the project activity on this apropos set. Finally, the structure of quality certificate that can be associated with data flowing across different organizations is defined.

1

slide-4
SLIDE 4

1 Introduction

Data quality is an increasingly critical aspect of the quality of service in the majority of information-intensive businesses. In business contexts where each organization can exclusively access internal data, the primary goal of data quality assurance is the continuous control of data values and, possibly, their improvement. In cooperative information systems (CIS), involving multiple organizations that must share data to reach a common goal, quality assurance is faced by the need for objective measures and evaluations of data quality that can be exchanged along with corresponding data. In addition, in a context in which interacting organizations may not be familiar with each other, approaches to certify the quality of exchanged data are important to be able to evaluate incoming information. This report focuses on two fundamental aspects of data quality management in CISs. First, it classifies data quality dimensions and it surveys their definitions and measures in previous literature. Second, it proposes a set of relevant dimensions that have to be analyzed in a CIS context. In Section 2, a model for exporting data and quality data is presented; in Sections 3-8 a whole set of data quality dimensions is described with the presentation of metadata that are able to allow data quality evaluation. In Section 9 a basic set of frequently used dimensions is proposed as relevant for CIS, and a description of the data quality certificate is provided in Section 10.

2 The D2Q Model

All cooperating organizations export data quality dimension values evaluated for the application data according to a specific data model. The model for exporting data and quality data is referred to as Data and Data Quality (D2Q) model. In defining the model, for simplicity we consider a set of only four dimensions, namely: accuracy, completeness, internal consistency and currency. However, such a set includes all the considered dimensions (i.e. in object, process and architectural categories).

2.1 Data Model

The D2Q model is inspired by the data model underlying XML-QL [4]. A database view of XML is adopted: an XML Document is a set of data items, and a Document Type Definition (DTD) is the schema of such data items, consisting of data and quality classes. In particular, a D2Q XML document contains both application data, in the form of a D2Q data graph, and the related data quality values, in the form of four D2Q quality graphs, one for each considered quality dimension.. Specifically nodes of the D2Q data graph are linked to the corresponding

  • nes of the D2Q quality graphs through links, as shown in Figure 1.

As a running example, consider the document citizens.xml, shown in Figure 2, which contains entries about citizens with the associated quality data. Such a document corresponds to a set of conceptual data items, which are instances of conceptual schema elements; schema elements are data and quality classes, and instances are data and quality objects. Specifically, an instance of Citizen and the related Accuracy values are depicted. Data classes and objects are straightforwardly represented as D2Q data graphs, as detailed in the following of this section, and quality classes and objects are represented as D2Q quality graphs, as detailed in Section 2.2. In order to clarify our definition of data class in XML, we preliminary recall a typical defi- nition of data class from ODMG [5]. A data class δ (π1, . . . , πn) consists of:

  • a name δ;

2

slide-5
SLIDE 5
  • a set of property tuples

πi = < namei : typei >, i = 1 . . . n, n ≥ 1, where namei is the name of the property πi and typei can be: – either a basic type1; – or a data class; – or a type set-of < X >, where < X > can be either a basic type or a data class. A D2Q data graph G is a graph with the following features:

  • a set of nodes N; each node (i) is identified by an object identifier and (ii) is the source
  • f 4 different links to quality objects, each one for a different quality dimension. A link is

a pair attribute-value, in which attribute represents the specific quality dimension for the element tag and value is an IDREF link2;

  • a set of edges E ⊂ N × N; each edge is labeled by a string, which represents an element

tag of an XML document;

  • a single root node R;
  • a set of leaves; leaves are nodes that (i) are not identified and (ii) are labeled by strings,

which represent element tag values, i.e., the values of the element tags labeling edges to them. Data class instances can be represented as D2Q data graphs, according to the following rules. Let δ (π1, . . . , πn) be a data class with n properties, and let O be a data object, i.e., an instance

  • f the data class. Such an instance is represented by a D2Q data graph G as follows:
  • The root R of G is labeled with the object identifier of the instance O.
  • For each πi =

< namei : typei > the following rules hold: – if typei is a basic type, then R is connected to a leaf lvi by the edge < R, lvi >; the edge is labeled with namei and the leaf lvi is labeled with the property value O.namei; – if typei is a data class, then R is connected to the D2Q data graph which represents the property value O′ = O.namei by an edge labeled with namei; – if typei is a set-of < X >, then: ∗ let C be the cardinality of O.namei; R is connected to C elements as it follows: if (i) < X > is a basic type, then the elements are leaves (each of them labeled with a property value of the set); otherwise if (ii) < X > is a data class, then the elements are D2Q data graphs, each of them representing a data object of the set; ∗ edges connecting the root to the elements are all labeled with namei. In Figure 3, the D2Q data graph of the running example is shown: an object instance Maria Rossi of the data class Citizen is considered. The data class has Name and Surname as properties

  • f basic types, a property of type set-of < TelephoneNumber > and another property of data

class type ResidenceAddress; the data class ResidenceAddress has all properties of basic types.

1Basic types are the ones provided by the most common programming languages and SQL, that is Integer,

Real, Boolean, String, Date, Time, Interval, Currency, Any.

2The use of links will be further explained in Section 2.2, when quality graphs are introduced.

3

slide-6
SLIDE 6

D2Q data graph Internal consistency D2Q quality graph Currency D2Q quality graph Completeness D2Q quality graph Accuracy D2Q quality graph D2Q data graph Internal consistency D2Q quality graph Currency D2Q quality graph Completeness D2Q quality graph Accuracy D2Q quality graph

Figure 1: The generic structure of a D2Q XML document, returned as result by a service

  • peration

<Citizens> <Citizen> <Name>Maria</Name> <Surname>Rossi</Surname> <BirthDate>16/07/1975</BirthDate> <Sex>F</Sex> <email>maria.rossi@daquincis.org</email> <ResidenceAddress> <Street>Via Salaria 113</Street> <City>Roma</City> <ZIPCode>00198</ZIPCode> <Country>Italy</Country> </ResidenceAddress> <TelephoneNumber>+390649918479</TelephoneNumber> </Citizen> </Citizens>

Figure 2: The XML document of the running example

MARIA 00198 Name Street Country ZIPCode VIA SALARIA 113 ROMA ITALY ROSSI TelephoneNumber +390649918479 +393391234567 TelephoneNumber City Surname ResidenceAddress

Links to quality

  • bjects/values

MARIA 00198 Name Street Country ZIPCode VIA SALARIA 113 ROMA ITALY ROSSI TelephoneNumber +390649918479 +393391234567 TelephoneNumber City Surname ResidenceAddress

Links to quality

  • bjects/values

Figure 3: The D2Q data graph of the running example 4

slide-7
SLIDE 7

2.2 Quality Model

So far the data portion of the D2Q model has been described. However, organizations export XML documents containing not only data objects, but also quality data concerning. Quality data are represented as graphs, too; they correspond to a set of conceptual quality data items, which are instances of conceptual quality schema elements; quality schema elements are referred to as quality classes and instances as quality objects. A quality class models a specific quality dimension for a specific data class: the property values of a quality object represent the quality dimension values of the property values of a data object. Therefore, each data object (i.e., node) and value (i.e., leaf) of a D2Q data graph is linked to respectively four quality objects and values. Let δ (π1, . . . , πn) be a data class. A quality class δD (πD

1 , . . . , πD n ) consists of:

  • a name δD, with D ∈ { Accuracy, Completeness, Currency, Internal Consistency

};

  • a set of tuples πD

i

= < nameD

i

: typeD

i

>, i = 1 . . . n, n ≥ 1, where:

  • δD is associated to δ by a one-to-one relationship and corresponds to the quality dimension

D evaluated for δ;

  • πD

i

is associated to πi of δ by a one-to-one relationship, and corresponds to the quality dimension D evaluated for πi;

  • typeD

i is either a basic type or a quality class or a set-of type, according to the structure

  • f the data class δ.

In order to represent quality objects, we define a D2Q quality graph as follows: A D2Q quality graph GD is a D2Q data graph with the following additional features:

  • no node nor leaf is linked to any other element;
  • labels of edges are strings of the form D name (e.g., Accuracy Citizen);
  • labels of leaves are strings representing quality values;
  • leaves are identified by object identifiers.

A quality class instance can be straightforwardly represented as a D2Q quality graph, on the basis on rules analogous to the ones previously presented for data objects and D2Q data

  • graphs. As an example, in Figure 4, the D2Q quality graph concerning accuracy of the running

example is shown, and links are highlighted; for instance, the accuracy of Maria is 0.7. Let { O1, . . . , Om } be a set of m objects which are instances of the same data class δ; a D2Q XML document is a graph consisting of:

  • a root node ROOT ;
  • m D2Q data graph Gi, i = 1 . . . m, each of them representing the data objects Oi;
  • 4∗m D2Q quality graph GD

i , i = 1 . . . m, each of them representing the quality graph related

to Oi concerning the quality dimension D; 5

slide-8
SLIDE 8

0.7 0.9 Accuracy_Name Accuracy_Street Accuracy_Country Accuracy_ZIPCode 0.3 0.9 0.9 0.7 0.5 0.2 Accuracy_TelephoneNumber Accuracy_City Accuracy_ ResidenceAddress

Accuracy D2Q quality graph

Accuracy_Surname Accuracy_TelephoneNumber

Link from a data

  • bject/value

0.7 0.9 Accuracy_Name Accuracy_Street Accuracy_Country Accuracy_ZIPCode 0.3 0.9 0.9 0.7 0.5 0.2 Accuracy_TelephoneNumber Accuracy_City Accuracy_ ResidenceAddress

Accuracy D2Q quality graph

Accuracy_Surname Accuracy_TelephoneNumber

Link from a data

  • bject/value

Figure 4: The accuracy D2Q quality graph of the running example

  • ROOT is connected to the m D2Q data graphs by edges labeled with the name of the data

class, i.e., δ;

  • for each quality dimension D, ROOT is connected to the m D2Q quality graph GD

i

by edges labeled with the name of the quality class, i.e., δD. The model proposed in this work adopts several graphs instead of embedding metadata within the data graph. Such a decision increases the document size, but on the other hand allows a modular and “fit-for-all” design: (i) extending the model to new dimensions is straightforward, as it requires to define the new dimension quality graph, and (ii) specific applications, requiring

  • nly some dimension values, will adopt only the appropriate subset of the graphs.

3 Data Quality dimensions

The data quality dimensions considered in the project are the ones proposed in [9], in which fifteen quality dimensions are grouped into four categories, namely: intrinsic, contextual, rep- resentational, accessibility. Considering the specificity of the CIS context, we review this set of dimensions in two ways:

  • We consider a different grouping of dimensions, on the basis of the methods that will be

used for their measurement.

  • We add some dimensions specific for the CIS context.

More specifically, four different categories are considered: subject, object, architectural and

  • process. The subject category includes all the dimensions that require a user to be evaluated.

The object category includes all the intrinsic dimensions, but also other dimensions measurable simply considering data. The architectural category includes data quality dimensions related to the architecture supporting a CIS. The process category includes all the contextual dimensions, but also some dimensions which are specific of cooperative processes. In the following, we list the dimensions for each category and for each of them we recall the definition given in [9], or we provide a new definition for the dimensions we propose to characterize the CIS contexts; these latter dimensions can be identified by means of an asterisk.

3.1 Subject Dimensions

The subject category includes the following dimensions:

  • Interpretability. The extent to which data are in appropriate language and units and data

definitions are clear. 6

slide-9
SLIDE 9
  • Ease of understanding. The extent to which data are clear without ambiguity and easily

comprehended.

  • Concise Representation.

The extent to which data are compactly represented without being overwhelming (i.e., brief in presentation, yet complete and to the point).

  • Accessibility.The extent to which data are available or easily and quickly retrievable.

3.2 Object Dimensions

The object category includes the following dimensions:

  • Believability. The extent to which data are accepted or regarded as true, real, and credible.
  • Accuracy. The extent to which data are correct, reliable and certified.
  • Objectivity. The extent to which data are unbiased (unprejudiced) and impartial.
  • Reputation. The extent to which data are trusted or highly regarded in terms of their

source or content

  • Representational consistency. The extent to which data are always presented in the same

format and are compatible with previous data.

  • Internal Consistency*. : the property by which values internal to an entity (e.g., a class)

do not conflict with one other according to entity dependent semantics rules.

  • Data Completeness*. We consider: ( i) the completeness of an attribute value, as depen-

dent from the fact that the attribute is present or not; ( ii) the completeness of an entity instance (e.g, an object), as dependent from the number of the attribute values that are present.

3.3 Architectural Dimensions

  • Availability*. The ability of the architecture to provide the specific data upon requests.
  • Responsiveness*. The ability of the architecture to quickly provide replies upon specific

data requests.

  • Source Availability*. The ability of a data source to provide the specific data upon re-

quests.

  • Source Responsiveness*.

The ability of a data source to quickly provide replies upon specific data requests.

3.4 Process Dimensions

The process category includes the following dimensions:

  • Relevancy.The extent to which data are applicable and helpful to the task at hand.
  • Timeliness. The extent to which the age of the data is appropriate for the task at hand.
  • Appropriate amount of data. The extent to which the quantity or volume of available data

is appropriate.

  • Process Completeness. The extend to which data are of sufficient breadth, depth, and

scope for the task at hand. 7

slide-10
SLIDE 10
  • Value-added. The extent to which data are beneficial and provide advantage from their

use.

  • Access Security. The extent to which access to data can be restricted and hence kept

secure.

  • History*. It describes the history of manipulations applied to data during exchanges within

cooperative processes.

  • Cost*

4 Data Quality Metadata

In the previous Section we made the assumption that each leaf of the quality graph contains a “single” metadata consisting of the value of a dimension for a specific attribute value. In the next sections, we relax this assumption, by introducing a “set” of quality metadata characterizing each data value. Specifically, we propose to describe these quality values using a general metadata format that accommodates a rating, that is the results of a measurement, and that explicitly includes the description of the criteria adopted for the measurement. Note that this approach conforms to standard practices used in various scientific fields when presenting experimental results (e.g., the result of a medical test is presented along with the specific technique used to carry out the test). Furthermore, for complex dimensions, the rating definition may be broken down into several facets, each covering a more specific aspect of that dimension. For instance, the Interpretability subjective dimension dimension is broken down into three rating values, for language appropriateness, data units clarity and data definitions clarity, respectively. Each rating value represents a facet of the Interpretability dimension. The metadata descriptor is a set of triples, of the form: [qualityRating, RatingRangeDescriptor, RatingMethodDescriptor] where each triple carries the quality rating for one of the facets that make up the dimension. Each triple consists of the following elements:

  • QualityRating is the quality value for the dimension, expressed as a number (a Real) in

the normalized range 0-100;

  • RatingRangeDescriptor describes the meaning of the quality values, i.e., what the rating

number stands for;

  • RatingMethodDescriptor describes the criteria used to obtain the rating, i.e., how the

measurements are carried out. Unless rating criteria are standardized, we may expect that different rating methods will be adopted by different data providers. A different RatingMethodDescriptor provides a description for each of those methods. At the same time, depending on the method adopted, each data provider may give different interpretations to the rating figures. The RatingRangeDescriptor is meant to explain to the data consumer what the rating means. The combination of these two descriptors provides data consumers with the information needed to interpret the raw rating

  • figure. This solution provides great flexibility in defining a customized rating system. Of course,

in the happy case where one or both descriptors are standard (this is not the case today), they can be left implicit and only the rating figures for each dimension need to be attached to the data value. Notice that, for each dimension, some further metadata can be required in order to make the specification of quality ranges, rating ranges and rating methods as accurate as possible. As an 8

slide-11
SLIDE 11

example, when considering an Accuracy evaluation by comparison with an other data source, the data source used for comparison has to be specified. In the following of this document we do not give all the details of such further metadata, as they are strictly related to the specific measurement activity that will be investigated in the next project activity.

5 Metadata for Subject Dimensions

In this section, we provide metadata definitions for the following quality dimensions:

  • interpretability: the extent to which data are in appropriate language and units and the

data definitions are clear

  • ease of understanding: The extent to which data are clear without ambiguity and easily

comprehended

  • concise representation: The extent to which data are compactly represented without being
  • verwhelming (i.e., brief representation, yet complete and to the point)
  • accessibility: the extent to which data are available or easily and quickly retrievable.

We start by observing that the definitions given for these dimensions imply that their mea- surement is subjective, i.e., different criteria can be used to measure their value, yielding different

  • results. Unlike with objective data dimensions, here it may be difficult to argue for or against
  • ne particular choice of measurement criteria. Therefore, we propose to describe these quality

values using a general metadata format that accommodates a rating, that is the results of a measurement, and that explicitly includes the description of the criteria adopted for the mea-

  • surement. Note that this approach conforms to standard practices used in various scientific

fields when presenting experimental results (eg the result of a medical test is presented along with the specific technique used to carry out the test). Furthermore, for complex subjective dimensions, the rating definition may be broken down into several facets, each covering a more specific aspect of that dimension. For instance, the Interpretability dimension is broken down into three rating values, for language appropriateness, data units clarity and data definitions clarity, respectively. Each rating value represents a facet of the Interpretability dimension. Using customized descriptors requires flexibility in the language used to describe them. Therefore, in this proposal the RatingRangeDescriptor and the RatingMethodDescriptor are defined using self-describing, semi-structured documents that follow the XML syntax. This means that documents will follow a common XML schema that may be extended for each spe- cific descriptor (semi-structured), and that the extensions themselves are described in a XML Schema, possibly proprietary, that is referenced from within the descriptor itself (self-describing). Note that this arrangement follows the common standard for XML-based documents, whereby the schema for the document’s structure is expressed as a DTD or as a XML schema, and can be either included in the document itself, or referred to using a URL. Before we present the common schema, we illustrate this proposal by means of examples (the actual XML XML content and structures may vary) that apply to the specific dimensions

  • f interest in this document. In the first example, Interpretability’s rating value is broken down

into the language appropriateness, data units clarity and data definitions clarity facets, while the Understandability dimension has only one (unnamed) facet. The rating is computed in different ways for the different facets, ie LanguageAppropriateness is computed from surveying a sample

  • f users, while the Clarity of data units is determined through usage statistics.

A DTD or XMLSchema will be provided in a follow-up draft. An important point to note is that each descriptor consists of a common structure, described through a hopefully standard schema, which contains one or more custom and possibly proprietary structures. 9

slide-12
SLIDE 12

... <Interpretability facet="Language appropriateness"> <Data name="allFile"> <Value>35</Value> <Rating> <RatingRangeDescriptor MinDef="Language completely inadequate to express this data" MaxDef="Choice of language makes it simple to interpret data"> <DataPoint value="75" description="Language helps understanding the data but is prone to ambiguity"/> </RatingRangeDescriptor> <RatingMethodDescriptor> <MethodDescriptor name="Acquisition" URLDescription="http://www.daquincis.com/acqmethod.xml"> </RatingMethodDescriptor> </Rating> </Data> </Interpretability> <Interpretability facet="DataUnitsClarity"> <Data name="allFile"> <Value>22</Value> <Rating> <RatingRangeDescriptor MinDef="Data units inappropriate for this type of data" MaxDef="Data units help interpret the data"> <DataPoint value="50" description="Data units not very intuitive but can be used to help the user interpret the data"/> </RatingRangeDescriptor> <RatingMethodDescriptor> <MethodDescriptor name="Acquisition" URLDescription="http://www.daquincis.com/acqmethod.xml"> </RatingMethodDescriptor> </Rating> </Data> </Interpretability> <Understandability> <Data name="allFile"> <Value>75</Value> <Rating> <RatingRangeDescriptor MinDef="Data value not understandable" MaxDef="Data easy to understand"> <DataPoint value="25" description="low understandability"/> <DataPoint value="50" description="average understandability"/> <DataPoint value="75" description="good understandability"/> </RatingRangeDescriptor> <RatingMethodDescriptor> <MethodDescriptor name="Acquisition" URLDescription="http://www.daquincis.com/acqmethod.xml"> </RatingMethodDescriptor> </Rating> </Data> </Understandability> <!--Conciseness and availability can be defined similarly--> ...

Figure 5: The XML document of the interpretability and understandability 10

slide-13
SLIDE 13

Following the detailed examples, a QualityDescriptor structure consists of a sequence of Rating structures, one for each of the facets that make up the overall quality value. When

  • nly one facet exists, the [facet] attribute of Rating may be undefined. The [value] attribute of

Rating is mandatory. The RatingRangeDescriptor and RatingMethodDescriptor structures are both optional. A RatingRangeDescriptor provides an informal definition of the range of possible rating

  • values. The pair of Min, Max attributes label the extreme rating values and are mandatory.

Additional optional DataPoint structures may be used to label intermediate values in the range. Thus, this descriptor may be used to map the normalized 0-100 values range for the rating, which are common to all quality dimensions, to descriptive labels that are significant for the specific dimension being defined. The RatingMethodDescriptor provides details on how the rating value was obtained by the data provider. When present, this information may be used by data consumers to weight the relevance of the rating itself. The idea is that, ultimately, part or all of a rating method may be standardized and its definition shared between data providers and data consumers (this is actually what happens for many standard medical and other scientific tests). Thus, data consumers may be interested in the technique alone (eg UserSurvey vs. UsageStatistics), on the paramaters for each technique (confidence factors, sample size), and so forth. The descriptor contains a common, timestamped AquisitionMethod structure, and custom sub-structures like UserSurvey, UsageStatistics, and more. Data consumers that are interested in parsing the custom information may need to lookup its Schema or DTD structure (either included in the document, or referenced through a URL).

6 Metadata for Object Dimensions

6.1 Accuracy

Accuracy is defined as ”the extent to which data are correct, reliable and certified” [9]. The metadata QualityRating specifies the accuracy value of a data item, that may be 0, 50, 100. The metadata RatingRangeDescriptor specifies the scale used for measurement. It can assume three different values, i.e. 0=inaccurate, 50=undecided, 100=accurate. The metadata RatingMethodDescriptor describes accuracy measurement methods. Ac- curacy is usually evaluated through different automated procedure. The simplest and most basic method is to compare data values with their real-world counterparts. This method is expensive and time-consuming even if it can yield precise estimates of error rates at low cost. A second method for error detection is Database bashing that consists in comparing records from two or more databases. Data that agree are considered correct, and data that do not agree are flagged for further investigation and correction. This is a method that is useful and convenient since the costs are reasonable. In DaQuincis project to evaluate accuracy it will be applied this method. We suppose to have for each database in organization a database that contains benchmark value available for comparisons. An example of a XML file with Accuracy metadata related to a Residence Address data item is shown in Figure 6.

6.2 Believability, Objectivity and Reputation: Source Reliability

In [9]: Believability is defined as ”the extent to which data are accepted or regarded as true, real, and credible”; Objectivity is defined as ”the extent to which data are unbiased (unprejudiced) and impar- tial” 11

slide-14
SLIDE 14

... <Accuracy> <Field name="ResidenceAddress"> <Value>100</Value> <Rating> <RatingRangeDescriptor MinDef="inaccurate" MaxDef="accurate"> <DataPoint value="50" description="undecided"/> </RatingRangeDescriptor> <RatingMethodDescriptor> <MethodDescriptor name="DataBaseBashing" URLDescription="http://www.daquincis.org/dbbas.xml"/> </RatingMethodDescriptor> </Rating> </Field> </Accuracy> ...

Figure 6: An example of accuracy metadata Reputation is defined as ”the extent to which data are trusted or highly regarded in terms

  • f their source or content”.

We group these dimensions into a single one that we call Source Reliability in order to characterize a source providing data by a unique dimension that can capture the aspects of trust, credibility and objectivity of the source. The metadata QualityRating specifies the source reliability value of a data item, that may be Trusted, Untrusted, Under Evaluation. The metadata RatingRangeDescriptor specifies that: Trusted means that a source provides data that are reliable, Untrusted that a source provides not reliable data, Under- Evaluation, that a source has occasionally provided not reliable data . The metadata RatingMethodDescriptor describes the adopted source reliability evalua- tion method that allow to assign the values Trusted, Untrusted and Under Evaluation to each source. We further introduce the metadata DataSourceName, in order to consider the specific source providing data.

6.3 Representational consistency

Representational consistency is defined as ”the extent to which data are always presented in the same format and are compatible with previous data” [9]. The metadata QualityRating can have two different values, namely: 0, 100. The metadata RatingRangeDescriptor specifies that the value 0 means that a format consistency verification has given a False value, 100 means that a format consistency verification has given a True value. The metadata RatingMethodDescriptor describes the adopted representational consis- tency evaluation method. A possible method to adopt is the verification of the data value format with a reference one. The further metadata to consider is the ReferenceFormat metadata. An example of an XML file with Representational Consistency metadata, related to a Birth Date data item is shown in Figure 8. 12

slide-15
SLIDE 15

... <SourceReliability source="MyDataBase"> <Field name="ResidenceAddress"> <Value>100</Value> <Rating> <RatingRangeDescriptor MinDef="Untrusted" MaxDef="Trusted"> <DataPoint value="50" description="The organization has

  • ccasionalyy provided not reliable data"/>

</RatingRangeDescriptor> <RatingMethodDescriptor> <MethodDescriptor name="DynamicRatingService" URLDescription="http://www.daquincis.org/dynRating.xml"/> </RatingMethodDescriptor> </Rating> </Field> </SourceReliability> ...

Figure 7: An example of source reliability metadata

... <RepresentationalConsistency> <Field name="BirthDate"> <Value>100</Value> <Rating> <RatingRangeDescriptor MinDef="The format consistency verification has given a False value" MaxDef="The format consistency verification has given a True value"> </RatingRangeDescriptor> <RatingMethodDescriptor> <MethodDescriptor name="ComparisonDate" URLDescription="http://www.daquincis.org/compar.xml"/> </RatingMethodDescriptor> </Rating> </Field> </RepresentationalConsistency> ...

Figure 8: An example of representational consistency metadata 13

slide-16
SLIDE 16

... <InternalConsistency> <Field name="Sex"> <Value>100</Value> <Rating> <RatingRangeDescriptor MinDef="The semantic rule verification has give a False value" MaxDef="The semantic rule verification has give a True value"> </RatingRangeDescriptor> <RatingMethodDescriptor> <MethodDescriptor name="RuleCheck" URLDescription="http://www.daquincis.org/checksex.xml"/> </RatingMethodDescriptor> </Rating> </Field> </InternalConsistency> ...

Figure 9: An example of internal consistency metadata

6.4 Internal Consistency

Internal consistency is defined as the property by which values internal to an entity (e.g., a class) do not conflict with one other according to entity dependent semantics rules. The metadata QualityRating can have two different values, namely: 0, 100. The metadata RatingRangeDescriptor specifies that the values 0 and 100 correspond respectively to the values False and True, are related to the verification of a specific semantic rule or a set of semantic rules. The metadata RatingMethodDescriptor describes the adopted internal consistency eval- uation method. The method we will adopt are Data Edits that are computerized routines which verify whether data values and/or their representation satisfy predetermined constraints. Such constraints are called business rules and they can involve a single data field, more than data field or probability considerations. In case, the further metadata to consider are:

  • SemanticRuleDescription
  • IDREFS to data values used for the consistency checking. As to now the D2Q does not

yet support such kind of references, but it may be extended in order to cope with this aspect too. An example of an XML file with Internal Consistency metadata, related to a Sex data item is shown in Figure 9.

6.5 Data Completeness

Data Completeness can be referred to an attribute, meaning that the attribute value is present

  • r not. The following metadata can are introduced:
  • QualityRating: it can have two different values, namely: 0, 100.
  • RatingRangeDescriptor: In evaluating completeness, it is important to consider the

meaning of null values of an attribute, depending on the attribute being mandatory, op- tional, or inapplicable: a null value for a mandatory attribute is associated with a lower completeness, whereas completeness is not affected by optional or inapplicable null values. 14

slide-17
SLIDE 17

... <Completeness> <Field name="email"> <Value>100</Value> <Rating> <RatingRangeDescriptor MinDef="The value is not present and there is a case of incompleteness" MaxDef="Either value is present or the value is not present but there is not a case of incompleteness"> </RatingRangeDescriptor> <RatingMethodDescriptor> <MethodDescriptor name="CompletenessCheck" URLDescription="http://www.daquincis.org/compl-email.xml"/> </RatingMethodDescriptor> </Rating> </Field> </Completeness> ...

Figure 10: An example of completeness metadata As an example, consider the attribute E-mail of the Citizen schema element; a null value for E-mail may have different meanings, that is (i) the specific citizen has no e-mail ad- dress, and therefore the attribute is inapplicable (this case has no impact on completeness),

  • r (ii) the specific citizen has an e-mail address which has not been stored (in this case

completeness is low). Therefore the RatingRangeDescriptor metadata specifies that 100 means that either the value is present or the value is not present but there is not a case of incompleteness. On the other hand, 0 means that the value is not present but there is a case of incompleteness.

  • RatingMethodDescriptor: it is the checking of presence and absence of attribute values

with a specific attention on the meaning of null values. An example of an XML file with completeness metadata, related to an Email data item is shown in Figure 10.

7 Metadata for Architectural Dimensions

7.1 Availability and Source Availability

We distinguish two types of availability: the ability of the whole architecture (Availability) and

  • f a single source (Source Availability) to provide the specific data upon requests. A source

is considered not available, with respect to a specific data request, when it does not return a reply within a timeout value (metadata timeout). For Availability the following metadata are introduced:

  • QualityRating: a value between 0 and 100
  • RatingRangeDescriptor: represents the percentage of time with respect to a specified

time interval (metadata interval) during which at least a source exists that can return the data item. 15

slide-18
SLIDE 18

... <SourceAvailability source="CitizenDB" timeout="1" measureto="minute" interval="30" measurei="day"> <Field name="Text"> <Value>67</Value> <Rating> <RatingRangeDescriptor MinDef="never available" MaxDef="always available"> </RatingRangeDescriptor> <RatingMethodDescriptor> <MethodDescriptor name="OfflinePush" URLDescription="http://www.daquincis.org/offlinepush.xml"/> </RatingMethodDescriptor> </Rating> </Field> </SourceAvailability> ...

Figure 11: An example of source availability metadata

  • RatingMethodDescriptor: Specifies the method used to perform availability measure-
  • ment. Some examples of such methods are based an architectural component that performs

availability measurement. This can be carried out off-line by the component itself in two styles: pull (the component periodically pings all the sources) or push (the sources pe- riodically sends heartbeat to the component). Otherwise, the component could simply gather feedbacks from the users that requested the data item. Then, the percentage can be approximated as the ratio between the sum of all the positive responses to a specific data request and the sum of all the responses (both positive and negative). For Source Availability the following metadata are introduced:

  • QualityRating: a value between 0 and 100
  • RatingRangeDescriptor: represents the percentage of time with respect to a specified

time interval (metadata interval) during which the source (metadata source) can return the data item within a timeout value (metadata timeout).

  • RatingMethodDescriptor: the same as for Availability but considered on a single

source. Moreover, the Source Availability metadata must specify the source subject to the measure- ment. An example of an XML file with availability metadata, related to a Residence Address data item is shown in Figure 11.

7.2 Responsiveness and Source Responsiveness

We distinguish two types of Responsiveness: the ability of the whole architecture (Responsive- ness) and of a single data source (source Responsiveness) to quickly provide replies upon specific data requests. For Responsiveness the following metadata are introduced: 16

slide-19
SLIDE 19

... <Responsiveness source="CitizenDB" timeout="1" measure="minute"> <Field name="ResidenceAddress"> <Value>12</Value> <Rating> <RatingRangeDescriptor MinDef="highly responsive" MaxDef="lowly responsive"> </RatingRangeDescriptor> <RatingMethodDescriptor> <MethodDescriptor name="OfflinePush" URLDescription="http://www.daquincis.org/offlinepush.xml"/> </RatingMethodDescriptor> </Rating> </Field> </Responsiveness> ...

Figure 12: An example of source responsiveness metadata

  • QualityRating: a value between 0 and 100
  • RatingRangeDescriptor: the value represents the average response time calculated for

all the sources , with respect to a specific data request, and normalized to the timeout value (metadata timeout) after which the source is considered not available (see previous section).

  • RatingMethodDescriptor: Specifies the method used to perform availability measure-

ment. Some examples of such methods are based on an architectural component that performs responsiveness measurement. This can be carried out off-line by periodically pinging all the sources that can return the specific data item and calculating the response

  • time. Otherwise, the average time can be calculated only from data requests performed

by users, that give a feedback to the component after each request. For Source Responsiveness the following metadata are introduced:

  • QualityRating: a value between 0 and 100
  • RatingRangeDescriptor: the time represents the average response time calculated for

all the sources , with respect to a specific data request, and normalized to the timeout value (metadata timeout) after which the source (metadata source) is considered not available (see previous section).

  • RatingMethodDescriptor: the same as for Responsiveness but considered on a single

source. An example of an XML file with source responsiveness metadata, related to a Residence Address data item is shown in Figure 12. 17

slide-20
SLIDE 20

8 Metadata for Process Dimensions

8.1 Currency

Currency is a quality dimension that does not have a standard definition in the literature. A common definition of currency is ”the time interval between the latest update of a data value and the time it is used” [2]. In [7], currency does not represent a time interval but is a measure

  • f the degree to which data are up-to-date. Data are out-of-date if they are incorrect at time

t’ but were correct at time t < t′. This perspective emphasizes that data values change over

  • time. There is no indication of a particular formula to estimate currency from this perspective.

Ballou [1] defines an important framework in which he applies the rules for production control to the generation of information. He provides an innovative contribution by proposing metrics for various time parameters. In this framework, currency measures how old data are and is calculated as the difference between the time when data are stored in the system and the time when users use it (delivery time), plus the age of data that expresses how old data were when they were stored in the system: Currency = Age + (DeliveryTime − InputTime) (1) In [9], currency is simply defined as the time data are stored in the system. As a consequence currency is considered a time point as opposed to a time interval. Overall, two approaches to the definition of currency can be distinguished. The first typically associates currency with a time measure, either a time point or a time interval, and defines currency for individual data

  • values. The second considers currency as a degree to which a data set is up-to-date. Within the

DaQuinCis Project, two indicators of currency will be defined to account for both approaches. For the time approach, the definition of currency proposed by Ballou is considered and currency will be measured as the time when data are recorded in the system and the time when they are used it. Accordingly, age will be considered null. It is necessary to introduce metadata that enable the measurement process as discussed in the following. The metadata QualityRating can assume every real number. The metadata RatingRangeDescriptor specifies the scale used for measurement. For currency RatingRangeDescriptor is a time unit, it indicates a time interval between update time and delivery time; for coherence every time measure has to express with the same time unit. RatingMethodDescriptor describes the criteria used to

  • btain the rating, using Ballou’s formula currency is defined as the difference between update

time and delivery time. Metadata to consider is Evaluation Time that specifies last update time and that has to be compared with time in which user uses data. Second dimension considered in the project is Currency Level that specifies the degree to which a data set is up-to-date. The metadata QualityRating is expressed as a percentage. The metadata RatingRangeDescriptor specifies the meaning of the values assumed by this dimension: the validity of data values is directly proportional to value assumed by the dimension in object. RatingMethodDescriptor specifies that Currency Level is a ratio between data that are up-to-date and the whole data set. Three metadata have to be introduced to calculate Currency Level value: the metadata Evaluation Time specifies last update time, the metadata Access frequencies describes how many time in a time interval users access to data modifying it and the metadata Update period specifies time interval associated to realignment operation. An example of an XML file with Currency Level metadata, related to a Telephone Number is shown in Figure 13. Note that Redman’s definition of currency level is particularly useful in the case of CIS. In a CIS, data are redundant and possibly duplicated in different organizations. Even if some 18

slide-21
SLIDE 21

... <CurrencyLevel source="CitizensDB"> <Field name="TelephoneNumber"> <Value>10</Value> <Rating> <RatingRangeDescriptor MinDef="rarely updated" MaxDef="frequently updated"> </RatingRangeDescriptor> <RatingMethodDescriptor> <MethodDescriptor name="UptodateDataCount" URLDescription="http://www.daquincis.org/uptodatedatacount.xml"/> </RatingMethodDescriptor> </Rating> </Field> </CurrencyLevel> ...

Figure 13: An example of currency level metadata realignment operations are usually periodically made, between subsequent realignments, data can have different values in different organizations.

8.2 Volatility

Volatility is not a standard dimension in data quality literature; it is rarely introduced in

  • researches. A first definition is introduced by Ballou [1], volatility refers to how long the item

remains valid and it is an intrinsic property unrelated to the data management process. An-

  • ther definition is introduced in [2], where volatility of information is a measure of information

instability; it is correlated to frequency of change of the data value. In DaQuinCIS project volatility is a measure of how long data remain valid and the metadata QualityRating is a time unit and it can assume every real number. The metadata RatingRangeDescriptor is the scale used for measurement, a high value of this dimension corresponds to a low level of volatility and it means that data remain valid for a long time interval; instead a low value of this dimension corresponds to a high level of volatility and it means that data change with high frequency. The metadata RatingMethodDescriptor specifies that volatility is an intrinsic dimen- sion of data so it has a static value that is associated to data in the moment of its creation. To evaluate volatility value we use the method discussed in ?? where Volatility is derived from dynamics of expiration. To determine data validity it is necessary to define the metadata Evaluation Time that specifies last update time. In fact, it has to evaluate how long is time interval between Evaluation time and the time in which user uses data; data is considered valid if this time interval is smaller than volatility value. An example of an XML file with Currency Level metadata, related to a Telephone Number is shown in Figure 14. 19

slide-22
SLIDE 22

... <Volatility evaluationTime=100 measure="minute"> <Field name="TelephoneNumber"> <Value>500</Value> <Rating> <RatingRangeDescriptor MinDef="data remain valid for a long time interval" MaxDef="data change with high frequency"> </RatingRangeDescriptor> <RatingMethodDescriptor> <MethodDescriptor name="ValidityEvaluation" URLDescription="http://www.daquincis.org/ValidityEvaluation.xml"/> </RatingMethodDescriptor> </Rating> </Field> </Volatility> ...

Figure 14: An example of volatility metadata

8.3 Timeliness

In [9] timeliness is defined as ”the extent to which the age of the data is appropriate for the task at hand”. This is not a unique definition of timeliness, usually it is a calculated measure, both in [2] and in [1] timeliness is measured as a function of elementary variables: currency and volatility respectively. A formula to measure it is defined in [1] and timeliness is measured on a continuous scale from 0 to 1. Timeliness = {max[(1 − Currency V olatility), 0]}s (2) The exponent s is a parameter necessary to control the sensibility of timeliness to the currency- volatility ratio. The metadata QualityRating specifies the level of adequacy of received data and it is a parameter that can assume a value included in an interval [0,1]. The metadata Ratin- gRangeDescriptor specifies that if timeliness values 0, data is out-of-date, while a value different from 0 specifies that information is updated and it is received by user when data is still

  • valid. The metadata RatingMethodDescriptor specifies that a possible method to calculate

timeliness consists in applying the formula defined by Ballou in which the dimension Currency and Volatility described in previous paragraphs are used as metadata. An example of an XML file with Timeliness metadata, related to a Residence Address is shown in Figure 15.

8.4 Appropriate Amount of data

In [9] appropriate amount of data is defined as a metric of data quality: ”The extent to which data the quantity or volume of available data is appropriate”. Relative to a determined process volume of data set extracted can be considered sufficient or not. For appropriate amount of data metadata QualityRating is defined as a percentage. The metadata RatingRangeDescriptor describes the meaning of the values assumed by this dimension; low values correspond to a poor 20

slide-23
SLIDE 23

... <Timeliness> <Field name="ResidenceAddress"> <Value>90</Value> <Rating> <RatingRangeDescriptor MinDef="data is out-of-date" MaxDef="data is up-to-date"> </RatingRangeDescriptor> <RatingMethodDescriptor> <MethodDescriptor name="BallouFormula" URLDescription="http://www.daquincis.org/BallouFormula.xml"/> </RatingMethodDescriptor> </Rating> </Field> </Timeliness> ...

Figure 15: An example of timeliness metadata quantity of data, while high values mean that the amount of data is acceptable for user request. The metadata RatingMethodDescriptor describes method used to calculate this dimension. A possible method is to compare the amount of data relative a specific process with a benchmark value that specifies the appropriate amount of information that have to be received by the user. An example of an XML file with Appropriate Amount of Data metadata, is shown in Figure 16.

8.5 Process Completeness

In [9] Process Completeness is defined as ”the extent to which data are of sufficient breadth, depth, and scope for the task at hand”. Considering a determined process it is necessary to evaluate data completeness relative to that process with the scope to evaluate which data are necessary to execute it. The metadata QualityRating assumes a value from 1 to 100. The metadata RatingRangeDescriptor specifies the meaning of the values associated to the dimension. As reference three level can be considered: 0 means that data relatively to the process are not complete, 50 means that data are enough complete and 100 means that there are all data required from the process. The metadata RatingMethodDescriptor specifies the method that is used for the calculation of this dimension. Further, it is necessary define additional metadata as Source Data that contains the name of data source from which data are extracted; this is important to evaluate the completeness of source data and compare it with process completeness. Besides, it is necessary to evaluate completeness of all data units that are required from the process. Relatively to a single data unit, two metadata are necessary: Data Unit that specifies name and source of data of data unit that is to be extracted and Required that specifies if data unit is important for the process or not. An example of an XML file with Process Completeness metadata, related to a Residence Address is shown in Figure 17. 21

slide-24
SLIDE 24

... <AppropriateAmountOfData> <Data name="AllData"> <Value>100</Value> <Rating> <RatingRangeDescriptor MinDef="amount of data in not enough" MaxDef="amount of data is enough"> </RatingRangeDescriptor> <RatingMethodDescriptor> <MethodDescriptor name="AmountOfDataCheck" URLDescription="http://www.daquincis.org/AmountOfDataCheck.xml"/> </RatingMethodDescriptor> </Rating> </Field> </AppropriateAmountOfData> ...

Figure 16: An example of appropriate amount of data metadata

... <ProcessCompleteness source="CitizensDB" required="true"> <Field name="ResidenceAddress"> <Value>100</Value> <Rating> <RatingRangeDescriptor MinDef="data relatively to the process are not complete" MaxDef="there are all the data required by the process"> <DataPoint value="50" description="data required by the process are enough"/> </RatingRangeDescriptor> <RatingMethodDescriptor> <MethodDescriptor name="DataRequiredCheck" URLDescription="http://www.daquincis.org/DataRequiredCheck.xml"/> </RatingMethodDescriptor> </Rating> </Field> </ProcessCompleteness> ...

Figure 17: An example of Process Completeness metadata 22

slide-25
SLIDE 25

... <ValueAdded> <UsefulIndex>NONSOCHEMETTERE</UsefulIndex> <Field name="ResidenceAddress"> <Value>80</Value> <Rating> <RatingRangeDescriptor MinDef="completely useless" MaxDef="completely useful"> </RatingRangeDescriptor> <RatingMethodDescriptor> <MethodDescriptor name="UsefulnessDegree" URLDescription="http://www.daquincis.org/UsefulnessDegree.xml"/> </RatingMethodDescriptor> </Rating> </Field> </ValueAdded> ...

Figure 18: An example of Value Added metadata

8.6 Value Added

In [9] Value added is defined as ”the extent to which data are beneficial and provide advantages from their use”. The metadata QualityRating specifies that the value added of a data value

  • f a data item that may be a number included between 0 and 100.

The metadata Ratin- gRangeDescriptor specifies the scale used for measurement; value added can assume a value from 0 that specifies that data is completely useless to 100 that specifies that data is useful for the user. The metadata RatingMethodDescriptor specifies the method used in measure- ment process. To calculate value added of data in a process it is possible to consider objective measures or subjective measures. An objective measure can be express through the metadata UsefulnessIndex that can be a static dimension associated to data that represents a general measure of the data usefulness. Instead, an example of subjective measure can be the metadata UserPercepition that specifies the value added that data have for the user. An example of an XML file with Value Added metadata, related to a Residence Address is shown in Figure 18.

8.7 Access Security

In [9] Access Security is defined as ”the extent to which access to data can be restricted and hence kept secure”. Information security in a database includes three main aspects: secrecy, integrity, availability, and non repudiation. Secrecy is related to prevention and detection of improper disclosure of information. Integrity is related to prevention and detection of improper modification of information. System availability is related to prevention and detection of im- proper denial of access to services provided by the system [3]. Non repudiation guarantee that a data is provided by a particular source, and source cannot deny that. In DaQuinCIS project is important measuring Access security for the generation of a Quality Certificate. This dimension has the primary scope to give to users and indication of the level of security of the source data. It indicates how many mechanisms of security are implemented in the organization to which data come from. The Metadata QualityRating specifies the level of access security of a data item 23

slide-26
SLIDE 26

and it is expressed as a percentage. The metadata RatingRangeDescriptor describes the meaning of the values associated to Access Security that refer to amount of security mechanisms implemented in source database. A high percentage means that most security mechanisms are implemented; vice versa if it is a low percentage a few of security mechanisms are implemented. The metadata RatingMethodDescriptor specifies the method used in measurement process. The method of measurement has to consider the security mechanisms that source databases have to implement to insure an adequate level of security. For example, it can be considered mechanisms that satisfy the following database protection requirements [3]:

  • Protection from improper access: it consists of granting access to database only to autho-

rized users

  • Protection from inference: it consists of granting the information privacy. It has to be

impossible obtaining confidential information from non-confidential data.

  • Integrity of the database: databases have to be protected from unauthorized accesses that

could modify the contents of data.

  • Operational integrity of data: requirement that ensures the logical consistency of data in

a database during concurrent transactions

  • Accountability and auditing: it consists of the possibility of recording all accesses to data

for both read and write operations

  • User authentication: it consists the necessity of identifying uniquely the database user
  • Multilevel protection: requirement that is made from a set of requirements. Data contain

in a database may be classified in different classes; each class has a level of protection. Each mechanism associated to a security requirement listed above can be considered as a metadata to which is associated an other metadata Importance Level that is a measure of the importance attributed to that mechanism by the user. Clearly, user perception is dependable from the type of process that is in execution. An example of an XML file with AccessSecurity metadata, related to a Residence Address is shown in Figure 19.

8.8 History

This process dimension is not a standard dimension in the data quality literature, but it is important in a CIS to generate a quality certificate. In a process, it is fundamental to know what operations of quality improvement have been performed on data and how much data has been improved. History can be measured on the basis of information on the degree of quality improvement that has been obtained with each improvement operation. The list of improvement

  • perations executed on data is recorded in the system and for each operation has to be stored:

type of operation, execution date and an aggregate indicator that allows indicating of how much data quality is improved. Consequently, value of history can be obtained as an improvement percentage that considers the list of improvement operations executed from the time instant in which data are stored in the system to the time instant in which data are used. The metadata QualityRating is a percentage. The metadata RatingRangeDescriptor specifies that the measure is an aggregate index define the degree of quality improvement of data that has been

  • btained with improvement operations. The metadata RatingMethodDescriptor specifies

that history is defined as a percentage that is calculated from the list of improvement operation. For each improvement operation following metadata have to be recorded: 24

slide-27
SLIDE 27

... <AccessSecurity source="CitizenDB"> <Field name="ResidenceAddress"> <Value>100</Value> <Rating> <RatingRangeDescriptor MinDef="source is completely insecure" MaxDef="source is secure"> <DataPoint value="50" description="source is sufficiently secure"/> </RatingRangeDescriptor> <RatingMethodDescriptor> <MethodDescriptor name="SecurityMechanismsEvaluation" URLDescription="http://www.daquincis.org/SecurityMechanismsEvaluation.xml"/> </RatingMethodDescriptor> </Rating> </Field> </AccessSecurity> ...

Figure 19: An example of Access Security metadata

  • Type: it specifies type of operation that improves data quality
  • Date: it is a time point the specifies when improvement operation has been executed.
  • Data quality values improvement: set of improvement percentages for every data

quality dimensions An example of an XML file with History metadata, related to a Surname is shown in Figure 20.

8.9 Cost

Cost dimension considered in DaQuinCIS project is defined as not-quality cost. It is fundamental to evaluate how much the errors due to bad quality data cost. The metadata QualityRating is a number, while the metadata RatingRangeDescriptor specifies that this dimension is an indication of how much the bad quality of data costs. The cost values will be expressed in a specific currency. The metadata RatingMethodDescriptor specifies the measurement method used to calculate the dimension in object. An example of measurement method that can be proposed consists in identifying a set of possible errors that can affect data. Each type

  • f error will occur with a certain frequency ErrorFrequency that is representable through

a metadata. It is also necessary to define a metadata to represent CostError that specifies the average error cost per data units. The formula below shows a possible use of information described above to calculate the dimension rate. Cost =

N

  • k=1

CostErrori × ErrorFrequencyi where

  • cost=Overall cost of all errors that affected data set in an observed period.

25

slide-28
SLIDE 28

... <History> <Field name="Surname"> <Value>100</Value> <Rating> <RatingRangeDescriptor MinDef="no improvement" MaxDef="high improvement"> </RatingRangeDescriptor> <RatingMethodDescriptor> <MethodDescriptor name="ImprovementEvaluation" URLDescription="http//www.daquincis.org/histCitizens.xml"/> </RatingMethodDescriptor> </Rating> </Field> </History> ...

Figure 20: An example of History metadata

  • CostErrori=Cost of a particular type of error i on a single data unit.
  • ErrorFrequencyi =Average number of error belonging to category i-th that occurs in the
  • bserved period.

It is possible to consider a good quality cost considering the implementation costs of hardware and software architectures that insure a good treatment and a good quality of data. The measures of architectural cost can be direct or indirect. A cost is defined direct if it is expressible in monetary term and is directly imputable to the process. Examples of direct costs are costs

  • f data quality tools or costs relative to basic technology that insure a good data treatment.

Instead, a cost is defined indirect if it is a consequence of data quality initiative. An example is cost derived from the loss of performance and other critical states. An example of an XML file with Cost metadata, related to a Residence Address is shown in Figure 21.

9 Relevant data quality dimensions for CIS

The data quality literature provides a thorough classification of data quality dimensions, even if there is not general agreement on the definition of most dimensions. Discrepancies are partly related to the contextual nature of quality, whose definition changes with the purpose of qual- ity assurance within a specific information system. However, in CISs, organizations have to

  • vercome their own contextual definition of quality and achieve a common definition of relevant

quality dimensions. The definitions proposed in the this Section are founded on a survey of the quality dimensions proposed in the literature over the past 10 years Scannapieco 2002. This study has selected six classifications which have been developed in different contexts and has individuated a basic set of data quality dimensions including accuracy, completeness, consis- tency, timeliness, interpretability and accessibility; which represent the dimensions considered by the majority of the authors. Timeliness is considered together with the other time related dimensions: currency and volatility. In the present approach we assume that all organizations cooperating by means of a CIS agree to measure and disclose the quality of their own data 26

slide-29
SLIDE 29

... <Cost currency="euro" ErrorFrequency="30" measureef="days" CostError="1"> <Field name="ResidenceAddress"> <Value>2000</Value> <Rating> <RatingRangeDescriptor MinDef="gratis" MaxDef="high cost"> </RatingRangeDescriptor> <RatingMethodDescriptor> <MethodDescriptor name="ErrorCostDefinition" URLDescription="http//www.daquincis.org/ErrorCostDefinition.xml"/> </RatingMethodDescriptor> </Rating> </Field> </Cost> ...

Figure 21: An example of Cost metadata through a Quality Certificate that is associated with data whenever they are exchanged. As a general observation, a Quality Certificate should be objective, as it is shared among multiple

  • rganizations. Consequently, it should include quantitative measures of relevant quality dimen-
  • sions. To build a Quality Certificate, it is necessary to identify a set of data quality dimensions

that are considered relevant by all cooperating organizations. In CISs, in addition to measuring the internal quality of data, new dimensions are needed to provide a context for evaluating the quality of exchanged data in the framework of collaboration processes. A fundamental character- istic of a CIS is the dynamic composition of services. Each organization requests services, which might be provided by one or more providers. In order to evaluate the suitability of received in- formation, we propose to add to the previously listed dimensions History, Cost, Security/Access Security, Relevance, and Reliability. According to the classification considered in the previous section in table 9 relevant dimensions are ordered on the basis of category which includes them. heightCategory Dimension Sub dimension Subject Dimensions Interpretability Accessability Object Dimension Accuracy Completeness Consistency Architectural Dimension Availability Process Dimension Relevance Access Security Timeliness Currency, Volatility, Currency level History Cost In previous sections we have described all dimensions but only the set including the above listed dimensions will be evaluated in DaQuinCis project with the metrics already described. 27

slide-30
SLIDE 30

10 Time impact on data accuracy, completeness, and currency in redundant contexts

According to the architectural choices in a CIS we can have different degrees of data redundancy and data misalignments that influence three fundamental data quality dimensions: accuracy, completeness and currency. In section 3 accuracy is defined as a measure of the proximity of a data value v to some other value v’ that is considered correct. It can be defined as the ratio between the number of correct values and the total number of values available from a given

  • source. The incorrectness of a data value depends on multiple factors, such as syntax errors or

representational ambiguities. In redundant contexts, delays in data updates represent a major cause for inaccuracy. Let us consider a data value v stored in two databases at to and changed to v′ in one database at t1. The misalignment between the two databases involves the inaccuracy

  • f the data value v. According to the definition of accuracy presented in previous section, only

part of the data are modified to v’ according to the changes that have occurred in the real world. Other data values have a lower proximity to real-world values and, thus, are less accurate. When we estimate the accuracy of a single source it is necessary to consider the updates that occur in the whole system and that are not propagated in the analyzed local database. Let us refer to the accuracy of an operational database as Local Accuracy and to the fraction of inaccurate data units due to misalignment problems in a local database as Outofdate Inaccurate. The time-related measure of the accuracy of operational databases as: Accuracy = LocalAccuracy − OutofDateAccuracy (3) As regards completeness, it is associated with data values and is defined as the degree to which a specific database includes all the values corresponding to a complete representation of a given set of real word events as database entities. According to this definition, in redundant context, a relevant aspect of completeness is the degree to which a database includes all new events within a given time interval. In this context, a single-database perspective is not sufficient to evaluate completeness. Databases are periodically realigned and a new event may occur be- tween subsequent realignments. This event should be simultaneously registered in all databases to guarantee completeness, but the application generating the new event may update its local database and rely on realignments to propagate data changes to other databases. Between an event and the following realignment, databases are incomplete, as they do not include a data representation of all events. The calculation of completeness on local database has to include a corrective factor that allow to calculate the completeness in an exact way. Let us refer to the completeness of an operational database as Local Completeness that indicates the percentage to which a local database contains all data units values corresponding to a complete represen- tation of real world events. The ratio between all data units contained in the local database and the Local Completeness evaluates the number of data units that should be contained in the local database. In a global perspective, the completeness of local databases is a correct measure right after realignments but it becomes erroneous during refresh periods. To account for this time dependence of Local Completeness, we define the Outofdate Incomplete parameter as the average number of data unit added in shared portion between two databases within a refresh period that has not been copied in examined local database. This parameter has to be consider in a measure of the completeness of the local database between subsequent alignments and can be used to obtain a correct measure of the completeness

  • f operational databases as:

Completeness = |localdatabase| OutofdateIncomplete +

|localdatabase| LocalCompleteness

(4) Finally misalignments among databases have impacts on currency too. Currency is usually defined as a time interval that goes from the time when data are updated and the time when 28

slide-31
SLIDE 31

data are used. The only research contribution that does not define currency as a time interval defines currency as a measure of the degree to which data are up-to-date. This definition is used to introduce a new data quality dimension named Currency level as discussed in section

  • 3. The misalignments among databases implies that data of last update associated to a data

value can be more recent in the real system that in a single databases where updates are not propagated. We can define a lost currency factor to describe this time impact on currency

  • dimension. Currency for a single operational database will be referred to as Local currency. If

data contained in a database A and shared with another database B are updated in database B, they have a more recent update time than the one indicated in database A. Therefore, an

  • verall measure of currency, called Global currency can be calculated and the currency lost by

local database A can be calculated as: LostCurrency = LocalCurrency − GlobalCurrency (5)

10.1 Data Quality Certificate

Every time mutual knowledge among organizations participating in CIS is not given in advance, new mechanisms are needed to ensure that mutual trust is established during cooperative process

  • executions. In this work we consider that organizations, which cooperate in CIS, can be of two

types:

  • Trusted organizations:data transmission occurs among organizations which trust each
  • ther in a network due to organizational reasons (e.g., homogeneous work groups in a

departmental structure, or supplychain relationships among organizations forming a vir- tual enterprise);

  • external organizations: external organizations: data are transmitted among cooperating

entities in general, possibly accessing external data sources. Trust regards mainly two aspects: (i) the quality of data being exchanged, and (ii) a secure environment for information exchange to guarantee sensitive information. We define an exchange unit as data:

  • transmitted from one e-service to another in the cooperative process
  • associated with quality data
  • transmitted according to security rules

The Data Quality Certificate, produced by the Quality Factory for each data exchanged, repre- sents the last two required information. It specifies not only the quality degree of the data in terms of the dimensions described above, but also ensure a secure communication of the data

  • exchanged. For this matter, as illustrate in Figure 10.1 that represents the exchange unit format

at all, two main parts compose the Data Quality Certificate: - Quality Data: this section stores the value associated to each quality dimension adopted excluding only the security aspects; - Security Data (described in more details in this section): composed by sensitivity information, a digital certificate X.509 compliant and a digital signature. All data have to be exchanged ac- cording to the exchange unit format, in order to ensure that they all can be adequately validated by the receiving organization. Whereas the Quality Data represent an XML version of the features described above, this section deals to the security aspects of the exchange unit need to be addressed, namely (i) integrity, (ii) authentication and (iii) confidentiality. Sensitivity data denotes the level of confidentiality of data being transferred, and, according to this level, information useful for its encryption. The confidentiality level can be assigned to 29

slide-32
SLIDE 32

Figure 22: Exchange unit format data according to standard security policies, e.g., using data labelling [3]. Depending on the relevance level of exchanged data, confidentiality can be ensured at different granularity levels. We can encrypt: (i) only the data package, (ii) also quality data, (iii) no data parts, or (iv) any possible combinations thereof. To cope with these possibilities, sensitivity data regards:

  • Confidentiality: for each component of the exchange unit (i.e., data, quality data), we de-

fine a boolean value (confidentiality flag) indicating whether the component is confidential

  • r not.
  • Encryption method: indicates the asymmetric encryption algorithm (e.g., RSA), and the

hash algorithm (e.g., SHA1)to be used to generate the digital signature.

  • Session key: the key to be used to encrypt the relevant information using symmetric

cryptography (e.g., TripleDES, AES-Reijindael () in order to improve transmission perfor- mances. As regards integrity and authentication issues, they are provided by creating a secure and efficient transmission channel using the following components:

  • the digital certificate, owned by the source organization that provide the authentication
  • f the data source;
  • the digital signature of both the listed components of the exchange unit and the digital

certificate that provide the integrity of the transmitted data. The digital certificate is issued by a Certification Authority, basically according to the X.509

  • format. The digital signature is created according to the PKCS7 specification, thus allowing the

destination organization to verify the integrity of data and of the digital certificate. By signing also the certificate, we guarantee the association between the data and its creator. Authentication can be weak or strong:

  • Weak authentication, required for trusted organizations, means that the data destination

checks the signature of the data source using the public key of the data source, but trusts the certificate of the data source. The advantage is that data transmission is fast and reliable: trusted organizations know each other by means of a list of certificates (in an ad- hoc certificate repository); integrity and reliability of such lists are under the responsibility

  • f the Certification Authority.

30

slide-33
SLIDE 33
  • Strong authentication, required for untrusted/external organizations, uses a Public Key

Infrastructure (PKI) [ref], and specifically a certificate revocation list, in order to validate the certificate of the data source. Finally, as regards confidentiality, data, quality data and history are encrypted using the session key included in the sensitivity information part, according to the value of the confidentiality flags. To avoid disclosure of the session key, this is encrypted by the data source using the public key of the destination one.

References

[1] Ballou D. P., Wang R., Pazer H.L., Tayi G.K., Modelling Information Manufacturing Sys- tems to Determine Information Product Quality. Management Science, vol. 44, No. 4, April 1998. [2] Bovee M., Mak B., Srivastava R.P., A Conceptual Framework and Belief-Function Approach to Assessing Overall Information Quality. Proceedings of the Sixth International Conference

  • n Information Quality 2001.

[3] Castano S., Fugini M., Martella G., Samarati P., Database Security Addison-Wesley Pub- lishing Company, 1994. [4] Deutsch A., Fernandez M., Florescu D., Levy A., and Suciu D.: XML-QL: A Query Lan- guage for XML. In Proceedings of the Proceedings of the 8th International World Wide Web Conference (WWW8), Toronto, Canada, 1999. [5] Cattell R.G.G. and Barry D.K.: The Object Database Standard: ODMG 2.0. Morgan Kaufmann Publishers, 1997. [6] B. Pernici, M. Scannapieco, Data Quality in Web Information Systems. In Proceedings of the International Conference on Conceptual Modeling (ER 2002), Tampere, Finland, 2002. [7] Redman T.C., Data Quality for the Information Age. Artech House, 1996. [8] Scannapieco M., Catarci T. Data Quality under a Computer Science Perspective. ”Archivi Computer” (Italian Journal), 2002. To be published [9] Wang R.Y., Strong D.M.: Beyond Accuracy: What Data Quality Means to Data Con-

  • sumers. Journal of Management Information Systems, vol. 12, no. 4, 1996.

31

slide-34
SLIDE 34

Appendice A Schema XSD <?xml version="1.0" encoding="UTF-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:d2q="http://www.daquindicisstuff.org"> <xs:simpleType name="valueType0-100"> <xs:restriction base="xs:float"> <xs:minInclusive value="0"/> <xs:maxInclusive value="100"/> </xs:restriction> </xs:simpleType> <xs:simpleType name="valueType0-Max"> <xs:restriction base="xs:float"> <xs:minInclusive value="0"/> </xs:restriction> </xs:simpleType> <xs:complexType name="DataPointType0-100"> <xs:attribute name="value" type="valueType0-100"/> <xs:attribute name="description" type="xs:string"/> </xs:complexType> <xs:complexType name="DataPointType0-Max"> <xs:attribute name="value" type="valueType0-Max"/> <xs:attribute name="description" type="xs:string"/> </xs:complexType> <xs:complexType name="RatingRangeDescriptor0-100"> <xs:sequence> <xs:element name="DataPoint" type="DataPointType0-100" minOccurs="2" maxOccurs="unbounded"/> </xs:sequence> <xs:attribute name="MinDef" type="xs:string" use="required"/> <xs:attribute name="MaxDef" type="xs:string" use="required"/> </xs:complexType> <xs:complexType name="RatingRangeDescriptor0-Max"> <xs:sequence> <xs:element name="DataPoint" type="DataPointType0-Max" minOccurs="2" maxOccurs="unbounded"/> </xs:sequence> <xs:attribute name="MinDef" type="xs:string" use="required"/> <xs:attribute name="MaxDef" type="xs:string" use="required"/> </xs:complexType> <xs:element name="MethodDescriptor"> <xs:complexType> <xs:sequence> <xs:any namespace="##other" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> <xs:attribute name="name" type="xs:string" use="required"/> <xs:attribute name="URLDescription" type="xs:string" /> </xs:complexType> </xs:element> <xs:element name="RatingMethodDescriptor"> <xs:complexType> <xs:sequence> 32

slide-35
SLIDE 35

<xs:element ref="MethodDescriptor"/> </xs:sequence> </xs:complexType> </xs:element> <xs:complexType name="Rating0-100"> <xs:sequence> <xs:element name="Value" type="valueType0-100"/> <xs:element name="Rating"> <xs:complexType> <xs:sequence> <xs:element name="RatingRangeDescriptor" type="RatingRangeDescriptor0-100"/> <xs:element ref="RatingMethodDescriptor"/> </xs:sequence> </xs:complexType> </xs:element> </xs:sequence> <xs:attribute name="name"/> </xs:complexType> <xs:complexType name="Rating0-Max"> <xs:sequence> <xs:element name="Value" type="valueType0-Max"/> <xs:element name="Rating"> <xs:complexType> <xs:sequence> <xs:element name="RatingRangeDescriptor" type="RatingRangeDescriptor0-Max"/> <xs:element ref="RatingMethodDescriptor"/> </xs:sequence> </xs:complexType> </xs:element> </xs:sequence> <xs:attribute name="name"/> </xs:complexType> <xs:element name="Accuracy"> <xs:complexType> <xs:sequence> <xs:element name="Field" type="Rating0-100" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="SourceReliability"> <xs:complexType> <xs:sequence> <xs:element name="Field" type="Rating0-100" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="RepresentationalConsistency"> 33

slide-36
SLIDE 36

<xs:complexType> <xs:sequence> <xs:element name="Field" type="Rating0-100" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="InternalConsistency"> <xs:complexType> <xs:sequence> <xs:element name="Field" type="Rating0-100" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="Completeness"> <xs:complexType> <xs:sequence> <xs:element name="Field" type="Rating0-100" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="ProcessCompleteness"> <xs:complexType> <xs:sequence> <xs:element name="Field" type="Rating0-100" maxOccurs="unbounded"/> </xs:sequence> <xs:attribute name="source" type="xs:string" use="required"/> <xs:attribute name="required" type="xs:boolean" use="required"/> </xs:complexType> </xs:element> <xs:element name="SourceAvailability"> <xs:complexType> <xs:sequence> <xs:element name="Field" type="Rating0-100" maxOccurs="unbounded"/> </xs:sequence> <xs:attribute name="source" type="xs:string" use="required"/> <xs:attribute name="timeout" type="xs:integer"/> <xs:attribute name="measureto" type="xs:string"/> <xs:attribute name="interval" type="xs:integer"/> <xs:attribute name="measurei" type="xs:string"/> </xs:complexType> </xs:element> <xs:element name="Responsiveness"> <xs:complexType> <xs:sequence> <xs:element name="Field" type="Rating0-100" maxOccurs="unbounded"/> 34

slide-37
SLIDE 37

</xs:sequence> <xs:attribute name="source" type="xs:string" use="required"/> <xs:attribute name="timeout" type="xs:integer"/> <xs:attribute name="measure" type="xs:string"/> </xs:complexType> </xs:element> <xs:element name="Interpretability"> <xs:complexType> <xs:sequence> <xs:choice> <xs:element name="Field" type="Rating0-100" maxOccurs="unbounded"/> <xs:element name="Data" type="Rating0-100"/> </xs:choice> </xs:sequence> <xs:attribute name="facet" type="xs:string"/> </xs:complexType> </xs:element> <xs:element name="Understandability"> <xs:complexType> <xs:sequence> <xs:choice> <xs:element name="Field" type="Rating0-100" maxOccurs="unbounded"/> <xs:element name="Data" type="Rating0-100"/> </xs:choice> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="Availability"> <xs:complexType> <xs:sequence> <xs:element name="Field" type="Rating0-100" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="Conciseness"> <xs:complexType> <xs:sequence> <xs:element name="Field" type="Rating0-100" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="Currency"> <xs:complexType> <xs:sequence> <xs:element name="Field" type="Rating0-Max" maxOccurs="unbounded"/> </xs:sequence> 35

slide-38
SLIDE 38

</xs:complexType> </xs:element> <xs:element name="CurrencyLevel"> <xs:complexType> <xs:sequence> <xs:element name="Field" type="Rating0-100" maxOccurs="unbounded"/> </xs:sequence> <xs:attribute name="source" type="xs:string" use="required"/> </xs:complexType> </xs:element> <xs:element name="Volatility"> <xs:complexType> <xs:sequence> <xs:element name="Field" type="Rating0-Max" maxOccurs="unbounded"/> </xs:sequence> <xs:attribute name="evaluationTime" type="xs:float" use="required"/> <xs:attribute name="measureet" type="xs:string" use="required"/> </xs:complexType> </xs:element> <xs:element name="Timeliness"> <xs:complexType> <xs:sequence> <xs:element name="Field" type="Rating0-100" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="AppropriateAmountOfData"> <xs:complexType> <xs:sequence> <xs:choice> <xs:element name="Field" type="Rating0-100" maxOccurs="unbounded"/> <xs:element name="Data" type="Rating0-100"/> </xs:choice> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="ValueAdded"> <xs:complexType> <xs:sequence> <xs:choice> <xs:element name="UsefulIndex" type="xs:string"/> <xs:element name="UserPerception" type="xs:string"/> </xs:choice> <xs:element name="Field" type="Rating0-100" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType> 36

slide-39
SLIDE 39

</xs:element> <xs:element name="AccessSecurity"> <xs:complexType> <xs:sequence> <xs:element name="Field" type="Rating0-100" maxOccurs="unbounded"/> </xs:sequence> <xs:attribute name="source" type="xs:string" use="required"/> </xs:complexType> </xs:element> <xs:element name="History"> <xs:complexType> <xs:sequence> <xs:element name="Field" type="Rating0-100" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType> </xs:element> <xs:element name="Cost"> <xs:complexType> <xs:sequence> <xs:element name="Field" type="Rating0-Max" maxOccurs="unbounded"/> </xs:sequence> <xs:attribute name="currency" type="xs:string" use="required"/> <xs:attribute name="ErrorFrequency" type="xs:float" use="required"/> <xs:attribute name="CostError" type="xs:float" use="required"/> </xs:complexType> </xs:element> <xs:element name="definitions"> <xs:complexType name="definitionsType"> <xs:sequence> <xs:element ref="Accuracy"/> <xs:element ref="SourceReliability"/> <xs:element ref="RepresentationalConsistency"/> <xs:element ref="InternalConsistency"/> <xs:element ref="Completeness"/> <xs:element ref="ProcessCompleteness"/> <xs:element ref="SourceAvailability" maxOccurs="unbounded"/> <xs:element ref="Responsiveness" maxOccurs="unbounded"/> <xs:element ref="Interpretability" maxOccurs="unbounded"/> <xs:element ref="Understandability"/> <xs:element ref="Availability"/> <xs:element ref="Conciseness"/> <xs:element ref="Currency"/> <xs:element ref="CurrencyLevel"/> <xs:element ref="Volatility"/> <xs:element ref="Timeliness"/> <xs:element ref="AppropriateAmountOfData"/> <xs:element ref="ValueAdded"/> <xs:element ref="AccessSecurity"/> 37

slide-40
SLIDE 40

<xs:element ref="History"/> <xs:element ref="Cost"/> </xs:sequence> <xs:attribute name="source" type="xs:string" use="required"/> </xs:complexType> </xs:element> </xs:schema> 38

View publication stats View publication stats