From Example Data Using ILP Manuel Fink and Heiner Stuckenschmidt - - PowerPoint PPT Presentation

โ–ถ
from example data using ilp
SMART_READER_LITE
LIVE PREVIEW

From Example Data Using ILP Manuel Fink and Heiner Stuckenschmidt - - PowerPoint PPT Presentation

Schema Mapping Discovery From Example Data Using ILP Manuel Fink and Heiner Stuckenschmidt Data and Web Science Group University of Mannheim Problem Description Target DB Source DB Transformation ? f T = f(S) S Learn Algorithm Example


slide-1
SLIDE 1

Schema Mapping Discovery From Example Data Using ILP

Manuel Fink and Heiner Stuckenschmidt Data and Web Science Group University of Mannheim

slide-2
SLIDE 2

Problem Description

?

Source DB

T = f(S)

Transformation

Algorithm

Learn

Target DB

S f

slide-3
SLIDE 3

Vegetable ID, Name, NID โˆง NutrionalValue NID, E, P, C, F โ†’ VegetarianIngredient(ID, Name, E, C, F, P)

Example 1 (GAV Mappings)

S T

AnimalProduct ID, Name, ๐Ÿ๐ก๐ก, NID โˆง NutrionalValue NID, E, P, C, F โ†’ VegetarianIngredient(ID, Name, E, C, F, P)

slide-4
SLIDE 4

Problem In ILP Framework

Components โ–ช Background Knowledge B: Tuples from relations of source DB โ–ช (Constraints like Key Relationships, Column Data Types/Domain) โ–ช Positive Examples E+: Tuples from relation(s) in target DB โ–ช Negative Examples E-: Any Tuples not in relations of target DB (CWA) โ–ช Hypothesis H: Set of Schema Mapping Rules

B = {Vegetable(โ€ž017โ€œ, โ€žMushroom (white)โ€œ, โ€ž11โ€œ),

NutritionalValue(โ€ž11โ€œ, โ€ž16โ€œ, โ€ž2.7gโ€œ, โ€ž0.2gโ€œ, โ€ž2.7gโ€œ), โ€ฆ}

E+ = {VegetableIngredient(โ€ž017โ€œ, โ€žMushroom (white)โ€œ, โ€ž16โ€œ, โ€ž0.6gโ€œ, โ€ž0.2gโ€œ, โ€ž2.7gโ€œ), โ€ฆ} E- = {VegetableIngredient(โ€ž017โ€œ, โ€žBeef (minced)โ€œ, โ€ž208โ€œ, โ€ž20.5gโ€œ, โ€ž0gโ€œ, โ€ž14gโ€œ),

VegetableIngredient(โ€ž032โ€œ, โ€žMushroom (white)โ€œ, โ€ž16โ€œ, โ€ž0.6gโ€œ, โ€ž0.2gโ€œ, โ€ž2.7gโ€œ), โ€ฆ}

slide-5
SLIDE 5

Problem In ILP Framework

Components โ–ช Background Knowledge B: Tuples from relations of source DB โ–ช Positive Examples E+: Tuples from relation(s) in target DB โ–ช Negative Examples E-: Any Tuples not in relations of target DB โ–ช Hypothesis H: Set of Schema Mapping Rules Requirements โœ“ Prior Satisfiability: ๐ถ โ‹€ ๐นโˆ’ โŠญ false โœ“ Prior Necessity: B โŠญ ๐น+ Solution Constraints โœ“ Posterior Sufficiency: ๐ถ โ‹€ H โŠจ ๐น+ (Produce all target tuples) โœ“ Posterior Satisfiability: ๐ถ โ‹€ H โ‹€ ๐นโˆ’ โŠญ false (Produce only target tuples)

B = {Vegetable(โ€ž017โ€œ, โ€žMushroom (white)โ€œ, โ€ž11โ€œ),

NutritionalValue(โ€ž11โ€œ, โ€ž16โ€œ, โ€ž2.7gโ€œ, โ€ž0.2gโ€œ, โ€ž2.7gโ€œ), โ€ฆ}

E+ = {VegetableIngredient(โ€ž017โ€œ, โ€žMushroom (white)โ€œ, โ€ž16โ€œ, โ€ž0.6gโ€œ, โ€ž0.2gโ€œ, โ€ž2.7gโ€œ), โ€ฆ} E- = {VegetableIngredient(โ€ž017โ€œ, โ€žBeef (minced)โ€œ, โ€ž208โ€œ, โ€ž20.5gโ€œ, โ€ž0gโ€œ, โ€ž14gโ€œ), VegetableIngredient(โ€ž032โ€œ,

โ€žMushroom (white)โ€œ, โ€ž16โ€œ, โ€ž0.6gโ€œ, โ€ž0.2gโ€œ, โ€ž2.7gโ€œ), โ€ฆ}

slide-6
SLIDE 6

Problem In ILP Framework

Components โ–ช Background Knowledge B: Tuples from relations of source DB โ–ช Positive Examples E+: Tuples from relation(s) in target DB โ–ช Negative Examples E-: Any Tuples not in relations of target DB โ–ช Hypothesis H: Set of Schema Mapping Rules Language Bias โ–ช GLAV (Global-And-Local-As-View) Schema Mappings (Source-to-Target Tuple-Generating Dependencies)

โˆ€เดฅ ๐‘ฆ ๐œ’ าง ๐‘ฆ โ†’ โˆƒเดค ๐‘ง ๐œ”( าง ๐‘ฆ, เดค ๐‘ง)

slide-7
SLIDE 7

Example 2 (GLAV Mapping)

S T

Person(ID,PName,CID1,EID) โˆง Employer(EID,EName,CID2 โˆง City(CID1,CName1,Loc1) โˆง City(CID2,CName2,Loc2) โ†’ โˆƒ CoID,GID1,GID2 : Commute(CoID,PName,GID1,GID2) โˆง GeoLocation(GID1,Loc1) โˆง GeoLocation(GID2,Loc2)

slide-8
SLIDE 8

Interesting Problem Dimensions (Future Work)

โ–ช Learn Mappings with Constants in Body:

โžขAnimalProduct ID, Name, ๐Ÿ๐ก๐ก, NID โˆง NutrionalValue NID, E, P, C, F โ†’ VegetarianIngredient(ID, Name, E, C, F, P)

โ–ช Learn GLAV (not GAV) Mappings:

โžขPerson(ID,PName,CID1,EID) โˆง Employer(EID,EName,CID2 โˆง City(CID1,CName1,Loc1) โˆง City(CID2,CName2,Loc2) โ†’ โˆƒ CoID,GID1,GID2 : Commute(CoID,PName,GID1,GID2) โˆง GeoLocation(GID1,Loc1) โˆง GeoLocation(GID2,Loc2)

โ–ช Learn Mappings with Functions:

โžขVegetable ID, Name, NID โˆง NutrionalValue NID, E, P, C, F โ†’ VegetarianIngredient ID, Name, ๐  ๐… , C, F, P , ๐‘”: ๐‘“๐‘œ๐‘“๐‘ ๐‘•๐‘ง โ†ฆ ๐‘“๐‘œ๐‘“๐‘ ๐‘•๐‘ง โˆ˜ โ€ฒ๐‘™๐‘‘๐‘๐‘šโ€ฒ

slide-9
SLIDE 9

Challenges and Aspects

โ–ช Scalability

โ–ช At least O(100) different Tables on both sides โ–ช Broad Tables/Predicates (e.g. 80 Columnsโ€ฆ) โ–ช Long Tables (GigaBytes of Data)

โ–ช Instance Ambiguity

โ–ช Instances are String/Number constants

โ–ช Language Bias

โ–ช Atypical for ILP

โ–ช Existential Quantifier in Head โ–ช Multiple Predicates in Head โ–ช Functions?

slide-10
SLIDE 10

Work So Far

Variation of Problem โ–ช Fixed Schema โžข โˆ€๐‘ฆ1 โ€ฆ ๐‘ฆ๐‘œ: ๐‘ˆ๐‘‡๐‘๐‘ฃ๐‘ ๐‘‘๐‘“(๐‘ฆ1, โ€ฆ , ๐‘ฆ๐‘œ ) โ†ฆ ๐‘ˆ๐‘ˆ๐‘๐‘ ๐‘•๐‘“๐‘ข (๐‘”

1 ๐‘ฆ1 , โ€ฆ , ๐‘” ๐‘œ(๐‘ฆ๐‘œ ))

โ–ช Transformation Functions on Values (Prefixing, Scaling, Trimming etc.)

key user flag 000 mfink B 001 apeter A 002 fschn X 003 cdehn C 004 hfink B key user flag 000 mfink B 001 apeter A 002 fschn X 003 cdehn C 004 hfink B key user flag 903 dehn C 904 fink B 901 peter A 900 fink B 902 schn X 1 2 3 4 1 2 3 4 key user flag 900 fink B 901 peter A 902 schn X 903 dehn C 904 fink B key user flag 000 mfink B 001 apeter A 002 fschn X 003 cdehn C 004 hfink B source target

slide-11
SLIDE 11

Work So Far

Variation of Problem โ–ช Fixed Schema โ–ช Offset Columns (Violation of First Normal-Form)

ID TABNAME TABKEY FLAG 38500 VBAP 8000000001908000001 V 42100 VBAP 8000000001908000003 V 1780 MARA 8000000054765 X 179 MARA 800000000000000000015 Y ID TABNAME TABKEY FLAG 385 VBAP 9047700001908000001 V 421 VBAP 9047700001908000003 V 178 MARA 90499000054675 X 179 MARA 904000000000000000015 Y

TableTarget TableSource Domain Rules: [MANDT] 800 -> 904 [VBELN] .{2}x -> 77y [KUNNR] .{2}x -> 99x Field Rules:

Table.ID x{0}* -> x

Offset Rules:

Table.TABKEY[TABNAME/FLAG] VBAP / V: MANDT[0-2]|VBELN[3-12] MARA / X: MANDT[0-2]|KUNNR[3-12] MARA / X: MANDT[0-2]

slide-12
SLIDE 12

Work So Far

Data Set โ–ช Artifical Transformation

โ–ช 2 different Transformations defined on 12 domains

โ–ช 168MB โ–ช 84 Tables with up to 28000 lines โ–ช Solvable in ~1 Minute

โ–ช 26 Domain rules

โ–ช 14 Key Mappings โ–ช 4 Fixed Value Transformations โ–ช 1 Trimming Transformation โ–ช 7 Mask Overlay Transformations

โ–ช 1 Field rule

โ–ช Disable domain rule

โ–ช 5 Offset Rules

โ–ช 1 Control Column โ–ช Up to 4 different Segmentations with up to 2 transformed domains

slide-13
SLIDE 13

Work So Far

Example Solution โ€“ Domain Rules:

KNUMB: IdTableTransformation KUNNR: IntegerAdditionTransformation.{NUMERIC} : x -> [number + 9900000000] TDOBNAME: FrontMaskTransformation.{NUMERIC} : xxy" -> "99y" ATINN: IdTableTransformation CLINT: IdTableTransformation ADRNR: IdTableTransformation SYBIN2: IdTableTransformation SYBIN1: FixedValueTransformation.{NUMERIC,TEXT} : x -> "0" USNAM: FixedValueTransformation.{NUMERIC,TEXT} : x -> "ANONYM" J_OBJNR: FrontMaskTransformation.{NUMERIC,TEXT} : xxxxy" -> "xx99yโ€ž AD_SO_KEY: IdTableTransformation FPLNR: IdTableTransformation

slide-14
SLIDE 14

Work So Far

Example Solution โ€“ Domain Rules (cont):

INT2: RearTrimTransformation.{NUMERIC,TEXT} : xxx[.]{3} -> xxx AD_ADDRNUM: IdTableTransformation VBELN: FrontMaskTransformation.{NUMERIC} : xxy" -> "99y" PARNR: IdTableTransformation CK_KALNR: IdTableTransformation PERNR: IdTableTransformation XUBNAME: FixedValueTransformation.{NUMERIC,TEXT} : x -> "ANONYM" MANDT: FixedValueTransformation.{NUMERIC,TEXT} : x -> "904" NA_OBJKEY: IntegerAdditionTransformation.{NUMERIC} : x -> [number + 9900000000] AD_PERSNUM: IdTableTransformation QMNUM: IdTableTransformation NA_PARNR: IntegerAdditionTransformation.{NUMERIC} : x -> [number + 9900000000] KUNDE: IntegerAdditionTransformation.{NUMERIC} : x -> [number + 9900000000]

slide-15
SLIDE 15

Work So Far

Example Solution (Offset Rules) โ–ช CDHDR.OBJECTID: [OBJECTCLAS] -> OBJECTID: [DEBI] KUNNR[0-10) | UnchangesSegments(11,..) [VERKBELEG] VBELN[0-10) | UnchangesSegments(11,..) [BANK] MANDT[0-3) | UnchangesSegments(4,..) [MATERIAL] UnchangedSegments

slide-16
SLIDE 16

Questions / Ideas?