multi column substring matching for database schema
play

Multi-column Substring Matching For Database Schema Translation - PowerPoint PPT Presentation

Research Motivation Proposed Approach Experimental Results Algorithmic Analysis Conclusion Multi-column Substring Matching For Database Schema Translation Robert H. Warren 1 Dr. Frank Wm Tompa 1 1 David R. Cheriton School of Computer Science


  1. Research Motivation Proposed Approach Experimental Results Algorithmic Analysis Conclusion Multi-column Substring Matching For Database Schema Translation Robert H. Warren 1 Dr. Frank Wm Tompa 1 1 David R. Cheriton School of Computer Science University of Waterloo Waterloo, Canada The 32nd Very Large Database Conference, 2006 Warren & Tompa Multi-column Substring Matching

  2. Research Motivation Proposed Approach Experimental Results Algorithmic Analysis Conclusion Database Integration Objective A generalisable method capable of resolving complex schema matches and the translation required to convert the instance data using substrings concatenation. Example 1 leftmost characters of column lastname + 2 rightmost characters of column birthdate → column userid Name in database D → First + Last in database D ′ 2005/05/29 in database D → 05/29/2005 in database D ′ PartNumber in database D → Number + PlantId + 2 rightmost digits in Year. Warren & Tompa Multi-column Substring Matching

  3. Research Motivation Proposed Approach Experimental Results Algorithmic Analysis Conclusion Database Integration Why is this an important problem? Issues: ...the number and size of databases growing. (+10,000 tables, +1,600 columns) ...integration is an every day issue. (Semantic web, smart clients, dynamic data sources...) ...multiple standards in use. (22 Locales) ...previously we have used top-down approaches. Here we use a data-driven, bottom up approach. ⇒ Need automation to deal with this problem. Warren & Tompa Multi-column Substring Matching

  4. Research Motivation Proposed Approach Experimental Results Algorithmic Analysis Conclusion Database Integration Previous work Rahm and Bernstein present a good taxonomy and discussion of matching problem. [RB01b, RB01a] Basic support for concatenating complete columns in the CUPID system. [MBR01] Embley et al. made use of ontologies to discover such translations. [EXD04] To deal with complex cases, Doan et al. proposed “format learners”. [DDH01] The IMAP system makes use of specific matchers for mathematical relationships. [DLD + 04] ⇒ No known generalisable solution to high-cardinality ( n -to-1), substring concatenations schema translations. Warren & Tompa Multi-column Substring Matching

  5. Research Motivation Proposed Approach Experimental Results Algorithmic Analysis Conclusion Requirements and concerns Automated, un-supervised and data driven. 1 Offload as much of the work to the databases. [KMS04] 2 Client side discovery process, bandwidth << database 3 contents. Part of a larger, automated, database integration system: 4 partial notion of what could/should be a match. “Entity” overlap between database tables unknown but 5 present. Warren & Tompa Multi-column Substring Matching

  6. Research Motivation Proposed Approach Experimental Results Algorithmic Analysis Conclusion Problem formalization Definition For a given target database table T 2 with a target column A ...and a source table T 1 with a set of likely source columns ( B 1 , B 2 , ..., B n ) Find a transformation such that: A = ω 1 + ω 2 + · · · + ω ν Where ω i represents a substring of column B i Translation model t ′ = t [ β x 1 ... y 1 + β x 2 ... y 2 + · · · + β x ν ... y ν ] (chars x ν . . . y ν of col B ν ) 1 2 ν Warren & Tompa Multi-column Substring Matching

  7. Research Motivation Proposed Approach Experimental Results Algorithmic Analysis Conclusion Basic example (Source) (Target) Table 1 Table 2 first middle last ??? Login . . . h kerry robert nawisema . . . kyle s norma jlmalton . . . a wiseman norma rhkerry . . . . . . . . . . . . . . . . . . amy l case alcase . . . josh a alderman ksokmoan . . . john l j malton j ksnormanj Warren & Tompa Multi-column Substring Matching

  8. Research Motivation Proposed Approach Experimental Results Algorithmic Analysis Conclusion Basic example (Source) (Target) Table 1 Table 2 first middle last ??? Login . . . h kerry robert nawisema . . . kyle s norma jlmalton . . . a wiseman norma rhkerry . . . . . . . . . . . . . . . . . . amy l case alcase . . . josh a alderman ksokmoan . . . john l j malton j ksnormanj Warren & Tompa Multi-column Substring Matching

  9. Research Motivation Proposed Approach Experimental Results Algorithmic Analysis Conclusion Basic example (Source) (Target) Table 1 Table 2 How to infer translation? select substring(first from 1 for 1) || first middle last ??? Login substring(middle from 1 for 1) || last as login . . . h kerry robert nawisema into target_table from source_table. . . . kyle s norma jlmalton . . . a wiseman norma rhkerry . . . . . . . . . . . . . . . . . . amy l case alcase . . . josh a alderman ksokmoan . . . john l j malton j ksnormanj Warren & Tompa Multi-column Substring Matching

  10. Research Motivation Proposed Approach Experimental Results Algorithmic Analysis Conclusion Basic example (Source) (Target) Table 1 Table 2 Solution: first middle last ??? Login Iteratively select substrings from “best-fit” . . . h kerry robert nawisema columns while performing a simple form of . . . kyle s norma jlmalton record linkage. . . . a wiseman norma rhkerry . . . . . . . . . . . . . . . . . . amy l case alcase . . . josh a alderman ksokmoan . . . john l j malton j ksnormanj Warren & Tompa Multi-column Substring Matching

  11. Research Motivation Proposed Approach Experimental Results Algorithmic Analysis Conclusion Basic example - Find initial column. (1) (Source) (Target) Table 1 Table 2 first middle last ??? Login . . . robert h kerry nawisema . . . kyle s norma jlmalton . . . norma a wiseman rhkerry . . . . . . . . . . . . . . . . . . amy l case alcase . . . a alderman josh ksokmoan . . . john l j malton j ksnormanj Warren & Tompa Multi-column Substring Matching

  12. Research Motivation Proposed Approach Experimental Results Algorithmic Analysis Conclusion Basic example - Find initial column. (2) (Source) (Target) Table 1 Table 2 first middle last ??? Login . . . robert h kerry nawisema . . . kyle s norma jlmalton 1 bi- . . . norma a wiseman rhkerry gram . . . . . . . . . . . . . . . match . . . amy l case alcase . . . a alderman josh ksokmoan . . . john l j malton j ksnormanj Warren & Tompa Multi-column Substring Matching

  13. Research Motivation Proposed Approach Experimental Results Algorithmic Analysis Conclusion Basic example - Find initial column. (2) (Source) (Target) Table 1 Table 2 first middle last ??? Login . . . robert h kerry nawisema . . . kyle s norma jlmalton 4 bi- . . . norma a wiseman rhkerry gram . . . . . . . . . . . . . . . matches . . . amy l case alcase . . . a alderman josh ksokmoan . . . john l j malton j ksnormanj Warren & Tompa Multi-column Substring Matching

  14. Research Motivation Proposed Approach Experimental Results Algorithmic Analysis Conclusion Basic example - Find initial column. (2) (Source) (Target) Table 1 Table 2 first middle last ??? Login . . . robert h kerry nawisema . . . kyle s norma jlmalton 2 bi- . . . norma a wiseman rhkerry gram . . . . . . . . . . . . . . . matches . . . amy l case alcase . . . a alderman josh ksokmoan . . . john l j malton j ksnormanj Warren & Tompa Multi-column Substring Matching

  15. Research Motivation Proposed Approach Experimental Results Algorithmic Analysis Conclusion Basic example - Find initial column. (2) (Source) (Target) Table 1 Table 2 first middle last ??? Login . . . robert h kerry nawisema . . . kyle s norma jlmalton 4 bi- . . . norma a wiseman rhkerry gram . . . . . . . . . . . . . . . matches . . . amy l case alcase . . . a alderman josh ksokmoan . . . john l j malton j ksnormanj Warren & Tompa Multi-column Substring Matching

  16. Research Motivation Proposed Approach Experimental Results Algorithmic Analysis Conclusion Basic example - Find initial column. (2) (Source) (Target) Table 1 Table 2 Column Scoring Formula q first middle last ??? Login   t HitCount ( j ) ScoreCol = � . . . robert h kerry nawisema (1) t ∗ length ( key j )   . . . kyle s norma j = 1 jlmalton . . . norma a wiseman rhkerry Where: . . . . . . . . . . . . . . . t is the number of sampled values from source column. . . . amy l case alcase key j is the j -th sampled value. . . . a alderman josh ksokmoan HitCount is the number of q -gram matches for key j . . . . john l j malton j ksnormanj Warren & Tompa Multi-column Substring Matching

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend