detecting duplicates
play

Detecting Duplicates Duplicates and . . . Duplicates . . . in - PowerPoint PPT Presentation

Outline Geospatial Databases: . . . Gravity . . . Duplicates: Where . . . Why duplicates Are a . . . Detecting Duplicates Duplicates and . . . Duplicates . . . in Geoinformatics: Duplicates Are Not . . . From Interval to Fuzzy . . . from


  1. Outline Geospatial Databases: . . . Gravity . . . Duplicates: Where . . . Why duplicates Are a . . . Detecting Duplicates Duplicates and . . . Duplicates . . . in Geoinformatics: Duplicates Are Not . . . From Interval to Fuzzy . . . from Intervals and Fuzzy What We Did in Our . . . Formalization of the . . . New Algorithm: . . . Numbers to General Possibility of . . . Acknowledgments Multi-D Uncertainty Title Page ◭◭ ◮◮ Scott A. Starks, Luc Longpr´ e ◭ ◮ Roberto Araiza, Vladik Kreinovich University of Texas at El Paso Page 1 of 15 El Paso, Texas 79968, USA sstarks@utep.edu, vladik@utep.edu Go Back Full Screen Hung T. Nguyen New Mexico State University Close Las Cruces, New Mexico 88003, USA Quit

  2. Gravity . . . Duplicates: Where . . . 1. Outline Why duplicates Are a . . . • Fact: geospatial databases often contain duplicate records. Duplicates and . . . Duplicates . . . • it What are duplicates: two or more close records rep- Duplicates Are Not . . . resenting the same measurement result. From Interval to Fuzzy . . . • Problem: how to detect and delete duplicates. What We Did in Our . . . • Test case: measurements of anomalies in the Earth’s Formalization of the . . . gravity field that we have compiled. New Algorithm: . . . Possibility of . . . • Previously analyzed case: closeness of two points ( x 1 , y 1 ) Acknowledgments and ( x 2 , y 2 ) is described as closeness of both coordi- Title Page nates. ◭◭ ◮◮ • What was known: O ( n · log( n )) duplication deletion algorithm for this case. ◭ ◮ Page 2 of 15 • New result: we extend this algorithm to the case when closeness is described by an arbitrary metric. Go Back Full Screen Close

  3. Gravity . . . Duplicates: Where . . . 2. Geospatial Databases: General Description Why duplicates Are a . . . • Fact: researchers and practitioners have collected a Duplicates and . . . large amount of geospatial data. Duplicates . . . Duplicates Are Not . . . • Examples: at different geographical points ( x, y ), geo- physicists measure values d of: From Interval to Fuzzy . . . What We Did in Our . . . – the gravity fields, Formalization of the . . . – the magnetic fields, New Algorithm: . . . – elevation, Possibility of . . . – reflectivity of electromagnetic energy for a broad Acknowledgments range of wavelengths (visible, infrared, and radar). Title Page • How this data is stored: corresponding records ( x i , y i , d i ) ◭◭ ◮◮ are stored in a large geospatial database. ◭ ◮ • How this data is used: dased on these measurements, Page 3 of 15 geophysicists generate maps and images and derive geo- Go Back physical models that fit these measurements. Full Screen Close

  4. Gravity . . . Duplicates: Where . . . 3. Gravity Measurements: Case Study Why duplicates Are a . . . • Typical geophysical data (e.g., remote sending images): Duplicates and . . . Duplicates . . . – mainly reflect the conditions of the Earth’s surface ; Duplicates Are Not . . . – cover a reasonably local area. From Interval to Fuzzy . . . • Gravity measurements: What We Did in Our . . . Formalization of the . . . – gravitation comes from the whole Earth, including deep zones; New Algorithm: . . . Possibility of . . . – gravity measurements cover broad areas. Acknowledgments • Conclusion: gravity measurements are one of the most Title Page important sources of information about subsurface struc- ◭◭ ◮◮ ture and physical conditions. ◭ ◮ Page 4 of 15 Go Back Full Screen Close

  5. Gravity . . . Duplicates: Where . . . 4. Duplicates: Where They Come From Why duplicates Are a . . . • Fact: the existing geospatial databases contain many Duplicates and . . . duplicate points. Duplicates . . . Duplicates Are Not . . . • Reason: From Interval to Fuzzy . . . – databases are rarely formed completely “from scratch”; What We Did in Our . . . – they are usually are built by combining measure- Formalization of the . . . ments from previous databases; New Algorithm: . . . – some measurements are represented in several of Possibility of . . . the combined databases. Acknowledgments Title Page • Conclusion: after combining databases, we get dupli- cate records. ◭◭ ◮◮ ◭ ◮ Page 5 of 15 Go Back Full Screen Close

  6. Gravity . . . Duplicates: Where . . . 5. Why duplicates Are a Problem Why duplicates Are a . . . • Main reason: duplicate values can corrupt the results Duplicates and . . . of statistical data processing and analysis. Duplicates . . . Duplicates Are Not . . . • Example: From Interval to Fuzzy . . . – when we see several measurement results confirm- What We Did in Our . . . ing each other, Formalization of the . . . – we may get an erroneous impression that this mea- New Algorithm: . . . surement result is more reliable than it actually is. Possibility of . . . • Conclusion: detecting and eliminating duplicates is an Acknowledgments important part of assuring and improving the quality Title Page of geospatial data. ◭◭ ◮◮ ◭ ◮ Page 6 of 15 Go Back Full Screen Close

  7. Gravity . . . Duplicates: Where . . . 6. Duplicates and Related Uncertainty Why duplicates Are a . . . • Ideal case: measurement results are simply stored in Duplicates and . . . their original form. Duplicates . . . Duplicates Are Not . . . • In this case: duplicates are identical records, easy to detect and to delete. From Interval to Fuzzy . . . What We Did in Our . . . • In reality: databases use different formats and units. Formalization of the . . . • Example: the latitude can be stored in degrees (as New Algorithm: . . . 32.1345) or in degrees, minutes, and seconds. Possibility of . . . • As a result: when a record ( x i , y i , d i ) is placed in a Acknowledgments database, it is transformed into this database’s format. Title Page ◭◭ ◮◮ • Fact: transformations are approximate. ◭ ◮ • Result: records representing the same measurement in different formats get transformed into values which cor- Page 7 of 15 respond to close but not identical points” Go Back ( x i , y i ) � = ( x j , y j ) . Full Screen Close

  8. Gravity . . . Duplicates: Where . . . 7. Duplicates Corresponding to Interval Uncertainty Why duplicates Are a . . . Geophysicists produce a threshold ε > 0 such that ε -closed Duplicates and . . . points ( x i , y i ) and ( x j , y j ) are duplicates. Duplicates . . . Duplicates Are Not . . . ❅ � ✻ ε From Interval to Fuzzy . . . ❄ � ❅ ✛ ε What We Did in Our . . . ✲ Formalization of the . . . In other words, if a new point ( x j , y j ) is within a 2D interval New Algorithm: . . . [ x i − ε, x i + ε ] × [ y i − ε, y i + ε ] centered at one of the existing Possibility of . . . points ( x i , y i ), then this new point is a duplicate: Acknowledgments Title Page ✻ ◭◭ ◮◮ ε ❄ ◭ ◮ � ❅ ✻ Page 8 of 15 ε � ❅ ❄ Go Back ✛ ✲ ✛ ✲ ε ε Full Screen Close

  9. Gravity . . . Duplicates: Where . . . 8. Duplicates Are Not Easy to Detect and Delete Why duplicates Are a . . . • Problem: detect and delete duplicates. Duplicates and . . . Duplicates . . . • How this is done now: “by hand”, by a professional Duplicates Are Not . . . geophysicist looking at the raw measurement results From Interval to Fuzzy . . . (and at the preliminary results of processing these raw data). What We Did in Our . . . Formalization of the . . . • Limitations: time-consuming. New Algorithm: . . . • Natural idea: use a computer to compare every record Possibility of . . . with every other record. Acknowledgments ∼ n 2 Title Page • Analysis: this idea requires n ( n − 1) 2 compar- 2 ◭◭ ◮◮ isons. ◭ ◮ • Limitation: this is impossible for large databases, with n ≈ 10 6 records. Page 9 of 15 Go Back • Conclusion: faster algorithms are needed. Full Screen Close

  10. Gravity . . . Duplicates: Where . . . 9. From Interval to Fuzzy Uncertainty Why duplicates Are a . . . • Typical situation: geophysicists provide several possi- Duplicates and . . . ble threshold values ε 1 < ε 2 < . . . < ε m that corre- Duplicates . . . spond to decreasing levels of their certainty: Duplicates Are Not . . . – if two measurements are ε 1 -close, we are 100% cer- From Interval to Fuzzy . . . tain that they are duplicates; What We Did in Our . . . Formalization of the . . . – if two measurements are ε 2 -close, then with some degree of certainty, we can claim them to be dupli- New Algorithm: . . . cates, etc. Possibility of . . . Acknowledgments • Objectives: Title Page – eliminate certain duplicates, and ◭◭ ◮◮ – mark possible duplicates (about which we are not ◭ ◮ 100% certain) with the corresponding degree of cer- tainty. Page 10 of 15 • Reduction to interval case: we need to solve the interval Go Back problem for several different values of ε i . Full Screen Close

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend