the data cleaning problem some key issues practical
play

The Data Cleaning Problem: Some Key Issues & Practical - PDF document

The Data Cleaning Problem: Some Key Issues & Practical Approaches Ronald K. Pearson Daniel Baugh Institute for Functional Genomics and Computational Biology Department of Pathology, Anatomy, and Cell Biology Thomas Jefferson University


  1. The Data Cleaning Problem: Some Key Issues & Practical Approaches Ronald K. Pearson Daniel Baugh Institute for Functional Genomics and Computational Biology Department of Pathology, Anatomy, and Cell Biology Thomas Jefferson University Philadelphia, PA DIMACS Workshop on Data Quality, Data Cleaning and Treatment of Noisy Data November 3-4, 2003 1

  2. Topics 1. Outliers: an important data anomaly - types and working assumptions - some real data examples 2. Detecting outliers - the popular 3 σ edit rule - order-statistics vs. moments - some alternative approaches 3. Other data anomalies - missing data - misalignments - noninformative variables - comparing performance 2

  3. Example 1: Outlier in a microarray data sequence Dye swap average of log2 intensity ratios, gene 263 <<-- Outlier 4 3 Log2 Intensity Ratio Control 2 1 0 EtOH -1 5 10 15 Sample 3

  4. Example 2: Influence of outliers on a volcano plot Log2 expression change vs. p-value, Genes 201 to 300 1.0 0.5 Log2 Expression Change 0.0 -0.5 -1.0 0.005 0.010 0.050 0.100 0.500 1.000 t-test P-value 4

  5. Example 3: Bivariate outlier in a simulated dataset � NOTE: Outlier is not extreme with respect to either x or y individually 1.0 0.8 0.6 y(k) value OUTLIER -->> 0.4 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 x(k) value 5

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend