 
              Data Quality Challenges ● ACM JDIQ EiC ● Open Knowledge Networks (Biomedicine) ● Data Science for Finance (DSfin) Louiqa Raschid Smith School of Business Computer Science and UMIACS
Technical Review of Data Quality ● Provenance Cleaning Annotation ● Data cleaning infrastructure and tools: ○ Robust first generation. ○ Big data and scalability. ○ Human-in-the-loop (HumInt). ● Process ○ Fitness to task. ○ Understanding workflows.
First Gen methodologies and products
Technical Review of Data Quality ● Provenance Cleaning Annotation ● Data cleaning infrastructure and tools: ○ Robust first generation. ○ Big data and scalability. ○ Human-in-the-loop (HumInt). ● Process ○ Fitness to task. ○ Understanding workflows.
Scenarios ● Lung cancer data (primary) generated by clinicians: ○ Patient entity identification in clinical notes, e.g., JM, J.M., etc. (cleaning) ○ Scale: barthel 4 ○ Stages: P0T0, Stage4, etc. (annotation) ● Analytics over (secondary) sources: ○ Drug induced liver injury (DILI): phenotype includes elevated levels of liver enzymes, etc. ○ (fitness to task; HumInt): There are many causes for elevated liver enzymes including transplants, some infants, etc.
Scenarios ● Privacy preserving data mining: ○ Entity linkage in the de-identified space. ○ Different entries contribute hashed identifiers but they may be missing a variety of fields. (provenance; fitness to task) ● iASiS SEMANTIC Data Cleaning / Annotation Pipeline ● Finding patterns in OKN: DILI Case Study ○ (Provenance; fitness to task; HumInt; Annotation.)
iASiS
DILI Case Study o Given a knowledge graph and a DILI phenotype (keywords) ... o Create profiles, e.g., [Phenotype | Drug | Gene | Pathway] o Rank the DRUG at most risk for DILI.
DILI Case Study
DILI Case Study
Tamr: Understanding Workflows
Tamr: Understanding Workflows
Tamr: Understanding Workflows
Lessons learned ● First generation tools work well. ● Next generation needs to focus on processes and workflows and HumInt. ● Scientists still spend huge amounts of time on cleaning. How can we fix this problem? ● Is Open Knowledge Networks a solution? ● An unexpected case study ...
Recommend
More recommend