Data Quality Challenges ACM JDIQ EiC Open Knowledge Networks - - PowerPoint PPT Presentation

data quality challenges
SMART_READER_LITE
LIVE PREVIEW

Data Quality Challenges ACM JDIQ EiC Open Knowledge Networks - - PowerPoint PPT Presentation

Data Quality Challenges ACM JDIQ EiC Open Knowledge Networks (Biomedicine) Data Science for Finance (DSfin) Louiqa Raschid Smith School of Business Computer Science and UMIACS Technical Review of Data Quality Provenance


slide-1
SLIDE 1

Data Quality Challenges

  • ACM JDIQ EiC
  • Open Knowledge Networks (Biomedicine)
  • Data Science for Finance (DSfin)

Louiqa Raschid

Smith School of Business Computer Science and UMIACS

slide-2
SLIDE 2

Technical Review of Data Quality

  • Provenance Cleaning

Annotation

  • Data cleaning infrastructure and tools:

○ Robust first generation. ○ Big data and scalability. ○ Human-in-the-loop (HumInt).

  • Process

○ Fitness to task. ○ Understanding workflows.

slide-3
SLIDE 3

First Gen methodologies and products

slide-4
SLIDE 4
slide-5
SLIDE 5
slide-6
SLIDE 6
slide-7
SLIDE 7
slide-8
SLIDE 8

Technical Review of Data Quality

  • Provenance Cleaning

Annotation

  • Data cleaning infrastructure and tools:

○ Robust first generation. ○ Big data and scalability. ○ Human-in-the-loop (HumInt).

  • Process

○ Fitness to task. ○ Understanding workflows.

slide-9
SLIDE 9

Scenarios

  • Lung cancer data (primary) generated by clinicians:

○ Patient entity identification in clinical notes, e.g., JM, J.M., etc. (cleaning) ○ Scale: barthel 4 ○ Stages: P0T0, Stage4, etc. (annotation)

  • Analytics over (secondary) sources:

○ Drug induced liver injury (DILI): phenotype includes elevated levels of liver enzymes, etc. ○ (fitness to task; HumInt): There are many causes for elevated liver enzymes including transplants, some infants, etc.

slide-10
SLIDE 10

Scenarios

  • Privacy preserving data mining:

○ Entity linkage in the de-identified space. ○ Different entries contribute hashed identifiers but they may be missing a variety of fields. (provenance; fitness to task)

  • iASiS SEMANTIC Data Cleaning / Annotation Pipeline
  • Finding patterns in OKN: DILI Case Study

○ (Provenance; fitness to task; HumInt; Annotation.)

slide-11
SLIDE 11

iASiS

slide-12
SLIDE 12

DILI Case Study

  • Given a knowledge graph and a DILI phenotype (keywords)

...

  • Create profiles, e.g., [Phenotype | Drug | Gene | Pathway]
  • Rank the DRUG at most risk for DILI.
slide-13
SLIDE 13

DILI Case Study

slide-14
SLIDE 14

DILI Case Study

slide-15
SLIDE 15
slide-16
SLIDE 16

Tamr: Understanding Workflows

slide-17
SLIDE 17

Tamr: Understanding Workflows

slide-18
SLIDE 18

Tamr: Understanding Workflows

slide-19
SLIDE 19

Lessons learned

  • First generation tools work well.
  • Next generation needs to focus on processes and

workflows and HumInt.

  • Scientists still spend huge amounts of time on cleaning.

How can we fix this problem?

  • Is Open Knowledge Networks a solution?
  • An unexpected case study ...