Big Data, Big Problem Data-intensive systems are highly complex - - PowerPoint PPT Presentation

big data big problem
SMART_READER_LITE
LIVE PREVIEW

Big Data, Big Problem Data-intensive systems are highly complex - - PowerPoint PPT Presentation

Towards Big Data Provenance Daniel Deutch Blavatnik School of Computer Science, Raymond and Beverly Sackler Faculty of Exact Sciences Tel Aviv University cartoon by T. Gregorius Big Data, Big Problem Data-intensive systems are highly


slide-1
SLIDE 1

Towards Big Data Provenance

Daniel Deutch

Blavatnik School of Computer Science, Raymond and Beverly Sackler Faculty

  • f Exact Sciences

Tel Aviv University

cartoon by T. Gregorius

slide-2
SLIDE 2

Big Data, Big Problem

  • Data-intensive systems are highly complex

– Manipulate large-scale data in intricate ways – Machine Learning, Data Mining systems are often black box, even black magic?

  • Error-prone

– Errors in input (measurements, crowd, text) – Errors in processing (ambiguities, imperfect text understanding, “bugs”)

  • Inherent precision vs. recall tradeoff
  • Difficult to justify and explain (even correct) results
  • Essentially no one can know if the system has used only

legitimate data

  • There’s a danger of these systems (being perceived as)

getting out of control

slide-3
SLIDE 3

Example: Leakage in Data Mining

  • “One concrete example we've seen occurred in a prostrate cancer
  • dataset. Hidden among hundreds of variables in the training data was a

variable named PROSSURG. It turned out this represented whether the patient had received prostate surgery, an incredibly predictive but

  • ut-of-scope value.” (https://www.kaggle.com/wiki/Leakage)
  • “An account number feature used for predicting whether a potential

customer would open an account at a bank.”

  • “An interviewer name feature, in a cellular company churn prediction

problem.

– […] It turns out that a specific salesperson was assigned to take over cases where customers had already notified they intend to churn.” “Leakage in Data Mining: Formulation, Detection, and Avoidance”, Kaufman,Rosset, Perlich, ACM Transactions on Knowledge Discovery from Data (TKDD) 6.4 (2012): 15

slide-4
SLIDE 4

Provenance to the rescue

  • Tracking where data came from,

how it was extracted how it was manipulated

  • Provenance leads to better applications and reliable data
  • Goal: seamless provenance tracking through tools

– Allow application owners to easily integrate provenance solutions – With reasonable overhead in time and storage

  • Fundamental Challenge: Big data, even bigger provenance
  • Many modeling, algorithmic and implementation challenges
slide-5
SLIDE 5

What can we do? (Examples)

  • Develop provenance models for expressive languages

Bouhris, D., Moskovitch, Analyzing Data-Centric Applications: Why, What-if, and How-to, ICDE ‘16 (to appear) D., Moskovitch, Tannen, A Provenance Framework for Data-Dependent Process Analysis, VLDB ‘14 D., Roy, Milo, Tannen, Provenance Circuits for Datalog, ICDT ‘14

  • Store only relevant parts of the provenance

D., Gilad, Moskovitch, Selective Provenance for Datalog Programs Using Top-K Queries, VLDB ‘15

  • Summarize the provenance

Ainy, Bourhis, Davidson, D., Milo, Approximated Summarization of Data Provenance, CIKM ‘15

  • Express it in Natural Language

D., Frost, Gilad, NLProv: Natural Language Provenance, Submitted

slide-6
SLIDE 6

Thank you for listening

cartoon by T. Gregorius