data provenance and reproducability
play

Data Provenance and Reproducability Tim Schmidt Syeda Hiba Ahmad - PowerPoint PPT Presentation

Data Provenance and Reproducability Tim Schmidt Syeda Hiba Ahmad 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad Outline Reproducability Data Provenance


  1. Data Provenance and Reproducability Tim Schmidt Syeda Hiba Ahmad 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  2. Outline ● Reproducability ● Data Provenance ▪ ▪ Importance Importance ▪ ▪ Crisis Challenges ▪ ▪ In ML Current Standards ▪ ▪ In Companies Different Approaches ▪ ▪ What to do about it? Provenance Taxonomy ▪ Examples ● Future Work 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  3. Reproducability “A measure of whether results can be attained by a different research team, using the same methods.” 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  4. Importance of Reproducability shows that there are no confounding variables ● ○ protects against fraud ○ human error 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  5. Reproducability Crisis A crisis of repeatability: “Of these 100 studies, just 68 reproductions provided [..] results that matched the original findings.” A crisis of description: Of 400 algorithms [..] He found that only 6% [..] shared the algorithm’s code. Only a third shared the data they tested their algorithms on, and just half shared “pseudocode”. 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  6. Reproducability in ML ML swaps heuristics for blackbox for better results ● Randomness between runs (need to fix meta parameters) ● Need to store data ● 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  7. Reproducability in Companies If a researcher drops out, somebody else should be able to step in ● Cope with changed requirements or platforms ● ○ time saver in the long run It’s a lot more risky to try different variation ● 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  8. What to do about it? Science: Provide all the info: code, data, description ● Companies: No wrong incentives ● Repetition of experiments ● Keep the team educated ● 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  9. What to do about it? ML: Features causing non-deterministic results are disabled ● Versioning of models ● (Jupiter Notebooks are a nightmare) ➔ Data versioning ● 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  10. Data Provenance “Data Provenance is the documentation of data in sufficient detail to allow reproducibility of a specific dataset.” 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  11. Importance Compliance (e.g. DSGVO) ● Necessary if accused of fraud ● Prevents manual errors ● Changes in underlying database ● Data needs to be trustworthy ● Root case analysis ● 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  12. Challenges Large data sizes ● Provenance overhead ● Archiving (vs changes) ● 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  13. Archiving (vs changes) Key Added Deleted Modified 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  14. Current Standards “The PROV Family of Documents defines a model , corresponding serializations and other supporting definitions to enable the inter-operable interchange of provenance information in heterogeneous environments such as the Web.” https://www.w3.org/TR/prov-overview/ 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  15. Different Approaches Where-Provenance (Original Source), Why-Provenance (Contributing ● Source) and How-Provenance (Transformation) 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  16. Provenance Taxonomy 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  17. Use of Provenance ● Data Quality : Level of detail and error ● Audit Trail : Process with which data is produced ● Replication : Availability of similar sources ● Attribution : Pedigree (ownership) ● Informational : Metadata (descriptive) 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  18. Subject of Provenance ● Data Oriented (Explicit) Model : Metadata from source data ● Process Oriented (Indirect) : Metadata from process inputs and outputs ● Granularity : Level of detail 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  19. Provenance Representation ● Annotation : Descriptions about source data and processes ● Inversion : Reverse-engineering queries ● Contents : Of annotation and inversion methods ● Syntactic Information : The form in which data is stored ● Semantic Information : The meaning given to the data 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  20. Storing Provenance ● Tightly Coupled : Close relation with data ● Loosely Coupled : Slight relation with data ● Scalability : Growth of system ● Overhead : Management costs 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  21. Provenance Dissemination 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  22. Practical Examples DVC (for small projects) ● ○ Git expansion for data Pachyderm (for bigger projects) ● ○ Runs on Kubernetes More Informations 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  23. Data Versioning Control Data Versioning: ● DVC File 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  24. Data Piplines & Experiments Link processing steps together ● Store versions and prameters ● Compare versions ● 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  25. Pachyderm Based on Kubernetes and Docker ● 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  26. Pachyderm Data Versioning Pre1 Net Pre2 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  27. Pachyderm Containerized Analysis Pre1 Net Pre2 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  28. Pachyderm Data Pipelines Pre1 Net Pre2 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  29. Pachyderm Pre1 Net Pre2 Scalable Stages 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  30. Pachyderm Pre1 Net Pre2 Data Provenance 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  31. Google Dataset Search (GOODS) 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  32. Google Dataset Search (GOODS) 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  33. What is Next? How to rank datasets ● How to identify important datasets ● Handling missing metadata ● More work on data semantics ● Data citation ● Environment information ● Applications in ML, social media, ● block chain, cybersecurity 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  34. Questions? 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

  35. Thank you! :) 23.06.2020 | Fachbereich Informatik | Software Engineering for Artificial Intelligence| Tim Schmidt, Syeda Hiba Ahmad

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend