Data-Intensive Workfmows A journey to a Holistjc Framework for - - PowerPoint PPT Presentation

data intensive workfmows
SMART_READER_LITE
LIVE PREVIEW

Data-Intensive Workfmows A journey to a Holistjc Framework for - - PowerPoint PPT Presentation

Data-Intensive Workfmows A journey to a Holistjc Framework for Data-Intensive Workfmows Ian Corner Design and Implementatjon Lead May 2016 INFORMATION MANAGEMENT AND TECHNOLOGY (IMT) CSIRO Data-Intensive Workfmows Holistjc


slide-1
SLIDE 1

Data-Intensive Workfmows

A journey to a Holistjc Framework for Data-Intensive Workfmows

INFORMATION MANAGEMENT AND TECHNOLOGY (IMT)

Ian Corner – Design and Implementatjon Lead – May 2016

CSIRO – Data-Intensive Workfmows – Holistjc Framework for Data-Intensive Workfmows – Ian Corner 2016

slide-2
SLIDE 2

CSIRO – Who we are

Commonwealth Scientjfjc and Industrial Research Organisatjon

CSIRO – Data-Intensive Workfmows – Holistjc Framework for Data-Intensive Workfmows – Ian Corner 2016

slide-3
SLIDE 3

CSIRO – Data-Intensive Workfmows – Holistjc Framework for Data-Intensive Workfmows – Ian Corner 2016

CSIRO – Our Mission

Strategy 2020 – Australia’s Innovatjon Catalyst

slide-4
SLIDE 4

CSIRO – Data-Intensive Workfmows – Holistjc Framework for Data-Intensive Workfmows – Ian Corner 2016

CSIRO – What we do

slide-5
SLIDE 5

CSIRO – Data-Intensive Workfmows – Holistjc Framework for Data-Intensive Workfmows – Ian Corner 2016

Australian Natjonal Insect Collectjon 12,000,000 specimens (+100,000 per year) Australian Natjonal Fish collectjon 5,000 species Australian Natjonal Algae Culture Collectjon 1,000 strains of more than 300 micro-algae species Australian Natjonal Herbarium 1,000,000 herbarium (Captain Cook’s 1770 expeditjon to Australia) Australian Natjonal Wildlife Collectjon 200,000 irreplaceable specimens of wildlife http://www.csiro.au/en/Research/Collections

CSIRO – Our Collectjons

Commonwealth Scientjfjc and Industrial Research Organisatjon

slide-6
SLIDE 6

CSIRO – Data-Intensive Workfmows – Holistjc Framework for Data-Intensive Workfmows – Ian Corner 2016

http://www.csiro.au/en/Research/Collections/ANIC

CSIRO – Yesterdays Collectjons

Physical collectjons, Captured and Preserved

slide-7
SLIDE 7

CSIRO – Data-Intensive Workfmows – Holistjc Framework for Data-Intensive Workfmows – Ian Corner 2016

http://data.csiro.au/

CSIRO – Todays Collectjons

We need collectjons digitjsed, discoverable, consumable

slide-8
SLIDE 8

CSIRO – Data-Intensive Workfmows – Holistjc Framework for Data-Intensive Workfmows – Ian Corner 2016

CSIRO – Todays Collectjons

Commonwealth Scientjfjc and Industrial Research Organisatjon

http://www.csiro.au/en/Research/Facilities/Marine-National-Facility/RV-Investigator

RV Investigator is our state-

  • f-the-art marine research

vessel, supporting Australia’s atmospheric,

  • ceanographic, biological

and geosciences research from the tropical north to the Antarctic ice-edge.

slide-9
SLIDE 9

Data-Intensive Workfmows

Where we started

CSIRO – Data-Intensive Workfmows – Holistjc Framework for Data-Intensive Workfmows – Ian Corner 2016

slide-10
SLIDE 10

Data-Intensive Workfmows

CSIRO started by asking what good is our data if it: is unable to be found? can not speak?

  • nly ever repeats the same story?

can not repeat the same story twice? speaks so slowly the message is lost?

CSIRO – Data-Intensive Workfmows – Holistjc Framework for Data-Intensive Workfmows – Ian Corner 2016

As data growth and proliferation continued to outpace research grade infrastructure, we considered a new approach?

slide-11
SLIDE 11

Data-Intensive Workfmows

CSIRO – Data-Intensive Workfmows – Holistjc Framework for Data-Intensive Workfmows – Ian Corner 2016

Lets revisit the “monolithic approach”

slide-12
SLIDE 12

Data-Intensive Workfmows

CSIRO – Data-Intensive Workfmows – Holistjc Framework for Data-Intensive Workfmows – Ian Corner 2016

We split the monolithic file systems into named and discoverable 'datasets.'

slide-13
SLIDE 13

Data-Intensive Workfmows

The 'dataset' approach delineated the 'responsibility' between infrastructure owners and dataset managers.

CSIRO – Data-Intensive Workfmows – Holistjc Framework for Data-Intensive Workfmows – Ian Corner 2016

slide-14
SLIDE 14

Data-Intensive Workfmows

CSIRO – Data-Intensive Workfmows – Holistjc Framework for Data-Intensive Workfmows – Ian Corner 2016

Within the dataset we developed 'categories‘ as a tool for data management.

slide-15
SLIDE 15

Data-Intensive Workfmows

Categories enabled mapping of the workflow to technology of best fit.

CSIRO – Data-Intensive Workfmows – Holistjc Framework for Data-Intensive Workfmows – Ian Corner 2016

slide-16
SLIDE 16

Data-Intensive Workfmows

Categories “kick” started the discussion about workflows.

CSIRO – Data-Intensive Workfmows – Holistjc Framework for Data-Intensive Workfmows – Ian Corner 2016

slide-17
SLIDE 17

Data-Intensive Workfmows

We established the ‘relationships’ between owners, domain specialists, users, consumers, and infrastructure.

CSIRO – Data-Intensive Workfmows – Holistjc Framework for Data-Intensive Workfmows – Ian Corner 2016

slide-18
SLIDE 18

Data-Intensive Workfmows

As workflows matured, “science apps” evolved enabling domain specific datasets to be usable by non-domain consumers.

CSIRO – Data-Intensive Workfmows – Holistjc Framework for Data-Intensive Workfmows – Ian Corner 2016

slide-19
SLIDE 19

Data-Intensive Workfmows – Science Applicatjons

The Pyrotron - CSIRO National Bushfire Research Facility

CSIRO – Data-Intensive Workfmows – Holistjc Framework for Data-Intensive Workfmows – Ian Corner 2016

http://www.csiro.au/en/Do-business/Services/Testing-and-technical-services/Enviro/Pyrotron

slide-20
SLIDE 20

Data-Intensive Workfmows – Science Applicatjons

CSIRO – Workspace - Intuitive Workflow Development Tool

CSIRO – Data-Intensive Workfmows – Holistjc Framework for Data-Intensive Workfmows – Ian Corner 2016

https://research.csiro.au/workspace/

slide-21
SLIDE 21

Data-Intensive Workfmows – Science Applicatjons

CSIRO – SPARK – A wild fire simulation tool

CSIRO – Data-Intensive Workfmows – Holistjc Framework for Data-Intensive Workfmows – Ian Corner 2016

https://research.csiro.au/spark/

SPARK – A wildfire simulation framework for researchers and experts in the disaster resilience field.

slide-22
SLIDE 22

Data-Intensive Workfmows

Our leading edge researchers combined domain specific workflows to produce higher value layered products.

CSIRO – Data-Intensive Workfmows – Holistjc Framework for Data-Intensive Workfmows – Ian Corner 2016

slide-23
SLIDE 23

Data-Intensive Workfmows

Our leading edge researchers combined domain specific workflows to produce higher value layered products.

CSIRO – Data-Intensive Workfmows – Holistjc Framework for Data-Intensive Workfmows – Ian Corner 2016

slide-24
SLIDE 24

Data-Intensive Workfmows

How we matured

CSIRO – Data-Intensive Workfmows – Holistjc Framework for Data-Intensive Workfmows – Ian Corner 2016

slide-25
SLIDE 25

Data-Intensive Workfmows

Below the line 'technology' is a consumable, replaceable, discardable commodity.

CSIRO – Data-Intensive Workfmows – Holistjc Framework for Data-Intensive Workfmows – Ian Corner 2016

slide-26
SLIDE 26

Data-Intensive Workfmows

Below the line - the “fit for purpose” pool of generic infrastructure

CSIRO – Data-Intensive Workfmows – Holistjc Framework for Data-Intensive Workfmows – Ian Corner 2016

slide-27
SLIDE 27

Data-Intensive Workfmows

CSIRO's value proposition is the “Workflow.”

CSIRO – Data-Intensive Workfmows – Holistjc Framework for Data-Intensive Workfmows – Ian Corner 2016

slide-28
SLIDE 28

Data-Intensive Workfmows

Crossing the line we deliver to the 'current' profile of the researchers workflow.

CSIRO – Data-Intensive Workfmows – Holistjc Framework for Data-Intensive Workfmows – Ian Corner 2016

slide-29
SLIDE 29

Data-Intensive Workfmows

Layers of abstraction enabled us to “scale up.”

CSIRO – Data-Intensive Workfmows – Holistjc Framework for Data-Intensive Workfmows – Ian Corner 2016

slide-30
SLIDE 30

Data-Intensive Workfmows

Layers of abstraction enabled us to “scale up.”

CSIRO – Data-Intensive Workfmows – Holistjc Framework for Data-Intensive Workfmows – Ian Corner 2016

slide-31
SLIDE 31

Data-Intensive Workfmows

Layers of abstraction enabled us to “scale out.”

CSIRO – Data-Intensive Workfmows – Holistjc Framework for Data-Intensive Workfmows – Ian Corner 2016

slide-32
SLIDE 32

Data-Intensive Workfmows

Summary

CSIRO – Data-Intensive Workfmows – Holistjc Framework for Data-Intensive Workfmows – Ian Corner 2016

slide-33
SLIDE 33

Where we started

We came from a position where data, code and compute were isolated by the approach to HPC infrastructure.

CSIRO – Data-Intensive Workfmows – Holistjc Framework for Data-Intensive Workfmows – Ian Corner 2016

slide-34
SLIDE 34

What we did – Brought Data to Life

We engineered a solution where data, code and compute are all now directly connected.

CSIRO – Data-Intensive Workfmows – Holistjc Framework for Data-Intensive Workfmows – Ian Corner 2016

slide-35
SLIDE 35

High Value Informatjon:

Discoverable, Assured, and Consumable.

CSIRO – Data-Intensive Workfmows – Holistjc Framework for Data-Intensive Workfmows – Ian Corner 2016

slide-36
SLIDE 36

CSIRO’s data-intensive workflows are a valuable source of information. How do we discover them, trust them and consume them?

CSIRO – Data-Intensive Workfmows – Holistjc Framework for Data-Intensive Workfmows – Ian Corner 2016

Data-Intensive Workfmows

slide-37
SLIDE 37

CSIRO’s data-intensive workflows are a valuable source of information. How do we discover them, trust them and consume them?

CSIRO – Data-Intensive Workfmows – Holistjc Framework for Data-Intensive Workfmows – Ian Corner 2016

Data-Intensive Workfmows

METADATA

slide-38
SLIDE 38

CSIRO’s data-intensive workflows are a valuable source of information. How do we discover them, trust them and consume them?

CSIRO – Data-Intensive Workfmows – Holistjc Framework for Data-Intensive Workfmows – Ian Corner 2016

Data-Intensive Workfmows

METADATA PROVENANCE

slide-39
SLIDE 39

CSIRO’s data-intensive workflows are a valuable source of information. How do we discover them, trust them and consume them?

CSIRO – Data-Intensive Workfmows – Holistjc Framework for Data-Intensive Workfmows – Ian Corner 2016

Data-Intensive Workfmows

METADATA PROVENANCE SEMANTIC WEB

slide-40
SLIDE 40

Discoverable – Metadata

Metadata is a pathway to making data and workfmows discoverable.

Lets look at Wikipedia: htups://en.wikipedia.org/wiki/Metadata

CSIRO – Data-Intensive Workfmows – Holistjc Framework for Data-Intensive Workfmows – Ian Corner 2016

Metadata is "data that provides informatjon about other data”. Two types of metadata exist: structural metadata and descriptjve metadata. Structural metadata is data about the containers of data. Descriptjve metadata uses individual instances of applicatjon data or the data content.

slide-41
SLIDE 41

Assured – Provenance

Dr Victoria Stodden at the CSIRO Computatjon Simulatjon Sciences and eResearch Annual Conference in Melbourne 2014

CSIRO – Data-Intensive Workfmows – Holistjc Framework for Data-Intensive Workfmows – Ian Corner 2016

slide-42
SLIDE 42

Consumable – Semantjc Web

Pragmatjc use of the web.

CSIRO – Data-Intensive Workfmows – Holistjc Framework for Data-Intensive Workfmows – Ian Corner 2016

Lets look at Wikipedia: htups://en.wikipedia.org/wiki/Semantjc_Web

The Semantjc Web is an extension of the Web through standards by the World Wide Web Consortjum (W3C). The standards promote common data formats and exchange protocols on the Web, most fundamentally the Resource Descriptjon Framework (RDF).

slide-43
SLIDE 43

Linking Metadata, Provenance and Semantjcs

We need to link metadata, provenance and semantics in an automatic and extensible manner to increase our value.

CSIRO – Data-Intensive Workfmows – Holistjc Framework for Data-Intensive Workfmows – Ian Corner 2016

slide-44
SLIDE 44

Context Capture – The Future

Preserving Metadata Establishing Provenance Presentjng via Semantjc Web

CSIRO – Data-Intensive Workfmows – Holistjc Framework for Data-Intensive Workfmows – Ian Corner 2016

slide-45
SLIDE 45

Stripping Context – The Past

What are we currently loosing

CSIRO – Data-Intensive Workfmows – Holistjc Framework for Data-Intensive Workfmows – Ian Corner 2016

slide-46
SLIDE 46

Context Capture – The Future

Preserving the context of discrete events

CSIRO – Data-Intensive Workfmows – Holistjc Framework for Data-Intensive Workfmows – Ian Corner 2016

slide-47
SLIDE 47

Context Capture – The Future

Create a blank dataset

CSIRO – Data-Intensive Workfmows – Holistjc Framework for Data-Intensive Workfmows – Ian Corner 2016

slide-48
SLIDE 48

Context Capture – The Future

Create a blank dataset – Preserving metadata, establishing provenance, ...

CSIRO – Data-Intensive Workfmows – Holistjc Framework for Data-Intensive Workfmows – Ian Corner 2016

slide-49
SLIDE 49

Context Capture – The Future

Ingestjng sensor data – Preserving metadata, establishing provenance, ...

CSIRO – Data-Intensive Workfmows – Holistjc Framework for Data-Intensive Workfmows – Ian Corner 2016

slide-50
SLIDE 50

Context Capture – The Future

Lets consider a real world applicatjon - PlantScan

CSIRO – Data-Intensive Workfmows – Holistjc Framework for Data-Intensive Workfmows – Ian Corner 2016

http://www.plantphenomics.org.au/services/plantscan/ PlantScan provides non-invasive analyses of plant structure (topology, surface orientation, number of leaves), morphology (leaf size, shape, colour, area, volume) and function by utilising cutting edge information technology including high resolution cameras and three-dimensional (3D) reconstruction software.

 Plant surface mesh reconstruction  Morphological mesh segmentation  Accurate phenotypic data extraction  Longitudinal matching

slide-51
SLIDE 51

Context Capture – The Future

Plant Scan: Workfmow with integrated metadata

CSIRO – Data-Intensive Workfmows – Holistjc Framework for Data-Intensive Workfmows – Ian Corner 2016

E X A M P L E O N L Y

slide-52
SLIDE 52

Context Capture – The Future

Benefjt 1 – Repeatable Analysis

CSIRO – Data-Intensive Workfmows – Holistjc Framework for Data-Intensive Workfmows – Ian Corner 2016

Now there is no ‘simple’ fjx to this issue. But if we issue the ‘birth certjfjcate’ before the baby leaves the hospital then we have at least improved our positjon. We reduce the size

  • f the problem.
slide-53
SLIDE 53

Context Capture – The Future

Benefjt 2 – Quality Assurance

CSIRO – Data-Intensive Workfmows – Holistjc Framework for Data-Intensive Workfmows – Ian Corner 2016

File_1 became File_2 using version 56 of Fred_1. Now if File_1 had a calibratjon issue. Or Fred_1 had an analysis bug. Guest what? We reduce our problem more.

slide-54
SLIDE 54

Context Capture – The Future

Other benefjts

CSIRO – Data-Intensive Workfmows – Holistjc Framework for Data-Intensive Workfmows – Ian Corner 2016

Benefjt 3: Immediate Consumptjon

slide-55
SLIDE 55

Context Capture – The Future

Other benefjts

CSIRO – Data-Intensive Workfmows – Holistjc Framework for Data-Intensive Workfmows – Ian Corner 2016

Benefjt 3: Immediate Consumptjon Benefjt 4: Benchmarks

slide-56
SLIDE 56

Context Capture – The Future

Other benefjts

CSIRO – Data-Intensive Workfmows – Holistjc Framework for Data-Intensive Workfmows – Ian Corner 2016

Benefjt 3: Immediate Consumptjon Benefjt 4: Benchmarks Benefjt 5: Infrastructure Management

slide-57
SLIDE 57

Context Capture – The Future

Other benefjts

CSIRO – Data-Intensive Workfmows – Holistjc Framework for Data-Intensive Workfmows – Ian Corner 2016

Benefjt 3: Immediate Consumptjon Benefjt 4: Benchmarks Benefjt 5: Infrastructure Management Benefjt 6: Failures

slide-58
SLIDE 58

Context Capture – The Future

Summary

CSIRO – Data-Intensive Workfmows – Holistjc Framework for Data-Intensive Workfmows – Ian Corner 2016

An example from the past

In Unix everything is a fjle (pretuy much) there are a set of well writuen simple tools which you can tje together in an ad-hoc way to produce high value outcomes in a dynamic yet robust manner.

Moving to the future

Everything is a dataset, there are a set of published, well proven and tested set of ‘research’ workfmows which you can tje together in an ad-hoc way to produce high value outcomes in a dynamic yet robust manner.

slide-59
SLIDE 59

Summary

CSIRO – Data-Intensive Workfmows – Holistjc Framework for Data-Intensive Workfmows – Ian Corner 2016

slide-60
SLIDE 60

Context Capture – The Future

Summary

CSIRO – Data-Intensive Workfmows – Holistjc Framework for Data-Intensive Workfmows – Ian Corner 2016

We made it to a place where data, code and compute are now tjghtly coupled – Researchers focus on the workfmow. As workfmows proliferate we want to make sure they exist in an ecosystem where they can be discovered, assessed and consumed.

slide-61
SLIDE 61

INFORMATION MANAGEMENT AND TECHNOLOGY (IMT)

Thank You

CSIRO – Data-Intensive Workfmows – Holistjc Framework for Data-Intensive Workfmows – Ian Corner 2016