Technologies, methods and challenges to effective public data - - PowerPoint PPT Presentation

technologies methods and challenges to effective public
SMART_READER_LITE
LIVE PREVIEW

Technologies, methods and challenges to effective public data - - PowerPoint PPT Presentation

Technologies, methods and challenges to effective public data sharing and aggregation Mark D Wilkinson Medical Genetics, UBC PI Bioinformatics Heart + Lung Institute at St. Pauls Hospital markw@illuminae.com http://wilkinsonlab.ca Thanks


slide-1
SLIDE 1

Technologies, methods and challenges to effective public data sharing and aggregation

Mark D Wilkinson Medical Genetics, UBC PI Bioinformatics Heart + Lung Institute at St. Paul’s Hospital markw@illuminae.com http://wilkinsonlab.ca

slide-2
SLIDE 2

Thanks in advance

(very few of these ideas are my own!)

Carole Goble – University of Manchester Paul Gordon – Sun Center of Excellence, U Calgary Charles Petrie – Stanford University

slide-3
SLIDE 3

We’ve come a long way!!

slide-4
SLIDE 4

In the beginning...

slide-5
SLIDE 5

Link Integration

slide-6
SLIDE 6
slide-7
SLIDE 7

Integration of What?

slide-8
SLIDE 8

Specially-formatted Files

EMBL Format

slide-9
SLIDE 9

Specially-formatted Files

GenBankFormat

slide-10
SLIDE 10

Specially-formatted Files

FASTA Format

slide-11
SLIDE 11

Specially-formatted Files

GCG Format

slide-12
SLIDE 12

At least 20 different formats for representing DNA sequences...

Lord et al. 2004
slide-13
SLIDE 13

Many formats contained a wide variety of related, but different, information

DNA sequence Sequence features Translation Date/time/method Publication cross-references ... ...

slide-14
SLIDE 14

Each file format required it’s own parser...

slide-15
SLIDE 15

...and that problem wasn’t limited to DNA...

slide-16
SLIDE 16

XML

slide-17
SLIDE 17

What did XML do for us?

“...advent of XML meant that we didn’t have to write our own parsers anymore...” Individual data elements in a file can be automatically located and extracted

slide-18
SLIDE 18

Predictable way to represent data Makes it easier for machines to encode/extract

slide-19
SLIDE 19

EMBL Record for BRCA1 In XML

slide-20
SLIDE 20

GenBank Record for BRCA1 In XML

slide-21
SLIDE 21

So now we can share and aggregate data!

slide-22
SLIDE 22

EMBL Record for BRCA1 In XML GenBank Record for BRCA1 In XML

slide-23
SLIDE 23

So now we can share and aggregate data! ...because it isn’t (just) a parsing problem... Various resources have various data models

slide-24
SLIDE 24

So... Let’s find a way to describe the data models!

slide-25
SLIDE 25

XML Schema

slide-26
SLIDE 26

XML Schema There will be an element called “GBQualifier” There will be a child element called “GBQualifier_name” The content of that child element will be free-text There will be a child element called “GBQualifier_value” The content of that child element will be free-text XML Schema There will be an element called “qualifier” It will have an attribute called “name” The content of that attribute will be text There will be a child element called “value” The content of that child element will be free-text

slide-27
SLIDE 27

So now we can share and aggregate data!

slide-28
SLIDE 28

What did XML Schema do for us?

“...XML Schema (among other things) allowed us to ~automate the creation of (in-memory) Structures which could hold the given XML-formatted data...”

slide-29
SLIDE 29

Does not solve the integration or aggregation problem

slide-30
SLIDE 30

XML Schema There will be an element called “GBQualifier” There will be a child element called “GBQualifier_name” The content of that child element will be free-text There will be a child element called “GBQualifier_value” The content of that child element will be free-text XML Schema There will be an element called “qualifier” It will have an attribute called “name” The content of that attribute will be text There will be a child element called “value” The content of that child element will be free-text Because the “meaning” of each element is implicit, we resort to “Schema Mapping” to integrate the data

slide-31
SLIDE 31

Nevertheless...

slide-32
SLIDE 32

Web Services “Service Oriented Architectures” WSDL

(and many other 4-letter words)

slide-33
SLIDE 33

Web Services & SOA’s

Allow you to expose software

(e.g. a database, analytical tool, or service)

  • n the Web

so that others can use it

(in their own analytical pipelines)

slide-34
SLIDE 34

Excellent!!

slide-35
SLIDE 35

But...

slide-36
SLIDE 36

XML Schema

slide-37
SLIDE 37

XML Schema There will be an element called “GBQualifier” There will be a child attribute called “GBQualifier_name” The content of that child attribute will be free-text There will be a child attribute called “GBQualifier_value” The content of that child attribute will be free-text XML Schema There will be an element called “qualifier” It will have an attribute called “name” The content of that attribute will be text There will be a child attribute called “value” The content of that child attribute will be free-text

slide-38
SLIDE 38

“The phrase ‘practical Web Services’ is not intrinsically an oxymoron, but [I] argue that there are few in existence.”

slide-39
SLIDE 39

Why?

slide-40
SLIDE 40 XML Schema There will be an element called “GBQualifier” There will be a child attribute called “GBQualifier_name” The content of that child attribute will be free-text There will be a child attribute called “GBQualifier_value” The content of that child attribute will be free-text XML Schema There will be an element called “qualifier” It will have an attribute called “name” The content of that attribute will be text There will be a child attribute called “value” The content of that child attribute will be free-text

Because this problem is so disruptive that there is little point in building “public” Web Services... They are simply too difficult to integrate with other “public” Web Services.

  • - adapted from Petrie, SWSIP 2009
slide-41
SLIDE 41

...and that’s pretty much where the world is right now...

slide-42
SLIDE 42

But there is hope!

slide-43
SLIDE 43

“Linked Data” movement Resource Description Framework “RDF” The “Semantic Web” movement Web Ontology Language “OWL” (+ RDF)

Two new technologies & communities

slide-44
SLIDE 44

What does RDF do for us?

“...RDF replaces XML Schema, because RDF says that there is only one data model...”

slide-45
SLIDE 45

What does OWL do for us?

“...the semantics are no longer implicit in the data model...”

XML Schema There will be an element called “GBQualifier” There will be a child attribute called “GBQualifier_name” The content of that child attribute will be free-text There will be a child attribute called “GBQualifier_value” The content of that child attribute will be free-text XML Schema There will be an element called “qualifier” It will have an attribute called “name” The content of that attribute will be text There will be a child attribute called “value” The content of that child attribute will be free-text
slide-46
SLIDE 46

So what?

slide-47
SLIDE 47

The Semantic Web Gives us the opportunity to re-think how we build our health data infrastructures

slide-48
SLIDE 48

The Semantic Web isn’t “yet another layer of technology”

slide-49
SLIDE 49

The Semantic Web changes the way we write software

slide-50
SLIDE 50

The Semantic Web What to do & How to do it is no longer encoded in your software

slide-51
SLIDE 51

The Semantic Web What to do & How to do it is part of the data

slide-52
SLIDE 52

The Semantic Web What to do & How to do it is part of a shared, expert understanding

slide-53
SLIDE 53

The Semantic Web What to do & How to do it IS* PERSONAL!

* can be...
slide-54
SLIDE 54

One piece of software Any question... Any answer

slide-55
SLIDE 55

Let me demonstrate what I mean

slide-56
SLIDE 56

Founding partner

Semantic Automated Discovery and Integration

http://sadiframework.org

Microsoft Research

slide-57
SLIDE 57

Semantic Health And Research Environment (a Semantic Web question answerer...)

slide-58
SLIDE 58

Example #1 Show me the latest Blood Urea Nitrogen and Creatinine levels

  • f patients who appear to be rejecting their transplants

SELECT ?patient ?bun ?creat FROM <http://sadiframework.org/ontologies/patients.rdf> WHERE { ?patient rdf:type patient:LikelyRejecter . ?patient l:latestBUN ?bun . ?patient l:latestCreatinine ?creat . }

slide-59
SLIDE 59

Likely Rejecter:

A patient who has creatinine levels that are increasing over time

  • - Wilkinson “MD”
slide-60
SLIDE 60

Likely Rejecter:

…but there is no “likely rejecter” column or table in our database…

slide-61
SLIDE 61

Likely Rejecter:

Our database contains various blood chemistry measurements at various time-points

slide-62
SLIDE 62

SHARE determines by itself the need to do a Linear Regression analysis over Creatinine blood chemistry measurements

slide-63
SLIDE 63

SHARE determines by itself how and where that analysis can be done and does it

slide-64
SLIDE 64

The SHARE system utilizes Semantics (via SADI) to discover and access analytical services on the Web that do linear regression analysis

slide-65
SLIDE 65

VOILA!

slide-66
SLIDE 66

Neither SADI nor SHARE know anything about blood chemistry, or mathematics

slide-67
SLIDE 67

Example #2 From a (contrived) integrated dataset, retrieve the blood pressure measurements

SELECT ?output ?unit ?value FROM <http://es-01.chibi.ubc.ca/~soroush/framingham/sbpfeb.owl> WHERE { ?output rdf:type sbp:BloodPressure . ?output local:hasCanonicalAttribute ?pr . ?pr sio:SIO_000221 ?unit . ?pr sio:SIO_000300 ?value . }

slide-68
SLIDE 68

This should be extremely straightforward...

slide-69
SLIDE 69

...except for one problem...

slide-70
SLIDE 70 <owl:NamedIndividual rdf:about="http://es-01.chibi.ubc.ca/~soroush/framingham/sbpfeb.owl#pressureinstance1"> <rdf:type rdf:resource="&galen;SystolicBloodPressure"/> <resource:SIO_000300>0.137</resource:SIO_000300> <resource:SIO_000221 rdf:resource="&ucum;unit/pressure/meter-of-mercury-column"/> </owl:NamedIndividual> <owl:NamedIndividual rdf:about="http://es-01.chibi.ubc.ca/~soroush/framingham/sbpfeb.owl#pressureinstance2"> <rdf:type rdf:resource="&galen;SystolicBloodPressure"/> <resource:SIO_000300>12.45</resource:SIO_000300> <resource:SIO_000221 rdf:resource="http://es-01.chibi.ubc.ca/~soroush/framingham/sbpfeb.owl#centi-meter-of-mercury-column"/> </owl:NamedIndividual> <owl:NamedIndividual rdf:about="http://es-01.chibi.ubc.ca/~soroush/framingham/sbpfeb.owl#pressureinstance3"> <rdf:type rdf:resource="&galen;SystolicBloodPressure"/> <resource:SIO_000300>5.3</resource:SIO_000300> <resource:SIO_000221 rdf:resource="&ucum;unit/pressure/inch-of-mercury-column"/> </owl:NamedIndividual>
slide-71
SLIDE 71 <owl:NamedIndividual rdf:about="http://es-01.chibi.ubc.ca/~soroush/framingham/sbpfeb.owl#pressureinstance1"> <rdf:type rdf:resource="&galen;SystolicBloodPressure"/> <resource:SIO_000300>0.137</resource:SIO_000300> <resource:SIO_000221 rdf:resource="&ucum;unit/pressure/meter-of-mercury-column"/> </owl:NamedIndividual> <owl:NamedIndividual rdf:about="http://es-01.chibi.ubc.ca/~soroush/framingham/sbpfeb.owl#pressureinstance2"> <rdf:type rdf:resource="&galen;SystolicBloodPressure"/> <resource:SIO_000300>12.45</resource:SIO_000300> <resource:SIO_000221 rdf:resource=“&ucum;unit/framingham/sbpfeb.owl#centi-meter-of-mercury-colum </owl:NamedIndividual> <owl:NamedIndividual rdf:about="http://es-01.chibi.ubc.ca/~soroush/framingham/sbpfeb.owl#pressureinstance3"> <rdf:type rdf:resource="&galen;SystolicBloodPressure"/> <resource:SIO_000300>5.3</resource:SIO_000300> <resource:SIO_000221 rdf:resource="&ucum;unit/pressure/inch-of-mercury-column"/> </owl:NamedIndividual>
slide-72
SLIDE 72

Example #2 From a (contrived) integrated dataset, retrieve the blood pressure measurements

SELECT ?output ?unit ?value FROM <http://es-01.chibi.ubc.ca/~soroush/framingham/sbpfeb.owl> WHERE { ?output rdf:type sbp:BloodPressure . ?output local:hasCanonicalAttribute ?pr . ?pr sio:SIO_000221 ?unit . ?pr sio:SIO_000300 ?value . } My semantic definition of “Blood Pressure” includes the units that I want...

slide-73
SLIDE 73

This is enough to trigger SHARE to automatically discover an online unit-conversion service...

slide-74
SLIDE 74
slide-75
SLIDE 75

Neither SADI nor SHARE know anything about units or unit conversions

slide-76
SLIDE 76

Many of the challenges to data aggregation and sharing now have solutions that work!

slide-77
SLIDE 77

What, in my opinion, is the greatest remaining challenge?

slide-78
SLIDE 78

To a biologist... ...“data mining” means “this data is mine!”

slide-79
SLIDE 79

The challenge to us all Move from Data Mine-ing To Data Ours-ing

  • - Len Silverston, 2007
slide-80
SLIDE 80

We’ve come a long way!!

XML

XML Schema There will be an element called “GBQualifier” There will be a child attribute called “GBQualifier_name” The content of that child attribute will be free-text There will be a child attribute called “GBQualifier_value” The content of that child attribute will be free-text XML Schema There will be an element called “qualifier” It will have an attribute called “name” The content of that attribute will be text There will be a child attribute called “value” The content of that child attribute will be free-text
slide-81
SLIDE 81

Microsoft Research TEAM: Luke McCarthy Benjamin Vandervalk Soroush Samadian