Why Data Citation is a Computational Problem
Susan B. Davidson University of Pennsylvania
Work partially supported by NSF IIS 1302212, NSF ACI 1547360 NIH 3-U01-EB-020954-02S1
Why Data Citation is a Computational Problem Susan B. Davidson - - PowerPoint PPT Presentation
Why Data Citation is a Computational Problem Susan B. Davidson University of Pennsylvania Work partially supported by NSF IIS 1302212, NSF ACI 1547360 NIH 3-U01-EB-020954-02S1 Outline The power of abstraction And how it has helped with
Work partially supported by NSF IIS 1302212, NSF ACI 1547360 NIH 3-U01-EB-020954-02S1
2
3
K L M N O P Q R S T K L M N O P Q R S T
F S F F S F F F F F F F F F F F F F 235 S. 39th F F S F F F F F S 4258 3910 Chestnut Hall International House Ralston House Sheraton University City 3335 Chestnut Garage Chestnut 34 4015 Walnut The Radian 3933 Franklin Building Franklin Annex Module 6 Retail 119 S. 38th Garage 30 Sansom Place West Nichols House ICA Newman Center Greenfield Center Bookstore Silverman Hall Tanenbaum Hall Kings Court English College House 3401 Walnut Gittis Hall Van Pelt Library Dietrich Graduate Library Meyerson Hall Jaffe History“Genomics is the next moon landing.” (1992)
4
Relational Databases 8 3 4 Object-Oriented Databases Image Data Array Data
Genome Sequence of the Nematode C. elegans: A Platform for Investigating Biology. SCIENCE Volume 282 (5396): 2012 - 2018 Issue of 11 Dec 1998 The C. elegans Sequencing Consortium * The 97-megabase genomic sequence of the nematode Caenorhabditis elegans reveals over 19,000 genes. More than 40 percent of the predicted protein products find significant matches in other organisms. There is a variety
repeats and highly conserved genes provides evidence for a regional organization of the chromosomes.
Entrez Medline
Integrating Query: What genes are involved in bipolar schizophrenia?
Name P Value Len 2182 440 440 440 423 HT97683 Q62167 P16381 P24346 P066346 Id Date & Time Image spdfld13a 9/8/95 12:02:03 spdfld22a 9/8/95 12:02:04 spdfld22a 9/8/95 12:02:06 1.2 3.4 5.6 7.8 9.0 3.5 6.8 9.1 2.4 5.7 8.0 7.6 5.4 3.2 1.0 1.9 2.8 3.7 4.6 5.5 7.3 8.2 9.1 0.0 1.1 6.8 9.1 2.4
5.7
8.0 7.6 5.4 3.2 1.0 9.8
?
>gi|2580555|gb|AF000985.1|HSAF000985 Homo sapiens dead box, Y isoform (DBY) mRNA, alternative transcript 1, complete cds CCAGTGTAAGAGTTCCGCTATTCGGTCTCACACCTACAGTGGACTACCCGATTTTTCGCTTCTCTTCAGG GATGAGTCATGTGGTGGTGAAAAATGACCCTGAACTGGACCAGCAGCTTGCTAATCTGGACCTGAACTCT GAAAAACAGAGTGGAGGAGCAAGTACAGCGAGCAAAGGGCGCTATATACCTCCTCACTTAAGGAACAAAG AAGCATCTAAAGGATTCCATGATAAAGACAGTTCAGGTTGGAGTTGCAGCAAAGATAAGGATGCATATAG CAGTTTTGGGTCTCGAGATTCTAGAGGAAAGCCTGGTTATTTCAGTGAACGTGGAAGTGGATCAAGGGGA ...
Entrez Sequence
3.1e-234 4.2e-230 4.2e-214 2.6e-127
5
6
7
8
Public data sources
TGCCGTGTGGC TAAATGTCTGTG C … CCCTTTCCGTG TGGCTAAATGT CTGTGC … TGCCGTGTGGC TAAATGTCTGTG C GTCTGTGC… TGCCGTGTGGC TAAATGTCTGTG C GTCTGTGC… TGCCGTGTGGC TAAATGTCTGTG C GTCTGTGC… ATGGCCGTGTG GCTAAATGTCT GTGCCTAACTA ACTAA…
Biologist’s workspace Bioinformatics protocols
9
10
11
12
13
14
15
16
Buneman, Davidson, Frew: Why data citation is a computational problem.
17
18
19
21
22
23
24
25
Formatted Citation eagle-i id Versioned Result Citation Query
Citation Creator
Human Readable
Citation Dereferencer Versioning Manager
Citation
26
27
28
29
To cite this family introduction, please use: Miller, Drucker, Bataille, Chan, Delagrange, Göke, Mayo, Thorens, Hills. Glucagon receptor family, introduction. Accessed on 08/05/2017. IUPHAR/BPS Guide to PHARMACOLOGY, http://www.guidetopharmacology.org/ GRAC/FamilyIntroductionForward?FamilyId=1. Database page citation: Miller, Drucker, Bataille, Chan, Delagrange, Göke, Mayo, Thorens, Hills. Glucagon receptor family. Accessed on 08/05/2017. IUPHAR/BPS Guide to PHARMACOLOGY, http://www.guidetopharmacology.org/ GRAC/FamilyDisplayForward?FamilyId=1.
30
32
33
families root introduction tables tuples … … … …
URI: .../target/1234 Contributors: Miller, Drucker, Salvatori URI: .../intro/987 Contributors: Miller, Drucker
targets introduction targets
URI: .../family/1234 Collaborators: Harmar, Sharman, Miller
34
35
36
37
38
39
40
41
42
43
44
“Model for Fine-Grained Data Citation”, CIDR 2017
45
46
47
Citation: Miller, Drucker, Bataille, Chan, Delagrange, Göke, Mayo, Thorens, Hills. Glucagon receptor family. Accessed on 08/05/2017. IUPHAR/BPS Guide to PHARMACOLOGY, Family(F, N, Ty), F= 1
48
Miller, Drucker, Bataille, Chan, Delagrange, Göke, Mayo, Thorens, Hills. Glucagon receptor family. Miller, Drucker, Bataille, Chan, Delagrange, Göke, Mayo, Thorens, Hills. Glucagon receptor family, introduction.
Citation:
Accessed on 08/05/2017. IUPHAR/BPS Guide to PHARMACOLOGY, Family(F, N, Ty), FamilyIntro(F, Tx), F= 1
49
50
52
Citation Views Policies DBA Query Rewriting Citation Generator define define Citation Curated DB Author Query Cited data Reader dereferencing citation applicable policies views used for rewriting query c i t a t i
q u e r i e s c i t a t i
s n i p p e t s Citation Citation Dereferencing Data (result set) Citation Versioning system
53
54
55
56
57
58
¤ (Data Management) ∩ (Machine Learning) ¤ “Data Engineering” akin to “Software Engineering”
¤ Collecting, cleaning and organizing data sets is reported to take nearly 80% of a data scientist's time yet is the least enjoyable part of their job
¤ “Why Analysis” of Algorithms ¤ Ethical data management
59
60
61
62