why data citation is a computational problem
play

Why Data Citation is a Computational Problem Susan B. Davidson - PowerPoint PPT Presentation

Why Data Citation is a Computational Problem Susan B. Davidson University of Pennsylvania Work partially supported by NSF IIS 1302212, NSF ACI 1547360 NIH 3-U01-EB-020954-02S1 Outline The power of abstraction And how it has helped with


  1. Why Data Citation is a Computational Problem Susan B. Davidson University of Pennsylvania Work partially supported by NSF IIS 1302212, NSF ACI 1547360 NIH 3-U01-EB-020954-02S1

  2. Outline ¤ The power of abstraction ¤ And how it has helped with two of my favorite problems in bioinformatics ¤ New problem: data citation ¤ Bigger picture: Data Science 2

  3. The power of abstraction ¤ The “right” abstraction is key to developing solutions to many practical problems. ¤ Data Integration ¤ Provenance ¤ … . ¤ Data Citation ¤ Developing the right abstraction requires close collaboration between end-users, systems builders, and theoreticians. 3

  4. Databases meets bioinformatics “Genomics is the next moon landing.” Ludlow Street Ralston IRS 4124 Ludlow K House Axis 3335 Chestnut K Garage The St. Leonard's 4039 Chestnut 34 Complex Chestnut Hub International Domus Sheraton (1992) Hub House New University 4111-25 3939 Ralston City Chestnut House Chestnut Street Chestnut Street Sansom Steve Murray's Way Greenfield English Place Gittis Hall Evo 4212 4026-40 HUP 3910 Center Nichols College West Cira Center Chestnut Newman House US Post Office Offices House New South Chestnut Center Kings 4258 Tanenbaum Silverman College Hall Court Hall Hall House 4101 ICA 4059 Horizon Christian Golkin Hall House Cira Center Assoc. Highline South Sansom Street The Left Bank Field Hill Garage L McNeil Square L 3808 Franklin Early 125 S. 31st Street Garage Annex American ( Translational Research ) Walnut 40 Singh 3201 Module 6 4109 4015 36 34 32 30 28 24 14 Nanotechnology Walnut Retail Pottruck Walnut F Hill 3933 The Radian Center Inn at Penn Garage FMC Tower 119 S. 38th College L.R.S.M Fresh 3901 F F Perelman Franklin 3401 Walnut 32 3025 3815 F Garage 30 House 3101 Grocer S F Bookstore Center for Building Walnut AFSCME S F F Walnut 3809 Walnut F Walnut ( WXPN ) Political Science and Walnut Street Economics Walnut Street 4108 4106 4104 4102 4100 President's Grad St. Marks Square Grad School Jaffe F S F Philadelphia Du Bois College House House Annenberg Addams Fisher Research Jones Way Lower Walnut Street of Education Dietrich History Free School Bennett Wing 4126-38 Cinema 3808-10 Hall Graduate Van Pelt of Art Library Fox-Fels Hall Moore Walnut Annenberg Library Library David Hall School Class of 1923 Center Rittenhouse Rotunda Levine Solomon Ice Rink 206 Hall Skirkanich Labs 3216 Jon M. Labs Annenberg Meyerson Hall Hillel at Huntsman Psychology PPC Lerner Chancellor M F Hall M Shops Steinhardt Hall Stiteler Ctr at 40th Rodin Hall Perry Hall Colonial 212 Street College World F The Penn Towne Building Hamilton ARCH Sweeten House F House Caster Locust Ctr F 36153609 3537 Morgan Hecht Village F F F Alumni Building House 3619 Bldg Palestra Tennis 3611 Blanche P. Levy Park Kelly House Center Writers Robbins Fisher Locust Street Locust Walk Smith Walk House House Fine Arts Shoemaker Library Green 250 S Dunning-Cohen 4032 Levy Ace Adams St Mary's Harnwell F F 36th Hayden Hall Champions Class of Vagelos Field Oral Health Church Field Sadie Tanner Mossell Alexander College 1920 Duhring Labs Dunning Hutchinson Civic McNeil Steinberg Hall House Commons Steinberg College Hall Wing Coaches’ University of Pennsylvania House Building Dietrich Hall IAST Gymnasium Ctr Partnership School Schattner Class of 1925 Conference Lauder- N Center House Center Fischer N Parent Infant Cohen Center F Steinhardt Perelman Quad Ringe Hall 1958 3907 Lehman Brothers Quad Plaza Spruce Berkshire 235 S. 39th Wynn Commons Wing Weightman Squash Crt Harrison Mack Plaza Wood Hall Apts College Irvine Evans Chemistry Apts Garage Auditorium House Wistar Houston Hall 1973 Spruce Building 3905 Spruce 38 Paley Institute Wing Cret Penn Park House Spruce Van Pelt Mayer Vance Hall Bridge Williams Hall Wing Multi-Purpose Hall College House Residence Hall Stadium Spruce Street Franklin Field St. Marks Square S F F S F F F Rosenthal Penn Kane S Building Transplant F Stouffer Hospital Park O House College O of the House Hamlin 301 3920 University of Pennsylvania Health System Tennis Center Delancey F + The Quadrangle 3918 Matthew J. Ryan Rhoads Surrey School of Veterinary Pavilion Veterinary Hall Weave Hospital Medicine Bridge of the UofP Old Quad Museum of y Pine Street Hamilton Walk a Archaeology and w Anthropology South Green s Convention Avenue s Johnson Stemmler e Leidy r John Morgan Building Hall p Labs x Hill Pavilion Building University University E Goddard Richards 4200 Pine Museum Museum Labs Building l l Academic Garage i k r Children's Hospital Wing l 4219 y e P of Philadelphia v P Levin Building u Osage Avenue h R i Claire M. c l Fagin Hall Clinical S l Anatomy k i Research E 6 l Chemistry Building a 7 y Webster Manor Kaskey - u Cyclotron s Child New Patient Park t I h Stellar- Guidance Pavilion c Carolyn S University Chance e Center S Lynch City Laboratories r Hollenback Labs v Station Wood Pediatric i Center c Ambulatory Care e e v O Center i Blockley D r s D Hall r e l i Hollenback n v a BRB 2 r e Perelman Center For Annex d i C Advanced Medicine r r i Children's Baltimore Avenue a The c u Curie Boulevard Seashore Roberts G Consortium l e House Proton E Curie a Jordan Medical Garage s Q Civic Center Boulevard Education Center t Veterans S Buerger e Smilow Rhodes Administration Center For r South Street Center for Medical Center Philadelphia v Field Abramson Advanced i Translational Center For c Pediatric e Pediatric Research Q Health Care Sciences Care D Schuylkill Avenue Research r i Medical v University Avenue Examiners e Building Stewart Field VA e n u Health Sciences Drive A v e Nursing W Vagelos e r s t Home Field C h e e s t S e Colket r R Research v i Center c e D r i v R UPHS e Medical Parking Garage Mondschein Field UofP / CHOP Medical Parking Garage 51 S e D r i v S s e l d F i e r R i v Module 7 Meiklejohn Stadium Woodland Avenue T T 4

  5. Example 1: Data Integration Image Data Entrez Sequence Image Id Date & Time >gi|2580555|gb|AF000985.1|HSAF000985 Homo sapiens dead box, Y isoform (DBY) ? mRNA, alternative transcript 1, complete cds spdfld13a 9/8/95 12:02:03 CCAGTGTAAGAGTTCCGCTATTCGGTCTCACACCTACAGTGGACTACCCGATTTTTCGCTTCTCTTCAGG GATGAGTCATGTGGTGGTGAAAAATGACCCTGAACTGGACCAGCAGCTTGCTAATCTGGACCTGAACTCT GAAAAACAGAGTGGAGGAGCAAGTACAGCGAGCAAAGGGCGCTATATACCTCCTCACTTAAGGAACAAAG AAGCATCTAAAGGATTCCATGATAAAGACAGTTCAGGTTGGAGTTGCAGCAAAGATAAGGATGCATATAG spdfld22a 9/8/95 12:02:04 CAGTTTTGGGTCTCGAGATTCTAGAGGAAAGCCTGGTTATTTCAGTGAACGTGGAAGTGGATCAAGGGGA ... spdfld22a 9/8/95 12:02:06 Relational Databases Name P Value Len Integrating Query: 0 HT97683 2182 3.1e-234 Q62167 440 What genes are involved in 4.2e-230 P16381 440 4.2e-214 P24346 440 bipolar schizophrenia? 2.6e-127 P066346 423 Entrez Medline Object-Oriented Databases Genome Sequence of the Nematode C. elegans: A Platform for Investigating Biology. SCIENCE Volume 282 (5396): 2012 - 2018 3 Issue of 11 Dec 1998 8 The C. elegans Sequencing Consortium * The 97-megabase genomic sequence of the 4 nematode Caenorhabditis elegans reveals over 19,000 genes. More than 40 percent of the Array Data predicted protein products find significant matches in other organisms. There is a variety 1.2 3.4 5.6 7.8 9.0 of repeated sequences, both local and 3.5 6.8 9.1 2.4 5.7 dispersed. The distinctive distribution of some 8.0 7.6 5.4 3.2 1.0 repeats and highly conserved genes provides 1.9 2.8 3.7 4.6 5.5 7.3 8.2 9.1 0.0 1.1 evidence for a regional organization of the 6.8 9.1 2.4 8.0 5.7 chromosomes. 7.6 5.4 3.2 1.0 9.8 5

  6. DOE “ Impossible ” Queries “ Until a fully relationalized sequence database is available, none of the queries in this appendix can be answered. ” 6

  7. Why would they say that? ¤ Needed to pose set-oriented queries against multiple, heterogeneous databases, files, and software packages. ¤ Most integration work at the time was based on the relational model ¤ Embedded links in files: Clicking doesn ʼ t scale! ¤ Needed in-depth understanding of what data sources were available and what information they contained. 7

  8. Answering the “unanswerable” ¤ We were able to answer the “unanswerable queries” within about a month using our data integration system, Kleisli. ¤ Kleisli used a complex-object model of data, language based on comprehension syntax, and optimizations that went beyond relational systems. ¤ Limsoon Wong BioGuideSRS (Cohen-Boulakia) ¤ Kyle Hart, Jonathan Crabtree, … The Q Query System (Ives) ¤ Leonid Libkin, Dan Suciu, … … ? 8

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend