SDSC 2013 Summer Institute Biomedical data integration system and - - PowerPoint PPT Presentation

sdsc 2013 summer institute biomedical data integration
SMART_READER_LITE
LIVE PREVIEW

SDSC 2013 Summer Institute Biomedical data integration system and - - PowerPoint PPT Presentation

SDSC 2013 Summer Institute Biomedical data integration system and web search engine Julia Ponomarenko, PhD San Diego Supercomputer Center jpon@sdsc.edu 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California SAN DIEGO


slide-1
SLIDE 1

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

SDSC 2013 Summer Institute Biomedical data integration system and web search engine

Julia Ponomarenko, PhD San Diego Supercomputer Center jpon@sdsc.edu

slide-2
SLIDE 2

Big Data: “the zillionics realm”

In 2012, there were created and replicated 2,800,000 petabytes (PB) 2,800,000,000 terabytes (TB) 2,800,000,000,000 gigabytes (GB) 2,800,000,000,000,000 megabytes (MB) 2,800,000,000,000,000,000 kilobytes (KB) 2,800,000,000,000,000,000,000 bytes

slide-3
SLIDE 3

Data from Wired, May 2013

Facebook’s content uploaded per year 183 PB Google’s search index 98 PB Kaiser’s medical records 31 PB E-mails sent per year 2,986 PB Large Hadron Collider's annual data output 15 PB YouTube’s video uploaded per year 15 PB National Climatic Data Center Database 6 PB Library of Congress digital collection 5 PB Nasdaq stock market database 3 PB

slide-4
SLIDE 4

The number of sequences in NCBI GenBank

April 2013 GenBank 164,136,731 sequences 594 GB (0.0006 PB)

slide-5
SLIDE 5

High-throughput sequencing:

Illumina HiSeq 2500 enables to sequence a human genome, 20 exomes, or 150 RNA-seq samples per day NCBI Sequence Read Archive contains 4.1 PB of sequence reads

slide-6
SLIDE 6

Gene expression data: 0.05 PB

  • Array Express database (at EBI): 1,104,037 assays

(0.014 PB of archived data)

  • NCBI GEO database: 931,577 samples
slide-7
SLIDE 7

Metabolic & Signaling pathways

slide-8
SLIDE 8

Protein-protein interaction & Transcriptional regulatory networks & Host-pathogen interactions

slide-9
SLIDE 9
  • KEGG: 245,393 pathways
  • REACTOME: 25,000+ pathways, 70,000+ reactions
  • PathGuide (survey of 325 pathway resources): 205+ Million

reactions for 28+ Million gene/proteins (0.01 PB)

Image of interactions among interaction database is from pathguide.org

slide-10
SLIDE 10

Data from Wired, May 2013

Facebook’s content uploaded per year 183 PB Google’s search index 98 PB Kaiser’s medical records 31 PB E-mails sent per year 2,986 PB Large Hadron Collider's annual data output 15 PB YouTube’s video uploaded per year 15 PB National Climatic Data Center Database 6 PB Library of Congress digital collection 5 PB Nasdaq stock market database 3 PB Molecular Biology data in public databases ~4.5 PB

slide-11
SLIDE 11

Taxonomies Data Sequences Publications Structures Variations Expression Data Networks Annotations Biochemical Data Epigenetic Data

slide-12
SLIDE 12

Taxonomies Data Databases (2,000+) Sequences Publications Structures Variations Expression Data Networks Annotations Biochemical Data Epigenetic Data Web pages

How can a molecular biologist embrace such amount of data in their entirety?

slide-13
SLIDE 13

Taxonomies Data Data Integration Resources Databases (2,000+) Sequences Publications Structures Variations Expression Data Networks Annotations Biochemical Data Epigenetic Data Web pages

slide-14
SLIDE 14
slide-15
SLIDE 15

Taxonomies Data Data Integration Resources Databases (2,000+) Sequences Publications Structures Variations Expression Data Networks Annotations Biochemical Data Epigenetic Data Web pages

slide-16
SLIDE 16

Taxonomies Data Data Integration Resources Databases (2,000+) Sequences Publications Structures Variations Expression Data Networks Annotations Biochemical Data Epigenetic Data

This leaves a molecular biologist to work with partial, incomplete, incomprehensive data sets!

Web pages

slide-17
SLIDE 17

Taxonomies Data Databases (2,000+) Sequences Publications Structures Variations Expression Data Networks Annotations Biochemical Data Epigenetic Data Web pages

Data Warehouse

Biological Ontologies The Semantic Web technologies

slide-18
SLIDE 18

Other web pages Database web pages

slide-19
SLIDE 19

Other web pages Database web pages

slide-20
SLIDE 20

Other web pages Database web pages

slide-21
SLIDE 21

Other web pages Database web pages

slide-22
SLIDE 22

Other web pages Database web pages

slide-23
SLIDE 23

Other web pages Database web pages

slide-24
SLIDE 24

Other web pages Database web pages

slide-25
SLIDE 25

Other web pages Database web pages

slide-26
SLIDE 26

Other web pages Database web pages For each ontological term A and page X, calculate the relevance score of X to A.

slide-27
SLIDE 27

Part of the multiple alignment ontology (MAO)

Thompson et al., 2005, NAR, PMID: 16043635

Grey boxes represent concepts and colored arrows represent relationships: red, is_a; blue, part_of; green, is_attribute.

slide-28
SLIDE 28

Mao.obo

slide-29
SLIDE 29

Other web pages Database web pages For each ontological term A and page X, calculate the relevance score of X to A. Calculate a rank of each page. Automatically extract data and map them into the internal database schema

slide-30
SLIDE 30

User Community

Web-portal & API

integromeDB.org

Java-application

BiologicalNetworks.org

IntegromeDB

Public Data on the Web User’s Private Data

slide-31
SLIDE 31

integromedb.org

slide-32
SLIDE 32
slide-33
SLIDE 33
slide-34
SLIDE 34
slide-35
SLIDE 35
slide-36
SLIDE 36

Integromedb.org visit statistics (7/16/2012 – 7/15/2013)

slide-37
SLIDE 37

Web Crawler Data Mapping and Integration Large Databases

Tables Texts Databases Web pages

Real problems of data integration: #1

  • Collecting data and maintaining data consistency: it is infeasible

to collect data from thousands of data sources neither downloading them no via web crawling (if a crawler sends a request to a website each 10 seconds, to download the entire GeneBank -- 163 million sequence records -- would require 50 years, from one IP address).

slide-38
SLIDE 38

Web Crawler Data Mapping and Integration Large Databases

Tables Texts Databases Web pages

Real problems of data integration: #1 (solution)

  • Hybrid approach: large databases (100,000+ web pages) are

downloaded as SQL, XML, or RDF models, while all other resources on the web are reached via the web crawling.

slide-39
SLIDE 39

Web Crawler Data Mapping and Integration Large Databases

Tables Texts Databases Web pages

Real problems of data integration: #2

  • Data mapping:

– No unified biological ontology (there are conflicts and inconsistencies across ontologies) – Data are heterogeneous (data integration requires mapping various types of data onto a set of stable gene and protein ids)

slide-40
SLIDE 40

Real problems of data integration: #2

(solution)

  • IntegromeDB Ontology: The Open Biomedical Ontologies (OBO)

consortium and the National Center for Biomedical Ontology (NCBO) provide the mapping among different ontologies. This mapping (for 120 ontologies) was used to develop IntegromeDB Ontology.

slide-41
SLIDE 41
slide-42
SLIDE 42
slide-43
SLIDE 43

Real problems of data integration: #2

(solution)

  • Each web page is integrated in the database as is along with two

calculated scores: PageRank and the Lucene score. The latter is calculated for each word from the database dictionaries (object names, IDs, synonyms) and IntegromeDB Ontology.

  • The downloaded files—SQL, XML, and RDF—are mapped to the

IntegromeDB database schema via transforming them into the RDF-compatible format. The mapping includes automatic determination of Node IDs, such as names and synonyms of biological entities from a dictionary of 70 million gene aliases.

slide-44
SLIDE 44

Web Crawler Data Mapping and Integration Large Databases IntegromeDB RDBMS

Tables Texts Databases Web pages

  • The database is a PostgreSQL database modeled as a

node- (Objects: proteins, ligands, molecular complexes, and genes) edge- (relationships between objects: up/down regulation, molecular transport, molecular synthesis, enzymatic activity) typed labeled meta-graph, where the labels are described by their own schema.

slide-45
SLIDE 45

IntegromeDB Database

  • Currently, the database integrates data from a billion web

pages populated from molecular biology databases listed in the NAR depository, with 100 major databases being directly downloaded and manually mapped to the IntegromeDB schema.

  • Using that manual mapping, mapping algorithms to perform

automatic mapping of SQL, XML, and RDF files have been trained.

slide-46
SLIDE 46

Web Crawler Data Mapping and Integration Large Databases IntegromeDB RDBMS User Community API Web Interface

Tables Texts Databases Web pages

User Data

slide-47
SLIDE 47

Real problems of data integration: #3

  • User Interface: Which types of search to allow and how to

represent the search results on the web page?

  • Search results for a gene or protein are organized on the web page

into a dashboard of relevant attributes that are grouped by data sources and by pair-wise similarity, defined using the normalized Levenshtein distance and an empirical threshold.

  • The web pages are sorted by relevance. For each ontological term A

and page X, integrated into the system, the relevance score of X to A is calculated as follows: RL(X,A) = K1×PO(A) + K2×PP(X) + K3×PT(X) + K4×PR(X)

where K1-K4 – empirical coefficients; PO – frequency of a term A in the

  • ntology containing A; PP – frequency of A in X (corresponds to the Lucene

score); PT – frequency of the term A in different HTML tag fields of the page X; PR – PageRank of X.

slide-48
SLIDE 48
slide-49
SLIDE 49
slide-50
SLIDE 50

Real problems of data integration: #4

  • Serving the user requests instantly: large SQL join-

queries over a large relational database are time- consuming.

  • A noSQL solutions based on the Hadoop

architecture is under development. It will be used for two tasks: to store, to update, to process, and to index crawled data; and for fast access of data from the IntegromeDB relational database.

slide-51
SLIDE 51

Web Crawler Data Mapping and Integration Large Databases IntegromeDB RDBMS User Community API Web Interface

Tables Texts Databases Web pages

noSQL database User Data

slide-52
SLIDE 52

2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Acknowledgments

SDSC

  • Michael Baitaluk
  • Sergey Kozhenkov
  • Amarnath Gupta
  • Yulia Dubinina

Funding

  • NIH R01GM084881 (to MB) & R01GM085325 (to JP)
  • SDSC Gordon seed grant
  • NSF EXEDE (Gordon and Data Oasis resources)