2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
SDSC 2013 Summer Institute Biomedical data integration system and - - PowerPoint PPT Presentation
SDSC 2013 Summer Institute Biomedical data integration system and - - PowerPoint PPT Presentation
SDSC 2013 Summer Institute Biomedical data integration system and web search engine Julia Ponomarenko, PhD San Diego Supercomputer Center jpon@sdsc.edu 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California SAN DIEGO
Big Data: “the zillionics realm”
In 2012, there were created and replicated 2,800,000 petabytes (PB) 2,800,000,000 terabytes (TB) 2,800,000,000,000 gigabytes (GB) 2,800,000,000,000,000 megabytes (MB) 2,800,000,000,000,000,000 kilobytes (KB) 2,800,000,000,000,000,000,000 bytes
Data from Wired, May 2013
Facebook’s content uploaded per year 183 PB Google’s search index 98 PB Kaiser’s medical records 31 PB E-mails sent per year 2,986 PB Large Hadron Collider's annual data output 15 PB YouTube’s video uploaded per year 15 PB National Climatic Data Center Database 6 PB Library of Congress digital collection 5 PB Nasdaq stock market database 3 PB
The number of sequences in NCBI GenBank
April 2013 GenBank 164,136,731 sequences 594 GB (0.0006 PB)
High-throughput sequencing:
Illumina HiSeq 2500 enables to sequence a human genome, 20 exomes, or 150 RNA-seq samples per day NCBI Sequence Read Archive contains 4.1 PB of sequence reads
Gene expression data: 0.05 PB
- Array Express database (at EBI): 1,104,037 assays
(0.014 PB of archived data)
- NCBI GEO database: 931,577 samples
Metabolic & Signaling pathways
Protein-protein interaction & Transcriptional regulatory networks & Host-pathogen interactions
- KEGG: 245,393 pathways
- REACTOME: 25,000+ pathways, 70,000+ reactions
- PathGuide (survey of 325 pathway resources): 205+ Million
reactions for 28+ Million gene/proteins (0.01 PB)
Image of interactions among interaction database is from pathguide.org
Data from Wired, May 2013
Facebook’s content uploaded per year 183 PB Google’s search index 98 PB Kaiser’s medical records 31 PB E-mails sent per year 2,986 PB Large Hadron Collider's annual data output 15 PB YouTube’s video uploaded per year 15 PB National Climatic Data Center Database 6 PB Library of Congress digital collection 5 PB Nasdaq stock market database 3 PB Molecular Biology data in public databases ~4.5 PB
Taxonomies Data Sequences Publications Structures Variations Expression Data Networks Annotations Biochemical Data Epigenetic Data
Taxonomies Data Databases (2,000+) Sequences Publications Structures Variations Expression Data Networks Annotations Biochemical Data Epigenetic Data Web pages
How can a molecular biologist embrace such amount of data in their entirety?
Taxonomies Data Data Integration Resources Databases (2,000+) Sequences Publications Structures Variations Expression Data Networks Annotations Biochemical Data Epigenetic Data Web pages
Taxonomies Data Data Integration Resources Databases (2,000+) Sequences Publications Structures Variations Expression Data Networks Annotations Biochemical Data Epigenetic Data Web pages
Taxonomies Data Data Integration Resources Databases (2,000+) Sequences Publications Structures Variations Expression Data Networks Annotations Biochemical Data Epigenetic Data
This leaves a molecular biologist to work with partial, incomplete, incomprehensive data sets!
Web pages
Taxonomies Data Databases (2,000+) Sequences Publications Structures Variations Expression Data Networks Annotations Biochemical Data Epigenetic Data Web pages
Data Warehouse
Biological Ontologies The Semantic Web technologies
Other web pages Database web pages
Other web pages Database web pages
Other web pages Database web pages
Other web pages Database web pages
Other web pages Database web pages
Other web pages Database web pages
Other web pages Database web pages
Other web pages Database web pages
Other web pages Database web pages For each ontological term A and page X, calculate the relevance score of X to A.
Part of the multiple alignment ontology (MAO)
Thompson et al., 2005, NAR, PMID: 16043635
Grey boxes represent concepts and colored arrows represent relationships: red, is_a; blue, part_of; green, is_attribute.
Mao.obo
Other web pages Database web pages For each ontological term A and page X, calculate the relevance score of X to A. Calculate a rank of each page. Automatically extract data and map them into the internal database schema
User Community
Web-portal & API
integromeDB.org
Java-application
BiologicalNetworks.org
IntegromeDB
Public Data on the Web User’s Private Data
integromedb.org
Integromedb.org visit statistics (7/16/2012 – 7/15/2013)
Web Crawler Data Mapping and Integration Large Databases
Tables Texts Databases Web pages
Real problems of data integration: #1
- Collecting data and maintaining data consistency: it is infeasible
to collect data from thousands of data sources neither downloading them no via web crawling (if a crawler sends a request to a website each 10 seconds, to download the entire GeneBank -- 163 million sequence records -- would require 50 years, from one IP address).
Web Crawler Data Mapping and Integration Large Databases
Tables Texts Databases Web pages
Real problems of data integration: #1 (solution)
- Hybrid approach: large databases (100,000+ web pages) are
downloaded as SQL, XML, or RDF models, while all other resources on the web are reached via the web crawling.
Web Crawler Data Mapping and Integration Large Databases
Tables Texts Databases Web pages
Real problems of data integration: #2
- Data mapping:
– No unified biological ontology (there are conflicts and inconsistencies across ontologies) – Data are heterogeneous (data integration requires mapping various types of data onto a set of stable gene and protein ids)
Real problems of data integration: #2
(solution)
- IntegromeDB Ontology: The Open Biomedical Ontologies (OBO)
consortium and the National Center for Biomedical Ontology (NCBO) provide the mapping among different ontologies. This mapping (for 120 ontologies) was used to develop IntegromeDB Ontology.
Real problems of data integration: #2
(solution)
- Each web page is integrated in the database as is along with two
calculated scores: PageRank and the Lucene score. The latter is calculated for each word from the database dictionaries (object names, IDs, synonyms) and IntegromeDB Ontology.
- The downloaded files—SQL, XML, and RDF—are mapped to the
IntegromeDB database schema via transforming them into the RDF-compatible format. The mapping includes automatic determination of Node IDs, such as names and synonyms of biological entities from a dictionary of 70 million gene aliases.
Web Crawler Data Mapping and Integration Large Databases IntegromeDB RDBMS
Tables Texts Databases Web pages
- The database is a PostgreSQL database modeled as a
node- (Objects: proteins, ligands, molecular complexes, and genes) edge- (relationships between objects: up/down regulation, molecular transport, molecular synthesis, enzymatic activity) typed labeled meta-graph, where the labels are described by their own schema.
IntegromeDB Database
- Currently, the database integrates data from a billion web
pages populated from molecular biology databases listed in the NAR depository, with 100 major databases being directly downloaded and manually mapped to the IntegromeDB schema.
- Using that manual mapping, mapping algorithms to perform
automatic mapping of SQL, XML, and RDF files have been trained.
Web Crawler Data Mapping and Integration Large Databases IntegromeDB RDBMS User Community API Web Interface
Tables Texts Databases Web pages
User Data
Real problems of data integration: #3
- User Interface: Which types of search to allow and how to
represent the search results on the web page?
- Search results for a gene or protein are organized on the web page
into a dashboard of relevant attributes that are grouped by data sources and by pair-wise similarity, defined using the normalized Levenshtein distance and an empirical threshold.
- The web pages are sorted by relevance. For each ontological term A
and page X, integrated into the system, the relevance score of X to A is calculated as follows: RL(X,A) = K1×PO(A) + K2×PP(X) + K3×PT(X) + K4×PR(X)
where K1-K4 – empirical coefficients; PO – frequency of a term A in the
- ntology containing A; PP – frequency of A in X (corresponds to the Lucene
score); PT – frequency of the term A in different HTML tag fields of the page X; PR – PageRank of X.
Real problems of data integration: #4
- Serving the user requests instantly: large SQL join-
queries over a large relational database are time- consuming.
- A noSQL solutions based on the Hadoop
architecture is under development. It will be used for two tasks: to store, to update, to process, and to index crawled data; and for fast access of data from the IntegromeDB relational database.
Web Crawler Data Mapping and Integration Large Databases IntegromeDB RDBMS User Community API Web Interface
Tables Texts Databases Web pages
noSQL database User Data
2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Acknowledgments
SDSC
- Michael Baitaluk
- Sergey Kozhenkov
- Amarnath Gupta
- Yulia Dubinina
Funding
- NIH R01GM084881 (to MB) & R01GM085325 (to JP)
- SDSC Gordon seed grant
- NSF EXEDE (Gordon and Data Oasis resources)