SDSC 2013 Summer Institute Biomedical data integration system and - PowerPoint PPT Presentation

SDSC 2013 Summer Institute Biomedical data integration system and web search engine Julia Ponomarenko, PhD San Diego Supercomputer Center jpon@sdsc.edu 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Big Data: “the zillionics realm” In 2012, there were created and replicated 2,800,000 petabytes (PB) 2,800,000,000 terabytes (TB) 2,800,000,000,000 gigabytes (GB) 2,800,000,000,000,000 megabytes (MB) 2,800,000,000,000,000,000 kilobytes (KB) 2,800,000,000,000,000,000,000 bytes

YouTube’s video uploaded per year 15 PB Nasdaq stock E-mails sent per year market 2,986 PB database 3 PB Facebook’s content uploaded per year National 183 PB Climatic Data Center Database 6 PB Google’s Kaiser’s search index medical 98 PB records 31 PB Library of Congress digital collection 5 PB Large Hadron Collider's annual data output 15 PB Data from Wired, May 2013

The number of sequences in NCBI GenBank April 2013 GenBank 164,136,731 sequences 594 GB (0.0006 PB)

High-throughput sequencing: Illumina HiSeq 2500 enables to sequence a human genome, 20 exomes, or 150 RNA-seq samples per day NCBI Sequence Read Archive contains 4.1 PB of sequence reads

Gene expression data: 0.05 PB • Array Express database (at EBI): 1,104,037 assays (0.014 PB of archived data) • NCBI GEO database: 931,577 samples

Metabolic & Signaling pathways

Protein-protein interaction & Transcriptional regulatory networks & Host-pathogen interactions

• KEGG: 245,393 pathways • REACTOME: 25,000+ pathways, 70,000+ reactions • PathGuide (survey of 325 pathway resources): 205+ Million reactions for 28+ Million gene/proteins ( 0.01 PB ) Image of interactions among interaction database is from pathguide.org

YouTube’s video uploaded per year 15 PB Nasdaq stock E-mails sent per year market 2,986 PB database 3 PB Facebook’s content uploaded per year National 183 PB Climatic Data Center Database 6 PB Google’s Kaiser’s search index Molecular Biology data in medical 98 PB public databases ~4.5 PB records 31 PB Library of Congress digital collection 5 PB Large Hadron Collider's annual data output 15 PB Data from Wired, May 2013

Sequences Networks Variations Publications Data Annotations Taxonomies Epigenetic Expression Structures Biochemical Data Data Data

Sequences Networks Variations Publications Data Annotations Taxonomies Epigenetic Expression Structures Biochemical Data Data Data Databases Web pages (2,000+) How can a molecular biologist embrace such amount of data in their entirety?

Sequences Networks Variations Publications Data Annotations Taxonomies Epigenetic Expression Structures Biochemical Data Data Data Databases Web pages (2,000+) Data Integration Resources

Sequences Networks Variations Publications Data Annotations Taxonomies Epigenetic Expression Structures Biochemical Data Data Data Databases Web pages (2,000+) Data Integration Resources This leaves a molecular biologist to work with partial, incomplete, incomprehensive data sets!

Sequences Networks Variations Publications Data Annotations Taxonomies Epigenetic Expression Structures Biochemical Data Data Data Databases Web pages (2,000+) Biological Ontologies The Semantic Web technologies Data Warehouse

Database web pages Other web pages

Database web pages Other web pages For each ontological term A and page X , calculate the relevance score of X to A.

Part of the multiple alignment ontology (MAO) Thompson et al., 2005, NAR, PMID: 16043635 Grey boxes represent concepts and colored arrows represent relationships: red, is_a ; blue, part_of ; green, is_attribute .

Mao.obo

Database web pages Other web pages For each ontological term A Automatically extract and page X , calculate the data and map them relevance score of X to A . into the internal Calculate a rank of each page. database schema

User Community Web-portal & API Java-application integromeDB.org BiologicalNetworks.org IntegromeDB Public Data on the Web User’s Private Data

integromedb.org

Integromedb.org visit statistics (7/16/2012 – 7/15/2013)

Real problems of data integration: #1 • Collecting data and maintaining data consistency: it is infeasible to collect data from thousands of data sources neither downloading them no via web crawling (if a crawler sends a request to a website each 10 seconds, to download the entire GeneBank -- 163 million sequence records -- would require 50 years, from one IP address) . Data Mapping and Integration Web Crawler Tables Texts Large Databases Web pages Databases

Real problems of data integration: #1 (solution) • Hybrid approach : large databases (100,000+ web pages) are downloaded as SQL, XML, or RDF models, while all other resources on the web are reached via the web crawling. Data Mapping and Integration Web Crawler Tables Texts Large Databases Web pages Databases

Real problems of data integration: #2 • Data mapping: – No unified biological ontology (there are conflicts and inconsistencies across ontologies) – Data are heterogeneous (data integration requires mapping various types of data onto a set of stable gene and protein ids) Data Mapping and Integration Web Crawler Tables Texts Large Databases Web pages Databases

Real problems of data integration: #2 (solution) • IntegromeDB Ontology: The Open Biomedical Ontologies (OBO) consortium and the National Center for Biomedical Ontology (NCBO) provide the mapping among different ontologies. This mapping (for 120 ontologies) was used to develop IntegromeDB Ontology.

Real problems of data integration: #2 (solution) • Each web page is integrated in the database as is along with two calculated scores: PageRank and the Lucene score. The latter is calculated for each word from the database dictionaries (object names, IDs, synonyms) and IntegromeDB Ontology. • The downloaded files—SQL, XML, and RDF—are mapped to the IntegromeDB database schema via transforming them into the RDF-compatible format. The mapping includes automatic determination of Node IDs, such as names and synonyms of biological entities from a dictionary of 70 million gene aliases.

• The database is a PostgreSQL database modeled as a node- ( Objects: proteins, ligands, molecular complexes, and genes ) edge- ( relationships between objects: up/down regulation, molecular transport, molecular synthesis, enzymatic activity ) typed labeled meta-graph, where the labels are described by their own schema. IntegromeDB RDBMS Data Mapping and Integration Web Crawler Tables Texts Large Databases Web pages Databases

IntegromeDB Database • Currently, the database integrates data from a billion web pages populated from molecular biology databases listed in the NAR depository, with 100 major databases being directly downloaded and manually mapped to the IntegromeDB schema. • Using that manual mapping, mapping algorithms to perform automatic mapping of SQL, XML, and RDF files have been trained.

User Community Web Interface API User Data IntegromeDB RDBMS Data Mapping and Integration Web Crawler Tables Texts Large Databases Web pages Databases

Real problems of data integration: #3 User Interface: Which types of search to allow and how to • represent the search results on the web page? Search results for a gene or protein are organized on the web page • into a dashboard of relevant attributes that are grouped by data sources and by pair-wise similarity, defined using the normalized Levenshtein distance and an empirical threshold. The web pages are sorted by relevance. For each ontological term A • and page X, integrated into the system, the relevance score of X to A is calculated as follows: RL(X,A) = K1×PO(A) + K2×PP(X) + K3×PT(X) + K4×PR(X) where K1-K4 – empirical coefficients; PO – frequency of a term A in the ontology containing A ; PP – frequency of A in X (corresponds to the Lucene score); PT – frequency of the term A in different HTML tag fields of the page X; PR – PageRank of X .

Real problems of data integration: #4 • Serving the user requests instantly: large SQL join- queries over a large relational database are time- consuming. • A noSQL solutions based on the Hadoop architecture is under development. It will be used for two tasks: to store, to update, to process, and to index crawled data; and for fast access of data from the IntegromeDB relational database.

SDSC 2013 Summer Institute Biomedical data integration system and - PowerPoint PPT Presentation

SDSC 2013 Summer Institute Biomedical data integration system and web search engine Julia Ponomarenko, PhD San Diego Supercomputer Center jpon@sdsc.edu 2013 Summer Institute: Discover Big Data, August 5-9, San Diego, California SAN DIEGO

Introduction to SDSC systems and data analytics software packages Mahidhar

Parallel Options for R Glenn K. Lockwood SDSC User Services glock@sdsc.edu 2013 Summer

Future of Enzo Michael L. Norman James Bordner LCA/SDSC/UCSD SDSC Resources Data to

The Analysis of Biomedical Data - The Analysis of Biomedical Data - - The Analysis of Biomedical

7/8/2013 1 7/8/2013 2 7/8/2013 3 7/8/2013 4 7/8/2013 5 7/8/2013 6 7/8/2013 7 7/8/2013

On the integration of On the integration of biomedical knowledge bases: biomedical knowledge

Image Data Stephen Bailey Instructor DataCamp Biomedical Image Analysis in Python Biomedical

Enabling Phylogenetic Research via the CIPRES Science Gateway Wayne Pfeiffer SDSC/UCSD

Biomedical Data I Kelly Ruggles, PhD Methods in Quantitative Biology Biomedical Data Types Next

SDSC is an organized research unit of UCSD, but a resource for the Nation through XSEDE, OSG,

ISM/Molecular Cloud/Star Formation Simulations Alexei Kritsuk UCSD Collaborators: David Collins

Systems Systems Systems Integration Systems Integration Systems Systems Integration Systems

Revised: March 4, 2013 3/19/2013 3/19/2013 2 3/19/2013 3 3/19/2013 4 3/19/2013 5

SUMMER BRAIN GAIN: REIMAGINING SUMMER LEARNING What is the problem? Why Summer Matters There is

Manchester Biomedical Research Centre Professor Ian Bruce, BRC Director Manchester Biomedical

Objects and Labels Stephen Bailey Instructor DataCamp Biomedical Image Analysis in Python

Expressiveness I ssues in Calculi for Artificial Biochemistry A r C 1 ++C n A ::= @r;C 1

An introduction to QM and QM/MM models Prof. Dr. Ville R. I. Kaila Department of Chemistry Prof.

From Bench to Bedside: Role of Informatics Nagasuma Chandra Indian Institute of Science

OF ZEBRAS AND HORSES: Advances in A PRIMER ON GENETICS Internal Medicine IN ADULT MEDICINE

Modeling biochemical signal transduction in heterogeneous cell populations Steffen Waldherr, Jan

Metastable Regimes and Tipping points of Biochemical Networks with Potential Applications in

Engineering and Understanding Biochemical Networks Herbert Sauro Bioengineering, UW Seattle

Controllability Metrics in Markov Decision Linear Models of Gene Networks Dan Goreac 1 Journes

Sambuz

Useful Links

Newsletter

Mail Us