PostgreSQL and Omics Data How omics data can be stored in postgres - PowerPoint PPT Presentation

PostgreSQL and Omics Data How omics data can be stored in postgres database Postgr tgreSQ eSQL L SF F Day Jan 2020 Anson Abraham Data Architect at Envisagenics Inc.

Omics Predicted to be “biggest of big data” by 2025 Omics Astronomy Video Twitter 17 PB 1 EB 1-2 EB 40 EB Source: Challenges For Genomics In The Age of Big Data, July 2015 , Forbes

The “Big Leap” of Biology: from molecule to computerized DNA The DNA molecule Computerized DNA sequence

Omics data made from human biopsies is used for therapeutics development ▪ Omics data can be generated from any human tissues ▪ Tissue-specific omics is used to compare across individuals (e.g., cancer patients v. control) ▪ Omics data can be stored, then analyzed by different algorithms (e.g., to find mutations, to find gene level changes) ▪ Biopsies and data can be stored, data last t longer er and take e less s space! ce!

Sequencing technology to computerize omics data Sequenci encing ng facili lity y at CSHL L - 2.5 T erabyt bytes es of genome me The cost of sequenc ncing ng a g genom ome e went from data a produced uced every y week! k!! ~$100M 0M in 2000 0 to <$1K nowadays!! days!!

Hardwar dware e for sequencer quencers s is getting tting sma maller ler but t data ta is getting tting larger ger

Storing omics data is important for therapeutic development ▪ Data sharing through partnerships ▪ Pharma company A extracts value from data, then partner B extracts additional value from same dataset ▪ Pharma company A makes more data, brings old data from archive to compare ▪ T o be shared as open access to the scientific community ▪ “Wisdom of the crowds”: thousands of brains working to cure cancer, and genetic diseases ▪ E.g. The Cancer Genome Atlas (TCGA), a public data repository, facilitates cures for cancer. ▪ For personalized medicine ▪ Use your genome for wellness improvement ▪ Use your genome to treat cancer, ALS, Alzheimer's, etc.

OMICS data are stored in various file formats ▪ FastQ tQ: text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores.

FASTQ are large and tightly governed raw data files App App App App App App FASTQ FASTQ App App New Paradig radigm: data have grown in Old Paradig radigm: data uploaded size & number, governance is to different apps for analysis tighter. Incentives to have Apps are deployed to data

FASTQ file format @SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT Sequence read + Quality ASCII !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

OMICS data are stored in various file formats ▪ FastQ tQ: text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores. ▪ SA SAM (Sequence Alignment/Map) File: Alignment information of short reads mapped against a reference of sequences ▪ BAM BAM (Binary Alignment/MAP): Binary and compressed version of the SAM file

SAM file format QNAME FLAG RNAME POS MAPQ CIGAR RNEXT PNEXT TLEN SEQ QUAL TAG1 TAG2 readID43GYAX15:7:1:1202:19894/1,256,contig43,613960,1,65M,*,0,0,CCAGCGCGAACGAAATCCGCA TGCGTCTGGTCGTTGCACGGAACGGCGGCGGTGTGATGCACGGC,EDDEEDEE=EE?DE??DDDBADEBEFFFD BEFFEBCBC=?BEEEE@=:?::?7?:8-6?7?@??#,AS:i:0,AA:3:4 CREATE TABLE sam ( qname varchar(100) ,flag int ,rname varchar(10) ,pos int ,mapq int ,cigar varchar(5) ,rnext varchar(1) ,pnext int ,tlen int ,seq text ,qual text ,tag jsonB );

SAM files can be stored in DBMS Columns description QNAME Query template name FLAG bitwise flag RNAME References sequence name POS 1- based leftmost mapping position MAPQ mapping quality CIGAR cigar (concise idiosyncratic gapped alignment report) string RNEXT Reference seq name of the primary alignment if the next read PNEXT Position of the primary alignment of the next read TLEN observed Template length SEQ segment sequence QUAL Phred-scaled base QUALITY TAG Tag:Type:Value

Omics data have coordinal data ▪ Chromosome, start, and end are the coordinates for Omics data on the chromosome ▪ Genome Browser is a tool querying an RDBMS of SAM data RNA sequences as presented in the UCSC Genome Browser, the “google map” of the genome

OMICS data are stored in various file formats ▪ FastQ tQ: text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores. ▪ SA SAM (Sequence Alignment/Map) File: Alignment information of short reads mapped against a reference of sequences ▪ BAM BAM (Binary Alignment/MAP): Binary and compressed version of the SAM file ▪ VCF (Variant Call Format): contains the information about a position in the genome

VCF file format #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003 20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,. CREATE TABLE vcf ( chrom int ,pos int ,var_id VARCHAR(25) ,ref varchar(10) ,alt varchar(10) ,qual int ,filter varchar(10) ,info varchar(50) ,format varchar(25) ,sample JSONB );

VCF file stored in database Columns description CHROM The chromosome POS The 1-based position of the variation on the given sequence. VAR_ID the variation identifier REF the reference base at the given position on the reference sequence ALT the alternate alleles for this position QUAL A quality score for the inference of the given alleles. A flag indicating which of a given set of filters the variation has FILTER passed. INFO list of k-v pairs (fields) describing the variation. FORMAT list of fields for describing the samples SAMPLEs sample described in the file

Compact & informative VCF files are ready to use for research : open-access database with thousands of VCF ▪ datasets from cancer patients ▪ Cosmic Applications: ▪ Study cancer inheritance ▪ Study cancer progression ▪ Develop biomarkers ▪ Develop therapeutic compounds ▪ Up and coming: CRISPR as therapeutic to correct DNA mutations

OMICS data are stored in various file formats ▪ FastQ tQ: text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores. ▪ SA SAM (Sequence Alignment/Map) File: Alignment information of short reads mapped against a reference of sequences ▪ BAM BAM (Binary Alignment/MAP): Binary and compressed version of the SAM file ▪ VCF (Variant Call Format): contains the information about a position in the genome

Example WorkFlow to convert from FASTQ to VCF

RDBM’s are useful for therapeutic AI applications Current Clinical Trial Data Modeling New Patients Recruited New Clinical Trial Low response rate High response rate Omics-based Responder to cancer treatment predictive features Non-responders

OMICS data file formats and PG… ▪ You can create Foreign Data Wrappers. New formats always arise, some maybe unstructured ▪ FDW to read VCF Files directly from Postgres ▪ https://github.com/smithijk/vcf_fdw_postgresql ▪ There is no Foreign Data Wrapper for FastQ files. ▪ Should there be one? ▪ PostBIS: Michael Schneider ▪ TileDB.IO and Snowflake are examples that can query directly to scale, VCF files stored in S3 .

Questions or Thoughts? Anson Abraham Sr. Data/Cloud Architect at Envisagenics, Inc. anson.abraham@gmail.com @therealansonism

PostgreSQL and Omics Data How omics data can be stored in postgres - PowerPoint PPT Presentation

PostgreSQL and Omics Data How omics data can be stored in postgres database Postgr tgreSQ eSQL L SF F Day Jan 2020 Anson Abraham Data Architect at Envisagenics Inc. Omics Predicted to be biggest of big data by 2025 Omics Astronomy

Integrating multi-omics Luciano Milanesi Outline Introduction Omics challenges Data

Hacking PostgreSQL Stephen Frost Crunchy Data stephen@crunchydata.com FOSDEM 2019 February 3,

Hacking PostgreSQL Stephen Frost Crunchy Data stephen@crunchydata.com PGConf.EU 2018 October

PostgreSQL Provider The PostgreSQL provider gives the ability to deploy and congure resources

PostgreSQL Who, What, When, Where, Why, How? 1 QUIS? Who's involved with PostgreSQL? Core

PostgreSQL SQL-MED Ibrar Ahmed Senior Software Engineer @ Percona PostgreSQL Consultant What?

Breaking PostgreSQL at Scale. Christophe Pettus PostgreSQL Experts pgDay Paris 2019

Statistical analysis of meta-omics data Sandra Plancade INRA (French Institute of Research in

Machine Learning Applications to Omics Data Kelly Ruggles April 9, 2018 Diversity of Omics in

Multi-Omics with Galaxy for Diverse Biological Applications Tim Griffin and Pratik Jagtap

Abou out t OM OMICS S Gr Grou oup OMICS Group International is an amalgamation of

PostgreSQL on FreeBSD Some news, observations and speculation Thomas Munro, BSDCan 2020

Look It Up: Practical PostgreSQL Indexing Christophe Pettus PostgreSQL Experts

PostgreSQL Replication Christophe Pettus PostgreSQL Experts PerconaLive, April 25, 2018

Securing PostgreSQL Christophe Pettus PostgreSQL Experts, Inc. PGDay FOSDEM 2018 Greetings!

Hosted PostgreSQL: An Objective Look Christophe Pettus PostgreSQL Experts, Inc. FOSDEM PGDay

Welcome to the 116 th meeting of the Lyncean Group 6 December 2017 Agenda for today

The Security and Privacy Challenges Raised by Precision Medicine Jean-Pierre Hubaux With

Molecular dynamics: looking ahead to exascale Steve Plimpton Sandia National Laboratories 17th

Practical Bioinformatics Mark Voorhies 5/2/2017 Mark Voorhies Practical Bioinformatics

BIBLIOGRAPHY PRESENTATIONS KLAUS AMMANN UNTIL 20190423 klaus.ammann@ips.unibe.ch Ammann Klaus

Integration Project (FCHIP) Introduction and Overview of the Frontier Community Health

IHIs Hospital Flow Professional Development Program Pat Rutherford VP, Institute for

EVICTION EVICTION MORATORIUM MORATORIUM This crisis is not over No New Yorker should lose the