PostgreSQL and Omics Data
How omics data can be stored in postgres database
Postgr tgreSQ eSQL L SF F Day
Jan 2020 Anson Abraham Data Architect at Envisagenics Inc.
PostgreSQL and Omics Data How omics data can be stored in postgres - - PowerPoint PPT Presentation
PostgreSQL and Omics Data How omics data can be stored in postgres database Postgr tgreSQ eSQL L SF F Day Jan 2020 Anson Abraham Data Architect at Envisagenics Inc. Omics Predicted to be biggest of big data by 2025 Omics Astronomy
How omics data can be stored in postgres database
Jan 2020 Anson Abraham Data Architect at Envisagenics Inc.
Omics
Astronomy
Video Twitter 40 EB 1-2 EB 1 EB 17 PB
Source: Challenges For Genomics In The Age of Big Data, July 2015 , Forbes
Computerized DNA sequence The DNA molecule
▪ Omics data can be generated from any human tissues ▪ Tissue-specific omics is used to compare across individuals (e.g., cancer patients v. control) ▪ Omics data can be stored, then analyzed by different algorithms (e.g., to find mutations, to find gene level changes) ▪ Biopsies and data can be stored, data last t longer er and take e less s space! ce!
Sequenci encing ng facili lity y at CSHL L - 2.5 T erabyt bytes es of genome me data a produced uced every y week! k!! The cost of sequenc ncing ng a g genom
e went from ~$100M 0M in 2000 0 to <$1K nowadays!! days!!
▪ Data sharing through partnerships
▪ Pharma company A extracts value from data, then partner B extracts additional value from same dataset ▪ Pharma company A makes more data, brings old data from archive to compare
▪ T
▪ “Wisdom of the crowds”: thousands of brains working to cure cancer, and genetic diseases
▪ E.g. The Cancer Genome Atlas (TCGA), a public data repository, facilitates cures for cancer.
▪ For personalized medicine
▪ Use your genome for wellness improvement ▪ Use your genome to treat cancer, ALS, Alzheimer's, etc.
▪ FastQ tQ: text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores.
FASTQ App App App App FASTQ App App App App Old Paradig radigm: data uploaded to different apps for analysis New Paradig radigm: data have grown in size & number, governance is
deployed to data
Sequence read Quality ASCII
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + @SEQ_ID
▪ FastQ tQ: text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores. ▪ SA SAM (Sequence Alignment/Map) File: Alignment information of short reads mapped against a reference of sequences ▪ BAM BAM (Binary Alignment/MAP): Binary and compressed version of the SAM file
CREATE TABLE sam ( qname varchar(100) ,flag int ,rname varchar(10) ,pos int ,mapq int ,cigar varchar(5) ,rnext varchar(1) ,pnext int ,tlen int ,seq text ,qual text ,tag jsonB ); QNAME FLAG RNAME POS MAPQ CIGAR RNEXT PNEXT TLEN SEQ QUAL TAG1 TAG2 readID43GYAX15:7:1:1202:19894/1,256,contig43,613960,1,65M,*,0,0,CCAGCGCGAACGAAATCCGCA TGCGTCTGGTCGTTGCACGGAACGGCGGCGGTGTGATGCACGGC,EDDEEDEE=EE?DE??DDDBADEBEFFFD BEFFEBCBC=?BEEEE@=:?::?7?:8-6?7?@??#,AS:i:0,AA:3:4
QNAME Query template name FLAG bitwise flag RNAME References sequence name POS 1- based leftmost mapping position MAPQ mapping quality CIGAR cigar (concise idiosyncratic gapped alignment report) string RNEXT Reference seq name of the primary alignment if the next read PNEXT Position of the primary alignment of the next read TLEN
SEQ segment sequence QUAL Phred-scaled base QUALITY TAG Tag:Type:Value
Columns description
▪ Chromosome, start, and end are the coordinates for Omics data on the chromosome ▪ Genome Browser is a tool querying an RDBMS of SAM data
RNA sequences as presented in the UCSC Genome Browser, the “google map” of the genome
▪ FastQ tQ: text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores. ▪ SA SAM (Sequence Alignment/Map) File: Alignment information of short reads mapped against a reference of sequences ▪ BAM BAM (Binary Alignment/MAP): Binary and compressed version of the SAM file ▪ VCF (Variant Call Format): contains the information about a position in the genome
CREATE TABLE vcf ( chrom int ,pos int ,var_id VARCHAR(25) ,ref varchar(10) ,alt varchar(10) ,qual int ,filter varchar(10) ,info varchar(50) ,format varchar(25) ,sample JSONB ); #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003 20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
CHROM The chromosome POS The 1-based position of the variation on the given sequence. VAR_ID the variation identifier REF the reference base at the given position on the reference sequence ALT the alternate alleles for this position QUAL A quality score for the inference of the given alleles. FILTER A flag indicating which of a given set of filters the variation has passed. INFO list of k-v pairs (fields) describing the variation. FORMAT list of fields for describing the samples SAMPLEs sample described in the file
Columns description
▪ FastQ tQ: text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores. ▪ SA SAM (Sequence Alignment/Map) File: Alignment information of short reads mapped against a reference of sequences ▪ BAM BAM (Binary Alignment/MAP): Binary and compressed version of the SAM file ▪ VCF (Variant Call Format): contains the information about a position in the genome
Non-responders Responder to cancer treatment
Omics-based predictive features
Current Clinical Trial
Low response rate
Data Modeling New Patients Recruited New Clinical Trial
High response rate