PostgreSQL and Omics Data How omics data can be stored in postgres - - PowerPoint PPT Presentation

postgresql and omics data
SMART_READER_LITE
LIVE PREVIEW

PostgreSQL and Omics Data How omics data can be stored in postgres - - PowerPoint PPT Presentation

PostgreSQL and Omics Data How omics data can be stored in postgres database Postgr tgreSQ eSQL L SF F Day Jan 2020 Anson Abraham Data Architect at Envisagenics Inc. Omics Predicted to be biggest of big data by 2025 Omics Astronomy


slide-1
SLIDE 1

PostgreSQL and Omics Data

How omics data can be stored in postgres database

Postgr tgreSQ eSQL L SF F Day

Jan 2020 Anson Abraham Data Architect at Envisagenics Inc.

slide-2
SLIDE 2

Omics

Astronomy

Video Twitter 40 EB 1-2 EB 1 EB 17 PB

Omics Predicted to be “biggest of big data” by 2025

Source: Challenges For Genomics In The Age of Big Data, July 2015 , Forbes

slide-3
SLIDE 3

The “Big Leap” of Biology: from molecule to computerized DNA

Computerized DNA sequence The DNA molecule

slide-4
SLIDE 4

Omics data made from human biopsies is used for therapeutics development

▪ Omics data can be generated from any human tissues ▪ Tissue-specific omics is used to compare across individuals (e.g., cancer patients v. control) ▪ Omics data can be stored, then analyzed by different algorithms (e.g., to find mutations, to find gene level changes) ▪ Biopsies and data can be stored, data last t longer er and take e less s space! ce!

slide-5
SLIDE 5

Sequencing technology to computerize omics data

Sequenci encing ng facili lity y at CSHL L - 2.5 T erabyt bytes es of genome me data a produced uced every y week! k!! The cost of sequenc ncing ng a g genom

  • me

e went from ~$100M 0M in 2000 0 to <$1K nowadays!! days!!

slide-6
SLIDE 6

Hardwar dware e for sequencer quencers s is getting tting sma maller ler but t data ta is getting tting larger ger

slide-7
SLIDE 7

Storing omics data is important for therapeutic development

▪ Data sharing through partnerships

▪ Pharma company A extracts value from data, then partner B extracts additional value from same dataset ▪ Pharma company A makes more data, brings old data from archive to compare

▪ T

  • be shared as open access to the scientific community

▪ “Wisdom of the crowds”: thousands of brains working to cure cancer, and genetic diseases

▪ E.g. The Cancer Genome Atlas (TCGA), a public data repository, facilitates cures for cancer.

▪ For personalized medicine

▪ Use your genome for wellness improvement ▪ Use your genome to treat cancer, ALS, Alzheimer's, etc.

slide-8
SLIDE 8

OMICS data are stored in various file formats

▪ FastQ tQ: text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores.

slide-9
SLIDE 9

FASTQ are large and tightly governed raw data files

FASTQ App App App App FASTQ App App App App Old Paradig radigm: data uploaded to different apps for analysis New Paradig radigm: data have grown in size & number, governance is

  • tighter. Incentives to have Apps are

deployed to data

slide-10
SLIDE 10

FASTQ file format

Sequence read Quality ASCII

!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + @SEQ_ID

slide-11
SLIDE 11

OMICS data are stored in various file formats

▪ FastQ tQ: text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores. ▪ SA SAM (Sequence Alignment/Map) File: Alignment information of short reads mapped against a reference of sequences ▪ BAM BAM (Binary Alignment/MAP): Binary and compressed version of the SAM file

slide-12
SLIDE 12

SAM file format

CREATE TABLE sam ( qname varchar(100) ,flag int ,rname varchar(10) ,pos int ,mapq int ,cigar varchar(5) ,rnext varchar(1) ,pnext int ,tlen int ,seq text ,qual text ,tag jsonB ); QNAME FLAG RNAME POS MAPQ CIGAR RNEXT PNEXT TLEN SEQ QUAL TAG1 TAG2 readID43GYAX15:7:1:1202:19894/1,256,contig43,613960,1,65M,*,0,0,CCAGCGCGAACGAAATCCGCA TGCGTCTGGTCGTTGCACGGAACGGCGGCGGTGTGATGCACGGC,EDDEEDEE=EE?DE??DDDBADEBEFFFD BEFFEBCBC=?BEEEE@=:?::?7?:8-6?7?@??#,AS:i:0,AA:3:4

slide-13
SLIDE 13

SAM files can be stored in DBMS

QNAME Query template name FLAG bitwise flag RNAME References sequence name POS 1- based leftmost mapping position MAPQ mapping quality CIGAR cigar (concise idiosyncratic gapped alignment report) string RNEXT Reference seq name of the primary alignment if the next read PNEXT Position of the primary alignment of the next read TLEN

  • bserved Template length

SEQ segment sequence QUAL Phred-scaled base QUALITY TAG Tag:Type:Value

Columns description

slide-14
SLIDE 14

Omics data have coordinal data

▪ Chromosome, start, and end are the coordinates for Omics data on the chromosome ▪ Genome Browser is a tool querying an RDBMS of SAM data

RNA sequences as presented in the UCSC Genome Browser, the “google map” of the genome

slide-15
SLIDE 15

OMICS data are stored in various file formats

▪ FastQ tQ: text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores. ▪ SA SAM (Sequence Alignment/Map) File: Alignment information of short reads mapped against a reference of sequences ▪ BAM BAM (Binary Alignment/MAP): Binary and compressed version of the SAM file ▪ VCF (Variant Call Format): contains the information about a position in the genome

slide-16
SLIDE 16

VCF file format

CREATE TABLE vcf ( chrom int ,pos int ,var_id VARCHAR(25) ,ref varchar(10) ,alt varchar(10) ,qual int ,filter varchar(10) ,info varchar(50) ,format varchar(25) ,sample JSONB ); #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003 20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.

slide-17
SLIDE 17

VCF file stored in database

CHROM The chromosome POS The 1-based position of the variation on the given sequence. VAR_ID the variation identifier REF the reference base at the given position on the reference sequence ALT the alternate alleles for this position QUAL A quality score for the inference of the given alleles. FILTER A flag indicating which of a given set of filters the variation has passed. INFO list of k-v pairs (fields) describing the variation. FORMAT list of fields for describing the samples SAMPLEs sample described in the file

Columns description

slide-18
SLIDE 18

Compact & informative VCF files are ready to use for research ▪ : open-access database with thousands of VCF datasets from cancer patients ▪Cosmic Applications:

▪ Study cancer inheritance ▪ Study cancer progression ▪ Develop biomarkers ▪ Develop therapeutic compounds ▪ Up and coming: CRISPR as therapeutic to correct DNA mutations

slide-19
SLIDE 19

OMICS data are stored in various file formats

▪ FastQ tQ: text-based format for storing both a biological sequence (usually nucleotide sequence) and its corresponding quality scores. ▪ SA SAM (Sequence Alignment/Map) File: Alignment information of short reads mapped against a reference of sequences ▪ BAM BAM (Binary Alignment/MAP): Binary and compressed version of the SAM file ▪ VCF (Variant Call Format): contains the information about a position in the genome

slide-20
SLIDE 20

Example WorkFlow to convert from FASTQ to VCF

slide-21
SLIDE 21

RDBM’s are useful for therapeutic AI applications

Non-responders Responder to cancer treatment

Omics-based predictive features

Current Clinical Trial

Low response rate

Data Modeling New Patients Recruited New Clinical Trial

High response rate

slide-22
SLIDE 22

OMICS data file formats and PG…

▪ You can create Foreign Data Wrappers. New formats always arise, some maybe unstructured ▪ FDW to read VCF Files directly from Postgres ▪ https://github.com/smithijk/vcf_fdw_postgresql ▪ There is no Foreign Data Wrapper for FastQ files. ▪ Should there be one? ▪ PostBIS: Michael Schneider ▪ TileDB.IO and Snowflake are examples that can query directly to scale, VCF files stored in S3.

slide-23
SLIDE 23
slide-24
SLIDE 24

Questions or Thoughts?

Anson Abraham

  • Sr. Data/Cloud Architect at Envisagenics, Inc.

anson.abraham@gmail.com @therealansonism