Biomedical Data I
Kelly Ruggles, PhD Methods in Quantitative Biology
Biomedical Data I Kelly Ruggles, PhD Methods in Quantitative - - PowerPoint PPT Presentation
Biomedical Data I Kelly Ruggles, PhD Methods in Quantitative Biology Biomedical Data Types Next Generation Sequencing Mass Spectrometry Clinical Imaging Biomedical Data Types Molecular Data Next Generation Sequencing Mass Spectrometry
Kelly Ruggles, PhD Methods in Quantitative Biology
Mass Spectrometry Next Generation Sequencing Imaging Clinical
Mass Spectrometry Next Generation Sequencing Imaging Clinical
Molecular Data
Mutation calls Copy Number Gene Expression DNA methylation/Epigenetics MicroRNA Metabolomics Phenotype Data Proteomics Phosphoproteomics
storage
storage
Ellis et al., Cancer Discovery 2013
Each nucleotide is attached to a removable fluorescent molecule and a chain terminating chemical adduct (instead of a 3’ OH group) Complementary nucleotide covalently incorporates and a picture is taken Enzymatically wash away label and 3’-OH blocking group
Construct DNA library using PCR amplification of all DNA fragments in the
attached to a solid support. Each cluster contains about 1000 identical copies of a small piece of the genome.
Sanger NGS Illumina and NGS Since 2005 the data output of NGS has more than doubled each year and the $1000 genome makes genomics-integrated personalized medicine a real possibility
sequence library
simultaneously during a single sequencing run
fragment during library preparation so they can be identified during data analysis
Illumina HiSeq 4000 Illumina HiSeq 2500 Oxford Nanopore MinION Illumina MiSeq
sequencing
sequencing (ChIP-Seq)
HiSeq 2500
sequencing
chromatin with high-throughput seq (ATAC-Seq)
HiSeq but fastest run times and longest illumina read lengths
and extremely long reads but very high error rates
the fragment of any length
time
https://nanoporetech.com/applications/dna-nanopore-sequencing
tumor types
miRNA, mutation calls, etc.
ISGR: The International Genome Sample Resource
+ samples globally
functional elements in the human genome
ATAC-seq, methylation, etc.
LINCS: Library of Integrated Cellular Signatures
16,000+ genetic and environmental stressors
phosphorylation..
The Cancer Genome Atlas Encyclopedia of DNA Elements
ChIP-Seq
recovered DNA is sequenced
proteins DNAse-Seq/FAIRE-Seq
(open chromatin = active genes) Hi-C/5C
(promoter/enhancer regions) Bisulfite Sequencing (WGBS, RRBS)
level
Tumor Sample
Single Nucleotide Polymorphisms (SNPs)
progression
Copy Number Variation (CNV)
deletion of large regions of DNA
Genomic DNA Isolation Load on Flow Cell Sequence Alignment Next Generation Sequencing
Library Preparation
SNP T C
people with the disease and compared to those who do not have the disease.
for the increased risk (based on DNA linkage)
found in all human populations which manifest a given disease.
sequence variations across the genome to identify genetic risk factors for common diseases
genetic associations with drug metabolism
person (1+ million)
evaluated and chi-squared test used to identify variants associated with the trait
sequences are laid across the chip surface
marker is attached
relative amount bound
502,627 SNPs in over 1000 AD cases/controls
rs4420638 14 kb distal to APOE) as having association with late
Coon K et al. (2007) J Clin Psychiatry 68(4):613-8
VCF File Format # Meta-information lines Columns:
dbSNP: : full collection of all SNPs identified
IC: : Database of somatic mutations in human cancer
nSNP: : allows you to upload your SNP data (and make it publiclly available!). Can be downloaded by researchers.
ISGR: Started as the 1000 genomes project, now contains data from over 3K individuals from around the world
exome: : SNP database from lung, heart and blood disorder patients
reference
recalibration
Pipelines:
Ellrott et al., 2018
Gene Expression
Alternative Splicing
cancer
driver
Tumor Sample RNA Isolation Load on Flow Cell Sequence Alignment Next Generation Sequencing
Library Preparation
the expression of thousands of genes at a time
compare patterns of gene expression in different tissues, different times, different conditions.
Isolate mRNA. Make cDNA by reverse transcription, using fluorescently labeled nucleotides. Apply the cDNA mixture to a microarray, a different gene in each spot. The cDNA hybridizes with any complementary DNA on the microarray. Rinse off excess cDNA; scan microarray for fluorescence. Each fluorescent spot represents a gene expressed in the tissue sample. Tissue sample mRNA molecules Labeled cDNA molecules (single strands) DNA fragments representing specific genes DNA microarray with 2,400 human genes DNA microarray
at a specific point in time
fusions, SNPs/mutations
RNAs are converted into cDNA fragment library Sequence adapters (blue) are added to cDNA fragments Short sequence reads from each cDNA are obtained Reads are aligned to reference sequence and classified as exonic reads, junction reads or poly(A) end-reads Used to generate a base-resolution expression profile for each gene Wang et al, 2009
30
Paired-end short reads Alignment to genome De Novo Assembly
Exon 1 Exon 2 Transcript X Reference genome
RNA-Seq Data Analysis
NGS Category Application Recommended coverage (x) or reads (millions) Whole Genome Sequencing SNV detection 10-33x CNV 1-8x Whole Exome Sequencing SNV detection 100x RNA-Seq Differential Expression 10—25 Million Alternative splicing 50-100 Million De novo assembly >100 Million
https://genohub.com/recommended-sequencing-coverage-by-application/
base has been sequenced a certain number of times (10X, 20X, …)
reads
Wang et al, 2009
processes
BED File Format Columns:
7-9. Display info
Ch Chr St Start En End Na Name Sc Score st str Display i info # b blocks Bl Block si size Bl Block st start chr5 11106617 111091469 NP_004763 1000
111091469 3 126, 78, 3 0, 4509, 24849
Chr 5 11106617
Ch Chr St Start En End Na Name Sc Score st str Display i info # # bl blocks cks Bl Block si size Bl Block st start chr5 11106617 111091469 NP_004763 1000
111091469 3
126, ,
78, 3
0, ,
4509, 24849
Chr 5 11106617 + 0 11106617 + 126 126 Block 1
Ch Chr St Start En End Na Name Sc Score st str Display i info # # bl blocks cks Bl Block si size Bl Block st start chr5 11106617 111091469 NP_004763 1000
111091469 3 126,
78 78, 3
0,
4509 4509
, 24849
Chr 5 Block 1 11106617 + 4509 4509 11106617 + 4509 + + 7 78 Block 2
Ch Chr St Start En End Na Name Sc Score st str Display i info # # bl blocks cks Bl Block si size Block s start chr5 11106617 111091469 NP_004763 1000
111091469 3 126, 78, 3 0, 4509,
24 24849
Chr 5 Block 1 Block 2 11106617 + 24 24849 11106617 + 24849 + + 3 3 Block 3
Chr 5 Block 1 Block 2 Block 3 Junction 1 Block 1 Block 2 Block 2 Block 3 Block 1 Block 3 Junction 2 Junction 3
Fusion Location
.…AGAACTGGAAGAATTGG*AATGGTAGATAACGCAGATCATCT..…
Find consensus sequence
Gene X Exon 1 Gene X Exon 2 Gene Y Exon 1 Gene Y Exon 2 Gene X Exon 1 Gene Y Exon 2 Chromosomal translocation Interstitial deletion Chromosomal inversion
Chr 1 Chr 2
Stephens, et al. Complex landscape of somatic rearrangement in human breast cancer genomes. Nature 2009
42
RNA-Seq: Expression RNA-Seq: coverage Global PNNL Global WashU Phospho PNNL Somatic Variants Germline Variants RefSeq Genes
Junctions Global Pep PNNL Phospho PNNL Global Pep WashU
1 2 3 4 5 6 7 8 9 10 11 12 13
UCSC Genome Browser
Tumor Sample Peptides Fractionation Digestion Lysis
m/z intensity
Identity Quantity
Tandem Mass Spectrometry
Discovery Proteomics:
expression (whole cell proteome)
to measure phosphorylation status Targeted Proteomics:
representative peptides of these proteins to measure prior to run
Using antibodies to quantify proteins
Western Blot RPPA Immunofluorescence Immunohistochemistry ELISA
based assay
expression and phosphorylation
as microspots on glass slides and probed with ~200 antibodies
High throughput shotgun MS/MS
Requires no knowledge of peptides present, uses mass difference to determine next AA in peptide chain.
Tumor Sample Peptides Fractionation Digestion Lysis
m/z intensity
Identity Quantity
Tandem Mass Spectrometry Protein Sequence DB Pick Protein in silico digestion Pick Peptide Compare, Score, Test Significance All fragment masses
m/z
Repeat for all proteins/peptides to find best match
Computational i issues w with p protein i inference:
to is difficult
Huang T et al (2012) Protein inference: a review
Examples: (1) Proteins 1 and 2 have same set of identified peptides, if no other supporting information then we cannot determine which protein is in the sample (2) Protein 3 is a one-hit wonder and cannot be reliably mapped (3) Protein 4 has two peptides identified which do not map to another protein, so we can assume that this protein is present
Top down Bottom up Fragmentation
Intact proteins are ionized and introduced to a mass analyzer Proteins are enzymatically digested and then introduced to a mass analyzer
Charge retained on the C-terminus Charge retained on the N-terminus
Mass Analyzer 1 Frag- mentation Detector
in intensit ity ma mass/cha harge
Ion Source Mass Analyzer 2 LC LC
in intensit ity ma mass/cha harge in intensit ity ma mass/cha harge in intensit ity ma mass/cha harge in intensit ity ma mass/cha harge in intensit ity ma mass/cha harge
in intensit ity ma mass/cha harge in intensit ity ma mass/cha harge in intensit ity ma mass/cha harge in intensit ity ma mass/cha harge in intensit ity ma mass/cha harge in intensit ity ma mass/cha harge in intensit ity ma mass/cha harge in intensit ity ma mass/cha harge in intensit ity ma mass/cha harge
the sample through a column filled with a solid adsorbent material.
rates for different peptides à separation
Di Dimensions: Time Peptide m/z Peptide Intensity Petide fragment m/z Peptide fragment intensity ...
1 1 8
m /z 2280 2400 700 m /z 13 00 1460 45
m /z 1444.0 1458.0 35 m /z 2378 .0 239 4.0 700
Pe Peptide intensity vs m/z
Fr Fragment i intensi sity v vs m s m/z
Tim Time m/ m/z
m/z % Relative Abundance 100 250 500 750 1000 [M+2H]2+ 762 260 389 504 633 875 292 405 534 907 1020 663 778 1080 1022
MS MS/MS MS
Pe Peptide intensity vs m/z vs time
Tu Tumor Sa Sample Pe Peptides Fract ctionation Di Digestion Ly Lysis
m/ m/z in intens nsit ity
Iden Identity Qua Quantity
Tandem Mass Spect ctrometry Pr Protein Sequence ce DB DB Pick ck Protein in in si silico di digestion Pick ck Peptide Co Comp mpare, Score, Test Significance ce Al All fragm gmen ent masses es
m/ m/z
Re Repeat for all proteins/peptides to find best match ch
Ly Lysis Fr Fractionation Di Digestion LC LC-MS MS/MS MS
MS MS/MS MS
Spe Spectrum um Li Library Pi Pick Spe Spectrum um Co Compare, , Score, , Test Significance Re Repeat for all all spe pectra Id Iden enti tified ed P Prot
eins
measurement of proteins which already have antibodies available
to measure proteins that do not require antibodies and instead rely
interest
Nature Method of the Year 2012
Fractionation Digestion LC-MS Lysis
MS
Shotgun proteomics Targeted MS
abundance and fragments
MS/MS
peptide identification Data Dependent Acquisition (DDA) Uses predefined set of peptides
MS
MS/MS
pairs for identification
Domon B & Aebersold R. Nature Biotechnology. 28(7), 710-721 (2010) Quadrupole Often used in targeted MS/MS Linear array of 4 symmetrical rods Filters sample ions based on m/z Examples:
Ion Trap Often used in Shotgun proteomics Ring electrode and two end cap electrodes Examples:
TOF (Time of Flight) Ions m/z determined via a time measurement Examples:
OT/ICR (Orbitrap) Barrel-like electrode and co-axial inner electrode Traps ions in an orbital motion around spindle Examples:
Deutsch EW. Mol Cell Proteomics. 11(12), 1612-21 (2012)
All instruments collect profile- mode data but vendor raw files written out after each run can contain one or more of these types, depending on user input.
Deutsch EW. Mol Cell Proteomics. 11(12), 1612-21 (2012)
information for a single MS run including metadata about the spectra and the spectra themselves
profile mode
format
Deutsch EW. Proteomics. 8(14), 2776-7 (2008) http://www.psidev.info/mzml_1_0_0%20
formats, it was common to convert the output files into simple text files with only information on the spectra
MGF file
into one file via m/z intensity pairs separated by headers
is lost in this conversion, hindering the development of advanced proteomic tools
retention_time peptide_mass peak_area
Nesvizhskii AI, Methods Mol Biol. 367, 87-119 (2007)
genome and the upstream input from the environment
functional level
processes leave behind”
biological sample
transcriptome or proteome
molecules based on their affinity for the stationary phase
fragments and detects these using their mass to charge ratio
analytes
which transfer energy to nuclei of each molecule
returns to base level
spectrum which can be used to identify/measure the metabolite
http://www.hmdb.ca/
experiments and data
2056 Species
drug discovery and development
causal, then the drug target/pathway is known
companies to create a system for these predictions
is not sensitive enough but with better technology these methods will likely increase
Johnson, Ivanisevic and Suizdak, 2016
Membrane Components
Choline glycerophospholipid (PC) Ethanolamine glycerophospholipid (PE) Phosphitidylinositol (PI) Phosphatidylglycerol(PG) Phosphatidic Acid (PA) Phosphatidylserine (PS) Cardiolipin (CL) Sphingomyelin (SM) Galactoslyceramide (GalCer) Glucosylceramide (GluCer) Cholesterol Glycolipids Sulfatide Gangliosides
Glycerophospholipids Sphingolipids
Energy Storage
Free Fatty Acid (FFA) Triacylglycerol (TAG) Diacylglycerol (DAG) Monoacylglycerol (MAG) Acyl-CoA Acylcarnitine
Signaling
DAG MAG Acyl-CoA Acylcarnitine FFA Eicosanoids Steroids Ceramide Sphingosine Sphingoid-1-phosphate (S1P)
Sphingolipids
subclasses that reflect changes in metabolism
Han, X. (2016) Lipidomics for studying metabolism
processing