Toward building an automated bioinformatician:
more accurate transcript assembly via parameter advising
with Kwanho Kim and Carl Kingsford
slides: dandeblasio.com/AutoAlg19
Dan DeBlasio
dandeblasio.com @ danfdeblasio
more accurate transcript assembly via parameter advising Dan - - PowerPoint PPT Presentation
Toward building an automated bioinformatician: more accurate transcript assembly via parameter advising Dan DeBlasio dandeblasio.com @ danfdeblasio with Kwanho Kim and Carl Kingsford slides: dandeblasio.com/AutoAlg19 Modern science is
with Kwanho Kim and Carl Kingsford
slides: dandeblasio.com/AutoAlg19
dandeblasio.com @ danfdeblasio
2
3
4
5
6
Quant ========== Perform dual-phase, mapping-based estimation of transcript abundance from RNA-seq reads salmon quant options: basic options:
will be considered toward quantification estimates. The default behavior is to consider orphan mappings if no valid paired mappings exist. This flag is independent of the option to write the orphaned mappings to file (--writeOrphanLinks).
increase sensitivity (allow more reads to map and more transcripts to be detected), but may decrease specificity as orphaned alignments are more likely to be spurious.
from the true fragment origin. Setting this to 0 specifies that alignments that disagree with the library type should be "impossible", while setting it to 1 says that alignments that disagree with the library type are no less likely than those that do
quant.genes.sf files, where the latter contains aggregated gene-level abundance estimates. The transcript to gene mapping should be provided as either a GTF file, or a in a simple tab-delimited format where each line contains the name of a transcript and the gene to which it belongs separated by a tab. The extension of the file is used to determine how the file should be parsed. Files ending in '.gtf', '.gff' or '.gff3' are assumed to be in GTF format; files with any other extension are assumed to be in the simple format. In GTF / GFF format, the "transcript_id" is assumed to contain the transcript identifier and the "gene_id" is assumed to contain the corresponding gene identifier.
will be directed to stdout, but an alternative file name can be provided instead.
7
8
··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ··· ···
Default Parameter Vector Optimized Parameter Vector Reference Transcriptome
9
10
11
12
Aligned RNA-seq Reads
Oracle
parameter choice Scallopcallop
13
14
advisor set alternate solutions labelled alternate solutions
candidate solution accuracy
solution max candidate solutions
Scientific Application
input
advisor estimator
(p1,p2,…,p18)
[DeBlasio and Kececioglu, Springer International, 2017]
15
solution
"New" Scientific Application
input
[DeBlasio and Kececioglu, Springer International, 2017]
16
A-GT-PNGNP A-G--P-GNP A-GTTPNGNP
ACGT-UNGNP
[DeBlasio and Kececioglu, Springer International, 2017]
17
advisor set alternate solutions labelled alternate solutions
candidate solution accuracy
solution max candidate solutions
Scientific Application
input
advisor estimator
(p1,p2,…,p18)
[DeBlasio and Kececioglu, Springer International, 2017] Facet (Feature-based ACuracy EsTimator) Exhaustive Enumeration
1 3 5 7 9 11 13 15 17 19 21 23 25
Advisor Set Cardinality
51% 52% 53% 54% 55% 56% 57% 58% 59% 60%
Average Accuracy
Default Opal advising General advising
18
[DeBlasio and Kececioglu, Springer International, 2017] Better
advisor set alternate solutions labelled alternate solutions
candidate solution accuracy
solution max candidate solutions
Scientific Application
input
advisor estimator
(p1,p2,…,p18)
19
[DeBlasio and Kececioglu, Springer International, 2017]
advisor set alternate solutions labelled alternate solutions
candidate solution accuracy
solution max candidate solutions
Scientific Application
input
advisor estimator
(p1,p2,…,p18)
20
A good advisor set:
[DeBlasio and Kececioglu, Springer International, 2017]
advisor set alternate solutions labelled alternate solutions
candidate solution accuracy
solution max candidate solutions
Scientific Application
input
advisor estimator
(p1,p2,…,p18)
21
A good advisor estimator:
[DeBlasio and Kececioglu, Springer International, 2017]
Area Under the Curve
22
Sensitivity Precision
23
Better
24
Better
25
training example
training example
training example
advisor set
(p1,p2,…,p18)
Coordinate Ascent Coordinate Ascent Coordinate Ascent
26
Hisat STAR TopHat
Average of 18.1% increase in AUC using Coordinate Ascent
0% 3% 5% 8% 10% 13% 15% 18% 20% SRR307903 SRR307911 SRR315323 SRR315334 SRR387661 SRR534291 SRR534307 SRR534319 SRR545695 SRR545723 SRR307903 SRR307911 SRR315323 SRR315334 SRR387661 SRR534291 SRR534307 SRR534319 SRR545695 SRR545723 SRR307903 SRR307911 SRR315323 SRR315334 SRR387661 SRR534291 SRR534307 SRR534319 SRR545695 SRR545723
Coordinate Ascent
60% 100% 140% 180%
Increase in AUC over Default
27
Better
28
Better
29
i j
c(Ej, Pi) c(Ei, Pj)
c(e, p) := AUC(Scallopp(e)(e)) − AUC(Scallopp(e))
[DeBlasio and Kececioglu, Springer International, 2017]
A point represents:
30
c(e, p) := AUC(Scallopp(e)(e)) − AUC(Scallopp(e))
31 X
i
min
j∈S c(Ei, Pj)
Find To minimize
S ⊆ {1...n} , |S| = k
c(e, p) := AUC(Scallopp(e)(e)) − AUC(Scallopp(e))
32
33
34 Increase in AUC Over Default
0% 4% 8% 13% 17% 21% 25% SRR307903 SRR307911 SRR315323 SRR315334 SRR387661 SRR534291 SRR534307 SRR534319 SRR545695 SRR545723 SRR307903 SRR307911 SRR315323 SRR315334 SRR387661 SRR534291 SRR534307 SRR534319 SRR545695 SRR545723 SRR307903 SRR307911 SRR315323 SRR315334 SRR387661 SRR534291 SRR534307 SRR534319 SRR545695 SRR545723
Coordinate Ascent
Hisat STAR TopHat
11.1% average increase in accuracy for Coordinate Ascent
[Pertea, et al., Nature Biotechnology 2015] Better
35
x104
Better
36
Reference Transcriptome Assembled Transcriptome Sequencing Reads Mapped Reads
37
Ground Truth Transcriptome Reads
38
Ground Truth Transcriptome Reads "Reference" Transcriptome
39
0% 40% 80% 120% 160%
SRR307903 SRR307911 SRR315323 SRR387661 SRR534291 SRR534307 SRR534319 SRR545695 SRR545723
whole AUC partial AUC Transrate reads linear
Relative Increase in AUC Examples AUC is the only tested method to increase recovery when optimized
Recovery
Better
40
41
Especially: Mingfu Shao Guillaume Marçais Heewook Lee Minh Hoang
Gordon and Betty Moore Foundation’s Data-Driven Discovery Initiative (GBMF4554) US National Science Foundation (CCF-1256087 and CCF-1319998) US National Institutes of Health (R01HG007104 and R01GM122935) The Shurl and Kay Curci Foundation
42
Scallop Advising: https://github.com/Kingsford-Group/scallopadvising Slides (and links): dandeblasio.com/AutoAlg19
The University of Texas at El Paso
Published at the Workshop on Computational Biology at ICML 2019