GeneQC Statistical Model General Idea Reads can be mapped to - - PowerPoint PPT Presentation
GeneQC Statistical Model General Idea Reads can be mapped to - - PowerPoint PPT Presentation
GeneQC Statistical Model General Idea Reads can be mapped to multiple gene loci Leads to varying degrees of mapping uncertainty Potentially causes issues with inferences based on read counts Differentially expressed genes
General Idea
- Reads can be mapped to
multiple gene loci
- Leads to varying degrees
- f mapping uncertainty
- Potentially causes issues
with inferences based on read counts
- Differentially expressed
genes
- Co-expression patterns
- Various network analyses
Options
- Exclude ambiguous reads
- Multiple assignment
- Random assignment
- Probabilistic assignment
- Only considering local
information
Co-expressed Genes
- Co-expressed genes
provided additional level
- f information
- Global data for more solid
statistical evaluation
Goal
- Create statistically sound model for assignment of ambiguous
reads
- Use co-expression of genes
- Develop method that produces p-value or probability score for
each ambiguous read assignment
- Provide a p-value signifying the confidence of each gene’s read
count
Previous Publications
- Faulkner, G.J., et al., A rescue strategy for multimapping short
sequence tags refines surveys of transcriptional activity by CAGE. Genomics, 2008. 91(3): p. 281-288.
- Hashimoto, T
., et al., Probabilistic resolution of multi-mapping reads in massively parallel sequencing data using MuMRescueLite. Bioinformatics, 2009. 25(19): p. 2613-2614.
- Wang, J., Huda, A., Lunyak, V. V., & Jordan, I. K., A Gibbs
sampling strategy applied to the mapping of ambiguous short- sequence tags. Bioinformatics, 2010. 26(20): p.2501-2508
Overall Direction
- Assign all unambiguous reads
- Use co-expression information of unambiguous reads to make first
probabilistic assignment of ambiguous reads
- Based on assignments, recalculate probabilities for ambiguous
reads
- Continue iterative procedure until no/minimal changes occur
Additional parameters
- Similarity between a given read and each potential gene locus
- Differences generally very minute
- Co-expression rate between genes and co-expressed genes
Concerns & Limitations
- Requires accurate co-expression information
- Limited sample size of co-expression information could skew
probability distribution
- Potentially highly computationally intensive
- Local optimization may occur
- Does not currently consider dependence of read assignment
Our Future Plans
- Collect test data to verify increased performance using statistical
model
- Run model with various validated probability assumptions
- Normal, Poisson, etc.
- Develop R package with statistical model implementation