 
              GeneQC Statistical Model
General Idea • Reads can be mapped to multiple gene loci • Leads to varying degrees of mapping uncertainty • Potentially causes issues with inferences based on read counts • Differentially expressed genes • Co-expression patterns • Various network analyses
Options • Exclude ambiguous reads • Multiple assignment • Random assignment • Probabilistic assignment • Only considering local information
Co-expressed Genes • Co-expressed genes provided additional level of information • Global data for more solid statistical evaluation
Goal • Create statistically sound model for assignment of ambiguous reads • Use co-expression of genes • Develop method that produces p-value or probability score for each ambiguous read assignment • Provide a p- value signifying the confidence of each gene’s read count
Previous Publications • Faulkner, G.J., et al., A rescue strategy for multimapping short sequence tags refines surveys of transcriptional activity by CAGE. Genomics, 2008. 91 (3): p. 281-288. • Hashimoto, T ., et al., Probabilistic resolution of multi-mapping reads in massively parallel sequencing data using MuMRescueLite . Bioinformatics, 2009. 25 (19): p. 2613-2614. • Wang, J., Huda, A., Lunyak, V. V., & Jordan, I. K., A Gibbs sampling strategy applied to the mapping of ambiguous short- sequence tags . Bioinformatics, 2010. 26 (20): p.2501-2508
Overall Direction • Assign all unambiguous reads • Use co-expression information of unambiguous reads to make first probabilistic assignment of ambiguous reads • Based on assignments, recalculate probabilities for ambiguous reads • Continue iterative procedure until no/minimal changes occur
Additional parameters • Similarity between a given read and each potential gene locus • Differences generally very minute • Co-expression rate between genes and co-expressed genes
Concerns & Limitations • Requires accurate co-expression information • Limited sample size of co-expression information could skew probability distribution • Potentially highly computationally intensive • Local optimization may occur • Does not currently consider dependence of read assignment
Our Future Plans • Collect test data to verify increased performance using statistical model • Run model with various validated probability assumptions • Normal, Poisson, etc. • Develop R package with statistical model implementation
Recommend
More recommend