GeneQC Statistical Model General Idea Reads can be mapped to - - PowerPoint PPT Presentation

▶

Aug 05, 2023 115 likes •232 views

GeneQC Statistical Model General Idea Reads can be mapped to multiple gene loci Leads to varying degrees of mapping uncertainty Potentially causes issues with inferences based on read counts Differentially expressed genes

SLIDE 1

GeneQC Statistical Model

SLIDE 2

General Idea

Reads can be mapped to

multiple gene loci

Leads to varying degrees
f mapping uncertainty
Potentially causes issues

with inferences based on read counts

Differentially expressed

genes

Co-expression patterns
Various network analyses

SLIDE 3

Options

Exclude ambiguous reads
Multiple assignment
Random assignment
Probabilistic assignment
Only considering local

information

SLIDE 4

Co-expressed Genes

Co-expressed genes

provided additional level

f information
Global data for more solid

statistical evaluation

SLIDE 5

Goal

Create statistically sound model for assignment of ambiguous

reads

Use co-expression of genes
Develop method that produces p-value or probability score for

each ambiguous read assignment

Provide a p-value signifying the confidence of each gene’s read

count

SLIDE 6

Previous Publications

Faulkner, G.J., et al., A rescue strategy for multimapping short

sequence tags refines surveys of transcriptional activity by CAGE. Genomics, 2008. 91(3): p. 281-288.

Hashimoto, T

., et al., Probabilistic resolution of multi-mapping reads in massively parallel sequencing data using MuMRescueLite. Bioinformatics, 2009. 25(19): p. 2613-2614.

Wang, J., Huda, A., Lunyak, V. V., & Jordan, I. K., A Gibbs

sampling strategy applied to the mapping of ambiguous short- sequence tags. Bioinformatics, 2010. 26(20): p.2501-2508

SLIDE 7

Overall Direction

Assign all unambiguous reads
Use co-expression information of unambiguous reads to make first

probabilistic assignment of ambiguous reads

Based on assignments, recalculate probabilities for ambiguous

reads

Continue iterative procedure until no/minimal changes occur

SLIDE 8

Additional parameters

Similarity between a given read and each potential gene locus
Differences generally very minute
Co-expression rate between genes and co-expressed genes

SLIDE 9

Concerns & Limitations

Requires accurate co-expression information
Limited sample size of co-expression information could skew

probability distribution

Potentially highly computationally intensive
Local optimization may occur
Does not currently consider dependence of read assignment

SLIDE 10

Our Future Plans

Collect test data to verify increased performance using statistical

model

Run model with various validated probability assumptions
Normal, Poisson, etc.
Develop R package with statistical model implementation