SLIDE 1 A method for high throughput sequencing data analysis: application for mapping genome-wide protein-DNA binding sites (ChIPseq)
JC Andrau, Biostat, 15/01/2010
1 2 3 7 8 9 4 5 6 T G C T A C G A T
SLIDE 2 High thoughput sequencing applications
- Epigenetic marks mapping and identification of regulatory sequences of gene
expression (ChIP-seq)
Protein-DNA interaction Genome sequencing
- Human gene mapping
- Qualitative (SNP) and quantitative (amplification) genetic variations
- de novo sequencing of model organisms and pathogens
Transcriptome (RNAseq)
- Identification and analysis of non coding RNAs (miRNA, etc.)
- Monitoring gene expression in covering all the alternative
messengers to a given locus in a variety of contexts
SLIDE 3
ChIP-seq: Solexa procedure
PCR + size exclusion (gel extraction) Loading in flowcell and cluster amplification Image acquisition and base calling
SLIDE 4 Sequencing and alignment
- Sequencing extremities of DNA
fragments
- RAW data files (sequences)
- Aligned against a reference genome
– MAQ – Solexa…
SLIDE 5
First steps of data analysis
SLIDE 6
First steps of data analysis
SLIDE 7 DNA fragments VS sequences
- Only extremities of DNA fragments are
sequenced
- Enriched regions don’t represent exact
binding site
- In-silico process to elongate the tags
+ Strand
Binding Site
SLIDE 8
Elongation process
Strand + Strand - Shifting (bp) Overlap
SLIDE 9
Score per nucleotide
SLIDE 10
Score per nucleotide
SLIDE 11
Further analysis
SLIDE 12 Artefacts removal and normalisation
- An input experiment helps to localize problematic regions in alignment (duplications,
reference genome…) – We shouldn’t see enrichment in input – These regions were removed from all datasets
- Based on the average of the scores in the whole genome, we can estimate the BG level
and then rescale all experiments according to this level
- Last step consists of subtracting the input from the datasets in order to reduce the
variations effects and the background in the data
SLIDE 13 Pipeline for ChIPseq data Analysis
- ChIP, QCs, sequencing and original file genesis
- Alignment against a reference genome (Eland)
Conversion to gff format in R Artefact and multiple matches removal Elongation of tags, merge of both strands and data bining Input or mock data set substraction, data normalisation Data analysis and visualisation
SLIDE 14
ChIPseq and ChIP-on-Chip
SLIDE 15
Recruitment
CTD phosphorylations and transcription
The CTD is a heptapeptide repetition (Y S P T S P S)n of the largest Pol II subunit conserved from yeast (26x) to human (52x).
?
Initiation Elongation (productive)
SLIDE 16
Core et al, Science 2008
TSS profiling of CTD and S5P overlaps with sense/antisense transcription
Binding level Pol II Binding around TSSs
SLIDE 17 K mean clustering of top 20% Pol II S5-P
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
10 10
Right to TSS Centered Left to TSS
Clustering indicates several populations of initiating Pol II around TSS
SLIDE 18
PF lab, CIML Marseille Romain Fenouil Fred Koch Pierre Cauchy Pierre Ferrier CNG Evry Ivo Gut Marta Gut GSF Cancer Institute, Munich Dirk Eick Martin Heidemann Corinna Hintermair
Many thanks to…