a method for high throughput sequencing data analysis
play

A method for high throughput sequencing data analysis: application - PowerPoint PPT Presentation

A method for high throughput sequencing data analysis: application for mapping genome-wide protein-DNA binding sites (ChIPseq) 1 2 3 4 5 6 7 8 9 T G C T A C G A T JC Andrau, Biostat, 15/01/2010 High thoughput sequencing applications


  1. A method for high throughput sequencing data analysis: application for mapping genome-wide protein-DNA binding sites (ChIPseq) 1 2 3 4 5 6 7 8 9 T G C T A C G A T JC Andrau, Biostat, 15/01/2010

  2. High thoughput sequencing applications  Human gene mapping Genome  Qualitative (SNP) and quantitative (amplification) genetic variations sequencing  de novo sequencing of model organisms and pathogens  Identification and analysis of non coding RNAs (miRNA, etc.) Transcriptome  Monitoring gene expression in covering all the alternative (RNAseq) messengers to a given locus in a variety of contexts  Epigenetic marks mapping and identification of regulatory sequences of gene Protein-DNA expression (ChIP-seq) interaction

  3. ChIP-seq: Solexa procedure PCR + size exclusion (gel extraction) Loading in flowcell and cluster amplification Image acquisition and base calling

  4. Sequencing and alignment • Sequencing extremities of DNA fragments • RAW data files (sequences) • Aligned against a reference genome – MAQ – Solexa…

  5. First steps of data analysis

  6. First steps of data analysis

  7. DNA fragments VS sequences • Only extremities of DNA fragments are sequenced Binding Site • Enriched regions don’t represent exact binding site • In-silico process to elongate the tags + Strand - Strand

  8. Elongation process Overlap Strand + Strand - Shifting (bp)

  9. Score per nucleotide

  10. Score per nucleotide

  11. Further analysis

  12. Artefacts removal and normalisation • An input experiment helps to localize problematic regions in alignment (duplications, reference genome…) – We shouldn’t see enrichment in input – These regions were removed from all datasets • Based on the average of the scores in the whole genome, we can estimate the BG level and then rescale all experiments according to this level • Last step consists of subtracting the input from the datasets in order to reduce the variations effects and the background in the data

  13. Pipeline for ChIPseq data Analysis Artefact and multiple matches Conversion to gff format in R removal - ChIP, QCs, sequencing and original file genesis - Alignment against a reference genome (Eland) Data analysis and visualisation Elongation of tags, merge of Input or mock data set substraction, both strands and data bining data normalisation

  14. ChIPseq and ChIP-on-Chip

  15. CTD phosphorylations and transcription The CTD is a heptapeptide repetition ( Y S P T S P S)n of the largest Pol II subunit conserved from yeast (26x) to human (52x). ? Recruitment Initiation Elongation (productive)

  16. TSS profiling of CTD and S5P overlaps with sense/antisense transcription Pol II Binding around TSSs Binding level Core et al, Science 2008

  17. Clustering indicates several populations of initiating Pol II around TSS K mean clustering of top 20% Pol II S5-P 1 1 2 3 4 2 Right to 3 TSS 4 5 5 6 7 Centered 6 7 8 9 10 8 Left to 9 TSS 10

  18. Many thanks to… PF lab, CIML Marseille Romain Fenouil Fred Koch Pierre Cauchy Pierre Ferrier CNG Evry Ivo Gut Marta Gut GSF Cancer Institute, Munich Dirk Eick Martin Heidemann Corinna Hintermair

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend