A method for high throughput sequencing data analysis: application - - PowerPoint PPT Presentation

▶

Nov 14, 2023 462 likes •660 views

A method for high throughput sequencing data analysis: application for mapping genome-wide protein-DNA binding sites (ChIPseq) 1 2 3 4 5 6 7 8 9 T G C T A C G A T JC Andrau, Biostat, 15/01/2010 High thoughput sequencing applications

SLIDE 1

A method for high throughput sequencing data analysis: application for mapping genome-wide protein-DNA binding sites (ChIPseq)

JC Andrau, Biostat, 15/01/2010

1 2 3 7 8 9 4 5 6 T G C T A C G A T

SLIDE 2

High thoughput sequencing applications

Epigenetic marks mapping and identification of regulatory sequences of gene

expression (ChIP-seq)

Protein-DNA interaction Genome sequencing

Human gene mapping
Qualitative (SNP) and quantitative (amplification) genetic variations
de novo sequencing of model organisms and pathogens

Transcriptome (RNAseq)

Identification and analysis of non coding RNAs (miRNA, etc.)
Monitoring gene expression in covering all the alternative

messengers to a given locus in a variety of contexts

SLIDE 3

ChIP-seq: Solexa procedure

PCR + size exclusion (gel extraction) Loading in flowcell and cluster amplification Image acquisition and base calling

SLIDE 4

Sequencing and alignment

Sequencing extremities of DNA

fragments

RAW data files (sequences)
Aligned against a reference genome

– MAQ – Solexa…

SLIDE 5

First steps of data analysis

SLIDE 6

First steps of data analysis

SLIDE 7

DNA fragments VS sequences

Only extremities of DNA fragments are

sequenced

Enriched regions don’t represent exact

binding site

In-silico process to elongate the tags

+ Strand

Strand

Binding Site

SLIDE 8

Elongation process

Strand + Strand - Shifting (bp) Overlap

SLIDE 9

Score per nucleotide

SLIDE 10

Score per nucleotide

SLIDE 11

Further analysis

SLIDE 12

Artefacts removal and normalisation

An input experiment helps to localize problematic regions in alignment (duplications,

reference genome…) – We shouldn’t see enrichment in input – These regions were removed from all datasets

Based on the average of the scores in the whole genome, we can estimate the BG level

and then rescale all experiments according to this level

Last step consists of subtracting the input from the datasets in order to reduce the

variations effects and the background in the data

SLIDE 13

Pipeline for ChIPseq data Analysis

ChIP, QCs, sequencing and original file genesis
Alignment against a reference genome (Eland)

Conversion to gff format in R Artefact and multiple matches removal Elongation of tags, merge of both strands and data bining Input or mock data set substraction, data normalisation Data analysis and visualisation

SLIDE 14

ChIPseq and ChIP-on-Chip

SLIDE 15

Recruitment

CTD phosphorylations and transcription

The CTD is a heptapeptide repetition (Y S P T S P S)n of the largest Pol II subunit conserved from yeast (26x) to human (52x).

?

Initiation Elongation (productive)

SLIDE 16

Core et al, Science 2008

TSS profiling of CTD and S5P overlaps with sense/antisense transcription

Binding level Pol II Binding around TSSs

SLIDE 17

K mean clustering of top 20% Pol II S5-P

1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9

10 10

Right to TSS Centered Left to TSS

Clustering indicates several populations of initiating Pol II around TSS

SLIDE 18

PF lab, CIML Marseille Romain Fenouil Fred Koch Pierre Cauchy Pierre Ferrier CNG Evry Ivo Gut Marta Gut GSF Cancer Institute, Munich Dirk Eick Martin Heidemann Corinna Hintermair

A method for high throughput sequencing data analysis: application for mapping genome-wide protein-DNA binding sites (ChIPseq)

JC Andrau, Biostat, 15/01/2010

High thoughput sequencing applications

expression (ChIP-seq)

Protein-DNA interaction Genome sequencing

Transcriptome (RNAseq)

messengers to a given locus in a variety of contexts

ChIP-seq: Solexa procedure

PCR + size exclusion (gel extraction) Loading in flowcell and cluster amplification Image acquisition and base calling

Sequencing and alignment

fragments

– MAQ – Solexa…

First steps of data analysis

First steps of data analysis

DNA fragments VS sequences

sequenced

binding site

+ Strand

Binding Site

Elongation process

Strand + Strand - Shifting (bp) Overlap

Score per nucleotide

Score per nucleotide

Further analysis

Artefacts removal and normalisation

reference genome…) – We shouldn’t see enrichment in input – These regions were removed from all datasets

and then rescale all experiments according to this level

variations effects and the background in the data

Pipeline for ChIPseq data Analysis

Conversion to gff format in R Artefact and multiple matches removal Elongation of tags, merge of both strands and data bining Input or mock data set substraction, data normalisation Data analysis and visualisation

ChIPseq and ChIP-on-Chip

Recruitment

CTD phosphorylations and transcription

The CTD is a heptapeptide repetition (Y S P T S P S)n of the largest Pol II subunit conserved from yeast (26x) to human (52x).

?

Initiation Elongation (productive)

Core et al, Science 2008

TSS profiling of CTD and S5P overlaps with sense/antisense transcription

Binding level Pol II Binding around TSSs

K mean clustering of top 20% Pol II S5-P

Right to TSS Centered Left to TSS

Clustering indicates several populations of initiating Pol II around TSS

PF lab, CIML Marseille Romain Fenouil Fred Koch Pierre Cauchy Pierre Ferrier CNG Evry Ivo Gut Marta Gut GSF Cancer Institute, Munich Dirk Eick Martin Heidemann Corinna Hintermair

Many thanks to…