Importing data Peter Humburg Statistician, Macquarie University - - PowerPoint PPT Presentation

importing data
SMART_READER_LITE
LIVE PREVIEW

Importing data Peter Humburg Statistician, Macquarie University - - PowerPoint PPT Presentation

DataCamp ChIP-seq Workflows in R CHIP - SEQ WORKFLOWS IN R Importing data Peter Humburg Statistician, Macquarie University DataCamp ChIP-seq Workflows in R DataCamp ChIP-seq Workflows in R Handling sequence reads Usually stored in B inary


slide-1
SLIDE 1

DataCamp ChIP-seq Workflows in R

Importing data

CHIP-SEQ WORKFLOWS IN R

Peter Humburg

Statistician, Macquarie University

slide-2
SLIDE 2

DataCamp ChIP-seq Workflows in R

slide-3
SLIDE 3

DataCamp ChIP-seq Workflows in R

Handling sequence reads

Usually stored in Binary Sequence Alignment/Map (BAM) format files. BAM record fields: Read name: SRR1782620.7265769 Binary flag: 0 Reference sequence name and position of alignment: chr20 29803915 Mapping quality: 0 CIGAR string (alignment summary): 51M Reference sequence and position of paired read (not used here): 0 0 Read sequence: AATGAAATGGAA ... Read quality (ASCII encoded): CCCFFFFFHHHH ...

slide-4
SLIDE 4

DataCamp ChIP-seq Workflows in R

Importing mapped reads into R

Use Rsamtools package to interact with BAM files.

Rsamtools provides functions for indexing, reading, filtering and writing of BAM

files. Use readGAlignments to import mapped reads. Returns GAlignments object.

library(GenomicAlignments) reads <- readGAlignments(bam_file)

slide-5
SLIDE 5

DataCamp ChIP-seq Workflows in R

Importing selected regions

Use BamViews to define regions of interest. Then import reads as before. The BamViews function supports multiple BAM files.

library(GenomicRanges) library(Rsamtools) ranges <- GRanges(...) views <- BamViews(bam_file, bamRanges=ranges) reads <- readGAlignments(views)

slide-6
SLIDE 6

DataCamp ChIP-seq Workflows in R

Importing peak calls

Use import.bed to load peak calls from a BED file. Use peaks to define views into the BAM files.

library(rtracklayer) peaks <- import.bed(peak_bed, genome="hg19") bams <- BamViews(bam_file, bamRanges=peaks) reads <- readGAlignments(bams)

slide-7
SLIDE 7

DataCamp ChIP-seq Workflows in R

Let's practice!

CHIP-SEQ WORKFLOWS IN R

slide-8
SLIDE 8

DataCamp ChIP-seq Workflows in R

Taking a closer look at peaks

CHIP-SEQ WORKFLOWS IN R

Peter Humburg

Statistician, Macquarie University

slide-9
SLIDE 9

DataCamp ChIP-seq Workflows in R

Using Gviz

slide-10
SLIDE 10

DataCamp ChIP-seq Workflows in R

What are we plotting?

slide-11
SLIDE 11

DataCamp ChIP-seq Workflows in R

What are we plotting?

slide-12
SLIDE 12

DataCamp ChIP-seq Workflows in R

What are we plotting?

slide-13
SLIDE 13

DataCamp ChIP-seq Workflows in R

What are we plotting?

slide-14
SLIDE 14

DataCamp ChIP-seq Workflows in R

Setting-up coordinates

library(Gviz) ideogram <- IdeogramTrack("chr12", "hg19") axis <- GenomeAxisTrack() plotTracks(list(ideogram, axis), from=101360000, to=101380000)

slide-15
SLIDE 15

DataCamp ChIP-seq Workflows in R

Adding Data

cover_track <- DataTrack(cover_ranges,window=100000,type='h',name="Coverage") plotTracks(list(ideogram, cover_track, axis), from=101360000, to=101380000)

slide-16
SLIDE 16

DataCamp ChIP-seq Workflows in R

Adding Annotations

peak_track <- AnnotationTrack(peaks, name="Peaks") plotTracks(list(ideogram, cover_track, peak_track, axis), from=101360000, to=101380000)

slide-17
SLIDE 17

DataCamp ChIP-seq Workflows in R

Gene Annotations

library(TxDb.Hsapiens.UCSC.hg19.knownGene) tx <- GeneRegionTrack(TxDb.Hsapiens.UCSC.hg19.knownGene, chromosome="chr12", start=101360000, end=101380000, name="Genes") plotTracks(list(ideogram, cover_track, peak_track, tx, axis), from=101360000, to=101380000)

slide-18
SLIDE 18

DataCamp ChIP-seq Workflows in R

Let's practice!

CHIP-SEQ WORKFLOWS IN R

slide-19
SLIDE 19

DataCamp ChIP-seq Workflows in R

Cleaning ChIP-seq data

CHIP-SEQ WORKFLOWS IN R

Peter Humburg

Statistician, Macquarie University

slide-20
SLIDE 20

DataCamp ChIP-seq Workflows in R

Common Problems

Incorrectly mapped reads may produce false peaks. Genomic repeats.

slide-21
SLIDE 21

DataCamp ChIP-seq Workflows in R

Common Problems

Incorrectly mapped reads may produce false peaks. Genomic repeats. Incomplete reference sequence.

slide-22
SLIDE 22

DataCamp ChIP-seq Workflows in R

Common Problems

Incorrectly mapped reads may produce false peaks. Genomic repeats. Incomplete reference sequence. Low complexity regions.

slide-23
SLIDE 23

DataCamp ChIP-seq Workflows in R

slide-24
SLIDE 24

DataCamp ChIP-seq Workflows in R

Amplification Bias

DNA fragments extracted from cells are copied multiple times prior to sequencing. Not all fragments produce the same number of copies. Multiple copies of the same fragment may be sequenced. A single DNA fragment may inflate coverage and lead to incorrect peak calls.

slide-25
SLIDE 25

DataCamp ChIP-seq Workflows in R

Quality Control Reports

library(ChIPQC) qc_report <- ChIPQC(experiment="sample_info.csv", annotation="hg19") ChIPQCreport(qc_report)

slide-26
SLIDE 26

DataCamp ChIP-seq Workflows in R

Preparing input files

SampleID Factor Condition Tissue Treatment bamReads Peaks PeakCaller S1 AR primary primary prostate tumor gleason score: 3+4=7 S1.bam S1.bed macs S2 AR primary primary prostate tumor gleason score: 3+4=7 S2.bam S2.bed macs ... ... ... ... ... ... ... ...

slide-27
SLIDE 27

DataCamp ChIP-seq Workflows in R

Cleaning the Data

Remove duplicate reads. Remove reads with multiple hits. Remove reads with low mapping quality. Remove peaks in blacklisted regions.

slide-28
SLIDE 28

DataCamp ChIP-seq Workflows in R

Cleaning the Data

Remove duplicate reads. Remove reads with multiple hits. Remove reads with low mapping quality. Remove peaks in blacklisted regions.

slide-29
SLIDE 29

DataCamp ChIP-seq Workflows in R

Cleaning the Data

Remove duplicate reads. Remove reads with multiple hits. Remove reads with low mapping quality. Remove peaks in blacklisted regions. Blacklisted regions are available from the ENCODE project.

slide-30
SLIDE 30

DataCamp ChIP-seq Workflows in R

Let's practice!

CHIP-SEQ WORKFLOWS IN R

slide-31
SLIDE 31

DataCamp ChIP-seq Workflows in R

Assessing enrichment

CHIP-SEQ WORKFLOWS IN R

Peter Humburg

Statistician, Macquarie University

slide-32
SLIDE 32

DataCamp ChIP-seq Workflows in R

slide-33
SLIDE 33

DataCamp ChIP-seq Workflows in R

slide-34
SLIDE 34

DataCamp ChIP-seq Workflows in R

Extending reads

Load the data: Obtain average fragment length: Extend reads and compute coverage:

reads <- readGAlignments(bam) reads_gr <- granges(reads[[1]]) frag_length <- fragmentlength(qc_report)["GSM1598218"] reads_ext <- resize(reads_gr, width=frag_length) cover_ext <- coverage(reads_ext)

slide-35
SLIDE 35

DataCamp ChIP-seq Workflows in R

slide-36
SLIDE 36

DataCamp ChIP-seq Workflows in R

Coverage for peaks

Create 200 bp bins along the genome. Find all bins overlapping peaks. Count the number of reads overlapping each peak bin.

bins <- tileGenome(seqinfo(reads), tilewidth=200, cut.last.tile.in.chrom=TRUE) peak_bins_overlap <- findOverlaps(bins, peaks) peak_bins <- bins[from(peak_bins_overlap), ] peak_bins$score <- countOverlaps(peak_bins, reads)

slide-37
SLIDE 37

DataCamp ChIP-seq Workflows in R

Binned coverage function

count_bins <- function(reads, target, bins){ # Find all bins overlapping peaks

  • verlap <- from(findOverlaps(bins, target))

target_bins <- bins[overlap, ] # Count the number of reads overlapping each peak bin target_bins$score <- countOverlaps(target_bins, reads) target_bins }

slide-38
SLIDE 38

DataCamp ChIP-seq Workflows in R

Coverage for blacklisted regions

peak_bins <- count_bins(reads_ext, peaks, bins) bl_bins <- count_bins(reads_ext, blacklist.hg19, bins)

slide-39
SLIDE 39

DataCamp ChIP-seq Workflows in R

Background coverage

Remove all bins already accounted for. Count number of reads overlapping with each remaining bin.

bkg_bins <- subset(bins, !bins %in% peak_bins & !bins %in% bl_bins) bkg_bins$score <- countOverlaps(bkg_bins, reads_ext)

slide-40
SLIDE 40

DataCamp ChIP-seq Workflows in R

slide-41
SLIDE 41

DataCamp ChIP-seq Workflows in R

slide-42
SLIDE 42

DataCamp ChIP-seq Workflows in R

slide-43
SLIDE 43

DataCamp ChIP-seq Workflows in R

Let's practice!

CHIP-SEQ WORKFLOWS IN R