Daria Merkurjev Workshop logis3cs Disparate backgrounds Some will - - PowerPoint PPT Presentation

daria merkurjev workshop logis3cs
SMART_READER_LITE
LIVE PREVIEW

Daria Merkurjev Workshop logis3cs Disparate backgrounds Some will - - PowerPoint PPT Presentation

Daria Merkurjev Workshop logis3cs Disparate backgrounds Some will find this class easy Others will find it hard Hints to more advanced topics: I can stay longer aBer class for those interested Ques3ons Of interest to everybody:


slide-1
SLIDE 1

Daria Merkurjev

slide-2
SLIDE 2

Workshop logis3cs

Disparate backgrounds

  • Some will find this class easy
  • Others will find it hard
  • Hints to more advanced topics:

I can stay longer aBer class for those interested Ques3ons

  • Of interest to everybody: any*me
  • Personal curiosity OR requiring long discussion:

before or a/er class

slide-3
SLIDE 3

Mission and expecta3ons

➢ the collaboratory wants you, the wet lab scien3st who did the hard work, to be able to play with the data you generated:

  • understand the vocabulary of your analyst/bioinforma3cian
  • speak the language of your fellow analyst to provide sugges3ons

based on your wet lab choices

  • analyze the data yourself

➢ we want you to learn how to drive a car, not to build one ➢ 100% command line interface

slide-4
SLIDE 4

Schedule

  • Day 1

– Introduc3on – Cross-correla3on analysis and ENCODE QC with SPP – BigWig tracks with defined resolu3on and normaliza3on using Homer/ UCSC tools, and visualiza3on with Genome Browser/IGV

  • Day 2

– Peak calling with MACS2 – QC of replicates with ENCODE’s IDR – Differen3al peak calling with MAnorm

  • Day 3

– Loca3on annota3on with NGSPLOT – Mo3f analysis with HOMER – Func3onal annota3on with GREAT – (Unix tricks, tools installa3on)

slide-5
SLIDE 5

ChIP workflow

slide-6
SLIDE 6
slide-7
SLIDE 7
slide-8
SLIDE 8
slide-9
SLIDE 9
slide-10
SLIDE 10

NORMALIZATION

slide-11
SLIDE 11

Icebreaker: ChIP-qPCR normaliza3on

A B Input IP

DNA not of interest DNA of interest (PCR target) Protein of interest (IP target) 1/4 1/2

slide-12
SLIDE 12

qPCR by equal volumes

A B Input IP

DNA not of interest DNA of interest (PCR target) Protein of interest (IP target)

= 1 abundance

in qPCR

“DNA of interest is enriched by 4 fold in B compared to A”

4 1 4 4

slide-13
SLIDE 13

Icebreaker: ChIP-qPCR normaliza3on

A B Input IP

DNA not of interest DNA of interest (PCR target) Protein of interest (IP target) 1/4 1/2

slide-14
SLIDE 14

qPCR by equal mass

A B Input IP

DNA not of interest DNA of interest (PCR target) Protein of interest (IP target)

= 1 abundance

in qPCR

1 1 1 2

“DNA of interest is enriched by 2 fold in B compared to A”

slide-15
SLIDE 15

Why should I care for ChIP-seq?

This is you This is your library

In ChIP-seq, you only normalize by mass:

  • 10 nM
  • n mul3plexed samples
  • balanced ra3os
slide-16
SLIDE 16

qPCR seq by equal mass

A B Input IP

DNA not of interest DNA of interest (PCR target) Protein of interest (IP target)

slide-17
SLIDE 17

qPCR seq by equal mass

A B Input IP

DNA not of interest DNA of interest (PCR target) Protein of interest (IP target) 100 50

slide-18
SLIDE 18

qPCR seq by equal mass

A B Input IP

DNA not of interest DNA of interest (PCR target) Protein of interest (IP target) 100 50 100 50

slide-19
SLIDE 19

qPCR seq by equal mass

A B Input IP

DNA not of interest DNA of interest (PCR target) Protein of interest (IP target) 100 50 100 50 100 50

slide-20
SLIDE 20

qPCR seq by equal mass

A B Input IP

DNA not of interest DNA of interest (PCR target) Protein of interest (IP target) 100 50 100 50 100 50 100 50

slide-21
SLIDE 21

qPCR seq by equal mass

A B Input IP

DNA not of interest DNA of interest (PCR target) Protein of interest (IP target) 100 50 100 50 100 50 100 50 “DNA of interest is enriched by 2 fold in B compared to A”

slide-22
SLIDE 22

Workshops 2: Sequencing

slide-23
SLIDE 23

Workshop 2: Pre-processing

Single-end 50 bp best compromise between informa3on and cost in ChIP-seq.

  • Convert QSEQ to FASTQ
  • Demul3plex > FASTQC
  • End trimming > FASTQC

– Hard trimming – Quality-based trimming (e.g. Sickle) – Adapter clipping (e.g. Cutadapt, Scythe, Trimmoma3c)

  • Alignment (e.g. BWA, bow3e, bow3e2)

– QC (e.g. samtools flagstat, Picard tools, Qualimap) – Remove non-mappers and mul3-mappers (samtools) – Duplicates: usually remove (samtools, Picard tools), but stay tuned

If you need to refresh these concepts, I am happy to stay longer aBer class

mixed-length reads! can your tool handle them?

slide-24
SLIDE 24

Quality Control

slide-25
SLIDE 25

Quality Control

slide-26
SLIDE 26

Quality Control

slide-27
SLIDE 27

Other Quality Control

slide-28
SLIDE 28

CROSS-CORRELATION QC

slide-29
SLIDE 29

Strand cross-correla3on

100 200 300 400 500 650 850 1000

slide-30
SLIDE 30

Cross-correla3on to es3mate fragment length

k read length = k cross-correla3on = fragment length Don’t I know already?

  • gel/Bioanalyzer

But bias may occur during:

  • IP
  • library prep
  • sequencing
  • PE reads

5’ read end is being shiBed

slide-31
SLIDE 31

Input vs IP

IP

slide-32
SLIDE 32

Cross-correla3on for QC

NSC: Normalized Strand cross-correla3on Coefficient RSC: Rela3ve Strand cross-correla3on Coefficient

slide-33
SLIDE 33

How good is your sample for ENCODE?

But this is affected by target, an3body, cell type, fragment distribu3on…

slide-34
SLIDE 34

Phantom peak

  • Removing blacklisted regions from your alignments makes sense
  • However, that invalidates SPP QC: be careful when building workflows
  • If anybody is interested, I can show how to remove blacklisted regions using bedtools
slide-35
SLIDE 35

VISUALIZATION

slide-36
SLIDE 36

Many tools for the job

hjp://homer.salk.edu/

HOMER

deepTools bamCoverage bedtools genomecov

slide-37
SLIDE 37
slide-38
SLIDE 38
slide-39
SLIDE 39
slide-40
SLIDE 40
slide-41
SLIDE 41
slide-42
SLIDE 42

BED format

slide-43
SLIDE 43

BED detail format

slide-44
SLIDE 44

Let’s prac3ce

Work from your samples folder, e.g. /u/scratch/r/rspreafi/samples mkdir homer module load samtools/1.2 module load homer makeTagDirectory homer/IPr1 -keepAll -fragLength 200 -precision 3 IPr1.bam >homer/IPr1_tag_output.txt 2>&1 & You can repeat this for all your samples… but not needed now. highest

slide-45
SLIDE 45

BigWig format

  • Used by UCSC Genome Browser
  • Standard de facto
  • hgdownload.cse.ucsc.edu/admin/exe/

tail -n +2 IPr1.bedGraph | sort -k1,1 -k2,2n > IPr1_sorted.bedGraph fetchChromSizes.sh mm9 >mm9.chrom.sizes bedGraphToBigWig IPr1_sorted.bedGraph mm9.chrom.sizes IPr1.bw

slide-46
SLIDE 46

Example of visualiza3on

ENCODE MACS2 (disregard) (disregard) genomecov lores, 1e6 hires, 1e6 hires, 1e8

slide-47
SLIDE 47

Retrieve your track and visualize it with IGV

slide-48
SLIDE 48

Here we go! Good job!

  • When comparing mul3ple tracks, set same scale
  • Remember to match genome assembly (e.g. mm9) between browser and mapping

➢ you will not get any error when you forget!

slide-49
SLIDE 49
slide-50
SLIDE 50
slide-51
SLIDE 51

Copy data into your scratch folder

Work from your scratch folder, e.g. /u/scratch/r/rspreafi

slide-52
SLIDE 52

Let’s prac3ce

Work from your homer folder, e.g. /u/scratch/r/rspreafi/samples/homer makeUCSCfile IPr1 -o IPr1.bedGraph -fragLength 200 -norm 1000000 -res 1 -fsize 1e20 |& tee IPr1_bedGraph.txt gunzip IPr1.bedGraph.gz head IPr1.bedGraph You can repeat this for all your samples… but not needed now. 1 bp

slide-53
SLIDE 53

Retrieve your analysis

slide-54
SLIDE 54

Schedule

  • Day 1

– Introduc3on – Cross-correla3on analysis and ENCODE QC with SPP – BigWig tracks with defined resolu3on and normaliza3on using Homer/ UCSC tools, and visualiza3on with IGV

  • Day 2

– Peak calling with MACS2 – QC of replicates with ENCODE’s IDR – Differen3al peak calling with MAnorm

  • Day 3

– Loca3on annota3on with NGSPLOT – Mo3f analysis with HOMER – Func3onal annota3on with GREAT – (Unix tricks, tools installa3on)

slide-55
SLIDE 55

PEAK CALLING

slide-56
SLIDE 56

Goal

  • Peaks: chr, start, end, score/p-value
  • Summits: chr, posi3on
slide-57
SLIDE 57

Many tools for the job

github.com/taoliu/MACS/

HOMER

SPP

PeakSeq

slide-58
SLIDE 58

Let’s prac3ce

Work from your samples folder, e.g. /u/scratch/r/rspreafi/samples mkdir macs2 module load python/2.7.3 macs2 callpeak -t IPr1.bam -c input.bam -f BAM -n IPr1 -g mm -q 0.01 --keep-dup 1

  • -call-summits --nomodel --extsize 200 --outdir macs2 >macs2/IPr1_output.txt 2>&1 &

macs2 callpeak -t IPr2.bam -c input.bam -f BAM -n IPr2 -g mm -q 0.01 --keep-dup 1

  • -call-summits --nomodel --extsize 200 --outdir macs2 >macs2/IPr2_output.txt 2>&1 &

Mouse: mm Human: hs Or size of genome in bp n auto all q-value threshold alterna3vely, -p 0.01

slide-59
SLIDE 59

Call a peak in your head

IP Input A B

Now tell me how you made your decision

slide-60
SLIDE 60

How MACS works

  • MACS compares IP vs control
  • Comparisons are formalized

using the Poisson distribu3on

  • MACS computes both global

and local Poisson distribu3ons This is why good ChIP prac3ces recommend sequencing input more deeply than IP samples

slide-61
SLIDE 61

MACS Q&A

  • Do you support mixed-length reads?

– Github manual

  • s/--tsize

The size of sequencing tags. If you don't specify it, MACS will try to use the first 10 sequences from your input treatment file to determine the tag size.

– Google Groups forum: does not make much difference

  • Do you support calling broad peaks (e.g. certain histone marks)?

– Yes, but only MACS2 (not MACS 1.4) – Op3ons: --broad --broad-cutoff 0.1 (not compa3ble with --call-summits) – Other peak callers are specifically designed for broad marks: SICER, RSEG, ZINBA

  • Where can I find documenta3on?

– The original paper (Genome Biology 2008 9:R137) deals with MACS v1 – MACS2 is sparsely documented on Github and Google Groups, yet referred to in several publica3on. – Improvements in cross-correla3on, peaks called in log space, broad peak detec3on, removed several biases

slide-62
SLIDE 62

narrowPeak output

BED format: 0-based, half closed [start-1, end) 1. Chr 2. Start 3. End 4. ID 5. Score (1-1000) 6. Strand 7. Fold change 8.

  • Log10(p)

9.

  • Log10(q)
  • 10. Summit posi3on, offset from start

Score

  • Log10(p)

Fold change Score Don’t forget accuracy!

slide-63
SLIDE 63

Fold change vs p-value: example from RNAseq

A = Average CPM (Log2) M = Fold Change (Log2) significant p-value

slide-64
SLIDE 64

EVALUATE REPLICATES

slide-65
SLIDE 65

(Not) many tools for the job

Irreproducibility Discovery Rate (IDR) from ENCODE

sites.google.com/site/anshulkundaje/projects/idr New version in progress! github.com/nboley/idr

slide-66
SLIDE 66

Let’s prac3ce

Work from your samples folder, e.g. /u/scratch/r/rspreafi/samples mkdir idr module load python/2.7.3 macs2 callpeak -t IPr1.bam -c input.bam -f BAM -n IPr1 -g mm -p 1e-3 --keep-dup 1

  • -call-summits --nomodel --extsize 200 --outdir idr >idr/IPr1_macs.txt 2>&1 &

macs2 callpeak -t IPr2.bam -c input.bam -f BAM -n IPr2 -g mm -p 1e-3 --keep-dup 1

  • -call-summits --nomodel --extsize 200 --outdir idr >idr/IPr2_macs.txt 2>&1 &

No3ce the relaxed threshold: we do want false posi3ves!

slide-67
SLIDE 67

The unsophis3cated approach

  • Call peaks in R1 and R2. Retain those that are significant in

both.

– Variant: call peaks in R1 with FDR correc3on and retain those also significant in R2 by unadjusted p-value, and viceversa.

  • Can we do bejer?
  • Follow you intui3on. These peaks come from the same two
  • replicates. Which do you trust more?

a. R1: q=10-4; R2: q=0.23 b. R1: q=10-5; R2: q=10-4 c. R1: q=10-9; R2: q=10-2

slide-68
SLIDE 68

Let’s prac3ce

Work from your idr folder, e.g. /u/scratch/r/rspreafi/samples/idr module load R/3.1.1 cp -Rv /u/local/apps/idr/2010-10/* . cp -v genome_tables/genome_table.mm9.txt genome_table.txt # say yes to

  • verwri*ng here!

sort -k8,8nr IPr1_peaks.narrowPeak | head -n 100000 >IPr1_sorted.narrowPeak sort -k8,8nr IPr2_peaks.narrowPeak | head -n 100000 >IPr2_sorted.narrowPeak Rscript batch-consistency-analysis.r IPr1_sorted.narrowPeak IPr2_sorted.narrowPeak -1 R1vsR2 0 F p.value F = narrowPeak T = broadPeak 0-1 frac3onal overlap do not change

slide-69
SLIDE 69

How IDR works: rankings

R1 R2

Rank #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 A B C D E F G H I L A B D C E M N O P F good consistency: retain bad consistency: discard

Alterna3ves?

  • digital yes/no (“unsophis3cated approach”): retain if significant in both replicates. Peak “F” would be retained

with this approach.

  • absolute p-value: depends on scale, which in turn depends on many factors (such as background, stay tuned).

Rankings put p-values on a rela3ve scale. And on rela3ve scales you can use anything: p-value, q-values, fold change… and from different algorithms too!

In common 1 2 2 4 5 5 5 5 5 6

slide-70
SLIDE 70

Good replicates

slide-71
SLIDE 71

Bad replicates

slide-72
SLIDE 72

Let’s prac3ce

Work from your idr folder, e.g. /u/scratch/r/rspreafi/samples/idr Rscript batch-consistency-plot.r 1 R1vsR2 R1vsR2 head R1vsR2-overlapped-peaks.txt

slide-73
SLIDE 73

Retrieve your data

slide-74
SLIDE 74

The complete IDR pipeline

sites.google.com/site/anshulkundaje/projects/idr

pick best max(Np, Nt) peaks

slide-75
SLIDE 75

DIFFERENTIAL PEAK CALLING

slide-76
SLIDE 76

Let’s prac3ce

➢ Let’s pretend that the two replicates are different treatment condi3ons and let’s call differen3al peaks Work from your sample folder, e.g. /u/scratch/r/rspreafi/samples/ module load bedtools/2.23.0 awk 'BEGIN{OFS="\t"}{print $1,$2,$3}' macs2/IPr1_peaks.narrowPeak >IPr1_peaks.bed & awk 'BEGIN{OFS="\t"}{print $1,$2,$3}' macs2/IPr2_peaks.narrowPeak >IPr2_peaks.bed & bamToBed -i IPr1.bam | awk 'BEGIN{OFS="\t"}{print $1,$2,$3,$6}' >IPr1.bed & bamToBed -i IPr2.bam | awk 'BEGIN{OFS="\t"}{print $1,$2,$3,$6}' >IPr2.bed &

slide-77
SLIDE 77

Let’s prac3ce

➢ Let’s pretend that the two replicates are different treatment condi3ons and let’s call differen3al peaks Work from your sample folder, e.g. /u/scratch/r/rspreafi/samples/ mkdir manorm cd manorm cp /u/local/apps/manorm/2012/MAnorm_Linux_R_Package/MAnorm.* . chmod 755 MAnorm.sh ./MAnorm.sh ../IPr1_peaks.bed ../IPr2_peaks.bed ../IPr1.bed ../IPr2.bed 200 200 >output.txt 2>&1 &

slide-78
SLIDE 78

Break

slide-79
SLIDE 79

qPCR seq by equal mass

A B Input IP

DNA not of interest DNA of interest (PCR target) Protein of interest (IP target) 100 50 100 50 100 50 100 50 “DNA of interest is enriched by 2 fold in B compared to A”

slide-80
SLIDE 80

A spike in approach for normaliza3on

Nov 6, 2014

see also: www.ac3vemo3f.com/catalog/1063/chip-seq-spike-in

slide-81
SLIDE 81

An even worse problem: differences in background

Autoscaled Same scale

slide-82
SLIDE 82

The importance of S/N ra3o

Score R1 Score R2 Score R1 Score R2 y=x y=0.5x

Same S/N Different S/N

e.g. IP efficiency e.g. washing steps

slide-83
SLIDE 83

Es3ma3ng S/N ra3o

FRiP (Frac3on of Reads in Peaks) or SPOT (from HOTSPOT peak caller)

Depends on

  • Ab
  • background

Depends on

  • condi3on
  • background

Depends on background only

slide-84
SLIDE 84

Many tools for the job

edgeR DESeq MMDiff

(shapes)

PePr csaw

(RNA pol II) bcb.dfci.harvard.edu/ ~gcyuan/MAnorm

DiffBind

slide-85
SLIDE 85

Two ways

  • Compare peaks (e.g. edgeR, MAnorm)
  • Discard peaks, start fresh and compare signal

window by window (e.g. PePr, csaw)

slide-86
SLIDE 86

Remember your friend, the MA plot

A = Average CPM (Log2) M = Fold Change (Log2)

slide-87
SLIDE 87

How MAnorm works

Assumes that most common peaks have equal intensity Assumes that the trended bias fizng common peaks applies to unique peaks too Assumes that bias has linear

  • trend. Note that scaling may not

be enough

slide-88
SLIDE 88

Retrieve your data

slide-89
SLIDE 89

Retrieve your data

slide-90
SLIDE 90
slide-91
SLIDE 91

Schedule

  • Day 1

– Introduc3on – Cross-correla3on analysis and ENCODE QC with SPP – BigWig tracks with defined resolu3on and normaliza3on using Homer/ UCSC tools, and visualiza3on with IGV

  • Day 2

– Peak calling with MACS2 – QC of replicates with ENCODE’s IDR – Differen3al peak calling with MAnorm

  • Day 3

– Loca3on annota3on with NGSPLOT – Mo3f analysis with HOMER – Func3onal annota3on with GREAT – (Unix tricks, tools installa3on)

slide-92
SLIDE 92

LOCATION-BASED ANALYSIS

slide-93
SLIDE 93

Example

slide-94
SLIDE 94

Let’s prac3ce

Work from your sample folder, e.g. /u/scratch/r/rspreafi/samples/ mkdir ngsplot module load ngsplot ngs.plot.r -G mm9 -SS both -FL 200 -R tss -L 5000 -C IPr1.bam -O ngsplot/IPr1_tss - F chipseq -T TSS hjps://github.com/shenlab-sinai/ngsplot/wiki/ProgramArguments101 both strands examine 5kb upstream and downstream the TSS

slide-95
SLIDE 95

Many tools for the job

deepTools

ngsplot

CEAS

code.google.com/p/ngsplot/ ChIPpeakAnno

slide-96
SLIDE 96
  • C sample1.bam:sample2.bam
slide-97
SLIDE 97
  • C config.txt

(two .bam files, same custom bed regions)

A B

slide-98
SLIDE 98
  • C config.txt

(same .bam file, different gene lists)

slide-99
SLIDE 99

Retrieve your data

slide-100
SLIDE 100

MOTIF ANALYSIS

slide-101
SLIDE 101

Let’s prac3ce

Work from your sample folder, e.g. /u/scratch/r/rspreafi/samples/ cp /u/scratch/r/rspreafi/samples/IP_mo*f.bam . module load python/2.7.3 macs2 callpeak -t IP_mo*f.bam --nolambda -f BAM -n IP_mo*f -g mm -q 0.01 -- keep-dup 1 --call-summits --nomodel --extsize 70 --outdir macs2 >macs2/ IP_mo*f_macs.txt 2>&1 &

slide-102
SLIDE 102

Some tools for the job

hjp://homer.salk.edu/

HOMER

slide-103
SLIDE 103

Let’s prac3ce

Work from your sample folder, e.g. /u/scratch/r/rspreafi/samples/ module load homer findMo*fsGenome.pl macs2/IP_mo*f_peaks.narrowPeak mm10 homer/IP_mo*f - mask -len 8 -S 5 -mis 2 -size 100 -preparsedDir homer/preparsed/mm10 >homer/ IP_mo*f_output.txt 2>&1 & mask repeats # mismatches higher = more sensi3ve, but slower # de novo mo3fs to find # bases for each peak; or

  • size given
slide-104
SLIDE 104

Overview

known de novo

slide-105
SLIDE 105

HOMER's findMo3fsGenome.pl

slide-106
SLIDE 106

HOMER's annotatePeaks.pl

  • m -hist
  • m
  • go
slide-107
SLIDE 107

Iden3fying mo3fs

Two separate steps: finding a mobf in foreground does not mean it is enriched over background!

1 2

slide-108
SLIDE 108

Defining a mo3f

posi*ons count % ln(%/0.25)

slide-109
SLIDE 109

Spozng mo3fs in sequences

slide-110
SLIDE 110

Is AGGGTACAT related to my mo3f?

0.18 - 0.91 + 1.02 + 1.38 + 1.38 + 0.87 - 0.91 - 0.22 + 0.87 = 3.66

  • > 0 more likely to be func3onal site (related to mo3f)
  • < 0 more likely to be random site (not related to mo3f)
  • = 0 equal probability of being either func3onal or random
  • thresholds are determined by HOMER to maximize enrichment over background

(de novo) or specified in internal DB (known) need to correct, e.g. pseudocounts

slide-111
SLIDE 111

Is mo3f enriched in IP'd regions?

slide-112
SLIDE 112

Hypergeometric distribu3on

Males Females UCLA students 5,000 5,000 World popula3on 3 billion 3 billion

slide-113
SLIDE 113

Hypergeometric distribu3on

Males Females UCLA students 5,000 5,000 World popula3on 3 billion 3 billion Males Females UCLA students 7,000 3,000 World popula3on 3 billion 3 billion

slide-114
SLIDE 114

Hypergeometric distribu3on

Males Females UCLA students 5,000 5,000 World popula3on 3 billion 3 billion Males Females UCLA students 7,000 3,000 World popula3on 3 billion 3 billion Mobf present Mobf absent Peaks 20 100 Genome (bg) 200 20,000

“not males”

slide-115
SLIDE 115

Hypergeometric vs Binomial

Drawn Not drawn Green marbles 20 100 Red marbles 180 19,900

  • Note the different (but equivalent) nota3on: two classes rather than one class

and total

  • Is sampling with replacement?
  • Yes = Binomial
  • No = Hypergeometric
  • HOMER suggests that the hypergeometric distribu3on may make more sense for

mo3f enrichment applica3ons

  • By default, however, it uses the binomial distribu3on because it is faster to

compute, and the error is small if background is large 120 20,080 200 20,000

slide-116
SLIDE 116

Background selec3on

  • From genomic regions (TSS +/- 50kb)

– But you can provide a custom background

  • With features similar to query sequences:

– %GC or %CpG – 2-mers, 3-mers

Males Females UCLA students 7,000 3,000 World popula3on 3 billion 3 billion

California?

slide-117
SLIDE 117

Retrieve your results: de novo mo3fs

slide-118
SLIDE 118

Retrieve your results: known mo3fs

slide-119
SLIDE 119

FUNCTIONAL ENRICHMENT

slide-120
SLIDE 120

Many tools for the job

bejerano.stanford.edu/great/public/html/

HOMER

slide-121
SLIDE 121

Nothing new under the sun

  • This is standard GO analysis, aka

gene set analysis

  • In transcriptomics, many tools:

– threshold-based (e.g. DAVID) – ranking-based (e.g. GSEA)

  • In ChIP-seq, just an extra step:

associate peaks to genes by proximity, then use genes to perform standard GO analysis

slide-122
SLIDE 122

Why GREAT?

  • Easy (web interface)
  • Great help
  • Ac3vely maintained
  • Smarter peak-gene associa3on rules
  • Queries many databases (ontologies)

simultaneously, not only GO

slide-123
SLIDE 123

You can use GREAT to associate genomic regions to genes

slide-124
SLIDE 124

Déjà vu

In pathway Not in pathway Genes with peaks 20 100 All genes 200 20,000 Mobf present Mobf absent Peaks 20 100 Genome (bg) 200 20,000

slide-125
SLIDE 125

Let’s prac3ce

Work from your macs2 folder, e.g. /u/scratch/r/rspreafi/samples/macs2 sort -k8,8nr IP_mo*f_peaks.narrowPeak | awk 'BEGIN{OFS="\t"}{print $1,$2,$3,$4}' | head -n 5000 >IP_mo*f_peaks.bed

  • Go to bejerano.stanford.edu/great/public/html/
  • Select:

– mouse mm10 – test regions: BED file, upload IP_mo3f_peak.bed – background regions: whole genome

  • be fair when defining background

– "basal plus extension" associa3on rule

slide-126
SLIDE 126

Results

bejerano.stanford.edu/help/display/GREAT/Sta3s3cs

slide-127
SLIDE 127

Results

slide-128
SLIDE 128

PE reads

  • With PE reads you know the fragment length, so you do not need to run cross-correla3on
  • MACS2 has a -f BAMPE op3on

– but it then uses only the mean fragment size

  • ngsplot automa3cally extends reads by using fragment size

– undocumented, as per Google Groups Q&A

  • MAnorm does not have dedicated op3ons
  • HOMER deals with PE reads

– but you need to specify -sspe (and possibly -flip) if the library is strand-specific (usually it is not in ChIP-seq)

  • when PE-specific op3ons are not available, you can always approximate by trea3ng PE reads as SE

reads

– Loss of fragment length informa3on – Fragments get double counted (1 count for each end)

  • look at manuals for more details on these op3ons
slide-129
SLIDE 129

The bioinforma3cs dilemma

slide-130
SLIDE 130

Do you feel you have a lot to digest?

No worries, it’s normal!

slide-131
SLIDE 131

Leave no trace (and protect your privacy)

  • Delete anything that remembers your

password from shared computers

– e.g. Cyberduck

  • Sign out of Gmail if you used it
  • Delete personal documents
slide-132
SLIDE 132

Please take 5 min for a brief survey

Your feedback is important to us: www.surveymonkey.com/r/w7april Thank you for coming! We hope you enjoyed it!