Statistical LeaRning Katja Nowick Bioinformatics group, Markus - - PowerPoint PPT Presentation

statistical learning
SMART_READER_LITE
LIVE PREVIEW

Statistical LeaRning Katja Nowick Bioinformatics group, Markus - - PowerPoint PPT Presentation

Statistical LeaRning Katja Nowick Bioinformatics group, Markus Kreuz IMISE What is R? Programming/scripting language Comprehensive statistical environment Strength : statistical data analysis + graphical display Why use R?


slide-1
SLIDE 1

Statistical LeaRning

Katja Nowick

Bioinformatics group,

Markus Kreuz

IMISE

slide-2
SLIDE 2

What is R?

  • Programming/scripting language
  • Comprehensive statistical environment
  • Strength: statistical data analysis

+ graphical display

slide-3
SLIDE 3

Why use R?

  • It's free!
  • Runs on a variety of platforms including

Windows, Unix and MacOS.

  • Complicated bioinformatics analyses made easy

by a huge collection of packages in Bioconductor

  • Potential to implement automated workflows
  • Big datasets
  • Advanced statistical routines
  • State-of-the-art graphics capabilities
slide-4
SLIDE 4

How to obtain and install R?

  • R can be downloaded from the Comprehensive R

Archive Network (CRAN): http://cran.r- project.org/

  • Installation instructions depend on your
  • perating system and should be accessible from

the R download page for you operating system.

  • For our course, R is already installed

We use R-studio as programming environment

slide-5
SLIDE 5

750 packages in Bioconductor

slide-6
SLIDE 6

Binding site detection

query(MotifDb, "DAL80") pfm.dal80.jaspar = query(MotifDb, "DAL80")[[1]] seqLogo(pfm.dal80.jaspar) dal1 = "YIR027C" chromosomal.loc = transcriptsBy(TxDb.Scerevisiae.UCSC.sacCer3.sgdGene, by = "gene")[dal1] promoter.dal1 = getPromoterSeq(chromosomal.loc, Scerevisiae, upstream=1000, downstream=0) pcm.dal80.jaspar = round(100 * pfm.dal80.jaspar) matchPWM(pcm.dal80.jaspar, unlist(promoter.dal1)[[1]], "90%")

Finding binding sites for a transcription factor in the promoter of a gene With only 8 lines of code:

slide-7
SLIDE 7

Quality assessment of NGS data

files = list.files("fastq", full=TRUE) names(files) = sub(".fastq", "", basename(files)) qas = lapply(seq_along(files), function(i, files) qa(readFastq(files[i]), names(files)[i]), files) qa <- do.call(rbind, qas) save(qa, file=file.path("output", "qa.rda")) browseURL(report(qa))

With 6 lines of code: From a directory of FastQ files to a full quality report:

slide-8
SLIDE 8

Differential gene expression

Hcounts=read.table("segemehl.hg19.readCount", head=T, sep="\t", quote="", stringsAsFactor=FALSE, row.names="id") Ccounts=read.table("segemehl.panTro3.readCount", head=T, sep="\t", quote="", stringsAsFactor=FALSE, row.names="id") counts=cbind(Hcounts, Ccounts) colnames(counts)=c("H1", "H2", "H3", "C1", "C2", "C3") groups=c(0,0,0,1,1,1) counts_DESeq_obj=newCountDataSet(counts, groups) counts_DESeq_obj=estimateSizeFactors(counts_DESeq_obj) counts_DESeq_obj=estimateDispersions(counts_DESeq_obj) DESeq_DEgenes=nbinomTest(counts_DESeq_obj, "0", "1") plot(DESeq_DEgenes$baseMean,DESeq_DEgenes$log2FoldChange,log="x", pch=20, cex=.1,col=ifelse(DESeq_DEgenes$padj < 0.05, "red", "black" ) ) signDESeq_DEgenes=DESeq_DEgenes[DESeq_DEgenes$padj<0.05,] head(signDESeq_DEgenes[order(signDESeq_DEgenes$padj),])

From mapped and counted RNA-Seq data to differentially expressed genes

slide-9
SLIDE 9

Finding help

  • R mailing lists: https://stat.ethz.ch/mailman/listinfo/
  • Manuals and FAQs:

http://www.r-project.org/

  • Selected tutorials:

– http://www.math.ilstu.edu/dhkim/Rstuff/Rtutor.html – http://www.statmethods.net/index.html – http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/R_Bi

  • CondManual.html
slide-10
SLIDE 10

Dec 3rd: Introduction to R Dec 10th: Statistics and Graphics Dec 17th: A small programming project Jan 14th: Analysis of gene expression data Jan 21st: Clustering and Gene Ontology

Goals for the next 5 x 5 hours

slide-11
SLIDE 11
  • R-Studio
  • R as a calculator (interactive R)
  • Variables: numeric, character, arrays, vectors, matrices
  • Loops
  • Apply
  • Conditional executions (if-else-statements)
  • Write your own functions

Multiple exercises in between

Goals for the first 5 hours

slide-12
SLIDE 12

Optional for today

If you are already familiar with R

  • Find all genes in the human genome that have

a binding site for the transcription factor EGR1:

  • Search for the EGR1 binding sites within the promoters of the human genes in a

window from 5000 bp upstream to 2000 bp downstream of the transcription start site

  • Which genes are targeted by EGR1 only in humans?
  • Search for EGR1 binding sites within the chimpanzee, orangutan, and rhesus

macaque genome to answer this question

  • What are the genes associated with the human specific binding sites?
  • Which genes are not targeted by EGR1 in humans?
  • Find out which binding sites are in the other three primate genomes but not in the

human genome

  • Which genes are associated with the human specifically lost binding sites?
slide-13
SLIDE 13
  • R packages
  • Help pages
  • Some more on functions
  • Graphics
  • Statistical tests

Multiple exercises in between

Goals for second 5 hours