statistical learning
play

Statistical LeaRning Katja Nowick Bioinformatics group, Markus - PowerPoint PPT Presentation

Statistical LeaRning Katja Nowick Bioinformatics group, Markus Kreuz IMISE What is R? Programming/scripting language Comprehensive statistical environment Strength : statistical data analysis + graphical display Why use R?


  1. Statistical LeaRning Katja Nowick Bioinformatics group, Markus Kreuz IMISE

  2. What is R? • Programming/scripting language • Comprehensive statistical environment • Strength : statistical data analysis + graphical display

  3. Why use R? • It's free ! • Runs on a variety of platforms including Windows, Unix and MacOS. • Complicated bioinformatics analyses made easy by a huge collection of packages in Bioconductor • Potential to implement automated workflows • Big datasets • Advanced statistical routines • State-of-the-art graphics capabilities

  4. How to obtain and install R? • R can be downloaded from the Comprehensive R Archive Network (CRAN): http://cran.r- project.org/ • Installation instructions depend on your operating system and should be accessible from the R download page for you operating system. • For our course, R is already installed We use R-studio as programming environment

  5. 750 packages in Bioconductor

  6. Binding site detection Finding binding sites for a transcription factor in the promoter of a gene With only 8 lines of code: query(MotifDb, "DAL80") pfm.dal80.jaspar = query(MotifDb, "DAL80")[[1]] seqLogo(pfm.dal80.jaspar) dal1 = "YIR027C" chromosomal.loc = transcriptsBy(TxDb.Scerevisiae.UCSC.sacCer3.sgdGene, by = "gene")[dal1] promoter.dal1 = getPromoterSeq(chromosomal.loc, Scerevisiae, upstream=1000, downstream=0) pcm.dal80.jaspar = round(100 * pfm.dal80.jaspar) matchPWM(pcm.dal80.jaspar, unlist(promoter.dal1)[[1]], "90%")

  7. Quality assessment of NGS data From a directory of FastQ files to a full quality report: With 6 lines of code: files = list.files("fastq", full=TRUE) names(files) = sub(".fastq", "", basename(files)) qas = lapply(seq_along(files), function(i, files) qa(readFastq(files[i]), names(files)[i]), files) qa <- do.call(rbind, qas) save(qa, file=file.path("output", "qa.rda")) browseURL(report(qa))

  8. Differential gene expression From mapped and counted RNA-Seq data to differentially expressed genes Hcounts=read.table("segemehl.hg19.readCount", head=T, sep="\t", quote="", stringsAsFactor=FALSE, row.names="id") Ccounts=read.table("segemehl.panTro3.readCount", head=T, sep="\t", quote="", stringsAsFactor=FALSE, row.names="id") counts=cbind(Hcounts, Ccounts) colnames(counts)=c("H1", "H2", "H3", "C1", "C2", "C3") groups=c(0,0,0,1,1,1) counts_DESeq_obj=newCountDataSet(counts, groups) counts_DESeq_obj=estimateSizeFactors(counts_DESeq_obj) counts_DESeq_obj=estimateDispersions(counts_DESeq_obj) DESeq_DEgenes=nbinomTest(counts_DESeq_obj, "0", "1") plot(DESeq_DEgenes$baseMean,DESeq_DEgenes$log2FoldChange,log="x", pch=20, cex=.1,col=ifelse(DESeq_DEgenes$padj < 0.05, "red", "black" ) ) signDESeq_DEgenes=DESeq_DEgenes[DESeq_DEgenes$padj<0.05,] head(signDESeq_DEgenes[order(signDESeq_DEgenes$padj),])

  9. Finding help • R mailing lists : https://stat.ethz.ch/mailman/listinfo/ • Manuals and FAQs : http://www.r-project.org/ • Selected tutorials : – http://www.math.ilstu.edu/dhkim/Rstuff/Rtutor.html – http://www.statmethods.net/index.html – http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/R_Bi oCondManual.html

  10. Goals for the next 5 x 5 hours Dec 3 rd : Introduction to R Dec 10 th : Statistics and Graphics Dec 17 th : A small programming project Jan 14 th : Analysis of gene expression data Jan 21 st : Clustering and Gene Ontology

  11. Goals for the first 5 hours R-Studio • R as a calculator (interactive R) • Variables: numeric, character, arrays, vectors, matrices • Loops • Apply • Conditional executions (if-else-statements) • Write your own functions • Multiple exercises in between

  12. Optional for today If you are already familiar with R • Find all genes in the human genome that have a binding site for the transcription factor EGR1: - Search for the EGR1 binding sites within the promoters of the human genes in a window from 5000 bp upstream to 2000 bp downstream of the transcription start site • Which genes are targeted by EGR1 only in humans? - Search for EGR1 binding sites within the chimpanzee, orangutan, and rhesus macaque genome to answer this question - What are the genes associated with the human specific binding sites? • Which genes are not targeted by EGR1 in humans? - Find out which binding sites are in the other three primate genomes but not in the human genome - Which genes are associated with the human specifically lost binding sites?

  13. Goals for second 5 hours R packages • Help pages • Some more on functions • Graphics • Statistical tests • Multiple exercises in between

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend