Statistical LeaRning Katja Nowick Bioinformatics group, Markus - PowerPoint PPT Presentation

Statistical LeaRning Katja Nowick Bioinformatics group, Markus Kreuz IMISE

What is R? • Programming/scripting language • Comprehensive statistical environment • Strength : statistical data analysis + graphical display

Why use R? • It's free ! • Runs on a variety of platforms including Windows, Unix and MacOS. • Complicated bioinformatics analyses made easy by a huge collection of packages in Bioconductor • Potential to implement automated workflows • Big datasets • Advanced statistical routines • State-of-the-art graphics capabilities

How to obtain and install R? • R can be downloaded from the Comprehensive R Archive Network (CRAN): http://cran.r- project.org/ • Installation instructions depend on your operating system and should be accessible from the R download page for you operating system. • For our course, R is already installed We use R-studio as programming environment

750 packages in Bioconductor

Binding site detection Finding binding sites for a transcription factor in the promoter of a gene With only 8 lines of code: query(MotifDb, "DAL80") pfm.dal80.jaspar = query(MotifDb, "DAL80")[[1]] seqLogo(pfm.dal80.jaspar) dal1 = "YIR027C" chromosomal.loc = transcriptsBy(TxDb.Scerevisiae.UCSC.sacCer3.sgdGene, by = "gene")[dal1] promoter.dal1 = getPromoterSeq(chromosomal.loc, Scerevisiae, upstream=1000, downstream=0) pcm.dal80.jaspar = round(100 * pfm.dal80.jaspar) matchPWM(pcm.dal80.jaspar, unlist(promoter.dal1)[[1]], "90%")

Quality assessment of NGS data From a directory of FastQ files to a full quality report: With 6 lines of code: files = list.files("fastq", full=TRUE) names(files) = sub(".fastq", "", basename(files)) qas = lapply(seq_along(files), function(i, files) qa(readFastq(files[i]), names(files)[i]), files) qa <- do.call(rbind, qas) save(qa, file=file.path("output", "qa.rda")) browseURL(report(qa))

Differential gene expression From mapped and counted RNA-Seq data to differentially expressed genes Hcounts=read.table("segemehl.hg19.readCount", head=T, sep="\t", quote="", stringsAsFactor=FALSE, row.names="id") Ccounts=read.table("segemehl.panTro3.readCount", head=T, sep="\t", quote="", stringsAsFactor=FALSE, row.names="id") counts=cbind(Hcounts, Ccounts) colnames(counts)=c("H1", "H2", "H3", "C1", "C2", "C3") groups=c(0,0,0,1,1,1) counts_DESeq_obj=newCountDataSet(counts, groups) counts_DESeq_obj=estimateSizeFactors(counts_DESeq_obj) counts_DESeq_obj=estimateDispersions(counts_DESeq_obj) DESeq_DEgenes=nbinomTest(counts_DESeq_obj, "0", "1") plot(DESeq_DEgenes$baseMean,DESeq_DEgenes$log2FoldChange,log="x", pch=20, cex=.1,col=ifelse(DESeq_DEgenes$padj < 0.05, "red", "black" ) ) signDESeq_DEgenes=DESeq_DEgenes[DESeq_DEgenes$padj<0.05,] head(signDESeq_DEgenes[order(signDESeq_DEgenes$padj),])

Finding help • R mailing lists : https://stat.ethz.ch/mailman/listinfo/ • Manuals and FAQs : http://www.r-project.org/ • Selected tutorials : – http://www.math.ilstu.edu/dhkim/Rstuff/Rtutor.html – http://www.statmethods.net/index.html – http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/R_Bi oCondManual.html

Goals for the next 5 x 5 hours Dec 3 rd : Introduction to R Dec 10 th : Statistics and Graphics Dec 17 th : A small programming project Jan 14 th : Analysis of gene expression data Jan 21 st : Clustering and Gene Ontology

Goals for the first 5 hours R-Studio • R as a calculator (interactive R) • Variables: numeric, character, arrays, vectors, matrices • Loops • Apply • Conditional executions (if-else-statements) • Write your own functions • Multiple exercises in between

Optional for today If you are already familiar with R • Find all genes in the human genome that have a binding site for the transcription factor EGR1: - Search for the EGR1 binding sites within the promoters of the human genes in a window from 5000 bp upstream to 2000 bp downstream of the transcription start site • Which genes are targeted by EGR1 only in humans? - Search for EGR1 binding sites within the chimpanzee, orangutan, and rhesus macaque genome to answer this question - What are the genes associated with the human specific binding sites? • Which genes are not targeted by EGR1 in humans? - Find out which binding sites are in the other three primate genomes but not in the human genome - Which genes are associated with the human specifically lost binding sites?

Goals for second 5 hours R packages • Help pages • Some more on functions • Graphics • Statistical tests • Multiple exercises in between

Statistical LeaRning Katja Nowick Bioinformatics group, Markus - PowerPoint PPT Presentation

Statistical LeaRning Katja Nowick Bioinformatics group, Markus Kreuz IMISE What is R? Programming/scripting language Comprehensive statistical environment Strength : statistical data analysis + graphical display Why use R?

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

Foundations of AI Why learning works 1 6 . Statistical Machine Learning Bayesian Learning and

Day 1: Introduction to Statistical Learning Lucas Leemann Essex Summer School Introduction to

Statistical graphics with Statistical graphics with ggplot2 ggplot2 Programming for Statistical

COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 23. PGM

Statistical Natural Language Processing Statistical models: learning, inference, estimation,

STK-IN4300 Statistical Learning Methods in Data Science Statistical Boosting Boosting as a

COMPLETE STATISTICAL THEORY OF LEARNING LEARNING USING STATISTICAL INVARIANTS Vladimir Vapnik

Statistical presentation Statistical presentation Statistical tabulations by age, sex and 3 digit

EFTA Statistical Cooperation & the European Statistical System EEA Seminar EEA Seminar

EFTA Statistical Cooperation & the European Statistical System EEA Seminar EEA Seminar

13 Jan, 2011 Statistical Literacy: Confounding UTSA Confounding 2011 1 2011 2 Statistical

STAT 401A - Statistical Methods for Research Workers Statistical Inference Jarad Niemi (Dr. J)

Statistics 435/535 Statistical Methods for Quality and Productivity Improvement / Statistical

Nov 2010 Statistical Literacy: Harper's Magazine Fall 2010 1 Fall 2010 2 Statistical

Statistical Machine Translation George Foster George Foster Statistical Machine Translation A

LibreOffjce: Educational practice in China Lv Linjie, Shirley shirley@ezgolinux.org

Proposals to RELOAD for Client Promotion and Peer Demotion IETF81@Quebec

Q1 2015 Supplementary Slides May 6, 2015 1 Forward-looking Statements This presentation for

Q2 2015 Supplementary Slides July 23, 2015 1 Forward-looking Statements This presentation for

SUMMER MEALS PLANNING FOR THE NEW YEAR 2020 Schools Out, Foods In Webinar Series

Hermes The Problem 44 million runners just in U.S.A Poor running form leads to injury

EVOLVABLE HARDWARE: The Darwin Chip Dream or Reality? Presented by: Deyasini Majumdar 1

Seek Novelty Personality Environment Predictable Unpredictable Seek Stability Seek Novelty

Sambuz

Useful Links

Newsletter

Mail Us

Statistical LeaRning Katja Nowick Bioinformatics group, Markus - PowerPoint PPT Presentation

Statistical LeaRning Katja Nowick Bioinformatics group, Markus Kreuz IMISE What is R? Programming/scripting language Comprehensive statistical environment Strength : statistical data analysis + graphical display Why use R?

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

Foundations of AI Why learning works 1 6 . Statistical Machine Learning Bayesian Learning and

Day 1: Introduction to Statistical Learning Lucas Leemann Essex Summer School Introduction to

Statistical graphics with Statistical graphics with ggplot2 ggplot2 Programming for Statistical

COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 23. PGM

Statistical Natural Language Processing Statistical models: learning, inference, estimation,

STK-IN4300 Statistical Learning Methods in Data Science Statistical Boosting Boosting as a

COMPLETE STATISTICAL THEORY OF LEARNING LEARNING USING STATISTICAL INVARIANTS Vladimir Vapnik

Statistical presentation Statistical presentation Statistical tabulations by age, sex and 3 digit

EFTA Statistical Cooperation &amp; the European Statistical System EEA Seminar EEA Seminar

EFTA Statistical Cooperation &amp; the European Statistical System EEA Seminar EEA Seminar

13 Jan, 2011 Statistical Literacy: Confounding UTSA Confounding 2011 1 2011 2 Statistical

STAT 401A - Statistical Methods for Research Workers Statistical Inference Jarad Niemi (Dr. J)

Statistics 435/535 Statistical Methods for Quality and Productivity Improvement / Statistical

Nov 2010 Statistical Literacy: Harper's Magazine Fall 2010 1 Fall 2010 2 Statistical

Statistical Machine Translation George Foster George Foster Statistical Machine Translation A

LibreOffjce: Educational practice in China Lv Linjie, Shirley shirley@ezgolinux.org

Proposals to RELOAD for Client Promotion and Peer Demotion IETF81@Quebec

Q1 2015 Supplementary Slides May 6, 2015 1 Forward-looking Statements This presentation for

Q2 2015 Supplementary Slides July 23, 2015 1 Forward-looking Statements This presentation for

SUMMER MEALS PLANNING FOR THE NEW YEAR 2020 Schools Out, Foods In Webinar Series

Hermes The Problem 44 million runners just in U.S.A Poor running form leads to injury

EVOLVABLE HARDWARE: The Darwin Chip Dream or Reality? Presented by: Deyasini Majumdar 1

Seek Novelty Personality Environment Predictable Unpredictable Seek Stability Seek Novelty

Sambuz

Useful Links

Newsletter

Mail Us

EFTA Statistical Cooperation & the European Statistical System EEA Seminar EEA Seminar

EFTA Statistical Cooperation & the European Statistical System EEA Seminar EEA Seminar