Statistical LeaRning Katja Nowick, Lydia Mueller Bioinformatics - - PowerPoint PPT Presentation

statistical learning
SMART_READER_LITE
LIVE PREVIEW

Statistical LeaRning Katja Nowick, Lydia Mueller Bioinformatics - - PowerPoint PPT Presentation

Statistical LeaRning Katja Nowick, Lydia Mueller Bioinformatics group, Markus Kreuz IMISE What is R? Programming/scripting language Comprehensive statistical environment Strength : statistical data analysis + graphical display


slide-1
SLIDE 1

Statistical LeaRning

Katja Nowick, Lydia Mueller

Bioinformatics group,

Markus Kreuz

IMISE

slide-2
SLIDE 2

What is R?

  • Programming/scripting language
  • Comprehensive statistical environment
  • Strength: statistical data analysis

+ graphical display

slide-3
SLIDE 3

Why use R?

  • It's free!
  • Runs on a variety of platforms including

Windows, Unix and MacOS.

  • Complicated bioinformatics analyses made easy

by a huge collection of packages in Bioconductor

  • Potential to implement automated workflows
  • Big datasets
  • Advanced statistical routines
  • State-of-the-art graphics capabilities
slide-4
SLIDE 4

How to obtain and install R?

  • R can be downloaded from the Comprehensive R

Archive Network (CRAN): http://cran.r- project.org/

  • Installation instructions depend on your
  • perating system and should be accessible from

the R download page for you operating system

  • For our course, R is already installed

We use R-studio as programming environment

slide-5
SLIDE 5

~1000 packages in Bioconductor

http://www.bioconductor.org/packages/release/bioc/

slide-6
SLIDE 6

Binding site detection

query(MotifDb, "DAL80") pfm.dal80.jaspar = query(MotifDb, "DAL80")[[1]] seqLogo(pfm.dal80.jaspar)

Finding binding motifs for a transcription factor from a database and draw logo With only 3 lines of code:

slide-7
SLIDE 7

Quality assessment of NGS data

files = list.files("fastq", full=TRUE) names(files) = sub(".fastq", "", basename(files)) qas = lapply(seq_along(files), function(i, files) qa(readFastq(files[i]), names(files)[i]), files) qa <- do.call(rbind, qas) save(qa, file=file.path("output", "qa.rda")) browseURL(report(qa))

With 6 lines of code: From a directory of FastQ files to a full quality report:

@SEQ_ID_1 GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 @SEQ_ID_2 GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 @SEQ_ID_3

slide-8
SLIDE 8

Finding help

  • R mailing lists: https://stat.ethz.ch/mailman/listinfo/
  • Manuals and FAQs:

http://www.r-project.org/

  • Selected tutorials:

– http://www.math.ilstu.edu/dhkim/Rstuff/Rtutor.html – http://www.statmethods.net/index.html – http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/R_Bi

  • CondManual.html
slide-9
SLIDE 9

Nov 3rd: Introduction to R Nov 17th: Statistics and Graphics Nov 24th: A small programming project Dec 1st : Analysis of gene expression data Dec 15rd: Clustering and Gene Ontology

Goals for the next 5 x 5 hours

slide-10
SLIDE 10
  • R-Studio
  • R as a calculator (interactive R)
  • Variables: numeric, character, arrays, vectors, matrices
  • Loops
  • Apply
  • Conditional executions (if-else-statements)
  • Write your own functions

Multiple exercises in between

Goals for the first 5 hours

slide-11
SLIDE 11
  • R packages
  • Help pages
  • Some more on functions
  • Graphics
  • Statistical tests

Multiple exercises in between

Goals for second 5 hours

slide-12
SLIDE 12

Optional for today

  • If you know already R -