Practical Bioinformatics Mark Voorhies 5/15/2015 Mark Voorhies - - PowerPoint PPT Presentation

practical bioinformatics
SMART_READER_LITE
LIVE PREVIEW

Practical Bioinformatics Mark Voorhies 5/15/2015 Mark Voorhies - - PowerPoint PPT Presentation

Practical Bioinformatics Mark Voorhies 5/15/2015 Mark Voorhies Practical Bioinformatics Gotchas Indentation matters Mark Voorhies Practical Bioinformatics Clustering exercises Visualizing the distance matrix Mark Voorhies Practical


slide-1
SLIDE 1

Practical Bioinformatics

Mark Voorhies 5/15/2015

Mark Voorhies Practical Bioinformatics

slide-2
SLIDE 2

Gotchas

Indentation matters

Mark Voorhies Practical Bioinformatics

slide-3
SLIDE 3

Clustering exercises – Visualizing the distance matrix

Mark Voorhies Practical Bioinformatics

slide-4
SLIDE 4

Loading and re-loading your functions

# Use import the f i r s t time you load a module # (And keep using import u n t i l i t loads # s u c c e s s f u l l y ) import my module my module . my function (42) # Once a module has been loaded , use r e l o a d to # f o r c e python to read your new code reload ( my module )

Mark Voorhies Practical Bioinformatics

slide-5
SLIDE 5

Setting Canopy’s working/import directory

OS X Open a terminal cd path/to/working/directory env PYTHONPATH=”$PYTHONPATH:$PWD” canopy Windows (or OS X) Start canopy %cd path/to/working/directory import sys, os sys.path.append(os.getcwd())

Mark Voorhies Practical Bioinformatics

slide-6
SLIDE 6

Pearson distances

Pearson similarity s(x, y) = 1 N

N

  • i

xi − xoffset φx yi − yoffset φy

  • φG =
  • N
  • i

(Gi − Goffset)2 N

Mark Voorhies Practical Bioinformatics

slide-7
SLIDE 7

Pearson distances

Pearson similarity s(x, y) =

N

  • i

xi − xoffset φx yi − yoffset φy

  • φG =
  • N
  • i

(Gi − Goffset)2

Mark Voorhies Practical Bioinformatics

slide-8
SLIDE 8

Pearson distances

Pearson similarity s(x, y) =

N

  • i

  xi − xoffset N

i (xi − xoffset)2

    yi − yoffset N

i (yi − yoffset)2

 

Mark Voorhies Practical Bioinformatics

slide-9
SLIDE 9

Pearson distances

Pearson similarity s(x, y) = N

i (xi − xoffset)(yi − yoffset)

N

i (xi − xoffset)2

N

i (yi − yoffset)2

Mark Voorhies Practical Bioinformatics

slide-10
SLIDE 10

Pearson distances

Pearson similarity s(x, y) = N

i (xi − xoffset)(yi − yoffset)

N

i (xi − xoffset)2

N

i (yi − yoffset)2

Pearson distance duncentered(x, y) = 1 − s(x, y)

Mark Voorhies Practical Bioinformatics

slide-11
SLIDE 11

Pearson distances

Pearson similarity s(x, y) = N

i (xi − xoffset)(yi − yoffset)

N

i (xi − xoffset)2

N

i (yi − yoffset)2

Pearson distance duncentered(x, y) = 1 − s(x, y) Euclidean distance N

i (xi − yi)2

N

Mark Voorhies Practical Bioinformatics

slide-12
SLIDE 12

Clustering exercises – Negative controls

Write functions to reproduce the shuffling controls in figure 3 of the Eisen paper (removing correlations among genes and/or arrays).

Mark Voorhies Practical Bioinformatics

slide-13
SLIDE 13

Clustering exercises – Negative controls

Write functions to reproduce the shuffling controls in figure 3 of the Eisen paper (removing correlations among genes and/or arrays).

def s h u f f l e G e n e s ( s e l f , seed = None ) : ””” S h u f f l e e x p r e s s i o n matrix by row . ””” import random i f ( seed != None ) : random . seed ( seed ) i n d i c e s = range ( len ( s e l f . genes ) ) random . s h u f f l e ( i n d i c e s ) genes = [ s e l f . geneName [ i ] f o r i i n i n d i c e s ] s e l f . geneName = genes a n n o t a t i o n s = [ s e l f . geneAnn [ i ] f o r i i n i n d i c e s ] s e l f . geneAnn = genes num = [ s e l f . num [ i ] f o r i i n i n d i c e s ] s e l f . num = num Mark Voorhies Practical Bioinformatics

slide-14
SLIDE 14

Clustering exercises – Negative controls

Write functions to reproduce the shuffling controls in figure 3 of the Eisen paper (removing correlations among genes and/or arrays).

Mark Voorhies Practical Bioinformatics

slide-15
SLIDE 15

Clustering exercises – Negative controls

Write functions to reproduce the shuffling controls in figure 3 of the Eisen paper (removing correlations among genes and/or arrays).

def shuffleRows ( s e l f , seed = None ) : ””” Permute r a t i o v a l u e s w i t h i n rows . ””” import random i f ( seed != None ) : random . seed ( seed ) f o r i i n s e l f . num : random . s h u f f l e ( i ) Mark Voorhies Practical Bioinformatics

slide-16
SLIDE 16

Clustering exercises – Negative controls

Write functions to reproduce the shuffling controls in figure 3 of the Eisen paper (removing correlations among genes and/or arrays).

def shuffleRows ( s e l f , seed = None ) : ””” Permute r a t i o v a l u e s w i t h i n rows . ””” import random i f ( seed != None ) : random . seed ( seed ) f o r i i n s e l f . num : random . s h u f f l e ( i ) def s h u f f l e C o l s ( s e l f , seed = None ) : ””” Permute r a t i o v a l u e s w i t h i n columns . ””” import random i f ( seed != None ) : random . seed ( seed ) # Transpose the e x p r e s s i o n matrix c o l s = [ ] f o r c o l i n xrange ( len ( s e l f . num [ 0 ] ) ) : c o l s . append ( [ row [ c o l ] f o r row i n s e l f . num ] ) # S h u f f l e f o r i i n c o l s : random . s h u f f l e ( i ) # Transpose back to

  • r i g i n a l
  • r i e n t a t i o n

s e l f . num = [ ] f o r row i n xrange ( len ( c o l s ) ) : s e l f . num . append ( [ c o l [ row ] f o r c o l i n row ] ) Mark Voorhies Practical Bioinformatics

slide-17
SLIDE 17

Comparing all measurements for two genes

  • −5

5 −5 5

Comparing two expression profiles (r = 0.97)

TLC1 log2 relative expression YFG1 log2 relative expression

Mark Voorhies Practical Bioinformatics

slide-18
SLIDE 18

Comparing all genes for two measurements

  • −10

−5 5 10 −10 −5 5 Array 1, log2 relative expression Array 2, log2 relative expression

  • Mark Voorhies

Practical Bioinformatics

slide-19
SLIDE 19

Comparing all genes for two measurements

  • −10

−5 5 10 −10 −5 5

Euclidean Distance

Array 1, log2 relative expression Array 2, log2 relative expression

  • Mark Voorhies

Practical Bioinformatics

slide-20
SLIDE 20

Comparing all genes for two measurements

  • −10

−5 5 10 −10 −5 5

Uncentered Pearson

Array 1, log2 relative expression Array 2, log2 relative expression

  • Mark Voorhies

Practical Bioinformatics

slide-21
SLIDE 21

Measure all pairwise distances under distance metric

Mark Voorhies Practical Bioinformatics

slide-22
SLIDE 22

Hierarchical Clustering

Mark Voorhies Practical Bioinformatics

slide-23
SLIDE 23

Hierarchical Clustering

Mark Voorhies Practical Bioinformatics

slide-24
SLIDE 24

Hierarchical Clustering

Mark Voorhies Practical Bioinformatics

slide-25
SLIDE 25

Hierarchical Clustering

Mark Voorhies Practical Bioinformatics

slide-26
SLIDE 26

Hierarchical Clustering

Mark Voorhies Practical Bioinformatics

slide-27
SLIDE 27

Scripting Cluster

Running Cluster3 from the command line

/Applications/Cluster.app/Contents/MacOS/Cluster /Program Files/Stanford University/Cluster3/Cluster.com

Command-line programs are like functions “man program” is like “help(function)” Use the subprocess module to run command-line programs from within Python.

Mark Voorhies Practical Bioinformatics

slide-28
SLIDE 28

Programs as functions

USAGE: cluster [options]

  • f filename

File loading

  • u jobname

Allows you to specify a different name for the output files (default is derived from the input file name)

  • g [0..8]

Specifies the distance measure for gene clustering 0: No gene clustering 1: Uncentered correlation 2: Pearson correlation 3: Uncentered correlation, absolute value 4: Pearson correlation, absolute value 5: Spearman’s rank correlation 6: Kendall’s tau 7: Euclidean distance 8: City-block distance (default: 0)

  • m [msca]

Specifies which hierarchical clustering method to use m: Pairwise complete-linkage s: Pairwise single-linkage c: Pairwise centroid-linkage a: Pairwise average-linkage (default: m) Mark Voorhies Practical Bioinformatics

slide-29
SLIDE 29

Scripting the Protocol

from s u b p r o c e s s import c h e c k c a l l c h e c k c a l l ( # Which program to run ( ” c l u s t e r ” , # Input f i l e ” −f ” , ” supp2data . tdt ” , # Output p r e f i x ” −u” , ” supp2data . Uncentered . Complete ” , # C l u s t e r i n g method : complete l i n k a g e ” − m” , ”m” , # Distance f u n c t i o n : uncentered Pearson ” −g” , ”1” )) Mark Voorhies Practical Bioinformatics

slide-30
SLIDE 30

Using the Cluster3 GUI

Mark Voorhies Practical Bioinformatics

slide-31
SLIDE 31

Load your data

Mark Voorhies Practical Bioinformatics

slide-32
SLIDE 32

Choose distance function

Mark Voorhies Practical Bioinformatics

slide-33
SLIDE 33

Choose linking method

Mark Voorhies Practical Bioinformatics

slide-34
SLIDE 34

Using JavaTreeView

Mark Voorhies Practical Bioinformatics

slide-35
SLIDE 35

Adjust pixel settings for global view

Mark Voorhies Practical Bioinformatics

slide-36
SLIDE 36

Adjust pixel settings for global view

Mark Voorhies Practical Bioinformatics

slide-37
SLIDE 37

Select annotation columns

Mark Voorhies Practical Bioinformatics

slide-38
SLIDE 38

Select annotation columns

Mark Voorhies Practical Bioinformatics

slide-39
SLIDE 39

Select URL for gene annotations

Mark Voorhies Practical Bioinformatics

slide-40
SLIDE 40

Select URL for gene annotations

Mark Voorhies Practical Bioinformatics

slide-41
SLIDE 41

Activate and detach annotation window

Mark Voorhies Practical Bioinformatics

slide-42
SLIDE 42

Activate and detach annotation window

Mark Voorhies Practical Bioinformatics

slide-43
SLIDE 43

Activate and detach annotation window

Mark Voorhies Practical Bioinformatics

slide-44
SLIDE 44

Clustering exercises – Scripting Cluster

Modify the clustering protocol script to run Cluster3 multiple times

  • n the same input, varying distance metric and/or clustering
  • method. Be sure to give the output files distinct names.

Mark Voorhies Practical Bioinformatics

slide-45
SLIDE 45

Clustering exercises – Scripting Cluster

Modify the clustering protocol script to run Cluster3 multiple times

  • n the same input, varying distance metric and/or clustering
  • method. Be sure to give the output files distinct names.

m e t r i c s = ( ”None” , ” Uncentered ” , ” Pearson ” , ” UncenteredAbs ” , ” PearsonAbs ” , ”Spearman” , ” Kendall ” , ” E ucli dea n ” , ” City ” ) l i n k a g e = (( ” Complete ” , ”m” ) , ( ” S i n g l e ” , ” s ” ) , ( ” Centroid ” , ”c” ) , ( ” Average ” , ”a” )) # Loop

  • ver

a l l 32 p o s s i b l e methods p r i n t ” S t a r t i n g h i e r a r c h i c a l c l u s t e r i n g runs . . . ” from s u b p r o c e s s import c h e c k c a l l f o r metric i n xrange (1 , len ( m e t r i c s ) ) : p r i n t ” ” , m e t r i c s [ metric ] , ” . . . ” f o r ( linkname , l i n k ) i n l i n k a g e : p r i n t ” ” , linkname c h e c k c a l l (( ” c l u s t e r ” , ” −f ” , ” s h u f f l e d . t x t ” , ” −u” , ” . ” . j o i n ( ( ” s h u f f l e d ” , m e t r i c s [ metric ] , linkname ) ) , ” − m” , l i n k , ” −g” , s t r ( metric ) ) ) Mark Voorhies Practical Bioinformatics

slide-46
SLIDE 46

Homework

1 If you haven’t done so already, read the PNAS paper 2 Explore the figure 2 data with Cluster3 and JavaTreeView.

Can you find/reproduce the clusters described in the paper? Are the annotations consistent with the current annotations in SGD? Are there other patterns that you can find in the data? What follow-up experiments are prompted by this analysis?

Mark Voorhies Practical Bioinformatics