Practical Bioinformatics Mark Voorhies 5/12/2015 Mark Voorhies - - PowerPoint PPT Presentation

practical bioinformatics
SMART_READER_LITE
LIVE PREVIEW

Practical Bioinformatics Mark Voorhies 5/12/2015 Mark Voorhies - - PowerPoint PPT Presentation

Practical Bioinformatics Mark Voorhies 5/12/2015 Mark Voorhies Practical Bioinformatics Gotchas Strings are quoted, names of things are not. mystring = mystring Mark Voorhies Practical Bioinformatics Gotchas Strings are quoted,


slide-1
SLIDE 1

Practical Bioinformatics

Mark Voorhies 5/12/2015

Mark Voorhies Practical Bioinformatics

slide-2
SLIDE 2

Gotchas

Strings are quoted, names of things are not.

mystring = “mystring”

Mark Voorhies Practical Bioinformatics

slide-3
SLIDE 3

Gotchas

Strings are quoted, names of things are not.

mystring = “mystring”

Case matters for variable names: mystring = MyString

Mark Voorhies Practical Bioinformatics

slide-4
SLIDE 4

Gotchas

Strings are quoted, names of things are not.

mystring = “mystring”

Case matters for variable names: mystring = MyString Case matters for string comparison: “atg” = “ATG”

Mark Voorhies Practical Bioinformatics

slide-5
SLIDE 5

Gotchas

Strings are quoted, names of things are not.

mystring = “mystring”

Case matters for variable names: mystring = MyString Case matters for string comparison: “atg” = “ATG” Normalize sequence comparison to uppercase ”ATGCTGTA” . upper () == ”ATgcTgTA” . upper ()

Mark Voorhies Practical Bioinformatics

slide-6
SLIDE 6

Gotchas

Strings are quoted, names of things are not.

mystring = “mystring”

Case matters for variable names: mystring = MyString Case matters for string comparison: “atg” = “ATG” Normalize sequence comparison to uppercase ”ATGCTGTA” . upper () == ”ATgcTgTA” . upper () (And treat RNA as cDNA)

Mark Voorhies Practical Bioinformatics

slide-7
SLIDE 7

Gotchas

Statements that precede code blocks (if, def, for, while, ...) end with a colon. def mean( x ) : s = 0.0 for i in x : s += i return s / len ( x )

Mark Voorhies Practical Bioinformatics

slide-8
SLIDE 8

Gotchas

Statements that precede code blocks (if, def, for, while, ...) end with a colon. def mean( x ) : s = 0.0 for i in x : s += i return s / len ( x ) You can use tab and shift-tab in IPython to indent/unindent blocks of code

Mark Voorhies Practical Bioinformatics

slide-9
SLIDE 9

Gotchas

Statements that precede code blocks (if, def, for, while, ...) end with a colon. def mean( x ) : s = 0.0 for i in x : s += i return s / len ( x ) You can use tab and shift-tab in IPython to indent/unindent blocks of code Loop variables retain their state after the loop is finished (so if you want to reuse the variable, you need to reinitialize it).

Mark Voorhies Practical Bioinformatics

slide-10
SLIDE 10

Mean

def mean( x ) : s = 0.0 for i in x : s += i return s / len ( x ) def mean( x ) : return sum( x )/ f l o a t ( len ( x ))

Mark Voorhies Practical Bioinformatics

slide-11
SLIDE 11

Standard Deviation

σx = N

i (xi − ¯

x)2 N − 1

Mark Voorhies Practical Bioinformatics

slide-12
SLIDE 12

Standard Deviation

σx = N

i (xi − ¯

x)2 N − 1 def stdev ( x ) : m = mean( x ) s = 0.0 for i in x : s += ( i − m)∗∗2 from math import s q r t return s q r t ( s /( len ( x ) − 1))

Mark Voorhies Practical Bioinformatics

slide-13
SLIDE 13

Pearson’s Correlation Coefficient

r(x, y) =

  • i (xi − ¯

x)(yi − ¯ y)

  • i(xi − ¯

x)2

i(yi − ¯

y)2

Mark Voorhies Practical Bioinformatics

slide-14
SLIDE 14

Pearson’s Correlation Coefficient

def pearson ( x , y ) : mx = mean( x ) my = mean( y ) sxy = 0.0 ssx = 0.0 ssy = 0.0 for i in range ( len ( x ) ) : dx = x [ i ] − mx dy = y [ i ] − my sxy += dx∗dy ssx += dx∗∗2 ssy += dy∗∗2 from math import s q r t return sxy / s q r t ( ssx ∗ ssy )

r(x, y) =

  • i (xi − ¯

x)(yi − ¯ y)

  • i (xi − ¯

x)2

  • i (yi − ¯

y)2 Mark Voorhies Practical Bioinformatics

slide-15
SLIDE 15

Subject, verb that noun!

return value = object.function(parameter, ...) “Object, do function to parameter” file = open(“myfile.txt”) file.read() file.readlines() for line in file: string.split() and string.join() file.write()

Mark Voorhies Practical Bioinformatics

slide-16
SLIDE 16

Binary files are like genomic DNA

hexdump -C computers.png fp = open(“computers.png”) fp.read(50) fp.close()

Mark Voorhies Practical Bioinformatics

slide-17
SLIDE 17

Text files are like ORFs

hexdump -C 3 4 2010.txt

Mark Voorhies Practical Bioinformatics

slide-18
SLIDE 18

OS X sometimes uses CR newlines

hexdump -C macfile.txt tr ’\r’ ’\n’ < macfile.txt > unixfile.txt

Mark Voorhies Practical Bioinformatics

slide-19
SLIDE 19

Windows uses CRLF newlines

hexdump -C dosfile.txt

Mark Voorhies Practical Bioinformatics

slide-20
SLIDE 20

supp2data.csv

CSV File Mark Voorhies Practical Bioinformatics

slide-21
SLIDE 21
  • pen(“supp2data.csv”)

File object CSV File

Mark Voorhies Practical Bioinformatics

slide-22
SLIDE 22
  • pen(“supp2data.csv”).next()

File object single line CSV File

Mark Voorhies Practical Bioinformatics

slide-23
SLIDE 23
  • pen(“supp2data.csv”).read()

File object single line CSV File whole file

Mark Voorhies Practical Bioinformatics

slide-24
SLIDE 24

csv.reader(open(“supp2data.csv”)).next()

File object list reader CSV File

Mark Voorhies Practical Bioinformatics

slide-25
SLIDE 25

csv.reader(urlopen(“http://example.com/csv”)).next()

urllib object list reader CSV File Web service

Mark Voorhies Practical Bioinformatics

slide-26
SLIDE 26

The CDT file format

Minimal CLUSTER input Cluster3 CDT output Tab delimited (\t) UNIX newlines (\n) Missing values → empty cells

Mark Voorhies Practical Bioinformatics

slide-27
SLIDE 27

Homework

1 Try reading the first few bytes of different files on your

  • computer. Can you distinguish binary files from text files?

2 Create a simple data table in your favorite spreadsheet

program and save it in a text format (e.g., save as CSV or tab-delimited text from Excel1). Practice reading the data from Python.

3 Write a function to disect supp2data.cdt into three lists of

strings (gene names, gene annotations, and experimental conditions) and one matrix (list of lists) of log ratio values (as floats, using None or 0. to represent missing values).

4 If you are familiar with Python classes, write a CDT class

based on the parse in the previous exercise. Provide methods for looking up annotations and log ratios by gene name.

1Note for Mac users: Excel will offer you Macintosh and DOS/Windows

text formats. Choose DOS/Windows; otherwise, Python will think that the entire file is a single line.

Mark Voorhies Practical Bioinformatics