Programming in Python Lecture 2: Sequences Michael Schroeder Sven - - PowerPoint PPT Presentation

programming in python
SMART_READER_LITE
LIVE PREVIEW

Programming in Python Lecture 2: Sequences Michael Schroeder Sven - - PowerPoint PPT Presentation

Programming in Python Lecture 2: Sequences Michael Schroeder Sven Schreiber sven.schreiber@tu-dresden.de 1 Slides derived from Ian Holmes, Department of Statistics, University of Oxford Updates by Andreas Henschel Overview Types of


slide-1
SLIDE 1

1

Programming in Python

Michael Schroeder Sven Schreiber

sven.schreiber@tu-dresden.de

Updates by Andreas Henschel

Lecture 2: Sequences

Slides derived from Ian Holmes, Department of Statistics, University of Oxford

slide-2
SLIDE 2

2

Overview

  • Types of sequences and their properties

– Lists, Tuples, Strings, Range

  • Building, accessing and modifying sequences
  • List comprehensions
  • File operations
slide-3
SLIDE 3

3

Types and Properties of Sequences

slide-4
SLIDE 4

4

Lists vs tuples

  • Both are sequences (used to store collections of objects)
  • Tuples are immutable, Lists mutable
  • List are more flexible
  • Tuples provide better performance
  • Rule of thumb: Lists for similar kind of objects, tuples for different

l = [1,2,3,4] l2 = [‘Apple’, ‘Banana’, ‘Orange’] t = (‘sebastian’, ‘m’, 28) t2 = (‘motif’, ‘ATTCG’, ‘E44’)

Construction (Syntax) Accessing Elements

l[0] t[0]

1 sebastian

l.append(3) l[1] = 5 t.append(3) t[1] = 5 l3 = l+[3,2] t3 = t + (‘phd’,’biotec’)

Adding/modifying Elements Concatenating immutable !

slide-5
SLIDE 5

5

Range

  • Used to provide collections of sequent integer numbers
  • Allow iteration with loops
  • Numbers are not stored in memory, but just generated when

needed (while looping)

  • Saves time and memory with larger number sets

for x in range(10000): print(x) 1 2 3 ... ... 9998 9999

Excluding last number!

slide-6
SLIDE 6

6

Working with Lists

slide-7
SLIDE 7

7

Lists

A list is a collection of values/objects We can think of the above as a container with 4 entries

nucleotides = ['a', 'c', 'g', 't'] print("Nucleotides: ", nucleotides) Nucleotides: ['a', 'c', 'g', 't']

a c g t

element 0 element 1 element 2 element 3 the list is the collection

  • f all four elements

Note that the element indices start at zero!

slide-8
SLIDE 8

8

List literals

  • There are several ways to create or obtain lists.

a = [1,2,3,4,5] print("a = ",a) b = ['a','c','g','t'] print("b = ",b) c = list(range(1,6)) print("c = ",c) d = "a c g t".split() print("d = ", d) a = [1,2,3,4,5] b = ['a','c','g','t'] c = [1,2,3,4,5] d = ['a','c','g','t'] This is the most common: a comma- separated list, delimited by squared brackets

slide-9
SLIDE 9

9

Accessing lists

To access list elements, use square brackets e.g. x[0] means "element zero of list x"

  • Remember, element indices start at zero!
  • Negative indices refer to elements counting from the

end e.g. x[-1] means "last element of list x"

x = ['a', 'c', 'g', 't'] i= 2 print(x[0], x[i], x[-1]) a g t

slide-10
SLIDE 10

10

List operations

  • You can sort and reverse lists...
  • You can add, delete and count elements

x = ['a', 't', 'g', 'c'] print("x =",x) x.sort() print("x =",x) x.reverse() print("x =",x) x = ['a', 't', 'g', 'c'] x = ['a', 'c', 'g', 't'] x = ['t', 'g', 'c', 'a'] nums = [2,2,5,2,6] nums.append(8) print(nums) print(nums.count(2)) nums.remove(5) print(nums) [2,2,5,2,6,8] 3 [2,2,2,6,8]

slide-11
SLIDE 11

11

More list operations

>>> x=[1,0]*2 >>> x [1, 0, 1, 0] >>> x.pop() >>> x [1, 0, 1] >>> x+=x >>> x [1, 0, 1, 1, 0, 1] >>> x.index(0) 1

pop() obtains and removes the last element of a list multiplying lists concatenating lists with +

  • r +=

index(..) searches for the first occurrence of an element

slide-12
SLIDE 12

12

Example: Reverse complementing DNA

dna = "accACgttAGgtct".lower() replaced = dna.replace("a",“_a") \ .replace("t","a").replace(“_a","t") \ .replace("g",“_g").replace("c","g") \ .replace(“_g", "c") replacedList = list(replaced) replacedList.reverse() print("".join(replacedList))

agacctaacgtggt Start by making string lower case

  • again. This is generally good practice

Convert back to string using join Replace 'a' with 't', 'c' with 'g', 'g' with 'c' and 't' with 'a'

A common operation due to double-helix symmetry of DNA

Convert to list and reverse

slide-13
SLIDE 13

13

Taking a slice of a list

  • The syntax x[i:j] returns a list containing

elements i,i+1,…,j-1 of list x

nucleotides = ['a', ’g’, 'c', 't'] print(nucleotides) print(nucleotides[0:2]) # nucleotides[:2] also works print(nucleotides[2:4]) # nucleotides[2:] also works print(nucleotides[-2:]) # takes last two elements print(nucleotides[::2]) # takes every second print(nucleotides[::-1]) # obtains reversed list ['a', 'g', 'c', 't'] ['a', 'g'] ['c', 't'] ['c', 't'] [‘a', ‘c'] [‘t', ‘c', ‘g', ‘a']

slide-14
SLIDE 14

14

Lists and Strings

  • A string can be translated into a list of strings and

– Using the split method: string.split(separator)

  • A list of strings can be translated into one string

– Using the join method: separator.join(list)

sentence = ‘This is a complete sentence.’ print(sentence.split()) [‘This’, ‘is’, ‘a’, ‘complete’, ‘sentence’] datarow = ‘Apples,Bananas,Oranges’ print(datarow.split(‘,’)) [‘Apples’,’Bananas’,’Oranges’] cities = [‘Dresden’, ‘Munich’, ‘Hamburg’, ‘Cologne’] print(‘ -> ’.join(cities)) ‘Dresden -> Munich -> Hamburg -> Cologne’

slide-15
SLIDE 15

15

List Comprehensions

slide-16
SLIDE 16

16

What are list comprehensions?

  • Very concise way to build and transform lists
  • Typically replaces a for loop and an if-construction
  • Used very often in Python
  • Syntax: [expr(var) for var in sequence if condition]

newlist = [] for x in range(1,11): if x % 2: newlist.append(x**2) Verbose construction of list [1,9,25,49,81] newlist = [x**2 for x in range(1,11) if x % 2] Construction with list comprehension Squares of all odd numbers between 1 and 10

slide-17
SLIDE 17

17

Examples: List comprehensions

sentence = ‘I like MySQL but not Python’ print([(w.lower(), len(w)) for w in sentence.split()])

[(i, 1), (like, 4), (mysql, 5), (but, 3), (not, 3), (python, 6)]

numbers = (1,0,-1,6,3,-2,3,4) sum = sum([x for x in numbers if x >0]) print(sum)

17

Sum up all positive integers in a tuple

slide-18
SLIDE 18

18

File IO

slide-19
SLIDE 19

Opening and reading a file

f = open(‘myfile.txt’, ‘r’) for line in f: if not line.startswith(‘#’): print(line) f.close() #Old number 1234 # New number 5555 # Test 1 1234 5555 1 Returns file handler Loop variable Linewise iteration over file! File mode (r, w, a, ...) with open(‘myfile.txt’, ‘r’) as f: for line in f: if not line.startswith(‘#’): print(line) Shorter and better form File is closed after block!

slide-20
SLIDE 20

20

Example: FASTA format

  • A format for storing multiple named sequences
  • This file contains 3' UTRs

for Drosophila genes CG11604 CG11455 CG11488

>CG11604 TAGTTATAGCGTGAGTTAGT TGTAAAGGAACGTGAAAGAT AAATACATTTTCAATACC >CG11455 TAGACGGAGACCCGTTTTTC TTGGTTAGTTTCACATTGTA AAACTGCAAATTGTGTAAAA ATAAAATGAGAAACAATTCT GGT >CG11488 TAGAAGTCAAAAAAGTCAAG TTTGTTATATAACAAGAAAT CAAAAATTATATAATTGTTT TTCACTCT

Name of sequence is preceded by > symbol NB sequences can span multiple lines fly3utr.txt

slide-21
SLIDE 21

21

Example: FASTA format

with open(‘fly3utr.txt’, ‘r’) as f: for line in f: if line.startswith(‘>’): print(line[1:]) CG11604 CG11455 CG11488

What if we want to show the length of each sequence record?

>CG11604 TAGTTATAGCGTGAGTTAGT TGTAAAGGAACGTGAAAGAT AAATACATTTTCAATACC >CG11455 TAGACGGAGACCCGTTTTTC TTGGTTAGTTTCACATTGTA AAACTGCAAATTGTGTAAAA ATAAAATGAGAAACAATTCT GGT >CG11488 TAGAAGTCAAAAAAGTCAAG TTTGTTATATAACAAGAAAT CAAAAATTATATAATTGTTT TTCACTCT

slide-22
SLIDE 22

22

Example: FASTA format

name = None length = None with open('fly3utr.txt', 'r') as f: for line in f: line = line.rstrip() if line.startswith('>'): # None -> False if name: print(name, length) name = line[1:] length = 0 else: length += len(line) print(name, length) CG11604 58 CG11455 83 CG11488 69

>CG11604 TAGTTATAGCGTGAGTTAGT TGTAAAGGAACGTGAAAGAT AAATACATTTTCAATACC >CG11455 TAGACGGAGACCCGTTTTTC TTGGTTAGTTTCACATTGTA AAACTGCAAATTGTGTAAAA ATAAAATGAGAAACAATTCT GGT >CG11488 TAGAAGTCAAAAAAGTCAAG TTTGTTATATAACAAGAAAT CAAAAATTATATAATTGTTT TTCACTCT

slide-23
SLIDE 23

23

Summary

  • Strings, lists, tuples and ranges are all sequences
  • Lists (usually for elements of same type)

– More flexible, more memory consumption

  • Tuples (usually store elements of different types)

– Immutable, less memory consumption

  • Ranges for fast numeric iteration

– Least memory consumption

  • List comprehension as concise way to transform sequences
  • Convert strings into lists and vice versa with join and split
  • File handlers provides line-wise iteration