loops continued Genome 559: Introduction to Statistical and - PowerPoint PPT Presentation

loops continued Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Review • Pick variable names that are descriptive • Comment your code if it complex (# sign) for <element> in <object>: Use for loop to iterate over <statement> elements in a list, numbers, or <statement> characters in a string . . . <last statement> while (conditional test): Use while loop to run until <statement1> some condition is met <statement2> . . . <last statement>

Review Increment operator x += y # adds y to the value of x x *= y # multiplies x by the value y x -= y # subtracts y from the value of x Explicit program exit sys.exit() # exit program immediately Use to terminate when something is wrong - best to use print to provide user feedback before exit

Smart loop use Read a file and print the first ten lines import sys infile = open(sys.argv[1], "r") lineList = infile.readlines() counter = 0 for line in lineList: counter += 1 Does this work? if (counter > 10): break YES print line infile.close() Is it ideal? NO

What if the file has a million lines? (not uncommon in bioinformatics) import sys this statement reads infile = open(sys.argv[1], "r") all million lines!! lineList = infile.readlines() counter = 0 for line in lineList: counter += 1 if (counter > 10): break print line infile.close() How about this instead? import sys infile = open(sys.argv[1], "r") this version reads only for counter in range(10): the first ten lines, one line = infile.readline() at a time print line infile.close()

This while loop does the same thing, just as efficiently: import sys infile = open(sys.argv[1], "r") counter = 0 while counter < 10: line = infile.readline() print line counter += 1 infile.close() • The original readlines() approach not only takes much longer on large files it also has to store ALL the data in memory. • I ran original version and efficient version on a very large file. • Original version ran for 45 seconds and crashed when it ran out of memory. • Improved version ran successfully in the blink of an eye.

What if the file has fewer than ten lines? import sys infile = open(sys.argv[1], "r") for counter in range(10): line = infile.readline() print line infile.close() hint - when readline() reaches the end of a file, it returns "" It prints a blank line repeatedly Improved version: import sys infile = open(sys.argv[1], "r") for counter in range(10): line = infile.readline() if len(line) == 0: added code, tests for end of file break print line infile.close()

Sequential splitting of file contents Many problems in text or sequence parsing can employ this strategy: • First, chop file content into chunks (lines or fasta sequences etc.) • Second, extract needed data from each chunk • This can even be repeated - split each chunk into subchunks, extract needed data from subchunks import sys lineList = open(sys.argv[1], "r").readlines() for line in lineList: fieldList = line.strip().split("\t") for field in fieldList: <do something> What does this do?

Sample problem #1 Write a program read-N-lines.py that prints the first N lines from a file, where N is the first argument and filename is the second argument. Use a while loop and be sure it handles short and long files. >python read-N-lines.py 7 file.txt this file has five lines >

Solution #1 import sys infile = open(sys.argv[2], "r") max = int(sys.argv[1]) counter = 0 while counter < max: line = infile.readline() if len(line) == 0: # we reached end of file break print line counter += 1

Sample problem #2 Write a program count-fasta.py that counts the number of fasta sequences in a file specified on the command line. Make sure it can run on a huge file (don't read the entire file content at once). Fasta format: >identifier1 comment comment comment sequence on any number AAOSIUBOASIUETOAISOBUAOSIDUGOAIBUOABOIUAS of lines until next “>” AOSIUDTOAISUETOIGLKBJLZXCOITLJLBIULEIJLIJ >identifier2 comment comment TXDIGSIDJOIJEOITJOSIJOIGJSOIEJTSOE >identifier3 Etc. Two files are linked in News on the course web page – run your program on both: small.txt and large.txt

Solution #2 Not required, but a import sys good habit to get into # Make sure we got an argument on the command line. if (len(sys.argv) < 2): print("USAGE: count-fasta.py file argument required") sys.exit() # Open the file for reading. fasta_file = open(sys.argv[1], "r") lineList = fastaFile.readlines() num_seqs = 0 for line in lineList: # Increment if this is the start of a sequence. if (line[0] == ">"): num_seqs += 1 Not so good - will run out print num_seqs of memory if file is huge fasta_file.close()

Alternative solution #2 import sys # Make sure we got an argument on the command line. if (len(sys.argv) < 2): print("USAGE: count-fasta.py file argument required") sys.exit() # Open the file for reading. fasta_file = open(sys.argv[1], "r") wholeText = fastaFile.read() print wholeText.count(">") fasta_file.close() Not so good - will run out of memory if file is huge

Improved solution #2 import sys # Make sure we got an argument on the command line. if (len(sys.argv) < 2): print "USAGE: count-fasta.py file argument required" sys.exit() # Open the file for reading. fasta_file = open(sys.argv[1], "r") num_seqs = 0 line = fasta_file.readline() # read first line while len(line) > 0: # Increment if this is the start of a sequence. if line[0] == ">": # or if line.startswith(">"): num_seqs += 1 line = fasta_file.readline() # read next line print num_seqs fasta_file.close() Note - when readline() encounters the end-of-file (EOF) it returns "" (empty string)

Challenge problem Write a program seq-len.py that reads a file of fasta sequences and prints the name and length of each sequence and their total length. >seq-len.py seqs.fasta seq1 432 Here‟s what fasta sequences look like: seq2 237 >foo gatactgactacagttt seq3 231 ggatatcg Total length 900 >bar agctcacggtatcttag agctcacaataccatcc ggatac >etc… („>‟ followed by name, newline, sequence on any number of lines until next „>‟)

One solution import sys this version may have filename = sys.argv[1] problems with large files myFile = open(filename, "r") myLines = myFile.readlines() myFile.close() # we read the file, now close it cur_name = myLines[0] # initialize required variables cur_len = 0 total_len = 0 for index in range(1, len(myLines)): line = myLines[index] if (line.startswith(">")): # we reached a new fasta sequence print cur_name, cur_len # write values for previous sequence total_len += cur_len # increment total_len cur_name = line.strip() # record the name of the new sequence cur_len = 0 # reset cur_len else: # still in the current sequence, increment length cur_len += len(line.strip()) index += 1 print cur_name, cur_len # we need to write the last values print “Total length", total_len

Lea came up with a far more elegant solution. Here is my version using Lea‟s method: import sys filename = sys.argv[1] myFile = open(filename, "r") this version may have whole_string = myFile.read() problems with large files myFile.close() seqList = whole_string.split(">") total_len = 0 for seq in seqList: lineList = seq.split("\n") length = len("".join(lineList[1:])) total_len += length print lineList[0], length print "Total length", total_len What this does is split the text of the entire file on “>”, which gives a list of strings (each containing the sequence with its name). Each of these strings is split at “ \ n” characters, which gives a list of lines. The 0 th line in this list is the name, and the rest of the lines are sequence. The funky looking join statement just merges all the sequence lines into one long string and gets its length.

A solution that will handle large files without running out of memory import sys filename = sys.argv[1] myFile = open(filename, "r") cur_name = "" cur_len = 0 total_len = 0 for line in myFile: if (total_len > 0 and line.startswith(">")): # we reached a new fasta sequence print cur_name, cur_len # write values for previous sequence cur_name = line.strip() # record the name of the new sequence cur_len = 0 # reset cur_len elif line.startswith(">"): # this is the first fasta name, record it cur_name = line.strip() else: cur_len += len(line.strip()) total_len += cur_len print cur_name, cur_len # we need to write the last values print "Total length", total_len myFile.close()

One of the arts of programming is seeing how to write elegant loops that do complex things. It takes time and practice.

loops continued Genome 559: Introduction to Statistical and - PowerPoint PPT Presentation

loops continued Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas Review Pick variable names that are descriptive Comment your code if it complex (# sign) for <element> in <object>: Use

loops continued and coding efficiently Genome 559: Introduction to Statistical and Computational

Tutorial 3 Loops Side Effects 1 CS 136 Spring 2020 Tutorial 3 Loops: for loops &

Loops! Flow of Control: Loops (Savitch, Chapter 4) TOPICS while Loops do while

LOOPS Loops Loops Loops! How can we repeat a piece of code without having to write it out over

Building Java Programs Chapter 5 Lecture 11: while Loops, Fencepost Loops, and Sentinel Loops

Building Java Programs Chapter 5 Lecture 5-1: while Loops, Fencepost Loops, and Sentinel Loops

Building Java Programs Chapter 5 Lecture 5-1: while Loops, Fencepost Loops, and Sentinel Loops

ARM Assembler Structure / Loops Structure / Loops p. 1/12 Loops Four parts to any loop

Topic 5 for loops and nested loops Always to see the general in the particular is the very

Loops! Loops! Loops! Lecture 5 COP 3014 Fall 2020 September 17, 2020 Repetition Statements

Loops! Loops! Loops! Lecture 10 COP 3014 Spring 2017 January 31, 2017 Repetition Statements

Loops Simone Campanoni simonec@eecs.northwestern.edu Outline Loops Identify loops

Building Java Programs Chapter 5 Lecture 10: while Loops, Fencepost Loops, and Sentinel Loops

Loops Most execution time in most programs is spent in loops: 90/10 is typical CS 4120

CPSC 231 - Lab LOOPS Based on Ryan Henry's Slides Loooooooooooo...oooop Sometimes we need to do

Topic 15 Indefinite Loops - While Loops "If you cannot grok [understand] the overall

TOPIC 7 MORE EXCEL Max Fowler (Computer Science)

ISAs and Y86-64 Z80 TI calculators lab Wednesday in-lab quiz C loops to assembly loops

Loops Data Set Analysis Thomas Schwarz, SJ Marquette University Loops Computer Science knows

While Loops 15-110 Friday 09/18 Learning Goals Use while loops when reading and writing

Types of loops Topic 15 definite loop : A loop that executes a known number of Indefinite

Control Structures Termination s = 0 while True: x = int(input("Enter a positive number,

COMP 204 For loops, nested loops: Putting it all together Mathieu Blanchette 1 / 6 Clinical

The Finite Embeddability Property for IP Loops and Local Embeddability of Groups into Finite IP

loops continued Genome 559: Introduction to Statistical and - PowerPoint PPT Presentation

loops continued Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas Review Pick variable names that are descriptive Comment your code if it complex (# sign) for <element> in <object>: Use

loops continued and coding efficiently Genome 559: Introduction to Statistical and Computational

Tutorial 3 Loops Side Effects 1 CS 136 Spring 2020 Tutorial 3 Loops: for loops &amp;

Loops! Flow of Control: Loops (Savitch, Chapter 4) TOPICS while Loops do while

LOOPS Loops Loops Loops! How can we repeat a piece of code without having to write it out over

Building Java Programs Chapter 5 Lecture 11: while Loops, Fencepost Loops, and Sentinel Loops

Building Java Programs Chapter 5 Lecture 5-1: while Loops, Fencepost Loops, and Sentinel Loops

Building Java Programs Chapter 5 Lecture 5-1: while Loops, Fencepost Loops, and Sentinel Loops

ARM Assembler Structure / Loops Structure / Loops p. 1/12 Loops Four parts to any loop

Topic 5 for loops and nested loops Always to see the general in the particular is the very

Loops! Loops! Loops! Lecture 5 COP 3014 Fall 2020 September 17, 2020 Repetition Statements

Loops! Loops! Loops! Lecture 10 COP 3014 Spring 2017 January 31, 2017 Repetition Statements

Loops Simone Campanoni simonec@eecs.northwestern.edu Outline Loops Identify loops

Building Java Programs Chapter 5 Lecture 10: while Loops, Fencepost Loops, and Sentinel Loops

Loops Most execution time in most programs is spent in loops: 90/10 is typical CS 4120

CPSC 231 - Lab LOOPS Based on Ryan Henry's Slides Loooooooooooo...oooop Sometimes we need to do

Topic 15 Indefinite Loops - While Loops &quot;If you cannot grok [understand] the overall

TOPIC 7 MORE EXCEL Max Fowler (Computer Science)

ISAs and Y86-64 Z80 TI calculators lab Wednesday in-lab quiz C loops to assembly loops

Loops Data Set Analysis Thomas Schwarz, SJ Marquette University Loops Computer Science knows

While Loops 15-110 Friday 09/18 Learning Goals Use while loops when reading and writing

Types of loops Topic 15 definite loop : A loop that executes a known number of Indefinite

Control Structures Termination s = 0 while True: x = int(input(&quot;Enter a positive number,

COMP 204 For loops, nested loops: Putting it all together Mathieu Blanchette 1 / 6 Clinical

The Finite Embeddability Property for IP Loops and Local Embeddability of Groups into Finite IP

Tutorial 3 Loops Side Effects 1 CS 136 Spring 2020 Tutorial 3 Loops: for loops &

Topic 15 Indefinite Loops - While Loops "If you cannot grok [understand] the overall

Control Structures Termination s = 0 while True: x = int(input("Enter a positive number,