Class exercise Single-nucleotide polymorphism A single-nucleotide - - PowerPoint PPT Presentation
Class exercise Single-nucleotide polymorphism A single-nucleotide - - PowerPoint PPT Presentation
Class exercise Single-nucleotide polymorphism A single-nucleotide polymorphism (SNP, pronounced snip) is a substitution of a single nucleotide that occurs at a specific position in the genome, where each variation is present at a level of
Single-nucleotide polymorphism
- A single-nucleotide polymorphism (SNP, pronounced snip) is a substitution of
a single nucleotide that occurs at a specific position in the genome, where each variation is present at a level of more than 1% in the population.
- For example, at a specific base position in the human genome, the C
nucleotide may appear in most individuals, but in a minority of individuals, the position is occupied by an A. This means that there is a SNP at this specific position, and the two possible nucleotide variations – C or A – are said to be alleles for this position.
Objective
- Write a software that given a sequence of data about SNPs computes:
- the number of transitions (A vs. G or C vs. T) within the data for each
chromosome
- the number of transversions (anything not being a transition) within the
data for each chromosome
- BUT FIRST YOU HAVE TO DESIGN THE SOFTWARE BY DEFINING CRC
CARDS AND UML CLASS DIAGRAMS
Input data
- A dataset consisting of a VCF file representing a random sampling of SNPs
from three people—a mother, a father, and their daughter—compared to the reference human genome.
- VCF is tabular format similar to CSV
- The dataset contains a SNP for each row
Input data sample
Chromosome # SNP’s position in the chromosome SNP’s ID Reference base at this position Alternative base found
What to do: SNP class (1)
- Implement a SNP class whose object will hold relevant information about a single line in the VCF file.
- The SNP class is a derived class of AlleleVariation, which is an abstract class
- AlleleVariation provides two abstract methods:
- .isTransition() should return True if the variation is a transition and False otherwise by looking
at the two allele instance variables.
- .isTransversion() should return True if the variation is a not transition and False otherwise.
- Instances of SNP include the following private attributes:
- the reference allele (a one-character string in column 4, e.g., “A”)
- the alternative allele (a one-character string in column 5, e.g., “G")
- the name of the chromosome on which it exists (a string in column 1, e.g., “1")
- the reference position (an integer in column 2, e.g., 799739)
- and the ID of the SNP (in column 3, e.g., "rs57181708" or ".").
- Because we’ll be parsing lines one at a time, all of this information can be provided in the
constructor.
What to do: SNP class (2)
- SNP objects should be able to answer questions:
- isTransition() should return True if the SNP is a transition
and False otherwise by looking at the two allele instance variables. A transition is A/G, G/A, C/T, or T/C
- isTransversion() should return True if the SNP is a not transition
and False otherwise
- Use of inheritance and overriding for this and encapsulation for hiding all
attributes of SNP
What to do: Chromosome class
- Implement a Chromosome class that provides four methods:
- count_transitions(), which returns the number of transition SNPs
- count_transversions(), which returns the number of transversion SNPs
- addSNP(), which add a SNP object into the array of SNPs associated to
the current Chromosome
- getName, which returns the string representing the name of the
Chromosome
Where to get the dataset
- The dataset can be downloaded here:
https://raw.githubusercontent.com/anuzzolese/genomics-unibo/master/ 2019-2020/data/trio.sample.vcf
How to read the dataset:
import csv with open('trio.sample.vcf') as csv_file: csv_reader = csv.reader(csv_file, delimiter='\t') line_count = 0 for row in csv_reader: chromosomeName = row[0] snpPosition = row[1] snpId = row[2] refAllele = row[3] altAllele = row[4] print(chromosomeName + ", " + snpPosition + ", " + snpId + ", " + refAllele + ", " + altAllele)
https://github.com/anuzzolese/genomics-unibo/blob/master/2019-2020/exercises/trio-sample-vcf-reader.py