CSCI 490 Bioinformatics Part I: Introduction to Bioinformatics and - - PowerPoint PPT Presentation
CSCI 490 Bioinformatics Part I: Introduction to Bioinformatics and - - PowerPoint PPT Presentation
CSCI 490 Bioinformatics Part I: Introduction to Bioinformatics and Molecular Biology Course description A survey of algorithms and methods in bioinformatics, approached from a computational viewpoint. Prerequisite: Programming
Course description
- A survey of algorithms and methods in
bioinformatics, approached from a computational viewpoint.
- Prerequisite:
– Programming experience – Strong background in algorithms and data structure – Basic understanding of statistics and probability – Appetite to learn some biology
- For other information, check course website
Why bioinformatics
- The advance of biomedical experimental
technology has resulted in a huge amount
- f data
– The human genome is “finished” – Even if it were, that’s only the beginning…
- The bottleneck is how to integrate and
analyze the data
– Noisy – Diverse
Growth of GenBank vs Moore’s law
Genome annotations
- The process of identifying the locations of genes and
coding regions in a genome to determine what those genes do.
- Finding and attaching the structural elements and its
related function to each genome locations.
Genome annotations
- Gene structure prediction
- Identifying elements (introns/exons, coding region,
stop codon, start codon) in the genome
- Gene function prediction
- Attaching biological information to these elements-
eg: for which protein exon will code for
Genome annotations
Meyer, Trends and Tools in Bioinfo and Compt Bio, 2006
What is bioinformatics
- National Institutes of Health (NIH):
– Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data.
What is bioinformatics
- National Center for Biotechnology
Information (NCBI):
– the field of science in which biology, computer science, and information technology merge to form a single discipline. The ultimate goal of the field is to enable the discovery of new biological insights as well as to create a global perspective from which unifying principles in biology can be discerned.
Chemistry Mathematics Statistics Computer Science Informatics Physics Medicine Biology Molecular Biology
Bioinformatics
Computer Scientists vs Biologists
(courtesy Serafim Batzoglou, Stanford)
Biologists vs computer scientists
- (almost) Everything is true or false in
computer science
- (almost) Nothing is ever true or false in
Biology
Biologists vs computer scientists
- Biologists seek to understand the
complicated, messy natural world
- Computer scientists strive to build their
- wn clean and organized virtual world
Biologists vs computer scientists
- Computer scientists are obsessed with
being the first to invent or prove something
- Biologists are obsessed with being the first
to discover something
Some examples of central role of CS in bioinformatics
- 1. Genome sequencing
AGTAGCACAGA CTACGACGAGA CGATCGTGCGA GCGACGGCGTA GTGTGCTGTAC TGTCGTGTGTG TGTACTCTCCT
3x109 nucleotides ~500 nucleotides Genome sequencing is figuring out the order of DNA nucleotides, or bases, in a genome—the order of As, Cs, Gs, and Ts that make up an organism's DNA. The human genome is made up of over 3 billion of these genetic letters.
AGTAGCACAGA CTACGACGAGA CGATCGTGCGA GCGACGGCGTA GTGTGCTGTAC TGTCGTGTGTG TGTACTCTCCT
3x109 nucleotides Computational Fragment Assembly Introduced ~1980 1995: assemble up to 1,000,000 long DNA pieces 2000: assemble whole human genome A big puzzle ~60 million pieces
- 1. Genome sequencing
Where are the genes?
- 2. Gene Finding
In humans: ~22,000 genes ~1.5% of human DNA
- 2. Gene Finding
Even in a familiar language it is difficult to pick out the meaning
- f the passage: The quick brown fox jumped over the lazy dog.
The dog lay quietly dreaming of dinner. And the genome is "written" in a far less familiar language, multiplying the difficulties involved in reading it.
Start codon ATG
5’ 3’
Exon 1 Exon 2 Exon 3 Intron 1 Intron 2
Stop codon TAG/TGA/TAA Splice sites
- 2. Gene Finding
Hidden Markov Models (Well studied for many years in speech recognition)
- 3. Protein Folding
- The amino-acid sequence of a protein determines the 3D
fold
- The 3D fold of a protein determines its function
- Can we predict 3D fold of a protein given its amino-acid
sequence?
– Holy grail of computational biology —40 years old problem – Molecular dynamics, computational geometry, machine learning
- 4. Sequence Comparison—Alignment
AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC
- AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---
| | | | | | | | | | | | | x | | | | | | | | | | |
TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
Sequence Alignment Introduced ~1970 BLAST: 1990, one of the most cited papers in history Still very active area of research query DB BLAST Efficient string matching algorithms Fast database index techniques
…, comparison of a 200-amino-acid sequence to the 500,000 residues in the National Biomedical Research Foundation library would take less than 2 minutes on a minicomputer, and less than 10 minutes
- n a microcomputer (IBM PC).
Lipman & Pearson, 1985
Database size today (2007): 1012 (increased by 2 million folds). BLAST search: 1.5 minutes …, comparison of a 200-amino-acid sequence to the 500,000 residues in the National Biomedical Research Foundation library would take less than 2 minutes on a minicomputer, and less than 10 minutes
- n a microcomputer (IBM PC).
- 5. Microarray data analysis
Example: Clinical prediction of Leukemia type
- 2 types of leukemia
– Acute lymphoid (ALL) – Acute myeloid (AML)
- Different treatments & outcomes
- Predict type before treatment?
Bone marrow samples: ALL vs AML Measure amount of each gene
Some goals of biology for the next 50 years
- List all molecular parts that build an organism
– Genes, proteins, other functional parts
- Understand the function of each part
- Understand how parts interact physically and functionally
- Study how function has evolved across all species
- Find genetic defects that cause diseases
- Design drugs rationally
- Sequence the genome of every human, use it for personalized
medicine
- Bioinformatics is an essential component for all the
goals above
A short introduction to molecular biology
Life
- Two main categories:
– Prokaryotes (e.g. bacteria)
- Unicellular
- No nucleus
– Eukaryotes (e.g. fungi, plant, animal)
- Unicellular or multicellular
- Has nucleus
Life
Prokaryote vs Eukaryote
- Eukaryote has many membrane-bounded
compartment inside the cell
– Different biological processes occur at different cellular location
Organism, Organ, Cell
Organism
Chemical contents of cell
- Water
- Macromolecules (polymers) - “strings” made by linking
monomers from a specified set (alphabet)
–Protein –DNA –RNA –…
- Small molecules
–Sugar –Ions (Na+, Ka+, Ca2+, Cl- ,…) –Hormone –…
DNA
- DNA: forms the genetic material of all
living organisms
– Can be replicated and passed to descendents – Contains information to produce proteins
- To computer scientists, DNA is a string
made from alphabet {A, C, G, T}
– e.g. ACAGAACGTAGTGCCGTGAGCG
- Each letter is a nucleotide
- Length varies from hundreds to billions
RNA
- Historically thought to be mainly an information
carrier
– DNA => RNA => Protein – Very important new roles have been found recently
- To computer scientists, RNA is a string made
from alphabet {A, C, G, U}
– e.g. ACAGAACGUAGUGCCGUGAGCG
- Each letter is a nucleotide
- Length varies from tens to thousands
Protein
- Protein: the actual “worker” for almost all processes in
the cell
– Enzymes: speed up reactions – Signaling: information transduction – Structural support – Production of other macromolecules – Transport
- To computer scientists, protein is a string made from an
alphabet of 20 letters
– E.g. MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGP
- Each letter is called an amino acid
- Length varies from tens to thousands
DNA/RNA zoom-in
- Commonly referred to as Nucleic Acid
- DNA: Deoxyribonucleic acid
- RNA: Ribonucleic acid
- Found mainly in the nucleus of a cell (hence
“nucleic”)
- Contain phosphoric acid as a component (hence
“acid”)
- They are made up of a string of nucleotides
Nucleotides
- A nucleotide has 3 components
– Sugar ring (ribose in RNA, deoxyribose in DNA) – Phosphoric acid – Nitrogen base
- Adenine (A)
- Guanine (G)
- Cytosine (C)
- Thymine (T) in DNA and Uracil (U) in RNA
Units of RNA: ribo-nucleotide
- A ribonucleotide has 3 components
– Sugar - Ribose – Phosphate group – Nitrogen base
- Adenine (A)
- Guanine (G)
- Cytosine (C)
- Uracil (U)
Units of DNA: deoxy-ribo-nucleotide
- A deoxyribonucleotide has 3 components
– Sugar – Deoxy-ribose – Phosphate group – Nitrogen base
- Adenine (A)
- Guanine (G)
- Cytosine (C)
- Thymine (T)
Polymerization: Nucleotides => nucleic acids
Phosphate
Sugar
Nitrogen Base Phosphate
Sugar
Nitrogen Base Phosphate
Sugar
Nitrogen Base
G A G T C A G C
5’-AGCGACTG-3’ AGCGACTG
Phosphate Sugar Base
1 2 3 4 5
Often recorded from 5’ to 3’, which is the direction of many biological processes. e.g. DNA replication, transcription, etc.
5’ 3’
DNA
Free phosphate 5 prime 3 prime
G A G U C A G U
5’-AGUGACUG-3’ AGUGACUG
Often recorded from 5’ to 3’, which is the direction of many biological processes.
e.g. translation.
5’ 3’
RNA
Free phosphate 5 prime 3 prime
G A G T C A G C Base Base-pair pair: A A = T G G = C 5’ 5’ 3’ 3’
5’-AGCGACTG-3’ 3’-TCGCTGAC-5’ AGCGACTG TCGCTGAC
Forward (+) strand Backward (-) strand
One strand is said to be reverse- complementary to the other DNA usually exists in pairs.
DNA double helix
G-C pair is stronger than A-T pair
Reverse-complementary sequences
- 5’-ACGTTACAGTA-3’
- The reverse complement is:
3’-TGCAATGTCAT-5’ => 5’-TACTGTAACGT-3’
- Or simply written as
TACTGTAACGT
Orientation of the double helix
- Double helix is anti-parallel
–5’ end of one strand pairs with 3’ end of the other –5’ to 3’ motion in one strand is 3’ to 5’ in the other
- Double helix has no orientation
–Biology has no “forward” and “reverse” strand –Relative to any single strand, there is a “reverse complement” or “reverse strand” –Information can be encoded by either strand or both strands 5’TTTTACAGGACCATG 3’ 3’AAAATGTCCTGGTAC 5’
RNA
- RNAs are normally single-
stranded
- Form complex structure by self-
base-pairing
- A=U, C=G
- Can also form RNA-DNA and
RNA-RNA double strands.
– A=T/U, C=G
Carboxyl group Amino group
Protein zoom-in
Side chain
Generic chemical form of amino acid
- Protein is the actual “worker” for almost all processes in
the cell
- A string built from 20 kinds of chars
– E.g. MGDVEKGKKIFIMKCSQCHTVEKGGKH
- Each letter is called an amino acid
R | H2N--C--COOH | H
- 20 amino acids, only differ at side chains
– Each can be expressed by three letters – Or a single letter: A-Y, except B, J, O, U, X, Z – Alanine = Ala = A – Histidine = His = H
Units of Protein: Amino acid
R R | | H2N--C--CO--NH--C--COOH | | H H R R | | H2N--C--COOH H2N--C--COOH | | H H
Amino acids => peptide
Peptide bond
Protein
- Has orientations
- Usually recorded from N-terminal to C-terminal
- Peptide vs protein: basically the same thing
- Conventions
– Peptide is shorter (< 50aa), while protein is longer – Peptide refers to the sequence, while protein has 2D/3D structure
R H2N R R R R R COOH
N-terminal C-terminal
…
Protein structure
- Linear sequence of amino acids folds to
form a complex 3-D structure.
- The structure of a protein is intimately
connected to its function.
Genome and chromosome
- Genome: the complete DNA sequences in
the cell of an organism
– May contain one (in most prokaryotes) or more (in eukaryotes) chromosomes
- Chromosome: a single large DNA
molecule in the cell
– May be circular or linear – Contain genes as well as “junk DNAs” – Highly packed!
Gene
- Gene: unit of heredity in living organisms
– A segment of DNA with information to make a protein or a functional RNA
Some statistics
Chromosomes Bases Genes Human 46 3 billion 20k-25k Dog 78 2.4 billion ~20k Corn 20 2.5 billion 50-60k Yeast 16 20 million ~7k
- E. coli
1 4 million ~4k Marbled lungfish ? 130 billion ?
Human genome
- 46 chromosomes: 22 pairs + X + Y
1 from mother, 1 from father
- Female: X + X
- Male: X + Y
Human genome
- Every cell contains the same genomic
information
– Except sperms and eggs, which only contain half of the genome
- Otherwise your children would have 46 + 46
chromosomes …
Cell division: mitosis
- A cell duplicates its
genome and divides into two identical cells
- These cells build up
different parts of your body
Cell division: meiosis
- A reproductive cell
divides into four cells, each containing only half
- f the genomes
– Diploid => haploid
- Two haploid cells (sperm
+ egg) forms a zygote
– Which will then develop into a multi-cellular
- rganism by mitosis
Central dogma of molecular biology
DNA replication is critical in both mitosis and meiosis
DNA Replication
- The process of copying a double-stranded
DNA molecule
– Semi-conservative
5’-ACATGATAA-3’ 3’-TGTACTATT-5’ 5’-ACATGATAA-3’ 5’-ACATGATAA-3’ 3’-TGTACTATT-5’ 3’-TGTACTATT-5’
- Mutation: changes in DNA base-pairs
- Proofreading and error-correcting mechanisms
exist to ensure extremely high fidelity
p p p
Nucleotide triphosphate (dNTP)
Central dogma of molecular biology
Transcription
- The process that a DNA sequence is
copied to produce a complementary RNA
– Called message RNA (mRNA) if the RNA carries instruction on how to make a protein – Called non-coding RNA if the RNA does not carry instruction on how to make a protein – Only consider mRNA for now
- Similar to replication, but
– Only one strand is copied
Transcription
(where genetic information is stored) (for making mRNA)
Coding strand: 5’-ACGTAGACGTATAGAGCCTAG-3’ Template strand: 3’-TGCATCTGCATATCTCGGATC-5’ mRNA: 5’-ACGUAGACGUAUAGAGCCUAG-3’ Coding strand and mRNA have the same sequence, except that T’s in DNA are replaced by U’s in mRNA. DNA-RNA pair: A=U, C=G T=A, G=C
Translation
- The process of making proteins from mRNA
- A gene uniquely encodes a protein
- There are four bases in DNA (A, C, G, T), and four in
RNA (A, C, G, U), but 20 amino acids in protein
- How many nucleotides are required to encode an amino
acid in order to ensure correct translation?
– 4^1 = 4 – 4^2 = 16 – 4^3 = 64
- The actual genetic code used by the cell is a triplet.
– Each triplet is called a codon
The Genetic Code
Third letter
Translation
- The sequence of codons is translated to a
sequence of amino acids
- Gene: -GCT TGT TTA CGA ATT-
- mRNA: -GCU UGU UUA CGA AUU -
- Peptide: - Ala - Cys - Leu - Arg - Ile –
- Start codon: AUG
– Also code Met – Stop codon: UGA, UAA, UAG
Summary
- DNA: a string made from {A, C, G, T}
– Forms the basis of genes – Has 5’ and 3’ – Normally forms double-strand by reverse complement
- RNA: a string made from {A, C, G, U}
– mRNA: messenger RNA – tRNA: transfer RNA – Other types of RNA: rRNA, miRNA, etc. – Has 5’ and 3’ – Normally single-stranded. But can form secondary structure
- Protein: made from 20 kinds of amino acids
– Actual worker in the cell – Has N-terminal and C-terminal – Sequence uniquely determined by its gene via the use of codons – Sequence determines structure, structure determines function
- Central dogma: DNA transcribes to RNA, RNA translates to Protein
– Both steps are regulated