CSCI 490 Bioinformatics Part I: Introduction to Bioinformatics and - - PowerPoint PPT Presentation

csci 490 bioinformatics
SMART_READER_LITE
LIVE PREVIEW

CSCI 490 Bioinformatics Part I: Introduction to Bioinformatics and - - PowerPoint PPT Presentation

CSCI 490 Bioinformatics Part I: Introduction to Bioinformatics and Molecular Biology Course description A survey of algorithms and methods in bioinformatics, approached from a computational viewpoint. Prerequisite: Programming


slide-1
SLIDE 1

CSCI 490 Bioinformatics

Part I: Introduction to Bioinformatics and Molecular Biology

slide-2
SLIDE 2

Course description

  • A survey of algorithms and methods in

bioinformatics, approached from a computational viewpoint.

  • Prerequisite:

– Programming experience – Strong background in algorithms and data structure – Basic understanding of statistics and probability – Appetite to learn some biology

  • For other information, check course website
slide-3
SLIDE 3

Why bioinformatics

  • The advance of biomedical experimental

technology has resulted in a huge amount

  • f data

– The human genome is “finished” – Even if it were, that’s only the beginning…

  • The bottleneck is how to integrate and

analyze the data

– Noisy – Diverse

slide-4
SLIDE 4

Growth of GenBank vs Moore’s law

slide-5
SLIDE 5

Genome annotations

  • The process of identifying the locations of genes and

coding regions in a genome to determine what those genes do.

  • Finding and attaching the structural elements and its

related function to each genome locations.

slide-6
SLIDE 6

Genome annotations

  • Gene structure prediction
  • Identifying elements (introns/exons, coding region,

stop codon, start codon) in the genome

  • Gene function prediction
  • Attaching biological information to these elements-

eg: for which protein exon will code for

slide-7
SLIDE 7

Genome annotations

Meyer, Trends and Tools in Bioinfo and Compt Bio, 2006

slide-8
SLIDE 8

What is bioinformatics

  • National Institutes of Health (NIH):

– Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data.

slide-9
SLIDE 9

What is bioinformatics

  • National Center for Biotechnology

Information (NCBI):

– the field of science in which biology, computer science, and information technology merge to form a single discipline. The ultimate goal of the field is to enable the discovery of new biological insights as well as to create a global perspective from which unifying principles in biology can be discerned.

slide-10
SLIDE 10

Chemistry Mathematics Statistics Computer Science Informatics Physics Medicine Biology Molecular Biology

Bioinformatics

slide-11
SLIDE 11

Computer Scientists vs Biologists

(courtesy Serafim Batzoglou, Stanford)

slide-12
SLIDE 12

Biologists vs computer scientists

  • (almost) Everything is true or false in

computer science

  • (almost) Nothing is ever true or false in

Biology

slide-13
SLIDE 13

Biologists vs computer scientists

  • Biologists seek to understand the

complicated, messy natural world

  • Computer scientists strive to build their
  • wn clean and organized virtual world
slide-14
SLIDE 14

Biologists vs computer scientists

  • Computer scientists are obsessed with

being the first to invent or prove something

  • Biologists are obsessed with being the first

to discover something

slide-15
SLIDE 15

Some examples of central role of CS in bioinformatics

slide-16
SLIDE 16
  • 1. Genome sequencing

AGTAGCACAGA CTACGACGAGA CGATCGTGCGA GCGACGGCGTA GTGTGCTGTAC TGTCGTGTGTG TGTACTCTCCT

3x109 nucleotides ~500 nucleotides Genome sequencing is figuring out the order of DNA nucleotides, or bases, in a genome—the order of As, Cs, Gs, and Ts that make up an organism's DNA. The human genome is made up of over 3 billion of these genetic letters.

slide-17
SLIDE 17

AGTAGCACAGA CTACGACGAGA CGATCGTGCGA GCGACGGCGTA GTGTGCTGTAC TGTCGTGTGTG TGTACTCTCCT

3x109 nucleotides Computational Fragment Assembly Introduced ~1980 1995: assemble up to 1,000,000 long DNA pieces 2000: assemble whole human genome A big puzzle ~60 million pieces

  • 1. Genome sequencing
slide-18
SLIDE 18

Where are the genes?

  • 2. Gene Finding

In humans: ~22,000 genes ~1.5% of human DNA

slide-19
SLIDE 19
  • 2. Gene Finding

Even in a familiar language it is difficult to pick out the meaning

  • f the passage: The quick brown fox jumped over the lazy dog.

The dog lay quietly dreaming of dinner. And the genome is "written" in a far less familiar language, multiplying the difficulties involved in reading it.

slide-20
SLIDE 20

Start codon ATG

5’ 3’

Exon 1 Exon 2 Exon 3 Intron 1 Intron 2

Stop codon TAG/TGA/TAA Splice sites

  • 2. Gene Finding

Hidden Markov Models (Well studied for many years in speech recognition)

slide-21
SLIDE 21
  • 3. Protein Folding
  • The amino-acid sequence of a protein determines the 3D

fold

  • The 3D fold of a protein determines its function
  • Can we predict 3D fold of a protein given its amino-acid

sequence?

– Holy grail of computational biology —40 years old problem – Molecular dynamics, computational geometry, machine learning

slide-22
SLIDE 22
  • 4. Sequence Comparison—Alignment

AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC

  • AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---

| | | | | | | | | | | | | x | | | | | | | | | | |

TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

Sequence Alignment Introduced ~1970 BLAST: 1990, one of the most cited papers in history Still very active area of research query DB BLAST Efficient string matching algorithms Fast database index techniques

slide-23
SLIDE 23

…, comparison of a 200-amino-acid sequence to the 500,000 residues in the National Biomedical Research Foundation library would take less than 2 minutes on a minicomputer, and less than 10 minutes

  • n a microcomputer (IBM PC).

Lipman & Pearson, 1985

Database size today (2007): 1012 (increased by 2 million folds). BLAST search: 1.5 minutes …, comparison of a 200-amino-acid sequence to the 500,000 residues in the National Biomedical Research Foundation library would take less than 2 minutes on a minicomputer, and less than 10 minutes

  • n a microcomputer (IBM PC).
slide-24
SLIDE 24
  • 5. Microarray data analysis

Example: Clinical prediction of Leukemia type

  • 2 types of leukemia

– Acute lymphoid (ALL) – Acute myeloid (AML)

  • Different treatments & outcomes
  • Predict type before treatment?

Bone marrow samples: ALL vs AML Measure amount of each gene

slide-25
SLIDE 25
slide-26
SLIDE 26

Some goals of biology for the next 50 years

  • List all molecular parts that build an organism

– Genes, proteins, other functional parts

  • Understand the function of each part
  • Understand how parts interact physically and functionally
  • Study how function has evolved across all species
  • Find genetic defects that cause diseases
  • Design drugs rationally
  • Sequence the genome of every human, use it for personalized

medicine

  • Bioinformatics is an essential component for all the

goals above

slide-27
SLIDE 27

A short introduction to molecular biology

slide-28
SLIDE 28

Life

  • Two main categories:

– Prokaryotes (e.g. bacteria)

  • Unicellular
  • No nucleus

– Eukaryotes (e.g. fungi, plant, animal)

  • Unicellular or multicellular
  • Has nucleus
slide-29
SLIDE 29

Life

slide-30
SLIDE 30

Prokaryote vs Eukaryote

  • Eukaryote has many membrane-bounded

compartment inside the cell

– Different biological processes occur at different cellular location

slide-31
SLIDE 31

Organism, Organ, Cell

Organism

slide-32
SLIDE 32

Chemical contents of cell

  • Water
  • Macromolecules (polymers) - “strings” made by linking

monomers from a specified set (alphabet)

–Protein –DNA –RNA –…

  • Small molecules

–Sugar –Ions (Na+, Ka+, Ca2+, Cl- ,…) –Hormone –…

slide-33
SLIDE 33

DNA

  • DNA: forms the genetic material of all

living organisms

– Can be replicated and passed to descendents – Contains information to produce proteins

  • To computer scientists, DNA is a string

made from alphabet {A, C, G, T}

– e.g. ACAGAACGTAGTGCCGTGAGCG

  • Each letter is a nucleotide
  • Length varies from hundreds to billions
slide-34
SLIDE 34

RNA

  • Historically thought to be mainly an information

carrier

– DNA => RNA => Protein – Very important new roles have been found recently

  • To computer scientists, RNA is a string made

from alphabet {A, C, G, U}

– e.g. ACAGAACGUAGUGCCGUGAGCG

  • Each letter is a nucleotide
  • Length varies from tens to thousands
slide-35
SLIDE 35

Protein

  • Protein: the actual “worker” for almost all processes in

the cell

– Enzymes: speed up reactions – Signaling: information transduction – Structural support – Production of other macromolecules – Transport

  • To computer scientists, protein is a string made from an

alphabet of 20 letters

– E.g. MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGP

  • Each letter is called an amino acid
  • Length varies from tens to thousands
slide-36
SLIDE 36

DNA/RNA zoom-in

  • Commonly referred to as Nucleic Acid
  • DNA: Deoxyribonucleic acid
  • RNA: Ribonucleic acid
  • Found mainly in the nucleus of a cell (hence

“nucleic”)

  • Contain phosphoric acid as a component (hence

“acid”)

  • They are made up of a string of nucleotides
slide-37
SLIDE 37

Nucleotides

  • A nucleotide has 3 components

– Sugar ring (ribose in RNA, deoxyribose in DNA) – Phosphoric acid – Nitrogen base

  • Adenine (A)
  • Guanine (G)
  • Cytosine (C)
  • Thymine (T) in DNA and Uracil (U) in RNA
slide-38
SLIDE 38

Units of RNA: ribo-nucleotide

  • A ribonucleotide has 3 components

– Sugar - Ribose – Phosphate group – Nitrogen base

  • Adenine (A)
  • Guanine (G)
  • Cytosine (C)
  • Uracil (U)
slide-39
SLIDE 39

Units of DNA: deoxy-ribo-nucleotide

  • A deoxyribonucleotide has 3 components

– Sugar – Deoxy-ribose – Phosphate group – Nitrogen base

  • Adenine (A)
  • Guanine (G)
  • Cytosine (C)
  • Thymine (T)
slide-40
SLIDE 40

Polymerization: Nucleotides => nucleic acids

Phosphate

Sugar

Nitrogen Base Phosphate

Sugar

Nitrogen Base Phosphate

Sugar

Nitrogen Base

slide-41
SLIDE 41

G A G T C A G C

5’-AGCGACTG-3’ AGCGACTG

Phosphate Sugar Base

1 2 3 4 5

Often recorded from 5’ to 3’, which is the direction of many biological processes. e.g. DNA replication, transcription, etc.

5’ 3’

DNA

Free phosphate 5 prime 3 prime

slide-42
SLIDE 42

G A G U C A G U

5’-AGUGACUG-3’ AGUGACUG

Often recorded from 5’ to 3’, which is the direction of many biological processes.

e.g. translation.

5’ 3’

RNA

Free phosphate 5 prime 3 prime

slide-43
SLIDE 43

G A G T C A G C Base Base-pair pair: A A = T G G = C 5’ 5’ 3’ 3’

5’-AGCGACTG-3’ 3’-TCGCTGAC-5’ AGCGACTG TCGCTGAC

Forward (+) strand Backward (-) strand

One strand is said to be reverse- complementary to the other DNA usually exists in pairs.

slide-44
SLIDE 44

DNA double helix

G-C pair is stronger than A-T pair

slide-45
SLIDE 45

Reverse-complementary sequences

  • 5’-ACGTTACAGTA-3’
  • The reverse complement is:

3’-TGCAATGTCAT-5’ => 5’-TACTGTAACGT-3’

  • Or simply written as

TACTGTAACGT

slide-46
SLIDE 46

Orientation of the double helix

  • Double helix is anti-parallel

–5’ end of one strand pairs with 3’ end of the other –5’ to 3’ motion in one strand is 3’ to 5’ in the other

  • Double helix has no orientation

–Biology has no “forward” and “reverse” strand –Relative to any single strand, there is a “reverse complement” or “reverse strand” –Information can be encoded by either strand or both strands 5’TTTTACAGGACCATG 3’ 3’AAAATGTCCTGGTAC 5’

slide-47
SLIDE 47

RNA

  • RNAs are normally single-

stranded

  • Form complex structure by self-

base-pairing

  • A=U, C=G
  • Can also form RNA-DNA and

RNA-RNA double strands.

– A=T/U, C=G

slide-48
SLIDE 48

Carboxyl group Amino group

Protein zoom-in

Side chain

Generic chemical form of amino acid

  • Protein is the actual “worker” for almost all processes in

the cell

  • A string built from 20 kinds of chars

– E.g. MGDVEKGKKIFIMKCSQCHTVEKGGKH

  • Each letter is called an amino acid

R | H2N--C--COOH | H

slide-49
SLIDE 49
  • 20 amino acids, only differ at side chains

– Each can be expressed by three letters – Or a single letter: A-Y, except B, J, O, U, X, Z – Alanine = Ala = A – Histidine = His = H

Units of Protein: Amino acid

slide-50
SLIDE 50

R R | | H2N--C--CO--NH--C--COOH | | H H R R | | H2N--C--COOH H2N--C--COOH | | H H

Amino acids => peptide

Peptide bond

slide-51
SLIDE 51

Protein

  • Has orientations
  • Usually recorded from N-terminal to C-terminal
  • Peptide vs protein: basically the same thing
  • Conventions

– Peptide is shorter (< 50aa), while protein is longer – Peptide refers to the sequence, while protein has 2D/3D structure

R H2N R R R R R COOH

N-terminal C-terminal

slide-52
SLIDE 52

Protein structure

  • Linear sequence of amino acids folds to

form a complex 3-D structure.

  • The structure of a protein is intimately

connected to its function.

slide-53
SLIDE 53

Genome and chromosome

  • Genome: the complete DNA sequences in

the cell of an organism

– May contain one (in most prokaryotes) or more (in eukaryotes) chromosomes

  • Chromosome: a single large DNA

molecule in the cell

– May be circular or linear – Contain genes as well as “junk DNAs” – Highly packed!

slide-54
SLIDE 54

Gene

  • Gene: unit of heredity in living organisms

– A segment of DNA with information to make a protein or a functional RNA

slide-55
SLIDE 55

Some statistics

Chromosomes Bases Genes Human 46 3 billion 20k-25k Dog 78 2.4 billion ~20k Corn 20 2.5 billion 50-60k Yeast 16 20 million ~7k

  • E. coli

1 4 million ~4k Marbled lungfish ? 130 billion ?

slide-56
SLIDE 56

Human genome

  • 46 chromosomes: 22 pairs + X + Y

1 from mother, 1 from father

  • Female: X + X
  • Male: X + Y
slide-57
SLIDE 57

Human genome

  • Every cell contains the same genomic

information

– Except sperms and eggs, which only contain half of the genome

  • Otherwise your children would have 46 + 46

chromosomes …

slide-58
SLIDE 58

Cell division: mitosis

  • A cell duplicates its

genome and divides into two identical cells

  • These cells build up

different parts of your body

slide-59
SLIDE 59

Cell division: meiosis

  • A reproductive cell

divides into four cells, each containing only half

  • f the genomes

– Diploid => haploid

  • Two haploid cells (sperm

+ egg) forms a zygote

– Which will then develop into a multi-cellular

  • rganism by mitosis
slide-60
SLIDE 60

Central dogma of molecular biology

DNA replication is critical in both mitosis and meiosis

slide-61
SLIDE 61

DNA Replication

  • The process of copying a double-stranded

DNA molecule

– Semi-conservative

5’-ACATGATAA-3’ 3’-TGTACTATT-5’  5’-ACATGATAA-3’ 5’-ACATGATAA-3’ 3’-TGTACTATT-5’ 3’-TGTACTATT-5’

slide-62
SLIDE 62
  • Mutation: changes in DNA base-pairs
  • Proofreading and error-correcting mechanisms

exist to ensure extremely high fidelity

p p p

Nucleotide triphosphate (dNTP)

slide-63
SLIDE 63

Central dogma of molecular biology

slide-64
SLIDE 64

Transcription

  • The process that a DNA sequence is

copied to produce a complementary RNA

– Called message RNA (mRNA) if the RNA carries instruction on how to make a protein – Called non-coding RNA if the RNA does not carry instruction on how to make a protein – Only consider mRNA for now

  • Similar to replication, but

– Only one strand is copied

slide-65
SLIDE 65

Transcription

(where genetic information is stored) (for making mRNA)

Coding strand: 5’-ACGTAGACGTATAGAGCCTAG-3’ Template strand: 3’-TGCATCTGCATATCTCGGATC-5’ mRNA: 5’-ACGUAGACGUAUAGAGCCUAG-3’ Coding strand and mRNA have the same sequence, except that T’s in DNA are replaced by U’s in mRNA. DNA-RNA pair: A=U, C=G T=A, G=C

slide-66
SLIDE 66

Translation

  • The process of making proteins from mRNA
  • A gene uniquely encodes a protein
  • There are four bases in DNA (A, C, G, T), and four in

RNA (A, C, G, U), but 20 amino acids in protein

  • How many nucleotides are required to encode an amino

acid in order to ensure correct translation?

– 4^1 = 4 – 4^2 = 16 – 4^3 = 64

  • The actual genetic code used by the cell is a triplet.

– Each triplet is called a codon

slide-67
SLIDE 67

The Genetic Code

Third letter

slide-68
SLIDE 68

Translation

  • The sequence of codons is translated to a

sequence of amino acids

  • Gene: -GCT TGT TTA CGA ATT-
  • mRNA: -GCU UGU UUA CGA AUU -
  • Peptide: - Ala - Cys - Leu - Arg - Ile –
  • Start codon: AUG

– Also code Met – Stop codon: UGA, UAA, UAG

slide-69
SLIDE 69

Summary

  • DNA: a string made from {A, C, G, T}

– Forms the basis of genes – Has 5’ and 3’ – Normally forms double-strand by reverse complement

  • RNA: a string made from {A, C, G, U}

– mRNA: messenger RNA – tRNA: transfer RNA – Other types of RNA: rRNA, miRNA, etc. – Has 5’ and 3’ – Normally single-stranded. But can form secondary structure

  • Protein: made from 20 kinds of amino acids

– Actual worker in the cell – Has N-terminal and C-terminal – Sequence uniquely determined by its gene via the use of codons – Sequence determines structure, structure determines function

  • Central dogma: DNA transcribes to RNA, RNA translates to Protein

– Both steps are regulated