Outline Administravia What is bioinformatics CS 5263 - - PDF document

outline
SMART_READER_LITE
LIVE PREVIEW

Outline Administravia What is bioinformatics CS 5263 - - PDF document

Outline Administravia What is bioinformatics CS 5263 Bioinformatics Why bioinformatics Course overview Lectures 1 & 2: Introduction to Short introduction to molecular biology Bioinformatics and Molecular Biology


slide-1
SLIDE 1

1

CS 5263 Bioinformatics

Lectures 1 & 2: Introduction to Bioinformatics and Molecular Biology

Outline

  • Administravia
  • What is bioinformatics
  • Why bioinformatics
  • Course overview
  • Short introduction to molecular biology

Survey form

  • Your name
  • Email
  • Academic preparation
  • Interests
  • help me better design lectures and

assignments

Course Info

  • Instructor: Jianhua Ruan

Office: S.B. 4.01.48 Phone: 458-6819 Email: jruan@cs.utsa.edu Office hours: MW 2-3pm

  • Web:

http://www.cs.utsa.edu/~jruan/teaching/cs 5263_fall_2008/

slide-2
SLIDE 2

2

Course description

  • A survey of algorithms and methods in

bioinformatics, approached from a computational viewpoint.

  • Prerequisite:

– Programming experiences – Some knowledge in algorithms and data structures – Basic understanding of statistics and probability – Appetite to learn some biology

Textbooks

  • An Introduction to Bioinformatics

Algorithms

by Jones and Pevzner

  • Biological Sequence Analysis:

Probabilistic Models of Proteins and Nucleic Acids

by Durbin, Eddy, Krogh and Mitchison

  • Additional resources

– Papers – Handouts – See course website

Grading

  • Attendance: 10%

– At most 2 classes missed without affecting grade

  • Homeworks: 50%

– About 5 assignments – Combination of theoretical and programming exercises – No exams – No late submission accepted – Read the collaboration policy!

  • Final project and presentation: 40%

Why bioinformatics

  • The advance of experimental technology

has generated huge amount of data

– The human genome is “finished” – Even if it were, that’s only the beginning…

  • The bottleneck is how to integrate and

analyze the data

– Noisy – Diverse

slide-3
SLIDE 3

3

Growth of GenBank vs Moore’s law

Genome annotations

Meyer, Trends and Tools in Bioinfo and Compt Bio, 2006

What is bioinformatics

  • National Institutes of Health (NIH):

– Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data.

What is bioinformatics

  • National Center for Biotechnology

Information (NCBI):

– the field of science in which biology, computer science, and information technology merge to form a single discipline. The ultimate goal of the field is to enable the discovery of new biological insights as well as to create a global perspective from which unifying principles in biology can be discerned.

slide-4
SLIDE 4

4

What is bioinformatics

  • Wikipedia

– Bioinformatics refers to the creation and advancement of algorithms, computational and statistical techniques, and theory to solve formal and practical problems posed by or inspired from the management and analysis

  • f biological data.

Chemistry Mathematics Statistics Computer Science Informatics Physics Medicine Biology Molecular Biology

Bioinformatics

Course objectives

  • Learn the basis of sequence analysis and other

computational biology algorithms

  • Familiarize with the research topics in

bioinformatics

  • Be able to

– Read / criticize bioinformatics research articles – Identify subareas that best suit your background – Communicate and exchange ideas with (computational) biologists

What you will learn?

  • Basic concepts in molecular biology and

genetics

  • Algorithms to address selected problems in

bioinformatics

– Dynamic programming, string algorithms, graph algorithms – Statistical learning algorithms: HMM, EM, Gibbs sampling – Data mining: clustering / classification

  • Applications to real data
slide-5
SLIDE 5

5

What you will not learn?

  • Designing / performing biological

experiments (duh!)

  • Programming (in perl, etc).
  • Building bioinformatics software tools (GUI,

database, Web, …)

  • Using existing tools / databases (well, not

exactly true)

Covered topics

  • Biology
  • Sequence analysis

– Sequence alignment

  • Pairwise, multiple, global, local, optimal, heuristic

– String matching – Motif finding

  • Gene prediction
  • RNA structure prediction
  • Phylogenetic tree
  • Functional Genomics

– Microarray data analysis – Biological networks

8 weeks 5 weeks 1 week

Computer Scientists vs Biologists

(courtesy Serafim Batzoglou, Stanford)

Biologists vs computer scientists

  • (almost) Everything is true or false in

computer science

  • (almost) Nothing is ever true or false in

Biology

slide-6
SLIDE 6

6

Biologists vs computer scientists

  • Biologists seek to understand the

complicated, messy natural world

  • Computer scientists strive to build their
  • wn clean and organized virtual world

Biologists vs computer scientists

  • Computer scientists are obsessed with

being the first to invent or prove something

  • Biologists are obsessed with being the first

to discover something

Some examples of central role of CS in bioinformatics

  • 1. Genome sequencing

AGTAGCACAGA CTACGACGAGA CGATCGTGCGA GCGACGGCGTA GTGTGCTGTAC TGTCGTGTGTG TGTACTCTCCT

3x109 nucleotides ~500 nucleotides

slide-7
SLIDE 7

7

AGTAGCACAGA CTACGACGAGA CGATCGTGCGA GCGACGGCGTA GTGTGCTGTAC TGTCGTGTGTG TGTACTCTCCT

3x109 nucleotides Computational Fragment Assembly Introduced ~1980 1995: assemble up to 1,000,000 long DNA pieces 2000: assemble whole human genome A big puzzle ~60 million pieces

  • 1. Genome sequencing

Where are the genes? Where are the genes?

  • 2. Gene Finding

In humans: ~22,000 genes ~1.5% of human DNA

Start codon ATG

5’ 3’

Exon 1 Exon 2 Exon 3 Intron 1 Intron 2

Stop codon TAG/ TGA/ TAA Splice sites

  • 2. Gene Finding

Hidden Markov Models (Well studied for many years in speech recognition)

  • 3. Protein Folding
  • The amino-acid sequence of a protein determines the 3D

fold

  • The 3D fold of a protein determines its function
  • Can we predict 3D fold of a protein given its amino-acid

sequence?

– Holy grail of compbio—40 years old problem – Molecular dynamics, computational geometry, machine learning

slide-8
SLIDE 8

8

  • 4. Sequence Comparison—Alignment

AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC

  • AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---

| | | | | | | | | | | | | x | | | | | | | | | | |

TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

Sequence Alignment Introduced ~1970 BLAST: 1990, most cited paper in history Still very active area of research query DB BLAST Efficient string matching algorithms Fast database index techniques …, comparison of a 200-amino-acid sequence to the 500,000 residues in the National Biomedical Research Foundation library would take less than 2 minutes on a minicomputer, and less than 10 minutes

  • n a microcomputer (IBM PC).

Lipman & Pearson, 1985

Database size today: 1012 (increased by 2 million folds). BLAST search: 1.5 minutes …, comparison of a 200-amino-acid sequence to the 500,000 residues in the National Biomedical Research Foundation library would take less than 2 minutes on a minicomputer, and less than 10 minutes

  • n a microcomputer (IBM PC).
  • 5. Microarray analysis

Clinical prediction of Leukemia type

  • 2 types

– Acute lymphoid (ALL) – Acute myeloid (AML)

  • Different treatments & outcomes
  • Predict type before treatment?

Bone marrow samples: ALL vs AML Measure amount of each gene

Some goals of biology for the next 50 years

  • List all molecular parts that build an organism

– Genes, proteins, other functional parts

  • Understand the function of each part
  • Understand how parts interact physically and functionally
  • Study how function has evolved across all species
  • Find genetic defects that cause diseases
  • Design drugs rationally
  • Sequence the genome of every human, use it for personalized

medicine

  • Bioinformatics is an essential component for all the

goals above

slide-9
SLIDE 9

9

A short introduction to molecular biology

Life

  • Two categories:

– Prokaryotes (e.g. bacteria)

  • Unicellular
  • No nucleus

– Eukaryotes (e.g. fungi, plant, animal)

  • Unicellular or multicellular
  • Has nucleus

Prokaryote vs Eukaryote

  • Eukaryote has many membrane-bounded

compartment inside the cell

– Different biological processes occur at different cellular location

Organism, Organ, Cell

Organism Organ

slide-10
SLIDE 10

10

Chemical contents of cell

  • Water
  • Macromolecules (polymers) - “strings ” made by linking

monomers from a specified set (alphabet)

–Protein –DNA –RNA –…

  • Small molecules

–Sugar –Ions (Na+, Ka+, Ca2+, Cl- ,…) –Hormone –…

DNA

  • DNA: forms the genetic material of all

living organisms

– Can be replicated and passed to descendents – Contains information to produce proteins

  • To computer scientists, DNA is a string

made from alphabet {A, C, G, T}

– e.g. ACAGAACGTAGTGCCGTGAGCG

  • Each letter is a nucleotide
  • Length varies from hundreds to billions

RNA

  • Historically thought to be information

carrier only

– DNA => RNA => Protein – New roles have been found for them

  • To computer scientists, RNA is a string

made from alphabet {A, C, G, U}

– e.g. ACAGAACGUAGUGCCGUGAGCG

  • Each letter is a nucleotide
  • Length varies from tens to thousands

Protein

  • Protein: the actual “worker” for almost all processes in

the cell

– Enzymes: speed up reactions – Signaling: information transduction – Structural support – Production of other macromolecules – Transport

  • To computer scientists, protein is a string made from 20

kinds of characters

– E.g. MGDVEKGKKIFIMKCSQCHTVEKGGKHKTGP

  • Each letter is called an amino acid
  • Length varies from tens to thousands
slide-11
SLIDE 11

11

DNA/RNA zoom-in

  • Commonly referred to as Nucleic Acid
  • DNA: Deoxyribonucleic acid
  • RNA: Ribonucleic acid
  • Found mainly in the nucleus of a cell (hence

“nucleic”)

  • Contain phosphoric acid as a component (hence

“acid”)

  • They are made up of a string of nucleotides

Nucleotides

  • A nucleotide has 3 components

– Sugar ring (ribose in RNA, deoxyribose in DNA) – Phosphoric acid – Nitrogen base

  • Adenine (A)
  • Guanine (G)
  • Cytosine (C)
  • Thymine (T) or Uracil (U)

Monomers of RNA: ribo-nucleotide

  • A ribonucleotide has 3 components

– Sugar - Ribose – Phosphate group – Nitrogen base

  • Adenine (A)
  • Guanine (G)
  • Cytosine (C)
  • Uracil (U)

Monomers of DNA: deoxy-ribo-nucleotide

  • A deoxyribonucleotide has 3 components

– Sugar – Deoxy-ribose – Phosphate group – Nitrogen base

  • Adenine (A)
  • Guanine (G)
  • Cytosine (C)
  • Thymine (T)
slide-12
SLIDE 12

12

Polymerization: Nucleotides => nucleic acids

Phosphate

Sugar

Nitrogen Base Phosphate

Sugar

Nitrogen Base Phosphate

Sugar

Nitrogen Base G A G T C A G C

5’-AGCGACTG-3’ AGCGACTG

Phosphate Sugar Base

1 2 3 4 5

Often recorded from 5’ to 3’, which is the direction of many biological processes. e.g. DNA replication, transcription, etc.

5’ 3’

DNA

Free phosphate 5 prime 3 prime G A G U C A G U

5’-AGUGACUG-3’ AGUGACUG

Often recorded from 5’ to 3’, which is the direction of many biological processes.

e.g. translation.

5’ 3’

RNA

Free phosphate 5 prime 3 prime T C A C T G G C G A G T C A G C Base-pair: A = T G = C 5’ 5’ 3’ 3’

5’-AGCGACTG-3’ 3’-TCGCTGAC-5’ AGCGACTG TCGCTGAC

Forward (+) strand Backward (-) strand

One strand is said to be reverse- complementary to the other DNA ususally exists in pairs.

slide-13
SLIDE 13

13

DNA double helix

G-C pair is stronger than A

  • T pair

Reverse-complementary sequences

  • 5’-ACGTTACAGTA-3’
  • The reverse complement is:

3’-TGCAATGTCAT-5’ => 5’-TACTGTAACGT-3’

  • Or simply written as

TACTGTAACGT

Orientation of the double helix

  • Double helix is anti-parallel

–5’ end of each strand pairs with 3’ end of the other –5’ to 3’ motion in one strand is 3’ to 5’ in the other

  • Double helix has no orientation

–Biology has no “forward” and “reverse” strand –Relative to any single strand, there is a “reverse complement” or “reverse strand” –Information can be encoded by either strand or both strands 5’TTTTACAGGACCATG 3’ 3’AAAATGTCCTGGTAC 5’

RNA

  • RNAs are normally single-

stranded

  • Form complex structure by self-

base-pairing

  • A=U, C=G
  • Can also form RNA-DNA and

RNA-RNA double strands.

– A=T/U, C=G

slide-14
SLIDE 14

14

Carboxyl group Amino group

Protein zoom-in

Side chain

Generic chemical form of amino acid

  • Protein is the actual “worker” for almost all processes in

the cell

  • A string built from 20 letters

– E.g. MGDVEKGKKIFIMKCSQCHTVEKGGKH

  • Each letter is called an amino acid

R | H2N--C--COOH | H

  • 20 amino acids, only differ at side chains

– Each can be expressed by three letters – Or a single letter: A-Y, except B, J, O, U, X, Z – Alanine = Ala = A – Histidine = His = H

Amino acid

R R | | H2N--C--CO--NH--C--COOH | | H H R R | | H2N--C--COOH H2N--C--COOH | | H H

Amino acids => peptide

Peptide bond

Protein

  • Has orientations
  • Usually recorded from N-terminal to C-terminal
  • Peptide vs protein: basically the same thing
  • Conventions

– Peptide is shorter (< 50aa), while protein is longer – Peptide refers to the sequence, while protein has 2D/3D structure

R H2N R R R R R COOH

N-terminal C-terminal

slide-15
SLIDE 15

15

Protein structure

  • Linear sequence of amino acids folds to

form a complex 3-D structure.

  • The structure of a protein is intimately

connected to its function.

Genome and chromosome

  • Genome: the complete DNA sequences in

the cell of an organism

– May contain one (in most prokaryotes) or more (in eukaryotes) chromosomes

  • Chromosome: a single large DNA

molecule in the cell

– May be circular or linear – Contain genes as well as “junk DNAs” – Highly packed!

Formation of chromosome Formation of chromosome

50,000 times shorter than extended DNA

The total length of DNA present in one adult human is the equivalent of nearly 70 round trips from the earth to the sun

slide-16
SLIDE 16

16

Gene

  • Gene: unit of heredity in living organisms

– A segment of DNA with information to make a protein

Some statistics

~4k 4 million 1

  • E. coli

? 130 billion ? Marbled lungfish 50-60k 2.5 billion 20 Corn ~20k 2.4 billion 78 Dog ~7k 20 million 16 Yeast 20k-25k 3 billion 46 Human Genes Bases Chromosomes

Human genome

  • 46 chromosomes: 22 pairs + X + Y
  • 1 from mother, 1 from father
  • Female: X + X
  • Male: X + Y

Human genome

  • Every cell contains the same genomic

information

– Except sperms and eggs, which only contain half of the genome

  • Otherwise your children would have 46 + 46

chromosomes

slide-17
SLIDE 17

17

Cell division: mitosis

  • A cell duplicates its

genome and divides into two identical cells

  • These cells build up

different parts of your body

Cell division: meiosis

  • A reproductive cell

divides into four cells, each containing only half

  • f the genomes

– Diploid => haploid

  • Two haploid cells (sperm

+ egg) forms a zygote

– Which will then develop into a multi-cellular

  • rganism by mitosis

Central dogma of molecular biology

DNA replication is critical in both mitosis and meiosis

DNA Replication

  • The process of copying a double-stranded

DNA molecule

– Semi-conservative

5’-ACATGATAA-3’ 3’-TGTACTATT-5’ ⇓ 5’-ACATGATAA-3’ 5’-ACATGATAA-3’ 3’-TGTACTATT-5’ 3’-TGTACTATT-5’

slide-18
SLIDE 18

18

  • Mutation: changes in DNA base-pairs
  • Proofreading and error-correcting mechanisms

exist to ensure extremely high fidelity

Central dogma of molecular biology

Transcription

  • The process that a DNA sequence is

copied to produce a complementary RNA

– Called message RNA (mRNA) if the RNA carries instruction on how to make a protein – Called non-coding RNA if the RNA does not carry instruction on how to make a protein – Only consider mRNA for now

  • Similar to replication, but

– Only one strand is copied

Transcription

(where genetic information is stored) (for making mRNA)

Coding strand: 5’-ACGTAGACGTATAGAGCCTAG-3’ Template strand: 3’-TGCATCTGCATATCTCGGATC-5’ mRNA: 5’-ACGUAGACGUAUAGAGCCUAG-3’ Coding strand and mRNA have the same sequence, except that T ’s in DNA are replaced by U’s in mRNA. DNA-RNA pair: A=U, C=G T=A, G=C

slide-19
SLIDE 19

19

Translation

  • The process of making proteins from mRNA
  • A gene uniquely encodes a protein
  • There are four bases in DNA (A, C, G, T), and four in

RNA (A, C, G, U), but 20 amino acids in protein

  • How many nucleotides are required to encode an amino

acid in order to ensure correct translation?

– 4^1 = 4 – 4^2 = 16 – 4^3 = 64

  • The actual genetic code used by the cell is a triplet.

– Each triplet is called a codon

The Genetic Code

Third let t er

Translation

  • The sequence of codons is translated to a

sequence of amino acids

  • Gene: -GCT TGT TTA CGA ATT-
  • mRNA: -GCU UGU UUA CGA AUU -
  • Peptide: - Ala - Cys - Leu - Arg - Ile –
  • Start codon: AUG

– Also code Met – Stop codon: UGA, UAA, UAA

Translation

  • Transfer RNA (tRNA) – a different type of RNA.

– Freely float in the cell. – Every amino acid has its own type of tRNA that binds to it alone.

  • Anti-codon – codon binding crucial.

mRNA

tRNA-Leu

Nascent peptide

tRNA-Pro

Anti-codon

slide-20
SLIDE 20

20

Transcriptional regulation

gene promoter Transcription starting site RNA Polymerase Transcription factor

  • Will talk more later.
  • RNA polymerase binds to certain location on promoter to initiate

transcription

  • Transcription factor binds to specific sequences on the promoter to regulate

the transcription

– Recruit RNA polymerase: induce – Block RNA polymerase: repress – Multiple transcription factors may coordinate

Splicing

gene promoter Transcription starting site Pre-mRNA transcription

  • Pre-mRNA needs to be “edited” to form mature mRNA
  • Will talk more later.

5’ UTR 3’ UTR exon exon exon intron intron Start codon Stop codon Open reading frame (ORF) Pre-mRNA Mature mRNA (mRNA) Splice

Summary

  • Central dogma: DNA => RNA => Protein
  • DNA: a string made from {A, C, G, T}

– Forms the basis of genes – Normally double-stranded with a reverse-complementary sequence – Can replicate itself – Transcribed into messenger RNA – Transcription is regulated

  • RNA: a string made from {A, C, G, U}

– Translated into protein (possibly spliced)

  • Protein: made from 20 kinds of amino acids

– Actual worker in the cell – Sequence uniquely determined by its gene via the use of nucleotide triplets (codons) – Sequence determines structure – Structure determines function

Experimental techniques to manipulate DNA

slide-21
SLIDE 21

21

DNA synthesis

  • Creating DNA synthetically in a laboratory
  • Chemical synthesis

– Chemical reactions – Arbitrary sequences – Maximum length 160-200

  • Cloning: make copies based on a DNA template

– Biological reactions – Requires template – Many copies of a long DNA in a short time

in vivo Cloning

  • Connect a piece of DNA to bacterial DNA,

which can then be replicated together with the host DNA

in vitro Cloning

  • Polymerase chain reaction (PCR)

denature 5’ 5’ 5’ 5’ 5’ 5’ 5’ Primer (< 30 bases) 5’ 5’ dNTP 5’ 5’ 5’ DNA Polymerase

Some terms

  • Denaturation: a DNA double-strand is separated

into two strands

– By raising temperature

  • Renaturation: the process that two denatured

DNA strands re-forms a double-strand

– By cooling down slowly

  • Hybridization: two heterogeneous DNAs form a

double-stranded DNA

– may have mismatches – The rationale behind many molecular biological techniques including DNA microarray

slide-22
SLIDE 22

22

DNA sequencing technology

  • Read out the letters from

a DNA sequence

1974, Frederick Sanger

GTGAGGCGCTGC

DNA sequencing: Basic idea

  • PCR

primer extension 5’-TTACAGGTCCATACTA ⇒ 3’-AATGTCCAGGTATGATACATAGG-5’

  • We need to supply A, C, G, T for the synthesis to

continue

  • Besides A, C, G, T, we add some A*, C*, G*, and T*

– Very similar to ACGT in all aspects, except that – The extension will stop if used

DNA sequencing, cont DNA sequencing, cont

slide-23
SLIDE 23

23

Advances in DNA sequencing

  • 1969: three years to sequence 115nt DNA
  • 1979: three years to sequence ~1650nt
  • 1989: one week to sequence ~1650nt
  • 1995: Haemophilus genome sequenced at

TIGR - 1,830,138nt

  • 2000: Human Genome - working draft

sequence, 3 billion bases

  • 2003: (near) completion of human genome

The bioinformatics landmark

  • Completion of human genome sequencing is a success

embraced by

– Advancement in sequencing technology – Speed of computation – Algorithm development in bioinformatics

  • HGP (Human Genome Project) strategy

– Hierarchical sequencing – Estimated 15 years (1990 – 2005), completed in 13 years – $3 billion

  • Celera strategy

– Whole-genome shotgun sequencing – Three years (1998-2001) – $300 million

  • The key is the assembly algorithm

Whole-Genome shotgun sequencing

AGTAGCACAGA CTACGACGAGA CGATCGTGCGA GCGACGGCGTA GTGTGCTGTAC TGTCGTGTGTG TGTACTCTCCT

3x109 nucleotides ~500 nucleotides

slide-24
SLIDE 24

24

AGTAGCACAGA CTACGACGAGA CGATCGTGCGA GCGACGGCGTA GTGTGCTGTAC TGTCGTGTGTG TGTACTCTCCT

3x109 nucleotides Computational Fragment Assembly Introduced ~1980 1995: assemble up to 1,000,000 long DNA pieces 2000: assemble whole human genome A big puzzle ~60 million pieces

Genome sequencing

Now

  • Over 300 genomes have been sequenced
  • ~1011 - 1012 nt

2007

  • Genomes of three individual human were

sequenced

– James Watson – Craig Venter – TBN Chinese

  • Cost for sequencing Watson’s genome

– $3 million, 2 months – Compared to $3 billion for HGP

  • Sequencing speed has been tremendously

improved

  • High efficiency and relatively low cost

makes it possible to sequence the genome

  • f any individual from any species

What’s next?

slide-25
SLIDE 25

25

Continue to sequence more species? More individuals? What to do with those sequences? Coming next: biological sequence analysis