Mining and Pattern Analysis in Large Data Sets for Biological - - PowerPoint PPT Presentation

mining and pattern analysis in large data sets for
SMART_READER_LITE
LIVE PREVIEW

Mining and Pattern Analysis in Large Data Sets for Biological - - PowerPoint PPT Presentation

Mining and Pattern Analysis in Large Data Sets for Biological Information. David W. Mount Arizona Cancer Center Analysis of gene expression microarray data sets with goal of preventing or curing cancer Statistical analysis of data


slide-1
SLIDE 1

Mining and Pattern Analysis in Large Data Sets for Biological Information.

David W. Mount Arizona Cancer Center

  • Analysis of gene expression microarray data

sets with goal of preventing or curing cancer – Statistical analysis of data – Using biological information to interpret data

  • Future types of genetic analyses
slide-2
SLIDE 2

My major objectives.

Develop hypotheses based on data analysis that can be tested in the laboratory or clinic Use and develop new methods for data analysis - pattern analysis, clustering, data mining, biological models Focus 1: early changes in colorectal and prostate cancer Focus 2: drugs for pancreatic cancer Major goal: to discover the unusual based on statistical and biological data analyses

slide-3
SLIDE 3

C3Y labeled cDNA matched

  • ligos

mismatched

  • ligos

cDNA, EST collection

  • ligo 1, oligo2, oligo3,….for each gene

control sample mRNA test sample mRNA synthesis

  • n slide

C5Y labeled cDNA control sample mRNA to one slide test sample mRNA to another slide biotin labeled cDNAs hybridized to

  • ligos

Cy5/Cy3 for each gene slide1/slide2 for each gene mix hybridized to one slide

Using data from two types of microarrays for measuring gene expression of ~35,000 human genes. Spotted arrays Affymetrix arrays

control sample mRNA to one slide

slide-4
SLIDE 4

Green – down or Red – up <2-4 fold NAP normal adjacent MET metastatic PCA localized BPH benign hyperplasia

Using gene expression microarrays for predicting genetic variation in tissues.

  • Michigan Prostate Study

Underexpressed predicting lost functions Overexpressed predicting new metabolism

slide-5
SLIDE 5

Use data to find

  • An unusual gene product or

gene expression value that indicates a good drug target

  • An early change that can help

with early detection/diagnosis

slide-6
SLIDE 6

Microarrays provide new drug targets - 1

Over-expressed genes in metastatic tissue. What genes, what pathways, what functions, where in cell? Cancer cells need these additional proteins to support their abnormal metabolism. Cancer cell Normal cell A AAA Inhibitor of A product

slide-7
SLIDE 7

Cancer cells lose many gene functions by mutation (A-). They need backup functions to survive (B+). Target these backup functions. Geneticists call these overlapping gene functions synthetic lethals (A. Kamb) A-B+ A+B+ Normal cell Cancer cell Inhibitor of B product

Microarrays provide new drug targets - 2

slide-8
SLIDE 8

Careful Experimental Design and Statistical Analysis are Extremely Important

1. Plan experiment so as to identify sources

  • f variation

2. Include biological replication 3. Perform data quality analysis 4. Find genes that are varying significantly using data model in 1. 5. Mine this gene list for biological information Complications: genetic variability person to person, cancer stage, tissues are cell mixtures

slide-9
SLIDE 9

Analysis of biological data with a variable genetic component is not new!

slide-10
SLIDE 10

We are using R statistical computing/BioConductor for data analysis combined with Perl/Bioperl for biological mining.

slide-11
SLIDE 11

R has tools for looking at data quality, etc..

Background varies slide to slide bad spotted array good affy array

slide-12
SLIDE 12

Antibody used for immunochemical stain reveals which cells are producing a protein (cytokeratin)

Labeled cells Unlabeled cells

slide-13
SLIDE 13

Example of Pancreatic Cancer

  • 1/200 people get pancreatic cancer; 1/4 if have

pancreatitis

  • It is a very painful and debilitating disease
  • Death usually within 1-2 years of discovery
  • Few drugs available - gemcitabine hopeful ut only

helps small percentage of people

  • There is very little currently being spent on research

into pancreatic cancer compared to other cancers

  • I will describe early results: 4 cancer tissues vs one

normal tissue on Aglilent spotted arrays (24K genes).

slide-14
SLIDE 14

Boxplots of normalized data of 4 tissues reveal between slide comparisons should be valid.

Boxplots illustrate that distribution of M values in each sample is similar. Bars are 25% and 75% levels.

slide-15
SLIDE 15

Normalization within arrays corrects for labeling and label detection variation.

MA plot with no normalization MA plot with Loess normalization Red - tumor Blue - normal A = average of R and G values (square root of their product) M = log of R to G ratio to the base 2. MA plot from the first cancer tissue sample vs. control. Each point is a one of approx. 24,000 genes. The crowd of spots in the lower part of the graph, two of which are labeled R25, are the +ve control with a deliberately reduced R/G ratio; two -ve controls which should not change are on the center left near 0; and two values of VegF of interest to project 1, and Fos, the most significantly over-expressed gene in these tissues are also shown. Normalization restores M of most genes to approx. 0.

slide-16
SLIDE 16

Top 100 genes that are statistically best supported are mostly down regulated.

Red - tumor Blue - normal A1 = average of R and G values (square root of their product) M1 = log of R to G ratio to the base 2.

slide-17
SLIDE 17

Volcano plot of fold change (x axis) against log odds that gene is differentially expressed (y axis) for 100 most significantly varying genes.

This plot also shows that the most significantly varying genes in the pancreatic cancer tissues are down regulated, which probably means they are not

  • functional. Some down

regulated genes are also tumor suppressor genes and thus are candidates for project 2 drug screens in the Pancreatic PPG. Log odds of 5 means that the chance that these genes are NOT varying significantly from M=0 is e5 = 1/148. This is a measure

  • f the false discovery rate.
slide-18
SLIDE 18

Example of genes varying significantly between 4 pancr. cancer tissues and a normal pancr. tissue sample. -TGen data - Agilent arrays.

Gb_ accession GeneNam e Description M = log2( R/ G) A = RG t p corr. for FDR B BC0 0 4 4 9 0 FOS V-fos transcr. factor 3 .6 1 0 .3 3 6 .9 0 .0 0 0 9 0 7 .4 1 NM_ 0 3 3 1 9 4 NM_ 0 3 3 1 9 4 .1 Heat shock pr B9

  • 1 .7

8 .2

  • 3 0 .0

0 .0 0 0 9 0 6 .6 4 Y1 2 6 6 1 VGF VGF nerve grow th factor

  • 2 .5

1 3 .7

  • 2 8 .3

0 .0 0 0 9 0 6 .4 1 AF4 8 8 7 3 9 GABABL fam ily G protein coupled rec.

  • 2 .0

1 0 .2

  • 2 6 .1

0 .0 0 0 9 0 6 .0 6 …… NM_ 0 1 5 7 1 1 GLTSCR1 Gliom a tum or suppressor

  • 1 .0

1 0 .3

  • 1 5 .3

0 .0 0 1 8 8 3 .5 1 …… BC0 0 0 3 1 1 COPEB Kruppel-like

  • transcr. factor

1 .6 1 0 .8 1 3 .8 0 .0 0 2 5 0 2 .9 9 NM_ 0 0 6 9 9 9 POLS DNA Poly. sigm a 0 .8 9 .0 1 3 .8 0 .0 0 2 5 0 2 .9 8 …… NM_ 0 0 1 5 3 0 HI F1 A Hypoxia-ind factor 1 1 .4 7 .4 7 .6 0 .0 0 8 6 5

  • 0 .1 9

NM_ 0 0 1 5 3 0 HI F1 A Hypoxia-ind factor 1 1 .6 7 .3 7 .6 0 .0 0 8 6 7

  • 0 .2 0

NM_ 0 0 1 5 3 0 HI F1 A Hypoxia-ind factor 1 1 .5 7 .3 7 .6 0 .0 0 8 7 2

  • 0 .2 1

NM_ 0 0 1 5 3 0 HI F1 A Hypoxia-ind factor 1 1 .6 7 .3 7 .5 0 .0 0 8 8 8

  • 0 .2 6

p-value adjusted for false discovery rate (Benjamini and Hochberg) for multiple hypothesis testing. FDR is the expected percent of false predictions in a set of predictions, in this case the percent of genes that are incorrectly reported to change. B = log-odds that gene is differentially expressed. e.g. if B=1.5, odds is e 1.5 = 4.48, i.e, odds of correct prediction is 4.48/1. For B=0, odds = 1/1.

slide-19
SLIDE 19

What do you do with a list of genes?

  • Influence on known metabolic and regulatory

pathways (usually ~1/4 of genes)

  • Gene Ontology (GO) terms
  • Protein-protein and gene-gene interactions
  • Where located - genome amplification,

rearrangements?

  • Agreement with models - biological and

computational

slide-20
SLIDE 20

Local genome databases are maintained at AZCC

  • Local databases of human, rat, mouse, and model organisms
  • Direct links to genetic, proteomic, and regulatory/ pathway

databases

  • Information on protein-protein and gene-gene interactions
  • http:/ / www.biorag.org is public access Web site
slide-21
SLIDE 21

Pathway Miner

  • http://www.biorag.org/pathway.html
  • Pandey et al. 2004 Bioinformatics. 20:2156-8
  • Builds genetic network displays based on

regulatory and metabolic relationships

  • Produces lists of genes in excel format
slide-22
SLIDE 22

Genetic network analysis of pancreatic data with Pathway Miner

  • top 800 pancreatic genes - GenMAPP pathways

A java interactive display that can be filtered in many ways. Click on gene names to retrieve all relevant information and

  • n edges to view the

pathways in common. Any list of genes can be uploaded for analysis.

slide-23
SLIDE 23

Five genes in the top 800 are in MAPK, including FOS

slide-24
SLIDE 24

New target and drug strategy used in pancreatic cancer project.

  • Identify under-expressed tumor suppressor (TS) genes in

pancreatic cancer tissues

  • Make isogenic pancreatic cell lines with combinations of

these genes

  • Screen for differential sensitivity to a large NCI collection of

drugs and chemicals/ siRNA knockdowns

A- A+ Cancer cell TS+ Cancer cell TS- Find drug or siRNA specific for TS- cells

slide-25
SLIDE 25

Red - greater killing PPC4- Green - greater killing DPC4+ Conclusions:

  • may assist in

choice of drug targets

  • knockdown of

genes of closely related function can have quite

  • pposite effect.

Pathway miner used for siRNA analysis, TGen data. DPC4+/- cell lines.

Gene knockdowns showing largest effects.

slide-26
SLIDE 26

siRNA hits on Wnt pathway.

slide-27
SLIDE 27

siRNA effects on nuclear receptors.

slide-28
SLIDE 28

About prostate cancer

  • Men screened for a serum antigen - PSA
  • If levels go up -> biopsy specimens examined for

evidence of cancer (black box -> Gleason score)

  • Decision made about prostatectomy (undesirable -

incontinence, sex dysfunction, etc.)

  • Survival about 2/3
  • Tissues collected from men used for gene

expression analysis using Affymetrix arrays (about 12,500 genes)

slide-29
SLIDE 29

Analysis of a large Affy prostate data set (Singh et al. 2002, Cancer Cell)

  • 50 normal tissues
  • 52 staged tissues
  • Perform BioConductor Linear Models

(LIMMA) analysis

  • Trying advanced statistical modeling and

clustering of genes e.g. independent component analysis (fastICA, MLICA), mixture models (nlme)

  • Test models of penetration, altered

metabolism, etc.

slide-30
SLIDE 30

The data set: human Affy hgu95av2 chip - 1/3 of genome

  • 50 normal prostate tissues
  • 52 cancer tissues at different stages

– 29 negative capsule penetration/20 positive – 13 positive resesection surg. margin/37 negative – 9 non re-occurring/5 re-occurring – Gleason score available – no apparent dissection – no apparent pairing of N/C samples Singh et al. Cancer Cell March 2001

slide-31
SLIDE 31

Results from prostate data set

  • Can find about 600-1,000 genes

changing N/C depending on acceptable level of FDR

  • No significant changes capsular,

margin, or recurrence data (agrees with paper

  • What next?
slide-32
SLIDE 32

Biocarta pathways

slide-33
SLIDE 33

Metabolic changes in prostate cancer - cells deprived of oxygen depend on these changes.

Cancer cells in general learn to survive with reduced oxygen and they make a factor (vascular endothelial growth factor or VEGF) that induces growth of blood vessels. This is clearly observed in gene expression data.

slide-34
SLIDE 34

Getting at the unknown gene relationships.

  • Try to identify sets of genes that are regulated

independently of other sets

  • A new method is independent component analysis

(vs. principal components analysis, etc)

– Can superimpose regulatory models and build a more detailed model

  • Genes interact to different degrees
  • Problem: find sets of genes that are statistically most

different across the tissue samples

  • R provides resources
slide-35
SLIDE 35

Independent component analysis: suitable for building and testing regulatory models

samples genes components genes = X samples components Do any of these gene groups better Separate the sample classes C and N? Matrix A Matrix S Matrix X NN…CC

33331111 31331311 31311311

slide-36
SLIDE 36

Use of ICA in analysis of endometrial cancer

Noise Good separation

SA Saidi et al. Oncogene 2004

slide-37
SLIDE 37

Some samples of ICA: objective - can we find a set to discriminate gene and tissue classes in prostate ca.?

Hierarchical clustering (complete) of 102 prostate tissue samples Boxplots of 102 samples after ICA

slide-38
SLIDE 38

Another approach - use list of genes that are of biological interest during early stages of prostate Ca. and build model.

  • 14-3-3 sigma
  • actinin
  • BP180
  • BP230
  • cadherin
  • catenin
  • CD151
  • CD44
  • CD63
  • CD81
  • CD9
  • connexin 32
  • desmocollin
  • desmoglein
  • desmoplakin
  • ehm2
  • EWI
  • Ezrin
  • fascin
  • fibulin
  • HD1
  • keratin
  • laminin
  • MTA3
  • Nanos

homolog 1

  • PKC-delta
  • plakoglobin
  • Plectin
  • SNAI1
  • tenascin
  • vinculin
  • Zona

Occludens 1

  • Zona

Occludens 2

Genes related to cell adhesion to intracellular matrix

If change these genes - then expect cells to be able to penetrate the capsule and invade surrounding tissues.

slide-39
SLIDE 39

The source of germline variability in humans

One of my pairs

  • f chromosomes

Maternal Paternal What I pass on to

  • ur children

Hundreds of thousands of differences in sequence Called SNPs - single nucleotide polymorphisms

What my wife passes on to our children Inheritance is through haplotype blocks of 10s to 100s of kbases

slide-40
SLIDE 40

Genotype revealed in humans by haplotype structure of 5q31 (Daly et al. 2001)

slide-41
SLIDE 41

Goal: relationship between genotype and expression. Pomp et al. 2004 Large scale expression analysis mapped against genotype.

slide-42
SLIDE 42

Conclusions and future plans

  • gene expression data are used to identify

drug targets

  • further analysis

– ICA analysis - Maximum likelihood method – Examine all penetration related genes for possible variation

slide-43
SLIDE 43

Acknowledgements

Colleagues at UMC/AZCC/SWEHSC

  • Ritu Pandey, Greg

Thomas, Rob Klein, Raghavendra Guru

  • Dave Alberts
  • Anne Cress
  • Gene Gerner
  • Serrine Lau - SWEHSC
  • Clark Lantz
  • Ray Nagle
  • Garth Powis
  • George Tsaprailis and the

proteomics core

  • Bernie Futscher and

George Watts of the genomics core Colleagues at Tgen, Phoenix

  • Dan Von Hoff
  • Jeff Trent
  • Phillip Stafford
  • Haiyong Han
  • Spyro Mousses