Mining and Pattern Analysis in Large Data Sets for Biological - PowerPoint PPT Presentation

Mining and Pattern Analysis in Large Data Sets for Biological Information. David W. Mount Arizona Cancer Center • Analysis of gene expression microarray data sets with goal of preventing or curing cancer – Statistical analysis of data – Using biological information to interpret data • Future types of genetic analyses

My major objectives. Develop hypotheses based on data analysis that can be tested in the laboratory or clinic Use and develop new methods for data analysis - pattern analysis, clustering, data mining, biological models Focus 1: early changes in colorectal and prostate cancer Focus 2: drugs for pancreatic cancer Major goal: to discover the unusual based on statistical and biological data analyses

Using data from two types of microarrays for measuring gene expression of ~35,000 human genes. Spotted Affymetrix cDNA, oligo 1, oligo2, oligo3,….for each gene EST arrays arrays collection synthesis on slide matched oligos mismatched oligos control sample mRNA to one slide slide1/slide2 Cy5/Cy3 for each for each gene gene mix hybridized biotin to one slide C 3 Y labeled cDNAs labeled C 5 Y hybridized to cDNA labeled oligos cDNA control test control sample sample test sample mRNA mRNA to sample mRNA to one another slide mRNA slide

Using gene expression microarrays for predicting genetic variation in tissues. Underexpressed - Michigan Prostate Study predicting lost functions Green – down or Red – up <2-4 fold NAP normal adjacent Overexpressed predicting MET metastatic new metabolism PCA localized BPH benign hyperplasia

Use data to find • An unusual gene product or gene expression value that indicates a good drug target • An early change that can help with early detection/diagnosis

Microarrays provide new drug targets - 1 Over-expressed genes in metastatic tissue. What genes, what pathways, what functions, where in cell? Cancer cells need these additional proteins to support their abnormal metabolism. Cancer cell Normal cell A AAA Inhibitor of A product

Microarrays provide new drug targets - 2 Cancer cells lose many gene functions by mutation (A-). They need backup functions to survive (B+). Target these backup functions. Geneticists call these overlapping gene functions synthetic lethals (A. Kamb) Cancer cell Normal cell A+B+ A-B+ Inhibitor of B product

Careful Experimental Design and Statistical Analysis are Extremely Important 1. Plan experiment so as to identify sources of variation 2. Include biological replication 3. Perform data quality analysis 4. Find genes that are varying significantly using data model in 1. 5. Mine this gene list for biological information Complications: genetic variability person to person, cancer stage, tissues are cell mixtures

Analysis of biological data with a variable genetic component is not new!

We are using R statistical computing/BioConductor for data analysis combined with Perl/Bioperl for biological mining.

R has tools for looking at data quality, etc.. Background varies slide to slide bad spotted array good affy array

Antibody used for immunochemical stain reveals which cells are producing a protein (cytokeratin) Labeled cells Unlabeled cells

Example of Pancreatic Cancer • 1/200 people get pancreatic cancer; 1/4 if have pancreatitis • It is a very painful and debilitating disease • Death usually within 1-2 years of discovery • Few drugs available - gemcitabine hopeful ut only helps small percentage of people • There is very little currently being spent on research into pancreatic cancer compared to other cancers • I will describe early results: 4 cancer tissues vs one normal tissue on Aglilent spotted arrays (24K genes).

Boxplots of normalized data of 4 tissues reveal between slide comparisons should be valid. Boxplots illustrate that distribution of M values in each sample is similar. Bars are 25% and 75% levels.

Normalization within arrays corrects for labeling and label detection variation. Red - tumor Blue - normal A = average of R and G values (square root of their product) M = log of R to G ratio to the base 2. MA plot with Loess normalization MA plot with no normalization MA plot from the first cancer tissue sample vs. control. Each point is a one of approx. 24,000 genes. The crowd of spots in the lower part of the graph, two of which are labeled R25, are the +ve control with a deliberately reduced R/G ratio; two -ve controls which should not change are on the center left near 0; and two values of VegF of interest to project 1, and Fos, the most significantly over-expressed gene in these tissues are also shown. Normalization restores M of most genes to approx. 0.

Top 100 genes that are statistically best supported are mostly down regulated . Red - tumor Blue - normal A1 = average of R and G values (square root of their product) M1 = log of R to G ratio to the base 2.

Volcano plot of fold change (x axis) against log odds that gene is differentially expressed (y axis) for 100 most significantly varying genes. Log odds of 5 means that the chance that these genes are NOT varying significantly from M=0 is e 5 = 1/148. This is a measure of the false discovery rate. This plot also shows that the most significantly varying genes in the pancreatic cancer tissues are down regulated, which probably means they are not functional. Some down regulated genes are also tumor suppressor genes and thus are candidates for project 2 drug screens in the Pancreatic PPG.

Example of genes varying significantly between 4 pancr. cancer tissues and a normal pancr. tissue sample. -TGen data - Agilent arrays. A = M = p corr. � RG Gb_ accession GeneNam e Description log 2 ( R/ G) t for FDR B V-fos transcr. BC0 0 4 4 9 0 FOS factor 3 .6 1 0 .3 3 6 .9 0 .0 0 0 9 0 7 .4 1 NM_ 0 3 3 1 9 4 NM_ 0 3 3 1 9 4 .1 Heat shock pr B9 -1 .7 8 .2 -3 0 .0 0 .0 0 0 9 0 6 .6 4 VGF nerve Y1 2 6 6 1 VGF -2 .5 1 3 .7 -2 8 .3 0 .0 0 0 9 0 6 .4 1 grow th factor fam ily G protein AF4 8 8 7 3 9 GABABL coupled rec. -2 .0 1 0 .2 -2 6 .1 0 .0 0 0 9 0 6 .0 6 …… Gliom a tum or NM_ 0 1 5 7 1 1 GLTSCR1 suppressor -1 .0 1 0 .3 -1 5 .3 0 .0 0 1 8 8 3 .5 1 …… Kruppel-like BC0 0 0 3 1 1 COPEB transcr. factor 1 .6 1 0 .8 1 3 .8 0 .0 0 2 5 0 2 .9 9 NM_ 0 0 6 9 9 9 POLS DNA Poly. sigm a 0 .8 9 .0 1 3 .8 0 .0 0 2 5 0 2 .9 8 …… Hypoxia-ind NM_ 0 0 1 5 3 0 HI F1 A factor 1 � 1 .4 7 .4 7 .6 0 .0 0 8 6 5 -0 .1 9 Hypoxia-ind factor 1 � NM_ 0 0 1 5 3 0 HI F1 A 1 .6 7 .3 7 .6 0 .0 0 8 6 7 -0 .2 0 Hypoxia-ind factor 1 � NM_ 0 0 1 5 3 0 HI F1 A 1 .5 7 .3 7 .6 0 .0 0 8 7 2 -0 .2 1 Hypoxia-ind factor 1 � NM_ 0 0 1 5 3 0 HI F1 A 1 .6 7 .3 7 .5 0 .0 0 8 8 8 -0 .2 6 p-value adjusted for false discovery rate (Benjamini and Hochberg) for multiple hypothesis testing. FDR is the expected percent of false predictions in a set of predictions, in this case the percent of genes that are incorrectly reported to change. B = log-odds that gene is differentially expressed. e.g. if B=1.5, odds is e 1.5 = 4.48, i.e, odds of correct prediction is 4.48/1. For B=0, odds = 1/1.

What do you do with a list of genes? • Influence on known metabolic and regulatory pathways (usually ~1/4 of genes) • Gene Ontology (GO) terms • Protein-protein and gene-gene interactions • Where located - genome amplification, rearrangements? • Agreement with models - biological and computational

Local genome databases are maintained at AZCC • Local databases of human, rat, mouse, and model organisms • Direct links to genetic, proteomic, and regulatory/ pathway databases • Information on protein-protein and gene-gene interactions • http:/ / www.biorag.org is public access Web site

Pathway Miner • http://www.biorag.org/pathway.html • Pandey et al. 2004 Bioinformatics. 20:2156-8 • Builds genetic network displays based on regulatory and metabolic relationships • Produces lists of genes in excel format

Genetic network analysis of pancreatic data with Pathway Miner - top 800 pancreatic genes - GenMAPP pathways A java interactive display that can be filtered in many ways. Click on gene names to retrieve all relevant information and on edges to view the pathways in common. Any list of genes can be uploaded for analysis.

Five genes in the top 800 are in MAPK, including FOS

New target and drug strategy used in pancreatic cancer project. • Identify under-expressed tumor suppressor (TS) genes in pancreatic cancer tissues • Make isogenic pancreatic cell lines with combinations of these genes • Screen for differential sensitivity to a large NCI collection of drugs and chemicals/ siRNA knockdowns Cancer cell TS+ Cancer cell TS- A+ A- Find drug or siRNA specific for TS- cells

Pathway miner used for siRNA analysis, TGen data. DPC4+/- cell lines. Gene knockdowns showing largest effects. Red - greater killing PPC4- Green - greater killing DPC4+ Conclusions: -may assist in choice of drug targets -knockdown of genes of closely related function can have quite opposite effect.

siRNA hits on Wnt pathway.

siRNA effects on nuclear receptors.

Mining and Pattern Analysis in Large Data Sets for Biological - PowerPoint PPT Presentation

Mining and Pattern Analysis in Large Data Sets for Biological Information. David W. Mount Arizona Cancer Center Analysis of gene expression microarray data sets with goal of preventing or curing cancer Statistical analysis of data

Scope Constrained Frequent Pattern Mining: Constrained Frequent Pattern Mining: A A

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

CS6220: DATA MINING TECHNIQUES Chapter 7: Advanced Pattern Mining Instructor: Yizhou Sun

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Mining and Pattern Analysis in Large Data Sets for Biological Information. David W. Mount

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Data Mining 2018 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 10, 2018

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Week 5 Video 4 Relationship Mining Sequential Pattern Mining Association Rule Mining Try to

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Data Mining Associative pattern mining Hamid Beigy Sharif University of Technology Fall 1396

Pattern Structures Pattern Structures Models describe whole or a large part of the data

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 1 Introduction What is data mining? After years of data mining there is

DS504/CS586: Big Data Analytics Data acquisition and measurement Prof. Yanhua Li Time: 6:00pm

JORDAN BISERKOV ClojuTRE Helsinki, Finland September 14 th 2018 Jordan Biserkov Programming

Status tus Update te Alexey Lyashen henko Incom om Inc. Incom Inc. LAPPD, CPAD19, December 8

USING DEEP LEARNING GT 8803 // FALL 2018 // JACOB LOGAS L E C T U R E # 0 9 : DATA VO C A L I

Q2 2015 Results Conference Call Sound business performance Marcus Kuhnert, CFO August 6, 2015

Merck & Co., Inc. Merck ASCO Event June 6, 2016 Forward-Looking Statement of Merck &

Computational Systems Biology Deep Learning in the Life Sciences 6.802 6.874 20.390 20.490

DNA Computing State of the Art 2003-01-28 CPSC 601.73

Sambuz

Useful Links

Newsletter

Mail Us

Mining and Pattern Analysis in Large Data Sets for Biological - PowerPoint PPT Presentation

Mining and Pattern Analysis in Large Data Sets for Biological Information. David W. Mount Arizona Cancer Center Analysis of gene expression microarray data sets with goal of preventing or curing cancer Statistical analysis of data

Scope Constrained Frequent Pattern Mining: Constrained Frequent Pattern Mining: A A

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

CS6220: DATA MINING TECHNIQUES Chapter 7: Advanced Pattern Mining Instructor: Yizhou Sun

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Mining and Pattern Analysis in Large Data Sets for Biological Information. David W. Mount

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Data Mining 2018 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 10, 2018

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Week 5 Video 4 Relationship Mining Sequential Pattern Mining Association Rule Mining Try to

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Data Mining Associative pattern mining Hamid Beigy Sharif University of Technology Fall 1396

Pattern Structures Pattern Structures Models describe whole or a large part of the data

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 1 Introduction What is data mining? After years of data mining there is

DS504/CS586: Big Data Analytics Data acquisition and measurement Prof. Yanhua Li Time: 6:00pm

JORDAN BISERKOV ClojuTRE Helsinki, Finland September 14 th 2018 Jordan Biserkov Programming

Status tus Update te Alexey Lyashen henko Incom om Inc. Incom Inc. LAPPD, CPAD19, December 8

USING DEEP LEARNING GT 8803 // FALL 2018 // JACOB LOGAS L E C T U R E # 0 9 : DATA VO C A L I

Q2 2015 Results Conference Call Sound business performance Marcus Kuhnert, CFO August 6, 2015

Merck &amp; Co., Inc. Merck ASCO Event June 6, 2016 Forward-Looking Statement of Merck &amp;

Computational Systems Biology Deep Learning in the Life Sciences 6.802 6.874 20.390 20.490

DNA Computing State of the Art 2003-01-28 CPSC 601.73

Sambuz

Useful Links

Newsletter

Mail Us

Merck & Co., Inc. Merck ASCO Event June 6, 2016 Forward-Looking Statement of Merck &