Part 1 Advanced Computational Intelligence and Deep Machine - - PowerPoint PPT Presentation

part 1
SMART_READER_LITE
LIVE PREVIEW

Part 1 Advanced Computational Intelligence and Deep Machine - - PowerPoint PPT Presentation

Research Problem Definition Part 1 Advanced Computational Intelligence and Deep Machine Learning for Early Detection and Diagnosis of Diseases Tarek Khorshed PhD Student, American University in Cairo 1 | Problem Definition Disease Diagnosis


slide-1
SLIDE 1

1 |

Research Problem Definition Part 1

Tarek Khorshed

PhD Student, American University in Cairo

Advanced Computational Intelligence and Deep Machine Learning for Early Detection and Diagnosis of Diseases

slide-2
SLIDE 2

2 |

Problem Definition

 The ability to detect and diagnose diseases is a major challenge with severe negative impacts on the global health and economy. Examples: Cancer, H1NI, SARS and Tuberculosis.  One of the important areas which have not yet exploited the full potential of Computer Science is the early detection of diseases and outbreaks.

Disease Diagnosis

 Increasing complexity of health information has made it difficult if not impossible to use traditional monitoring techniques to detect irregular signs of possible disease threats or outbreaks.  Complexity lies in the different interpretations of data Ex: geographical location, weather climate and seasonal changes.

slide-3
SLIDE 3

3 |

Challenges in Analysis of Biomedical Data

 Analysis of biomedical gene expression data is extremely challenging given the complexity of biological networks and high dimensionality of the data.  Current Clustering techniques rely on preprocessing the data for feature extraction and dimensionality reduction which affects the accuracy of disease diagnosis [4].

[4] Rui Xu; Wunsch, D.C., "Clustering Algorithms in Biomedical Research: A Review," Biomedical Engineering, IEEE Reviews in , vol.3, no., pp.120,154, 2010.. Data preprocessing and Feature extraction [4]

 The proposed research is targeting alternative solutions that are capable of processing high dimensional data to achieve better accuracy.

Molecular Gene Expression data

slide-4
SLIDE 4

4 |

Overview Of Gene Expression Data

Gene Expression Data Representation

 Gene data is represented in a real valued 2-dimensional matrix.  Rows: represent patterns of genes. Columns: represent profiles of samples.

Matrix representation of gene data [6].

slide-5
SLIDE 5

5 |

Challenges in Gene Expression Data

 Only a small subset of gene data might be influencing the disease infection being monitored.  Interesting features of the disease are only present in a subset of the data which leads to further complexity in pattern analysis techniques.

Data Quality

 Gene expression matrix contains many data anomalies such as noise and missing values.  Preprocessing the gene data is a crucial step before attempting any data analysis tasks for disease diagnosis.  Data normalizing, estimating missing values, filtering gene expression data which are not relevant or significant to the biological process being analyzed.

Preprocessing Tasks

slide-6
SLIDE 6

6 |

Challenges in Gene Expression Data

Features >>> Samples

Typical example in cancer classification

  • No. of features is much larger than
  • no. of samples.

Bontempi, G., "A Blocking Strategy to Improve Gene Selection for Classification of Gene Expression Data," Computational Biology and Bioinformatics, IEEE/ACM Transactions on , vol.4, no.2, pp.293,300, April-June 2007 .

Name: Short identifier of the data set Platform: Type of microarray platform N: No. of Samples (Ex: Tumor samples: hundreds) n: No. of features (Ex: Gene probes: thousands)

slide-7
SLIDE 7

7 |

Cluster Analysis

Clustering Overview

 Objective is to group data objects into a set of disjoint classes called clusters.  Clustering is a form of unsupervised learning because it does not depend on predefined class labels.

Classification of Clustering Techniques

  • 1. Hard Partitional Clustering: Attempts to find a K-partition of the data
  • 2. Hierarchical Clustering: Attempts to build a tree structure in the

form of a partition.

  • 3. Fuzzy Clustering: Data object can belong to a certain cluster with a

degree of membership

  • 4. Density Based Clustering: Defines core, border and noise points.
slide-8
SLIDE 8

8 |

Cluster Analysis

Attempts to find a K-partition of the data

  • 1. Hard Partitional Clustering

K-Means Clustering

slide-9
SLIDE 9

9 |

Cluster Analysis

  • 1. Hard Partitional Clustering Attempts to find a K-partition of the data

Mixture Based Clustering: EM Algorithm for Mixture of Gaussians

slide-10
SLIDE 10

10 |

Cluster Analysis

  • 2. Hierarchical Clustering:

Dendogram output for Hierarchical clustering

 Organizes data set into a hierarchical structure.

  • 1. Agglomerative methods : Bottom-up approach where each element

starts in its own cluster and then pairs of clusters are merged together

  • 2. Divisive methods : Top-down approach where all elements start in
  • ne cluster and then they are divided recursively
slide-11
SLIDE 11

11 |

 Proximity describes how we measure the distance or similarity between a pair of data objects. 1) Distance (Dissimilarity) 2) Similarity

Similarity Measures for Gene Expression Data

Gene Expression Data Cluster Analysis

Common similarity measures for continuous variables [4].

slide-12
SLIDE 12

12 |

Similarity Measures for Gene Expression Data

Euclidean Distance

 Performs well for many clustering applications, but produces poor results when used with gene data [10].  For gene data we are more interested in the overall pattern similarity as opposed to the size of each individual attribute.

slide-13
SLIDE 13

13 |

Pearson’s correlation coefficient

 Measures similarity between the shapes of two gene expression patterns. Commonly used in clustering gene data and has produced very good results.  Does not perform well in the presence of

  • utliers. Potential problem of assigning a high

similarity score to a pair of dissimilar patterns if they have a common peak or valley [10].

Similarity Measures for Gene Expression Data

Jackknife correlation

slide-14
SLIDE 14

14 |

Gene Based Clustering Algorithms

Evaluation of Hard Partitional Clustering with Gene Expression Data  Complexity O(N K d). Practical for large data since number of clusters K and dimensions d are typically much smaller than N.  Similarity measures can be relatively simple to compute.  Variation called Bisecting K-means can enhance the performance of the algorithm [10].

Advantages

 Correct number of clusters is not known in advance.  No standard method to define the initial set of clusters.  Requires running the algorithm several times with random initial partitions which is computationally expensive.

Disadvantages

slide-15
SLIDE 15

15 |

References

[1] International Agency for Research on Cancer (IARC), World Cancer Report. http://www.iarc.fr [2] National Cancer Institute (NCI), http://www.cancer.gov/ [3] P. Baldi and G.W. Hatfield, “DNA Microarrays and Gene Expression. From Experiments to Data Analysis and Modelling”. Cambridge Univ. Press, 2002. [4] Rui Xu; Wunsch, D.C., "Clustering Algorithms in Biomedical Research: A Review," Biomedical Engineering, IEEE Reviews in , vol.3, no., pp.120,154, 2010. [5] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313,

  • no. 5786, pp. 504–507, 2006.

[6] T. Lee and D. Mumford, “Hierarchical Bayesian inference in the visual cortex,” J. Opt. Soc. Amer., vol. 20, pt. 7, pp. 1434–1448, 2003. [7] G. E. Hinton, S. Osindero, and Y. Teh, “A fast learning algorithm for deep belief nets,” Neural Comput., vol. 18, pp. 1527–1554, 2006. [8] Krizhevsky, A., Sutskever, I. and Hinton, G. E. (2012), “ImageNet Classification with Deep Convolutional Neural Networks”, Advances in Neural Information Processing 25, MIT Press, Cambridge, MA. [9] http://www.darpa.mil/IPTO/solicit/baa/BAA09-40_PIP.pdf [10] Carneiro, G.; Nascimento, J.C.; Freitas, A., "The Segmentation of the Left Ventricle of the Heart From Ultrasound Data Using Deep Learning Architectures," Image Processing, IEEE Transactions on , vol.21, no.3, pp.968,982, March 2012. [11] http://www.darpa.mil/IPTO/solicit/baa/BAA09-40_PIP.pdf