1 |
Part 1 Advanced Computational Intelligence and Deep Machine - - PowerPoint PPT Presentation
Part 1 Advanced Computational Intelligence and Deep Machine - - PowerPoint PPT Presentation
Research Problem Definition Part 1 Advanced Computational Intelligence and Deep Machine Learning for Early Detection and Diagnosis of Diseases Tarek Khorshed PhD Student, American University in Cairo 1 | Problem Definition Disease Diagnosis
2 |
Problem Definition
The ability to detect and diagnose diseases is a major challenge with severe negative impacts on the global health and economy. Examples: Cancer, H1NI, SARS and Tuberculosis. One of the important areas which have not yet exploited the full potential of Computer Science is the early detection of diseases and outbreaks.
Disease Diagnosis
Increasing complexity of health information has made it difficult if not impossible to use traditional monitoring techniques to detect irregular signs of possible disease threats or outbreaks. Complexity lies in the different interpretations of data Ex: geographical location, weather climate and seasonal changes.
3 |
Challenges in Analysis of Biomedical Data
Analysis of biomedical gene expression data is extremely challenging given the complexity of biological networks and high dimensionality of the data. Current Clustering techniques rely on preprocessing the data for feature extraction and dimensionality reduction which affects the accuracy of disease diagnosis [4].
[4] Rui Xu; Wunsch, D.C., "Clustering Algorithms in Biomedical Research: A Review," Biomedical Engineering, IEEE Reviews in , vol.3, no., pp.120,154, 2010.. Data preprocessing and Feature extraction [4]
The proposed research is targeting alternative solutions that are capable of processing high dimensional data to achieve better accuracy.
Molecular Gene Expression data
4 |
Overview Of Gene Expression Data
Gene Expression Data Representation
Gene data is represented in a real valued 2-dimensional matrix. Rows: represent patterns of genes. Columns: represent profiles of samples.
Matrix representation of gene data [6].
5 |
Challenges in Gene Expression Data
Only a small subset of gene data might be influencing the disease infection being monitored. Interesting features of the disease are only present in a subset of the data which leads to further complexity in pattern analysis techniques.
Data Quality
Gene expression matrix contains many data anomalies such as noise and missing values. Preprocessing the gene data is a crucial step before attempting any data analysis tasks for disease diagnosis. Data normalizing, estimating missing values, filtering gene expression data which are not relevant or significant to the biological process being analyzed.
Preprocessing Tasks
6 |
Challenges in Gene Expression Data
Features >>> Samples
Typical example in cancer classification
- No. of features is much larger than
- no. of samples.
Bontempi, G., "A Blocking Strategy to Improve Gene Selection for Classification of Gene Expression Data," Computational Biology and Bioinformatics, IEEE/ACM Transactions on , vol.4, no.2, pp.293,300, April-June 2007 .
Name: Short identifier of the data set Platform: Type of microarray platform N: No. of Samples (Ex: Tumor samples: hundreds) n: No. of features (Ex: Gene probes: thousands)
7 |
Cluster Analysis
Clustering Overview
Objective is to group data objects into a set of disjoint classes called clusters. Clustering is a form of unsupervised learning because it does not depend on predefined class labels.
Classification of Clustering Techniques
- 1. Hard Partitional Clustering: Attempts to find a K-partition of the data
- 2. Hierarchical Clustering: Attempts to build a tree structure in the
form of a partition.
- 3. Fuzzy Clustering: Data object can belong to a certain cluster with a
degree of membership
- 4. Density Based Clustering: Defines core, border and noise points.
8 |
Cluster Analysis
Attempts to find a K-partition of the data
- 1. Hard Partitional Clustering
K-Means Clustering
9 |
Cluster Analysis
- 1. Hard Partitional Clustering Attempts to find a K-partition of the data
Mixture Based Clustering: EM Algorithm for Mixture of Gaussians
10 |
Cluster Analysis
- 2. Hierarchical Clustering:
Dendogram output for Hierarchical clustering
Organizes data set into a hierarchical structure.
- 1. Agglomerative methods : Bottom-up approach where each element
starts in its own cluster and then pairs of clusters are merged together
- 2. Divisive methods : Top-down approach where all elements start in
- ne cluster and then they are divided recursively
11 |
Proximity describes how we measure the distance or similarity between a pair of data objects. 1) Distance (Dissimilarity) 2) Similarity
Similarity Measures for Gene Expression Data
Gene Expression Data Cluster Analysis
Common similarity measures for continuous variables [4].
12 |
Similarity Measures for Gene Expression Data
Euclidean Distance
Performs well for many clustering applications, but produces poor results when used with gene data [10]. For gene data we are more interested in the overall pattern similarity as opposed to the size of each individual attribute.
13 |
Pearson’s correlation coefficient
Measures similarity between the shapes of two gene expression patterns. Commonly used in clustering gene data and has produced very good results. Does not perform well in the presence of
- utliers. Potential problem of assigning a high
similarity score to a pair of dissimilar patterns if they have a common peak or valley [10].
Similarity Measures for Gene Expression Data
Jackknife correlation
14 |
Gene Based Clustering Algorithms
Evaluation of Hard Partitional Clustering with Gene Expression Data Complexity O(N K d). Practical for large data since number of clusters K and dimensions d are typically much smaller than N. Similarity measures can be relatively simple to compute. Variation called Bisecting K-means can enhance the performance of the algorithm [10].
Advantages
Correct number of clusters is not known in advance. No standard method to define the initial set of clusters. Requires running the algorithm several times with random initial partitions which is computationally expensive.
Disadvantages
15 |
References
[1] International Agency for Research on Cancer (IARC), World Cancer Report. http://www.iarc.fr [2] National Cancer Institute (NCI), http://www.cancer.gov/ [3] P. Baldi and G.W. Hatfield, “DNA Microarrays and Gene Expression. From Experiments to Data Analysis and Modelling”. Cambridge Univ. Press, 2002. [4] Rui Xu; Wunsch, D.C., "Clustering Algorithms in Biomedical Research: A Review," Biomedical Engineering, IEEE Reviews in , vol.3, no., pp.120,154, 2010. [5] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol. 313,
- no. 5786, pp. 504–507, 2006.
[6] T. Lee and D. Mumford, “Hierarchical Bayesian inference in the visual cortex,” J. Opt. Soc. Amer., vol. 20, pt. 7, pp. 1434–1448, 2003. [7] G. E. Hinton, S. Osindero, and Y. Teh, “A fast learning algorithm for deep belief nets,” Neural Comput., vol. 18, pp. 1527–1554, 2006. [8] Krizhevsky, A., Sutskever, I. and Hinton, G. E. (2012), “ImageNet Classification with Deep Convolutional Neural Networks”, Advances in Neural Information Processing 25, MIT Press, Cambridge, MA. [9] http://www.darpa.mil/IPTO/solicit/baa/BAA09-40_PIP.pdf [10] Carneiro, G.; Nascimento, J.C.; Freitas, A., "The Segmentation of the Left Ventricle of the Heart From Ultrasound Data Using Deep Learning Architectures," Image Processing, IEEE Transactions on , vol.21, no.3, pp.968,982, March 2012. [11] http://www.darpa.mil/IPTO/solicit/baa/BAA09-40_PIP.pdf