Part 1 Advanced Computational Intelligence and Deep Machine - PowerPoint PPT Presentation

Research Problem Definition Part 1 Advanced Computational Intelligence and Deep Machine Learning for Early Detection and Diagnosis of Diseases Tarek Khorshed PhD Student, American University in Cairo 1 |

Problem Definition Disease Diagnosis  The ability to detect and diagnose diseases is a major challenge with severe negative impacts on the global health and economy. Examples: Cancer, H1NI, SARS and Tuberculosis.  One of the important areas which have not yet exploited the full potential of Computer Science is the early detection of diseases and outbreaks.  Increasing complexity of health information has made it difficult if not impossible to use traditional monitoring techniques to detect irregular signs of possible disease threats or outbreaks.  Complexity lies in the different interpretations of data Ex: geographical location, weather climate and seasonal changes . 2 |

Challenges in Analysis of Biomedical Data  Analysis of biomedical gene expression data is extremely challenging given the complexity of biological networks and high dimensionality of the data.  Current Clustering techniques rely on Molecular Gene Expression data preprocessing the data for feature extraction and dimensionality reduction which affects the accuracy of disease diagnosis [4].  The proposed research is targeting alternative solutions that are capable of processing high dimensional data to achieve better accuracy. Data preprocessing and Feature extraction [4] [4] Rui Xu; Wunsch, D.C., "Clustering Algorithms in Biomedical Research: A Review," Biomedical Engineering, IEEE Reviews in , vol.3, no., pp.120,154, 2010.. 3 |

Overview Of Gene Expression Data Gene Expression Data Representation  Gene data is represented in a real valued 2-dimensional matrix.  Rows: represent patterns of genes. Columns: represent profiles of samples. Matrix representation of gene data [6]. 4 |

Challenges in Gene Expression Data Data Quality  Only a small subset of gene data might be influencing the disease infection being monitored.  Interesting features of the disease are only present in a subset of the data which leads to further complexity in pattern analysis techniques.  Gene expression matrix contains many data anomalies such as noise and missing values.  Preprocessing the gene data is a crucial step before attempting any data analysis tasks for disease diagnosis. Preprocessing Tasks  Data normalizing, estimating missing values, filtering gene expression data which are not relevant or significant to the biological process being analyzed. 5 |

Challenges in Gene Expression Data Features >>> Samples Typical example in cancer classification No. of features is much larger than no. of samples. Name: Short identifier of the data set Platform: Type of microarray platform N: No. of Samples (Ex: Tumor samples: hundreds) n: No. of features (Ex: Gene probes: thousands) Bontempi, G., "A Blocking Strategy to Improve Gene Selection for Classification of Gene Expression Data," Computational Biology and Bioinformatics, IEEE/ACM Transactions on , vol.4, no.2, pp.293,300, April-June 2007 . 6 |

Cluster Analysis Clustering Overview  Objective is to group data objects into a set of disjoint classes called clusters.  Clustering is a form of unsupervised learning because it does not depend on predefined class labels. Classification of Clustering Techniques 1. Hard Partitional Clustering: Attempts to find a K-partition of the data 2. Hierarchical Clustering: Attempts to build a tree structure in the form of a partition. 3. Fuzzy Clustering: Data object can belong to a certain cluster with a degree of membership 4. Density Based Clustering: Defines core, border and noise points. 7 |

Cluster Analysis 1. Hard Partitional Clustering Attempts to find a K-partition of the data K-Means Clustering 8 |

Cluster Analysis 1. Hard Partitional Clustering Attempts to find a K-partition of the data Mixture Based Clustering: EM Algorithm for Mixture of Gaussians 9 |

Cluster Analysis 2. Hierarchical Clustering:  Organizes data set into a hierarchical structure. Dendogram output for Hierarchical clustering 1. Agglomerative methods : Bottom-up approach where each element starts in its own cluster and then pairs of clusters are merged together 2. Divisive methods : Top-down approach where all elements start in one cluster and then they are divided recursively 10 |

Gene Expression Data Cluster Analysis Similarity Measures for Gene Expression Data  Proximity describes how we measure the distance or similarity between a pair of data objects. 1) Distance (Dissimilarity) 2) Similarity Common similarity measures for continuous variables [4]. 11 |

Similarity Measures for Gene Expression Data Euclidean Distance  Performs well for many clustering applications, but produces poor results when used with gene data [10].  For gene data we are more interested in the overall pattern similarity as opposed to the size of each individual attribute. 12 |

Similarity Measures for Gene Expression Data Pearson’s correlation coefficient  Measures similarity between the shapes of two gene expression patterns. Commonly used in clustering gene data and has produced very good results.  Does not perform well in the presence of outliers. Potential problem of assigning a high similarity score to a pair of dissimilar patterns if they have a common peak or valley [10]. Jackknife correlation 13 |

Gene Based Clustering Algorithms Evaluation of Hard Partitional Clustering with Gene Expression Data Advantages  Complexity O(N K d). Practical for large data since number of clusters K and dimensions d are typically much smaller than N.  Similarity measures can be relatively simple to compute.  Variation called Bisecting K-means can enhance the performance of the algorithm [10]. Disadvantages  Correct number of clusters is not known in advance.  No standard method to define the initial set of clusters.  Requires running the algorithm several times with random initial partitions which is computationally expensive. 14 |

References [1] International Agency for Research on Cancer (IARC), World Cancer Report. http://www.iarc.fr [2] National Cancer Institute (NCI), http://www.cancer.gov/ [3] P. Baldi and G.W. Hatfield, “DNA Microarrays and Gene Expression. From Experiments to Data Analysis and Modelling”. Cambridge Univ. Press, 2002. [4] Rui Xu; Wunsch, D.C., "Clustering Algorithms in Biomedical Research: A Review," Biomedical Engineering, IEEE Reviews in , vol.3, no., pp.120,154, 2010. [5] G. E. Hinton and R. R. Salakhutdinov , “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504 – 507, 2006. [6 ] T. Lee and D. Mumford, “Hierarchical Bayesian inference in the visual cortex,” J. Opt. Soc. Amer., vol. 20, pt. 7, pp. 1434 – 1448, 2003. [7] G. E. Hinton, S. Osindero, and Y. Teh , “A fast learning algorithm for deep belief nets,” Neural Comput., vol. 18, pp. 1527 – 1554, 2006. [8] Krizhevsky, A., Sutskever, I. and Hinton, G. E. (2012 ), “ ImageNet Classification with Deep Convolutional Neural Networks”, Advances in Neural Information Processing 25, MIT Press, Cambridge, MA. [9] http://www.darpa.mil/IPTO/solicit/baa/BAA09-40_PIP.pdf [10] Carneiro, G.; Nascimento, J.C.; Freitas, A., "The Segmentation of the Left Ventricle of the Heart From Ultrasound Data Using Deep Learning Architectures," Image Processing, IEEE Transactions on , vol.21, no.3, pp.968,982, March 2012. [11] http://www.darpa.mil/IPTO/solicit/baa/BAA09-40_PIP.pdf 15 |

Part 1 Advanced Computational Intelligence and Deep Machine - PowerPoint PPT Presentation

Research Problem Definition Part 1 Advanced Computational Intelligence and Deep Machine Learning for Early Detection and Diagnosis of Diseases Tarek Khorshed PhD Student, American University in Cairo 1 | Problem Definition Disease Diagnosis

Conformal Field Theories, Conformal Bootstrap and Applications Konstantinos Deligiannis December

Part 0: Git-ing Started Part 1: Essential Skills Part 2: Introduction to Git Part 3: Advanced

Overview Two-Part MDL Two-Part MDL Two-Part MDL for Two-Part MDL for Grammar Learning

FY17 CONSOLIDATED RESULTS UNIPOL AND UNIPOLSAI Bologna, 23 March 2018 2 PART 1 PART 2 PART 3

Answers To Common Questions (Part-2) ? Part 1 : Christian walk, Marriage Part 2 : Lifestyle

Cardiff Schools Facilities Presentation Part 1: History of Cardiff Schools Part 2: Todays

Wind Part 1: How do we measure it? Part 2: What exactly is wind? Part 3: Where is it? PART 1:

Introduction Part One: Initial Problem Part Two: Progress Over Six Months Part

SANLAM STAFF UMBRELLA PROVIDENT AND PENSION FUND AND RELATED GROUP INSURANCE agenda PART A -

FY17 Grants Program Presented by the DCCAH Grants Department Agenda: Part 1: The Challenge

Part 2 2017- 2018 Supts Proposed Budget Part 3 Call for Advocacy 2 Part 1 Budget Context

Commercial Dog Breeders Part 8: Housing (Part 2) Introduction Housing Part 1 Housing Part 2

Answers To Common Questions (Part-1) ? Part 1 : Christian walk, Marriage Part 2 : Lifestyle,

DMR - Part 2 of 3 May 2, 2020 Part 1 - Mike Moore KC2NM Part 2 - Rich Hoffarth K2AXP Part 3 -

Fusion - Part 3 of 3 May 16, 2020 Part 1 - Mike Moore KC2NM Part 2 - Rich Hoffarth K2AXP Part 3

The heartful PRESENTER Influence minds and win hearts Contents 04 PART 1 INTRODUCTION 06

AdjEEXP f(d ) ResPP k ik k k N SmResPP Z

Bayesian cluster detection via adjacency modelling Craig Anderson University of Technology

P rediction of U nderlying L atent C lasses via K -means and H ierarchical C lustering A lgorithm

Reconstruction of the Intra-Host Evolution of HCV Mathieu Flinders Max Planck Institute for

Community Detection : A Simple Example Joon Ho Park, Yumlembam Hemajit and Ki-Ho Lee Project

NAXOS 2018 Assessment of wastewater N2O generation using multivariate techniques Vasilaki V. 1 ,

Your energy. Our passion. Petten 2012 www.ecn.nl ECN: A rich and evolving history ~600

Characteriz ization of the reproductive performance of a colle llection of grapevine varie