Ekaterina Nosova DMI Dept of Mathematics and Informatics, - PowerPoint PPT Presentation

Ekaterina Nosova DMI – Dept of Mathematics and Informatics, University of Salerno, Italy

Outline  Introduction to biclustering problem.  Data Sets  Biclustering  Task of biclustering  Bicluster definition.  Combinatoric algorithm. CBA theory .   Error definition  Initial conditions Obtaining of combinatorial matrix .   Bimax  Results  Conclusions

Introduction Data sets  Data sets are provided, for example, by the DNA Microarray technology. Where the results of the experiments carried out on genes under different conditions are the expression levels of their transcribed mRNA stored in DNA chips. If two genes are related (they have similar functions or are co-regulated), their expression profiles should be similar.

Introduction  Clustering (Unsupervised): Given a set of samples, partition them into groups containing similar samples according to some similarity criteria (CLASS DISCOVERING).  Classification (Supervised): Find classes of the test data set using known classification of training data set (CLASS PREDICTION).  Feature Selection (Dimensionality reduction): Select a subset of features responsible for creating the condition corresponding to the class (GENE SELECTION, BIOMARKER SELECTION).

Introduction Biclustering  If two genes are related, they can have similar expression patterns only under some conditions (e.g. they have similar response to a certain external stimulus, but each of them has some distinct functions at other time).  Similarly, for two related conditions, some genes may exhibit different expression patterns (e.g. two tumor samples of different sub-types).  As a result, each cluster may involve only a subset of genes and a subset of conditions.

Introduction Biclustering  Biclustering is a Simultaneous clustering of both rows and columns of a data Matrix.  Concept can be traced back to the 70’(Hartigan, 1972), although it has been rarely used or studied.  The term was introduced by (Cheng and Church, 2000) who were the first to use gene expression data analysis.  The technique used in many fields, such as collaborative filtering, information retrieval and data mining.  Other Names: simultaneous clustering, co-clustering, two-way clustering, subspace clustering, bi-dimensional clustering,.

Introduction  Microarray data can be viewed as an m  n matrix X:  X ( x )  ij m n  Each of the m rows represents a gene (or a clone, ORF, etc.).  Each of the n columns represents a condition (a sample, a time point, etc.).  Each entry represents the expression level of a gene under a condition. It can either be an absolute value (e.g. Affymetrix GeneChip) or a relative expression ratio (e.g. cDNA microarrays).

Introduction Biclustering  An interesting criteria to evaluate a biclustering algorithm concerns the identification of the type of biclusters the algorithm is able to find.  We identified three major classes of biclusters  Biclusters with constant values.  Biclusters with constant values on rows or columns.  Biclusters with coherent values. aij = μ aij= μ + β j aij= μ + α i + β j aij= μ × α i × β j

Bicluster definition   Let X be the n n n ; g c bicluster of size n             x ; x x ; x x ; end the elements x ij ij i j i iJ IJ j Ij IJ        x n n   ij g i g j i i x ; Ij bicluster row mean and n n g g        x n n ij c j c i   j j bicluster column mean x ; iJ n n c c         x n n n ij c i g j   ij i j x ; bicluster mean IJ n n       d x ( x ); residue [Cheng & Church, 2000] ij ij IJ i j  2 d  ij   2 ; ij Sum-squared residue and MSR H d G ij ij n

Overview of the Biclustering Methods Method Publish Cluster Model Goal Cheng & Church ISMB 2000 Background + row effect + Minimize mean squared column effect residue of biclusters Getz et al. PNAS 2000 Depending on plugin Depending on plugin clustering algorithm clustering algorithm (CTWC) Lazzeroni & Owen Bioinformatics Background + row effect + Minimize modeling error 2000 column effect (Plaid Models) Ben-Dor et al. RECOMB All genes have the same Minimize the p-values of 2002 order of expression values biclusters (OPSM) Tanay et al. Bioinformatics Maximum bounded Minimize the p-values of 2002 bipartite subgraph biclusters (SAMBA) Yang et al. BIBE 2003 Background + row effect + Minimize mean squared column effect residue of biclusters (FLOC) Background  row effect  Kluger et al. Genome Res. Finding checkerboard 2003 column effect structures (Spectral)

Combinatorial Biclustering Algorithm Problems of other techniques: 1. Precision 2. Noise Control 3. Initialization 4. Overlapping 5. Finding of all biclusters 6. Multi - biclustering solutions

CBA theory 1. Precision       x ij i j                       ... ... 1 1 1 2 1 3 1 m 1 1 1 2 1 m             ...           ... 2 1 2 2 2 3 2 m 2 1 2 2 2 m             ___________________________ ...          3 1 3 2 3 3 3 m ... 1 2 1 2 1 2 ... ... ... ... ...             ... n 1 n 2 n 3 n m If we calculate the difference between every two rows of the bicluster we obtain equal constant values. So we   construct the matrix G 1    T ...       G  1 N

Error definition 2. Noise Control    a [ x x x x ] x x x x 1 11 21 31 41 11 12 13 14    a [ x x x x ] x x x x   2 12 22 32 42 21 22 23 24 With the columns:    a [ x x x x ] x x x x 3 13 23 33 43 31 32 33 34    a [ x x x x ]   x x x x 4 14 24 34 44 41 42 43 44    max( a ) min( a ) 1 1      max( a ) min( a )  2 2   error max  max( a ) min( a )   3 3      max( a ) min( a ) 4 4

Initial conditions 3. Initialization 70 3 60 2.5 50 2 40 1.5 conditions 30 1 20 0.5 10 0 0 0 10 20 30 40 50 60 70 80 90 genes

Obtaining of combinatorial matrix 4. Overlapping   1 1 1 2 x x x    X 1 1 1 2 3 4 x     G   1   x x 0 1 2 3 x    T ...       0 0 0 0 x x x   G  1 N    T x x 1 1 x x x       x x 1 1 1 1 x   1 1 1 1 0 0 0    C 0 0 1 1 0 0 0       0 0 1 1 1 1 0

Obtaining of combinatorial matrix 4. Overlapping Let us take the first row of T that contains 3 groups of the constants: c 1 , c 2 , c 3   t  c c c c c c c 1 1 1 1 2 2 3 3 We construct the matrix C 1 in the way:   1 1 1 0 0 0 0    C 0 0 0 1 1 0 0   1     0 0 0 0 0 1 1

5. Finding of all biclusters Bimax 6. Multi - biclustering solutions  We divide the input matrix E into two smaller sub-matrices U and V  The set of columns is divided into two subsets C U and C V , here by taking the first row as a template.  The rows of E are resorted:  1. the genes that respond only to conditions given by C U ,  2. those genes that respond to conditions in C U and in C V  3. the genes that respond to conditions in C V only. The corresponding sets of genes are GU, GW and GV

Results The matrix 20×20  1. The simple matrix 20 × 20 with two biclusters

Results The matrix 100×100  The matrix 100 × 100 that contains 3 biclusters:

Results The Gastric Cancer data  31 normal tissues  38 tumoral tissues:  19 MSS  19 MSI  82 genes

Normal/tumoral

Mss/Msi

Conclusions  As shown by the experiments, Combinatorial algorithm gives always better and more accurate results than the other algorithms, because it reaches the maximal precision in the data sets analysis.  In every experiment we a priori decided the maximal error and the minimal dimension of the desired biclusters.

Acknowledgments I thank my co-workers and co-authors:  Prof. Roberto Tagliaferri, PhD Francesco Napolitano Prof.Giancarlo Raiconi (Dept. of Mathematics and Informatics, University of Salerno) PhD.Roberto Amato Prof. Gennaro Miele Prof. Sergio Cocozza (Dipartimento di Scienze Fisiche, Università degli Studi di Napoli "Federico I", Napoli, Italy) This work is partially supported by Istituto Nazionale di Alta  Matematica Francesco Severi (INdAM) with the scholarship N U 2007/000458 07/09/2007

Thank you!!!

Ekaterina Nosova DMI Dept of Mathematics and Informatics, - PowerPoint PPT Presentation

Ekaterina Nosova DMI Dept of Mathematics and Informatics, University of Salerno, Italy Outline Introduction to biclustering problem. Data Sets Biclustering Task of biclustering Bicluster definition. Combinatoric

A SOFTWARE ENGINEERING CASE Gordana Raki, goca@dmi.uns.ac.rs Zoran Budimac, zjb@dmi.uns.ac.rs

Glunergy A novel Feed Additive for the transition cow Data so far.. DMI and ME requirements

Heat Stress During LactaFon Depresses DMI Dry Period Heat Stress: Carryover Reduces milk

Sound Laws Assimilation ingest imbibe < mann-r mar dma, dmi skipta, skipti

Sound Laws Assimilation ingest imbibe < mann-r mar skipta, skipti dma, dmi

Rewriting with extensionality Roberto Di Cosmo LIENS (CNRS) - DMI Ecole Normale Sup erieure

Political effects of the internet and social media Ekaterina Zhuravskaya Paris School of

Category-Based Task Specific Grasping Ekaterina Nikandrova and Ville Kyrki Department of

Comparing Job Quality Across Countries & Over Time Ekaterina Kalugina University of Evry

Contextual Advertising: Contextual Advertising: Semantic Approach Semantic Approach Ekaterina

Informatics BioMedical Informatics Imaging Informatics Richard H. Wiggins, III, MD, CIIP,

Volume of hyperbolic octahedron with 3 -symmetry Nikolay Abrosimov 1 , 2 joint with Ekaterina

Stir : Spontaneous Social Peer-to- Peer Streaming Anh Tuan Nguyen (Dept. of Informatics, Uni. of

Drug Market Intervention Gainesville Training October 23, 2019 Why A DMI Eliminate open-air

Pathfinding with Trees Samuel Bader <s.bader@unibas.ch> DMI, University of Basel 13. Sep.

Time-Series Constraints: Improvements and Application in CP and MIP Contexts Ekaterina Arafailova

Variance Estimation for Survey-Weighted Data Using Bootstrap Resampling Methods: 2013

Background Data Resampling for Outlier-Aware Classification Out-of-distribution Detection

Challenge of Reproducible Pipelines Pjotr Prins 11th Biohackathon 2018 Matsue, Japan, December

Reproducible Research The Hacker Within Monday 15 th October 2018 Simon Branford Advertising

Stochastic simulation and resampling methods Statistical modelling: Theory and practice Gilles

Synthe'c data in US Census CompSci 590.03 Instructor: Ashwin

Update of noise filtering in protoDUNE Wenqiang Gu (BNL) Carlos Sarasty (University of

Advanced Simulation - Lecture 13 Patrick Rebeschini February 26th, 2018 Patrick Rebeschini

Ekaterina Nosova DMI Dept of Mathematics and Informatics, - PowerPoint PPT Presentation

Ekaterina Nosova DMI Dept of Mathematics and Informatics, University of Salerno, Italy Outline Introduction to biclustering problem. Data Sets Biclustering Task of biclustering Bicluster definition. Combinatoric

A SOFTWARE ENGINEERING CASE Gordana Raki, goca@dmi.uns.ac.rs Zoran Budimac, zjb@dmi.uns.ac.rs

Glunergy A novel Feed Additive for the transition cow Data so far.. DMI and ME requirements

Heat Stress During LactaFon Depresses DMI Dry Period Heat Stress: Carryover Reduces milk

Sound Laws Assimilation ingest imbibe &lt; mann-r mar dma, dmi skipta, skipti

Sound Laws Assimilation ingest imbibe &lt; mann-r mar skipta, skipti dma, dmi

Rewriting with extensionality Roberto Di Cosmo LIENS (CNRS) - DMI Ecole Normale Sup erieure

Political effects of the internet and social media Ekaterina Zhuravskaya Paris School of

Category-Based Task Specific Grasping Ekaterina Nikandrova and Ville Kyrki Department of

Comparing Job Quality Across Countries &amp; Over Time Ekaterina Kalugina University of Evry

Contextual Advertising: Contextual Advertising: Semantic Approach Semantic Approach Ekaterina

Informatics BioMedical Informatics Imaging Informatics Richard H. Wiggins, III, MD, CIIP,

Volume of hyperbolic octahedron with 3 -symmetry Nikolay Abrosimov 1 , 2 joint with Ekaterina

Stir : Spontaneous Social Peer-to- Peer Streaming Anh Tuan Nguyen (Dept. of Informatics, Uni. of

Drug Market Intervention Gainesville Training October 23, 2019 Why A DMI Eliminate open-air

Pathfinding with Trees Samuel Bader &lt;s.bader@unibas.ch&gt; DMI, University of Basel 13. Sep.

Time-Series Constraints: Improvements and Application in CP and MIP Contexts Ekaterina Arafailova

Variance Estimation for Survey-Weighted Data Using Bootstrap Resampling Methods: 2013

Background Data Resampling for Outlier-Aware Classification Out-of-distribution Detection

Challenge of Reproducible Pipelines Pjotr Prins 11th Biohackathon 2018 Matsue, Japan, December

Reproducible Research The Hacker Within Monday 15 th October 2018 Simon Branford Advertising

Stochastic simulation and resampling methods Statistical modelling: Theory and practice Gilles

Synthe'c data in US Census CompSci 590.03 Instructor: Ashwin

Update of noise filtering in protoDUNE Wenqiang Gu (BNL) Carlos Sarasty (University of

Advanced Simulation - Lecture 13 Patrick Rebeschini February 26th, 2018 Patrick Rebeschini

Sound Laws Assimilation ingest imbibe < mann-r mar dma, dmi skipta, skipti

Sound Laws Assimilation ingest imbibe < mann-r mar skipta, skipti dma, dmi

Comparing Job Quality Across Countries & Over Time Ekaterina Kalugina University of Evry

Pathfinding with Trees Samuel Bader <s.bader@unibas.ch> DMI, University of Basel 13. Sep.