A method for similarity-based grouping of biological data Vaida - - PowerPoint PPT Presentation

a method for similarity based grouping of biological data
SMART_READER_LITE
LIVE PREVIEW

A method for similarity-based grouping of biological data Vaida - - PowerPoint PPT Presentation

A method for similarity-based grouping of biological data Vaida Jakonien , David Rundqvist, Patrick Lambrix Outline Environments for supporting grouping algorithms needed Method for similarity based grouping Test cases Summary


slide-1
SLIDE 1

A method for similarity-based grouping of biological data

Vaida Jakonienė, David Rundqvist, Patrick Lambrix

slide-2
SLIDE 2
  • V. Jakonienė, D. Rundqvist, P. Lambrix. Linköpings universitet, Sweden

2

Outline

Environments for supporting grouping

algorithms needed

Method for similarity based grouping Test cases Summary and future work

slide-3
SLIDE 3
  • V. Jakonienė, D. Rundqvist, P. Lambrix. Linköpings universitet, Sweden

3

Tools for biological data analysis

Hierarchical microarray clustering (J-Express Pro) Classification of abstracts

slide-4
SLIDE 4
  • V. Jakonienė, D. Rundqvist, P. Lambrix. Linköpings universitet, Sweden

4

Tools for biological data analysis

Other applications of grouping

structuring search results data cleaning data integration

slide-5
SLIDE 5
  • V. Jakonienė, D. Rundqvist, P. Lambrix. Linköpings universitet, Sweden

5

Similarity of biological data

Sequence alignment (BLAST) Similarity between data entries

Lord PW, Stevens RD, Brass A, Goble CA. Bioinformatics, 19(10):1275-83, 2003.

Basic task – computation of a

similarity value between objects

slide-6
SLIDE 6
  • V. Jakonienė, D. Rundqvist, P. Lambrix. Linköpings universitet, Sweden

6

Similarity-based grouping

Similarity-based grouping for biological data

needed

Not a trivial task

influence of a number of aspects data is complex variety of grouping algorithms is available: which

method performs best for which grouping task

existing grouping algorithms may not be applied

straightforward

slide-7
SLIDE 7
  • V. Jakonienė, D. Rundqvist, P. Lambrix. Linköpings universitet, Sweden

7

Environments that support comparison

and evaluation of different grouping strategies are needed

Similarity-based grouping

slide-8
SLIDE 8
  • V. Jakonienė, D. Rundqvist, P. Lambrix. Linköpings universitet, Sweden

8

Domain independent

  • sim. funct.

Domain dependent

  • sim. funct.

Grouping attributes Library of similarity funct. Specification of grouping rules Analysis Evaluation Grouping Pairwise grouping Data source Other knowledge Library of classifications

Method for similarity-based grouping

slide-9
SLIDE 9
  • V. Jakonienė, D. Rundqvist, P. Lambrix. Linköpings universitet, Sweden

9

A toolKit for Evaluating Grouping Algorithms

slide-10
SLIDE 10
  • V. Jakonienė, D. Rundqvist, P. Lambrix. Linköpings universitet, Sweden

10

Grouping task. Grouping of proteins with

respect to

biological function class of isozymes they belong to

Data source

human proteins involved in glycolysis via Entrez retrieved 190 data entries

Test cases

slide-11
SLIDE 11
  • V. Jakonienė, D. Rundqvist, P. Lambrix. Linköpings universitet, Sweden

11

Test cases. Data entry

  • Entrez. Protein database
slide-12
SLIDE 12
  • V. Jakonienė, D. Rundqvist, P. Lambrix. Linköpings universitet, Sweden

12

Test cases. Data entry

slide-13
SLIDE 13
  • V. Jakonienė, D. Rundqvist, P. Lambrix. Linköpings universitet, Sweden

13

Test cases. Data entry

GOann Sequence

slide-14
SLIDE 14
  • V. Jakonienė, D. Rundqvist, P. Lambrix. Linköpings universitet, Sweden

14

Test cases. Data sources and mappings

GO Consortium. Mappings between data values and ontological terms: ec2go – ec_numbers translated into GO terms spkw2go – swissprot keywords translated into GO terms Keywords Ec_number GOann

spkw2go ec2go

GOcomb, 93 data entries DS2: Ec_number GOann

ec2go

GOcomb, 92 data entries DS3: GOann, 67 data entries DS1:

  • only terms of GO function ontology analyzed
  • only data entries having GO terms
slide-15
SLIDE 15
  • V. Jakonienė, D. Rundqvist, P. Lambrix. Linköpings universitet, Sweden

15

Library of similarity

functions

EditDist(v1,v2) SeqSim(v1,v2) SemSim(v1,v2)

Other knowledge

GO ontology

Classifications.

Manual classification according to

biological function classes of isozymes

Test cases. Other components

Domain independent

  • sim. funct.

Domain dependent

  • sim. funct.

Grouping attributes Library of similarity funct. Specification of grouping rules Analysis Evaluation Grouping Pairwise grouping Data source Other knowledge Library of classifications

slide-16
SLIDE 16
  • V. Jakonienė, D. Rundqvist, P. Lambrix. Linköpings universitet, Sweden

16

Specification of grouping rules

Analysis Evaluation Pairwise grouping Grouping

  • Method. Specification of grouping rules

(DS3)

slide-17
SLIDE 17
  • V. Jakonienė, D. Rundqvist, P. Lambrix. Linköpings universitet, Sweden

17

Specification of grouping rules

Analysis Evaluation Pairwise grouping Grouping

  • Method. Pairwise grouping

(DS3)

all pairs of data entries compared

slide-18
SLIDE 18
  • V. Jakonienė, D. Rundqvist, P. Lambrix. Linköpings universitet, Sweden

18

  • Method. Grouping

Specification of grouping rules

Analysis Evaluation Pairwise grouping Grouping

data entries in a group directly or transitively similar to each other (ConnectedComponents) all data entries in a group similar to each other (Cliques)

slide-19
SLIDE 19
  • V. Jakonienė, D. Rundqvist, P. Lambrix. Linköpings universitet, Sweden

19

Specification of grouping rules

Analysis Evaluation Pairwise grouping Grouping

  • Method. Grouping
slide-20
SLIDE 20
  • V. Jakonienė, D. Rundqvist, P. Lambrix. Linköpings universitet, Sweden

20

Types of quality measures

internal – based on information obtained during the

grouping

external – with respect to known classes of the

grouped data

Specification of grouping rules

Analysis Evaluation Pairwise grouping Grouping

  • Method. Evaluation
slide-21
SLIDE 21
  • V. Jakonienė, D. Rundqvist, P. Lambrix. Linköpings universitet, Sweden

21

Specification of grouping rules

Analysis Evaluation Pairwise grouping Grouping

  • Method. Analysis

true positives false positives false negatives

slide-22
SLIDE 22
  • V. Jakonienė, D. Rundqvist, P. Lambrix. Linköpings universitet, Sweden

22

  • Method. Analysis
slide-23
SLIDE 23
  • V. Jakonienė, D. Rundqvist, P. Lambrix. Linköpings universitet, Sweden

23

  • Method. Analysis

Studied aspects, e.g. use of different data sources,

grouping algorithms, and classifications, grouping on different attributes, impact of threshold

slide-24
SLIDE 24
  • V. Jakonienė, D. Rundqvist, P. Lambrix. Linköpings universitet, Sweden

24

Best suited grouping approaches. For data

source Glyc-Funct-AnnEc-onlyGO (DS3)

SemSim(GOcomb) for grouping on biological function SeqSim(Sequence) for grouping on classes of isozymes

Suitability of mappings for the used grouping

approches

spkw2go – too general, e.g. ’Glycolysis’ ec2go – specific enough, e.g. ’6-phosphofructokinase activity’

Test cases. Observations

slide-25
SLIDE 25
  • V. Jakonienė, D. Rundqvist, P. Lambrix. Linköpings universitet, Sweden

25

Summary and future work

Motivated need for environments that support the

development and evaluation of similarity-based grouping procedures

Proposed a method that identifies the main components

and steps that are importan for such environments.

Illustrated the grouping method by test cases based on

different strategies and classifications

Extend the Kitega implementation