Detecting Outliers with Ensemble of Profile HMMs Xilin Yu 1 UIUC - - PowerPoint PPT Presentation

detecting outliers with ensemble of profile hmms
SMART_READER_LITE
LIVE PREVIEW

Detecting Outliers with Ensemble of Profile HMMs Xilin Yu 1 UIUC - - PowerPoint PPT Presentation

Detecting Outliers with Ensemble of Profile HMMs Xilin Yu 1 UIUC December 11, 2018 1 under the supervision of Tandy Warnow Xilin Yu (UIUC) Detecting Outliers Dec 11, 2018 1 / 19 Table of Contents Introduction 1 Method Overview 2


slide-1
SLIDE 1

Detecting Outliers with Ensemble of Profile HMMs

Xilin Yu1

UIUC

December 11, 2018

1under the supervision of Tandy Warnow Xilin Yu (UIUC) Detecting Outliers Dec 11, 2018 1 / 19

slide-2
SLIDE 2

Table of Contents

1

Introduction

2

Method Overview

3

Experiments Data Sets Methods and Parameters

4

Evaluation Results

5

Summary and Future Work

Xilin Yu (UIUC) Detecting Outliers Dec 11, 2018 2 / 19

slide-3
SLIDE 3

Table of Contents

1

Introduction

2

Method Overview

3

Experiments Data Sets Methods and Parameters

4

Evaluation Results

5

Summary and Future Work

Xilin Yu (UIUC) Detecting Outliers Dec 11, 2018 3 / 19

slide-4
SLIDE 4

Introduction

Problem

Given a set of sequences in which most of the sequences are homologous to each other, detect the few (say ≤ 5%) outliers. Outlier: a sequence not homologous to the majority of sequences Harm of outliers in a set of sequences: propagation and magnification

  • f error

Difficulty: homology defined in terms of evolutionary history

No ground truth Almost ground truth by human expert hard to summarize into step by step algorithm

Xilin Yu (UIUC) Detecting Outliers Dec 11, 2018 4 / 19

slide-5
SLIDE 5

Introduction

Different approaches for different goals: (Treeshrink) unexpected long branch: decrease gene tree discordance for better species tree (OD-seq) distance metric using gappiness of alignment: reduce under-alignment level (EDM) edit distance: increase proximity of sequences

Xilin Yu (UIUC) Detecting Outliers Dec 11, 2018 5 / 19

slide-6
SLIDE 6

Table of Contents

1

Introduction

2

Method Overview

3

Experiments Data Sets Methods and Parameters

4

Evaluation Results

5

Summary and Future Work

Xilin Yu (UIUC) Detecting Outliers Dec 11, 2018 6 / 19

slide-7
SLIDE 7

Method Overview: Ensemble of Profile HMMs

Key Ideas

Random sample unlikely to contain any outlier Profile HMM on sample generates outliers with low probability Profile HMM more accurate on closely related subset of sequences: build HMMs on a hierarchy of subsets of sample Multiple independent runs reduces miss on outliers

Xilin Yu (UIUC) Detecting Outliers Dec 11, 2018 7 / 19

slide-8
SLIDE 8

Method Flow Chart

Xilin Yu (UIUC) Detecting Outliers Dec 11, 2018 8 / 19

slide-9
SLIDE 9

Method Flow Chart

Figure: Flowchart of HIPPI

Xilin Yu (UIUC) Detecting Outliers Dec 11, 2018 9 / 19

slide-10
SLIDE 10

Table of Contents

1

Introduction

2

Method Overview

3

Experiments Data Sets Methods and Parameters

4

Evaluation Results

5

Summary and Future Work

Xilin Yu (UIUC) Detecting Outliers Dec 11, 2018 10 / 19

slide-11
SLIDE 11

Experiments

Experiment Design

Seeds of a pfam protein family: human curated, taken as ground truth Artificial outliers: seeds from other families Evaluation:

TP, FN and FP Precision: TP/(TP+FP) Recall: TP/(TP+FN)

Xilin Yu (UIUC) Detecting Outliers Dec 11, 2018 11 / 19

slide-12
SLIDE 12

Data Sets

Families with at least 100 sequences (around 3500 families) Divide into families of small size (<= 200) and large size (> 200) Divide into families of short, medium, and long sequence length For each family A, randomly choose family B as source of ourliers (average length between 50% to 200% of A) Uniformly random seq from B s.t E[num outliers] = 5%|A| Each main + outlier family pair, 3 independent experiments

Xilin Yu (UIUC) Detecting Outliers Dec 11, 2018 12 / 19

slide-13
SLIDE 13

Methods and Parameters

Ensemble method: sampling probability = 0.1, number of trials = 3 (expected outliers in sample << 1) MAFFT: default FastTree2: default HIPPI: decomp size = {10, 12, 15, 20}, min p-dist = default

  • utliers: sequences unmatched by HIPPI

Xilin Yu (UIUC) Detecting Outliers Dec 11, 2018 13 / 19

slide-14
SLIDE 14

Methods and Parameters

Edit distance method: sampling probability = 0.05, number of std (ℓ) = {1, 2, 3} d(x): average edit distance between the sample and x

  • utliers: sequences with d(x) at least mean + ℓ * std

OD-seq: threshold = {0.001, 0.01, 0.02}

  • utliers: sequences that are above threshold in distribution of distance

scores

Xilin Yu (UIUC) Detecting Outliers Dec 11, 2018 14 / 19

slide-15
SLIDE 15

Table of Contents

1

Introduction

2

Method Overview

3

Experiments Data Sets Methods and Parameters

4

Evaluation Results

5

Summary and Future Work

Xilin Yu (UIUC) Detecting Outliers Dec 11, 2018 15 / 19

slide-16
SLIDE 16

Evaluation

Precision Recall Running time Ensemble, 10 0.871 0.955 <= 0.2s Ensemble, 12 0.855 0.954 Ensemble, 15 0.841 0.948 Ensemble, 20 0.832 0.959 OD-seq, 0.001 0.430 0.995 <= 0.02s OD-seq, 0.01 0.386 0.997 OD-seq, 0.02 0.371 0.998

Table: Averaged result of the ensemble method and OD-seq on the protein families with 100 - 200 seed sequences with average length 100 - 200

Xilin Yu (UIUC) Detecting Outliers Dec 11, 2018 16 / 19

slide-17
SLIDE 17

Table of Contents

1

Introduction

2

Method Overview

3

Experiments Data Sets Methods and Parameters

4

Evaluation Results

5

Summary and Future Work

Xilin Yu (UIUC) Detecting Outliers Dec 11, 2018 17 / 19

slide-18
SLIDE 18

Summary

Summary: Ensemble method has much higher precision than OD-seq. Both have

  • ver 90% recall and OD-seq does slightly better.

Both ensemble method and OD-seq are efficient while edit-distance is much slower. For ensemble method, best parameter seems to be 10

Xilin Yu (UIUC) Detecting Outliers Dec 11, 2018 18 / 19

slide-19
SLIDE 19

Future work

Future work: Add edit distance method to comparison Compare to the method of using one HMM Run on all groups of data Differentiate between in-clan and out-clan outliers More evaluation criteria

Xilin Yu (UIUC) Detecting Outliers Dec 11, 2018 19 / 19