Comp omputation ional P Pan- an-Gen enomics omics wit with - - PowerPoint PPT Presentation

comp omputation ional p pan an gen enomics omics wit with
SMART_READER_LITE
LIVE PREVIEW

Comp omputation ional P Pan- an-Gen enomics omics wit with - - PowerPoint PPT Presentation

Comp omputation ional P Pan- an-Gen enomics omics wit with Ela last stic ic-D -Deg egen ener erate e Strin ings (a (a ca case se st study of of my my resea research rch) NAD NADIA IA PIS ISANT ANTI (this Department)


slide-1
SLIDE 1

Comp

  • mputation

ional P Pan- an-Gen enomics

  • mics wit

with Ela last stic ic-D

  • Deg

egen ener erate e Strin ings

(a (a ca case se st study of

  • f my

my resea research rch)

NAD NADIA IA PIS ISANT ANTI

(this Department)

5/12/2019 PhD Day 1

slide-2
SLIDE 2

The e pan-Gen enome

  • me

2

Some definitions of pan- pan-gen enome

  • me:
  • ... describes the full complement of genes [...] which can have large variation in gene

content among closely related strains

[Wikipedia]

  • a collection of genomic sequences to be analyzed jointly or to be used as a reference

[The Computational Pan-Genomics Consortium, 2016]

Tradit ition ionall lly, , a ref referen erence ce gen enome

  • me is

is:

  • a genome of a single selected individual, or
  • a consensus drawn from a population, or
  • a "functional" genome, or
  • a maximal genome capturing all ever-

detected sequences

  • ...

5/12/2019 PhD Day

slide-3
SLIDE 3

ED- ED-st strin ings

3

Ela last stic ic Deg egen ener erate e st strin ing as as a a natur natural al rep represen resentation ion of

  • f a pan-gen

enome

  • me

It It cor corresp respon

  • nds to
  • the

e Varia iant Call ll For Forma mat (. (.vcf cf) ) st standard rd [e.g. data from rom the e 1000 1000 Gen enomes

  • mes project

roject]

5/12/2019 PhD Day

slide-4
SLIDE 4

Ref eferen erence ce Pan-Gen enome

  • me

4 5/12/2019 PhD Day

  • Chea

eaper er seq sequen encin cing: re- re-seq sequen encin cing beca ecame me a common common task sk.

  • In

In gen enome

  • me analysis

ysis wor

  • rkflo

flows ws, down wnst strea ream m of

  • f re-

re-seq sequen ences ces there ere is is the e task sk of

  • f

ma mappin ing rea reads (a (a st strin ing) ) on

  • n a ref

referen erence ce gen enome

  • me (a

(a lon longer er st strin ing) It It's PATTERN N MATCHING ING: rea read is is P , , ref referen erence ce gen enome

  • me is

is T T

slide-5
SLIDE 5

EDSM EDSM prob roblem lem

5

P = CGGGT = CGGGTATA

5/12/2019 PhD Day

ELASTIC DEGENERATE STRING MATCHING (EDSM) Input: a string P of length m, an ED string T of length n and total size N Output: all positions in T where at least one occurrence of P ends

~ ~

slide-6
SLIDE 6

Lower er bou

  • unds &

& upper er bou

  • unds

[IC ICAL ALP 2019] 2019]

6 5/12/2019 PhD Day

In [CPM 2017] we solved EDSM in O(N + n*m2) time In [CPM 2018] they solve it in O(N + n*m1.5 √(log m)) time Can EDSM be improved further? In [ICALP 2019] we solve EDSM in O(N + n*m1.381) time ... with an algebraic method! We show one can’t do better with combinatorial methods

slide-7
SLIDE 7

Patter ern Match chin ing on

  • n ED-st

strin ing wit with er error rors

[SPIRE 2017]

5/12/2019 7

Rea eads ca carry seq sequen encin cing er error rors: : ho how ca can we we rep represen resent them em?

Hammin mming Dist istance ce: : Giv iven en tw two st strin ings X X and Y Y on

  • n the

e sa same me alp lphabet et and and havin ing the e sa same me len length, the e Ham amming ng Dist stanc ance dH(X, (X,Y) Y) bet etween een X X and Y Y is is the e numb mber er of

  • f posit
  • sition

ions s in in wh which ich they ey dif iffer er. . X X = CGGG GGTATA A dH(X, (X,Y)= Y)=2 Y Y = CAGG GGCATA A Edit Distance: Giv iven en tw two st strin ings X X and Y Y on

  • n the

e sa same me alp lphabet et, the e edit edit Dist stanc ance dE(X, (X,Y) Y) is is the e numb mber er of

  • f

su subst stit itution ions, , in inser sertion ions, or

  • r delet

eletion ion of

  • f a let

letter er need eeded ed to

  • transf

sfor

  • rm X

X in into Y Y (or (or vicev icever ersa sa, as as dE(X, (X,Y)= Y)=dE(Y (Y,X)). X)). X X = CGGG GGTAT AT--

  • -A

A dE(X, (X,Y)= Y)=3 Y Y = CCGG GG--

  • -AT

ATTA A

PhD Day

slide-8
SLIDE 8

STRING

ING COMPAR ARIS ISON among (E)D-strings is a basic sic tool

  • ol for
  • r ma

many ot

  • ther

er prob roblems lems: Are two degenerate strings the same? Or similar? Or share sub-(E)D-strings? Motifs? Is one (E)D-string a substring of another (E)D-string? A Reverse? A Palindrome?

PhD Day 8

Deg egen ener erate e Strin ings Comp

  • mparison

ison

5/12/2019

slide-9
SLIDE 9

A definition of a match among D-strings

(a step into formal languages and automata problems)

A linear (O(N+M)) algorithm to tell whether two D-strings X (of size N) and Y (of size N) do match

(“accidentally” solving an open formal languages and automata problem)

An application of such D-strings comparison to the design of two algorithms to decompose a D-string into palindromes (a proof-of-concept on real RNA data)

PhD Day 9

Deg egen ener erate e Strin ings Comp

  • mparison

ison

  • ur result [WABI 2018]

5/12/2019

slide-10
SLIDE 10

Ref eferen erences ces

The Computational Pan-Genomics Consortium: Computational pan-genomics: status, promises and challenges. Briefings in Bioinformatics 19(1): 118-135 (2018) R.Grossi, C.S.Iliopoulos, C.Liu, N.Pisanti, S.P. Pissis, A.Retha, G.Rosone, F.Vayani, L.Versari: On-Line Pattern Matching on Similar Texts. CPM 2017: 9:1-9:14 G.Bernardini, N.Pisanti, S.P. Pissis, G.Rosone: Pattern Matching on Elastic-Degenerate Text with Errors. SPIRE 2017: 74-90

[extended version in press in Theoretical Computer Science journal]

M.Alzamel, L.A.K. Ayad, G.Bernardini, R.Grossi, C.S.Iliopoulos, N.Pisanti, S.P.Pissis, G.Rosone: Degenerate String Comparison and Applications. WABI 2018: 21:1-21:14 G.Bernardini, P.Gawrychowski, N.Pisanti, S.P.Pissis, G.Rosone: Even Faster Elastic-Degenerate String Matching via Fast Matrix Multiplication. ICALP 2019: 21:1-21:15

10 5/12/2019 PhD Day