Comp
- mputation
ional P Pan- an-Gen enomics
- mics wit
with Ela last stic ic-D
- Deg
egen ener erate e Strin ings
(a (a ca case se st study of
- f my
my resea research rch)
NAD NADIA IA PIS ISANT ANTI
(this Department)
5/12/2019 PhD Day 1
Comp omputation ional P Pan- an-Gen enomics omics wit with - - PowerPoint PPT Presentation
Comp omputation ional P Pan- an-Gen enomics omics wit with Ela last stic ic-D -Deg egen ener erate e Strin ings (a (a ca case se st study of of my my resea research rch) NAD NADIA IA PIS ISANT ANTI (this Department)
5/12/2019 PhD Day 1
2
Some definitions of pan- pan-gen enome
content among closely related strains
[Wikipedia]
[The Computational Pan-Genomics Consortium, 2016]
Tradit ition ionall lly, , a ref referen erence ce gen enome
is:
detected sequences
5/12/2019 PhD Day
3
Ela last stic ic Deg egen ener erate e st strin ing as as a a natur natural al rep represen resentation ion of
enome
It It cor corresp respon
e Varia iant Call ll For Forma mat (. (.vcf cf) ) st standard rd [e.g. data from rom the e 1000 1000 Gen enomes
roject]
5/12/2019 PhD Day
4 5/12/2019 PhD Day
eaper er seq sequen encin cing: re- re-seq sequen encin cing beca ecame me a common common task sk.
In gen enome
ysis wor
flows ws, down wnst strea ream m of
re-seq sequen ences ces there ere is is the e task sk of
ma mappin ing rea reads (a (a st strin ing) ) on
referen erence ce gen enome
(a lon longer er st strin ing) It It's PATTERN N MATCHING ING: rea read is is P , , ref referen erence ce gen enome
is T T
5
P = CGGGT = CGGGTATA
5/12/2019 PhD Day
ELASTIC DEGENERATE STRING MATCHING (EDSM) Input: a string P of length m, an ED string T of length n and total size N Output: all positions in T where at least one occurrence of P ends
~ ~
[IC ICAL ALP 2019] 2019]
6 5/12/2019 PhD Day
5/12/2019 7
Rea eads ca carry seq sequen encin cing er error rors: : ho how ca can we we rep represen resent them em?
Hammin mming Dist istance ce: : Giv iven en tw two st strin ings X X and Y Y on
e sa same me alp lphabet et and and havin ing the e sa same me len length, the e Ham amming ng Dist stanc ance dH(X, (X,Y) Y) bet etween een X X and Y Y is is the e numb mber er of
ions s in in wh which ich they ey dif iffer er. . X X = CGGG GGTATA A dH(X, (X,Y)= Y)=2 Y Y = CAGG GGCATA A Edit Distance: Giv iven en tw two st strin ings X X and Y Y on
e sa same me alp lphabet et, the e edit edit Dist stanc ance dE(X, (X,Y) Y) is is the e numb mber er of
su subst stit itution ions, , in inser sertion ions, or
eletion ion of
letter er need eeded ed to
sfor
X in into Y Y (or (or vicev icever ersa sa, as as dE(X, (X,Y)= Y)=dE(Y (Y,X)). X)). X X = CGGG GGTAT AT--
A dE(X, (X,Y)= Y)=3 Y Y = CCGG GG--
ATTA A
PhD Day
PhD Day 8
5/12/2019
(a step into formal languages and automata problems)
(“accidentally” solving an open formal languages and automata problem)
PhD Day 9
5/12/2019
The Computational Pan-Genomics Consortium: Computational pan-genomics: status, promises and challenges. Briefings in Bioinformatics 19(1): 118-135 (2018) R.Grossi, C.S.Iliopoulos, C.Liu, N.Pisanti, S.P. Pissis, A.Retha, G.Rosone, F.Vayani, L.Versari: On-Line Pattern Matching on Similar Texts. CPM 2017: 9:1-9:14 G.Bernardini, N.Pisanti, S.P. Pissis, G.Rosone: Pattern Matching on Elastic-Degenerate Text with Errors. SPIRE 2017: 74-90
[extended version in press in Theoretical Computer Science journal]
M.Alzamel, L.A.K. Ayad, G.Bernardini, R.Grossi, C.S.Iliopoulos, N.Pisanti, S.P.Pissis, G.Rosone: Degenerate String Comparison and Applications. WABI 2018: 21:1-21:14 G.Bernardini, P.Gawrychowski, N.Pisanti, S.P.Pissis, G.Rosone: Even Faster Elastic-Degenerate String Matching via Fast Matrix Multiplication. ICALP 2019: 21:1-21:15
10 5/12/2019 PhD Day