space efficient construction of succinct de bruijn graphs
play

Space-efficient construction of succinct de Bruijn graphs Felipe A. - PowerPoint PPT Presentation

Space-efficient construction of succinct de Bruijn graphs Felipe A. Louza University of S ao Paulo, Brazil Joint work with Lavinia Egidi and Giovanni Manzini. LSD/LAW London, 6-7 Feb. 2019 Outline 1. Introduction 2. BOSS construction 3.


  1. Space-efficient construction of succinct de Bruijn graphs Felipe A. Louza University of S˜ ao Paulo, Brazil Joint work with Lavinia Egidi and Giovanni Manzini. LSD/LAW London, 6-7 Feb. 2019

  2. Outline 1. Introduction 2. BOSS construction 3. Merging dBGs 4. Space-efficient BOSS construction 5. References Felipe A. Louza (USP) Space-efficient construction of dBGs 2 / 21

  3. de Bruijn graphs (dBGs) Definitions: ◮ Given a collection of strings S , a de Bruijn graph of order k is a directed graph containing: ◮ a node v for every unique k -mer v [1] ... v [ k ] in S . ◮ an edge ( u , v ) with label v [ k ] if there is a ( k + 1) -mer u [1] ... u [ k ] v [ k ] in S . Example: ◮ S = { TACACT, TACTCA, GACTCG } C ACA CAC TCG A T G T C TAC ACT CTC T A GAC TCA Felipe A. Louza (USP) Space-efficient construction of dBGs 3 / 21

  4. de Bruijn graphs (dBGs) Definitions: ◮ Given a collection of strings S , a de Bruijn graph of order k is a directed graph containing: ◮ a node v for every unique k -mer v [1] ... v [ k ] in S . ◮ an edge ( u , v ) with label v [ k ] if there is a ( k + 1) -mer u [1] ... u [ k ] v [ k ] in S . Example: ◮ S = { TACACT, TACTCA, GACTCG } C ACA CAC TCG A T G T C TAC ACT CTC T A GAC TCA Felipe A. Louza (USP) Space-efficient construction of dBGs 3 / 21

  5. de Bruijn graphs (dBGs) Definitions: ◮ Given a collection of strings S , a de Bruijn graph of order k is a directed graph containing: ◮ a node v for every unique k -mer v [1] ... v [ k ] in S . ◮ an edge ( u , v ) with label v [ k ] if there is a ( k + 1) -mer u [1] ... u [ k ] v [ k ] in S . Example: ◮ S = { TACACT, TACTCA, GACTCG } C ACA CAC TCG A T G T C TAC ACT CTC T A GAC TCA Felipe A. Louza (USP) Space-efficient construction of dBGs 3 / 21

  6. Succinct representation of dBGs: BOSS ∗ : ◮ In [Bowe et al. , WABI 2012] introduced a succinct representation for dBGs in space O ( | E | log σ ) bits. ◮ BOSS representation: ◮ Outgoing edges of each v i : are encoded into the substring W i = v j [ k ] . . . v k [ k ] W i = AT , W j = AG ◮ W i are concatenated considering the order of the reversed labels ← − v i = v i [ k ] ... v i [1] TAC ≺ ← ← − − − − CTC Example: ◮ S = { TACACT, TACTCA, GACTCG } T C $ ACA CAC TCG $$$ $$T · · · A A T G C T C G $TA TAC ACT CTC T A $ A C $$G $GA GAC TCA · · · ∗ for the authors’ initials Felipe A. Louza (USP) Space-efficient construction of dBGs 4 / 21

  7. Succinct representation of dBGs: BOSS ∗ : ◮ In [Bowe et al. , WABI 2012] introduced a succinct representation for dBGs in space O ( | E | log σ ) bits. ◮ BOSS representation: ◮ Outgoing edges of each v i : are encoded into the substring W i = v j [ k ] . . . v k [ k ] W i = AT , W j = AG ◮ W i are concatenated considering the order of the reversed labels ← − v i = v i [ k ] ... v i [1] TAC ≺ ← ← − − − − CTC Example: ◮ S = { $$$TACACT, $$$TACTCA, $$$GACTCG } T C $$$ $$T ACA CAC TCG · · · A A T G C T C $TA TAC ACT CTC G T A A C $$G $GA GAC TCA For convenience, we add k copies of a symbol $ at the beginning of each string s i . Felipe A. Louza (USP) Space-efficient construction of dBGs 4 / 21

  8. Succinct representation of dBGs: BOSS ∗ : ◮ In [Bowe et al. , WABI 2012] introduced a succinct representation for dBGs in space O ( | E | log σ ) bits. ◮ BOSS representation: ◮ Outgoing edges of each v i : are encoded into the substring W i = v j [ k ] . . . v k [ k ] W i = AT , W j = AG ◮ W i are concatenated considering the order of the reversed labels ← − v i = v i [ k ] ... v i [1] TAC ≺ ← ← − − − − CTC Example: ◮ S = { $$$TACACT, $$$TACTCA, $$$GACTCG } T C $$$ $$T ACA CAC TCG · · · A A T G C T C $TA TAC ACT CTC G T A A C $$G $GA GAC TCA The label of every node can be recovered. Felipe A. Louza (USP) Space-efficient construction of dBGs 4 / 21

  9. Succinct representation of dBGs: BOSS ∗ : ◮ In [Bowe et al. , WABI 2012] introduced a succinct representation for dBGs in space O ( | E | log σ ) bits. ◮ BOSS representation: ◮ Outgoing edges of each v i : are encoded into the substring W i = v j [ k ] . . . v k [ k ] W i = AT , W j = AG ◮ W i are concatenated considering the order of the reversed labels ← − v i = v i [ k ] ... v i [1] TAC ≺ ← ← − − − − CTC Example: ◮ S = { $$$TACACT, $$$TACTCA, $$$GACTCG } T C $ ACA CAC TCG $$$ $$T · · · A A T G C T C G $TA TAC ACT CTC T A $ A C $$G $GA GAC TCA · · · Felipe A. Louza (USP) Space-efficient construction of dBGs 4 / 21

  10. Succinct representation of dBGs: BOSS ∗ : ◮ In [Bowe et al. , WABI 2012] introduced a succinct representation for dBGs in space O ( | E | log σ ) bits. ◮ BOSS representation: ◮ Outgoing edges of each v i : are encoded into the substring W i = v j [ k ] . . . v k [ k ] W i = AT , W j = AG ◮ W i are concatenated considering the order of the reversed labels ← − v i = v i [ k ] ... v i [1] TAC ≺ ← ← − − − − CTC Example: ◮ S = { $$$TACACT, $$$TACTCA, $$$GACTCG } T C $ ACA CAC TCG $$$ $$T · · · A A T G C T C G $TA TAC ACT CTC T A $ A C $$G $GA GAC TCA · · · Felipe A. Louza (USP) Space-efficient construction of dBGs 4 / 21

  11. Succinct representation of dBGs: BOSS: ◮ Nodes v i = v i [1] ... v i [ k ] are sorted by their reversed labels ← − v i = v i [ k ] ... v i [1] ◮ We mark the position of the last outgoing edge of each node. ◮ We mark as negative ( − ) incoming edges with the same label (except the first). last Nodes W 0 $ $ $ G 1 $ $ $ T 1 AC A C 1 T C A $ 1 $ GA C T C $$$ $$T ACA CAC TCG 1 $ T A C · · · 1 C AC T A A T G 1 GAC T- C T C $TA TAC ACT CTC G 0 T AC A T A 1 T AC T- A C 0 C T C A $$G $GA GAC TCA 1 C T C G 1 $ $ G A 1 T C G $ 1 $ $ T A 1 AC T C Felipe A. Louza (USP) Space-efficient construction of dBGs 5 / 21

  12. Succinct representation of dBGs: BOSS: ◮ Nodes v i = v i [1] ... v i [ k ] are sorted by their reversed labels ← − v i = v i [ k ] ... v i [1] ◮ We mark the position of the last outgoing edge of each node. ◮ We mark as negative ( − ) incoming edges with the same label (except the first). last Nodes W 0 $ $ $ G 1 $ $ $ T 1 AC A C 1 T C A $ 1 $ GA C T C $$$ $$T ACA CAC TCG 1 $ T A C · · · 1 C AC T A A T G 1 GAC T- C T C $TA TAC ACT CTC G 0 T AC A T A 1 T AC T- A C 0 C T C A $$G $GA GAC TCA 1 C T C G 1 $ $ G A 1 T C G $ 1 $ $ T A 1 AC T C Felipe A. Louza (USP) Space-efficient construction of dBGs 5 / 21

  13. Succinct representation of dBGs: BOSS: ◮ Nodes v i = v i [1] ... v i [ k ] are sorted by their reversed labels ← − v i = v i [ k ] ... v i [1] ◮ We mark the position of the last outgoing edge of each node. ◮ We mark as negative ( − ) incoming edges with the same label (except the first). last Nodes W 0 $ $ $ G 1 $ $ $ T 1 AC A C 1 T C A $ 1 $ GA C T C $$$ $$T ACA CAC TCG 1 $ T A C · · · 1 C AC T A A T G 1 GAC T- C T C $TA TAC ACT CTC G 0 T AC A T A 1 T AC T- A C 0 C T C A $$G $GA GAC TCA 1 C T C G 1 $ $ G A 1 T C G $ 1 $ $ T A 1 AC T C Felipe A. Louza (USP) Space-efficient construction of dBGs 5 / 21

  14. Succinct representation of dBGs: BOSS: ◮ LF-mapping between the positive symbols in W and the Nodes [ k ] (with last = 1). ◮ Fast navigation operations: Outdegree, Outgoing, Indegree and Incoming. ◮ Small space: O ( m log σ ) + m + o ( m ) bits for rank and select operations. last Nodes W 0 $ $ $ G 1 $ $ $ T 1 AC A C 1 T C A $ 1 $ GA C T C $$$ $$T ACA CAC TCG 1 $ T A C · · · 1 C AC T A A T G 1 GAC T- C T C $TA TAC ACT CTC G 0 T AC A T A 1 T AC T- A C 0 C T C A $$G $GA GAC TCA 1 C T C G 1 $ $ G A 1 T C G $ 1 $ $ T A 1 AC T C Similar to the BWT and XBW. Felipe A. Louza (USP) Space-efficient construction of dBGs 6 / 21

  15. Succinct representation of dBGs: BOSS: ◮ LF-mapping between the positive symbols in W and the Nodes [ k ] (with last = 1). ◮ Fast navigation operations: Outdegree, Outgoing, Indegree and Incoming. ◮ Small space: O ( m log σ ) + m + o ( m ) bits for rank and select operations. last Nodes W 0 $ $ $ G 1 $ $ $ T 1 AC A C 1 T C A $ 1 $ GA C T C $$$ $$T ACA CAC TCG 1 $ T A C · · · 1 C AC T A A T G 1 GAC T- C T C $TA TAC ACT CTC G 0 T AC A T A 1 T AC T- A C 0 C T C A $$G $GA GAC TCA 1 C T C G 1 $ $ G A 1 T C G $ 1 $ $ T A 1 AC T C Select operation. Felipe A. Louza (USP) Space-efficient construction of dBGs 6 / 21

  16. Succinct representation of dBGs: BOSS: ◮ LF-mapping between the positive symbols in W and the Nodes [ k ] (with last = 1). ◮ Fast navigation operations: Outdegree, Outgoing, Indegree and Incoming. ◮ Small space: O ( m log σ ) + m + o ( m ) bits for rank and select operations. last Nodes W 0 $ $ $ G 1 $ $ $ T 1 AC A C 1 T C A $ 1 $ GA C T C $$$ $$T ACA CAC TCG 1 $ T A C · · · 1 C AC T A A T G 1 GAC T- C T C $TA TAC ACT CTC G 0 T AC A T A 1 T AC T- A C 0 C T C A $$G $GA GAC TCA 1 C T C G 1 $ $ G A 1 T C G $ 1 $ $ T A 1 AC T C Select operation. Felipe A. Louza (USP) Space-efficient construction of dBGs 6 / 21

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend