PETER Fast similarity searches and similarity joins in Oracle DB - PowerPoint PPT Presentation

PETER Fast similarity searches and similarity joins in Oracle DB Astrid Rheinländer, Ulf Leser Humboldt-Universität zu Berlin Department of Computer Science Knowledge Management in Bioinformatics

Motivation Approximately searching DNA sequences is important many fields of ● modern genomics 1,160,000 citations of BLAST in Google Scholar ● ESTs (Expressed Sequence Tags) are small portions ● of DNA Find (near-)duplicates ● Find homolog sequences ● → Similarity based search and join algorithms needed dbEST strings contains more than 60 million records ● → efficient execution is crucial Similarity Search and Join on Genomic Data Based on Prefix Tree Indexing in RDBMS Fast Similarity Searches and Similarity Joins in Oracle DB 2 A.Rheinländer, M.Knobloch, N.Hochmuth, U.Leser A. Rheinländer, U. Leser

Our contribution - PETER Prefix tree based index structure ● Combines many tricks for query speedup ● Exact and similarity based search and join queries ● Similarity measures: Hamming and Edit distance ● Evaluated on real data from dbEST [NCBI, online, 1992] ● Real software ● standalone Unix command line tool ● Plugin for Oracle DB ● Similarity Search and Join on Genomic Data Based on Prefix Tree Indexing in RDBMS Fast Similarity Searches and Similarity Joins in Oracle DB 3 A.Rheinländer, M.Knobloch, N.Hochmuth, U.Leser A. Rheinländer, U. Leser

Outline ● PETER-Index ● Pruning strategies ● Integration to Oracle DB ● Results + Conclusion Similarity Search and Join on Genomic Data Based on Prefix Tree Indexing in RDBMS Fast Similarity Searches and Similarity Joins in Oracle DB 4 A.Rheinländer, M.Knobloch, N.Hochmuth, U.Leser A. Rheinländer, U. Leser

Idea of PETER Idea: manage ESTs according to shared prefixes → prefix tree index [Shang et al, IEEE TKDE, 1996] strings with common prefixes occur in one subtree ● Nodes are labeled with some substring of the EST ● any two children of some node start with different characters ● Goal: minimize query response time → exclude whole subtrees early from search space → add information on length and character frequency for pruning Operations are based on DFS traversal Similarity Search and Join on Genomic Data Based on Prefix Tree Indexing in RDBMS Fast Similarity Searches and Similarity Joins in Oracle DB 5 A.Rheinländer, M.Knobloch, N.Hochmuth, U.Leser A. Rheinländer, U. Leser

Example: Search Hamming distance search for p = CTGAAATTGGT, k=1 d(C,A)=1 Similarity Search and Join on Genomic Data Based on Prefix Tree Indexing in RDBMS Fast Similarity Searches and Similarity Joins in Oracle DB 6 A.Rheinländer, M.Knobloch, N.Hochmuth, U.Leser A. Rheinländer, U. Leser

Example: Search Hamming distance search for p = CTGAAATTGGT, k=1 d(CT..,AC..) > 1 d(CT..,AA..) > 1 Similarity Search and Join on Genomic Data Based on Prefix Tree Indexing in RDBMS Fast Similarity Searches and Similarity Joins in Oracle DB 7 A.Rheinländer, M.Knobloch, N.Hochmuth, U.Leser A. Rheinländer, U. Leser

Example: Search Hamming distance search for p = CTGAAATTGGT, k=1 d(CTGA,CTGA)=0 Similarity Search and Join on Genomic Data Based on Prefix Tree Indexing in RDBMS Fast Similarity Searches and Similarity Joins in Oracle DB 8 A.Rheinländer, M.Knobloch, N.Hochmuth, U.Leser A. Rheinländer, U. Leser

Example: Search Hamming distance search for p = CTGAAATTGGT, k=1 d(CTGAAATTG..., CTGAATTT...) > 1 Similarity Search and Join on Genomic Data Based on Prefix Tree Indexing in RDBMS Fast Similarity Searches and Similarity Joins in Oracle DB 9 A.Rheinländer, M.Knobloch, N.Hochmuth, U.Leser A. Rheinländer, U. Leser

Example: Search Hamming distance search for p = CTGAAATTGGT, k=1 d(CTGAAATTGGT, CTGAGATTGGT)= 1 Similarity Search and Join on Genomic Data Based on Prefix Tree Indexing in RDBMS Fast Similarity Searches and Similarity Joins in Oracle DB 10 A.Rheinländer, M.Knobloch, N.Hochmuth, U.Leser A. Rheinländer, U. Leser

Example: Search Hamming distance search for p = CTGAAATTGGT, k=1 Similarity Search and Join on Genomic Data Based on Prefix Tree Indexing in RDBMS Fast Similarity Searches and Similarity Joins in Oracle DB 11 A.Rheinländer, M.Knobloch, N.Hochmuth, U.Leser A. Rheinländer, U. Leser

Filter by string length Assumptions Hamming distance ● two strings p,t are only worth examining if they are of equal length ● Edit distance ● p and t are worth examining only if |t|−|p| ≤ k. ● Idea ● include min/max string lengths in the index ● at each node x, we know the length of the longest and shortest string that starts with ● prefix t[1..x] → E.g., Hamming distance: ∨ if (|max(x)| < |p|) (|min(x)| > |p|) then prune subtree x Similarity Search and Join on Genomic Data Based on Prefix Tree Indexing in RDBMS Fast Similarity Searches and Similarity Joins in Oracle DB 12 A.Rheinländer, M.Knobloch, N.Hochmuth, U.Leser A. Rheinländer, U. Leser

Filter by character frequency Basically evaluates the character frequencies in the strings ● Gives a lower bound for Hamming and edit distance ● [Aghili et al., String processing and Information Retrieval, 2003] Not very effective due to small alphabet size ● Details omitted ● Similarity Search and Join on Genomic Data Based on Prefix Tree Indexing in RDBMS Fast Similarity Searches and Similarity Joins in Oracle DB 13 A.Rheinländer, M.Knobloch, N.Hochmuth, U.Leser A. Rheinländer, U. Leser

Filter by q-grams • No. of mismatching q-grams gives lower bound to edit distance [Xiao et al., VLDB, 2008] • computing q-grams is on avg. cheaper than computing edit distance directly • used for suffix pre-selection • Computed on the fly when EST node is reached • ≥ 90% of our strings have long, unique suffixes • edit distance will probably exceed threshold in the suffix part • Not evaluated for Hamming distance queries Similarity Search and Join on Genomic Data Based on Prefix Tree Indexing in RDBMS Fast Similarity Searches and Similarity Joins in Oracle DB 14 A.Rheinländer, M.Knobloch, N.Hochmuth, U.Leser A. Rheinländer, U. Leser

Lessons learned runtime improvements by filtering increases with threshold → we achieve a speed up of query response time up to 80 % Best configuration for PETER: → combination of length and q-gram filter Similarity Search and Join on Genomic Data Based on Prefix Tree Indexing in RDBMS Fast Similarity Searches and Similarity Joins in Oracle DB 15 A.Rheinländer, M.Knobloch, N.Hochmuth, U.Leser A. Rheinländer, U. Leser

Integration into Oracle DB ● integrated as shared library using ODCI in Oracle Express 10g ● provides functionality for user-defined indexes and table-functions ● ● two steps compile code as shared library and save it in $ORA_HOME/bin ● PL/SQL scripts that defines and registers the index and functions in the DBS ● straight-forward ● Similarity Search and Join on Genomic Data Based on Prefix Tree Indexing in RDBMS Fast Similarity Searches and Similarity Joins in Oracle DB 16 A.Rheinländer, M.Knobloch, N.Hochmuth, U.Leser A. Rheinländer, U. Leser

Execution in Oracle ● first access in a session PL/SQL locates PETER via Data dictionary ● listener invokes extproc and passes procedure parameters ● – extproc remains alive during session – initialization costs emerge only once Similarity Search and Join on Genomic Data Based on Prefix Tree Indexing in RDBMS Fast Similarity Searches and Similarity Joins in Oracle DB 17 A.Rheinländer, M.Knobloch, N.Hochmuth, U.Leser A. Rheinländer, U. Leser

Execution in Oracle ● extproc loads PETER-lib and invokes function calls ● → opens index and performs desired operation inside PETER return values are passed back via extproc ● Similarity Search and Join on Genomic Data Based on Prefix Tree Indexing in RDBMS Fast Similarity Searches and Similarity Joins in Oracle DB 18 A.Rheinländer, M.Knobloch, N.Hochmuth, U.Leser A. Rheinländer, U. Leser

Results - Setup Competitors for similarity operations: ● Unix tools agrep, nrgrep and Flamingo library ● build-in and UDFs inside the Oracle ● EST sets extracted from dbEST ● Index creation and optimization was done in advance, not included in the measured times ● time for index creation grows linear with the number of indexed strings ● Similarity Search and Join on Genomic Data Based on Prefix Tree Indexing in RDBMS Fast Similarity Searches and Similarity Joins in Oracle DB 19 A.Rheinländer, M.Knobloch, N.Hochmuth, U.Leser A. Rheinländer, U. Leser

Results - Similarity Search Agrep: bounded with ● pattern length of 32 chars PETER vs. Unix tools, search of patterns from T3 in indexed set T1 nrgrep: arbitrary patterns ● allowed → PETER is orders of magnitudes faster in both cases threshold Comparison is a bit unfair ● Agrep and nrgrep do not build an index ● Indexing amortizes fast ● Takes 15 searches to outperform agrep, 105 searches for nrgrep ● → PETER more suited for searching the same set of strings multiple times Similarity Search and Join on Genomic Data Based on Prefix Tree Indexing in RDBMS Fast Similarity Searches and Similarity Joins in Oracle DB 20 A.Rheinländer, M.Knobloch, N.Hochmuth, U.Leser A. Rheinländer, U. Leser

PETER Fast similarity searches and similarity joins in Oracle DB - PowerPoint PPT Presentation

PETER Fast similarity searches and similarity joins in Oracle DB Astrid Rheinlnder, Ulf Leser Humboldt-Universitt zu Berlin Department of Computer Science Knowledge Management in Bioinformatics Motivation Approximately searching DNA

L0ng-SUffeRing 1 Peter 3:8 1 Peter 3:8- -12 12 1 Peter 3:8 1 Peter 3:8 - - 12 12 MOVIE

Autumn Sermon Series in 1 Peter 1 Peter 3.13-22 Revd Matthew Dietz 1 Peter 3.13-22 What is

2 Peter Series Lesson #001 May 2, 2019 Dean Bible Ministries www.deanbibleministries.org Dr.

The Original Peter Rabbit Presentation Box 1 23 R I The Original Peter Rabbit Presentation Box 1

Peter Thorpe Peter Thorpe We would like you to be able to paint a space themed picture in the

distress Peter Gotham Gotham Consulting Ltd peter.gotham@gmail.com Who am I? Peter Gotham

Drupal Core Auto-Update Drupal Core Auto-Update Architecture Architecture Peter Wolanin Peter

2 Peter Series Lesson #033 October 22, 2020 Dean Bible Ministries www.deanbibleministries.org

2 Peter Series Lesson #034 November 5, 2020 Dean Bible Ministries www.deanbibleministries.org

LISA-QEMU Peter Puhov and Rob Foley INTRODUCTIONS Peter Puhov Chief Architect @

The Bio-Screening Core Facility PETER BANKS PETER.BANKS@NCL.AC.UK CARMEN MARTIN-RUIZ

SharePoint 2010 Intranet Case Study Presented by Peter Carson President, Envision IT Peter

Customs and International Trade Law By Peter Quinter www.gray-robinson.com Peter Quinter,

ROI stands for Rethink OTT Intelligence Peter White, Principal Analyst and Founder, Rethink

Kim Verbon Peter Hofstede @spilgames Spil Games @hofstede Peter Hofstede @digimini Kim

1. Everyone can and should experience spiritual growth. 2 Peter 1:1 Simon Peter, a slave and

Autoreferat w jzyku angielskim Za cznik 3 do wniosku o przeprowadzenie post powania

Promote Equity Prevent Violence Karen Sheehan, MD, MPH Medical Director Lurie Childrens

South Deep, A world class ore body in Perspective Presentation transcript 15 April 2014 South

OUC-China lasmid dventurer OUTLINE Project Overview Conjugation Transfection Model Policy

Improving Success for Diverse Students in STEM Prof. Michael A. McKibben Divisional Dean of

Different kinds of robustness in genetic and metabolic networks Peter Schuster Institut fr

Investor Presentation - October 2018 Forward Looking Statements This presentation contains

OPCW SAB TWG OPCW SAB TWG OPCW SAB TWG OPCW SAB TWG Convergence in Chemistry and Biology