A LTHOUGH recent genome sequencing projects have actual cases. There - PDF document

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 1, NO. 4, OCTOBER-DECEMBER 2004 159 An O ð N 2 Þ Algorithm for Discovering Optimal Boolean Pattern Pairs Hideo Bannai, Heikki Hyyro ¨, Ayumi Shinohara, Masayuki Takeda, Kenta Nakai, and Satoru Miyano Abstract —We consider the problem of finding the optimal combination of string patterns, which characterizes a given set of strings that have a numeric attribute value assigned to each string. Pattern combinations are scored based on the correlation between their occurrences in the strings and the numeric attribute values. The aim is to find the combination of patterns which is best with respect to an appropriate scoring function. We present an O ð N 2 Þ time algorithm for finding the optimal pair of substring patterns combined with Boolean functions, where N is the total length of the sequences. The algorithm looks for all possible Boolean combinations of the patterns, e.g., patterns of the form p ^ : q , which indicates that the pattern pair is considered to occur in a given string s , if p occurs in s , AND q does NOT occur in s . An efficient implementation using suffix arrays is presented, and we further show that the algorithm can be adapted to find the best k -pattern Boolean combination in O ð N k Þ time. The algorithm is applied to mRNA sequence data sets of moderate size combined with their turnover rates for the purpose of finding regulatory elements that cooperate, complement, or compete with each other in enhancing and/or silencing mRNA decay. Index Terms —Pattern discovery, Boolean patterns, suffix tree, suffix array. � 1 I NTRODUCTION A LTHOUGH recent genome sequencing projects have actual cases. There are several methods which address this revealed the whole DNA sequence of several organ- observation, focusing on finding composite patterns. In [2], isms, there is still much that is unknown concerning what they develop a suffix tree-based approach for discovering and how the information is encoded in these blueprints of structured motifs , which are two or more patterns separated life. Pattern discovery from such biological sequences is by a certain distance, similar to text associative patterns [3]. thus an important topic in bioinformatics that has been MITRA [4] is another method that looks for composite studied heavily with numerous variations and applications patterns using mismatch trees . Bioprospector [5] applies the (see [1] for a survey on earlier work). To extract meaning Gibbs sampling strategy to find gapped motifs. Multiple from biological sequences, the general goal of these unordered motifs are considered in [6]. methods is to find patterns which are conserved across a In this paper, we assume that we are given a set of set of biologically related sequences. The existence of such sequences that have numeric attribute values associated sequence elements suggests that those elements are central with each sequence as input. We present a new formulation to the functions and characteristics of the sequence set. of composite pattern discovery where the problem is to find Computational analyses which provide such candidates can pairs of patterns combined with any Boolean function . The main contribution is an O ð N 2 Þ algorithm (where N is the be a very helpful guide for biologists in the task of experimentally confirming the actual sequence elements in total length of the input strings) and implementation based play, as well as their functions. on suffix arrays, for finding the optimal Boolean substring Although finding the most significant sequence element pattern pair with respect to some suitable scoring function. Note that the methods mentioned above for finding conserved across multiple sequences has important applica- composite patterns can be viewed as being limited to tions, it is known that more than one sequence element will finding pattern pairs which use only the ^ (AND) operation affect the biological characteristics of the sequences in many (with an extra distance constraint in the case of gapped motifs). In other words, the algorithms find combinations of two patterns p , q where both p AND q occur in each string. . H. Bannai, K. Nakai, and S. Miyano are with the Human Genome Center, Institute of Medical Science, The University of Tokyo, 4-6-1 Shirokanedai, The use of any Boolean function permits the use of the : Minato-ku, Tokyo 108-8639, Japan. (NOT) operation, allowing combinations such as p ^ : q . E-mail: {bannai, knakai, miyano}@ims.u-tokyo.ac.jp. This makes it possible to find not only sequence elements . H. Hyyro ¨ is with PRESTO, Japan Science and Technology Agency (JST), Kawaguchi-shi, Saitama, Japan. E-mail: heikki.hyyro@gmail.com. that cooperate with each other, but those with competing . A. Shinohara is with PRESTO, Japan Science and Technology Agency functions, i.e., not only the presence of one element, but the (JST) and the Department of Informatics, Graduate School of Information absence of the other is crucial for their functions. The pattern Science and Electrical Engineering, Kyushu University, 6-10-1 Hakozaki, pairs discovered by our algorithm are optimal in that they Higashi-ku, Fukuoka 812-8581, Japan. E-mail: ayumi@i.kyushu-u.ac.jp. . M. Takeda is with SORST, Japan Science and Technology Agency (JST) are guaranteed to be the highest scoring pair of substring and the Department of Informatics, Graduate School of Information Science patterns with respect to a given scoring function and, also, a and Electrical Engineering, Kyushu University, 6-10-1 Hakozaki, Higashi- limit on the lengths of the patterns in the pair is not ku, Fukuoka 812-8581, Japan. E-mail: takeda@i.kyushu-u.ac.jp. assumed. Our algorithm can be adjusted to handle several Manuscript received 3 Oct. 2004; revised 3 Dec. 2004; accepted 14 Dec. 2004. common problem formulations of pattern discovery, for For information on obtaining reprints of this article, please send e-mail to: example, pattern discovery from positive and negative tcbb@computer.org, and reference IEEECS Log Number TCBBSI-0163-1004. 1536-1233/04/$20.00 � 2004 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM

A LTHOUGH recent genome sequencing projects have actual cases. There - PDF document

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 1, NO. 4, OCTOBER-DECEMBER 2004 159 An O N 2 Algorithm for Discovering Optimal Boolean Pattern Pairs Hideo Bannai, Heikki Hyyro , Ayumi Shinohara, Masayuki

The Transcription Factor Bhlhe40 Sustains T H 17 Cell Pathogenicity in EAE Brian T. Edelson Dept.

The Evolution in the Molecular Gas Content of Luminous Infrared Galaxies at z = 0.25 0.65

(1820 -1903) By Dr. F. Elwell Note: This presentation is based on the theories of Herbert

North Carolina Legislative Library Presentation to the 2013 NCSL LRL PDS Boise, ID September 18,

Genomic approaches towards finding cis -regulatory modules (CRM) in animals Matthew I. Omoruyi

Bone Cancer Research Trust Strictly Research Grant Applications 07/05/2016 BCRT 10 th

p53 TUMOR SUPPRESSOR PROTEIN AG Jochemsen Dept. Cell and Chemical Biology LUMC Functions and

Corporate Presentation December 2019 NASDAQ: CLRB Forward-Looking Statements This presentation

Hazarding A Guess: The Dangers of Mining Big Data O E C D T E C H N O L O G Y F O R E S I G H T

Factors exacerbating vulnerabilities to food insecurity among the Maasai community living in

Shared Equity in Economic Development (SEED) Fellowship Technical Assistance Panel Visit November

2006 King County Flood Hazard 2006 King County Flood Hazard Management Plan Recap Management

Not just another WordPress talk About climate change, the internet & our responsibility

Be Careful What You Wish For Wyoming Energy Summit: Powering Future Generations Richard A.

Structure of Presentation Why does a stable and supportive financial system matter for the

DOES CORRUPTION EXACERBATE INEQUALITY? Adnan MS Fakir Azraf Uddin Ahmed K M Masnun Hosain

Morgan Appel, Director Education Department This presentation and a host of related materials

Large Urban County Caucus 2016 Innovation Symposium New York, NY November 2016 Large counties

Presentation Summary S&P Dow Jones Indices Commonwealth of Pennsylvania Public Pension

Treasury Market Practices Group October 3, 2016 Tom Wipf, Morgan Stanley Nathaniel Wuerffel,

TO ECONOMIC SHOCKS WITH REFERENCE TO SMALL ISLAND DEVELOPING STATES Lino Briguglio University

2020 Overview of Grant Program & Application Process Recorded Webinar Posted March 17, 2020

Bay Planning Coalition Expert Briefing CEQA Update 2016: Supreme Court Decisions & Hot

Water Resource Protection Act Senate Bill 163 Summary of the Act Survey of large water users

A LTHOUGH recent genome sequencing projects have actual cases. There - PDF document

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 1, NO. 4, OCTOBER-DECEMBER 2004 159 An O N 2 Algorithm for Discovering Optimal Boolean Pattern Pairs Hideo Bannai, Heikki Hyyro , Ayumi Shinohara, Masayuki

The Transcription Factor Bhlhe40 Sustains T H 17 Cell Pathogenicity in EAE Brian T. Edelson Dept.

The Evolution in the Molecular Gas Content of Luminous Infrared Galaxies at z = 0.25 0.65

(1820 -1903) By Dr. F. Elwell Note: This presentation is based on the theories of Herbert

North Carolina Legislative Library Presentation to the 2013 NCSL LRL PDS Boise, ID September 18,

Genomic approaches towards finding cis -regulatory modules (CRM) in animals Matthew I. Omoruyi

Bone Cancer Research Trust Strictly Research Grant Applications 07/05/2016 BCRT 10 th

p53 TUMOR SUPPRESSOR PROTEIN AG Jochemsen Dept. Cell and Chemical Biology LUMC Functions and

Corporate Presentation December 2019 NASDAQ: CLRB Forward-Looking Statements This presentation

Hazarding A Guess: The Dangers of Mining Big Data O E C D T E C H N O L O G Y F O R E S I G H T

Factors exacerbating vulnerabilities to food insecurity among the Maasai community living in

Shared Equity in Economic Development (SEED) Fellowship Technical Assistance Panel Visit November

2006 King County Flood Hazard 2006 King County Flood Hazard Management Plan Recap Management

Not just another WordPress talk About climate change, the internet &amp; our responsibility

Be Careful What You Wish For Wyoming Energy Summit: Powering Future Generations Richard A.

Structure of Presentation Why does a stable and supportive financial system matter for the

DOES CORRUPTION EXACERBATE INEQUALITY? Adnan MS Fakir Azraf Uddin Ahmed K M Masnun Hosain

Morgan Appel, Director Education Department This presentation and a host of related materials

Large Urban County Caucus 2016 Innovation Symposium New York, NY November 2016 Large counties

Presentation Summary S&amp;P Dow Jones Indices Commonwealth of Pennsylvania Public Pension

Treasury Market Practices Group October 3, 2016 Tom Wipf, Morgan Stanley Nathaniel Wuerffel,

TO ECONOMIC SHOCKS WITH REFERENCE TO SMALL ISLAND DEVELOPING STATES Lino Briguglio University

2020 Overview of Grant Program &amp; Application Process Recorded Webinar Posted March 17, 2020

Bay Planning Coalition Expert Briefing CEQA Update 2016: Supreme Court Decisions &amp; Hot

Water Resource Protection Act Senate Bill 163 Summary of the Act Survey of large water users

Not just another WordPress talk About climate change, the internet & our responsibility

Presentation Summary S&P Dow Jones Indices Commonwealth of Pennsylvania Public Pension

2020 Overview of Grant Program & Application Process Recorded Webinar Posted March 17, 2020

Bay Planning Coalition Expert Briefing CEQA Update 2016: Supreme Court Decisions & Hot