The Dynamic Bloom Filters
Deke Guo, Member, IEEE, Jie Wu, Fellow, IEEE, Honghui Chen, Ye Yuan, and Xueshan Luo
Abstract—A Bloom filter is an effective, space-efficient data structure for concisely representing a set, and supporting approximate membership queries. Traditionally, the Bloom filter and its variants just focus on how to represent a static set and decrease the false positive probability to a sufficiently low level. By investigating mainstream applications based on the Bloom filter, we reveal that dynamic data sets are more common and important than static sets. However, existing variants of the Bloom filter cannot support dynamic data sets well. To address this issue, we propose dynamic Bloom filters to represent dynamic sets, as well as static sets and design necessary item insertion, membership query, item deletion, and filter union algorithms. The dynamic Bloom filter can control the false positive probability at a low level by expanding its capacity as the set cardinality increases. Through comprehensive mathematical analysis, we show that the dynamic Bloom filter uses less expected memory than the Bloom filter when representing dynamic sets with an upper bound on set cardinality, and also that the dynamic Bloom filter is more stable than the Bloom filter due to infrequent reconstruction when addressing dynamic sets without an upper bound on set cardinality. Moreover, the analysis results hold in stand- alone applications, as well as distributed applications. Index Terms—Bloom filters, dynamic Bloom filters, information representation.
Ç 1 INTRODUCTION
I
NFORMATION representation and processing of member-
ship queries are two associated issues that encompass the core problems in many computer applications. Representa- tion means organizing information based on a given format and mechanism such that information is operable by a corresponding method. The processing of membership queries involves making decisions based on whether an item with a specific attribute value belongs to a given set. A standard Bloom filter (SBF) is a space-efficient data structure for representing a set and answering membership queries within a constant delay [1]. The space efficiency is achieved at the cost of false positives in membership queries, and for many applications, the space savings
- utweigh this drawback when the probability of an error
is sufficiently low. The SBF has been extensively used in many database applications [2], for example, the Bloom join [3]. Recently, it has started receiving more widespread attention in net- working literature [4]. An SBF can be used as a summariz- ing technique to aid global collaboration in peer-to-peer (P2P) networks [5], [6], [7], support probabilistic algorithms for routing and locating resources [8], [9], [10], [11], and share Web cache information [12]. In addition, SBFs have great potential for representing a set in main memory [13] in stand-alone applications. For example, SBFs have been used to provide a probabilistic approach for explicit state model checking of finite-state transition systems [13], to summar- ize the contents of stream data in memory [14], [15], to store the states of flows in the on-chip memory at networking devices [16], and to store the statistical values of tokens to speed up the statistical-based Bayesian filters [17]. The SBF has been modified and improved from different aspects for a variety of specific problems. The most important variations include compressed Bloom filters [18], counting Bloom filters [12], distance-sensitive Bloom filters [19], Bloom filters with two hash functions [20], space- code Bloom filters [21], spectral Bloom filters [22], general- ized Bloom filters [23], Bloomier filters [24], and Bloom filters based on partitioned hashing [25]. Compressed Bloom filters can improve performance in terms of bandwidth saving when an SBF is passed on as a message. Counter Bloom filters deal mainly with the item deletion operation. Distance-sensitive Bloom filters, using locality-sensitive hash functions, can answer queries of the form, “Is x close to an item of S?” Bloom filters with two hash functions use a standard technique in hashing to simplify the implementa- tion of SBFs significantly. Space-code Bloom filters and spectral Bloom filters focus on multisets, which support queries of the form, “How many occurrences of an item are there in a given multiset?” The SBF and its mainstream variations are suitable for representing static sets whose cardinality is known prior to design and deployment. Although the SBF and its variations have found suitable applications in different fields, the following three obstacles still lack suitable and practical solutions: 1. For stand-alone applications that know the upper bound on set cardinality for a dynamic set in advance, a large number of bits are allocated for an SBF to represent all possible items of the dynamic set at the outset. This approach diminishes the space
120 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
- VOL. 22,
- NO. 1,
JANUARY 2010
. D. Guo, H. Chen, and X. Luo are with the Key Laboratory of C4ISR Technology, National University of Defense Technology, Changsha 410073, China. E-mail: {guodeke, chh0808}@gmail.com, xsluo@nudt.edu.cn. . J. Wu is with the Department of Computer and Information Sciences, Temple University, 1805 N. Borad Street, Philadelphia, PA 19122. E-mail: jiewu@temple.edu. . Y. Yuan is with the Institute of Computer Systems, Northeastern University, 132#, Shen Yang City, Liao Ning Province 110004, China. E-mail: linuxyy@gmail.com. Manuscript received 26 May 2007; revised 19 July 2008; accepted 10 Feb. 2009; published online 18 Feb. 2009. Recommended for acceptance by D. Gunopulos For information on obtaining reprints of this article, please send e-mail to: tkde@computer.org, and reference IEEECS Log Number TKDE-2007-05-0239. Digital Object Identifier no. 10.1109/TKDE.2009.57.
1041-4347/10/$26.00 2010 IEEE Published by the IEEE Computer Society