CS6100: Topics in Design and Analysis of Algorithms Range Searching - - PDF document

▶

Oct 05, 2023 105 likes •359 views

CS6100: Topics in Design and Analysis of Algorithms Range Searching John Augustine CS6100 (Even 2012): Range Searching The Range Searching Problem Given a set P of n points in R d , for fixed integer d 1 , we want to preprocess and store

SLIDE 1

CS6100: Topics in Design and Analysis of Algorithms

Range Searching John Augustine

CS6100 (Even 2012): Range Searching

SLIDE 2

The Range Searching Problem

Given a set P of n points in Rd, for fixed integer d ≥ 1, we want to preprocess and store it in a data structure so that, given a query range, typically an axis parallel rectangle, we can report all the points in the range quickly. For 1D range searching, we will study (i) balanced binary search trees and (ii) skip lists. For 2D point sets, we will study (i) kd-trees and (ii) Range trees, both of which can be extended to arbitrary d-dimensional point sets.

date of birth salary

19,500,000 19,559,999 3,000 4,000

G. Ometer

born: Aug 19, 1954 salary: $3,500

19,500,000 19,559,999 3,000 4,000 2 4

CS6100 (Even 2012): Range Searching 1

SLIDE 3

Balanced Binary Search Trees (BBST)

Given a set P of n points in R stored in a sorted array A, we can construct a tree that has depth O(log n). For simplicity, we begin with the assumption that n = 2k for some integer k. The data nodes are in the leaves. The internal nodes store values that guide the search. The root node stores 2k−1th element in A. While searching for a value x in the query phase, if x is less than or equal to the value stored in the root, the search is guided to the left sub tree. Otherwise, the search is guided to the right subtree. The left subtree is constructed recursively over points in A stored from locations 1 through 2k−1. The right subtree is constructed over points in A located from positions 2k−1 + 1 through 2k. When constructing the internal node on 2 element point sets, the left subtree simply points to the smaller

CS6100 (Even 2012): Range Searching 2

SLIDE 4

f the two points and the right subtree to the larger,

thus terminating the recursion. The construction can be easily adapted for arbitrary n. See below for an example.

µ µ′ 3 10 19 23 30 37 49 59 62 70 80 89 3 19 10 30 37 59 70 62 100 89 80 23 49 100 105

Lemma 1. If the set of points is sorted, we can construct the BBST in O(n) time. If not, it takes O(n log n) as we have to sort the points set. The BBST data structure requires O(n) storage space.

CS6100 (Even 2012): Range Searching 3

SLIDE 5

To search for a single value µ, we start at the root node and ask if µ is greater than the value stored in the

root. If it is, we move to the right subtree, otherwise,

we move to the left. We continue recursively till the leaf, where we can report if µ is present. To query a range [µ, µ′], we traverse the tree for both µ and µ′ until we find the internal node where the two split ways — call it vsplit.

νsplit µ µ′ root(T) the selected subtrees

At vsplit, we part ways for µ and µ′. As we traverse towards µ (past vsplit), just before we move to some left subtree, we report all points in the right subtree. We deal with µ′ symmetrically.

CS6100 (Even 2012): Range Searching 4

SLIDE 6

Lemma 2. The time to report points in some range [µ, µ′] is O(k+log n) where k is the number of points in [µ, µ′].

Proof. The tree traversal requires O(log n) time.

Reporting points in each subtree requires O(k′) time, where k′ is the number of points on which that particular subtree is built. Therefore, O(k) time is required to report all k points. Preprocessing Time O(n log n) Space O(n) Searching for 1 element O(log n) Reporting a range with k items O(k + log n) Insertion O(log n) Deletion O(log n) Table 1: Performance bounds of a BBST containing n points.

CS6100 (Even 2012): Range Searching 5

SLIDE 7

Skip Tree

While the static implementation of a binary search tree is very straightforward, making the data structure dynamic (i.e., adding and deleting the points from the points set) is non-trivial. The skip tree is a randomized data structure that allows easy implementation including updates (insertions and deletions). On expectation, it has the same performance bounds as BBST’s (in Table 1).

Head Pointer CS6100 (Even 2012): Range Searching 6

SLIDE 8

Construction

Again, we assume that the set P of n points is given to us in sorted order. We denote the ith element of P in the sorted list by pi. In our data structure, we use nodes with four pointers: left, right, top and bottom. We first construct the bottom level (or level 0), which is a linked list of the sorted list using the four-pointer node structure. The bottom pointers are set to null. For each pi, we toss a fair coin repeatedly until we get Heads. Let ℓi be the number of Tails before we

btain the first Heads.

Vertical Pointers. We make ℓi identical nodes containing pi, one for each level up to level ℓi, and we chain them up as follows. For j < ℓi, the top pointer

f jth node points to the j +1th node and the bottom

pointer of j + 1th node points to node j. The top pointer of the ℓith node is null.

CS6100 (Even 2012): Range Searching 7

SLIDE 9

The number of levels ℓ = maxi ℓi. For each level, we have two special boundary nodes, one to the left of all nodes in that level, and the other to the right. The boundary nodes are also chained up. Horizontal Pointers. We establish horizontal links at each level j starting from j = 1 up to j = ℓ. We start from the left boundary of level j. For each node η in level j (starting from the left boundary) we step down to its copy in level j − 1 and traverse to the right until we come to a node in level j − 1 that has a copy η′ in level j. We establish bidirectional links between and η and η′ and continue this process from η′ until we reach the right boundary. The head pointer points to the left boundary of level ℓ.

CS6100 (Even 2012): Range Searching 8

SLIDE 10

Searching for a Point p

Here, given p, we want to report if P (stored using the skip list datastructure) contains p. For simplicity, assume that the left boundary nodes store −∞ and the right boundary nodes store +∞. Start from the head pointer. Repeat the following steps:

1. Find the last node whose value is at most than p. If

the value is exactly p, we have found it, so we can terminate.

2. Else, if we have reached level 0, then, report that p

is not in P and terminate.

3. Else, step directly down one level.

CS6100 (Even 2012): Range Searching 9

SLIDE 11

Exercises

1. How do we search for points in a range?
2. How do we insert a new node?
3. How do we delete a new node?
4. Suppose

you are given a skip list, can you strategically add and delete points so that the query times become bad (i.e., ω(log n))? Note that you will have to play the role of an adaptive adversary that can see the coin tosses (and therefore see the data structure as it evolves).

5. Suppose the coin tosses are hidden to you and you

can’t measure the actual query times. Can you still strategically add and delete points so that the query times become bad? (Such an adversary that cannot see the coin tosses is called an oblivious adversary.)

CS6100 (Even 2012): Range Searching 10

SLIDE 12

6. An alternative way to ask the previous question

is the following. How do we prove that, under an oblivious adversary, the expected performance bounds of a skip list matches Table 1?

CS6100 (Even 2012): Range Searching 11

SLIDE 13

kd-Trees

Recall that we now want to perform 2D range searches.

date of birth salary

19,500,000 19,559,999 3,000 4,000

G. Ometer

born: Aug 19, 1954 salary: $3,500

So, we need a data structure that considers both the x AND the y coordinates. Kd-Trees achieve this by alternating between x and y. Let us now recursively construct the kd-Tree given a set P of n points in 2D. As in BBST’s, the data is stored in the leaves. The internal nodes serve the purpose of guiding searches to the required leaves. The root node (level 0) of the kd-Tree corresponds to the entire data set.

CS6100 (Even 2012): Range Searching 12

SLIDE 14

To construct the level 1 nodes, i.e., the left and right children of the root, we split the data along the x median. The subtree rooted at the left child of the root node stores all points with x coordinate values no more than the x median. The rest are stored in the right subtree of the root node.

ℓ P

left

right

To construct level 2 nodes, we again split the points stored in the subtrees rooted at each of the level 1 nodes into two roughly equal halves. However, this time, we split along the y median. We continue recursively alternating between splitting along x and y medians.

CS6100 (Even 2012): Range Searching 13

SLIDE 15

p4 p1 p5 p3 p2 p7 p9 p10 p6 p8 ℓ1 ℓ2 ℓ3 ℓ4 ℓ5 ℓ6 ℓ7 ℓ8 ℓ9 p1 p2 ℓ8 ℓ4 ℓ2 ℓ1 ℓ5 p3 p4 p5 p6 p7 p8 p9 p10 ℓ7 ℓ6 ℓ9 ℓ3 Algorithm BUILDKDTREE(P,depth)

Input. A set of points P and the current depth depth.
Output. The root of a kd-tree storing P.

1. if P contains only one point 2. then return a leaf storing this point 3. else if depth is even 4. then Split P into two subsets with a vertical line ℓ through the median x-coordinate

f the points in P. Let P

1 be the set of points to the left of ℓ or on ℓ, and let

2 be the set of points to the right of ℓ.

5. else Split P into two subsets with a horizontal line ℓ through the median y- coordinate of the points in P. Let P

1 be the set of points below ℓ or on ℓ,

and let P

2 be the set of points above ℓ.

6. νleft ← BUILDKDTREE(P

1,depth+1)

7. νright ← BUILDKDTREE(P

2,depth+1)

8. Create a node ν storing ℓ, make νleft the left child of ν, and make νright the right child of ν. 9. return ν

CS6100 (Even 2012): Range Searching 14

SLIDE 16

Preprocessing Time and Storage

At each internal node, we have to split P into two sets. This requires O(n) time if the internal node is built on n elements. Subsequently, two recursive calls are made to points sets that contain roughly n/2

elements. Therefore, the recurrence relationship on the

preprocessing time of n elements is: T(n) = O(n) + 2T(n/2), which evaluates to T(n) = O(n log n). To analyse the space required by a kd-tree that stores n points, first note that suppose a binary tree T has n leaves and each of its internal nodes has exactly two children, then T has n − 1 internal nodes. Since any kd-tree is such a tree, the space required is O(n).

CS6100 (Even 2012): Range Searching 15

SLIDE 17

Region of a node

Note that each node in the kd-tree has a region associated with it. The region associated with the root is the entire plane. Subsequently, the region gets divided based on where the points are spilt.

ℓ1 ℓ2 ℓ3 ν region(ν) ℓ3 ℓ2 ℓ1

CS6100 (Even 2012): Range Searching 16

SLIDE 18

Query Procedure

Traverse the kd-tree, but only visit nodes whose regions intersect the query rectangle.

When a region is fully contained in the query

rectangle, just report all points in the subtree.

When traversal reaches a leaf, check its containment

in the query rectangle and report if necessary. Lemma 3. A query with an axis parallel rectangle in a kd-tree of n points takes O(√n + k) time, where k is the number of points reported. Proof Sketch. Reporting all points in a region fully contained in the query rectangle takes time linear in the number of points in the region. Therefore, the time to report all points in regions contained within the query rectangle will take O(k) time. Consider the nodes that were visited, but whose regions were not fully contained by the query rectangle. We

CS6100 (Even 2012): Range Searching 17

SLIDE 19

nly spend O(1) time in each such node. Therefore,

we can account for the remaining running time by (asymptotically) counting the number of such nodes.

The region of each such node is cut by one of the

four boundaries of the query rectangle.

Therefore,

the number

such nodes is asymptotically upper bounded by the maximum number of intersections of a line with regions in the kd-tree.

We build a recurrence function Q(n) that captures

the maximum number of regions in an n-node kd- tree that a line can intersect.

Since the kd-tree alternates between vertical and

horizontal splits, Q(n) must be defined across two

levels. In particular, Q(n) = 2 + 2Q(n/4), which

evaluates to Q(√n). Thus the total query time is O(√n + k).

CS6100 (Even 2012): Range Searching 18

SLIDE 20

Range Trees

The Range Tree is a data structure for range searching whose (non-output sensitive term in the) query time is polylogarithmic in n instead of O(√n)? Its preprocessing time and space complexity is O(n log n). The key to designing multi-dimensional range searching data structures is to combine searching along multiple coordinate axes. While we alternated between x and y coordinate in kd- trees, in range trees, we first build on the x-coordinate and then, for each internal node on the x-coordinate tree, we build a separate tree on the y-coordinate. To construct the range tree, it is helpful to store two copies of the set of points (at each recursive call), one sorted according to the x coordinates and the other sorted according to the y coordinates.

CS6100 (Even 2012): Range Searching 19

SLIDE 21

Recall 1D Range Searching

Before we see how 2D range trees can be constructed, we first recall 1D BBST’s.

νsplit µ µ

We store the data as leaves in a balanced binary search tree. The canonical subset P(v) of a node v is the data stored in the leaves of the subtree rooted at v. In 2D range trees, the primary tree is a 1D BBST based

n the x-coordinate of the points. For each internal

node v, we additionally store an associated tree based

n the y-coordinates of the canonical subset P(v) of

v.

CS6100 (Even 2012): Range Searching 20

SLIDE 22

2D Range Tree

T P(ν) ν Tassoc(ν) P(ν) binary search tree

n y-coordinates

binary search tree on x-coordinates Algorithm BUILD2DRANGETREE(P)

Input. A set P of points in the plane.
Output. The root of a 2-dimensional range tree.

1. Construct the associated structure: Build a binary search tree Tassoc on the set P

y of y-

coordinates of the points in P. Store at the leaves of Tassoc not just the y-coordinate of the points in P

y, but the points themselves.

2. if P contains only one point 3. then Create a leaf ν storing this point, and make Tassoc the associated structure of ν. 4. else Split P into two subsets; one subset P

left contains the points with x-coordinate less

than or equal to xmid, the median x-coordinate, and the other subset P

right contains

the points with x-coordinate larger than xmid. 5. νleft ← BUILD2DRANGETREE(P

left)

6. νright ← BUILD2DRANGETREE(P

right)

7. Create a node ν storing xmid, make νleft the left child of ν, make νright the right child of ν, and make Tassoc the associated structure of ν. 8. return ν CS6100 (Even 2012): Range Searching 21

SLIDE 23

Lemma 4. A 2D range tree on n data points takes O(n log n) storage.

Proof. A data point p is stored only in the associated

trees attached to the nodes of the first level tree on the path from root to p. At a given level, a point p is stored in only one associated structure. Since the associated tree structure uses linear storage, each data point contributes to O(1) of the storage in each

f the O(log n) levels. Therefore, the total space is

O(n log n).

Algorithm 2DRANGEQUERY(T,[x : x′]×[y : y′])

Input. A 2-dimensional range tree T and a range [x : x′]×[y : y′].
Output. All points in T that lie in the range.

1. νsplit ←FINDSPLITNODE(T,x,x′) 2. if νsplit is a leaf 3. then Check if the point stored at νsplit must be reported. 4. else (∗ Follow the path to x and call 1DRANGEQUERY on the subtrees right of the

path. ∗)

5. ν ← lc(νsplit) 6. while ν is not a leaf 7. do if x xν 8. then 1DRANGEQUERY(Tassoc(rc(ν)),[y : y′]) 9. ν ← lc(ν) 10. else ν ← rc(ν) 11. Check if the point stored at ν must be reported. 12. Similarly, follow the path from rc(νsplit) to x′, call 1DRANGEQUERY with the range [y : y′] on the associated structures of subtrees left of the path, and check if the point stored at the leaf where the path ends must be reported. CS6100 (Even 2012): Range Searching 22

SLIDE 24

Theorem

1. A 2D range tree on n data points

can be constructed in O(n log n) time and occupies O(n log n) space. A range search query on a range with k points in it takes O(k + log2 n) time.

Proof. The construction time can be proved using

ideas from proof of Lemma 4. On the primary BBST (based on points sorted according to x-coordinates), we perform a 1D range search for nodes whose canonical subsets have x coordinates that overlap with the x coordinates of the range that we are searching for. There are O(log n) such nodes. For each of these nodes, we look at the associated BBST (base on the canonical subset sorted according to the y-coordinate) and perform a 1D range search for points whose y coordinates fall within the range we are searching for. Overall, these traversals require O(log2 n) time. In these associated BBST’s we look for subtrees that are fully contained within our search range and report all points in such subtrees. Since such reporting is linear in the number of points stored in those subtrees, this adds an O(k) term in the query time. Therefore, total query time is O(k + log2 n) time.

CS6100 (Even 2012): Range Searching 23