Closest pair of points problem

The problem of the densest or closest pair of points (also closest pair problem) is to find the two most densely spaced points in a plane.

Formal Description

Given a set of points and wanted are two points such that the Euclidean distance is minimal.

The naive approach is to simply calculate all distances between all possible pairs of points and from this to select the smallest:

Mindist = infinity for all p in P: for all q in P: when p ≠ q and dist ( p, q) < MinDist MinDist = dist ( p, q) densest few = (p, q) return densest few Since both the inner and outer loop to be executed n times, is the running time of the algorithm.

Divide -and- conquer algorithm

First, the given point set is to sort once with respect to x and once with respect to y - coordinates; we obtain two sorted lists and.

The divide step is divided into two halves and the left and right of the median.

The recursive call is now done on each of the two halves; the densest each pair of points obtained in two halves, are taken into account without any point pairs between the two halves.

The conquer step now combines these two halves, it is selected the minimum of the distances found. There are two cases to consider:

In principle, it would be necessary durchzuprüfen all such pairs of points individually. However, since this square many are still, no advantage would be so in the run-time result ( as one would obtain recurrence equation T (n) = 2 * T ( n / 2) n ² / 4, which is still in ). It makes more sense to use that is already known to a certain " default " for the smallest distance. One hand, only have to be considered points whose distance does not exceed the division boundary in the x- direction; must also be considered for these points only partners whose distance is less than in the y- direction. Been If we imagine a grid of width in the vicinity of the partition boundary, then in each grid box are just one, otherwise already found a distance of less than: Thus, for each to be tested point P that only many potential partners are constant to examine would be. As only those cells are to consider the P of a maximum distance have (maximum 12 pieces each top and bottom), even up to 24 additional distances for each point are then calculated, which takes quadratic running time only linear running time is necessary. Overall, the recurrence equation is now T (n) = 2 * T ( n / 2) 24 · n, which is (n log n ).

Randomized algorithm

Operation

The randomized algorithm is based on that the items are inserted in random order in a hash table, where the insertion of a point as well as the testing that he has the previously known minimum distance, in expected O (1 ) works. The hash table maps all grid cells of size ( δ / 2) ² to the point that may exist therein. While it needs to be rebuilt at each such update of the hash table, the total number of inserts in all this is a hash table but still in O (n). Without attention, such as the hash table is used concretely, first results in the following algorithm:

Form a random sequence of points δ = d () Include in the hash table and a ( grid size δ / 2) For i = 3. N If a distance to one who has: δ: = δ ' Build the hash table new to ( grid size δ ' / 2) Include in the hash table Lattice structure and a hash function

It is simplistic to assume that all the points have coordinates between ( 0,0) and ( 1,1), if necessary, can be made for this purpose a coordinate transformation. The plane in which the points lie is, then δ by a grid of width / 2 covered. It is now required is a universal hash function; it allows on the one hand, of a given grid cell in O (1) determine whether there is in this grid cell one of the points, and on the other hand new points in O (1) insert into the existing hash table.

Inserting new points

For each insertion of a new point P is to check whether there are already well-known minimum distance is below δ. Based on the coordinates of P can be seen immediately in which the grid cells P will fall. Due to the division into grid cells of size ( δ / 2) ² must be determined only if in one of the surrounding grid cells have a point lies. Surrounding are all grid cells that could give rise to such a distance that is only the surrounding 5x5 area - all points that can be outside this range no distance < δ to P, since they due to the lattice size at least 2 · ( δ / 2) are far away from P. As can be in each grid cell, only one point (otherwise previously already been found a distance < δ ), so will just need more than 25 points are checked. Except as provided in this section are no points, P can be safely inserted into the existing hash table. However, in this area are already more points, it is the distance δ is updated to the new found minimum distance set. This has the consequence that the previous hash table is rendered useless because their grid spacing no longer coincides with δ / 2. So we need a new hash table with the correct δ - we just create one table for all points up to P from scratch, for the sake of efficiency, you can save the distance test when you rebuild, since the minimum distance δ of all points to P is already known. In the end, one has after inserting a new point is always a correct hash table with grid width δ / 2, and in each grid square is at most one point.

Runtime Analysis

Relevant for the running time is only the number of insert operations of new points. Trivially, each point must be inserted at least once, so it is Ω ( n ).

The question, however, is how the occasionally occurring rebuild the hash table thing since this all points already known to be reinserted. It is therefore necessary to indicate the probability that the insertion of the i-th point, a reconstruction is necessary, and the necessary costs einzuberechnen. Intuitively, one can see that with increasing score the cost of reorganization is getting bigger, but the probability of getting smaller. Mathematically considered:

The probability that the i-th inserted point a distance δ ' founded < δ, <2 / i, because that would require exactly one of the points to be, that the δ ' reasons. Since we have a total of i points, two of which correspond to our request, the probability is at most 2 / i

If we define the random variable. She is 1 if '< δ applies when inserting the i-th point δ, and 0 otherwise

Then: The total number of inserts; So anyway, the inserted points plus the number of insert operations by the reorganization.

For the expectation value can be due to its linearity say.

That is, the total number of expected inserts in all of the reorganization hash table is only 2n. Overall, the expected running time is thus O ( n 2n), so linear.

Swell

Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Algorithms - An Introduction. Oldenbourg -Verlag, 2004. ISBN 3-486-27515-1. Pages 959-963 (Chapter 33.4: Calculating the closest pair of points ).
Jon Kleinberg, Éva Tardos. Algorithm Design. Pearson International Edition, 2006. ISBN 0-321-37291-3. Pages 225ff ( Divide & Conquer ); Pages 741ff ( Randomized algorithm).

Search algorithm

Transformation (function) Ron Rivest

194604