Posts tagged ‘Nearest Neighbour’

March 11, 2012

“Point-of-Maximal Distance” queries (AKA Variations on a Theme of Nearest-Neighbours)

Nearest neighbour queries (or, for my international readers, nearest-neighbor queries), are a pretty common query pattern used in location-based applications and spatial analysis:- given a set of known locations of restaurants, customers, stations etc.,  “What/where is the nearest X to a given location?” I’ve already written about performing nearest-neighbour queries with SQL Server spatial datatypes here and here.

Occasionally, you’ll see this pattern reversed – “What is the X furthest away from my current location?” – which could be referred to as a farthest-neighbour query.

However, there’s another variant that I’ve not often seen discussed, which is: “Given a set of known locations, where is the point on Earth that maximises the distance to the closest X.” In other words, where would you have to go to get as far away as possible from any item in the given set of data? The key difference to appreciate about this query structure is that, unlike the first two examples, the answer is not selected from one of a set of known locations – it is location that must be determined programmatically to maximise the distance from X. This makes the query a lot more complicated, but also more interesting. (I don’t know if there is a generic name for this type of query or not…  a “point-of-maximal distance” query?).

I can think of several potential practical applications for this type of query – a fire response service may want to identify the location at increased risk because it is furthest away from any of their response stations, for example. Or, a mobile telecommunications operator might want to identify the “darkest spot” in their network coverage – the point that lies furthest from any of their transmitters.

As ever, I think a practical example might help demonstrate. One of my recent posts described how to import a set of point data representing Tescos stores in the UK. People that know me will know that I’m a Green, tree-hugging, anti-capitalist hippie, so let’s suppose that I wanted to move house to get as far away from Tescos as possible (for my international readers, Tesco is a multi-billion pound supermarket chain in the UK often presented as one of the faces of modern global capitalism – I think you have similar companies like Wal-Mart or Subway).

I’m going to set an additional constraint that, while I want to live far from Tescos, I still want to live in mainland UK. So, my problem is to find the point that lies somewhere inside this Polygon that is furthest away from any of the dots (which are the locations of Tescos stores):

image

So…. how do that?

Method 1: Iterative T-SQL

To start, I’m going to create the shape of the UK as a geography MultiPolygon called @UK. I’ve already got a table containing various parts of the UK that I can aggregate, but I want to exclude the Scottish Islands and other small polygon geometries that will confuse matters (Scottish islands = good for single malt whiskey, bad for spatial data… they’re just too craggy!). So, I’ll create a simple numbers table in a CTE and join to that table in order to loop through each element in the UK table and only aggregate those polygons that are greater than 10,000km2 in area:

DECLARE @UK geography;

-- Create a MultiPolygon of the UK, omitting any islands
-- with area less than 10,000km2
WITH Numbers(n) AS ( 
  SELECT ROW_NUMBER() OVER (ORDER BY number) FROM master..spt_values
)
SELECT @UK = geography::UnionAggregate(Shape.STGeometryN(n))
FROM UK JOIN Numbers ON Numbers.n  10000000000;

I’ll also create a geography MultiPoint of all Tescos stores called @Tescos. The only slight complication here is that the Garmin POI dataset I’m using seems to have some erroneous data, so I’ll put a WHERE condition to filter only those records that lie within the approximate geographic bounds of the UK:

DECLARE @Tescos geography;

-- Create a MultiPoint of Tescos stores
SELECT @Tescos = geography::UnionAggregate(Location)
FROM TescoStores
WHERE Location.Long BETWEEN -10 AND 3
AND Location.Lat BETWEEN 50 AND 60;

Now that we’ve got the basics set up, how do we go about finding the part of @UK that lies furthest from @Tescos? To explain one approach, consider an analogy in which the shape of the UK is cut out of blotting paper. You then arrange a set of pipettes filled with ink, so that each is positioned above the location of a Tescos store. And then, all at the same time, you start dropping ink from the pipettes onto the location of the stores.

The ink blots centred on the location of each store will expand outwards at the same rate, gradually filling up the blotting paper shape of the UK. The last part of the paper to be covered by ink is that which lies furthest from any of the ink sources, and must be the solution.

We can recreate this approach using SQL Server’s spatial functions:

  • To mimic the expanding ink blots we can call STBuffer() on the @Tescos MultiPoint with increasing buffer sizes to create expanding circular zones centred around each store.
  • To determine the area of blotting paper that is still not covered by ink, we can use STDifference() to compare the inked area to the original shape.
  • If the part of the UK yet to be covered by ink is still of a sufficiently large area, we simply add more ink by looping through again with an increased buffer size. If, however, the whole UK has been covered by ink then we’ve put too much ink on and we need to reduce the buffer size slightly.
  • When the area yet to be covered reaches a sufficiently small size (this approach is unlikely to ever converge on a single Point), declare that to be the solution.

Here’s the query:

DECLARE @x float = 100000; -- Declare an initial buffer size. I'm guessing that the furthest point in the UK lies more than 100km from Tescos
WHILE @UK.STDifference(@Tescos.STBuffer(@x)).STArea() > 10000 -- Stop when we find an area less than 10km2 in area
BEGIN

  -- Print a little debug message to see what's happening on this iteration
  PRINT CAST(@UK.STDifference(@Tescos.STBuffer(@x)).STArea() AS varchar(32)) + 
        'm2 of the UK lies more than ' + CAST(@x AS varchar(32)) + 'm from the closest Tescos';

  -- Amount by which to increase the buffer size
  DECLARE @y float;

  -- Fairly arbitrary, but enlarge the buffer by the square root of the remaining area
  -- Intention here is to close in on the appropriate buffer value relatively quickly
  SET @y = SQRT(@UK.STDifference(@Tescos.STBuffer(@x)).STArea());

  -- Check whether increasing the buffer by this amount would be too great
  -- and lead to the whole UK being covered in ink!
  WHILE @UK.STDifference(@Tescos.STBuffer(@x + @y)).STArea() = 0 BEGIN

    -- Reduce the amount by which the buffer is increased
    SET @y = @y/2;
  END

  -- Apply the increase in buffer
  SET @x = @x + @y;

END

Once this loop has finished, @x will contain a value, measured in metres, that represents the radius to which each of the ink blots must expand so that the area left uncovered has an area of less than 10km2 . Here’s what the map looks like when a circular buffer of radius @x is created centred on the location of each Tesco store:

SELECT @Tescos.STBuffer(@x)
UNION ALL SELECT @Tescos
UNION ALL SELECT @UK;

image

As you can tell, the whole of the UK is pretty much covered – the iteration only stops when the area remaining uncovered is less than 10km2 in in size (WHILE @UK.STDifference(@Tescos.STBuffer(@x)).STArea() > 10000), which means that, at this zoom level, it’s going to be pretty hard to perceive.

To find where the remaining un-inked area is, we can SELECT @UK.STDifference(@Tescos.STBuffer(@x)), and, to make it more obvious, I’ll also add a buffer to enlarge the solution area. And the answer? (as you might have been able to tell from the previous image) is the small blue polygon shown below:

image

Perhaps unsurprisingly, it’s an area at the tip of the highlands of Scotland, and it’s 119,591 metres from the closest Tesco store. Ok, so it’s not an exact point, but the way that this iteration works it would be very hard to reach such a precise location.

Method 2: SQLCLR Triangulation?

My gut instinct told me that there should be a way of coming up with the exact solution to the problem, in a much simpler manner, using triangulation. For example, suppose you were to create the Delaunay triangulation of the set of points such that the vertices of each triangle were formed from the location of the Tesco stores:

image

Since the triangles are formed by connecting vertices of Tesco stores, in those areas where stores are further apart, the circumcircle (the point of equal distance from all three vertices) of the triangle created between them will be larger. In fact, you’d think the point that lies at the centre of the triangle of largest circumradius would be the point that lies furthest from any Tesco store. And this would be true, were it not for (quite literally) “edge cases”…

… when, as in this case, the solution lies on the edge of the area in question, it is not covered by the extent of the triangulation (which only creates triangles that are interior to the set of data). Only by artificially extending the triangles out from the set of Tesco stores towards infinity could we ensure that the whole UK had been covered, and it turns out the SQL Server is not too happy about creating infinite-sized triangles (who’d have thought?).

So that’s a bit disappointing… the iterative T-SQL method still works and, with a bit of tweaking to choose how the buffer should be refined on each iteration, could probably be made to work more efficiently. However, I was hoping to be able to get something exact using SQLCLR.

I should just point out that the fact the triangulation method doesn’t extend beyond the bounds of the supplied dataset can still make it useful for lots of other purposes. For example, creating an aggregate union of the triangles creates a concave hull around the location of all Tescos stores in the UK (not a convex hull as could be created by STConvexHull()). The result looks scarily like the shape of the UK itself:

image

And, deriving a Voronoi tessellation from the set of triangles creates the “catchment area” around each store – the area formed from all those points closest to that store than any other. You can think of this as the materialised form of the results of a nearest-neighbour query:

image

(Note that, as expected, the point that lies furthest away from a Tescos store is contained within the largest Voronoi polygon – the pinkish one at the top of the map). At first, the Voronoi tessellation and the Delaunay triangulation may appear similar – indeed, they are “dual structures” – while Tesco stores form the vertices of each triangle in the Delaunay triangulation, they are the sites around which each polygon is created in the Voronoi tessellation.

If anybody can suggest any alternative approaches to identify the “point of maximal distance” from a set of data, I’d love to hear it…

Note: The SQLCLR code for creating the Delaunay triangulation, concave hull, and Voronoi tessellation demonstrated in this post is taken from “Pro Spatial with SQL Server 2012”, which also contains a chapter explaining these topics in more detail.

February 22, 2012

Updating a SQL Server Table with Nearest Neighbours (Optimised for SQL Server 2012 / SQL Azure)

So I’ve been trying to think of some examples that make use of the Garmin POI data from my last post. Seeing as POI data naturally lends itself to nearest-neighbour queries, I thought I’d write a little about those.

The general form of the nearest neighbor query is: “Show me the closest x restaurants/pubs/train stations to this location”; this is the query pattern used by location-aware applications to find POIs close to the user’s current location, for example. However, another pattern of which I’ve seen much fewer examples is how to update an entire table of records in order to determine and populate the nearest neighbour for each row in the table (indeed, this very question came up on the MSDN spatial forum just last week).

With this in mind, I thought I’d demonstrate a practical scenario: let’s say you’re planning going for a swim at the local pool and, after you’ve swum 50 lengths, you’ll probably feel a bit peckish – perhaps in the need for some fish ‘n’ chips. So, in this example I’m firstly going to import the Garmin POI data for swimming pools and fish ‘n’ chip shops in the UK, each being represented by a geography Point instance. I’ll then add an extra column to the swimming pool table and populate it with the closest fish and chip shop to every swimming pool.

To start with, create tables containing the data for fish and chip shops and swimming pools (download the data from here). Notice that I’ve included an IDENTITY primary key field in each table, which I’ll need in order to add a spatial index later:

SELECT
  Id = IDENTITY(int,1,1),
  CAST(Name AS varchar(255)) AS Name,
  CAST(Details AS varchar(255)) AS Details,
  geography::Point(Latitude,Longitude,4326) AS Location
INTO SwimmingPools
FROM OPENROWSET(
 BULK 'C:\Users\Alastair\Downloads\GarminPOI\Swimming Pools.csv',
 FORMATFILE='C:\temp\garminpoi.xml',
 FIRSTROW=2
) AS SwimmingPools;
GO

ALTER TABLE SwimmingPools ADD PRIMARY KEY(Id);
GO

SELECT
  Id = IDENTITY(int,1,1),
  CAST(Name AS varchar(255)) AS Name,
  CAST(Details AS varchar(255)) AS Details,
  geography::Point(Latitude,Longitude,4326) AS Location
INTO FishAndChips
FROM OPENROWSET(
 BULK 'C:\Users\Alastair\Downloads\GarminPOI\Fish and Chips.csv',
 FORMATFILE='C:\temp\garminpoi.xml',
 FIRSTROW=2
) AS FishAndChips;
GO

ALTER TABLE FishAndChips ADD PRIMARY KEY(Id);
GO

Now we’ll add some extra columns to the Swimming Pools table, which we’ll populate with information about the closest fish ‘n’ chip shop:

ALTER TABLE SwimmingPools
ADD ClosestFishbar varchar(255),
    Distance decimal(18,2);

Now, you don’t need to add a spatial index to perform the update query, but it’ll be a lot faster with one than without. SQL Server can only make use of one spatial index in a join between tables, so there’s no need to add an index to both tables. We’ll just add one to the Fish And Chips table, as follows:

CREATE SPATIAL INDEX sidx_FishAndChips ON FishAndChips(Location)
USING  GEOGRAPHY_GRID 
WITH (
  GRIDS =(LEVEL_1 = HIGH,LEVEL_2 = HIGH,LEVEL_3 = HIGH,LEVEL_4 = HIGH), 
  CELLS_PER_OBJECT = 16
);

Then, run an UPDATE query to populate the ClosestFishBar and Distance columns, as follows:

UPDATE s
SET ClosestFishbar = fName,
    Distance = Location.STDistance(fLocation)  
FROM (
  SELECT
    SwimmingPools.*,
    fnc.Name AS fName,
    fnc.Location AS fLocation
  FROM SwimmingPools 
  CROSS APPLY (
    SELECT TOP 1
     Name,
     Location
    FROM FishAndChips WITH(index(sidx_FishAndChips))
    WHERE FishAndChips.Location.STDistance(SwimmingPools.Location) IS NOT NULL
    ORDER BY FishAndChips.Location.STDistance(SwimmingPools.Location) ASC
  ) fnc
) s;

This query probably needs some explanation:

  • Generally, you might expect to use a correlated subquery to obtain the closest fish and chip shop to each swimming pool. However, correlated subqueries can only return a single value, and I want to retrieve both the name and the distance to the closest fish bar, so I've used a CROSS APPLY instead.
  • In the function being applied, I’m sorting the table of fish and chip shops by ascending order of their distance from the current swimming pool (ORDER BY FishAndChips.Location.STDistance(SwimmingPools.Location) ASC), and then selecting the TOP 1 Name and Location (i.e. the closest).
  • To make the query efficient, I’ve also added an index hint to use the spatial index on the Fish and Chips table – WITH(index(sidx_FishAndChips)), and I’ve included an extra predicate, WHERE FishAndChips.Location.STDistance(SwimmingPools.Location) IS NOT NULL. This extra predicate will help the query optimiser choose the dedicated nearest neighbour query plan introduced in SQL Server 2012 and SQL Azure.
  • Note that the spatial index-optimised nearest neighbour query plan is only available in SQL Server 2012/SQL Azure. If you try to execute the UPDATE query as written above in SQL Server 2008/R2 then you’ll get an error:

    The query processor could not produce a query plan for a query with a spatial index hint.  Reason: Spatial indexes do not support the comparator supplied in the predicate.  Try removing the index hints or removing SET FORCEPLAN.

    As stated, to resolve this you’ll have to remove the WITH(index(sidx_FishAndChips)) index hint (and also expect your query to run a lot slower!). Instead, you might need to try one of the alternative approaches to efficiently finding nearest neighbours in SQL Server 2008/R2.

  • Finally, in order to perform the UPDATE, I’ve wrapped the whole lot in a table alias, s, and then updated the two columns as required.

This query should take a few seconds to run (or, if you don’t create/use the spatial index, a few minutes), after which you can check the results as follows:

SELECT * FROM SwimmingPools;

And there you have the complete table of swimming pools, each listed with the name and distance to its closest fish and chip shop:

image 

Seeing as it’s less than 100 metres from getting out of the water to a bag of chips, I’ll think I’ll be taking my next dip at the Farnworth swimming pool….

Follow

Get every new post delivered to your Inbox.

Join 53 other followers