Archive for April, 2011

April 29, 2011

In Defence of 3D Charts…

There was a time, not so long ago, when pretty much all business graphics looked like this:

image  image

The global hegemony of Microsoft Excel (particularly, Excel 97), deployed onto millions and millions of users’ desktop PCs meant that the world was awash with poorly-labelled, hideously-coloured, and sometimes just plain wrong line charts and bar graphs. I’m not blaming Excel itself – it was a fantastically successful product – but the problem was that a huge number of people were producing business reports who had no training on the software, little background in statistics itself, and certainly no explicit knowledge of data visualisation methods (which, at the time, were unheard of as an academic discipline).

But then things got even worse, as people discovered the ability to create “3d” charts. All of a sudden, the unattractive but essentially harmless business graphics above morphed into things like this:

image  image

The general criticisms of these “3d” charts are quite well known, so I’m not going to repeat them here. However, to give two specific comments on the examples above:

  • The artificial “perspective” added to the pie chart distorts the true relationship between the values in the series. For example, what is the relative size of the large purple slice compared to the blue slice? You can’t tell because the purple slice is shown closer to the viewer, enlarging its relative area and giving it distorted prominence.
  • In the line chart, the shadow effects added to the extruded ends of each column make it difficult to read across to the axis to find out the exact value of an item. Furthermore, in several places the blue columns in the foreground (inquiries) obscure the red columns behind (sales).

“3d” charts enjoyed a brief life as the popular visualisation of choice, seen to make data more exciting and, thus, making managers pay more attention to it (also perhaps in the hope that adding “depth” to a graphic added depth to the corresponding data analysis?!).

However, before too long, they were debunked as being nothing more than marketing gloss. In fact, much more than that, 3d charts were proclaimed as being misleading and dangerous, and were wholeheartedly rejected by most members of the data visualisation community. In a write-up of one of the Jen Stirrup’s sessions from SQLBits8 conference last month, Luke Merrett describes the learning point he takes away as “…you should never, EVER use 3D graphs for reporting…”.

I wonder if 3d charts, which not so long ago were being enthusiastically embraced and adopted, have now been wholeheartedly disregarded with equal lack of consideration. So, I’d like to take this opportunity to stick up for 3d graphs and explain why, in certain situations, I believe they are an appropriate way to visualise sets of data.

“3D” Graphs or 3D Graphs?

Before I do, I need to clarify something. In the preceding paragraphs, you might have noticed that I carefully referred to “3d” graphs, not 3d graphs. Why?

As far as I’m concerned, the examples of the tilted pie chart and the extruded column chart with shadows shown above are not 3d charts. They are 2d charts that have had a graphic effect applied to them to give them a sense of depth. That effect is not in any way linked to the underlying data set – it has no corresponding dimension (from a data point of view). These arbitrary graphic effects take up space in the chart without adding any further information about the data, which is pretty much the definition of “chart junk”.

I referred to these images as “3d” charts, but they should really be described as two-and-a-half dimensional charts. Clearly (unless you’ve got some sort of very expensive holographic-projection monitor), it’s impossible to create a chart that is displayed in three-dimensional space. However, it is possible to create a 2-dimensional representation of a chart that represents data in 3 dimensions, and that is what I regard as a 3d chart.

As an example, yesterday I was doing some performance testing of spatial indexes in SQL Server Denali. There are lots of factors that can influence the performance of SQL Server spatial  queries, but I decided to focus on two factors – the number of rows in the underlying base table, and the selectivity of the query window (measured as a percentage of the bounding box). I ran a test query repeatedly, changing these two independent variables and measuring the execution time of the query (my dependent variable).

At the end of my tests, I had gathered over 15,000 individual test results, and I decided to plot them using a 3d surface chart as follows:

image

Why do I think this 3d graph is suitable for this data?

  • It’s clear to see, at a glance, the way in which, separately, the number of rows in the table and the size of the query window affects execution time, and also the way in which they affect it in combination. Since I’m using only a single set of axes for all my data, you can easily see which of the independent variables has a greater effect on the dependent variable.
  • With the exception of a few spikes, the dependent variable (execution time) increases monotonically with both independent variables. That is to say, as the number of rows in the base table increases and the size of the query window increases, the execution time will always increase (not necessarily at a consistent rate, but it doesn’t fall). This means that, so long as the axes are oriented correctly, there is no risk of a higher value in the foreground of the chart obscuring a smaller value behind it.
  • I don’t need to know exact values corresponding to any point on this graph. The results illustrated in this chart were obtained from a test system and are intended to be illustrative of the shape of the graph – the exact number of milliseconds taken to execute a query will depend on the exact configuration of the server and the query in question anyway, so it is not important for people to be able to tell these. (In fact, I could remove the z axis labels altogether, but I’ve left them in to demonstrate that it a linear scale has been used). At a broad level, colour is used to double-encode the z value into one of a number of distinct categories so you can see the “bands” of execution times.
  • This graph plots execution times of a spatial query, and the audience who are interested in the results are familiar with examining representations of spatial data, including 3d terrain maps. I’d therefore argue that the presentation method of this data has been chosen with consideration of the likely end user (another thing often forgotten by people presenting data), and vaguely resembles a terrain map at the base of a mountain, for example.

It has to be said that the Excel implementation of a 3d surface chart has a few deficiencies – I’d like to have been able to set a semi-transparent fill on the surface, for example. Also, the two independent variables are treated as distinct categories and must be equally spaced along the x and y axes. If you’re plotting continuous independent variables, as I was in this case, you therefore need to ensure that your data is sampled at consistent intervals.

My next challenge, of course, is that I’m currently only considering two of the variables that effect performance of spatial queries. I now need to think about the effect of other factors: the size of the bounding box, the cells_per_object limit, and the resolution of the grid cells in the index, for example. To display the results taking these factors into consideration, I’m thinking about a panel of 3d surface charts… if any “3d chart”-deniers have any comments, or can propose a better solution, I’d love to hear from you!

April 27, 2011

Categorising Open Street Map Road Types For Display and Routefinding

At the end of one my previous posts, I displayed a set of ways imported from Open Street Map in SQL Server Management Studio. It looked like this:

image_thumb1

Now, as I stated previously, OSM ways don’t equate to roads. Ways can be any arbitrary series of nodes, so although at first glance it may appear to be so, the map above does not represent a roadmap. A single way may represent several roads, or only a single segment of one road. Many ways are nothing to do with roads at all – they may be rivers, or railway lines. Ways may also denote the boundaries of an area, such as a county, a park, or a building.

To create a dataset of OSM roads (or footpaths, tracks etc.) suitable for routefinding or display in a road map, it is necessary to retrieve only those ways that contain a tag element with a k attribute of “highway”. The corresponding v attribute describes the type of highway. Examples of possible values include:

  • cycleway
  • footway
  • motorway
  • path
  • pedestrian
  • primary
  • residential
  • road
  • secondary
  • service
  • steps
  • tertiary
  • track
  • unclassified

Note that this list isn’t exhaustive – the design of the OSM schema means that editors can tag ways or nodes with any values, but this is a list of some of the commonly-used tags.

There are many reasons why you might want to categorise each of these highway types separately.

  • Consider access for different modes of transport, for example; clearly, a car can’t go down a cycleway. Nor can a tractor go on a motorway, or a cycle go down steps (unless you’re planning some sort of stunt bike ride).
  • If you’re designing a route-finding application, you might want to consider and compare the relative costs of travelling down different routes. Motorway segments generally have a higher speed limit than primary roads, which in turn have higher average speed than secondary roads, etc. Therefore, a route that maximises the percentage travelled on higher-status roads may well be shorter in time, even if it covers a longer distance.
  • When drawing features onto a map, it’s usual to display different categories of highways with different styles (e.g. motorways coloured blue, primary roads thicker than secondary, tracks as dotted/dashed lines).

As a simple example, to style the Spatial Results tab view of my OSM map to show different categories of roads with different thicknesses I created a table to attach a weight to each highway type, as follows:

CREATE TABLE RoadWeights (
  highwaytype varchar(32),
  roadweight int
);
INSERT INTO RoadWeights VALUES
(‘motorway’, 15),
(‘motorway_Link’, 15),
(‘trunk’, 10),
(‘trunk_Link’, 10),
(‘primary’, 8),
(‘primary_Link’, 8),
(‘secondary’, 5),
(‘tertiary’, 3),
(‘residential’, 1),
(‘unclassified’, 0);

Then, I edited my SELECT query to only display rows from my Ways table that were tagged with one of my chosen highway types, and buffered the geography LineString representing each road by the corresponding weight from the RoadWeights table:

SELECT
  w.wayid,
  wt.TagValue AS highwaytype,
  w.geog4326.STBuffer(ISNULL(roadweight,0))
FROM
  ways w
  INNER JOIN waytags wt ON w.wayid = wt.wayid AND wt.TagName = ‘Highway’
  LEFT JOIN RoadWeights rw ON wt.TagValue = rw.highwaytype
WHERE
  wt.TagValue IN (‘motorway’, ‘motorway_Link’, ‘trunk’, ‘trunk_Link’, ‘primary’, ‘primary_Link’, ‘secondary’, ‘tertiary’, ‘residential’, ‘unclassified’)

Even when viewed in the SSMS Spatial Results tab, the map already becomes much cleaner than the result shown at the top of this post – with the Norwich inner ring road and its primary arterial roads now clearly visible (and also, the train tracks coming into the railway station on the south east of the city are no longer shown)

image

Clearly you wouldn’t normally optimise your dataset just for the purposes of display in the SSMS spatial results tab, but you can apply this same technique to attach any other properties that correspond to the type of road – the average road speed limit, accessibility, or styling options that should be passed to a front-end display, for example.

image

April 25, 2011

Nearest-Neighbours, Voronoi Diagrams, and Finding your nearest SQL Server Usergroup

As somebody interested in all things spatial, I was delighted that the community corner at SQLBits8 featured an interactive mapping application. Many people commented on how much work had obviously gone into the user interface: the ability to add “pushpins” was particularly realistic, and the base dataset, despite being centred on the UK, was designed to be modular and extendable – with users adding in further layers for Italy, Germany, Holland, Sweden, Norway, Denmark, Austria and Australia.

If you’re wondering, the application front-end interface in question looked like this:  (red pins are usergroup locations, blue pins are attendees)

image

The purpose of the map was obviously both to visualise the geographic distribution of attendees coming to SQLBits, as well as to show people the location of their closest usergroup. Seeing as this was a SQL Server conference, it gave me an idea on how you could analyse this information with SQL Server spatial…

Displaying Locations of UK SQL Server Usergroups

To start with, we can create a table with geography points representing regional SQL Server usergroups in the UK (information kindly provided by tsqltidy).

image

To show the locations where each usergroup is held, we can then simply select each point, buffer them by a reasonable amount (10,000m, in this case), and display them against a background map of Great Britain.

SELECT name, location.STBuffer(10000) FROM SQLServerUGs;

Here’s what the results of that query look like in the SSMS Spatial Results tab:

image

That’s the objective of visualising the locations of all UK usergroups completed…. now, what about the other purpose of the map – to let users located their closest usergroup. This is commonly known as a “nearest-neighbour” query. So how would we go about doing this?

Nearest-Neighbour Queries

I don’t know the actual locations of all the attendees of SQLBits, and I can’t be bothered to trace their locations from the little blue pins in the first photo of this post, so let’s create some dummy data instead. The following code will create a table containing the locations of 800 fictional attendees, all situated somewhere on GB mainland (maybe there are some SQL Server DBAs out in the North Sea on oil rigs or pirate ships, but let’s just assume not for the sake of simplicity):

CREATE TABLE #SQLBitsAttendees (
id int identity(1,1),
location geography
);
SET NOCOUNT ON;
DECLARE @i int = 0;
WHILE @i < 800 BEGIN
DECLARE @p geography;
SET @p = geography::Point(RAND()*10+50, RAND()*8-6, 4326);
IF (@p.STIntersects(@GB)=1)
BEGIN
INSERT INTO #SQLBitsAttendees(location) VALUES (@p)
SET @i = @i + 1;
END
END

And here they all are  (including a fictional attendee from the Shetland Islands, from the looks of it):

image

Now, to find out the location of the closest usergroup for each attendee, you could use the STDistance() method to measure and sort the distance from each usergroup, then select the TOP 1 closest in a subselect, as follows:

SELECT
id,
location,
(SELECT TOP 1 name
FROM #SQLServerUGs ug
ORDER BY ug.location.STDistance(a.location)
) AS nearestUG
FROM #SQLBitsAttendees a

image

The problem with this approach is that, in order to determine the closest usergroup, the subselect statement has to evaluate the distance from every usergroup, only to select the top one. What’s more, in SQL Server 2008/R2 this query cannot take advantage of an index (spatial or otherwise), so it involves a full scan of the usergroups table for every attendee. Not good.

In SQL Server Denali, there’s a new nearest-neighbour query plan that does allow the query optimiser to use a spatial index for these types of queries, so long as they are expressed using a particular pattern. To use the nearest-neighbour plan in SQL Server Denali for this example, we have to add a IS NOT NULL condition to the STDistance() method, as follows:

SELECT
id,
location,
(SELECT TOP 1 name
FROM #SQLServerUGs ug
WHERE ug.location.STDistance(a.location) IS NOT NULL
ORDER BY ug.location.STDistance(a.location)
) AS nearestUG
FROM #SQLBitsAttendees a

This improves the performance significantly, but it’s still not an ideal query design.

One alternative approach could be to place a constraint on the subselect query, by assuming that nobody wants to travel more than 100km to attend a usergroup:

SELECT
id,
location.STAsText(),
(SELECT TOP 1 name
FROM #SQLServerUGs ug
WHERE ug.location.STDistance(a.location) < 100000
ORDER BY ug.location.STDistance(a.location)
) AS nearestUG
FROM #SQLBitsAttendees a

The advantage of this approach is that a spatial index can be used to fulfil the STDistance()query predicate (in both SQL Server 2008 and Denali), so this can efficiently reduce the number of rows that need to be evaluated and sorted. An added benefit is that it also allows us to identify all those users who are further than 100km from their closest usergroup, since the subselect will return NULL for those records:

image

(Perhaps users with a NULL nearestUG should consider starting their own local group?)

This approach is still not perfect though, since it relies on a fairly arbitrary limit of 100km for the limit within which to search. Setting this limit too high and we leave ourselves with a lot of data still to sort. But set it too low and some keen DBAs and Devs might be willing to travel further than the specified limit, and would miss out on knowing what their closest UG would be.

Another alternative solution, first proposed by Isaac Kunen, is to make use of a numbers table to create a query that looks for nearest neighbours in an expanding series of search ranges. The initial search area is set small, but then expands exponentially outwards until a neighbour is found. A query using this logic looks something like this:

DECLARE @start FLOAT = 1000;
WITH NearestUGs AS
(
SELECT TOP(1) WITH TIES *,  T.g.STDistance(@x) AS dist
FROM Numbers JOIN T WITH(INDEX(spatial_index))
ON T.g.STDistance(@x) < @start*POWER(2,Numbers.n)
ORDER BY n
)
SELECT TOP(1) * FROM NearestUGs
ORDER BY n, dist

(for explanation of what’s going on here, I suggest you refer to Isaac’s post).

Empirically, this query works pretty well – but it’s not exactly pretty to look at and it’s pretty daunting unless you’re proficient with fairly advanced T-SQL as well as with the nuisances of dealing with spatial data. So, there’s still scope for a better solution.

Enter Voronoi diagrams…

Voronoi Diagrams

A Voronoi diagram is a formed from a set of underlying points (called Voronoi sites). Around each site is a Voronoi cell, which is the area made up of all those points that lie closest to that site than to any other site. The edges at which two Voronoi cells meet is therefore constructed from all those points that lie equidistant between the two sites.

The complete set of Voronoi cells form a tessellating, non-overlapping set of of Polygons that cover the full extent of a set of data, known as a Voronoi tessellation.

Voronoi diagrams are closely related to Delauney triangulations (which I also used in my post about alpha shapes). In fact, if you connect the circumcenters of all the triangles in a Delauney triangulation, you’ll create a Voronoi tessellation of the original set of points, as shown below:

image

Delauney triangulation (black lines) and Voronoi tessellation (red lines) of a set of points. Image from http://en.wikipedia.org/wiki/File:Delaunay_Voronoi.png

The simplest (although not necessarily the most efficient) way to construct the Voronoi tessellation of a set of points is to take advantage of this duality. First, create the Delauney triangulation of the points, and then connect the circumcenters of all those triangles that have a common vertex to create the Voronoi cell around the site at that vertex. You can find more details on the algorithms required to create Delauney Triangulations and Voronoi tessellations at http://www.cs.cmu.edu/~quake/triangle.html and http://paulbourke.net/papers/triangulate/, as well as plenty of other resources on the internet. You know where to look.

What’s all this got to do with nearest neighbour queries? Well, suppose we create a Voronoi tessellation based on the set of points representing the location of SQL Server UGs. The result will be a set of (convex) polygons, in which each polygon contains exactly one usergroup, and all those points closest to that usergroup than to any other. If we store each Voronoi cell as a geography Polygon and clip them to the outline map of Great Britain, we get something a bit like this:

image

You can think of the coloured polygons as representing the “catchment area” for each of the usergroups. In order to work out the closest usergroup for any attendee, all we therefore have to do is to work out which polygon their location lies in, which can be done with a simple STIntersects() query:

SELECT
a.id,
a.location,
ug.name AS nearestUG
FROM #SQLBitsAttendees a
JOIN SQLServerUGs ug ON a.location.STIntersects(ug.area) = 1

And here they all are, assigned to the correct UG, and much quicker than any of the preceding queries (as the STIntersects() predicate is fully capable of taking advantage of any spatial index):

image

image

The only thing to remember is that you will have to recreate the underlying Voronoi tessellation every time the distribution of usergroups changes (i.e. every time a usergroup is added/removed/or starts meeting at a new location). However, we can probably safely assume that new usergroups don’t pop up every day, and it doesn’t take that long to recreate the tessellation anyway. For nearest-neighbour queries where the “neighbours” are relatively static, Voronoi tessellations are therefore a very good choice.

You can download the code used in the examples above from here.

Tags: ,
April 21, 2011

The Bing Read Write World

During his session yesterday at the Where 2.0 conference, Blaise Aguera y Arcas showed the first public demo of where Microsoft are going in the mapping space, monikered (slightly oddly, I find) the Read/ Write world. Described as “an indexing, unification, and connection of the world’s geo-linked media”, it links together elements of Bing Maps, Photosynth, Streetside, Azure, Deep zoom and others into a common framework, with potentially very interesting consequences. You can read more, and see some demonstration applications of the technology at http://readwriteworld.cloudapp.net/ and be sure to check out the following video

It’s obviously very early days, but some of my initial thoughts are as follows:

Finally, some cohesion!

Internally, the MS Photosynth team has been part of the Bing Maps team for several years, but with not much obvious link between the two technologies (other than a fairly trivial way of linking the location of photosynths to a map). More recently, the Bing Maps team became part of the Bing Mobile team. Again, other than the fact that Bing Maps was a component of the WP7 OS, there wasn’t much obvious reason why…

…finally, we’re now starting to see publicly some of the reasoning behind these decisions, and the strategic direction that MS is taking to try to integrate these technologies. For an example, check out the section of the video above, from about 00:45 to 01:30. You’ll see a (3D) Bing Maps street view, with overlaid thumbnails from streetside (car) imagery. This then changes to a horizontal pan along the street images (streetslide), before cutting to a video link entering into the shop. These video transitions link two photo panoramas shot inside the building, which incidentally, could have been recorded by almost anyone using the new iPhone photosynth app….

image

So, that’s Bing Mobile, photosynth, and maps working together for the first time in an application…

Licensing

It’s interesting that, even though this technology has only just been publicly previewed for the first time and is far away from being production-ready, Microsoft have already published an initial explanation of the licensing rights involved in the project. This is obviously something they want to make sure they get right.

Another interesting thing to note is that Microsoft have publicly stated a strong support for the Creative Commons licences throughout the project, with content creators of photos, videos, or panoramas used in Read-Write world retaining “complete control over whether others, including Microsoft, can display it, mash it up, or otherwise present it.”.

I wonder if this is any way a response to some of the recent criticism levelled at the Google MapMaker product (also demo’ed at Where 2.0), which encourages “citizen cartographers” to add their own data to Google Maps but, in doing so, relinquish their rights to that data to Google. (Compare this to Open Street Maps, which has been enabling citizen cartography for many years and uses the Creative Commons ShareAlike 2.0 licence, which allows anyone to use OSM data so long as it is credited appropriately).

There’s already announcements that Bing will be working with 360 Cities to use (and re-use) CC panoramas, as well as CC images through Flickr.

Technical

There are a few demo apps showing examples of web services based on the technology at http://readwriteworld.cloudapp.net/, and there’s also a brief description of the technical spec of the new RML (Reality Markup Language) used to bind the whole lot together. However, at this stage, the demos are only based on a very limited set of data. I’ll certainly be going back to play with these when there’ a bit more data available (MS – how about adding Flickr photos from Norwich rather than base all your demos on Seattle? We have beautiful buildings here! Two cathedrals, a castle, the UK’s largest permanent outdoor market, a pub established in 1249AD…)

Follow

Get every new post delivered to your Inbox.

Join 53 other followers