Archive for July, 2011

July 31, 2011

Spatio-temporal Event Processing with StreamInsight, SQL Server Denali, and Bing Maps – Part 1

Following a recent conversation with Isaac Kunen from the StreamInsight team, I decided that I should find out more about Microsoft’s complex event processing engine. Technology moves so fast that there are probably a hundred topics that I wish I’d know more about and, although I had a superficial idea what StreamInsight does, I’d never really looked at it in much depth because I didn’t think I had any particular use for it. So this post captures my first baby steps with using StreamInsight – hopefully it should provide some guidance for anyone else who hasn’t tried it yet, and perhaps a few chuckles for those people who are already experts in the field.

What Is StreamInsight, anyway?

The first hurdle I had to overcome was not a technical one, but a mental one: what exactly is StreamInsight and what it it for? You can read all the release material from Microsoft which I won’t bother repeating here, but these are the things I had to get my head around:

  • Firstly, although it’s branded and licensed as part of SQL Server, StreamInsight really has nothing to do with SQL Server, either in terms of architecture or purpose. You don’t need to have SQL Server installed on the machine on which StreamInsight runs, for example, and you won’t see a line of T-SQL code in StreamInsight.
  • StreamInsight is an entirely in-memory engine for processing streams of input (i.e. a time-ordered series of events) – it can write output to SQL Server, but it can also write to a Console app, WCF service etc. It doesn’t have to persist any data at all.
  • Although the examples tend to focus on scenarios in which you’d have live streams of event data coming from e.g. monitoring systems, stock trackers etc., you can also use StreamInsight to analyse any temporal data, by streaming it into the StreamInsight engine using an input adaptor from e.g. a text/CSV file or database. The timestamp used to order the events in a stream can be supplied from historical data and does not have to bear any relation to the actual time at which it is received by the engine.
  • You write queries using LINQ to analyse those events passing through the engine. There are different types of temporal query pattern – for example, analysing those events that occur in rolling windows of a fixed time length, or at a particular snapshot in time.
  • Every event carries an associated payload of data – user-defined fields of information attached to that event. These are based on fundamental .NET entity types – string, binary etc. Of interest to me, particularly, is that means you can have spatial information in an event payload, either as WKT, WKB, or just using the native SqlGeography/SqlGeometry serialisation format.
  • The payload information of those events falling within a specified window (or filtered using some other query template) can be aggregated, manipulated, or have other query logic applied. This information can then be routed through to one or more output adaptors, which can write the results to a file, database, or service.

StreamInsight and Spatial. An Example Application Plan.

So, how could I put this all together and make an example related to spatial? Well, here was my plan (explained using my limited knowledge of StreamInsight terminology):

  1. I would create an input adaptor that would load a set of spatial point data that also has an associated temporal value into the StreamInsight engine. The sort of thing I’ve got in mind is a dataset of the location and time at which cases of a particular virus outbreak (e.g. H5N1/Foot and Mouth etc.) were reported.
  2. Since I’m considering notifications to occur at a singular moment in time, I’d use the point event model. (If, instead, I wanted to consider the period of time for which a farm was declared “infected”, I might use an edge model instead)
  3. The payload attached to each event would contain a geography Point instance representing the location at which the occurrence occurred (possibly serialised as Well-Known Binary, or maybe just WKT)
  4. In SI, I’d define a hopping window that would consider, say, all those events in the preceding 3 days leading up to any day.
  5. Then, in my query I’d create a User-Defined Aggregate that took the events in the current window and created a convex hull around them (using the geography ConvexHull() method introduced in SQL Server Denali Microsoft.SqlServer.Types.dll).
  6. Finally, I’d create an output adaptor that would send (via a WCF service?) the resulting geography Polygon of the current event window to be visualised on Bing Maps. The map would refresh to show a time-based spatial analysis of the spread of the virus.

That’s my plan, at least, and if you follow over the next few blog posts you’ll see how I get on in actually achieving it. I haven’t prepared these posts in advance, so there’s a very real chance that I’ll get halfway through and give up, crying. We’ll see.

For this post, I’ll start right at the beginning and just jot down a few notes about getting StreamInsight installed and configured.

Installing and Configuring StreamInsight

  • The latest version of StreamInsight (v1.2), can be downloaded from the StreamInsight Download Centre at http://www.microsoft.com/download/en/details.aspx?id=26720.
  • There are x86 and x64 versions available. Since StreamInsight is not a component of SQL Server, there is no need to choose a version that correlates with any other SQL Server tools you’ve already got installed. I’m running a x64 system so I downloaded the x64 StreamInsight package, even though I’ve got an x86 SQL Server installation on this machine.
  • The package comes in two flavours: one is for client only (i.e. allows you to connect to existing StreamInsight services) whereas the other is the full package. I went with the full shebang option (which, curiously, is exactly the same filesize as the client-only version).
  • The package is very small and installation is quick. What I found a bit odd is that, after installation, I was prompted to install version 3.5 SP2 of SQL Server Compact. Why should a product that is licensed as part of SQL Server 2008 R2 have a dependency on SQL Server Compact? What’s even more odd is that, if install the x64 platform version of StreamInsight, you need to install both the x86 and x64 versions of SQL Server Compact. Fortunately, these are both included in the redist folder of StreamInsight, and only take a few minutes to set up, so it’s not too much of a big deal.
  • After installation, there’s not much to see. Stream Insight is a service and a SDK rather that an “application”. There is one executable added to the Start Menu, which is the Event Flow Debugger. Because I’m the kind of person who clicks on programs that are newly-installed, I tried firing this up – it had already pre-filled in the instance name I supplied during installation, so I clicked OK:

image

Hmm. No Dice:

image

I wondered whether the StreamInsight service was actually running. After all, the installation was very short, and I don’t remember setting up anything relating to service accounts etc. unlike in a SQL Server installation, for example. Sure enough, StreamInsight was set to be a manual service, and wasn’t currently running, so I tried to manually start it:

image

image

Hmmm. Let’s just leave that for a while and look at what else was added to the start menu. There’s a link to the MSDN documentation and a link to some codeplex samples and, well, that’s it, actually.

Documentation is boring so I downloaded and browsed through the codeplex samples instead. The one entitled “PatternDetector” looked interested, so I loaded it up. I changed the server connection string where indicated to provide the name of my StreamInsight server specified during installation:

image

Then I built the package, hit F5 and, what do you know, it worked! At least, I think it worked – here’s the output:

image

So that’s the end of my first experience with StreamInsight. In my next post, I’ll look at how to create a Input Adaptor to load events into the engine containing a spatial payload.

July 29, 2011

sp_help_spatial_geography_histogram and Indexing geography data in SQL Server Denali

Working with spatial data involves many dark arts, and performance tuning spatial indexes is perhaps the blackest of all of them. Although there are only a few parameters to configure: GRIDS, CELLS_PER_OBJECT, and BOUNDING_BOX, the number of different index combinations possible and the way in which these settings interact with each other is pretty complex. This is particularly true when dealing with geography data, since even the fundamental concept of defining a “grid” becomes quite tricky when dealing with an ellipsoidal surface.

Fortunately, there are two new system stored procedures in SQL Server Denali that make the process of spatial index-tuning much simpler, by allowing you to recreate (and therefore visualise) polygons representing the decomposition of a set of data under different index settings. These are sp_help_spatial_geography_histogram, and its geometry sister, sp_help_spatial_geometry_histogram. If you Google either of these terms right now, you’ll get only 7 results (Hopefully after I publish this post there might be 8 results…). I find this pretty surprising because, to me, these procedures (particularly the geography flavour) are pretty awesome.

Why? Well, when creating spatial indexes you generally want each feature to be described using the greatest possible accuracy (in order to increase the primary filter efficiency), but requiring the least number of cells to do so (in order to ensure that features are fully tessellated without blowing the CELLS_PER_OBJECT limit, or the index becoming large and unwieldy). But, when you create a index on a spatial column, how do you know in which grid cell(s) a given feature lies?

For the geometry datatype, this is pretty easy. LOW resolution is a 4×4 grid, MEDIUM resolution corresponds to an 8×8 grid, and HIGH resolution is a 16×16 grid. There are four levels of nested grids. So, to work out the size of each cell at Level 1 in a MEDIUM geometry index, you take the extent of the index bounding box in both x and y dimensions, and divide by 8. If the next grid resolution is also MEDIUM, then each cell at Level 2 will be 1/8th of a Level 1 cell, i.e. 1/64th of the height and width of the bounding box of the index. etc. etc.

So, if you have a bounding box that covers a map of the whole globe, from (–180 –90) to (180 90), then the level 1 cells in a HIGH resolution grid look like this:

image

Since the geometry datatype operates on a simple, planar surface, it’s actually always been pretty easy to work out this grid of cells programmatically, and determine which (and, crucially for performance-tuning, how many) cells a given feature intersects. But, the sp_help_spatial_geometry_histogram  makes this even easier. You can now create the above grid for yourself using the following T-SQL:

CREATE TABLE #geometry_histogram_test (
  geom geometry
);
INSERT INTO #geometry_histogram_test VALUES (
  geometry::STPolyFromText('POLYGON((-180 -90, 180 -90, 180 90, -180 90, -180 -90))', 4326)
  );

DECLARE @histogram_results table (
CellID int,
Cell geometry,
IntersectionCount int
);
INSERT INTO @histogram_results
EXEC
sp_help_spatial_geometry_histogram 
@tabname=#geometry_histogram_test,
@colname=geom,
@xmin=-180,
@xmax=180,
@ymin=-90,
@ymax=90,
@resolution=16;

SELECT CELL.STBoundary().STBuffer(0.05) FROM @histogram_results;

Tweak the @resolution value to get the grids at a given level and resolution. For example, to determine the set of Level 1 cells at HIGH resolution, supply a @resolution parameter of 16. To determine the set of Level 2 cells in a MEDIUM, MEDIUM grid, the @resolution parameter should be (8*8) = 64. But what’s more – the output of the procedure won’t just create the grid for you, but will give you a count of each feature from the underlying table that intersects each feature. For this example, I just created a single polygon spanning the full extent of the bounding box so the intersection count of every cell is just 1, but for distribution analysis of a real dataset, this is very useful.

But what’s even more useful is the sp_help_spatial_geography_histogram counterpart, because, despite the explanation in the  MSDN spatial indexing overview, it’s always been hard to visualise exactly how an index grid is applied to an ellipsoidal surface. Now, using code very similar to the above, you can see exactly how the geography index cells are created (and, again, determine how many features intersect each cell). For example, the following code listing demonstrates the set of Level 1 HIGH geography index cells:

CREATE TABLE #histogram_test (
  geog geography
);

INSERT INTO #histogram_test VALUES (
  geography::STGeomFromText('FULLGLOBE', 4326)
  );

DECLARE @histogram_results table (
CellID int,
Cell geography,
IntersectionCount int
);

INSERT INTO @histogram_results
EXEC
sp_help_spatial_geography_histogram 
@tabname=#histogram_test,
@colname=geog,
@resolution=16;

SELECT CELL.RingN(1).STBuffer(10000) FROM @histogram_results;

I’ve run the procedure against a table containing a single FULLGLOBE geometry to ensure that every cell intersects some feature (cells that don’t intersect anything in the base table aren’t included in the output).  Since geography indexes have no bounding box and are implicitly assumed to tessellate the whole globe, these Level 1 cells will be the same for every HIGH Level 1 geography index. And here’s what they look like:

image

And the Level 3 cells in a LOW, LOW, LOW geography grid (at which point, the resolution is 4*4*4 = 64)? They look slightly more like this:

image

Pretty, huh? But more importantly, these procedures make visualising, analysing, and ultimately tuning geography indexes much easier.

July 28, 2011

CaSe-SEnsItIvtY when Handling Multiple Spatial Columns in Safe FME Transformations

Some spatial data formats can store only a single logical set of features. Any given ESRI shapefile, for example, can store only POINTs, LINESTRINGs, or POLYGONs – not a mixture of each type of geometry. What’s more, all of those geometries must all be defined using the same spatial reference system, as defined in the associated .PRJ file.

Some other formats are more flexible. A SQL Server table, for example, can have many separate geometry or geography columns, and the values can contain both a mixture of geometry types, and even different spatial reference systems within the same column. I frequently use multiple spatial columns as a way of storing pre-calculated geometries relating to each feature at different levels of approximation, or different coordinate systems. For example, suppose that I have a table of features that were originally supplied in EPSG:27700 (OS National Grid of Great Britain). In addition to keeping the original geometry, I re-project those features into EPSG:4326 (WGS84) for displaying on Bing Maps. I also frequently simplify those features (using Reduce()) to create simpler approximations for display at far zoom levels. I keep each of these modified versions of the geometry in separate columns of the same table, named according to the datatype, SRID, and any modifications that have been performed, such as this:

CREATE TABLE (
  FeatureID int,
  geom27700 geometry,
  geog4326 geography,
  geog4326_reduce1000 geography
);

I came across a situation earlier today trying out Safe FME against a table with this structure that held me up for a few hours, so I thought I’d write about it here.

If you create a Feature Reader from a SQL Server table that has only a single geometry or geography column (as, I suppose, is probably the norm), you don’t need to explicitly specify the name of the column, as Safe FME is intelligent enough to work it out for you. Looking in the log, you’ll see a line that says

MS SQL Server (Spatial) Reader: Spatial column not specified for Feature Type 'dbo.Mastermap'. 
Using Column 'GEOMETRY' with type 'geometry'

If you’ve only ever used tables with one spatial column, you’ve probably not even noticed this as it shoots past in the log screen – you just drag your MS SQL Feature Reader onto the workspace and use it just as you would use a ESRI shapefile reader, or any other single data source. FME loads the values in the spatial column as features, and the values of all other columns of the table as attributes of those features.

But what if, like me, your source table has more than one spatial column? How do you specify which geometry/geography column should be used in the MSSQL reader to populate the feature type? Well, if you go to the parameters tab of the feature type you’ll see a couple of options:

  • “Specify Geography (geodetic) Column” – Yes if you’re using the geography datatype, or No if you’re using a column of the geometry datatype.
  • Geometry Column Name / Geography Column Name – The name of the column to retrieve.

In my dataset, the source column was of the geometry datatype and named Geom27700, so I set the parameters in my reader feature type as follows:

clip_image001[7]

I then carried on with building my (rather simple) workflow, as follows – check for valid data, reproject to WGS84, orient polygons using the left-hand rule, and then output to a writer:

image

Rather than insert the records into a new dataset, the writer in this case was designed to update the same table from which I’d read the features, but populate the geog4326 column (remember that features had been read from the geom27700 column).  So, I set the Writer Mode to UDPATE, selected the FEATURE_ID (the primary key of the table) as the key on which features would be matched, and set the other  parameters as follows:

 clip_image001[5]

Then, I clicked execute.

10 hours later, and FME was reporting that the transformation had been a success, with 48 million rows loaded into the destination table. But when I came to look at the table in SQL Server Management Studio, the geog4326 column that should have been populated by my writer was empty. Blank. NULL. Zilch. Zip.

So, what had gone wrong? I went back and looked through the FME log – everything seemed sound. I tried searching the internet and found some old posts about whether Safe FME supported multiple spatial columns per feature (like this one), so wondered if this was a limitation of the software, but nothing concrete came up. So then I contacted Safe FME’s “Doctors” over twitter. They asked me to send them a simple repro script, so I ditched my original package and started again. I built up a simple script, this time just looking at a test table with 100 rows, and…. it worked.

Much head-scratching later and I found the reason – when you specify the Geometry Column Name or Geography Column Name in a Feature Reader, the entered column is CASE-SENSITIVE, and must match exactly the definition of that column in the database. Frustratingly, there is no drop-down list of columns from which to choose this value – it must be manually typed. What’s more, if you enter the correct column name but using the wrong case, no error will be thrown. The package will run quite happily, in fact, loading all the attribute values from the table, but not the spatial data from the geometry/geography column itself. So your transformation will still load ‘000s of rows from the reader, but not with the (generally rather crucial) spatial feature itself.

For some reason, in this particular case, I’d named the column from which I was reading data GEOM27700 rather than geom27700. This is why, in my earlier example, the geog4326 column was NULL – it’s not that my UPDATE query hadn’t executed – the column had in fact been updated, but with NULL values that had been carried right the way through the FME transformation, since the reader had never successfully retrieved any values from the geom27700 column (since it was actually called GEOM27700).

I found this slightly odd behaviour, since my SQL Server collation itself is not case-sensitive, and will quite happily accept either GEOM27700 or geom27700 in a SELECT statement. What’s also odd is that this same behaviour is not mirrored in the Safe FME MS SQL Writer. In the writer, case sensitivity of the spatial column name is still significant, but if you attempt to specify a column with the wrong case, an error is thrown straight away and you’ll be unable to run the transformation, rather than the “let you carry on but don’t read anything” approach of the reader.

So, a slightly frustrating experience, and 10hrs of processing spent updating a table with a load of NULLs, but never mind – fortunately I’m only still learning Safe FME and this was just an experiment, so no angry clients ringing up asking where their data is (this time) Smile

And, to give credit where it’s due, having found the problem myself I then contacted the FME doctors again over twitter and they immediately replied, acknowledging the problem and informing that an update was in the works. I can think of a few software companies who could learn a thing or two about customer service like that….

Tags:
July 26, 2011

Determining the Geographic Extent of Spatial Features in a SQL Server Table

I’ve been using the SQL Server Denali CTP(s) for just over 6 months now. As a result, I’ve become accustomed to some of the new features and forget that they aren’t generally available in production environments yet… it always comes as a bit of a surprise when you have to do some work on a SQL Server 2008/R2 database and suddenly discover half of the feature you’ve come to rely on aren’t there any more (I pity those consultants who turn up at a client’s site to find SQL Server 2000/7/6.5 instances – fortunately, if you’re dealing with spatial data, this doesn’t really happen ;)

One such example happened today, when I was trying to determine the extent of the features in a column of spatial data, in order to create a bounding box for a spatial index covering all the data. In SQL Server Denali, this can be done very easily using an EnvelopeAggregate. Although not documented, any envelope created by SQL Server (i.e. using STEnvelope() or EnvelopeAggregate()) will be always have its first point at the bottom left of the envelope, and its third point at the opposing, top right corner. You can therefore calculate the minimum and maximum extents of the features in the MyGeomColumn of geometry data as follows:

SELECT
  geometry::EnvelopeAggregate(MyGeomColumn).STPointN(1).STX AS MinX,
  geometry::EnvelopeAggregate(MyGeomColumn).STPointN(1).STY AS MinY,
  geometry::EnvelopeAggregate(MyGeomColumn).STPointN(3).STX AS MaxX,
  geometry::EnvelopeAggregate(MyGeomColumn).STPointN(3).STY AS MaxX
FROM BigOlTableOfData;

So what do you do in SQL Server 2008/R2, where the EnvelopeAggregate() is not available? Here’s a couple of options…

Option #1 : With a Cursor

Well, you could come up with your own envelope aggregate using a (dreaded) cursor. Something like this:

DECLARE
  @geom geometry,
  @geomaggregate geometry = 'GEOMETRYCOLLECTION EMPTY';
DECLARE @i int = 0;

DECALRE geocursor cursor fast_forward
FOR SELECT MyGeomColumn FROM BigOlTableOfData;
OPEN geoCursor;
FETCH NEXT FROM geoCursor INTO @geom;
WHILE @@FETCH_STATUS = 0 BEGIN
  SET @geomaggregate = @geomagregate.STUnion(@geom);
  FETCH NEXT FROM geoCursor INTO @geom;
  SET @i += 1;
END;
CLOSE geocursor;
deallocate geoCursor;

Then, create an envelope around @geomaggregate using STEnvelope(), and select the first and third points as before.

Option #2 : CLR Function

Alternatively, you could choose not to explicitly create the envelope, but create a CLR function that looped through all the points of each geometry (using STPointN() and STNumPoints()) , and keep track of the minimum and maximum X and Y values in the loop. Something like this:

double xMin = double.MaxValue;
double xMax = double.MinValue;
double yMin = double.MaxValue;
double yMax = double.MinValue;

// Loop through each geometry in the dataset
foreach(SqlGeometry geom in MyGeomColumn) {

  // Loop through each point in this geometry
  for(int i=1; i<= geom.STNumPoints(); i++)
  {
    SqlGeometry point = geom.STPointN(i);

    // Check whether this point is a new min/max value
    if (point.X < xMin)
    { xMin = point.X; }
    else if (point.X > xMax)
    { xMax = point.X; }
  
    if (point.Y < yMin)
    { yMin = point.Y; }
    else if (point.Y > yMax)
    { yMax = point.Y; }
  }
}

Option #3 : CTE

Here’s a nice alternative using two CTEs, which I first saw described here. The first CTE creates a table, ENVELOPE, containing the individual envelopes of each feature in the table. The second table selects just the corner points of each of those envelopes, with a UNION ALL query (note that this is a regular UNION, not a STUnion()).

Then, you simply select the minimum and maximum X and Y values of any of the points listed in the CORNERS table:

WITH
  ENVELOPE as 
  (
    SELECT MyGeomColumn.STEnvelope() as envelope from BigOlTableOfData
  ),
  CORNERS as 
  (
    SELECT envelope.STPointN(1) as point from ENVELOPE 
    UNION ALL select envelope.STPointN(3) from ENVELOPE
  )

SELECT
  MIN(point.STX) as MinX,
  MIN(point.STY) as MinY,
  MAX(point.STX) as MaxX,
  MAX(point.STY) as MaxY 
FROM
  CORNERS;

Option #4 : Persisted Envelopes

Another option, and the one I chose to use today, is to alter your table structure to include computed columns containing the minimum and maximum X and Y values of the envelope around each feature on a per-row basis. I chose to make my columns PERSISTED, so that the computed values would actually be materialised and saved in the database rather than computed on-the-fly:

ALTER TABLE BigOlTableOfData
ADD
  MinX AS (CONVERT(int, MyGeomColumn.STEnvelope().STPointN((1)).STX, 0)) PERSISTED,
  MinY AS (CONVERT(int, MyGeomColumn.STEnvelope().STPointN((1)).STY, 0)) PERSISTED,
  MaxX AS (CONVERT(int, MyGeomColumn.STEnvelope().STPointN((3)).STX, 0)) PERSISTED,
  MaxY AS (CONVERT(int, MyGeomColumn.STEnvelope().STPointN((3)).STY, 0)) PERSISTED;

This requires a little bit more processing up-front, (and on each additional INSERT into the table), but the data in my table was going to remain relatively static. The advantage of this approach is that you can now work out the full extent of values in the dataset at any time quickly and simply by:

SELECT
  MIN(MinX) AS MinX,
  MIN(MinY) AS MinY,
  MAX(MaxX) AS MaxX,
  MAX(MaxY) AS MaxY;

Seeing as these are just integer columns, this is much quicker than relying on the more complex spatial methods, of having to employ any looping-type behaviour. You can also add an index to the integer coordinate values in the computed columns, which might enable you to perform faster search and retrieval queries of e.g. values lying within a rectangular area of interest (such as what features are visible in a given Bing Maps view) than using the more powerful, but frequently slower, built-in spatial indexes.

Follow

Get every new post delivered to your Inbox.

Join 53 other followers