Parsing Free-Text Addresses and a UK Postcode Regular Expression Pattern

In my upcoming session at SQLBits this Saturday, I’ll be attempting to replicate the functionality of Google Maps using nothing but freely-available tools and data – SQL Server Express, OS Open Data, and a dash of Silverlight.

One of the features I’ll be demonstrating is a basic geocoding function – i.e. given an address, placename, or landmark, how do you look up and return the coordinates representing the location so that the map can centre on that place? This is not really a spatial question at all – it’s a question of parsing a free-text user input and using that as the basis of a text search of the database.

The simplest way of doing this is to force your users to enter Street Number, Street Name, Town, and Postcode in separate input elements (and these match the fields in your database). In this case, your query becomes straightforward:

SELECT X, Y FROM AddressDatabase WHERE StreetNumber = ‘10’ AND StreetName = ‘Downing Street’ AND Town=’London’

Most databases don’t contain the location of every individual address. If there is no exact matching StreetNumber record, then you typically find the closest matching properties on the same road and interpolate between them (it seems reasonable to assume that Number 10 Downing Street will be somewhere between Number 9 and Number 11).

Forcing users to enter each element of the address separately doesn’t necessarily create the most attractive UI, however. What’s more common is to use a single free-text search box into which users can type whatever they’re searching for – a placename, address, landmark, postcode etc. Nice UI, but horrible to make sense of the input. In these cases, the user might supply:

“10 Downing Street, London”

“Downing Street, St James’, LONDON”

“10, Downing St. SW1A 2AA”

…not to mention “10 Downig Street. London”, and any other many of misspellings or alternative formats.

One approach you might want to take in these cases is to use a RegEx pattern matcher to determine if any part of the string supplied is a postcode. The UK postcode format is defined by British Standard BS7666, and can be described using the following regular expression pattern:

(GIR 0AA|[A-PR-UWYZ]([0-9][0-9A-HJKPS-UW]?|[A-HK-Y][0-9][0-9ABEHMNPRV-Y]?) [0-9][ABD-HJLNP-UW-Z]{2})

Matching the supplied address string against this RegEx doesn’t prove that a valid postcode was supplied, but just that some part of the user input matched the format for a postcode. The matching substring can then be looked up (say, against the CodePoint Open dataset) to confirm that it is real.

Once you’ve identified the postcode, you can then run a query to retrieve a list of roadnames that lie in that postcode, from something like the OSLocator dataset, and scan the remainder of the input to see if it contains any of those names. You can also scan for any numeric characters in the first part of the text input, which might represent a house number. If you find a matching property, with the same road name and valid postcode, you can be pretty sure you’ve found a match.

If you find more than valid match, or possibly several partial matches only, then you can of course present a disambiguation dialogue box – “Is this the 10 Downing Street you meant?”. For example, there are many “10 Downing Street”s in the UK – from Liverpool to Llanelli and Farnham to Fishwick…. without knowing either the town or the postcode, it could have referred to any of the following:

image

This entry was posted in Spatial, SQL Server and tagged , , . Bookmark the permalink.

5 Responses to Parsing Free-Text Addresses and a UK Postcode Regular Expression Pattern

  1. Paul K says:

    alastair … saw your presentation at SQLBits yesterday and found it was a brilliant introduction to an aspect of SQL I’ve not yet used.

    Been playing with the OSLocator data all afternoon.🙂

  2. Pingback: SQLBits 8 | Alastair Aitchison

  3. Thomas Sykes says:

    Alastair, also saw your presentation at SQLBits on Saturday and it was great, thanks. It has prompted me to also look further into spatial, sql etc.

  4. Irma says:

    Great presentation, Alastair! Could you tell me if you know how to convert from geometry data types to geography? I would like to add another layer in SSRS to my existing postcode location, which will be the buil-t in Bing layer.
    Many Thanks,
    Irma

    • alastaira says:

      Hi Irma,
      Thanks, and I’m glad you liked the presentation. Converting from geometry to geography involves unprojecting the coordinate data, from a projected a coordinate system (such as 27700) to a geographic coordinate system (such 4326). This can be a little complex to explain – (there’s a chapter on it in my next book!) – but I have written about how to do it using OGR2OGR in another post in this blog.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s