PHP and UTF-8 BOM (Or, why do my webpages start with  )

Like many developers, I write code for a variety of platforms, using a variety of platforms. I write C# on an iBook, PHP on a Linux VM running under Windows, Javascript on my desktop, and HTML blogposts on my mobile phone. I’ve been known to write C++ on my microwave. And I write SQL snippets on the back of my hand.

Unsurprisingly, this causes a few headaches sometimes, both from configuration differences between various development environments and also the limited amount of memory in my brain to remember the nuances of all these languages. To quote A.A.Milne’s Winnie the Pooh: “I am a bear of very little brain…”.

I recently wrote some PHP for the first time in ages, and noticed some of my pages were appearing on one development machine, in some browsers, preceded by the characters . These characters didn’t show up when editing the pages, and they didn’t show up at all when served from a different server or when viewed in some other browsers.

Initially, I thought that it was something to do with not having configured the correct character set in the response header (which is generally the main cause of garbled characters appearing in webpages), but, checking the response header it seemed ok – I was outputting UTF-8 as desired:

header('Content-type: text/html; charset=UTF-8') ;

And browsers viewing the page were correctly auto-detecting the character encoding as UTF-8:

image

Then I checked the configuration of the server, which was also set up with Unicode support correctly. And then I checked the encoding of the PHP scripts themselves, which were all encoded using Unicode UTF-8 – (Windows Codepage 65001). So far, everything seemed consistent, so where were those garbled characters coming from?

UTF-8 with or without signature – your choice. (Or not).

The reason, as I found out, was that one of my development environments (Visual Studio – from which I’d made the most recent edits to the affected pages) was configured to save UTF-8 encoded files with signature. Here’s the options for Unicode character encoding in Visual Studio, showing UTF-8 both with and without signature (notice that they’re both the same codepage – 65001):

image

There seems to be very little convention or standardisation as to the use of this “signature”. I hadn’t really come across this problem before because I generally use Eclipse for PHP development. The encoding options there are shown below:

image

Notice that, although there are several flavours of UTF-16 available in Eclipse, there is only version of UTF-8, which is equivalent to Visual Studio’s without signature.

Then again, here are the options in Windows Notepad (yes, I use that sometimes as well). As in Eclipse, there is only one choice of UTF-8, but this time the sole option available  provides the opposite behaviour – always saving UTF-8 with signature:

image

BOM BOM BOOM!

The optional “signature” in question is the Byte-Order Marker, or BOM. A byte-order marker is required for multibyte encoded data, including UTF-16, to indicate big-endianness or little-endianness – the order in which bytes are arranged. All of the save dialogs above give you the choice for specifying the byte order for Unicode UTF-16, since in a multibyte format the byte order matters. However, for UTF-8, which uses only a single byte for each character (that’s what the “8” stands for – 8 bits = 1 byte) a BOM is not required and doesn’t really make sense.

Even though UTF-8 always uses the same byte-order, a UTF-8 encoded file can begin with the bytes EF BB BF, which merely signifies that it is in UTF-8 format. It’s not really a BOM, hence why Visual Studio calls it a “signature”. The problem is that some clients don’t expect UTF-8 to have a BOM and, as it turns out, the PHP engine is one of them. At least, some builds of the PHP engine. One of my PHP servers, running on a linux machine, interpreted the UTF-8 file with signature fine, whereas another, running under Windows, tried to display the leading bytes as content on the page, which is how you end up with .

The combination of different default encoding behaviours across different editors combined with different server/browser behaviours when interpreting UTF-8 files with BOM means that this problem can be a little tricky to diagnose.

This is reported as a PHP bug at http://bugs.php.net/bug.php?id=22108, but the workarounds are actually quite straightforward (once you know what the problem is!):

  • If you’re using Visual Studio, make sure you save your PHP files as UTF-8 without signature. If you’re using Eclipse, this is the default anyway.
  • Compile your PHP with the –enable-zend-multibyte option, which will correctly parse the BOM at the start of the file
  • If you don’t need unicode at all, you could use ISO-8566-1, or another non-UTF-8  encoding
This entry was posted in General Development and tagged . Bookmark the permalink.

5 Responses to PHP and UTF-8 BOM (Or, why do my webpages start with  )

  1. Roger Morgan says:

    “If you don’t need unicode at all, you could use ISO-8566-1” … I think that’s bad advice. You never know whether there might be a future requirement for characters not in the ANSI set, and in that case you should go straight to utf-8, which takes care of all possible characters, and does it in a reasonably efficient way. One nice thing about utf-8 is that files which just contain the original ANSI characters are already automatically valid utf-8 files.

    With utf-8 you don’t have to worry about which characters are supported. They all are.

    • alastaira says:

      I don’t think I was offering “bad advice” – I’m suggesting possible alternatives to a problem. Also, bear in mind that I’m not talking about the content of a page (which, I agree, should be UTF-8 encoded for possible future internationalisation even if you currently only have content in one codepage) – I’m talking about the PHP script itself.
      PHP’s inbuilt functions, operators and control structures (for, if, while, etc.) are not named using unicode characters and, unless you use unicode characters in your variable or class names, you don’t need to use it either. You can still retrieve UTF8 data from a database, or include UTF8 content from an HTML template to send to the client without the PHP script file being saved with UTF8 encoding.

  2. spigolo says:

    Do you mean ISO 8859-1 rather then ISO 8566-1 ?

  3. Pingback: Anonymous

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s