Character Encodings

Note: I probably need to re-write some of this. It’s a Monday morning, so go easy on me, yeah?

“Surely I don’t need to worry about character encodings, all my code is in ASCII plain-text and the output all works fine on my PC. It’s not like I’m writing in Japanese or anything.”

If you thought something along those lines then you probably need to read this. If you want your content to display accurately and consistently across browsers and especially if you want even a hope of supporting other languages then you need to understand character encodings and which one(s) to use.

What are character encodings?
Let’s start with the basics. You know that computers deal with numbers, not letters, right? So you’ve probably worked out that computers store text as numbers. So each letter has a number which corresponds to it. You’ve probably heard of ASCII. In ASCII 65-90 are upper-case A-Z and 97-122 are lower-case a-z. However computers aren’t even very good with numbers, they really only like simple “yes/no” data. In the days of ASCII most computer systems dealt in groups (or “words”) of eight 0s and/or 1s (each of which is called a “bit”, but you know this, right?). This gives us a total of 256 combinations to play with. The ASCII standard contains only 128 characters which means that if you encode ASCII using 8-bit words (which everyone was) then you had 128 unused numbers which could be mapped to anything.

And so people started mapping them to things. The character encoding used by DOS contained some accented letters, a pair of smiley faces and some structural characters that were used to make boxes and other structures in text based programs.

So now we have a bunch of different encodings which work fine for American English (because the first 128 characters are standardised ASCII (the A in ASCII stands for “American”)) but which may or may not work for, say, French because the accented characters may or may not be where you expect. Worse you could write something in French and it would look fine, but then transfer it to another computer which is using a different character encoding and the standard a-z would work fine, but any accented letters would be replaced by, well anything really. And what about if you were trying to write in Japanese? And so they made Unicode.

Bear in mind that so far these encodings have been set on one-byte-per-letter encoding (there are 8 bits to the byte). For web-based work this shouldn’t matter but imaging writing a word processor. You would be dealing with text data on a letter-by-letter basis. If you wanted your software to skip to the next letter you just advance 1 byte. Want to go back a character? Just go back 1 byte. Simple, right? Yes. Bear this in mind for the next bit…

In 1987, Joe Becker from Xerox and Lee Collins and Mark Davis from Apple started looking at creating a unified character set. In 1988 Becker published a draft proposal for a 16-bit character set “tentatively called Unicode.” It was designed to incorporate all modern languages. Being 16-bit it had room for 65,536 different characters but being 16-bit also gave it one drawback: every character would take up twice as much space (in terms of data) than its 8-bit equivalent. This seemed a bit excessive.

Also, Unicode text almost, almost matched up with the ASCII standard. For example the Unicode for “Hello” would be:
U+0048 U+0065 U+006C U+006C U+006F

The ASCII for “Hello” would be:

48 65 6C 6C 6F

Notice any similarities? The actual numbers were the same but because Unicode included some leading zeros the value couldn’t be directly converted. That is to say you couldn’t take a bit of ASCII text and run it through an interpreter that was expecting Unicode and vice-versa. it just wouldn’t work. An ASCII interpreter would see the Unicode as being 10 letters long, with every other character being “00”.

And so in 1992 Dave Prosser of Unix System Laboratories proposed a system which made a lot of sense. In UTF-8 each character is represented by a string of 1 to 4 bytes. If the character was 1 byte long then it would correspond with ASCII, if it was longer than 1 byte then it would correspond to Greek, Cyrillic, Hebrew or something else. In this way you can have the standard ASCII characters encoded in just 1 byte (which saves space) but you also have the ability to encode any character that is defined in the Unicode standard. And there are a lot of those.

And so…
So what does this mean for you? Well for a start when someone says “plain-text” you now know that this has no real meaning. Plain text encoded as what? ASCII? UTF-8?

What should you use for making web pages? If you’re writing in English ASCII is good enough, right? Er, no, not really. Say you are making a forum which will be used by lots of different people, potentially French people, or maybe Korean. They will need the ability to speak in their own language. So if you write a forum which only handles ASCII then they won’t be able to use it. Similarly you might be making a personal website that’s all in English but you want to use the word “Résumé”. Well if you’re using standard ASCII then you can’t. Sure it might work, because some implementations of ASCII include accented characters, but some don’t. UTF-8 isn’t perfect. If you want the smallest possible file size you should use an encoding that’s designed for the language in hand. However UTF-8 gives you the smallest file size you can get whilst still supporting all Unicode characters. Basically, you’re most likely going to want to use UTF-8 for web pages!

OK, how do I do that then?
So by this point you’re hopefully thinking “Well that makes a lot of sense but how do I go about doing that?” Well, ideally your web server, before sending your web page would send a header saying “Here comes a page, prepare your self! It’s in UTF-8. Here it comes!” and then it would send the HTML. This happens in the form of an http header which looks like this:

Content-Type: text/plain; charset="UTF-8"

The trouble is that web servers often host more than one website. And not all sites will use the same encoding. So it would be nice to able to send the character type in the HTML. But wait a minute. This means that the browser will have to actually read the HTML before it knows what it is! Think about this for a minute, it doesn’t make sense!

However, this does actually work. The code that you use looks something like this:

<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />

…and it should go inside the head section of your html. In fact it should go right at the top of the head section because once the browser gets this far and figures out what’s going on it’s going to back up and start again, now interpreting it all as UTF-8.

Handy Hint
Pick one character encoding at the start of your project and stick to it! (hint: choose unicode!). Next make sure that every step in your process uses this same encoding I.E. Make sure that your database tables are using it; make sure your HTML, CSS and PHP (or whatever) files are encoded in it. If this isn’t possible you might have to pay extra attention to where conversions take place…this is unlikely to be fun. But at least now you’ll know what might be causing these problems when they crop up.

That’s all.
There’s a whole lot more to say about character encoding, but I’m not going to say it. It’s only Monday morning and the sun’s barely come up.

For an example of what happens when character encodings go wrong see this article on BOMs and the trouble they can cause.

Extra! Extra! Read all about it!

Character Set Encoding Basics – A fairly technical look at character encodings. Uses phrases like “When 16-bit or 32-bit data units are brought into an 8-bit byte context, the data units can easily be split into 8-bit chunks since their size is an integer multiple of 8 bits.”

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) – A less formal introduction. Uses phrases like “if you are a programmer working in 2003 and you don’t know the basics of characters, character sets, encodings, and Unicode, and I catch you, I’m going to punish you by making you peel onions for 6 months in a submarine.”

Character Encoding – Wikipedia knows all. Uses phrases like “Citation needed”.

Update: Here are some more links:
Character Encoding Issues
UTF-8 With or without BOM


About Mr Chimp

I make music, draw pictures, browse the internet, programme, and make sweet, sweet cups of tea until the early hours.
This entry was posted in Uncategorized and tagged , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s