Tangible Bytes

A Web Developer’s Blog

Unicode

A couple of things about UTF-8 have eluded me for a while …

I knew that the first bit of ASCII (the bit people agreed on) is the same in ASCII and UTF-8

I knew that the rest of Unicode needs 2 or 3 bytes

But I wasn’t clear how you could tell how many bytes needed to be read at a time

And mostly I didn’t need to because the computer does it all for me - but those bits of vagueness can catch you out and so I went down the rabbit hole and it turns out to be fairly short.

First off : it’s only the first 128 characters (7 bits) that are the same.

In binary these all start with a 0

So in UTF-8 if you read a bytes and it starts zero - you just need that one byte.

Any UTF-8 byte that starts with a 1 is part of a multi-byte sequence

Starting codes in UTF-8 have the following meaning

0 : single byte

110 : start of a two byte sequence

1110 : start of a three byte sequence

11110 : start of a four byte sequence

10 : a byte that is part of a multi-byte sequence

The consequences of this are

  • all of the less well defined ASCII chars start with a 1 and are not valid UTF-8 single byte characters

  • you can’t just read any random byte and know what character it is (it might be part of a multi byte character)

  • But you can read any byte and know if it is a single byte char, part of a multi byte char (and which way to go to read the rest of it)

A more subtle point about characters in extended ASCII like ‘£’ decimal 163 Hex A3 binary 10100011 (which may be encoded as other characters in other flavours of ASCII - but being British the is the one I’m used to)

This is encoded as the two bytes

Decimal [194 163]

Binary [11000010 10100011]

Note that the second byte is the same as in extended ASCII

But it is not valid UTF-8 as a single byte - the leading 10 tells you it is part of a multi byte character

Taking the two bytes

110000010 10100011

The 110 to start says “this is a two byte character” and don’t form part of the code value

the 10 of the second bytes says “this isn’t the only byte”

What is left is 000010100011 or 163

So 8 bit numbers don’t stay the same in UTF-8 - only 7 bit ones.

But the 8 bit bytes may be there and if you search for a byte you might find it - but then it wouldn’t mean what you think it does.

As illustrated by this comment

	str := "Ǵo£lang"
	data := []byte(str)
	fmt.Println(data)
	// £ = 163
	fmt.Println(strings.IndexByte(str, 163))
    // Output: 
    // [199 180 111 194 163 108 97 110 103]
    // 4

Note that this naive search by byte tells us the byte is at position 4 - when what we want is the character (or run as Go calls them) at position 2

If we tried to search for a run with integer value greater than 8 bits we would get a compile error.

The Wikipedia page on UTF-8 is very good

And Joel Splosky provides a good intro: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Many years ago I battled with Reading a Unicode Excel file in PHP but UTF-16 is a slightly different beast.