Unicode
A couple of things about UTF-8 have eluded me for a while …
I knew that the first bit of ASCII (the bit people agreed on) is the same in ASCII and UTF-8
I knew that the rest of Unicode needs 2 or 3 bytes
But I wasn’t clear how you could tell how many bytes needed to be read at a time
And mostly I didn’t need to because the computer does it all for me - but those bits of vagueness can catch you out and so I went down the rabbit hole and it turns out to be fairly short.
First off : it’s only the first 128 characters (7 bits) that are the same.
In binary these all start with a 0
So in UTF-8 if you read a bytes and it starts zero - you just need that one byte.
Any UTF-8 byte that starts with a 1 is part of a multi-byte sequence
Starting codes in UTF-8 have the following meaning
0 : single byte
110 : start of a two byte sequence
1110 : start of a three byte sequence
11110 : start of a four byte sequence
10 : a byte that is part of a multi-byte sequence
The consequences of this are
all of the less well defined ASCII chars start with a 1 and are not valid UTF-8 single byte characters
you can’t just read any random byte and know what character it is (it might be part of a multi byte character)
But you can read any byte and know if it is a single byte char, part of a multi byte char (and which way to go to read the rest of it)
A more subtle point about characters in extended ASCII like ‘£’ decimal 163 Hex A3 binary 10100011 (which may be encoded as other characters in other flavours of ASCII - but being British the is the one I’m used to)
This is encoded as the two bytes
Decimal [194 163]
Binary [11000010 10100011]
Note that the second byte is the same as in extended ASCII
But it is not valid UTF-8 as a single byte - the leading 10 tells you it is part of a multi byte character
Taking the two bytes
110000010 10100011
The 110 to start says “this is a two byte character” and don’t form part of the code value
the 10 of the second bytes says “this isn’t the only byte”
What is left is 000010100011 or 163
So 8 bit numbers don’t stay the same in UTF-8 - only 7 bit ones.
But the 8 bit bytes may be there and if you search for a byte you might find it - but then it wouldn’t mean what you think it does.
As illustrated by this comment
str := "Ǵo£lang"
data := []byte(str)
fmt.Println(data)
// £ = 163
fmt.Println(strings.IndexByte(str, 163))
// Output:
// [199 180 111 194 163 108 97 110 103]
// 4
Note that this naive search by byte tells us the byte is at position 4 - when what we want is the character (or run as Go calls them) at position 2
If we tried to search for a run with integer value greater than 8 bits we would get a compile error.
The Wikipedia page on UTF-8 is very good
And Joel Splosky provides a good intro: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Many years ago I battled with Reading a Unicode Excel file in PHP but UTF-16 is a slightly different beast.