This is the first part of a series about decoding UTF-8 encoded text.
I assume you have some basic knowledge of Unicode and its terminology. If not, I strongly recommend reading up on it before continuing with this series. Some good texts I'm aware of include:
In this article, we’ll focus on converting sequences of UTF-8 encoded 8-bit wide octets (we will call them “bytes” from now on) to UTF-32 encoded Unicode code points. Strictly speaking, code points are at most 21-bits wide, but a 32-bit value is used to store a single code point; the highest 11 bits are always zero.
In practice, decoding UTF-8 often means direct conversion from UTF-8 to UTF-16, or even just validating a UTF-8 sequence.
Illustrating Decoding
Let’s represent the significant 21 bits of a code point with letters of the alphabet where each letter has a binary value of 0 or 1: a
through u
.
A UTF-32 encoded code point looks like this:
00000000 000abcde fghijklm nopqrstu
To encode a code point as UTF-8, anywhere between one and four bytes may be needed:
Code points from U+0000 to U+007F:
0opqrstu
Code points from U+0080 to U+07FF:
110klmno 10pqrstu
Code points from U+0800 to U+FFFF:
1110fghi 10jklmno 10pqrstu
Finally, code points from U+010000 to U+10FFFF:
11110abc 10defghi 10jklmno 10pqrstu
Decoding UTF-8 consists of identifying the a-u
bits, extracting them and laying them out into the lower 21 bits of a 32-bit UTF-32 code point.
Manual Decoding
For instance, let’s consider a 3-byte sequence of UTF-8 encoded text:
0xe6, 0xb0, 0xb4
.
The first byte, in binary, is: 1110 0110
. It starts with 1110
which is the prefix of a three-byte sequence. The rest of the bits are the “fghi” ones: f=0, g=1, h=1, i=0
. The bits that come before f: a-e
are all zeros.
The second byte in binary is: 1011 0000
. It starts with 10
which is the prefix for the continuation bytes. Then we have: j=1, k=1, l=0, m=0, n=0, o=0.
The third byte is: 1011 0100
. Again, it starts with 10
as expected. The rest of the bits are: p=1, q=1, r=0, s=1, t=0, u=0
.
We now have all the bits we need. Laying them out into a UTF-32 code point leads to:
00000000 00000000 01101100 00110100
That is 0x6c34 in hex, i.e. the Unicode code point U+6C34. It looks like this: 水 and (if the Internet is to be believed) it means “water” in Chinese.