Decoding UTF-8. Part I: Manual Decoding

What does it mean to decode UTF-8

Jun 22, 2025

This is the first part of a series about decoding UTF-8 encoded text.

I assume you have some basic knowledge of Unicode and its terminology. If not, I strongly recommend reading up on it before continuing with this series. Some good texts I'm aware of include:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) – Joel on Software
The Absolute Minimum Every Software Developer Must Know About Unicode in 2023 (Still No Excuses!)

In this article, we’ll focus on converting sequences of UTF-8 encoded 8-bit wide octets (we will call them “bytes” from now on) to UTF-32 encoded Unicode code points. Strictly speaking, code points are at most 21-bits wide, but a 32-bit value is used to store a single code point; the highest 11 bits are always zero.

In practice, decoding UTF-8 often means direct conversion from UTF-8 to UTF-16, or even just validating a UTF-8 sequence.

Illustrating Decoding

Let’s represent the significant 21 bits of a code point with letters of the alphabet where each letter has a binary value of 0 or 1: a through u.

A UTF-32 encoded code point looks like this:

00000000 000abcde fghijklm nopqrstu

To encode a code point as UTF-8, anywhere between one and four bytes may be needed:

Code points from U+0000 to U+007F: 0opqrstu
Code points from U+0080 to U+07FF: 110klmno 10pqrstu
Code points from U+0800 to U+FFFF: 1110fghi 10jklmno 10pqrstu
Finally, code points from U+010000 to U+10FFFF: 11110abc 10defghi 10jklmno 10pqrstu

Decoding UTF-8 consists of identifying the a-u bits, extracting them and laying them out into the lower 21 bits of a 32-bit UTF-32 code point.

Manual Decoding

For instance, let’s consider a 3-byte sequence of UTF-8 encoded text:

0xe6, 0xb0, 0xb4.

The first byte, in binary, is: 1110 0110. It starts with 1110 which is the prefix of a three-byte sequence. The rest of the bits are the “fghi” ones: f=0, g=1, h=1, i=0. The bits that come before f: a-e are all zeros.

The second byte in binary is: 1011 0000. It starts with 10 which is the prefix for the continuation bytes. Then we have: j=1, k=1, l=0, m=0, n=0, o=0.

The third byte is: 1011 0100. Again, it starts with 10 as expected. The rest of the bits are: p=1, q=1, r=0, s=1, t=0, u=0.

We now have all the bits we need. Laying them out into a UTF-32 code point leads to:

00000000 00000000 01101100 00110100

That is 0x6c34 in hex, i.e. the Unicode code point U+6C34. It looks like this: 水 and (if the Internet is to be believed) it means “water” in Chinese.

Programming at the right level

Decoding UTF-8. Part I: Manual Decoding

What does it mean to decode UTF-8

Illustrating Decoding

Manual Decoding