Decoding UTF-8. Part II: Determining Sequence Length - a Straightforward Approach
Analysis of a simple algorithm to determine the length of a valid UTF-8 sequence.
This is the second part in my series on decoding UTF-8. The first part is: Decoding UTF-8. Part I: Manual Decoding.
As we saw, the length of a valid UTF-8 sequence can be determined just by inspecting its first byte:
If the byte starts with a zero bit, the sequence length is one byte.
If it starts with
110, the length is two bytes.If it starts with
1110, the length is three bytes.If it starts with
11110, the length is four bytes.
A straightforward implementation of this algorithm in C could look like:
int utf8_sequence_length(unsigned char lead_byte) {
if ((lead_byte & 0x80) == 0) {
// Single byte (0opqrstu)
return 1;
} else if ((lead_byte & 0xE0) == 0xC0) {
// Two-byte sequence (110xxxxx)
return 2;
} else if ((lead_byte & 0xF0) == 0xE0) {
// Three-byte sequence (1110xxxx)
return 3;
} else if ((lead_byte & 0xF8) == 0xF0) {
// Four-byte sequence (11110xxx)
return 4;
} else {
// Invalid UTF-8 lead byte
return 0;
}
}Let’s take a closer look at what we’re doing here:
The first branch checks whether the leading byte has bit 7 cleared. We do this by performing a bitwise AND with 0x80 (1000 0000); if bit 7 is zero, the result is zero regardless of the other bits.
Generated assembly (aarch64 Ubuntu clang 18.1.3) looks even more straightforward than the C code:
tbnz w0, #0x7, 0xaaaaaaaa07e4 ; if bit 7 is set, go to the next branch
mov w0, #0x1 ; otherwise, set return value to 1
b 0x7a8 ; jump to the function epilogInstead of doing full bitwise AND, Clang uses tbnz (Test bit and Branch if Nonzero) to test bit 7 directly.
Of course, we could also write:
if (lead_byte < 0x80) {and the generated assembly is exactly the same.
Interestingly, the GNU C compiler generated different code1, that uses left shift instead of tbnz to determine the value of bit 7:
lsls r3, r0, #24 ; shift left r0 by 24 - result in r3
bpl .L3 ; if r3 is positive or zero, go to .L3
...
.L3:
movs r0, #1 ; set return value to 1
bx lr ; returnMicrosoft’s C++ compiler v19 for arm64 produces code that is very similar to Clang’s.
The second branch checks whether the lead byte starts with 110. To do that, we do a bitwise AND with 0xE0 (1110 000) to zero all the bits that we are not interested in and then compare the result to 0xC0 (1100 0000).
} else if ((lead_byte & 0xE0) == 0xC0) {
// Two-byte sequence
return 2;Here is the generated assembly with clang 18.1.3 on aarch64 Ubuntu:
and w8, w8, #0xff ; mask low byte of w8
and w9, w8, #0xe0 ; keep only top 3 bits
cmp w9, #0xc0 ; compare against 11000000
b.ne 0x784 ; if not equal go to the next branch
mov w1, #0x2 ; it is equal - the length is 2
b 0x7a8 ; jump to epilogGCC 15.1 and MSVC produce similar sequences for this branch.
The third branch looks for 1110 at the start of the lead byte. We zero all bits but the first four ones by doing a bitwise AND with 0xF0 (1111 0000) and then compare the result to E0 (1110 000). Generated machine code is equivalent to the second branch.
The final branch is looking whether the lead byte starts with the 11110 sequence. By now, it is pretty clear what we are doing: we zero the lower three bits by applying bitwise AND with 0xF8 (1111 1000) and then compare the result to 0xF0 (1111 0000).
} else if ((lead_byte & 0xF8) == 0xF0) {
// Four-byte sequence
return 4;The generated assembly with clang 18.1.3 on aarch64 Ubuntu:
and w8, w8, #0xf8 ; mask off lower three bits
cmp w8, #0xf0 ; compare against 11110000
cset w8, eq ; if equal, set w8 to 1 - else to 0
lsl w1, w8, #2 ; multiply content of w8 by 4 and store to w1GCC’s variant uses clz (count leading zeros) and shifts:
and r0, r0, #248 ; bitwise and with 0xf8
sub r0, #240 ; subtract 0xf0
clz r0, r0 ; if r0 was 0, now it is 32 - otherwise less than 32
lsrs r0, r0, #5 ; if r0 was 32, now it is 1 - otherwise 0
lsls r0, r0, #2 ; multiply by 4
bx lrMSVC 19 uses csel:
and w0,w0,#0xF8 ; bitwise and with 0xf8
cmp w0,#0xF0 ; is w0 equal to 0xf0?
mov w0,#4 ; assign 4 to w0
cseleq w0,w0,wzr ;
retThe three compilers produce virtually identical code for the “if” condition, but then differently set the result: clang and msvc use variants of cset and csel instructions, and gcc resorts to counting leading zeros and shifting.
At first glance, the approach described here is a bit naive - it is simple and straightforward but introduces quite a few branches which are theoretically bad for instruction pipelining. In practice, the effect of branching on performance is hard to estimate due to branch predictions performed by modern CPUs.
In some of the future posts, we’ll examine ways to reduce branching when determining UTF-8 sequence length.
For the assembly output of GCC and MSVC, I used Matt Godbolt's excellent Compiler Explorer. For Clang, I simply compiled the code on my local machine (Arm64 Microsoft Surface laptop with Ubuntu Linux on WSL2) and inspected the assembly with LLDB.

