Decoding UTF-8. Part II: Determining Sequence Length - a Straightforward Approach

Analysis of a simple algorithm to determine the length of a valid UTF-8 sequence.

Jul 27, 2025

This is the second part in my series on decoding UTF-8. The first part is: Decoding UTF-8. Part I: Manual Decoding.

As we saw, the length of a valid UTF-8 sequence can be determined just by inspecting its first byte:

If the byte starts with a zero bit, the sequence length is one byte.
If it starts with 110, the length is two bytes.
If it starts with 1110, the length is three bytes.
If it starts with 11110, the length is four bytes.

A straightforward implementation of this algorithm in C could look like:

int utf8_sequence_length(unsigned char lead_byte) {
    if ((lead_byte & 0x80) == 0) {
        // Single byte (0opqrstu)
        return 1;
    } else if ((lead_byte & 0xE0) == 0xC0) {
        // Two-byte sequence (110xxxxx)
        return 2;
    } else if ((lead_byte & 0xF0) == 0xE0) {
        // Three-byte sequence (1110xxxx)
        return 3;
    } else if ((lead_byte & 0xF8) == 0xF0) {
        // Four-byte sequence (11110xxx)
        return 4;
    } else {
        // Invalid UTF-8 lead byte
        return 0;
    }
}

Let’s take a closer look at what we’re doing here:

The first branch checks whether the leading byte has bit 7 cleared. We do this by performing a bitwise AND with 0x80 (1000 0000); if bit 7 is zero, the result is zero regardless of the other bits.

Generated assembly (aarch64 Ubuntu clang 18.1.3) looks even more straightforward than the C code:

tbnz   w0, #0x7, 0xaaaaaaaa07e4 ; if bit 7 is set, go to the next branch
mov    w0, #0x1 ; otherwise, set return value to 1
b      0x7a8    ; jump to the function epilog

Instead of doing full bitwise AND, Clang uses tbnz (Test bit and Branch if Nonzero) to test bit 7 directly.

Of course, we could also write:

if (lead_byte < 0x80) {

and the generated assembly is exactly the same.

Interestingly, the GNU C compiler generated different code1, that uses left shift instead of tbnz to determine the value of bit 7:

  lsls    r3, r0, #24  ; shift left r0 by 24 - result in r3
  bpl     .L3          ; if r3 is positive or zero, go to .L3
...
.L3:
  movs    r0, #1       ; set return value to 1
  bx      lr           ; return

Microsoft’s C++ compiler v19 for arm64 produces code that is very similar to Clang’s.

The second branch checks whether the lead byte starts with 110. To do that, we do a bitwise AND with 0xE0 (1110 000) to zero all the bits that we are not interested in and then compare the result to 0xC0 (1100 0000).

} else if ((lead_byte & 0xE0) == 0xC0) {
   // Two-byte sequence
   return 2;

Here is the generated assembly with clang 18.1.3 on aarch64 Ubuntu:

and    w8, w8, #0xff ; mask low byte of w8
and    w9, w8, #0xe0 ; keep only top 3 bits
cmp    w9, #0xc0     ; compare against 11000000
b.ne   0x784         ; if not equal go to the next branch
mov    w1, #0x2      ; it is equal - the length is 2
b      0x7a8         ; jump to epilog

GCC 15.1 and MSVC produce similar sequences for this branch.

The third branch looks for 1110 at the start of the lead byte. We zero all bits but the first four ones by doing a bitwise AND with 0xF0 (1111 0000) and then compare the result to E0 (1110 000). Generated machine code is equivalent to the second branch.

The final branch is looking whether the lead byte starts with the 11110 sequence. By now, it is pretty clear what we are doing: we zero the lower three bits by applying bitwise AND with 0xF8 (1111 1000) and then compare the result to 0xF0 (1111 0000).

} else if ((lead_byte & 0xF8) == 0xF0) {
        // Four-byte sequence
        return 4;

The generated assembly with clang 18.1.3 on aarch64 Ubuntu:

and    w8, w8, #0xf8  ; mask off lower three bits
cmp    w8, #0xf0      ; compare against 11110000
cset   w8, eq         ; if equal, set w8 to 1 - else to 0
lsl    w1, w8, #2     ; multiply content of w8 by 4 and store to w1

GCC’s variant uses clz (count leading zeros) and shifts:

and     r0, r0, #248  ; bitwise and with 0xf8
sub     r0, #240      ; subtract 0xf0
clz     r0, r0        ; if r0 was 0, now it is 32 - otherwise less than 32
lsrs    r0, r0, #5    ; if r0 was 32, now it is 1 - otherwise 0 
lsls    r0, r0, #2    ; multiply by 4
bx      lr

MSVC 19 uses csel:

and         w0,w0,#0xF8 ; bitwise and with 0xf8
cmp         w0,#0xF0    ; is w0 equal to 0xf0?
mov         w0,#4       ; assign 4 to w0
cseleq      w0,w0,wzr   ; 
ret

The three compilers produce virtually identical code for the “if” condition, but then differently set the result: clang and msvc use variants of cset and csel instructions, and gcc resorts to counting leading zeros and shifting.

At first glance, the approach described here is a bit naive - it is simple and straightforward but introduces quite a few branches which are theoretically bad for instruction pipelining. In practice, the effect of branching on performance is hard to estimate due to branch predictions performed by modern CPUs.

In some of the future posts, we’ll examine ways to reduce branching when determining UTF-8 sequence length.

For the assembly output of GCC and MSVC, I used Matt Godbolt's excellent Compiler Explorer. For Clang, I simply compiled the code on my local machine (Arm64 Microsoft Surface laptop with Ubuntu Linux on WSL2) and inspected the assembly with LLDB.

Programming at the right level

Decoding UTF-8. Part II: Determining Sequence Length - a Straightforward Approach

Analysis of a simple algorithm to determine the length of a valid UTF-8 sequence.