Decoding UTF-8. Part IV: Determining Sequence Length - Counting Leading Bits

Sep 07, 2025

In the first part of this series, we saw how to manually decode a UTF-8 sequence. In part two, we started looking into determining sequence length. Part three was about avoiding branches when calculating sequence length with a lookup table.

Now, we will see how we can use some help from the hardware to accomplish the same task: based on the lead byte, calculate the UTF-8 sequence length. I recommend reading my recent post Counting Leading Zeros in a Byte to get familiar with the standard C and C++ functions for counting leading zeroes in a byte and the assembly instructions they use.

We will use a C++ 20 std::countl_one() function from <bit> header to get the number of consecutive set bits, starting from the most significant bit. As we saw in the part one:

Having 0 set bits on the left side means the most significant bit is zero. If a lead byte starts with zero, it means it is the only byte in the sequence - the length is 1.
1 set bit means the byte starts with the 10 sequence - these are continuation bytes, so we report an invalid lead
2 set bits mean the lead byte starts with 110; the sequence length is 2.
3 set bits mean the lead byte starts with 1110; the sequence length is 3.
4 set bits mean the lead byte starts with 11110; the sequence length is 4
Anything else indicates an invalid lead byte.

Without further ado, here is the code:

#include <bit>
int utf8_sequence_length(unsigned char lead_byte) {
    switch (std::countl_one(lead_byte)) {
        case 0: return 1;
        case 2: return 2;
        case 3: return 3;
        case 4: return 4;
        default: return 0; // invalid lead
    }
}

Note that we do not catch all invalid values of the lead byte here. It is OK, as long as we do proper validation at some point.

Let’s have a look at the generated assembly (clang 18.1, aarch64, -O2 --std=c++20):

utf8_sequence_length(unsigned char):
        mov     w8, #255
        bics    w9, w8, w0
        clz     w9, w9
        bics    wzr, w8, w0
        mov     w8, #8
        sub     w9, w9, #24
        csel    w8, w8, w9, eq
        cmp     w8, #4
        b.hi    .LBB0_2
        adrp    x9, .Lswitch.table.utf8_sequence_length(unsigned char)
        add     x9, x9, :lo12:.Lswitch.table.utf8_sequence_length(unsigned char)
        ldr     w0, [x9, w8, uxtw #2]
        ret
.LBB0_2:
        mov     w0, wzr
        ret

.Lswitch.table.utf8_sequence_length(unsigned char):
        .word   1
        .word   0
        .word   2
        .word   3
        .word   4

Like x86-64, Aarch64 instruction set does not have an instruction for counting leading ones - therefore we need to invert the bits and count zeros. That’s what CLZ instruction does. The switch-case statement generated a small lookup table and there is no branching except to check for the invalid value (something that a branch predictor will be able to deal with).

In the next post in the series, I plan to discuss some benchmarks that compare the three methods of calculating a UTF-8 sequence length.

Programming at the right level

Ready for more?