Decoding UTF-8. Part IV: Determining Sequence Length - Counting Leading Bits
In the first part of this series, we saw how to manually decode a UTF-8 sequence. In part two, we started looking into determining sequence length. Part three was about avoiding branches when calculating sequence length with a lookup table.
Now, we will see how we can use some help from the hardware to accomplish the same task: based on the lead byte, calculate the UTF-8 sequence length. I recommend reading my recent post Counting Leading Zeros in a Byte to get familiar with the standard C and C++ functions for counting leading zeroes in a byte and the assembly instructions they use.
We will use a C++ 20 std::countl_one() function from <bit> header to get the number of consecutive set bits, starting from the most significant bit. As we saw in the part one:
Having 0 set bits on the left side means the most significant bit is zero. If a lead byte starts with zero, it means it is the only byte in the sequence - the length is 1.
1 set bit means the byte starts with the 10 sequence - these are continuation bytes, so we report an invalid lead
2 set bits mean the lead byte starts with 110; the sequence length is 2.
3 set bits mean the lead byte starts with 1110; the sequence length is 3.
4 set bits mean the lead byte starts with 11110; the sequence length is 4
Anything else indicates an invalid lead byte.
Without further ado, here is the code:
#include <bit>
int utf8_sequence_length(unsigned char lead_byte) {
switch (std::countl_one(lead_byte)) {
case 0: return 1;
case 2: return 2;
case 3: return 3;
case 4: return 4;
default: return 0; // invalid lead
}
}Note that we do not catch all invalid values of the lead byte here. It is OK, as long as we do proper validation at some point.
Let’s have a look at the generated assembly (clang 18.1, aarch64, -O2 --std=c++20):
utf8_sequence_length(unsigned char):
mov w8, #255
bics w9, w8, w0
clz w9, w9
bics wzr, w8, w0
mov w8, #8
sub w9, w9, #24
csel w8, w8, w9, eq
cmp w8, #4
b.hi .LBB0_2
adrp x9, .Lswitch.table.utf8_sequence_length(unsigned char)
add x9, x9, :lo12:.Lswitch.table.utf8_sequence_length(unsigned char)
ldr w0, [x9, w8, uxtw #2]
ret
.LBB0_2:
mov w0, wzr
ret
.Lswitch.table.utf8_sequence_length(unsigned char):
.word 1
.word 0
.word 2
.word 3
.word 4Like x86-64, Aarch64 instruction set does not have an instruction for counting leading ones - therefore we need to invert the bits and count zeros. That’s what CLZ instruction does. The switch-case statement generated a small lookup table and there is no branching except to check for the invalid value (something that a branch predictor will be able to deal with).
In the next post in the series, I plan to discuss some benchmarks that compare the three methods of calculating a UTF-8 sequence length.

