Decoding UTF-8. Part III: Determining Sequence Length - A Lookup Table
Avoiding branching with the help of a lookup table
In the first part of this series on decoding UTF-8, we discussed what it means to decode a UTF-8 sequence. In part two, we saw how to determine sequence length and mentioned that there are ways to reduce branching.
The obvious way to avoid branching in the code is to use a lookup table. As we have only 256 possible values for the first byte of a UTF-8 sequence, we can hard-code a simple table that would map values of the lead byte to sequence lengths. And because the possible values of a byte are contiguous (0-255
), we can implement the table as a simple array, where the index is the value of the lead byte, and array elements are lengths of the sequence.
As we saw in the part one, any byte with its highest bit set to 0
is a lead byte for a one-byte sequence. That means the values of the first 128
positions in the lookup table (0x00 - 0x7F
) will be 1
.
Lead bytes of two-bytes sequences have three highest bits set to 110
. That means they take the range 1100 0000
- 1101 1111
, or in hex: 0xC0
- 0xDF
(decimal 192
- 223
). Therefore, values at positions 192
to 223
in the lookup table are 2
.
Three-bytes sequences have the lead byte starting with 1110
. That makes the range of 1110 0000
to 1110 1111
, or 0xE0 - 0xEF
(decimal 224 - 239
). Values at positions 224
to 239
in the lookup table are 3
.
Four-bytes sequences have the lead byte starting with 11110
. The range of the values that have these bits set is from 1111 0000
to 1111 0111
, or 0xF0 to 0xF7 (decimal 240 - 247
). Values at positions 240 - 247
are 4
.
Note two additional ranges not covered above: the first one is 129 - 191
, hex 0x81 - 0x8F
, binary 1000 0001 - 10111111
. As we can see, all these bytes have the top two bits set to 10, which is the characteristic of continuation bytes in UTF-8 encoded sequences. They cannot be valid lead bytes, and we can use a value outside the 1–4
range to denote invalid lead bytes. Here, we’ll use 0, just to be consistent with the code from part 1. That makes the positions 129 - 191
have value 0
.
Another range we did not cover is at the high end of the table: 248 - 255
, or 0xF8 - 0xFF
, binary: 1111 1000 - 1111 1111
. These are also invalid lead bytes, and in fact these bytes cannot be found in any valid UTF-8 string.
What we have so far is enough to make a function equivalent to the one we had in part one. Two more subtleties that we did not address there:
- Range 0xC0–0xC1
: These start overlong sequences for U+0000…U+007F
and must be invalid lead bytes (length = 0).
- Range 0xF5–0xF7
: Implies code points above U+10FFFF
; also invalid (length = 0).
I usually leave the validation for later (indeed, it will be a subject of a separate future post) but in this case, addressing them here is free, so I decided to include these two cases.
With all that said, here is a lookup-table based function that returns sequence length for a lead byte.
int utf8_sequence_length(unsigned char lead_byte) {
// Hard-coded lookup table for UTF-8 lead byte lengths
static const unsigned char lookup[256] = {
/* 0x00–0x7F */ [0 ... 0x7F] = 1,
/* 0x80–0xBF */ [0x80 ... 0xBF] = 0,
/* 0xC0–0xC1 */ [0xC0 ... 0xC1] = 0,
/* 0xC2–0xDF */ [0xC2 ... 0xDF] = 2,
/* 0xE0–0xEF */ [0xE0 ... 0xEF] = 3,
/* 0xF0–0xF4 */ [0xF0 ... 0xF4] = 4,
/* 0xF5–0xFF */ [0xF5 ... 0xFF] = 0,
};
// Access the hard-coded table to determine the sequence length
return lookup[lead_byte];
}
Note the use of ranges in designated initializers to initialize the lookup table; it is a GNU extension that also works with Clang but is not standard C; in portable code, we would fill the table manually.
Generated assembly (clang 18.1.0, aarch64):
utf8_sequence_length:
and x8, x0, #0xff // x8 = lower 8 bits of input
adrp x9, utf8_sequence_length.lookup
// x9 = page base address of lookup table
add x9, x9, :lo12:utf8_sequence_length.lookup
// x9 = full address of lookup table
ldrb w0, [x9, x8] // w0 = table[x8]
ret // return, length in w0
utf8_sequence_length.lookup:
.ascii "\001\001\001...\004\004" // lookup table - first 245 bytes
.zero 11 // pad to exactly 256 entries
Avoiding branching doesn’t come for free: a 256-byte array is added to the .rodata
section, which may negatively affect caching.
In the next post, we’ll try to figure out a way to reduce branching without using a lookup table.