Decoding UTF-8. Part VI: Simplistic Non-Validating Decoder
Reconstructing Code Points Without Safety Checks
In parts 2-5 we spent some time on determining length of a UTF-8 sequence. Now, we move on to actual decoding. We’ll start not just simple, but simplistic - without validation.
Validation is important. Not every stream of bytes is valid UTF-8, and assuming it is, leads not only to bugs in functionality, but also security problems. A UTF-8 decoder needs to operate under assumption that the input stream may not be valid UTF-8 and ready to abort decoding gracefully and report an error when it detects invalid encoding.
That said, there are some scenarios in which it is OK to skip validation and save some CPU cycles:
The input has already been validated. It is OK to do validation once and then non-validated decoding if the input bytes are immutable.
The input is produced by a trusted source. If you are processing strings that you generate internally - i.e. string literals and various resource files that you know are valid UTF-8, it is OK to skip validation.
Here is a minimal C function that performs UTF‑8 decoding without any validation:
/* A sequence length function we covered in previous posts */
int utf8_sequence_length(unsigned char lead_byte);
/* Returns the code point and advances *p by the number of bytes consumed. */
unsigned int utf8_decode(const unsigned char **p) {
const unsigned char *s = *p;
unsigned int cp;
switch(utf8_sequence_length(s[0])) {
case 1:
cp = s[0];
*p += 1;
break;
case 2:
cp = ((s[0] & 0x1F) << 6) |
(s[1] & 0x3F);
*p += 2;
break;
case 3:
cp = ((s[0] & 0x0F) << 12) |
((s[1] & 0x3F) << 6) |
(s[2] & 0x3F);
*p += 3;
break;
default:
cp = ((s[0] & 0x07) << 18) |
((s[1] & 0x3F) << 12) |
((s[2] & 0x3F) << 6) |
(s[3] & 0x3F);
*p += 4;
break;
}
return cp;
}The function does what we described in: Decoding UTF-8. Part I: Manual Decoding.
A one-byte sequence is decoded by taking the low 7 bits from the byte (simply casting the unsigned char to unsigned int).
A two-byte sequence is decoded by laying out the five low bits from the first byte (
s[0] & 0x1F)to bits 6-10 (<< 6) and the six low bits from the second byte (s[1] & 0x3F) to bits 0-5.For a three-byte sequence we lay out the four low bits from the first byte (
s[0] & 0x0F)to bits 12-15(<< 12), then the six low bits from the second byte (s[1] & 0x3F) to bits 6-11 (<< 6), and finally the six low bits from the third byte (s[2] & 0x3F) to bits 0-5.For a four-byte sequence we lay out the three low bits from the first byte (
s[0] & 0x07) to bits 18-21 (<< 18), then the six low bits from the second byte (s[1] & 0x3F) to bits 12-17 (<< 12), then the six low bits from the third byte (s[2] & 0x3F) to bits 6-11 (<< 6), and finally the six low bits from the fourth byte (s[3] & 0x3F) to bits 0-5.
Assembly (clang 18.1 on aarch64 with -O2) for the three-byte sequence:
// The three-byte sequence: 1110fghi 10jklmno 10pqrstu
// Starts with: w0 - zero-extended first byte (1110fghi),
// x20 - start of the sequence
// Ends with: w0 - final codepoint: 00000000 00000000 fghijklm nopqrstu
and w8, w0, #0xf // Extract fghi bits from first byte
lsl w0, w8, #12 // Into bits 12-15
ldrb w8, [x20, #1] // Load second byte
bfi w0, w8, #6, #6 // Insert jklmno bits into bits 6-11
ldrb w8, [x20, #2] // Load third byte
bfxil w0, w8, #0, #6 // Insert pqrstu bits into bits 0-5
mov w8, #3 // length = 3
b .utf8_decode_epilog // go to the function epilogPretty much straightforward. Note how a different instruction is used for each byte:
lsl: moves bits upward into an empty register.bfi: inserts bits into a middle part of the registerbfxil: extracts low bits and inserts them into a low part of the register
One interesting thing I learned from inspecting this assembly snippet is that BFI and BFXIL instructions are aliases for BFM. The former inserts a bitfield into a destination register at a specified position, and the latter extracts a bitfield from the source and place it into low-order bits of the destination1.
In the next part, we’ll extend the simple decoder with validation.
Raymond Chen in his blog entry The AArch64 processor (aka arm64), part 7: Bitfield manipulation complains how the (U)BFM instruction “hurts his brain” and is thankful for the aliases. Frankly, I am not a fan of aliases - in my mind assembly should map faithfully to machine code.

