Why is utf-8 overlong encoding even a thing?
So in utf-8, if you had the codepoint 0100100, you could represent it as either 0100100 literally or as 11000000 10100100. The latter is called "overlong encoding" because it uses more bytes than it needs. This means that the two-byte representation of codepoints has a 0x80 space of wasted codepoints. Why didn't they just shift it so that 11000000 10000000 actually represents 0x80 instead of 0x00? Then there would be no overlong encoding at all, because there is no possible overlap at all. I doubt it's that much more expensive to process, given that the majority of characters you'll usually get are in the basic multilingual plane anyway.
The git developers figured this shit out with their variable-length integer encoding; multibyte number range begins at the lowest value that can't be represented by the next multibyte down.
I'm so fucking irritated that they didn't think of this in the design, and we're stuck with utf-8 even with this stupid easily-corrected flaw. utf-8 is effectively a sequence of variable-length integers, and it's irritating they didn't consider this and had to tack on afterward that "overlong encoding" is to be regarded as an error instead of fixing the encoding to remove redundancy.