Why is utf-8 overlong encoding even a thing?So in utf-8, if you had the codepoint 0100100, you could represent it as either 0100100 literally or as 11000000 10100100. The latter is called "overlong encoding" because it uses more bytes than it needs.

Email
Comment *
File	Select/drop/paste files here
Password	(Randomized for file and post deletion; you may also set your own.)
* = required field	[▶ Show post options & limits] Confused? See the FAQ.

Comment *

File

Select/drop/paste files here

Password

(Randomized for file and post deletion; you may also set your own.)

* = required field

[▶ Show post options & limits]
Confused? See the FAQ.

Flag
Oekaki	Show oekaki applet (replaces files and can be used instead)
Options	Do not bump (you can also write sage in the email field) Spoiler images (this replaces the thumbnails of your images with question marks)
Allowed file types:jpg, jpeg, gif, png, webm, mp4, pdf Max filesize is 16 MB. Max image dimensions are 15000 x 15000. You may upload 3 per post.

Flag

Oekaki

Show oekaki applet
(replaces files and can be used instead)

Options

Do not bump
(you can also write sage in the email field)

Spoiler images
(this replaces the thumbnails of your images with question marks)

Allowed file types:jpg, jpeg, gif, png, webm, mp4, pdf
Max filesize is 16 MB.
Max image dimensions are 15000 x 15000.
You may upload 3 per post.

File (hide): 2cabb04e4e5e957⋯.png (55.56 KB, 1179x378, 131:42, 2018-10-11T11_23_42.png) (h) (u)

File (hide): 393e3f1c5b3c036⋯.png (29.25 KB, 1133x278, 1133:278, 2018-10-11T11_22_22.png) (h) (u)

File (hide): 8bc0e9e7f4dad2f⋯.png (105.82 KB, 1178x515, 1178:515, 2018-10-11T11_29_12.png) (h) (u)

[–]

▶Anonymous 10/11/18 (Thu) 17:33:15 No.985628 [Watch Thread][Show All Posts]

Why is utf-8 overlong encoding even a thing?

So in utf-8, if you had the codepoint 0100100, you could represent it as either 0100100 literally or as 11000000 10100100. The latter is called "overlong encoding" because it uses more bytes than it needs. This means that the two-byte representation of codepoints has a 0x80 space of wasted codepoints. Why didn't they just shift it so that 11000000 10000000 actually represents 0x80 instead of 0x00? Then there would be no overlong encoding at all, because there is no possible overlap at all. I doubt it's that much more expensive to process, given that the majority of characters you'll usually get are in the basic multilingual plane anyway.

The git developers figured this shit out with their variable-length integer encoding; multibyte number range begins at the lowest value that can't be represented by the next multibyte down.

I'm so fucking irritated that they didn't think of this in the design, and we're stuck with utf-8 even with this stupid easily-corrected flaw. utf-8 is effectively a sequence of variable-length integers, and it's irritating they didn't consider this and had to tack on afterward that "overlong encoding" is to be regarded as an error instead of fixing the encoding to remove redundancy.

▶Anonymous 10/11/18 (Thu) 17:49:05 No.985634

File (hide): a490f50fea8e247⋯.png (409.39 KB, 724x814, 362:407, the_girl_in_a_math_test_by….png) (h) (u)

Oh, so this is like eval(base10(float 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000002.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000))=int 2

▶Anonymous 10/11/18 (Thu) 18:21:55 No.985642>>985648 >>985679

The only way to use UTF-8 sanely is to program as if a glyph can't be more than one codepoint.

▶Anonymous 10/11/18 (Thu) 19:01:10 No.985648>>985649

>>985642

But then you get fucked if you get invalid data. You need to program as if a glyph can't be more than one codepoint and also error out on overlong.

▶Anonymous 10/11/18 (Thu) 19:02:08 No.985649>>985651

>>985648

Well, of course. I should have precised that this means discarding those.

▶Anonymous 10/11/18 (Thu) 19:07:32 No.985651

>>985649

My point is just that utf-8 fucked up variable-length integer encoding. It's a stupid mistake.

▶Anonymous 10/11/18 (Thu) 20:39:12 No.985677