[ / / / / / / / / / / / / / ] [ dir / animu / general / hypno / leftpol / marx / vg / vichan / vore ][Options][ watchlist ]

/tech/ - Technology

You can now write text to your AI-generated image at https://aiproto.com It is currently free to use for Proto members.
Email
Comment *
File
Select/drop/paste files here
Password (Randomized for file and post deletion; you may also set your own.)
* = required field[▶ Show post options & limits]
Confused? See the FAQ.
Expand all images

File (hide): 2cabb04e4e5e957⋯.png (55.56 KB, 1179x378, 131:42, 2018-10-11T11_23_42.png) (h) (u)

File (hide): 393e3f1c5b3c036⋯.png (29.25 KB, 1133x278, 1133:278, 2018-10-11T11_22_22.png) (h) (u)

File (hide): 8bc0e9e7f4dad2f⋯.png (105.82 KB, 1178x515, 1178:515, 2018-10-11T11_29_12.png) (h) (u)

[–]

 No.985628[Watch Thread][Show All Posts]

Why is utf-8 overlong encoding even a thing?

So in utf-8, if you had the codepoint 0100100, you could represent it as either 0100100 literally or as 11000000 10100100. The latter is called "overlong encoding" because it uses more bytes than it needs. This means that the two-byte representation of codepoints has a 0x80 space of wasted codepoints. Why didn't they just shift it so that 11000000 10000000 actually represents 0x80 instead of 0x00? Then there would be no overlong encoding at all, because there is no possible overlap at all. I doubt it's that much more expensive to process, given that the majority of characters you'll usually get are in the basic multilingual plane anyway.

The git developers figured this shit out with their variable-length integer encoding; multibyte number range begins at the lowest value that can't be represented by the next multibyte down.

I'm so fucking irritated that they didn't think of this in the design, and we're stuck with utf-8 even with this stupid easily-corrected flaw. utf-8 is effectively a sequence of variable-length integers, and it's irritating they didn't consider this and had to tack on afterward that "overlong encoding" is to be regarded as an error instead of fixing the encoding to remove redundancy.

 No.985634

File (hide): a490f50fea8e247⋯.png (409.39 KB, 724x814, 362:407, the_girl_in_a_math_test_by….png) (h) (u)

Oh, so this is like eval(base10(float 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000002.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000))=int 2


 No.985642>>985648 >>985679

The only way to use UTF-8 sanely is to program as if a glyph can't be more than one codepoint.


 No.985648>>985649

>>985642

But then you get fucked if you get invalid data. You need to program as if a glyph can't be more than one codepoint and also error out on overlong.


 No.985649>>985651

>>985648

Well, of course. I should have precised that this means discarding those.


 No.985651

>>985649

My point is just that utf-8 fucked up variable-length integer encoding. It's a stupid mistake.


 No.985677

this is the least of the problems with UTF-8 tbh


 No.985679

>>985642

>not supporting nigger skin tone combining character for kissing faggots emoji




[Return][Go to top][Catalog][Screencap][Nerve Center][Cancer][Update] ( Scroll to new posts) ( Auto) 5
7 replies | 1 images | Page ?
[Post a Reply]
[ / / / / / / / / / / / / / ] [ dir / animu / general / hypno / leftpol / marx / vg / vichan / vore ][ watchlist ]