[–]▶ No.917388>>917392 >>917470 >>917781 >>918913 >>919325 >>923403 >>925213 [Watch Thread][Show All Posts]
If we had another chance at a character encoding standard, what would you propose?
▶ No.917390>>918660
▶ No.917391>>917439
>useless 1 line thread
kys
UTF-8 is the only standard we need
▶ No.917392
>>917388 (OP)
An extremely grammatically simple language designed for usage on a computer. Each of 256 letters (including spaces et al) is encoded as one character. Because it is intended for computer usage, things like line breaks, EOFs, and even advanced markup could be represented with words that fit very nicely into the language's grammar. Simpler renderers could ignore special words and it would still be human-readable.
People would learn the language in addition to learning how to use a computer. Because of how global the Internet is, it would make sense for such a universal language to exist for computers.
▶ No.917406>>917425
Unicode + UTF-8 is about as good as it gets. Maybe some more characters should be differentiated - for example, 'I'.lower() != 'i' in a Turkish locale - but my impression is that most problems now are related to compatibility and the inherent difficulty of formalizing human notation.
▶ No.917425>>917439
>>917406
>UTF-8 is about as good as it gets
Seconding this. We don't need to stick all the fucking memes and emojis into unicode, but UTF-8 is a really good format.
▶ No.917439>>917641 >>918660
>>917391
unicode is dickballs
>>917425
the emojis aren't even the problem with unicode. it has about 3000 problems and emojis amount to about 0.1% of the problem. of course seeing them add a bunch of pointless emojis is still insufferable
▶ No.917447>>917498 >>917888 >>918660
UTF-8 is cancer. It's ambiguous and can code invalid strings. No one handles normalization correctly and only a few people even understand what the problem is. Normalization isn't reversible so there is information loss. Normalization requires massive tables of bullshit that will have to be kept up to date forever which is why there are dozens of copies of the massive ICU.DLL on your Windows system, dozens of libicu on your smartphones, and even a couple redundant copies on a Linux distro. How did they fuck up something so badly that should have been so simple.
▶ No.917453>>917470 >>918660
ASCII did literally nothing wrong and oughtta be enuff for anybody
▶ No.917470>>921386 >>925215
>>917388 (OP)
>If we had another chance at a character encoding standard, what would you propose?
I'm going to assume you mean character code and not encoding. Unicode had the right idea, but they fucked up by adding everything and the kitchen sink (there is probably a kitchen sink emoji). Do what Unicode did, but limit it to only semantic characters, which excludes emoji and similar shit.
>>917453
You are probably just shitposting, but there are people out there who really seem to be under the impression that English is the only language that matters.
▶ No.917498>>917930
Like everyone else, pretty much UTF-8.
>>917447
>It's ambiguous and can code invalid strings
So can pretty much every other encoding
>No one handles normalization correctly and only a few people even understand what the problem is
Normalization is necessary for any encoding which wishes to span a large character set. It's an unavoidable complexity.
>Normalization isn't reversible so there is information loss
and this is a problem why? If you really need it unnormalized, just keep it that way until you do a compare operation or whatever.
>icu libraries will be scatter throughout your system
They aren't on my system. I only have two copies of them, a 32 bit and 64 bit version.
▶ No.917501>>917502 >>917570 >>917613 >>918660
There should be nothing more than extended ASCII. If you do not write in a latin alphabet you're shit.
Also just make your own encoding for non-latin alphabets. It doesn't have to be universal.
▶ No.917502>>917628 >>918711
>>917501
How do you write APL then faggot.
▶ No.917570>>917615 >>918660
>>917501
this, ascii is more than enough. let the niche muneroon languages deal with it themselves if it matters so much.
▶ No.917607>>918660
Plain ASCII is good. It even works on the smallest 8-bit computers.
▶ No.917613>>917627
>>917501
What if I want to use mathematical notation, or mix my Plato with my English, in the same document or even the same sentence?
Do you really think it's less messy to have a trillion encodings?
▶ No.917615>>918660
>>917570
You mean unicode?
▶ No.917627>>917629 >>918660
>>917613
I think it's less messy to use a markup language (or Emacs, or Word) for special cases like that. Having a small, easy-to-understand *normal*, *canonical*, *default* encoding is important if you want to accept even usernames without going completely mad. Unicode's allowance of mixing what should logically be completely separate encodings has only resulted in bullshit--unicode smilies, unicode that leaves vertical streaks down your screen, unicode that phishes you by being visually identical to something else. It's an understandable desire, like adding CJK to Unicode, and (like adding CJK to Unicode) it's a bad desire and pursuing it has only given us bad things.
▶ No.917628
>>917502
Like all the other APL programmers, in a windows IDE.
▶ No.917629>>917698
>>917627
Allowing only ASCII in usernames and domain names is fine enough, but why not allow unicode in the places where you would allow that markup language?
▶ No.917641>>917664 >>917702
>>917439
>UTF-8 = Unicode
This is the state of /tech/.
▶ No.917664
>>917641
Is that post actually equating the two? I don't see it.
▶ No.917698>>917703 >>917704 >>918660
>>917629
If you're going to use unicode at all, at least keep it contained to the contents of actual documents, and preferably documents that must be in or contain non-english language. If you start using it for everything, including OS level stuff, you're bound to run into problems eventually. One example someone gave on openbsd-misc was copying files from one system to another. If the filenames are plain ASCII, there's no problems, but if they're in unicode, you can't be sure what they'll end up like. Actually I've often seen files inside ZIP files get extracted with filenames full of ?????? characters all over the place. So it's not just a hypothetical, it's actually happening out there right now.
▶ No.917702
>>917641
I want to see something BESIDES unicode. All I see are a shit ton of ways to encode unicode.
▶ No.917703
>>917698
That wouldn't be a problem if there weren't multiple encodings floating around.
Solving that problem by never using unicode isn't actually more feasible than solving that problem by always using UTF-8.
▶ No.917704>>917709
>>917698
System level shit should just use raw bytes for everything instead of a particular encoding. If someone wants to name their file with a jpeg of a pepe fuck it let them.
▶ No.917709>>917716 >>917861 >>918660 >>918715
>>917704
Filenames are part of the user interface, and they should take that into account. End users will be confronted with them. Arbitrary bytes is a bad idea, case insensitive unicode with a lot of (control) characters blacklisted might be good.
Anything that requires the ability to use arbitrary bytes in file names is probably a really bad idea in the first place, and will still have to avoid using path separators and (usually) null bytes. Newlines alone make correct shell scripting a lot harder.
Forcing the entire world to use the latin alphabet in filenames might seem like a good idea if you're an edgy imageboard poster but it doesn't fly in real life.
▶ No.917716>>917719
>>917709
> but it doesn't fly in real life.
You want to know about real life? Its windows 7, photoshop, ubuntu server, and javascript.
▶ No.917719>>917727
>>917716
What's your point? I honestly don't understand what you're trying to say.
For what it's worth, I think Windows's filename convention is closest to what I described.
▶ No.917727>>917773
>>917719
Please go back to the "real world" of macs, windows, iphones, and other shit designed to restrict users.
▶ No.917732
SYSTEMD encoding.
Decoding requires the file to be piped through SYSTEMD-chard
Binary only files encrypted and compressed at creation time and linked to hardware addresses
▶ No.917751
OK here's my idea.
Null terminated characters.
▶ No.917773
>>917727
I think unicode is good in general.
"In general" includes things that are designed to restrict you. It also includes things that are designed to give you freedom.
Freedom is orthogonal to unicode support.
▶ No.917781>>917807 >>917889
>>917388 (OP)
Each character is a 64-bit number. That number stores the black and white pixels of an 8x8 pixel square. Each character code is equivalent to the 64-bit number that prints a pattern that looks like the character. An optional pretty font layer is added on top for vector rendering/whatever.
▶ No.917807
▶ No.917825
All you need is TTS+opus.
▶ No.917861>>917875 >>917915
>>917709
Well it did work like that IRL for quite a long time, since that's how Unix did things for decades. In fact people even tended to shy away from using whiespace or shell metacharacters in filenames (but the system didn't enforce that).
▶ No.917875>>918721
>>917861
Unix went from "anything other than '/'" in the very early days to "anything other than '/' or '\0'" (probably once it was rewritten in C).
That's pretty bad for what's supposed to be a human-readable identifier. It's typical for Unix. Keeping it that simple made sense fifty years ago at Bell Labs, but not today with kernels that are already huge and complex.
▶ No.917888
>>917447
>No one handles normalization correctly and only a few people even understand what the problem is.
This is the worst fucking thing. I have had to handle character conversion several times, in several languages, and I could not tell you anything about it. It always resorts to a copy-paste job followed by trying things until it seems to work. I understand many things but the verbage and everything else around character encoding is bonkers to me.
>a character is represented not as a char but as this other thing which is gauranteed to fit inside a char except when it doesn't and if you go from this language make sure you do ...
fucking stop
☃☃☃☃
and then people keep fucking adding shit
▶ No.917889
>>917781
literally this would be less aids than unicode. and it would still be highly compressible. running a simple compression algo is much less bad than having unicode supporting behemoth
▶ No.917915>>917946
>>917861
>Well it did work like that IRL
If you allow arbitrary data like Unix you also allow unicode, not just the latin alphabet.
▶ No.917930
>>917498
>and this is a problem why?
It creates security issues. Again, I don't expect any of you to understand this. When you need two systems to agree on something like the permissions of a path, the question becomes how to get them to agree on the strings being equivalent. There is only one way to go due to there being information loss in the translation, towards fully normalized string comparisons, but you can't fully normalize in a future-proof way since it requires tables that will be changed over time by the committee. One side of that normalization is some day going to be a newer version that understands newer characters and it's going to open a hole.
Linux developers encountering this stupid clusterfuck raged about it and then decided to just say fuck it and do raw, unnormalized memory comparisons of UTF-8 strings since any other option was deemed unworkable. Other operating systems decided to half-ass something and partially handle normalization (which will likely require pinning compatibility to some old version of unicode since I doubt they can ever add new normalized forms without destroying existing filesystems and servers). This causes a lot of pain for projects that have to talk between the two. SMB shares on Linux and Mac OS have a lot of issues because of this as each OS handles this differently and there are infinite bugs in the middle since the future will add them even if they aren't there today.
▶ No.917936>>918684
I don't like UTF-8 but in practice I don't believe it can be improved. The ASCII-compatibility is somewhat of a cancer (for the same reason UTF-16 is shit), but without it most software wouldn't support international text at all. Most developers are too stupid to deal with international text, so they had to be tricked.
▶ No.917946
>>917915
People didn't allow that, even though Unix did. They followed naming conventions, even to the point of avoiding whitespace and using underscore characters instead.
I think that self-control started to vanish when Win95 long filenames got popular and began to show up everywhere.
▶ No.918660>>918665 >>918688 >>919143
>>917390
>>917453
>>917501
>>917570
>>917607
NO
>>917439
>>917615
Unicode is half-shit. They fucked up with CJKV.
t. Chinese person who wants this BS to end
>>917447
Is there a multilingual standard that is less fucked?
Big Five + HKSCS and Shift-JIS are good enough (GB is for Niggers)
>>917627
Agreed, diacritics and right-to-left are cancer and are reserved for kikes, poos and sandniggers.
>>917698
>>917709
English-only for system-programming level, multi-lingual for document level
▶ No.918665
>>918660
THIS. Reorder everything to TRON would make the world less unbearable https://en.wikipedia.org/wiki/TRON_(encoding)
▶ No.918670
▶ No.918684
>>917936
>ASCII-compatibility is somewhat of a cancer
<7-bit characters define ASCII
<All 7-bit characters of ASCII are defined within 1 8-bit byte in UTF-8.
What are you talking about? UTF-8 is actually pretty easy to handle and implement.
▶ No.918688>>918689
>>918660
>right-to-left are cancer and are reserved for kikes, poos and sandniggers.
Get a load of this chink, all smug thinking he is more civilized for switching to left to right in the last hundred(or so) years.
>diacritics
Its a good thing you main input method is full of them then.
▶ No.918689>>918697
>>918688
>thinking he is more civilized for switching to left to right in the last hundred(or so) years
Right-to-left is Mudslime or (((Jewish))), and we are not going there
>Diacritics is okay
What are you, Indian? Vietnamese?
▶ No.918697>>918712
>>918689
>What are you, Indian? Vietnamese?
Huh? I was saying that pinyin is full of diacratics.
(你)是中国人吗?Because I'm starting to doubt it.
▶ No.918711
>>917502
you draw it and let the compiler parse the .bmp
▶ No.918712>>918963 >>919139
▶ No.918721>>919414
>>917875
Then what's the alternative? Keeping a copy of ICU in kernel? That way at least you can use any language in filenames and the kernel doesn't need to care.
▶ No.918913>>918957
>>917388 (OP)
>If we had another chance at a character encoding standard, what would you propose?
UTF-8 but without emoji, beep and other shit that exists only to cause trouble.
▶ No.918957>>918968 >>919313
>>918913
That's not part of what UTF-8 does. UTF-8 is a way to represent unicode code points, the meaning of those code points is defined elsewhere.
▶ No.918963>>919139
>>918712
Do you know of any chinese imageboards? I feel like the general autism posting to imageboards commands combined with the cultural differences would be rather amusing.
▶ No.918968
>>918957
We need an alternative to unicode not utf8
▶ No.919139
>>918712
In Cantonese: "The fuck you using diacritics for?"
>>918963
No, note really (BBS is rare as well)
https://www.hkgolden.com/
http://www.discuss.com.hk/
▶ No.919143>>919276 >>919322 >>922476 >>923416
>>918660
turn cjkv into something sensible.
make all the han characters into a series of composed radicals like it should be.
turn hangul blocks into hangul letters and have a combining mark or something better
also a mark that turns a han character into its corresponding emoji
▶ No.919276>>919322
>>919143
>Combining marks and Composed radicals
The bytes will balloon in that case. At least 5~7 bytes per character on average.
▶ No.919313>>919318
>>918957
I were talking about both.
UTF-8 is a way to represent unicode code points, and it's not used for representing anything else. Of course if the Unicode allocation table is revisited, UTF-8 would also decode differently.
And I also insist that all variants of UTF-16 must vanish, too. They are cancer and the worst of both ends (UTF-8 and fixed-width UTF-32): byte-order dependent and variable length => maximum complexity, and for no real gain (only pain).
In hindsight it could be phrased a little bit cleaner:
1: The one and only character-to-number mapping should be the modified Unicode, which is as follows:
1.1: All emoji and all characters that are only used for purposes other than displaying themselves in 2D monochrome glyphs or modifying adjacent characters appearance while keeping them as 2D monochrome glyphs (that is, combining characters), should be discarded and their code points shall be considered "unassigned" (these numbers are unused).
1.2: All other (valid) characters shall keep their code points as is, so that good software won't need to be altered to stay correct.
2: The one and only variable length encoding method of the one and only character-to-number mapping shall be UTF-8 (modified if possible to strip any excess complexities if they become obsolete after p.1)
3: Trying to press people to add any nonsensical characters to the new Unicode (that is, anything other than commonly used typographic symbols and real world languages) shall be a criminal offense.
▶ No.919318>>919357
>>919313
The only actionable suggestion here is to drop support for characters you don't like.
Encodings other than UTF-8 are already almost exclusively used for legacy reasons or not exposed.
All other parts of your suggestion are either noops or inane.
▶ No.919322>>922476 >>923045
>>919143
Don't make it "absolute radical" because we can cut everything down to less than 2500 codepoints and ~3 character average when encoding rebuses. "Absolute radical" would go like >>919276
▶ No.919325>>919339
▶ No.919339>>919404
>>919325
Python fucked up.
▶ No.919354
▶ No.919357
▶ No.919404>>919410 >>921258
>>919339
Did it?
It only uses UTF-32 for strings that contain four-byte code points. Strings that are purely ASCII (or even LATIN-1) take up one byte per character.
>>> import sys
>>> size = lambda s: sys.getsizeof(s * 10_000) // 10_000
>>> size('a')
1
>>> size('ß')
1
>>> size('ij')
2
>>> size('\N{fish}')
4
Unfortunately, even a single character is enough to blow up the size:
>>> sys.getsizeof('a' * 9_999 + '\N{fish}') // 10_000
4
But that seems hard to avoid if you want O(1) indexing, which is very desirable.
▶ No.919410>>919416 >>919426 >>919437 >>921391
>>919404
>But that seems hard to avoid if you want O(1) indexing, which is very desirable.
Daily reminder that big O does not correlate with reality in many situations. Non constant time access to cache is much faster than constant time outside of cache. UTF32 is 4x bigger than UTF8 in most situations. Thats 4 GB vs 1GB. Your constant time access will be slower under most access patterns.
▶ No.919414
>>918721
The kernel 'solved' the problem by saying that you can use UTF-8 but it will be treated like a binary string instead of unicode. It's a giant fuck you to the committee, but breaking compatibility is the correct engineering decision when given a broken standard. Microsoft even agreed unicode was fucked and uses UTF-16 as a binary string similar to what Linux does, however they do have mystery proprietary normalizations across their software like in SQL Server but maybe those are just Microsoft being Microsoft. Apple is all over the place with older versions of their OS doing the impossible task of normalization in kernel, then they started doing normalization at the application level instead, and I have no idea what the status of that mess is today but I'm glad I don't have to deal with it.
There's nothing worse than a flawed standard.
▶ No.919416>>919434
>>919410
It makes more sense for python to have O(1) indexing as python programmers don't understand algorithms; that better protects them from creating poorly performing code if they handle strings in insane ways.
▶ No.919426>>919430 >>919437 >>919492 >>921180 >>921392
>>919410
>Non constant time access to cache is much faster than constant time outside of cache.
You're an absolute retard. O(N), when N is 10,000, will be slower than O(1), regardless of the cache. Now kill yourself.
▶ No.919430>>919437
>>919426
What about O(2*N) when N is 10000?
▶ No.919432
▶ No.919434>>919437
>>919416
>python programmers don't understand algorithms
this is factually incorrect, and I will also kick you in the nuts for this if you dare to say it in my face.
▶ No.919437>>921178 >>922583
>>919410
Most of your strings will just be ASCII or LATIN-1 and take up a single byte per character.
If you do have a 1 GB string then O(1) indexing is likely to be useful.
>>919426
That's not true, and not how complexity works. For any value of N, a O(N) and O(1) algorithm exist so that the O(N) algorithm is faster than the O(1) algorithm for that value of N (in the worst case). However, for any such combination, there also exists a value M so that the O(1) algorithm is faster than the O(N) algorithm for values of N > M.
>>919430
O(2*N) and O(N) are the same thing.
>>919434
Many Python programmers do understand algorithms (plenty of them don't), but there's a valid point buried in there, which is that Python programmers shouldn't have to worry about the underlying implementation too much. A lot of Python's design is about reducing mental load so programmers can focus on the program instead of the programming language.
There are cases in which memory is very expensive and it's reasonable to expect the programmer to know every implementation detail of the language and functions they're working with so they can micromanage efficiency, but Python doesn't aim to cover those cases. For Python it's more sensible to build up expectations like "indexing is cheap" (O(1) for lists and strings, O(log N) for dicts), so
value = obj[ind]
if value is None:
return default
return value
and
if obj[ind] is None:
return default
return obj[ind]
are about equally performant for non-weird types of obj.
▶ No.919492
>>919426
>Now kill yourself
...sez the tripfag. Oh the irony.
▶ No.921178>>921258
>>919437
>Most of your strings will just be ASCII or LATIN-1 and take up a single byte per character. If you do have a 1 GB string then O(1) indexing is likely to be useful.
You totally misread the post or simply don't understand how it works. That is absolutely not the case. UTF32 and UTF8 DO NOT have the same properties.
▶ No.921180
>>919426
>He does not understand how complexity actually works
▶ No.921258
>>921178
I think you don't understand how CPython's strings work.
It picks an encoding based on the content (this is ok because they're immutable). If your string's characters fit in LATIN-1, it uses LATIN-1. If they don't all fit in LATIN-1 it uses more bytes per character. Strings always use fixed-width encodings, but not all strings use the same encoding, so the memory use tends to be ok in most cases.
See >>919404
▶ No.921386>>925215 >>925216
>>917470
>You are probably just shitposting, but there are people out there who really seem to be under the impression that English is the only language that matters.
Ascii has letters with diacritical marks for all of the other languages that matter. We're not programming in Korean you fucking gooklover.
▶ No.921391>>921400 >>922580
>>919410
Look at this idiom:
for i in range(len(s)):
if s[i] == 'a':
n += 1
Notice two things: 1. If indexing is O(1), there are no cache misses. 2. if indexing is O(n), then this runs in O(n^2).
So, for a one gigabyte string, assuming 1ns to read and compare a byte, this goes from one second to run to being a half a billion seconds.
▶ No.921392
>>919426
>schemeposter
>stupid
stop larping anytime
▶ No.921400>>921436
>>921391
>caring about string indexing when you're analyzing an algorithm
That's not how Big O works, shut the fuck up
▶ No.921436
>>921400
If indexing the string takes i operations, then the whole process takes 0.5*len(s)*(len(s)+1) of such operations, which is O(n^2).
▶ No.922129
▶ No.922476
>>919143
>>919322
Csan anyone confirm this?
▶ No.922500>>922551 >>922608
All english characters come first, then symbols, then nippon's moonrunes. Anything else goes into later parts of the encoding as to fuck with people I don't like, such as needing several times the storage that one english letter would need for one arabic or hebrew character.
▶ No.922551>>922608
>>922500
What about Gook runes and Taiwan runes?
▶ No.922580
>>921391
Nobody would write this so bad
It's actually:
n = sum(1 for l in s if l == 'a')
▶ No.922583>>922594 >>922653
>>919437
>For Python it's more sensible to build up expectations like "indexing is cheap"
Horse shit, it's even more sensible to not expect any specific optimization if you don't actually need it. (And if you need it, you go and check the docs)
The typical example --- if you just need to go through all items, use a fucking iterator. It will always work in an optimal way unless the author of that goddamn library deliberately hates you.
▶ No.922594>>923392
>>922583
Whats an iterator anon? Is that a bloated way of saying map or fold?
▶ No.922608>>922620
>>922551
>>922500
You are both retarded if you haven't realized that ~90% are shared. The only thing that varied was that the gooks simplified some characters differently than the chinks (or that taiwan and places speaking cantonese didn't simplify at all).
▶ No.922620>>923045
>>922608
We should just get rid of all characters except for the Chinese ones. Everyone will be speaking Chinese soon anyways.
▶ No.922636>>922641
ASCII, fuck foreginers.
This is American property!
▶ No.922641
>>922636
But I need muh chinese cartoon games and I don't want xir to localize them.
▶ No.922653>>923392
>>922583
This particular expectation is actually ingrained in the language. If you define a getitem and a len then a iter is implicit, unless you provide it yourself.
Making indexing expensive would make only certain strings more compact, at the cost of breaking the expectations of people coming from Python 2 (where strings are bytes), or people coming from C, while also bothering people who want to go through two strings character-by-character and don't know how to use zip yet.
It has a huge cost and little gain.
▶ No.923045
▶ No.923128
standards are for noobs who complain like brats
allow all and getgud
▶ No.923348
Unicode and UTF-8
>B-but muh emojis!
yeah, don't use them.
▶ No.923392>>923393
>>922594
iterator is anything that can be iterated.
>>922653
>breaking the expectations of people coming from Python 2 (where strings are bytes), or people coming from C, while also bothering people who want to go through two strings character-by-character and don't know how to use zip yet.
nothing wrong with that.
but I agree that the gain would be small anyway.
▶ No.923393>>923394
>>923392
>iterator is anything that can be iterated.
I mean anything that can be asked if it has another item, and can be asked to give that item to you.
▶ No.923394
>>923393
…and it's a special case of a more general abstraction, a zipper
▶ No.923403
>>917388 (OP)
It would obviously have to be done with inline Javascript and plenty of remote server calls.
▶ No.923416>>924503
>>919143
Hangul blocks are already composed out of individual Jamo 'letters' without needing any kind of combining mark since Hangul is formulaic enough.
Turning Han characters into radicals only would be really difficult. While there are a lot of similar radicals, there are many characters which have pretty unique strokes that would be hard to modularize. That also isn't even thinking about how the layout of the character would be specified, which I can only imagine would be a nightmare.
▶ No.924503
>>923416
Well for Cantonese at least if you do "semi-radical" i.e. semantic-phonetic blocks you can trim down the character space by a long shot
▶ No.924521
▶ No.924730
>>918715
I bet you hate significant whitespace too fag.
▶ No.925213
>>917388 (OP)
ASCII. 2nd choice is EBDIC
▶ No.925215>>925219
>>917470
>>921386
Or you could just encode other languages in ASCII instead of tainting your character space with homoglyph malware and cancer.
▶ No.925216
>>921386
>Ascii has letters with diacritical marks for all of the other languages that matter.
No, it doesn't. Some extensions of ASCII do.
▶ No.925219>>925220 >>925222
>>925215
Can't for chinese
▶ No.925220>>925222 >>925238 >>925291
>>925219
Fuck the Chinese, having a specific character for each word is counter intuitive to building a language that's compatible to modern complexities.
The japanese realized this and adapted to romanji/hiragana/katakana
▶ No.925222
>>925219
>>925220
This is exactly what base64 encoding is for. ASCII has 95 usable characters, reserve whatever 30 punctuation shit for controls and you've still got 3x unicode width.
▶ No.925238
>>925220
Come back when you know the difference between Japanese/Korean and Chinese languages. Tonal vs non-tonal
▶ No.925291
>>925220
>japanese
>romanji
You don't know anything about Japanese.
▶ No.925433
I saved a Japanese site with wget. The live version appeared in Japanese but the text was mojibake in the source. The downloaded version appeared in mojibake but was Japanese in the source. Adding Shift-JIS as the character set fixed it. Character encoding is a hell of a thing.