If we had another chance at a character encoding standard, what would you propose?

Name
Email
Subject
Comment *
File	Select/drop/paste files here
Password	(Randomized for file and post deletion; you may also set your own.)
* = required field	[▶ Show post options & limits] Confused? See the FAQ.

Flag
Oekaki	Show oekaki applet (replaces files and can be used instead)
Options	Do not bump (you can also write sage in the email field) Spoiler images (this replaces the thumbnails of your images with question marks)
Allowed file types:jpg, jpeg, gif, png, webm, mp4, pdf Max filesize is 16 MB. Max image dimensions are 15000 x 15000. You may upload 3 per post.

File (hide): 7258603d453edab⋯.jpg (50.29 KB, 1000x1000, 1:1, mesa.jpg) (h) (u)

▶Anonymous 05/20/18 (Sun) 20:58:39 No.917388>>917392 >>917470 >>917781 >>918913 >>919325 >>923403 >>925213 [Watch Thread][Show All Posts]

If we had another chance at a character encoding standard, what would you propose?

▶Anonymous 05/20/18 (Sun) 21:01:18 No.917390>>918660

English only.

▶Anonymous 05/20/18 (Sun) 21:04:07 No.917391>>917439

>useless 1 line thread

kys

UTF-8 is the only standard we need

▶Anonymous 05/20/18 (Sun) 21:05:42 No.917392

>>917388 (OP)

An extremely grammatically simple language designed for usage on a computer. Each of 256 letters (including spaces et al) is encoded as one character. Because it is intended for computer usage, things like line breaks, EOFs, and even advanced markup could be represented with words that fit very nicely into the language's grammar. Simpler renderers could ignore special words and it would still be human-readable.

People would learn the language in addition to learning how to use a computer. Because of how global the Internet is, it would make sense for such a universal language to exist for computers.

▶Anonymous 05/20/18 (Sun) 21:40:23 No.917406>>917425

Unicode + UTF-8 is about as good as it gets. Maybe some more characters should be differentiated - for example, 'I'.lower() != 'i' in a Turkish locale - but my impression is that most problems now are related to compatibility and the inherent difficulty of formalizing human notation.

▶Anonymous 05/20/18 (Sun) 22:23:58 No.917425>>917439

>>917406

>UTF-8 is about as good as it gets

Seconding this. We don't need to stick all the fucking memes and emojis into unicode, but UTF-8 is a really good format.

▶Anonymous 05/20/18 (Sun) 22:41:06 No.917439>>917641 >>918660

>>917391

unicode is dickballs

>>917425

the emojis aren't even the problem with unicode. it has about 3000 problems and emojis amount to about 0.1% of the problem. of course seeing them add a bunch of pointless emojis is still insufferable

▶Anonymous 05/20/18 (Sun) 22:50:32 No.917447>>917498 >>917888 >>918660

UTF-8 is cancer. It's ambiguous and can code invalid strings. No one handles normalization correctly and only a few people even understand what the problem is. Normalization isn't reversible so there is information loss. Normalization requires massive tables of bullshit that will have to be kept up to date forever which is why there are dozens of copies of the massive ICU.DLL on your Windows system, dozens of libicu on your smartphones, and even a couple redundant copies on a Linux distro. How did they fuck up something so badly that should have been so simple.

▶Anonymous 05/20/18 (Sun) 23:05:39 No.917453>>917470 >>918660

ASCII did literally nothing wrong and oughtta be enuff for anybody

▶Anonymous 05/20/18 (Sun) 23:50:14 No.917470>>921386 >>925215

>>917388 (OP)

>If we had another chance at a character encoding standard, what would you propose?

I'm going to assume you mean character code and not encoding. Unicode had the right idea, but they fucked up by adding everything and the kitchen sink (there is probably a kitchen sink emoji). Do what Unicode did, but limit it to only semantic characters, which excludes emoji and similar shit.

>>917453

You are probably just shitposting, but there are people out there who really seem to be under the impression that English is the only language that matters.

▶Anonymous 05/21/18 (Mon) 02:07:33 No.917498>>917930

Like everyone else, pretty much UTF-8.

>>917447

>It's ambiguous and can code invalid strings

So can pretty much every other encoding

>No one handles normalization correctly and only a few people even understand what the problem is

Normalization is necessary for any encoding which wishes to span a large character set. It's an unavoidable complexity.

>Normalization isn't reversible so there is information loss

and this is a problem why? If you really need it unnormalized, just keep it that way until you do a compare operation or whatever.

>icu libraries will be scatter throughout your system

They aren't on my system. I only have two copies of them, a 32 bit and 64 bit version.

▶Anonymous 05/21/18 (Mon) 02:12:54 No.917501>>917502 >>917570 >>917613 >>918660

There should be nothing more than extended ASCII. If you do not write in a latin alphabet you're shit.

Also just make your own encoding for non-latin alphabets. It doesn't have to be universal.

▶Anonymous 05/21/18 (Mon) 02:18:46 No.917502>>917628 >>918711

>>917501

How do you write APL then faggot.

▶Anonymous 05/21/18 (Mon) 05:31:40 No.917570>>917615 >>918660

>>917501

this, ascii is more than enough. let the niche muneroon languages deal with it themselves if it matters so much.

▶Anonymous 05/21/18 (Mon) 06:56:48 No.917607>>918660

Plain ASCII is good. It even works on the smallest 8-bit computers.

▶Anonymous 05/21/18 (Mon) 07:13:49 No.917613>>917627

>>917501

What if I want to use mathematical notation, or mix my Plato with my English, in the same document or even the same sentence?

Do you really think it's less messy to have a trillion encodings?

▶Anonymous 05/21/18 (Mon) 07:15:16 No.917615>>918660

>>917570

You mean unicode?

▶Anonymous 05/21/18 (Mon) 08:20:38 No.917627>>917629 >>918660

>>917613

I think it's less messy to use a markup language (or Emacs, or Word) for special cases like that. Having a small, easy-to-understand *normal*, *canonical*, *default* encoding is important if you want to accept even usernames without going completely mad. Unicode's allowance of mixing what should logically be completely separate encodings has only resulted in bullshit--unicode smilies, unicode that leaves vertical streaks down your screen, unicode that phishes you by being visually identical to something else. It's an understandable desire, like adding CJK to Unicode, and (like adding CJK to Unicode) it's a bad desire and pursuing it has only given us bad things.

▶Anonymous 05/21/18 (Mon) 08:21:39 No.917628

>>917502

Like all the other APL programmers, in a windows IDE.

▶Anonymous 05/21/18 (Mon) 08:27:11 No.917629>>917698

>>917627

Allowing only ASCII in usernames and domain names is fine enough, but why not allow unicode in the places where you would allow that markup language?

▶Anonymous 05/21/18 (Mon) 08:49:17 No.917641>>917664 >>917702

>>917439

>UTF-8 = Unicode

This is the state of /tech/.

▶Anonymous 05/21/18 (Mon) 09:18:55 No.917664

>>917641

Is that post actually equating the two? I don't see it.

▶Anonymous 05/21/18 (Mon) 10:30:37 No.917698>>917703 >>917704 >>918660

>>917629

If you're going to use unicode at all, at least keep it contained to the contents of actual documents, and preferably documents that must be in or contain non-english language. If you start using it for everything, including OS level stuff, you're bound to run into problems eventually. One example someone gave on openbsd-misc was copying files from one system to another. If the filenames are plain ASCII, there's no problems, but if they're in unicode, you can't be sure what they'll end up like. Actually I've often seen files inside ZIP files get extracted with filenames full of ?????? characters all over the place. So it's not just a hypothetical, it's actually happening out there right now.

▶Anonymous 05/21/18 (Mon) 10:38:32 No.917702

>>917641

I want to see something BESIDES unicode. All I see are a shit ton of ways to encode unicode.

▶Anonymous 05/21/18 (Mon) 10:43:24 No.917703

>>917698

That wouldn't be a problem if there weren't multiple encodings floating around.

Solving that problem by never using unicode isn't actually more feasible than solving that problem by always using UTF-8.

▶Anonymous 05/21/18 (Mon) 10:45:14 No.917704>>917709

>>917698

System level shit should just use raw bytes for everything instead of a particular encoding. If someone wants to name their file with a jpeg of a pepe fuck it let them.

▶Anonymous 05/21/18 (Mon) 10:55:04 No.917709>>917716 >>917861 >>918660 >>918715

>>917704

Filenames are part of the user interface, and they should take that into account. End users will be confronted with them. Arbitrary bytes is a bad idea, case insensitive unicode with a lot of (control) characters blacklisted might be good.

Anything that requires the ability to use arbitrary bytes in file names is probably a really bad idea in the first place, and will still have to avoid using path separators and (usually) null bytes. Newlines alone make correct shell scripting a lot harder.

Forcing the entire world to use the latin alphabet in filenames might seem like a good idea if you're an edgy imageboard poster but it doesn't fly in real life.

▶Anonymous 05/21/18 (Mon) 11:05:31 No.917716>>917719

>>917709

> but it doesn't fly in real life.

You want to know about real life? Its windows 7, photoshop, ubuntu server, and javascript.

▶Anonymous 05/21/18 (Mon) 11:09:25 No.917719>>917727

>>917716

What's your point? I honestly don't understand what you're trying to say.

For what it's worth, I think Windows's filename convention is closest to what I described.

▶Anonymous 05/21/18 (Mon) 11:28:28 No.917727>>917773

>>917719

Please go back to the "real world" of macs, windows, iphones, and other shit designed to restrict users.

▶Anonymous 05/21/18 (Mon) 11:35:19 No.917732

SYSTEMD encoding.

Decoding requires the file to be piped through SYSTEMD-chard

Binary only files encrypted and compressed at creation time and linked to hardware addresses

▶Anonymous 05/21/18 (Mon) 13:33:10 No.917751

OK here's my idea.

Null terminated characters.

▶Anonymous 05/21/18 (Mon) 14:35:10 No.917773

>>917727

I think unicode is good in general.

"In general" includes things that are designed to restrict you. It also includes things that are designed to give you freedom.

Freedom is orthogonal to unicode support.

▶Anonymous 05/21/18 (Mon) 14:55:03 No.917781>>917807 >>917889

>>917388 (OP)

Each character is a 64-bit number. That number stores the black and white pixels of an 8x8 pixel square. Each character code is equivalent to the 64-bit number that prints a pattern that looks like the character. An optional pretty font layer is added on top for vector rendering/whatever.

▶Anonymous 05/21/18 (Mon) 15:49:15 No.917807

>>917781

retard

▶Anonymous 05/21/18 (Mon) 16:10:41 No.917825

All you need is TTS+opus.

▶Anonymous 05/21/18 (Mon) 17:08:21 No.917861>>917875 >>917915

>>917709

Well it did work like that IRL for quite a long time, since that's how Unix did things for decades. In fact people even tended to shy away from using whiespace or shell metacharacters in filenames (but the system didn't enforce that).

▶Anonymous 05/21/18 (Mon) 17:30:38 No.917875>>918721

>>917861

Unix went from "anything other than '/'" in the very early days to "anything other than '/' or '\0'" (probably once it was rewritten in C).

That's pretty bad for what's supposed to be a human-readable identifier. It's typical for Unix. Keeping it that simple made sense fifty years ago at Bell Labs, but not today with kernels that are already huge and complex.

▶Anonymous 05/21/18 (Mon) 17:44:35 No.917888

>>917447

>No one handles normalization correctly and only a few people even understand what the problem is.

This is the worst fucking thing. I have had to handle character conversion several times, in several languages, and I could not tell you anything about it. It always resorts to a copy-paste job followed by trying things until it seems to work. I understand many things but the verbage and everything else around character encoding is bonkers to me.

>a character is represented not as a char but as this other thing which is gauranteed to fit inside a char except when it doesn't and if you go from this language make sure you do ...

fucking stop

☃☃☃☃

and then people keep fucking adding shit

▶Anonymous 05/21/18 (Mon) 17:46:01 No.917889

>>917781

literally this would be less aids than unicode. and it would still be highly compressible. running a simple compression algo is much less bad than having unicode supporting behemoth

▶Anonymous 05/21/18 (Mon) 19:07:29 No.917915>>917946

>>917861

>Well it did work like that IRL

If you allow arbitrary data like Unix you also allow unicode, not just the latin alphabet.

▶Anonymous 05/21/18 (Mon) 19:32:50 No.917930

>>917498

>and this is a problem why?

It creates security issues. Again, I don't expect any of you to understand this. When you need two systems to agree on something like the permissions of a path, the question becomes how to get them to agree on the strings being equivalent. There is only one way to go due to there being information loss in the translation, towards fully normalized string comparisons, but you can't fully normalize in a future-proof way since it requires tables that will be changed over time by the committee. One side of that normalization is some day going to be a newer version that understands newer characters and it's going to open a hole.

Linux developers encountering this stupid clusterfuck raged about it and then decided to just say fuck it and do raw, unnormalized memory comparisons of UTF-8 strings since any other option was deemed unworkable. Other operating systems decided to half-ass something and partially handle normalization (which will likely require pinning compatibility to some old version of unicode since I doubt they can ever add new normalized forms without destroying existing filesystems and servers). This causes a lot of pain for projects that have to talk between the two. SMB shares on Linux and Mac OS have a lot of issues because of this as each OS handles this differently and there are infinite bugs in the middle since the future will add them even if they aren't there today.

▶Anonymous 05/21/18 (Mon) 19:48:59 No.917936>>918684

I don't like UTF-8 but in practice I don't believe it can be improved. The ASCII-compatibility is somewhat of a cancer (for the same reason UTF-16 is shit), but without it most software wouldn't support international text at all. Most developers are too stupid to deal with international text, so they had to be tricked.

▶Anonymous 05/21/18 (Mon) 20:08:42 No.917946

>>917915

People didn't allow that, even though Unix did. They followed naming conventions, even to the point of avoiding whitespace and using underscore characters instead.

I think that self-control started to vanish when Win95 long filenames got popular and began to show up everywhere.

▶Anonymous 05/23/18 (Wed) 04:32:18 No.918660>>918665 >>918688 >>919143

Unicode is half-shit. They fucked up with CJKV.

t. Chinese person who wants this BS to end

>>917447

Is there a multilingual standard that is less fucked?

Big Five + HKSCS and Shift-JIS are good enough (GB is for Niggers)

>>917627

Agreed, diacritics and right-to-left are cancer and are reserved for kikes, poos and sandniggers.

>>917698

>>917709

English-only for system-programming level, multi-lingual for document level

▶Anonymous 05/23/18 (Wed) 04:39:54 No.918665

>>918660

THIS. Reorder everything to TRON would make the world less unbearable https://en.wikipedia.org/wiki/TRON_(encoding)

▶Anonymous 05/23/18 (Wed) 04:54:10 No.918670

http://tronweb.super-nova.co.jp/unicoderevisited.html

▶Anonymous 05/23/18 (Wed) 05:43:56 No.918684

>>917936

>ASCII-compatibility is somewhat of a cancer

<7-bit characters define ASCII

<All 7-bit characters of ASCII are defined within 1 8-bit byte in UTF-8.

What are you talking about? UTF-8 is actually pretty easy to handle and implement.

▶Anonymous 05/23/18 (Wed) 06:02:14 No.918688>>918689

File (hide): c93d6c3beb343e4⋯.jpg (99.48 KB, 808x1080, 101:135, 0e56f3fae7f3f8727925f6153f….jpg) (h) (u)

>>918660

>right-to-left are cancer and are reserved for kikes, poos and sandniggers.

Get a load of this chink, all smug thinking he is more civilized for switching to left to right in the last hundred(or so) years.

>diacritics

Its a good thing you main input method is full of them then.

▶Anonymous 05/23/18 (Wed) 06:09:09 No.918689>>918697

>>918688

>thinking he is more civilized for switching to left to right in the last hundred(or so) years

Right-to-left is Mudslime or (((Jewish))), and we are not going there

>Diacritics is okay

What are you, Indian? Vietnamese?

▶Anonymous 05/23/18 (Wed) 07:01:29 No.918697>>918712

>>918689

>What are you, Indian? Vietnamese?

Huh? I was saying that pinyin is full of diacratics.

(你)是中国人吗？Because I'm starting to doubt it.

▶Anonymous 05/23/18 (Wed) 08:01:31 No.918711

>>917502

you draw it and let the compiler parse the .bmp

▶Anonymous 05/23/18 (Wed) 08:06:13 No.918712>>918963 >>919139

>>918697

屌你老母拼乜撚音呀？

▶Anonymous 05/23/18 (Wed) 08:12:19 No.918715>>924730

>>917709

>case insensitive filenames

▶Anonymous 05/23/18 (Wed) 08:20:57 No.918721>>919414

>>917875

Then what's the alternative? Keeping a copy of ICU in kernel? That way at least you can use any language in filenames and the kernel doesn't need to care.

▶Anonymous 05/23/18 (Wed) 17:43:47 No.918913>>918957

>>917388 (OP)

>If we had another chance at a character encoding standard, what would you propose?

UTF-8 but without emoji, beep and other shit that exists only to cause trouble.

▶Anonymous 05/23/18 (Wed) 19:41:22 No.918957>>918968 >>919313

>>918913

That's not part of what UTF-8 does. UTF-8 is a way to represent unicode code points, the meaning of those code points is defined elsewhere.

▶Anonymous 05/23/18 (Wed) 19:48:33 No.918963>>919139

File (hide): 1c912c880bc4e6e⋯.gif (39.58 KB, 200x204, 50:51, 1454640392815.gif) (h) (u)

>>918712

Do you know of any chinese imageboards? I feel like the general autism posting to imageboards commands combined with the cultural differences would be rather amusing.

▶Anonymous 05/23/18 (Wed) 19:56:40 No.918968

>>918957

We need an alternative to unicode not utf8

▶Anonymous 05/24/18 (Thu) 01:14:01 No.919139

>>918712

In Cantonese: "The fuck you using diacritics for?"

>>918963

No, note really (BBS is rare as well)

https://www.hkgolden.com/

http://www.discuss.com.hk/

▶Anonymous 05/24/18 (Thu) 01:20:56 No.919143>>919276 >>919322 >>922476 >>923416

>>918660

turn cjkv into something sensible.

make all the han characters into a series of composed radicals like it should be.

turn hangul blocks into hangul letters and have a combining mark or something better

also a mark that turns a han character into its corresponding emoji

▶Anonymous 05/24/18 (Thu) 08:18:23 No.919276>>919322

>>919143

>Combining marks and Composed radicals

The bytes will balloon in that case. At least 5~7 bytes per character on average.

▶Anonymous 05/24/18 (Thu) 10:12:26 No.919313>>919318

>>918957

I were talking about both.

UTF-8 is a way to represent unicode code points, and it's not used for representing anything else. Of course if the Unicode allocation table is revisited, UTF-8 would also decode differently.

And I also insist that all variants of UTF-16 must vanish, too. They are cancer and the worst of both ends (UTF-8 and fixed-width UTF-32): byte-order dependent and variable length => maximum complexity, and for no real gain (only pain).

In hindsight it could be phrased a little bit cleaner:

1: The one and only character-to-number mapping should be the modified Unicode, which is as follows:

1.1: All emoji and all characters that are only used for purposes other than displaying themselves in 2D monochrome glyphs or modifying adjacent characters appearance while keeping them as 2D monochrome glyphs (that is, combining characters), should be discarded and their code points shall be considered "unassigned" (these numbers are unused).

1.2: All other (valid) characters shall keep their code points as is, so that good software won't need to be altered to stay correct.

2: The one and only variable length encoding method of the one and only character-to-number mapping shall be UTF-8 (modified if possible to strip any excess complexities if they become obsolete after p.1)

3: Trying to press people to add any nonsensical characters to the new Unicode (that is, anything other than commonly used typographic symbols and real world languages) shall be a criminal offense.

▶Anonymous 05/24/18 (Thu) 10:29:40 No.919318>>919357

>>919313

The only actionable suggestion here is to drop support for characters you don't like.

Encodings other than UTF-8 are already almost exclusively used for legacy reasons or not exposed.

All other parts of your suggestion are either noops or inane.

▶Anonymous 05/24/18 (Thu) 10:48:07 No.919322>>922476 >>923045

>>919143

Don't make it "absolute radical" because we can cut everything down to less than 2500 codepoints and ~3 character average when encoding rebuses. "Absolute radical" would go like >>919276

▶Anonymous 05/24/18 (Thu) 10:55:29 No.919325>>919339

>>917388 (OP)

UTF-32.

▶Anonymous 05/24/18 (Thu) 11:58:57 No.919339>>919404

>>919325

Python fucked up.

▶Anonymous 05/24/18 (Thu) 12:41:02 No.919354

wingdings

▶Anonymous 05/24/18 (Thu) 12:46:17 No.919357

>>919318

>inane

oh really?

▶Anonymous 05/24/18 (Thu) 17:01:03 No.919404>>919410 >>921258

>>919339

Did it?

It only uses UTF-32 for strings that contain four-byte code points. Strings that are purely ASCII (or even LATIN-1) take up one byte per character.

>>> import sys
>>> size = lambda s: sys.getsizeof(s * 10_000) // 10_000
>>> size('a')
1
>>> size('ß')
1
>>> size('ĳ')
2
>>> size('\N{fish}')
4

Unfortunately, even a single character is enough to blow up the size:

>>> sys.getsizeof('a' * 9_999 + '\N{fish}') // 10_000
4

But that seems hard to avoid if you want O(1) indexing, which is very desirable.

▶Anonymous 05/24/18 (Thu) 17:32:43 No.919410>>919416 >>919426 >>919437 >>921391

>>919404

>But that seems hard to avoid if you want O(1) indexing, which is very desirable.

Daily reminder that big O does not correlate with reality in many situations. Non constant time access to cache is much faster than constant time outside of cache. UTF32 is 4x bigger than UTF8 in most situations. Thats 4 GB vs 1GB. Your constant time access will be slower under most access patterns.

▶Anonymous 05/24/18 (Thu) 17:50:58 No.919414

File (hide): 26d1e0d7cb55256⋯.png (197.24 KB, 764x577, 764:577, apple engineering.png) (h) (u)

>>918721

The kernel 'solved' the problem by saying that you can use UTF-8 but it will be treated like a binary string instead of unicode. It's a giant fuck you to the committee, but breaking compatibility is the correct engineering decision when given a broken standard. Microsoft even agreed unicode was fucked and uses UTF-16 as a binary string similar to what Linux does, however they do have mystery proprietary normalizations across their software like in SQL Server but maybe those are just Microsoft being Microsoft. Apple is all over the place with older versions of their OS doing the impossible task of normalization in kernel, then they started doing normalization at the application level instead, and I have no idea what the status of that mess is today but I'm glad I don't have to deal with it.

There's nothing worse than a flawed standard.

▶Anonymous 05/24/18 (Thu) 17:55:43 No.919416>>919434

>>919410

It makes more sense for python to have O(1) indexing as python programmers don't understand algorithms; that better protects them from creating poorly performing code if they handle strings in insane ways.

▶Scheme is Life!!ph6IF8xgJ2 05/24/18 (Thu) 18:17:51 No.919426>>919430 >>919437 >>919492 >>921180 >>921392

>>919410

>Non constant time access to cache is much faster than constant time outside of cache.

You're an absolute retard. O(N), when N is 10,000, will be slower than O(1), regardless of the cache. Now kill yourself.

▶Anonymous 05/24/18 (Thu) 18:23:55 No.919430>>919437

>>919426

What about O(2*N) when N is 10000?

▶Anonymous 05/24/18 (Thu) 18:28:00 No.919432

UTF-4096

▶Anonymous 05/24/18 (Thu) 18:31:35 No.919434>>919437

>>919416

>python programmers don't understand algorithms

this is factually incorrect, and I will also kick you in the nuts for this if you dare to say it in my face.

▶Anonymous 05/24/18 (Thu) 18:48:35 No.919437>>921178 >>922583

>>919410

Most of your strings will just be ASCII or LATIN-1 and take up a single byte per character.

If you do have a 1 GB string then O(1) indexing is likely to be useful.

>>919426

That's not true, and not how complexity works. For any value of N, a O(N) and O(1) algorithm exist so that the O(N) algorithm is faster than the O(1) algorithm for that value of N (in the worst case). However, for any such combination, there also exists a value M so that the O(1) algorithm is faster than the O(N) algorithm for values of N > M.

>>919430

O(2*N) and O(N) are the same thing.

>>919434

Many Python programmers do understand algorithms (plenty of them don't), but there's a valid point buried in there, which is that Python programmers shouldn't have to worry about the underlying implementation too much. A lot of Python's design is about reducing mental load so programmers can focus on the program instead of the programming language.

There are cases in which memory is very expensive and it's reasonable to expect the programmer to know every implementation detail of the language and functions they're working with so they can micromanage efficiency, but Python doesn't aim to cover those cases. For Python it's more sensible to build up expectations like "indexing is cheap" (O(1) for lists and strings, O(log N) for dicts), so

value = obj[ind]
if value is None:
    return default
return value

and

if obj[ind] is None:
    return default
return obj[ind]

are about equally performant for non-weird types of obj.

▶Anonymous 05/24/18 (Thu) 20:58:54 No.919492

>>919426

>Now kill yourself

...sez the tripfag. Oh the irony.

▶Anonymous 05/27/18 (Sun) 17:19:03 No.921178>>921258

>>919437

>Most of your strings will just be ASCII or LATIN-1 and take up a single byte per character. If you do have a 1 GB string then O(1) indexing is likely to be useful.

You totally misread the post or simply don't understand how it works. That is absolutely not the case. UTF32 and UTF8 DO NOT have the same properties.

▶Anonymous 05/27/18 (Sun) 17:20:49 No.921180

>>919426

>He does not understand how complexity actually works

▶Anonymous 05/27/18 (Sun) 20:23:10 No.921258

>>921178

I think you don't understand how CPython's strings work.

It picks an encoding based on the content (this is ok because they're immutable). If your string's characters fit in LATIN-1, it uses LATIN-1. If they don't all fit in LATIN-1 it uses more bytes per character. Strings always use fixed-width encodings, but not all strings use the same encoding, so the memory use tends to be ok in most cases.

See >>919404

▶Anonymous 05/28/18 (Mon) 02:34:47 No.921386>>925215 >>925216

>>917470

>You are probably just shitposting, but there are people out there who really seem to be under the impression that English is the only language that matters.

Ascii has letters with diacritical marks for all of the other languages that matter. We're not programming in Korean you fucking gooklover.

▶Anonymous 05/28/18 (Mon) 02:50:58 No.921391>>921400 >>922580

>>919410

Look at this idiom:

for i in range(len(s)):
    if s[i] == 'a':
        n += 1

Notice two things: 1. If indexing is O(1), there are no cache misses. 2. if indexing is O(n), then this runs in O(n^2).

So, for a one gigabyte string, assuming 1ns to read and compare a byte, this goes from one second to run to being a half a billion seconds.

▶Soldier of Monads!!KhvXzHPkGE 05/28/18 (Mon) 02:53:47 No.921392

>>919426

>schemeposter

>stupid

stop larping anytime

▶Anonymous 05/28/18 (Mon) 03:25:27 No.921400>>921436

>>921391

>caring about string indexing when you're analyzing an algorithm

That's not how Big O works, shut the fuck up

▶Anonymous 05/28/18 (Mon) 04:33:41 No.921436

>>921400

If indexing the string takes i operations, then the whole process takes 0.5*len(s)*(len(s)+1) of such operations, which is O(n^2).

▶Anonymous 05/29/18 (Tue) 13:50:50 No.922129

5-bit encodings.

▶Anonymous 05/30/18 (Wed) 01:31:38 No.922476

>>919143

>>919322

Csan anyone confirm this?

▶Anonymous 05/30/18 (Wed) 02:46:26 No.922500>>922551 >>922608

All english characters come first, then symbols, then nippon's moonrunes. Anything else goes into later parts of the encoding as to fuck with people I don't like, such as needing several times the storage that one english letter would need for one arabic or hebrew character.

▶Anonymous 05/30/18 (Wed) 05:04:37 No.922551>>922608

>>922500

What about Gook runes and Taiwan runes?

▶Anonymous 05/30/18 (Wed) 05:55:06 No.922580

>>921391

Nobody would write this so bad

It's actually:


n = sum(1 for l in s if l == 'a')

▶Anonymous 05/30/18 (Wed) 05:58:33 No.922583>>922594 >>922653

>>919437

>For Python it's more sensible to build up expectations like "indexing is cheap"

Horse shit, it's even more sensible to not expect any specific optimization if you don't actually need it. (And if you need it, you go and check the docs)

The typical example --- if you just need to go through all items, use a fucking iterator. It will always work in an optimal way unless the author of that goddamn library deliberately hates you.

▶Anonymous 05/30/18 (Wed) 06:07:33 No.922594>>923392

>>922583

Whats an iterator anon? Is that a bloated way of saying map or fold?

▶Anonymous 05/30/18 (Wed) 06:44:53 No.922608>>922620

>>922551

>>922500

You are both retarded if you haven't realized that ~90% are shared. The only thing that varied was that the gooks simplified some characters differently than the chinks (or that taiwan and places speaking cantonese didn't simplify at all).

▶Anonymous 05/30/18 (Wed) 07:01:21 No.922620>>923045

>>922608

We should just get rid of all characters except for the Chinese ones. Everyone will be speaking Chinese soon anyways.

▶Anonymous 05/30/18 (Wed) 07:14:06 No.922636>>922641

ASCII, fuck foreginers.

This is American property!

▶Anonymous 05/30/18 (Wed) 07:16:14 No.922641

>>922636

But I need muh chinese cartoon games and I don't want xir to localize them.

▶Anonymous 05/30/18 (Wed) 07:31:59 No.922653>>923392

>>922583

This particular expectation is actually ingrained in the language. If you define a getitem and a len then a iter is implicit, unless you provide it yourself.

Making indexing expensive would make only certain strings more compact, at the cost of breaking the expectations of people coming from Python 2 (where strings are bytes), or people coming from C, while also bothering people who want to go through two strings character-by-character and don't know how to use zip yet.

It has a huge cost and little gain.

▶Anonymous 05/31/18 (Thu) 03:25:53 No.923045

>>922620

See >>919322

▶Anonymous 05/31/18 (Thu) 06:15:27 No.923128

standards are for noobs who complain like brats

allow all and getgud

▶Anonymous 05/31/18 (Thu) 16:48:38 No.923348

Unicode and UTF-8

>B-but muh emojis!

yeah, don't use them.

▶Anonymous 05/31/18 (Thu) 18:14:30 No.923392>>923393

>>922594

iterator is anything that can be iterated.

>>922653

>breaking the expectations of people coming from Python 2 (where strings are bytes), or people coming from C, while also bothering people who want to go through two strings character-by-character and don't know how to use zip yet.

nothing wrong with that.

but I agree that the gain would be small anyway.

▶Anonymous 05/31/18 (Thu) 18:15:46 No.923393>>923394

>>923392

>iterator is anything that can be iterated.

I mean anything that can be asked if it has another item, and can be asked to give that item to you.

▶Anonymous 05/31/18 (Thu) 18:16:14 No.923394

>>923393

…and it's a special case of a more general abstraction, a zipper

▶Anonymous 05/31/18 (Thu) 18:55:37 No.923403

>>917388 (OP)

It would obviously have to be done with inline Javascript and plenty of remote server calls.

▶Anonymous 05/31/18 (Thu) 19:37:58 No.923416>>924503

>>919143

Hangul blocks are already composed out of individual Jamo 'letters' without needing any kind of combining mark since Hangul is formulaic enough.

Turning Han characters into radicals only would be really difficult. While there are a lot of similar radicals, there are many characters which have pretty unique strokes that would be hard to modularize. That also isn't even thinking about how the layout of the character would be specified, which I can only imagine would be a nightmare.

▶Anonymous 06/04/18 (Mon) 06:18:18 No.924503

>>923416

Well for Cantonese at least if you do "semi-radical" i.e. semantic-phonetic blocks you can trim down the character space by a long shot

▶Anonymous 06/04/18 (Mon) 06:49:52 No.924521

Wingdings

▶Anonymous 06/04/18 (Mon) 16:59:08 No.924730

>>918715

I bet you hate significant whitespace too fag.

▶Anonymous 06/05/18 (Tue) 04:10:51 No.925213

>>917388 (OP)

ASCII. 2nd choice is EBDIC

▶Anonymous 06/05/18 (Tue) 04:14:15 No.925215>>925219

>>917470

>>921386

Or you could just encode other languages in ASCII instead of tainting your character space with homoglyph malware and cancer.

▶Anonymous 06/05/18 (Tue) 04:17:50 No.925216

>>921386

>Ascii has letters with diacritical marks for all of the other languages that matter.

No, it doesn't. Some extensions of ASCII do.

▶Anonymous 06/05/18 (Tue) 04:27:21 No.925219>>925220 >>925222

>>925215

Can't for chinese

▶Anonymous 06/05/18 (Tue) 04:38:18 No.925220>>925222 >>925238 >>925291

>>925219

Fuck the Chinese, having a specific character for each word is counter intuitive to building a language that's compatible to modern complexities.

The japanese realized this and adapted to romanji/hiragana/katakana

▶Anonymous 06/05/18 (Tue) 04:44:30 No.925222

>>925219

>>925220

This is exactly what base64 encoding is for. ASCII has 95 usable characters, reserve whatever 30 punctuation shit for controls and you've still got 3x unicode width.

▶Anonymous 06/05/18 (Tue) 07:30:27 No.925238

>>925220

Come back when you know the difference between Japanese/Korean and Chinese languages. Tonal vs non-tonal

▶Anonymous 06/05/18 (Tue) 11:20:46 No.925291

>>925220

>japanese

>romanji

You don't know anything about Japanese.

▶Anonymous 06/05/18 (Tue) 16:55:50 No.925433

I saved a Japanese site with wget. The live version appeared in Japanese but the text was mojibake in the source. The downloaded version appeared in mojibake but was Japanese in the source. Adding Shift-JIS as the character set fixed it. Character encoding is a hell of a thing.

/tech/ - Technology★

General

WebM

Theme

User JS

Do not paste code here unless you absolutely trust the source or have read it yourself!

Favorites

Customize Formatting

Filters