[rescue] UTF-8 [was T5220 update]
jp at celestrion.net
Thu Nov 2 09:14:11 CDT 2017
On Wed, 1 Nov 2017, Mouse wrote:
> Storage compactness is a completely spurious claim except for those
> using mostly-ASCII characters. UTF-8, as compared to a stream of
> 16-bit codepoints, does not save storage for anything except ASCII,
This was a design goal, and I don't think it's as bad as all that. Many
of the scripts using those wider characters have a greater information
density per glyph than the western scripts do. Cyrillic gets shafted, and
that's probably somewhere between coincidence and politics.
> In my opinion - and that's all it is, my opinion, and it's probably
> worth about what you paid for it - UTF-8 is an abomination. The
> benefits of each character being the same size in memory far outweighs,
> to me, the storage compaction UTF-8 provides for ASCII text (or, if you
> use 24- instead of 16-bit codepoints, the handful of writing systems
> outlined above).
My instinct is to agree, but for applications where that matters (nearly
any in-memory processing), there's UTF-32/UCS-4. You do the code-point
processing in bulk once instead of iteratively, and your in-core view of
the file has characters of the same size. UTF-8's compactness is intended
for transfer and storage primarily.
Further, a system that defaults to single-byte storage ends arguments
about byte order; expand the bytes into core however you see fit, but
serialization with other systems won't depend on byte-order marking.
>> For its faults, UTF-8 and Unicode are _FAR_ better than their
> Maybe they would have been if there were no installed base - though I
> still consider variable-sized (in storage) characters an abomination.
The Big Win for the notion of variable-width characters, if we're talking
about installed bases, is that UTF-8 software can correctly process all
7-bit ASCII text--including control codes. This is, by far, the single
largest set of legacy electronic textual data.
That facilitates support for wide characters being introduced into
software without a Flag Day when all characters need to be 24 or 32 bits
>> Thompson and Pike were presenting talks on UTF-8 in the early-to-mid
> So? I can't see that as relevant, unless your stance is something
> like, UTF-8 is the best encoding of the best character set for all
> users and purposes, so it is reasonable to expect everyone/everything
> to support it as soon as it was introduced (modulo implementation
At 24 years on, that delay could involve conceiving the programmer who
would later implement UTF-8 support and sending him/her through
university. "My system is more than a year or two old," would be a
perfectly valid excuse if Unicode were a passing fad with niche
applicability and a majority of the planet well-serviced by ASCII.
> Perhaps that is your stance, in which case, I have the painful duty to
> break it to you that it's not so. There are lots of users and purposes
> for which Unicode, never mind UTF-8, is a wrong answer, even today.
> Many of them involve the sort of hardware and software this list
> focuses on, hence my remark.
The thing about the network is that something doesn't have to be the best
to be nigh-universal, which is how we got Unix to begin with. There will
probably never be a best-in-all-cases-ever incidence of any technology,
but there will usually be one that's pretty reasonable to support by
That used to be ASCII. These days, it really looks to be Unicode, for
better or worse. Looping all the way back to the start of this
divergence, if software needs ASCII, iconv is a much better input filter
than &= 127.
 Although filesystem support for lz4 and similar compression schemes
makes even this a hard claim to defend, but in 1993 the relative
processing overhead was much higher.
 I very likely have a bias in my perception as to how valuable a
universal character set is due to most of my coworkers speaking
English (or any Western language) as a second or third language.
More information about the rescue