2009-06-12

unicode, encoding, font, foreign chars

On Jun 12, 7:54 am, ken wrote:
> B) It would be helpful if the code which does the decoding of a file and
> renders it into the buffer display, if that part of it would throw an
> error message when it encounters a character it doesn't know how to
> display, i.e., when a little box character is displayed. After all,
> isn't it an error when a little box is displayed in lieu of the correct
> character? Possible error messages would be something like: "decoding
> process can't find /path/to/charset.file" or "decoding process doesn't
> have requisite permission to read /path/to/charset.file" or "invalid
> character: [hex/decimal value]" or other.

some thought process in the above is not correct.

In general, a program just read a text file as a byte stream, and using a encoding scheme to interpret it, the program has little way to determine if the encoding is correct. Theoretically, it could check with common phrases but that is generally not done by the software we use daily. (some program does scan text guess a encoding, but not always correct)

here's some general technical issues and experiences about using foreign chars:

• the software needs to know what encoding & char set is used in order to interpret the binary stream. If you don't specifically set it, typically it assumes ascii or some iso latin char set. (of software in USA anyway)

• today's software generally don't contain any extra heuristics to check if the encoding used is actually correct. There is no technical way to check that in general. It can be only heuristics, i.e. guesses. e.g. browsers will often guess when reading a page that doesn't have encoding info.

• even when the encoding is correct, the software needs all the proper fonts to display it. Or, rely on some font-replacement technology, e.g. when it finds a char which the current font doesn't have, it uses another font for that char. (in the case of Chinese, this often results in ugly text of mixed char style, some appear thin, some thick, some squarely (like sans-serif), some calligraphic, some bit-mapped) Windows OS and OS X both has font-replacement technology, as well as all the major browsers for both os x and windows. This font replacement technology, however, is not perfect. So, sometimes you'll see squares or question marks here or there, especially on some chars that's not widely used (e.g. math symbols in unicode, double right arrow, tech symbols such as Apple's command key and option key, triple asterisk, etc.).

• when writing a file, the software needs to use a encoding to write it. Just like reading, if you haven't explicitly set it, typically it uses ascii or some iso latin char set, in most western lang countries.

• when you use a software to open a text but with wrong encoding info, the result is gibberish.

the above applies not just to emacs, but applies to all apps. Some commentary are based on my experiences with browsers, web pages, word processors, online forums, mailing list, email apps, instant messaging chat apps, etc, on both mac and windows.

technically, the issues involved is char set, encoding, font. ( the concept of char set and encoding are independent but is often mixed together in a spec, esp earlier ones).

i use mixed chinese & english in single file often and in both mac os x and windows. They work well. On the mac, my emacs is version 22.x. On win, it is emacs23. My encoding in emacs is set to utf-8.

I've wrote a lot about these issues, the following docs might be helpful.

• Emacs and Unicode Tips
http://xahlee.org/emacs/emacs_n_unicode.html

• Unicode Characters Example
http://xahlee.org/Periodic_dosage_dir/t1/20040505_unicode.html

• the Journey of a Foreign Character thru Internet
http://xahlee.org/Periodic_dosage_dir/t2/non-ascii_journey.html

• Converting a File's Encoding with Python
http://xahlee.org/perl-python/charset_encoding.html

• Character Sets and Encoding in HTML
http://xahlee.org/js/html_chars.html

• The Complexity And Tedium of Software Engineering (parts about unicode problem with unison and emacs)
http://xahlee.org/UnixResource_dir/writ/programer_frustration.html

• Mac and Windows File Conversion (parts about unicode filename issues)
http://xahlee.org/mswin/mac_windows_file_conv.html

• Windows Font and Unicode
http://xahlee.org/mswin/windows_font_unicode.html

the above article contain tens of links to Wikipedia in appropriate places. Wikipedia has massive info in digestible form about these issues, one can spend a month on the above foreign char issues ...

for some examples of mixed chinese & english text i work with, see:

• Chinese Core Simplified Chars
http://xahlee.org/lojban/simplified_chars.html

• Ethology, Ethnology, and Lyrics
http://xahlee.org/Periodic_dosage_dir/sanga_pemci/sanga_pemci.html

Xah
∑ http://xahlee.org/




On Jun 12, 9:48 am, "B. T. Raven" wrote:

> I wouldn't be surprised if the gaps and overlaps in the CJK ranges of
> glyphs weren't so complicated that many characters from the following
> encodings may not be included in utf-8, especially if they are not
> precomposed. Try some of these encodings to see if some of the empty
> boxes are resolved into characters:
>
> chinese-big5
> chinese-hz
> chinese-iso-7bit
> chinese-iso-8bit
> chinese-iso-8bit-with-esc
> cn-big5
> cn-gb
> cn-gb-2312
> iso-2022-cjk
> iso-2022-cn
> iso-2022-cn-ext

most chinese encodings are subset or identical to unicode's charset.

In particular, the current, mostly widely used chinese charset the GB 18030, actually is just unicode.

see http://en.wikipedia.org/wiki/GB_18030

Note also, that means china's GB 18030 contain the entirely of traditional chars in unicode too. (though, i don't know about how big5 relates to unicode )

the list you gave above is from emacs? emacs's list always seems strange to me... haven't really looked into it. maybe emacs's list is really encompassing of all encoding that've existed, but it also could be just screwed up like many open source things. For example, it invents its own names by mixing up char set encoding with concepts of EOL convention.

btw, who actually coded the low down levels of char encoding in emacs? e.g. especially unicode, since it came after richard stallman still doing the bulk of emacs. That person should be admirable. lol.

Xah
∑ http://xahlee.org/

1 comment:

  1. Great.
    interesting articles.
    I have problem when generating a font with some features by standar encoding. The font (OT) especially the features will work as good in a software application in windows(i.e. adobe photoshop/ indesign) but other not appearance (i.e corel draw). When i installing on mac, the features absolutely nothing, even more glyph not appearance. It was happened when i generating in T1. Just now i try to create a pro-font by FL. The font as far as good when testing by FL itself. But i'm hesitate to do generating before i surely that it will work as good.
    Just a little info, i'm autodidact in this field. :-)) So where i must send a question? :-)) Several my free font can be found on dafont.com.
    I has been copy this article for read later. Thanks you in advance.
    Best regards.
    Andi AW. Masry

    ReplyDelete