2009-06-12

font problems

On Jun 12, 3:23 pm, ken wrote:
> On 06/12/2009 01:53 PM Xah Lee wrote:
>
> > On Jun 12, 7:54 am, ken wrote:
> >> B) It would be helpful if the code which does the decoding of a file and
> >> renders it into the buffer display, if that part of it would throw an
> >> error message when it encounters a character it doesn't know how to
> >> display, i.e., when a little box character is displayed. After all,
> >> isn't it an error when a little box is displayed in lieu of the correct
> >> character? Possible error messages would be something like: "decoding
> >> process can't find /path/to/charset.file" or "decoding process doesn't
> >> have requisite permission to read /path/to/charset.file" or "invalid
> >> character: [hex/decimal value]" or other.
>
> > some thought process in the above is not correct.
>
> Yet emacs puts a little box in the place of a character it cannot find
> (or, per your explanation) possibly confused about. The fact remains
> that the little box is not a correct rendering of the code. It is an
> error... at least it is for me, because that's not what I typed in. So
> it is an error. As an error, there should be a corresponding error
> message, hopefully one (or more) which would help diagnose the problem.
> It seems obvious that, given the long thread on this issue with no
> resolution, we could use some help-- like an error message-- which would
> help in diagnosis.
>
> Thanks for the information and the links though.

i think displaying a error for each char that emacs cannot find a font for is just not feasible. The app can't know whether it used the right encoding. And even if the encoding used is correct, it can't deal with possible missing fonts in some of the characters in the char set.

i don't have experience in this, but imagine, when a app gets a byte stream, and with a given charset/encoding. With that, it can decode byte length to map to the code points in the char set. (e.g. utf-8, utf-16, both don't have fixed byte-length for chars) After that done, you get a sequence of a code points (i.e. a sequence of integers). At this point, given a integer, you need to map this integere to a character in a font. There are many issues here... a font i guess is a set of glyphs... ultimately a set of integers. I'm not sure what sort of spec or standard specifies what each integer means (i.e. support your app now has a integer that represents B. Now suppose your app is set to use font Aria. Now, Aria is a set of integers, but by what standard that says what integer is B?)... Part of this step is what happens when Aria don't have that character. (i'm guessing a font also has data about what character set it contains...)
But in anycase, finally we'll have a B from font Arial. Then it goes thru the whole display process...

overall i think the technology we have today that actually display fonts and unicode text etc are extremely complex, not to mention vector based fonts and anti-aliasing and font-substitution etc techs.

some interesting read here:

http://en.wikipedia.org/wiki/Computer_font
http://en.wikipedia.org/wiki/Anti-aliasing
http://en.wikipedia.org/wiki/Font_rasterization
http://en.wikipedia.org/wiki/Subpixel_rendering
http://en.wikipedia.org/wiki/Font-substitution

for most modern apps, like browsers, i think they all call OS's APIs to handle it. Some glimps over emacs dev list seems to suggest that emacs implements its own display system... on one hand it's bad because emacs misses out using all modern techs developed in 2 decades by Apple or Adobe or Microsoft, or some Open Source's work, on the other hand it is admirable in that it does it on its own...

sorry am rambling a bit. You are right that the bottom line is that some things just rendered as squares and is a problem. Though, i wanted to say that my point was that it is unfeasible to issue a error for missing fonts or miss-interpretation of the encodings. Part of this is because theoretically there's no way to know that encoding chosen is correct. Part is because in practice missing font or bad chosen encoding is very common. If we all stick with ascii, everything is pretty good. If we stick to western langs, things are still not too bad. But once you have chinese, japanese, korean alphabets, or the ocational use of the many math symbols and greek letters, or adding cyrillic/russian alphabets or arabian alphabets ... the chances of missing font or missing encoding info is very high.

i think a large part of the problem is that char set and encoding info is not part of the file. Things are getting better in the past decade with mime type and unicode standard. But give a byte stream, after being lucky of able to know it is text, there's still little way to know how to interpret it. The char set and encoding meta data often gets lost, implementation are often not robust, font for multi-lang usually are not there, and font-substitution tech just started. (according to Wikipedia, IE before 7 does not even have font substitution (which means, you really need such beast as “unicode font”, namely a font that contains some tens or hundreds thousands of glyphs))

i think all these issue only started to get addressed in the past decade since the globalization partly due to internet. Before, English speakers just stick with ascii and that's pretty sufficient. Each western lang region stick with their particular encoding for a few special chars in their alphabet. Only when things started to mix they get more complex, and now with Chinese & japanese etc. With unicode, the use of math symbols also becomes more common. Before that, it's just ascii markup...

speaking of this. Emacs and FSF docs still stick with 1980s's `quote hack', and arrows like this -> => ... very extremely stupid. Of course i filed polite bug reports, and have argued here too heated, but basically fallen to no ears. Somethings just is impossible to progress in the FSF world.

Xah
∑ http://xahlee.org/

No comments:

Post a Comment