You are not logged in.

#1 2010-04-12 18:38:52

sokuban
Member
Registered: 2006-11-11
Posts: 412

Unicode Language Tagging

Well, I just found out that unicode is supposed to have some sort of language tagging feature that lets you tag languages in plain text!

Unfortunately, I barely see any documentation on how this feature works, and I'm not even sure if it is properly implemented it, and it seems the unicode peeps don't want people to use the feature.

This feature; if I could get it working would be a godsend for me. Basically, my problem is that I use Japanese, Tradtional & Simplified Chinese, and Korean on my computer often at the same time. This is a big problem for fonts, because "pan-CJK" fonts are quite rare, I only know of one (iYaHei), and I don't like it, but I reluctantly use it, because it is the only font that can allow me to read all the characters I need to display without graphical glitches or visually jarring borrowings from other fonts. (I suppose you could argue Arial MS Unicode, Bitstream Cyberbit et al are also "pan-CJK", but they don't have enough coverage of CJK for my uses, and look even uglier than iYaHei.)

This isn't a problem when the text has some sort of markup language or rich text format that allows me to specify the font and the language, but it is a big problem when the text is stuck as plain text, mainly in my filesystem and music collection. If I could tag all of that text with a language tag, and then somehow set up the system so it uses a different font for a different language tag, I would be one very happy man. The "Unicode Standard Ver. 5.0" seems to suggest that this is possible:

page 166 wrote:

Language information can be useful in certain operations, such as spell-checking or
hyphenating a mixed-language document. It is also useful in choosing the default font for a
run of unstyled text; for example, the ellipsis character may have a very different appear-
ance in Japanese fonts than in European fonts. Modern font and layout technologies pro-
duce different results based on language information. For example, the angle of the acute
accent may be different for French and Polish.

However, they don't seem to like my idea for my exact use:

page 166-167 wrote:

A common misunderstanding about Unicode Han unification is the mistaken belief that
Han characters cannot be rendered properly without language information. This idea
might lead an implementer to conclude that language information must always be added to
plain text using the tags. However, this implication is incorrect. The goal and methods of
Han unification were to ensure that the text remained legible. Although font, size, width,
and other format specifications need to be added to produce precisely the same appearance
on the source and target machines, plain text remains legible in the absence of these speci-
fications.

There should never be any confusion in Unicode, because the distinctions between the uni-
fied characters are all within the range of stylistic variations that exist in each country. No
unification in Unicode should make it impossible for a reader to identify a character if it
appears in a different font. Where precise font information is important, it is best conveyed
in a rich text format.

Indeed untagged CJK text can be readable; that is, only if you have a font like iYaHei that supports the entire CJK fields. Get Kochi Gothic for example and you /won't be able to read/ Chinese text because of visual glitches. Even with a font that has no glitches, it is pretty ugly to have different fonts used in the same text because the first one doesn't support a character. And of course, they are forgetting the fact that with untagged text, you can only get the right font style for one out of four of the scripts, in my case it is Simplified Chinese because of iYaHei, but I personally would prefer Japanese if I had the choice: but no... I don't have that choice. (I suppose you could argue I could use Arial MS Unicode/Bitstream Cyberbit and I would have my wish; I used to do that, but it is such an ugly font that it is unreadable due to squished lines that I'd actually prefer iYaHei.)

So, now that my rant is over; does anyone know how to use this feature and if it can be?

Last edited by sokuban (2010-04-12 18:45:02)

Offline

Board footer

Powered by FluxBB