You are not logged in.
Pages: 1
I think I need some advise on the topic of encodings. My system right now is a mess: One old Windows partition (NTFS), one shared FAT32 partition, both with files from a long time ago in i don't know which encoding, filenames with and without non-ascii characters, etc.
Then there are the linux partitions with mostly new files created under my current locale, en_GB, although i tried out en_GB.utf8 for a while.
My files are partly English, partly Norwegian/Swedish, so I need to accomodate the special characters of those languages, but I have no need neither for Arabic nor Korean/Japanese etc.
When I would still like to switch to a fully unicoded system, it's partly to avoid the mess, but mostly because some of the programs I need and use on a daily basis, require unicode files (lilypond, etc.). I've had some programs complain about incorrect file name encoding (gwenview, I think, or perhaps gqview), and my file names with non-ascii characters show up in several different ways, and, it seems, differently in X-apps, in consoles, in terminals, in vim, etc.
So, there you have my situation. Does anyone have experience - good or bad - with this kind of situation? Ideally, I suppose, I'd like all my files to be properly unicoded, both regarding file names and contents. But I assume this involves a potentially "dangerous" situation with a batch job with iconv?
What are the pitfalls I need to keep in mind before I do this? Are there other solutions I might consider? What about apps which don't support unicode - would I be better off staying in latin1 - perhaps converting everything to that instead of unicode? As I said, in terms of characters, I really don't need more than what latin1 can offer, but some apps require unicode for the file encoding.
Any help, feedback, pointers to good sites, etc. are appreciated.
Offline
apps not supporting unicode tend to be rare these days, as most apps are built on qt, gtk, python, java, mono, readline, etc... and thus support utf8 OOTB.
IIRC (some part of) zsh does not support unicode.
you should also make clear distinction on multiple things:
1. display/input uft8
2. utf8 filenames in filesystem(s)
3. utf8 file contents
gentoo has good reference:
http://gentoo-wiki.com/HOWTO_Create_an_ … led_system
http://gentoo-wiki.com/HOWTO_Make_your_ … code/utf-8
as for my advice, I'd say go for utf8 all the way. when you're on the rails, it's just transparent, and if you happen to have some file/contents having some 'weird' character in it, you'll always be okay ('weird' being relative, you don't need to go as far as Arabic: that could simply be a simple accented character not present in your ISO-* set). why be restricted when you can just as well be open? you never know what might be your need tomorrow: e.g I'm french (accented chars), I got on to learn japanese (obvious chars), and I may be in relation occasionally with at least spanish (tilda chars), and german (umlauts and ss) someday in my work. I woudn't want to have to handle 5 charsets, or to convert to utf8 at that time, so I took things in hand right now, ready and set.
personnally the first thing I do when I install arch is locale-gen en_US.UTF-8 and set locale to it.
as for windows compatibility, NTFS filenames are stored as utf8 IIRC, and fat32 can have utf8 enabled, though IIRC it makes the fs case sensitive. regarding file contents, most not too old windows software usually handle utf8 correctly (at least never had a problem, when I still had windows). issues arise when files are stored in non-iso windows encoding (cp1252 and the like), which I was never able to convert successfully, yet it can just be my fault for not trying hard enough (because the few files I had being non-utf8 were not that important).
EDIT: some additions
Last edited by lloeki (2007-04-23 12:30:28)
To know recursion, you must first know recursion.
Offline
Thanks a lot - I'll read up and take a day off to effect the conversion... :-)
apps not supporting unicode tend to be rare these days, as most apps are built on qt, gtk, python, java, mono, readline, etc... and thus support utf8 OOTB.
IIRC (some part of) zsh does not support unicode.
I hope that's not a dealbreaker for me - I'll have to check that out: I use zsh...
Offline
from the gentoo wiki:
Zsh handles UTF-8 perfectly since version 4.3.1. Older versions are not yet unicode aware. It still works as long as you dont use Backspace on unicode characters. (This deletes parts of the utf-8 character bytewise and confuses zle assumptions about the cursor position.)
To know recursion, you must first know recursion.
Offline
Yes, I saw that. I was a little bit discouraged by a note at the zsh FAQ, but it apparently hasn't been updated since 4.2.3 was the latest version (http://zsh.dotsrc.org/FAQ/zshfaq02.html#l16)
I was wondering, about the gentoo wiki: how much of what is listed there has to be done also in Arch, and how much is taken care of by the general settings in rc.conf?
Also: if I use some script or iconv + options to convert a whole file system, what about files which are already in unicode - will they be converted as well, or is there a check so that I don't corrupt my files instead of converting them...?
Offline
AFAIK what is to be done in arch: nothing, except generate and set locale to en_US.UTF-8, and choose a unicode-aware font for vt. KISS in all its glory ![]()
and AFAIK about conversion iconv takes (multi)bytes, interpret them and convert them to (multi)bytes, so yes, you have to take special care and only apply conversion to 'non-converted' files. be aware that utf8 and ascii map on each other for their common part, so the whole filesystem does not need conversion. so it is obvious that only some files need conversion, certainly only in /home and certainly only plain text files, and even not all of them. be sure to do some testing on some selected files, and also make some backup just to be on the safe side.
Last edited by lloeki (2007-04-23 21:10:36)
To know recursion, you must first know recursion.
Offline
Thanks a lot; I imagine it can't be that easy (I particularly fear all those combinations of apps the config files of which I have tweaked and tested to make them work - such as the sequence
mrxvt - screen - mutt - vim/elinks/etc.
(I'll change mrxvt to urxvt, I suppose)
There's bound to be trouble there... isn't there...?
But I will go ahead in any case. Thanks again for your help and yor patience!
Offline
I'm not sure if I should go UTF-8, right now everything is ISO-8859-1 on my computer.
The issue is IRC - in Sweden (and probably Denmark/Norway etc) everyone is using ISO-8859-1 (well Windows-1252), and they can't read some of my text properly when I use UTF-8. It doesn't stop me from using UTF-8 for the rest of my system, but I'm not completely sure about speed either.
http://sam.zoy.org/writings/utf8/
The other issue is that I'm not completely sure how to interpret UTF-8 in code. There is no "this is how you support UTF-8 properly", or at least any that I've found. If I would be writing an IRC bot, how would I know the charset of messages? Some people use UTF-8, some use ISO-8859-1, and they have to be supported. I must be stupid or something because people say it's very easy, but I honestly can't grasp it.
Running: Arch Linux i686, x86_64, ppc
Offline
that guy is a moron. stating that ISO satifsies his needs doesn't mean it satisfies everyone, and forever. people did not invent utf8 or ipv6 for the sake of it.
second: http://mail.nl.linux.org/linux-utf8/200 … 00041.html
cause is a stupid implementation in grep (see reply from markus http://mail.nl.linux.org/linux-utf8/200 … 00050.html )
one wicked implementation is no proof something is flawed.
Last edited by lloeki (2007-04-27 14:46:15)
To know recursion, you must first know recursion.
Offline
I use utf-8 and have no problem, but then, I'm using gnome (yes, even gnome-terminal), that's basically built on top of utf-8.
Offline
The issue is IRC - in Sweden (and probably Denmark/Norway etc) everyone is using ISO-8859-1 (well Windows-1252), and they can't read some of my text properly when I use UTF-8. It doesn't stop me from using UTF-8 for the rest of my system, but I'm not completely sure about speed either.
If you are using irssi, check out /recode
When death smiles at you, all you can do is smile back!
Blog
Offline
... choose a unicode-aware font for vt. KISS in all its glory
Where to find those? Can you recommend someone? Thanks.
Offline
Pages: 1