Filenames charset broken

Dinth · 2009-08-28 07:26:34

Hello. I have website with thousands of download files with national characters. Currently i was forced to reinstall server from ftp backup, which was done on windows machine. Unfortunately, after reinstall national characters in all my files got broken.
I tried to restore them with 'convmv' util, but it dont help.

convmv -r -f cp1250 -t utf-8 ./                                              
Starting a dry run without changes...                                                        
Skipping, already UTF-8: ./deleted/6/a/p/home/dinth/public_html/centrumseriali.pl/modules/tinymce/tinymce/jscripts/tiny_mce/plugins/fullpage/langs/home/wikipasy/en/images/0/00/Hymn_MaleĹ„czuk_2.mp3                 
Skipping, already UTF-8: ./deleted/6/a/p/home/dinth/public_html/centrumseriali.pl/modules/tinymce/tinymce/jscripts/tiny_mce/plugins/fullpage/langs/home/wikipasy/en/images/0/00/Oktawiusz_MarciĹ„czak.jpg             
Skipping, already UTF-8: ./deleted/6/a/p/home/dinth/public_html/centrumseriali.pl/modules/tinymce/tinymce/jscripts/tiny_mce/plugins/fullpage/langs/home/wikipasy/en/images/0/01/WĹ‚adysĹ‚aw_Ĺach.jpg

(...)

But this files arent in proper UTF8. My console font and locale in my system is UTF8. Manualy renaming "WĹ‚adysĹ‚aw_Ĺach" to "Władysław_Lach" help but there are thousands of files, so i cannot do them all manually.

I've tried utf to utf conversion, but then convmv dont see any files to convert.

convmv -r -f utf-8 -t utf-8 ./                                              
Starting a dry run without changes...                                                       
No changes to your files done. Use --notest to finally rename the files.

How can i repair broken utf in filenames ?

atordo · 2009-08-28 07:50:04

If you just need support for one language you can stick with the old locale that worked for you:

* Look for your old locale in /etc/locale.gen and uncomment its line.
* Run locale-gen as root.
* Export the appropriate LANG variable in .bashrc or equivalent.

Note though that if you expect to receive filenames in several languages/scripts, UTF-8 is the way to go.

Dinth · 2009-08-28 07:53:46

Old filenames was in utf-8 and now i have working utf-8 locale in system. Maybe problem is that this files was copied by FTP on windows system from previous server to a new server?

atordo · 2009-08-28 13:43:47

Sorry, I didn't read the OP with the due attention. If you're sure about the Windows charset you could try something like this:

#!/bin/sh

for a in `find /the/path -type f`
do
        b=`echo -n $a|recode cp1250..utf-8`
        mv $a $b
done

Last edited by atordo (2009-08-28 13:46:16)

Procyon · 2009-08-28 15:20:24

So the file has an utf8 encoded filename of how the cp1250 encoding would look like if it was presumed to be utf8.

I have a sjis encoded filename, it would be kind of like this:

--> echo *LET\ IT*
´üä- LET IT OUT(Low quality).mp3
--> echo *LET\ IT* | iconv -f sjis -t utf8
福原美穂- LET IT OUT(Low quality).mp3
--> echo *LET\ IT* | hexdump -C           
00000000  95 9f 8c b4 94 fc 95 e4  2d 20 4c 45 54 20 49 54  |........- LET IT|
00000010  20 4f 55 54 28 4c 6f 77  20 71 75 61 6c 69 74 79  | OUT(Low quality|
00000020  29 2e 6d 70 33 0a                                 |).mp3.|
00000026
--> echo *LET\ IT* | iconv -f sjis -t utf8 | hexdump -C
00000000  e7 a6 8f e5 8e 9f e7 be  8e e7 a9 82 2d 20 4c 45  |............- LE|
00000010  54 20 49 54 20 4f 55 54  28 4c 6f 77 20 71 75 61  |T IT OUT(Low qua|
00000020  6c 69 74 79 29 2e 6d 70  33 0a                    |lity).mp3.|
0000002a
--> echo '´üä- LET IT OUT(Low quality).mp3' | hexdump -C
00000000  c2 b4 c3 bc c3 a4 2d 20  4c 45 54 20 49 54 20 4f  |......- LET IT O|
00000010  55 54 28 4c 6f 77 20 71  75 61 6c 69 74 79 29 2e  |UT(Low quality).|
00000020  6d 70 33 0a                                       |mp3.|
00000024

If you have a filename that is like the last one, I think you're out of luck. You can't encode/recode it, you have to decode it and that's ambiguous and I don't think there are tools for that.

Depending on the alphabet of the language in question, you're better off converting everything character by character. Like the script above by atordo, but using
b=`echo "$a" | sed 's/Ĺ‚/ł/g;s/More_errors/Fix/g'`

Dinth · 2009-08-29 07:42:50

Procyon wrote:

Depending on the alphabet of the language in question, you're better off converting everything character by character. Like the script above by atordo, but using
b=`echo "$a" | sed 's/Ĺ‚/ł/g;s/More_errors/Fix/g'`

Thanks for reply. Propably i will must do this way, but that isnt very comfortable, because files have national letters of all european alphabets.

Procyon · 2009-08-29 10:54:05

But if you do
echo 'Władysław_Lach' | iconv -f utf8 -t cp1250
The result is not the expected WĹ‚adysĹ‚aw_Ĺach, so maybe it's a different situation.

Can you post the hex values with echo 'W*ach.jpg' | hexdump -C

Maybe the the value of Ĺ and ‚ together form ł, and you just need to interpret it as \x??\x?? => \u????

thisoldman · 2009-08-29 15:09:22

This may help (or not). I just found the program 'enca' today. It's used to identify the character code of a text file. It seems to work best with the '-L' option. For example, "-L zh" specifies Chinese.

$ enca -L zh 中国姑娘.txt
Simplified Chinese National Standard; GB2312
  CRLF line terminators

Arch Linux

#1 2009-08-28 07:26:34

Filenames charset broken

#2 2009-08-28 07:50:04

Re: Filenames charset broken

#3 2009-08-28 07:53:46

Re: Filenames charset broken

#4 2009-08-28 13:43:47

Re: Filenames charset broken

#5 2009-08-28 15:20:24

Re: Filenames charset broken

#6 2009-08-29 07:42:50

Re: Filenames charset broken

#7 2009-08-29 10:54:05

Re: Filenames charset broken

#8 2009-08-29 15:09:22

Re: Filenames charset broken

Board footer