You are not logged in.
Hello. I have website with thousands of download files with national characters. Currently i was forced to reinstall server from ftp backup, which was done on windows machine. Unfortunately, after reinstall national characters in all my files got broken.
I tried to restore them with 'convmv' util, but it dont help.
convmv -r -f cp1250 -t utf-8 ./
Starting a dry run without changes...
Skipping, already UTF-8: ./deleted/6/a/p/home/dinth/public_html/centrumseriali.pl/modules/tinymce/tinymce/jscripts/tiny_mce/plugins/fullpage/langs/home/wikipasy/en/images/0/00/Hymn_Maleńczuk_2.mp3
Skipping, already UTF-8: ./deleted/6/a/p/home/dinth/public_html/centrumseriali.pl/modules/tinymce/tinymce/jscripts/tiny_mce/plugins/fullpage/langs/home/wikipasy/en/images/0/00/Oktawiusz_Marcińczak.jpg
Skipping, already UTF-8: ./deleted/6/a/p/home/dinth/public_html/centrumseriali.pl/modules/tinymce/tinymce/jscripts/tiny_mce/plugins/fullpage/langs/home/wikipasy/en/images/0/01/WĹ‚adysĹ‚aw_Ĺach.jpg
(...)
But this files arent in proper UTF8. My console font and locale in my system is UTF8. Manualy renaming "WĹ‚adysĹ‚aw_Ĺach" to "Władysław_Lach" help but there are thousands of files, so i cannot do them all manually.
I've tried utf to utf conversion, but then convmv dont see any files to convert.
convmv -r -f utf-8 -t utf-8 ./
Starting a dry run without changes...
No changes to your files done. Use --notest to finally rename the files.
How can i repair broken utf in filenames ?
Offline
If you just need support for one language you can stick with the old locale that worked for you:
* Look for your old locale in /etc/locale.gen and uncomment its line.
* Run locale-gen as root.
* Export the appropriate LANG variable in .bashrc or equivalent.
Note though that if you expect to receive filenames in several languages/scripts, UTF-8 is the way to go.
Offline
Old filenames was in utf-8 and now i have working utf-8 locale in system. Maybe problem is that this files was copied by FTP on windows system from previous server to a new server?
Offline
Sorry, I didn't read the OP with the due attention. If you're sure about the Windows charset you could try something like this:
#!/bin/sh
for a in `find /the/path -type f`
do
b=`echo -n $a|recode cp1250..utf-8`
mv $a $b
done
Last edited by atordo (2009-08-28 13:46:16)
Offline
So the file has an utf8 encoded filename of how the cp1250 encoding would look like if it was presumed to be utf8.
I have a sjis encoded filename, it would be kind of like this:
--> echo *LET\ IT*
´üä- LET IT OUT(Low quality).mp3
--> echo *LET\ IT* | iconv -f sjis -t utf8
福原美穂- LET IT OUT(Low quality).mp3
--> echo *LET\ IT* | hexdump -C
00000000 95 9f 8c b4 94 fc 95 e4 2d 20 4c 45 54 20 49 54 |........- LET IT|
00000010 20 4f 55 54 28 4c 6f 77 20 71 75 61 6c 69 74 79 | OUT(Low quality|
00000020 29 2e 6d 70 33 0a |).mp3.|
00000026
--> echo *LET\ IT* | iconv -f sjis -t utf8 | hexdump -C
00000000 e7 a6 8f e5 8e 9f e7 be 8e e7 a9 82 2d 20 4c 45 |............- LE|
00000010 54 20 49 54 20 4f 55 54 28 4c 6f 77 20 71 75 61 |T IT OUT(Low qua|
00000020 6c 69 74 79 29 2e 6d 70 33 0a |lity).mp3.|
0000002a
--> echo '´üä- LET IT OUT(Low quality).mp3' | hexdump -C
00000000 c2 b4 c3 bc c3 a4 2d 20 4c 45 54 20 49 54 20 4f |......- LET IT O|
00000010 55 54 28 4c 6f 77 20 71 75 61 6c 69 74 79 29 2e |UT(Low quality).|
00000020 6d 70 33 0a |mp3.|
00000024
If you have a filename that is like the last one, I think you're out of luck. You can't encode/recode it, you have to decode it and that's ambiguous and I don't think there are tools for that.
Depending on the alphabet of the language in question, you're better off converting everything character by character. Like the script above by atordo, but using
b=`echo "$a" | sed 's/Ĺ‚/ł/g;s/More_errors/Fix/g'`
Offline
Depending on the alphabet of the language in question, you're better off converting everything character by character. Like the script above by atordo, but using
b=`echo "$a" | sed 's/Ĺ‚/ł/g;s/More_errors/Fix/g'`
Thanks for reply. Propably i will must do this way, but that isnt very comfortable, because files have national letters of all european alphabets.
Offline
But if you do
echo 'Władysław_Lach' | iconv -f utf8 -t cp1250
The result is not the expected WĹ‚adysĹ‚aw_Ĺach, so maybe it's a different situation.
Can you post the hex values with echo 'W*ach.jpg' | hexdump -C
Maybe the the value of Ĺ and ‚ together form ł, and you just need to interpret it as \x??\x?? => \u????
Offline
This may help (or not). I just found the program 'enca' today. It's used to identify the character code of a text file. It seems to work best with the '-L' option. For example, "-L zh" specifies Chinese.
$ enca -L zh 中国姑娘.txt
Simplified Chinese National Standard; GB2312
CRLF line terminators
Offline