You are not logged in.

#1 2009-08-25 05:20:37

lolilolicon
Member
Registered: 2009-03-05
Posts: 1,722

Nonlatin characters in URLs;NTFS Invalid or incomplete multibyte error

I'm testing with a file called "柯有伦-零.mp3", which contains Chinese characters. (Maybe your browser won't show it correctly, and the font might be crappy...)
Well, here's a screenshot of the correct filename:
vMjdueQ

My locale: en_US.utf8
Downloader I tested with: wget, aria2c
Target filesystem I tested with: ext4, ntfs

I find it strange the same filename has two forms in two urls:

%BF%C2%D3%D0%C2%D7-%C1%E3.mp3
%E6%9F%AF%E6%9C%89%E4%BC%A6-%E9%9B%B6.mp3

I don't know why... Must have something to do with character set/encoding. Somebody explain this to me please.

---------------------------------------------experiment--with--wget--------------------------------------------------

wget "%BF%C2%D3%D0%C2%D7-%C1%E3.mp3" to ext4 partition:

$ wget 'http://down.jsharer.com/user/userAction.do?method=download&urlpath=ftp://-552790109:693110381@58.215.91.170:2022/22487/200908/%BF%C2%D3%D0%C2%D7-%C1%E3.mp3'
--2009-08-25 12:17:41--  http://down.jsharer.com/user/userAction.do?method=download&urlpath=ftp://-552790109:693110381@58.215.91.170:2022/22487/200908/%BF%C2%D3%D0%C2%D7-%C1%E3.mp3
Resolving down.jsharer.com... 222.73.163.168
Connecting to down.jsharer.com|222.73.163.168|:80... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: ftp://-552790109:693110381@58.215.91.170:2022/22487/200908/%BF%C2%D3%D0%C2%D7-%C1%E3.mp3 [following]
--2009-08-25 12:17:41--  ftp://-552790109:*password*@58.215.91.170:2022/22487/200908/%BF%C2%D3%D0%C2%D7-%C1%E3.mp3
           => `¿ÂÓÐÂ×-Áã.mp3'
Connecting to 58.215.91.170:2022... connected.
Logging in as -552790109 ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD /22487/200908 ... done.
==> SIZE \277\302\323\320\302\327-\301\343.mp3 ... 5211995
==> PASV ... done.    ==> RETR \277\302\323\320\302\327-\301\343.mp3 ... done.
Length: 5211995 (5.0M)

100%[=====================================================>] 5,211,995   89.1K/s   in 56s     

2009-08-25 12:18:37 (91.2 KB/s) - `¿ÂÓÐÂ×-Áã.mp3' saved [5211995]

wget "%BF%C2%D3%D0%C2%D7-%C1%E3.mp3" to ntfs partition:

$ wget 'http://down.jsharer.com/user/userAction.do?method=download&urlpath=ftp://1368144520:-1398555672@58.215.91.170:2022/41161/200908/%BF%C2%D3%D0%C2%D7-%C1%E3.mp3'
--2009-08-25 12:26:29--  http://down.jsharer.com/user/userAction.do?method=download&urlpath=ftp://1368144520:-1398555672@58.215.91.170:2022/41161/200908/%BF%C2%D3%D0%C2%D7-%C1%E3.mp3
Resolving down.jsharer.com... 222.73.163.168
Connecting to down.jsharer.com|222.73.163.168|:80... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: ftp://1368144520:-1398555672@58.215.91.170:2022/41161/200908/%BF%C2%D3%D0%C2%D7-%C1%E3.mp3 [following]
--2009-08-25 12:26:29--  ftp://1368144520:*password*@58.215.91.170:2022/41161/200908/%BF%C2%D3%D0%C2%D7-%C1%E3.mp3
           => `¿ÂÓÐÂ×-Áã.mp3'
Connecting to 58.215.91.170:2022... connected.
Logging in as 1368144520 ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD /41161/200908 ... done.
==> SIZE \277\302\323\320\302\327-\301\343.mp3 ... 5211995
==> PASV ... done.    ==> RETR \277\302\323\320\302\327-\301\343.mp3 ... done.
¿ÂÓÐÂ×-Áã.mp3: Invalid or incomplete multibyte or wide character

wget  "%E6%9F%AF%E6%9C%89%E4%BC%A6-%E9%9B%B6.mp3" to ext4 partition:

$ wget 'http://mp3.tktt.com/eec38e543bc6c0e4/15/%E6%9F%AF%E6%9C%89%E4%BC%A6-%E9%9B%B6.mp3'
--2009-08-25 12:29:59--  http://mp3.tktt.com/eec38e543bc6c0e4/15/%E6%9F%AF%E6%9C%89%E4%BC%A6-%E9%9B%B6.mp3
Resolving mp3.tktt.com... 58.215.81.44
Connecting to mp3.tktt.com|58.215.81.44|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 126229 (123K) [audio/x-ms-wma]
Saving to: `æ%9F¯æ%9C%89伦-é%9B¶.mp3'

100%[=====================================================>] 126,229      129K/s   in 1.0s    

2009-08-25 12:30:03 (129 KB/s) - `æ%9F¯æ%9C%89伦-é%9B¶.mp3' saved [126229/126229]

wget "%E6%9F%AF%E6%9C%89%E4%BC%A6-%E9%9B%B6.mp3" to ntfs partition:

$ wget 'http://mp3.tktt.com/eec38e543bc6c0e4/15/%E6%9F%AF%E6%9C%89%E4%BC%A6-%E9%9B%B6.mp3'
--2009-08-25 12:37:30--  http://mp3.tktt.com/eec38e543bc6c0e4/15/%E6%9F%AF%E6%9C%89%E4%BC%A6-%E9%9B%B6.mp3
Resolving mp3.tktt.com... 58.215.81.44
Connecting to mp3.tktt.com|58.215.81.44|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 126229 (123K) [audio/x-ms-wma]
æ%9F¯æ%9C%89伦-é%9B¶.mp3: Invalid or incomplete multibyte or wide character

Cannot write to `æ%9F¯æ%9C%89伦-é%9B¶.mp3' (Invalid or incomplete multibyte or wide character).

---------------------------------------------let's--try--with--aria2c--------------------------------------------------

aria2c "%BF%C2%D3%D0%C2%D7-%C1%E3.mp3" to ext4 partition:

$ aria2c 'http://down.jsharer.com/user/userAction.do?method=download&urlpath=ftp://1368144520:-1398555672@58.215.91.170:2022/41161/200908/%BF%C2%D3%D0%C2%D7-%C1%E3.mp3'

2009-08-25 12:47:52.079592 NOTICE - #1 - Download has already completed: /home/canti/Desktop/¿ÂÓÐÂ×-Áã.mp3

2009-08-25 12:47:52.080014 NOTICE - Download complete: /home/canti/Desktop/¿ÂÓÐÂ×-Áã.mp3

Download Results:
gid|stat|avg speed  |path/URI
===+====+===========+===========================================================
  1|  OK|        n/a|/home/canti/Desktop/¿ÂÓÐÂ×-Áã.mp3

Status Legend:
 (OK):download completed.

aria2c "%BF%C2%D3%D0%C2%D7-%C1%E3.mp3" to ntfs partition:

$ aria2c 'http://down.jsharer.com/user/userAction.do?method=download&urlpath=ftp://1368144520:-1398555672@58.215.91.170:2022/41161/200908/%BF%C2%D3%D0%C2%D7-%C1%E3.mp3'
[#1 SIZE:0B/0B CN:1 SPD:0Bs]                                                                   
2009-08-25 12:46:06.942979 ERROR - Exception caught
Exception: [RequestGroup.cc:528] Download aborted.
  -> [AbstractDiskWriter.cc:115] Failed to open the file /media/20G/¿ÂÓÐÂ×-Áã.mp3, cause: Invalid or incomplete multibyte or wide character

Download Results:
gid|stat|avg speed  |path/URI
===+====+===========+===========================================================
  1| ERR|        n/a|/media/20G/¿ÂÓÐÂ×-Áã.mp3

Status Legend:
(ERR):error occurred.

aria2 will resume download if the transfer is restarted.
If there are any errors, then see the log file. See '-l' option in help/man page for details.

aria2c "%E6%9F%AF%E6%9C%89%E4%BC%A6-%E9%9B%B6.mp3" to ext4 partition:

$ aria2c 'http://mp3.tktt.com/eec38e543bc6c0e4/15/%E6%9F%AF%E6%9C%89%E4%BC%A6-%E9%9B%B6.mp3'
[#1 SIZE:96.0KiB/123.2KiB(77%) CN:1 SPD:135.1KiBs]                                             
2009-08-25 12:30:17.753443 NOTICE - Download complete: /home/canti/Desktop/柯有伦-零.mp3

Download Results:
gid|stat|avg speed  |path/URI
===+====+===========+===========================================================
  1|  OK| 136.0KiB/s|/home/canti/Desktop/柯有伦-零.mp3

Status Legend:
 (OK):download completed.

aria2c "%E6%9F%AF%E6%9C%89%E4%BC%A6-%E9%9B%B6.mp3" to ntfs partition:

$ aria2c 'http://mp3.tktt.com/eec38e543bc6c0e4/15/%E6%9F%AF%E6%9C%89%E4%BC%A6-%E9%9B%B6.mp3'
[#1 SIZE:0B/123.2KiB(0%) CN:1 SPD:0Bs]                                                         
2009-08-25 12:42:14.735797 NOTICE - Download complete: /media/20G/柯有伦-零.mp3

Download Results:
gid|stat|avg speed  |path/URI
===+====+===========+===========================================================
  1|  OK|  20.0MiB/s|/media/20G/柯有伦-零.mp3

Status Legend:
 (OK):download completed.

---------------------------------------------------------------------------------------------------------------------------

So I'm asking:

1. Why is '%E6%9F%AF%E6%9C%89%E4%BC%A6-%E9%9B%B6.mp3' interpreted correctly by aria2c but not with wget?
2. Why is '%BF%C2%D3%D0%C2%D7-%C1%E3.mp3' interpreted to '¿ÂÓÐÂ×-Áã.mp3' by aria2c and wget both?
3. Can I tune aria2c/wget to get those characters interpreted right? I have some experience with FTP clients such as filezilla and gftp, they both have a charset option for correctly displaying filenames from servers that has a encoding other than utf-8.
4. Despite wget/aria2c both get the '%BF%C2%D3%D0%C2%D7-%C1%E3.mp3' interpreted wrong, they are able to write a file called '¿ÂÓÐÂ×-Áã.mp3' to EXT4 partitions. But they can't do this to NTFS partitions. (PcManFM can display Chinese/Japnese filenames on NTFS partitions correctly. Just to point it out.) How can I tune ntfs-3g's options to fix this?
5. What I asked in the beginning, why does the same filename has two versions in two URLs?

Last edited by lolilolicon (2009-08-25 05:33:18)


This silver ladybug at line 28...

Offline

#2 2009-08-27 04:19:33

thisoldman
Member
From: Pittsburgh
Registered: 2009-04-25
Posts: 1,172

Re: Nonlatin characters in URLs;NTFS Invalid or incomplete multibyte error

I do not speak or read Chinese, my native language is American English and I can read and speak French haltingly.

Different Chinese character code systems exist. I was able to find that %BF%C2%D3%D0%C2%D7-%C1%E3.mp3 was coded in GB2312(old) or GBK(new), used in mainland China-- two bytes per character.  BFC2=柯; D3D0=有; C2D7=伦; C1E3=零.  Here's a table for conversions that I found for GB2312: http://www.ansell-uebersetzungen.com/gborder.html. The Wikipedia article on GB2312 is here: http://en.wikipedia.org/wiki/GB_2312. The GBK article is here: http://en.wikipedia.org/wiki/GBK.

I haven't been able to pin down the 3-byte per char code; I found no hex to char conversion table that fits.

Your use of the ntfs partition and the ntfs-3g driver makes what little I know about the various disk mount options useless. If you have a spare USB flash drive you could format it with ntfs and start experimenting with different mount options. Copy a file called "柯有伦-零.txt" to the drive and see which mount options allow you to see the correct name in both Windows and GNU Linux. The Windows machine will have to have a usable Chinese font and the correct codepage set. I have no problems displaying the Chinese characters using utf8 on my ext3 partitions in Linux.

I haven't used it but convmv in extra can be used to convert file names from one charset to another. There's also fuse-convmvfs in the AUR.

Perhaps if you change the subject of this thread to include the Chinese characters, someone with knowledge of the charset encodings would pick up on it.

Offline

#3 2009-08-27 05:21:45

lolilolicon
Member
Registered: 2009-03-05
Posts: 1,722

Re: Nonlatin characters in URLs;NTFS Invalid or incomplete multibyte error

thisoldman wrote:

I do not speak or read Chinese, my native language is American English and I can read and speak French haltingly.

Different Chinese character code systems exist. I was able to find that %BF%C2%D3%D0%C2%D7-%C1%E3.mp3 was coded in GB2312(old) or GBK(new), used in mainland China-- two bytes per character.  BFC2=柯; D3D0=有; C2D7=伦; C1E3=零.  Here's a table for conversions that I found for GB2312: http://www.ansell-uebersetzungen.com/gborder.html. The Wikipedia article on GB2312 is here: http://en.wikipedia.org/wiki/GB_2312. The GBK article is here: http://en.wikipedia.org/wiki/GBK.

I haven't been able to pin down the 3-byte per char code; I found no hex to char conversion table that fits.

Thanks very much for the info. I've found "the 3-byte per char code" to be UTF-8. This is why aria2c interpreted it right. It seems aria2c has a better utf-8 support than wget.

I have no problem displaying correct Chinese filenames on ntfs partitions. The problem is:
1. The downloader interpretes filenames incorrectly.
2. Those incorrect filenames can't be written to ntfs partitions, while they get written to ext4 partitions despite no one on earth knows what these characters mean.

Last edited by lolilolicon (2009-08-27 05:24:17)


This silver ladybug at line 28...

Offline

#4 2009-08-27 11:03:01

thisoldman
Member
From: Pittsburgh
Registered: 2009-04-25
Posts: 1,172

Re: Nonlatin characters in URLs;NTFS Invalid or incomplete multibyte error

That was a nice little oversight on my part --not checking the utf-8 charset.

I experimented a little bit. I tried your link to downoad the file with wget; of course it doesn't work for me. Does the use of the wget option --restrict-file-names=nocontrol help? That option stops the incorrect interpretation of control characters.

For a download example, here's a Russian language Wikipedia page,  http://ru.wikipedia.org/wiki/%D0%97%D0% … 1%86%D0%B0

Without --restrict-file-names=nocontrol

$ wget http://ru.wikipedia.org/wiki/%D0%97%D0%B0%D0%B3%D0%BB%D0%B0%D0%B2%D0%BD%D0%B0%D1%8F_%D1%81%D1%82%D1%80%D0%B0%D0%BD%D0%B8%D1%86%D0%B0
--2009-08-27 06:11:35--  http://ru.wikipedia.org/wiki/%D0%97%D0%B0%D0%B3%D0%BB%D0%B0%D0%B2%D0%BD%D0%B0%D1%8F_%D1%81%D1%82%D1%80%D0%B0%D0%BD%D0%B8%D1%86%D0%B0
... ... ...
Length: 77305 (75K) [text/html]
Saving to: `Ð%97аглавнаÑ%8F_Ñ%81Ñ%82Ñ%80аниÑ%86а'
... ... ...
2009-08-27 06:11:36 (292 KB/s) - `Ð%97аглавнаÑ%8F_Ñ%81Ñ%82Ñ%80аниÑ%86а' saved [77305/77305]
With --restrict-file-names=nocontrol

$ wget --restrict-file-names=nocontrol http://ru.wikipedia.org/wiki/%D0%97%D0%B0%D0%B3%D0%BB%D0%B0%D0%B2%D0%BD%D0%B0%D1%8F_%D1%81%D1%82%D1%80%D0%B0%D0%BD%D0%B8%D1%86%D0%B0
--2009-08-27 06:12:37--  http://ru.wikipedia.org/wiki/%D0%97%D0%B0%D0%B3%D0%BB%D0%B0%D0%B2%D0%BD%D0%B0%D1%8F_%D1%81%D1%82%D1%80%D0%B0%D0%BD%D0%B8%D1%86%D0%B0
... ... ...
Length: 77304 (75K) [text/html]
Saving to: `Заглавная_страница'
... ... ...
2009-08-27 06:12:37 (291 KB/s) - `Заглавная_страница' saved [77304/77304]

Wget also has the --header option. The man page gives an example that might be useful.

I'm still looking at the ntfs problem.

Last edited by thisoldman (2009-08-27 11:11:49)

Offline

#5 2009-08-27 15:11:39

lolilolicon
Member
Registered: 2009-03-05
Posts: 1,722

Re: Nonlatin characters in URLs;NTFS Invalid or incomplete multibyte error

The --restrict-file-names=nocontrol option works great with utf8 encoded urls!
So now wget can do it as well as aria2c at this point.
But I can't get the --header option to work for GBK encoded urls:

$ wget --header='Accept-Charset: gbk'  'http://down.jsharer.com/user/userAction.do?method=download&urlpath=ftp://1090929347:800983720@58.215.91.170:2022/08792/200908/%BF%C2%D3%D0%C2%D7-%C1%E3.mp3'
... ... ...
--2009-08-27 22:38:42--  ftp://1090929347:*password*@58.215.91.170:2022/08792/200908/%BF%C2%D3%D0%C2%D7-%C1%E3.mp3
           => `¿ÂÓÐÂ×-Áã.mp3'
... ... ...

Also tried wget --restrict-file-names=nocontrol and LANG='zh_CN.GBK' ; wget --restrict-file-names=nocontrol , both failed.

Sorry I forgot to paste the download page link:
http://jsharer.com/download/main/1367623/1.htm
Right click on the filename "柯有伦-零.mp3" and copy the url for wget. (If the DDL link dies, simply reload the page and copy the new url.)

Thanks, Tom (aka thisoldman tongue)

Last edited by lolilolicon (2009-08-27 15:13:22)


This silver ladybug at line 28...

Offline

#6 2009-08-28 02:04:01

thisoldman
Member
From: Pittsburgh
Registered: 2009-04-25
Posts: 1,172

Re: Nonlatin characters in URLs;NTFS Invalid or incomplete multibyte error

I found an example that shows many of the combinations for the wget "header=" options at http://www.askapache.com/tools/wget-header-trick.html. I'll have to experiment with it when I get time, if I remember.

But now that the file can be downloaded with correct characters displayed, my curiosity is piqued by the copy failure to the ntfs drive. We know everything is encoded using utf8 -- the ntfs-3g driver permits nothing else. Does "cp -v 柯有伦-零.mp3 ntfs_dir/" return anything useful? If it doesn't report an error, you have to assume it was successful.

Perhaps uncommenting the "#zh_CN.UTF-8 UTF-8" line in /etc/locale.gen and running "sudo locale-gen" would make the disk to disk transfer work or the invisible characters become visible.

GBK might not be correct for wget. There are other Chinese codesets, and there is probably significant overlap among them. I suggested GBK, but I realize I stopped after finding the match for just five characters and did not look for other matches.

The non-utf8 character problem for wget might also be a Microsoft Chinese codeset problem. We know the display of characters like 'Ä' in Latin-based alphabets causes conflict between Windows and Linux. This could be similar.

Offline

#7 2009-08-28 04:00:56

lolilolicon
Member
Registered: 2009-03-05
Posts: 1,722

Re: Nonlatin characters in URLs;NTFS Invalid or incomplete multibyte error

I already did locale-gen for zh_CN.* stuff.

"cp -v 柯有伦-零.mp3 ntfs_dir/" -- surely it works. Any normal Chinese character can be written to ntfs partition and be displayed with no problem. Only the mis-translated ones like "¿ÂÓÐÂ×-Áã.mp3" can not be written.
"the ntfs-3g driver permits nothing else" -- so strings like "¿ÂÓÐÂ×-Áã.mp3" is not permitted because it's not utf8? I don't quite understand what utf8 is... I thought it covers everything?

==> The thing seems to be more complicated, read "Edit" below.

This codeset thing is annoying. I'm thinking about picking up some perl/python because I find they've got some little modules/libraries just for charset conversion.

Edit::
Oops, I find it strange!

$ echo $LANG
en_US.utf8
$ touch '¿ÂÓÐÂ×-Áã.mp3' 
$ ls
¿ÂÓÐÂ×-Áã.mp3
$ wget 'URL'
... ... ...
`¿ÂÓÐÂ×-Áã.mp3' saved
$ ls
??????-??.mp3   ¿ÂÓÐÂ×-Áã.mp3
$ LANG='zh_CN.gbk' ; ls
¿ÂÓÐÂ×-Áã.mp3   ¿ÂÓÐÂ×-Áã.mp3

I did test this on EXT4 partition because wget won't download to NTFS.
I can do $ touch '¿ÂÓÐÂ×-Áã.mp3' on NTFS.
The very odd part is that while wget says "`¿ÂÓÐÂ×-Áã.mp3' saved", I get "??????-??.mp3".

Eh, I gave the wrong link, sorry! That link requires registration... Here's the public one.
http://jsharer.com/file/1367623.htm
Click the big blue button down there, the left of the two. And then right click the "柯有伦-零.mp3" at the top and copy the url.

Last edited by lolilolicon (2009-08-28 04:03:45)


This silver ladybug at line 28...

Offline

#8 2009-08-28 10:24:42

thisoldman
Member
From: Pittsburgh
Registered: 2009-04-25
Posts: 1,172

Re: Nonlatin characters in URLs;NTFS Invalid or incomplete multibyte error

kokoko3k has posted a similar ntfs character display problem in this thread. I've asked him to join us in this thread.

Unfortunately, forum.ntfs-3g.org, is down at the moment. Can't search there.

It appears the non-Latin and double-wide characters cause a problem with ntfs-3g as found by Googling "character set ntfs-3g". Looking through the listings for the "solved" situations gives many different solutions. One person unmounts and remounts the disk, another uses the '-o locale=xxx" option for mounting, and a third person modifies PolicyKit/HAL.

Unmounting and remounting the drive with ntfs-3g should be easy to test. But first I have to create a partition and format it and install ntfs-3g. I won't have time to run tests before I leave for work. If a solution isn't found before 11:00 PM EDT I'll start running tests.

Should "chkdsk /f /r" be run from a Windows system on the drive? Just to make sure hardware or disk problems aren't being created or are causing a complication?

Last edited by thisoldman (2009-08-28 10:26:17)

Offline

#9 2009-08-28 11:09:35

kokoko3k
Member
Registered: 2008-11-14
Posts: 2,393

Re: Nonlatin characters in URLs;NTFS Invalid or incomplete multibyte error

From version 1.5222-RC
Change: The 'locale=' mount option is not used anymore for filename characterset conversion. Instead filenames are always converted to UTF-8.

So, in arch, the latest ntfs-3g version which supports the locale= parameter is: 1.5130-1
Any solution found on google which points to a locale=xxx (by using policykit/hal or whatever) will not work with any of the recent ntfs-3g versions.

I never really wanted to fully understand the alchemy of character encodings, but it seems that the only way to have new ntfs-3g deal with "strange" characters is to set your own locales to xx_XX.utf8 and that the tools you are using (starting from the shell till wget or aria2c, in the lolilolicon case), have to support utf8 too.

All solved, you may think, well... no.
What charset does my italian windows installation use?
I bet it uses it_IT.8859-1, because mounting that partition with ntfs-3g-5130 seems the only way to access those filenames with "strange characters".

Don't miss this nice thing here:
http://fuse-convmvfs.sourceforge.net/
It sould be possible to have a layer on top of ntfs mounted partition which should convert on the fly everything from your own charset to utf-8, to make ntfs-3g happy, but that way, i bet windows will have serious filename problems then.

By the way, ntfs-3g forums are up, and here is the relevant info, with some people complaining about the issue:
http://tuxera.com/forum/viewtopic.php?f … 369bc5591c

...Hope this will help somehow, at least in understanding wth is happening.

Last edited by kokoko3k (2009-08-28 11:15:07)


Help me to improve ssh-rdp !
Retroarch User? Try my koko-aio shader !

Offline

#10 2009-08-28 14:29:17

lolilolicon
Member
Registered: 2009-03-05
Posts: 1,722

Re: Nonlatin characters in URLs;NTFS Invalid or incomplete multibyte error

If I read that ntfs-3g forum link right, ntfs-3g now *automatically* converts filenames in whatever character set to utf8.
If you want those filenames displayed correctly in terminal, your terminal must be in utf8.
If you write some filenames to the ntfs partition, ntfs-3g always assumes it to be utf8. If it *can't* map it to utf8, then it says 'Invalid or incomplete multibyte or wide character'.
Hence, you must make sure the filenames you're writing to the ntfs partition is correctly converted to uft8, which means the tool which is reporting the name to ntfs-3g must report it in *good* utf8.

Just guessing though. Especially that "*automatically*" sounds good but I really don't know how this is done... But well, my windows partition displays very well with ntfs-3g anyway. The problem I'm experiencing is more of a downloader issue than ntfs-3g.

The reason why wget can write those "really strange" characters to EXT4 but not to NTFS is (I think):
EXT4 accepts those characters even if they're out of the scope of utf8 code map.(*bad* utf8 accepted)
NTFS-3g rejects anything out of the range of utf8 code map.(*good* utf8 only)

szaka wrote:

You must have an UTF-8 capable Linux distribution. Most of them is doing this already for a long time. "Santa Fè" and anything else is working absolutely perfectly on the latest Ubuntu, openSUSE, Mandriva and many other distributions.
If it doesn't work then the reason is that your environment is misconfigured: http://ntfs-3g.org/support.html#utf8a

Correct me if I'm wrong.

Last edited by lolilolicon (2009-08-28 14:31:08)


This silver ladybug at line 28...

Offline

#11 2009-08-28 16:12:22

kokoko3k
Member
Registered: 2008-11-14
Posts: 2,393

Re: Nonlatin characters in URLs;NTFS Invalid or incomplete multibyte error

Nope, i guess you're right.
I thought that setting the locale variables would be enough to make the current console (and children) utf8-compliant, but i was wrong, because setting utf8 in rc.conf and rebooting is leading me to different results (i'll check what rc.sysinit do) and now ntfs-3g works as expected.
Thanks smile

Last edited by kokoko3k (2009-08-28 16:13:25)


Help me to improve ssh-rdp !
Retroarch User? Try my koko-aio shader !

Offline

#12 2009-08-28 19:50:41

drtoki
Member
From: {x ∈ A | p(x) = 1}
Registered: 2009-07-22
Posts: 95

Re: Nonlatin characters in URLs;NTFS Invalid or incomplete multibyte error

I don't understand a thing about character encodings so this whole thread is beyond me. I was curious though because I have never encountered this problem using wget. I don't have a jsharer account so I was unable to test that particular file, however this is what I get with the tktt link under ext3, ext4, ntfs. I partitioned a flash drive with the three file formats.

All three resulted as follows

[toki@zr-1 ext4]$ wget http://mp3.tktt.com/eec38e543bc6c0e4/15/%E6%9F%AF%E6%9C%89%E4%BC%A6-%
--2009-08-28 13:43:22--  http://mp3.tktt.com/eec38e543bc6c0e4/15/%E6%9F%AF%E6%9C%89%E4%BC%A6-%25
Resolving mp3.tktt.com... 58.215.81.44
Connecting to mp3.tktt.com|58.215.81.44|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 126229 (123K) [audio/x-ms-wma]
Saving to: `柯有伦-零'

100%[=============================================>] 126,229     93.2K/s   in 1.5s    

2009-08-28 13:12:36 (33.2 KB/s) - `柯有伦-零' saved [126229/126229]

[toki@zr-1 ext4]$ ls
柯有伦-零

##### for sake of cutting repetition.. rinse and repeat on other partitions#####

[toki@zr-1 ext3]$ ls && ls ../ext4 && ls ../ntfs
柯有伦-零
柯有伦-零
柯有伦-零

So, I can't seem to reproduce your problem..

EDIT: only difference I see is that I don't get the .mp3 extension that shows up on your example. This should be the case seeing as how it identifies as a wma. However, the file seems much to small to be valid.  (I've always had issues with tktt and their surplus of bogus links)

Last edited by drtoki (2009-08-28 20:03:49)

Offline

#13 2009-08-29 03:01:20

lolilolicon
Member
Registered: 2009-03-05
Posts: 1,722

Re: Nonlatin characters in URLs;NTFS Invalid or incomplete multibyte error

Hello, drtoki. Welcome to join us.

Well, your link is wrong.
http://mp3.tktt.com/eec38e543bc6c0e4/15 … %9B%B6.mp3
But anyway, your result is different from mine.
What is your locale? And, do you have a .wgetrc file with predefined options?

The jsharer link I posted was fixed for public access. But that file will die soon. So here's another link for "中国姑娘.mp3"
http://jsharer.com/file/1374265.htm
You'll have to click the left big blue button down there. And right click the "中国姑娘.mp3" to copy the URI.

Regarding the validness of that file... I don't know, I didn't even try to play it. Anyway only the name matters. wink


This silver ladybug at line 28...

Offline

#14 2009-08-29 12:35:33

thisoldman
Member
From: Pittsburgh
Registered: 2009-04-25
Posts: 1,172

Re: Nonlatin characters in URLs;NTFS Invalid or incomplete multibyte error

Okay, I screwed up the character display in Firefox trying to display all the different character sets correctly! ;)

With some package update, enca was installed on my machine. From the man page:

enca -- detect and convert encoding of text files

If  you  are  lucky  enough, the only two things you will ever need to know are: 'enca FILE' will tell you which encoding file FILE uses (without changing it), and 'enconv FILE' will convert file FILE to your locale native encoding.

The command "enca 西山红 .txt" failed gracefully with an error message, but "enca -L zh 西山红 .txt", specifying Chinese returned "Universal transformation format 8 bits; UTF-8".

From a little experimentation, it was easy to see I would be more often unlucky than lucky unless I specified the language. It's still a useful tool.

Offline

#15 2009-08-29 13:03:41

lolilolicon
Member
Registered: 2009-03-05
Posts: 1,722

Re: Nonlatin characters in URLs;NTFS Invalid or incomplete multibyte error

I sometimes use file for that, but:

$ file goodies.txt 
goodies.txt: ISO-8859 text, with CRLF line terminators

$ enca -L zh goodies.txt
Simplified Chinese National Standard; GB2312
  CRLF line terminators

enca is right. Heh, strange.


This silver ladybug at line 28...

Offline

#16 2009-08-30 11:50:34

drtoki
Member
From: {x ∈ A | p(x) = 1}
Registered: 2009-07-22
Posts: 95

Re: Nonlatin characters in URLs;NTFS Invalid or incomplete multibyte error

lolilolicon wrote:

Hello, drtoki. Welcome to join us.

Well, your link is wrong.
http://mp3.tktt.com/eec38e543bc6c0e4/15 … %9B%B6.mp3
But anyway, your result is different from mine.
What is your locale? And, do you have a .wgetrc file with predefined options?

The jsharer link I posted was fixed for public access. But that file will die soon. So here's another link for "中国姑娘.mp3"
http://jsharer.com/file/1374265.htm
You'll have to click the left big blue button down there. And right click the "中国姑娘.mp3" to copy the URI.

Regarding the validness of that file... I don't know, I didn't even try to play it. Anyway only the name matters. wink

Sorry, my fault for not correctly copying a link..  the tktt file download again shows up correctly. 中国姑娘 unfortunately doesn't display correctly. It instead comes up as

=> `Öйú¹ÃÄï.mp3'

I do not have a .wgetrc file.
My locale is as follows

[toki@zr-1 ~]$ locale
LANG=zh_CN.utf8
LC_CTYPE=zh_CN.utf8
LC_NUMERIC="en_US.utf8"
LC_TIME="zh_CN.utf8"
LC_COLLATE=C
LC_MONETARY="en_US.utf8"
LC_MESSAGES="zh_CN.utf8"
LC_PAPER="en_US.utf8"
LC_NAME="zh_CN.utf8"
LC_ADDRESS="en_US.utf8"
LC_TELEPHONE="en_US.utf8"
LC_MEASUREMENT="en_US.utf8"
LC_IDENTIFICATION="zh_CN.utf8"

Anyways I don't know if you've figured things out or not, though I don't think my post has helped you at all... Sorry I couldn't be of more help to you in this.

Offline

#17 2009-08-30 12:53:29

thisoldman
Member
From: Pittsburgh
Registered: 2009-04-25
Posts: 1,172

Re: Nonlatin characters in URLs;NTFS Invalid or incomplete multibyte error

My interest is producing accurate translations. Technical papers for work and now my family uses three different languages in their daily life (we're scattered over four continents).

This post by Micah Cohen, the maintainer of wget, shows he is aware of its shortcomings. Wget "cannot handle certain shifting or multibyte character sets , notably ISO-2022-JP..." Looks like we have to wait for wget version 2.0.

Offline

Board footer

Powered by FluxBB