Unicode composition and decomposition

bender02 · 2010-01-22 21:00:57

Unicode encoded characters can be expressed in more ways, for instance 'ě' can be written as (in hex) '65 CC 8C' [which is just the character 'e' and then a code saying that the char before should have a caron over it], or as 'C4 9B' [which is a special code for that character]. Now I have a text in the former form (usually called "decomposed") and I need to convert quite a big text into the latter form ("composed"). [It's an e-book and my reader doesn't support the decomposed form of unicode.]

Could anyone help, perhaps with a script?

I've found perl's Unicode::Normalize module, but my first try (with perl altogether) did not do the job...
And probably my googling skills aren't what they've used to be

Thanks.

EDIT: if vim could do this, it would be amazing (although not that surprising

Last edited by bender02 (2010-01-22 21:03:10)

Xyne · 2010-01-22 21:15:14

Are we using the same unicode?

65 cc 8c: eÌ
65 cc8c: e첌

Post some example "before" and "after" text.

wuischke · 2010-01-23 08:43:51

>>> print unicode("\x65\xcc\x8c","utf-8")
ě
>>> print unicode("\x65\xcc\x8c","utf-8")[0]
e
>>> print hex(ord(unicode("\x65\xcc\x8c","utf-8")[1]))
0x30c

Unicode is really complex. Python will convert your format to an entirely different one...

On my computer, if I copy and paste above character, I'll get the following hexdump:

$ hexdump test
0000000 9bc4
0000002

So copying and pasting will covert it to another character again.

Well, since Python does already a (different) conversion with one line of code, could you try whether your reader supports the variant with 0x30c to add the caret?

Edit: It appears Python uses UTF-16 internally, so 0x30c is probably related to UTF-16 rather than any conversions.

Last edited by wuischke (2010-01-23 08:52:12)

bender02 · 2010-01-23 21:11:35

@xyne: I suppose the problem could be with the order of bytes; I used hexcurse and it groups every 8 bits. hexdump apparently groups 16 bits.

@wuischke: your 'hexdump test' prints actually exactly the "composed" form, it's just reverse order (9b c4 <-> c4 9b).

Some more comments: I tried "copy and paste" from and to different programs, but the result for some reason does not change the form. However, if I manually enter the 'ě' character on my keyboard, it uses the "composed" form. Well, maybe I should try copy/paste to some other apps or more advanced editors.

Some examples: here's an excerpt of my file:

To věděl každý.

$ hexdump test
0000000 6f54 7620 cc65 648c cc65 6c8c 6b20 7a61
0000010 8ccc 7964 81cc 0a2e
0000018

Now if I type the same text manually into an editor (vim), I get

$ hexdump test2
0000000 6f54 7620 9bc4 c464 6c9b 6b20 c561 64be
0000010 bdc3 0a2e
0000014

test: http://pastebin.com/m54719d22
test2: http://pastebin.com/mc01bf4f

@wuischke: I tried a simple python script (your first 'print' command), it does print the right character, but I wasn't able to pipe the output into a file:

Traceback (most recent call last):
File "./try.py", line 2, in <module>
print unicode("\x65\xcc\x8c","utf-8");
UnicodeEncodeError: 'ascii' codec can't encode character u'\u030c' in position 1: ordinal not in range(128)

juster · 2010-01-23 22:15:26

I'm guessing maybe you had problems with perl's Unicode::Normalize because of encoding problems. When you are reading a string of raw data it must be decoded from utf8 (in your case) to perl's internal text format. Unicode::Normalize seems to use perl's internal text format. Then after normalizing you have to encode from the internal format to the UTF8 raw binary format.

You can accomplish this manually using the Encode module with its "decode" and "encode" functions. More easily for files, you can tell perl that a file handle should do this automatically for you like so:

#!/usr/bin/perl

use warnings;
use strict;
use Unicode::Normalize qw(compose);

# Tell perl we want output automatically encoded to utf8
binmode STDOUT, ':encoding(UTF-8)';

# :>encoding automatically encodes to utf8 and opens the file for output
open my $goodfile, '>:encoding(UTF-8)', 'goodfile.txt'  or die "open goodfile: $!";
# :<encoding automatically decodes text from utf8 and opens the file for reading
open my $badfile,  '<:encoding(UTF-8)', 'm54719d22.txt' or die "open badfile: $!";
while ( my $text = <$badfile> ) {
    print $text;
    $text = compose( $text );
    print $goodfile $text;
}

edit: I forgot to mention I found a tool you might use called charlint that normalizes Unicode text.

Last edited by juster (2010-01-23 22:23:30)

bender02 · 2010-01-25 09:51:26

Thanks y'all, mainly juster!

Juster's script works nicely, charlint as well. Once I was at it, I packaged charlint, it's in AUR.

Arch Linux

#1 2010-01-22 21:00:57

Unicode composition and decomposition

#2 2010-01-22 21:15:14

Re: Unicode composition and decomposition

#3 2010-01-23 08:43:51

Re: Unicode composition and decomposition

#4 2010-01-23 21:11:35

Re: Unicode composition and decomposition

#5 2010-01-23 22:15:26

Re: Unicode composition and decomposition

#6 2010-01-25 09:51:26

Re: Unicode composition and decomposition

Board footer