Handling unicode strings in C.

iTwenty · 2012-01-05 09:51:18

I am making a C application that makes use of TagLib library for reading ID3 tags from audio files. Here's the relevant piece of code:

    .....
    TagLib_Tag  *tag = taglib_file_tag( tagFile );
    if( tag != NULL )
    {
        audio->title = taglib_tag_title( tag );    //audio->title is a char *
        audio->year  = taglib_tag_year( tag );
    }
    ......

the relevant header file has this to say about taglib_tag_title() function

/*!
 * Returns a string with this tag's title.
 *
 * \note By default this string should be UTF8 encoded and its memory should be
 * freed using taglib_tag_free_strings().
 */

So, from my understanding, audio->title contains a UTF-8 encoded string which, if I try to print using priintf(), displays gibbersih instead of actual title. But, isn't UTF-8 supposed to be compatible with ASCII encoding which C uses (atleast for the usual A-Z and 0-9)? So, why is printf() string displaying weirdness? More importantly, what is the proper way to deal with UTF-8/16/32 encoded strings in C?

TheLemonMan · 2012-01-05 15:29:01

wprintf should do the trick

juster · 2012-01-05 22:12:07

UTF8 is a multibyte 8-bit encoding so yes, printf should work. wprintf takes wide-character input instead. Does your terminal print UTF8 fine in other programs? Are you messing with the strings before printing them? Do you use setlocale?

http://www.cl.cam.ac.uk/~mgk25/unicode.html#c
or type 'info libc', see character/locale sections for long explanations

#include <stdio.h>
main(){ printf("\xE2\x98\x83\n"); }

iTwenty · 2012-01-06 08:25:29

I noticed that if I printf() strings with accented characters, they show up fine in terminal. But if the same string comes from a call to taglib_tag_title(), then I see garbage. So, I am guessing I have to deal with some library weirdness here. BTW, if I don't make a call to setlocale(), does the process run with locale determined by my environment variables or does it default to C/POSIX locale?

TheLemonMan · 2012-01-06 11:52:38

I just think that the library internally uses wide chars

juster · 2012-01-06 22:31:37

If you do not explicitly call setlocale then yes your program uses the POSIX/C locale.

I was playing with the tagreader_c.c example bundled with taglib. At first it was printing out ISO-8859 and I went aha! It sucks! Then I saw the line:

  taglib_set_strings_unicode(FALSE);

Oh! I just changed the FALSE to 1 (TRUE didn't work?!) and viola, UTF8. Are you calling this function before you are retrieving the strings?

TheLemonMan:
wprintf and friends all take wide characters as arguments. This is fine if you convert from multibyte to wide characters by using mbtowc but is not appropriate for an array of UTF8 encoded bytes.

edit: Oops, nevermind I thought you were talking about the GNU standard library, not TagLib.

Last edited by juster (2012-01-06 22:33:10)

iTwenty · 2012-01-11 06:49:11

Thanks for the responses. I eventually implemented the program in Python with mutagen library as it handles unicode strings far better than C. Speed isn't really a concern for me since the program anyway spends 95% of it's time waiting for a remote server to respond. But it was nevertheless a good experience coding in C since it made me learn about character encodings other than ASCII.

Last edited by iTwenty (2012-01-11 06:49:34)

Blµb · 2012-01-16 10:11:23

You have to understand that in C "strings" as such don't really exist. If you make a 'char*', all you do is create a pointer to a (bunch of) bytes. This, essentially, "is" a string, except when you start using unicode you have one big problem: representation.
Unicode strings can be represented in different ways: wide-characters (what wprintf would be used for), multibyte encodings (utf-8), and all the other useless ( ) encodings out there.
Now, the C functions used on strings don't actually see the data as "strings", but also just as "data", meaning, strlen() could return 8 for a 5-letter utf-8 string, because some letters could take up more bytes.

The same applies to printf(), it will simply dump the data to stdout. So in the case of utf-8, whether or not the result will be garbage or not, solely depends on whether your terminal (or whatever program the output will be passed to) expects utf-8 characters or not.

utf-8 being "compatible" means not that programs which allow you to handle ASCII strings will handle utf-8 strings correct in *all* circumstances. It mainly removes the problem wide-characters would have - for example that with a 16-bit encoding you'd often have a null-byte after
your letters since the upper 8 bits are 0 for codepoints below or equal to 255, which would result in every strlen() call returning *at most* 1, or just 0 - depending on byte-ordering - on strings which don't start with special characters
In other words, utf-8 and ascii are "compatible" as long as a program which only understands ascii only reads utf8 without trying to modify the contents.

strlen("ø") = 2

iTwenty · 2012-01-20 10:15:49

I think I get the picture now...C stores only byte representation of the string, determined by the encoding in which you save the source file, in case the string is hardcoded in the source. How this bytes get mapped to characters when you display the string is left up to the encoding scheme of the terminal. As an example, the following code:

#include <stdio.h>
int main( int argc, char **argv )
{
    char utf8[]  = { 0xc4, 0x80, 0x00 }; //0x00 is for NUL termination
    printf( "%s\n", utf8 );
    return 0;
}

instead of outputting two unintelligible characters, correctly outputs "Ā" (codepoint: U+0100) in a UTF-8 terminal. Why UTF8 represents 0100 as c480 has to do with initial few bits it reserves for it's multi-byte magic.
BTW, is there a way to change my system locale to en-US.UTF-16 instead of en-US.UTF-8, just for the kicks? As it is, if I change the encoding in my terminal to UTF-16 and press some innocuous character like "A", it displays a long series of mojibake which I suppose is due to my system locale and console encoding mismatch

Last edited by iTwenty (2012-01-20 10:26:00)

Blµb · 2012-01-20 10:29:15

Never tried it, but at least locale-gen doesn't seem to like utf-16...
Remember that what applies to C, doesn't only apply to the language, but to all programs written in a language where strings work in a similar way (even if only internally).

Meaning, if you were to use a utf-16 terminal, and were to type "echo Hello", you might only get it to print "H", or garbage, since it probably outputs { 'H', '\n' }. Unless the locale is understood by echo, or rather the shell (since usually the shell doesn't actually execute a *program* named echo...)

toofishes · 2012-01-20 17:58:41

If you want to have more fun, you can use the wide character class of functions and look at the raw bytes those stream out; sizeof(wchar_t) on Linux is 4 bytes, so it uses the full 32-bit Unicode value in that form.

In pacman we have to do some funky things to find the actual length of a string in "columns used for display". Some 1 byte characters take 1 column, while other 3 byte characters take 1 column, and yet others take 2 columns. You can see a bunch of stuff we have to do to treat Unicode and UTF-8 special here: http://projects.archlinux.org/pacman.gi … til.c#n455

tavianator · 2012-01-20 19:16:12

toofishes wrote:

In pacman we have to do some funky things to find the actual length of a string in "columns used for display". Some 1 byte characters take 1 column, while other 3 byte characters take 1 column, and yet others take 2 columns. You can see a bunch of stuff we have to do to treat Unicode and UTF-8 special here: http://projects.archlinux.org/pacman.gi … til.c#n455

static size_t string_length(const char *s)
{
	int len;
	wchar_t *wcstr;

	if(!s || s[0] == '\0') {
		return 0;
	}
	/* len goes from # bytes -> # chars -> # cols */
	len = strlen(s) + 1;
	wcstr = calloc(len, sizeof(wchar_t));
	len = mbstowcs(wcstr, s, len);
	len = wcswidth(wcstr, len);
	free(wcstr);

	return len;
}

Looks like you guys don't check the return value of calloc() there.

juster · 2012-01-22 03:22:58

Here's the same thing using mblen:

int n, i, c;
if(s == NULL){
    return 0; /* paranoia */
}
mblen(NULL, 0); /* paranoia */
for(i = 0, n = strlen(s); n > 0; i++){
    c = mblen(s, n);
    if(c <= 0){
        return i; /* -1 is error, 0 is bad news */
    }
    n -= c;
    s += c;
}
return i;

edit: changed to for loop

Last edited by juster (2012-01-22 03:27:06)

Arch Linux

#1 2012-01-05 09:51:18

Handling unicode strings in C.

#2 2012-01-05 15:29:01

Re: Handling unicode strings in C.

#3 2012-01-05 22:12:07

Re: Handling unicode strings in C.

#4 2012-01-06 08:25:29

Re: Handling unicode strings in C.

#5 2012-01-06 11:52:38

Re: Handling unicode strings in C.

#6 2012-01-06 22:31:37

Re: Handling unicode strings in C.

#7 2012-01-11 06:49:11

Re: Handling unicode strings in C.

#8 2012-01-16 10:11:23

Re: Handling unicode strings in C.

#9 2012-01-20 10:15:49

Re: Handling unicode strings in C.

#10 2012-01-20 10:29:15

Re: Handling unicode strings in C.

#11 2012-01-20 17:58:41

Re: Handling unicode strings in C.

#12 2012-01-20 19:16:12

Re: Handling unicode strings in C.

#13 2012-01-22 03:22:58

Re: Handling unicode strings in C.

Board footer