[solved] grep german umlaut using regex? use iconv

demian · 2010-04-27 08:18:07

Hi guys,

when grepping
"<condition data="Bewölkt"/><temp_f data="45"/>"
i don't get any output.
The same grep sequence works fine for words like "Leichter Regen" or "Klar", so i suspect grep fails because of the german umlaut ("ö" in this case).

Now, this is my current locale:

LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

If i export LC_ALL=de_DE i no longer get 0 output but rather "Bew" instead of "Bewölkt" or "Meistens" instead of "Meistens bewölkt".

Any ideas how to teach grep to use iso-8859-15 or something like that?

Here's the relevant part of the function:

declare -a wetter
wetter=(`curl -silent "http://www.google.de/ig/api?weather=$1+$2+$3" | grep -o 'condition data=\"[A-Z].*\"/><temp_f' | grep -o '[A-Z][a-z]*'`)
echo Wetter: $wetter

Last edited by demian (2010-04-27 16:47:26)

skanky · 2010-04-27 08:44:56

If you use [[:alpha:]] or [[:alnum:]] it *should* be character set and encoding independent.
Trying to test it, with UK locale, I get a match on the umlauted o using [a-z] though.

Actually, my LC_ALL is not set so I may be using the C locale?

kazuo · 2010-04-27 08:55:10

I think you problem is the encoding

You use a UTF-8 locale so grep take chars from utf-8, but you need to grep for iso8859-15, try something like

curl blablbal|LC_ALL=en_US.iso889515 grep [a-zA-Z]

(correct the locale, I dont know how the are named for -15)

Remember that ö in latin1 means something complete differenty of ö from UTF-8....

EDIT: To be more precise you dont need to use LC_ALL but only LC_CTYPE

With the curl and the link you gave more the grep with LC_ALL I get a match for ü that appers on the output (the word is vorübergehen) and my locale is set to en_GB.UTF-8

Last edited by kazuo (2010-04-27 09:03:16)

skanky · 2010-04-27 09:24:30

Incidentally, you could replace the double grep call with this sed call (which might need a tweak, but it worked with my single test).

sed -nr 's/^.*condition data="([^"]+)".*$/\1/p'

It's GNU sed so it might not be portable as is.

demian · 2010-04-27 09:29:23

Yes, i was just now getting into regular expressions and sed so my functions will be very dirty for the time being . I'll try the sed command though, thank you.

I tried iconv as well as your suggestions but the problem remains.
When i save the .xml file on my hard drive i can grep "bewölkt" but not when i use curl to view the data. When i use curl grep simply ignores everything after the umlaut:

curl -silent "http://www.google.de/ig/api?weather=Berlin" | grep -o 'condition data=\"[[:alpha:]]*.*'

condition data="Meistens bew
condition data="Wind: W mit Windgeschwindigkeiten von 23 km/h"/>[...]

Maybe the problem is that i don't seem to have iso8895* locales.

 > fd /usr en_US
/usr/lib/tupac/localization/en_US.php
/usr/share/X11/locale/en_US.UTF-8
/usr/share/X11/locale/en_US.UTF-8/Compose
/usr/share/X11/locale/en_US.UTF-8/XI18N_OBJS
/usr/share/X11/locale/en_US.UTF-8/XLC_LOCALE
/usr/share/i18n/locales/en_US
/usr/share/locale/en_US
/usr/share/locale/en_US/LC_MESSAGES
/usr/share/locale/en_US/LC_MESSAGES/wget.mo
/usr/share/vim/vim72/spell/en/en_US.diff
 > fd /usr de_DE
/usr/lib/tupac/localization/de_DE.php
/usr/share/i18n/locales/de_DE
/usr/share/i18n/locales/de_DE@euro
/usr/share/vim/vim72/lang/menu_de_de.latin1.vim
/usr/share/vim/vim72/lang/menu_de_de.utf-8.vim
/usr/share/vim/vim72/spell/de/de_DE.diff

Sorry, I've never localized the system before. I always found it convenient just to stick with en_US.utf8 until now.

I've already changed the function to use google.com so i don't have the umlaut-problem anymore.

Regards,
demian

Last edited by demian (2010-04-27 10:04:31)

skanky · 2010-04-27 10:09:27

Yeah, I think my localization may be a bit off too.
Using the Berlin option for the weather seems to break my sed script. I'm a bit busy at the moment, but I'll try and work out what's wrong if I get some time later, today.

demian · 2010-04-27 10:20:16

Thanks for taking interest.
I've localized my system now and use de_DE@euro as default. The script now shows the first part of the word so i can guess the rest ^_^.

Edit:
When i use an ISO format as default, many characters do not get displayed correctly (system-wide) so for now I'll stick with en_US.utf8 and just do something else to get the function out of my head .

Last edited by demian (2010-04-27 10:40:28)

raf_kig · 2010-04-27 13:36:06

The problem is that the xml output's encoding is latin1 - converting the charset (ie with iconv) gets rid of the problem.

Example:

 get -O- 'http://www.google.de/ig/api?weather=Berlin' | iconv - -f latin1 -tutf8 |  grep -o 'condition data=\"[[:alpha:]]*.*'

Regards,

raf

skanky · 2010-04-27 13:39:26

Also fixes it for my sed script, thanks.
Really must look into this stuff more.

demian · 2010-04-27 16:47:02

That works for me too. I actually used the iconv command before but was under the illusion i had to convert utf8 to iso8859. I used wrong codes anyway .

So.. thanks. Issue solved.

Arch Linux

#1 2010-04-27 08:18:07

[solved] grep german umlaut using regex? use iconv

#2 2010-04-27 08:44:56

Re: [solved] grep german umlaut using regex? use iconv

#3 2010-04-27 08:55:10

Re: [solved] grep german umlaut using regex? use iconv

#4 2010-04-27 09:24:30

Re: [solved] grep german umlaut using regex? use iconv

#5 2010-04-27 09:29:23

Re: [solved] grep german umlaut using regex? use iconv

#6 2010-04-27 10:09:27

Re: [solved] grep german umlaut using regex? use iconv

#7 2010-04-27 10:20:16

Re: [solved] grep german umlaut using regex? use iconv

#8 2010-04-27 13:36:06

Re: [solved] grep german umlaut using regex? use iconv

#9 2010-04-27 13:39:26

Re: [solved] grep german umlaut using regex? use iconv

#10 2010-04-27 16:47:02

Re: [solved] grep german umlaut using regex? use iconv

Board footer