You are not logged in.

#1 2010-04-27 08:18:07

demian
Member
From: Frankfurt, Germany
Registered: 2009-05-06
Posts: 709

[solved] grep german umlaut using regex? use iconv

Hi guys,

when grepping
"<condition data="Bewölkt"/><temp_f data="45"/>"
i don't get any output.
The same grep sequence works fine for words like "Leichter Regen" or "Klar", so i suspect grep fails because of the german umlaut ("ö" in this case).



Now, this is my current locale:

LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

If i export LC_ALL=de_DE i no longer get 0 output but rather "Bew" instead of "Bewölkt" or "Meistens" instead of "Meistens bewölkt".

Any ideas how to teach grep to use iso-8859-15 or something like that?

Here's the relevant part of the function:

declare -a wetter
wetter=(`curl -silent "http://www.google.de/ig/api?weather=$1+$2+$3" | grep -o 'condition data=\"[A-Z].*\"/><temp_f' | grep -o '[A-Z][a-z]*'`)
echo Wetter: $wetter

Last edited by demian (2010-04-27 16:47:26)


no place like /home
github

Offline

#2 2010-04-27 08:44:56

skanky
Member
From: WAIS
Registered: 2009-10-23
Posts: 1,847

Re: [solved] grep german umlaut using regex? use iconv

If you use [[:alpha:]] or [[:alnum:]] it *should* be character set and encoding independent.
Trying to test it, with UK locale, I get a match on the umlauted o using [a-z] though.

Actually, my LC_ALL is not set so I may be using the C locale?


"...one cannot be angry when one looks at a penguin."  - John Ruskin
"Life in general is a bit shit, and so too is the internet. And that's all there is." - scepticisle

Offline

#3 2010-04-27 08:55:10

kazuo
Member
From: São Paulo/Brazil
Registered: 2008-03-18
Posts: 413
Website

Re: [solved] grep german umlaut using regex? use iconv

I think you problem is the encoding

You use a UTF-8 locale so grep take chars from utf-8, but you need to grep for iso8859-15, try something like

curl blablbal|LC_ALL=en_US.iso889515 grep [a-zA-Z]

(correct the locale, I dont know how the are named for -15)

Remember that ö in latin1 means something complete differenty of ö from UTF-8....

EDIT: To be more precise you dont need to use LC_ALL but only LC_CTYPE

With the curl and the link you gave more the grep with LC_ALL I get a match for ü that appers on the output (the word is vorübergehen) and my locale is set to en_GB.UTF-8

Last edited by kazuo (2010-04-27 09:03:16)

Offline

#4 2010-04-27 09:24:30

skanky
Member
From: WAIS
Registered: 2009-10-23
Posts: 1,847

Re: [solved] grep german umlaut using regex? use iconv

Incidentally, you could replace the double grep call with this sed call (which might need a tweak, but it worked with my single test).

sed -nr 's/^.*condition data="([^"]+)".*$/\1/p'

It's GNU sed so it might not be portable as is.


"...one cannot be angry when one looks at a penguin."  - John Ruskin
"Life in general is a bit shit, and so too is the internet. And that's all there is." - scepticisle

Offline

#5 2010-04-27 09:29:23

demian
Member
From: Frankfurt, Germany
Registered: 2009-05-06
Posts: 709

Re: [solved] grep german umlaut using regex? use iconv

Yes, i was just now getting into regular expressions and sed so my functions will be very dirty for the time being tongue. I'll try the sed command though, thank you.

I tried iconv as well as your suggestions but the problem remains.
When i save the .xml file on my hard drive i can grep "bewölkt" but not when i use curl to view the data. When i use curl grep simply ignores everything after the umlaut:

curl -silent "http://www.google.de/ig/api?weather=Berlin" | grep -o 'condition data=\"[[:alpha:]]*.*'
condition data="Meistens bew
condition data="Wind: W mit Windgeschwindigkeiten von 23 km/h"/>[...]

Maybe the problem is that i don't seem to have iso8895* locales.

 > fd /usr en_US
/usr/lib/tupac/localization/en_US.php
/usr/share/X11/locale/en_US.UTF-8
/usr/share/X11/locale/en_US.UTF-8/Compose
/usr/share/X11/locale/en_US.UTF-8/XI18N_OBJS
/usr/share/X11/locale/en_US.UTF-8/XLC_LOCALE
/usr/share/i18n/locales/en_US
/usr/share/locale/en_US
/usr/share/locale/en_US/LC_MESSAGES
/usr/share/locale/en_US/LC_MESSAGES/wget.mo
/usr/share/vim/vim72/spell/en/en_US.diff
 > fd /usr de_DE
/usr/lib/tupac/localization/de_DE.php
/usr/share/i18n/locales/de_DE
/usr/share/i18n/locales/de_DE@euro
/usr/share/vim/vim72/lang/menu_de_de.latin1.vim
/usr/share/vim/vim72/lang/menu_de_de.utf-8.vim
/usr/share/vim/vim72/spell/de/de_DE.diff

Sorry, I've never localized the system before. I always found it convenient just to stick with en_US.utf8 until now.

I've already changed the function to use google.com so i don't have the umlaut-problem anymore.

Regards,
demian

Last edited by demian (2010-04-27 10:04:31)


no place like /home
github

Offline

#6 2010-04-27 10:09:27

skanky
Member
From: WAIS
Registered: 2009-10-23
Posts: 1,847

Re: [solved] grep german umlaut using regex? use iconv

Yeah, I think my localization may be a bit off too.
Using the Berlin option for the weather seems to break my sed script. I'm a bit busy at the moment, but I'll try and work out what's wrong if I get some time later, today.


"...one cannot be angry when one looks at a penguin."  - John Ruskin
"Life in general is a bit shit, and so too is the internet. And that's all there is." - scepticisle

Offline

#7 2010-04-27 10:20:16

demian
Member
From: Frankfurt, Germany
Registered: 2009-05-06
Posts: 709

Re: [solved] grep german umlaut using regex? use iconv

Thanks for taking interest.
I've localized my system now and use de_DE@euro as default. The script now shows the first part of the word so i can guess the rest ^_^.

Edit:
When i use an ISO format as default, many characters do not get displayed correctly (system-wide) so for now I'll stick with en_US.utf8 and just do something else to get the function out of my head roll.

Last edited by demian (2010-04-27 10:40:28)


no place like /home
github

Offline

#8 2010-04-27 13:36:06

raf_kig
Member
Registered: 2008-11-28
Posts: 143

Re: [solved] grep german umlaut using regex? use iconv

The problem is that the xml output's encoding is latin1 - converting the charset (ie with iconv) gets rid of the problem.

Example:

 get -O- 'http://www.google.de/ig/api?weather=Berlin' | iconv - -f latin1 -tutf8 |  grep -o 'condition data=\"[[:alpha:]]*.*'

Regards,

raf

Offline

#9 2010-04-27 13:39:26

skanky
Member
From: WAIS
Registered: 2009-10-23
Posts: 1,847

Re: [solved] grep german umlaut using regex? use iconv

Also fixes it for my sed script, thanks. smile
Really must look into this stuff more.


"...one cannot be angry when one looks at a penguin."  - John Ruskin
"Life in general is a bit shit, and so too is the internet. And that's all there is." - scepticisle

Offline

#10 2010-04-27 16:47:02

demian
Member
From: Frankfurt, Germany
Registered: 2009-05-06
Posts: 709

Re: [solved] grep german umlaut using regex? use iconv

That works for me too. I actually used the iconv command before but was under the illusion i had to convert utf8 to iso8859. I used wrong codes anyway tongue.

So.. thanks. Issue solved.


no place like /home
github

Offline

Board footer

Powered by FluxBB