You are not logged in.
Hi guys,
when grepping
"<condition data="Bewölkt"/><temp_f data="45"/>"
i don't get any output.
The same grep sequence works fine for words like "Leichter Regen" or "Klar", so i suspect grep fails because of the german umlaut ("ö" in this case).
Now, this is my current locale:
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
If i export LC_ALL=de_DE i no longer get 0 output but rather "Bew" instead of "Bewölkt" or "Meistens" instead of "Meistens bewölkt".
Any ideas how to teach grep to use iso-8859-15 or something like that?
Here's the relevant part of the function:
declare -a wetter
wetter=(`curl -silent "http://www.google.de/ig/api?weather=$1+$2+$3" | grep -o 'condition data=\"[A-Z].*\"/><temp_f' | grep -o '[A-Z][a-z]*'`)
echo Wetter: $wetter
Last edited by demian (2010-04-27 16:47:26)
no place like /home
github
Offline
If you use [[:alpha:]] or [[:alnum:]] it *should* be character set and encoding independent.
Trying to test it, with UK locale, I get a match on the umlauted o using [a-z] though.
Actually, my LC_ALL is not set so I may be using the C locale?
"...one cannot be angry when one looks at a penguin." - John Ruskin
"Life in general is a bit shit, and so too is the internet. And that's all there is." - scepticisle
Offline
I think you problem is the encoding
You use a UTF-8 locale so grep take chars from utf-8, but you need to grep for iso8859-15, try something like
curl blablbal|LC_ALL=en_US.iso889515 grep [a-zA-Z]
(correct the locale, I dont know how the are named for -15)
Remember that ö in latin1 means something complete differenty of ö from UTF-8....
EDIT: To be more precise you dont need to use LC_ALL but only LC_CTYPE
With the curl and the link you gave more the grep with LC_ALL I get a match for ü that appers on the output (the word is vorübergehen) and my locale is set to en_GB.UTF-8
Last edited by kazuo (2010-04-27 09:03:16)
Offline
Incidentally, you could replace the double grep call with this sed call (which might need a tweak, but it worked with my single test).
sed -nr 's/^.*condition data="([^"]+)".*$/\1/p'
It's GNU sed so it might not be portable as is.
"...one cannot be angry when one looks at a penguin." - John Ruskin
"Life in general is a bit shit, and so too is the internet. And that's all there is." - scepticisle
Offline
Yes, i was just now getting into regular expressions and sed so my functions will be very dirty for the time being . I'll try the sed command though, thank you.
I tried iconv as well as your suggestions but the problem remains.
When i save the .xml file on my hard drive i can grep "bewölkt" but not when i use curl to view the data. When i use curl grep simply ignores everything after the umlaut:
curl -silent "http://www.google.de/ig/api?weather=Berlin" | grep -o 'condition data=\"[[:alpha:]]*.*'
condition data="Meistens bew
condition data="Wind: W mit Windgeschwindigkeiten von 23 km/h"/>[...]
Maybe the problem is that i don't seem to have iso8895* locales.
> fd /usr en_US
/usr/lib/tupac/localization/en_US.php
/usr/share/X11/locale/en_US.UTF-8
/usr/share/X11/locale/en_US.UTF-8/Compose
/usr/share/X11/locale/en_US.UTF-8/XI18N_OBJS
/usr/share/X11/locale/en_US.UTF-8/XLC_LOCALE
/usr/share/i18n/locales/en_US
/usr/share/locale/en_US
/usr/share/locale/en_US/LC_MESSAGES
/usr/share/locale/en_US/LC_MESSAGES/wget.mo
/usr/share/vim/vim72/spell/en/en_US.diff
> fd /usr de_DE
/usr/lib/tupac/localization/de_DE.php
/usr/share/i18n/locales/de_DE
/usr/share/i18n/locales/de_DE@euro
/usr/share/vim/vim72/lang/menu_de_de.latin1.vim
/usr/share/vim/vim72/lang/menu_de_de.utf-8.vim
/usr/share/vim/vim72/spell/de/de_DE.diff
Sorry, I've never localized the system before. I always found it convenient just to stick with en_US.utf8 until now.
I've already changed the function to use google.com so i don't have the umlaut-problem anymore.
Regards,
demian
Last edited by demian (2010-04-27 10:04:31)
no place like /home
github
Offline
Yeah, I think my localization may be a bit off too.
Using the Berlin option for the weather seems to break my sed script. I'm a bit busy at the moment, but I'll try and work out what's wrong if I get some time later, today.
"...one cannot be angry when one looks at a penguin." - John Ruskin
"Life in general is a bit shit, and so too is the internet. And that's all there is." - scepticisle
Offline
Thanks for taking interest.
I've localized my system now and use de_DE@euro as default. The script now shows the first part of the word so i can guess the rest ^_^.
Edit:
When i use an ISO format as default, many characters do not get displayed correctly (system-wide) so for now I'll stick with en_US.utf8 and just do something else to get the function out of my head .
Last edited by demian (2010-04-27 10:40:28)
no place like /home
github
Offline
The problem is that the xml output's encoding is latin1 - converting the charset (ie with iconv) gets rid of the problem.
Example:
get -O- 'http://www.google.de/ig/api?weather=Berlin' | iconv - -f latin1 -tutf8 | grep -o 'condition data=\"[[:alpha:]]*.*'
Regards,
raf
Offline
Also fixes it for my sed script, thanks.
Really must look into this stuff more.
"...one cannot be angry when one looks at a penguin." - John Ruskin
"Life in general is a bit shit, and so too is the internet. And that's all there is." - scepticisle
Offline
That works for me too. I actually used the iconv command before but was under the illusion i had to convert utf8 to iso8859. I used wrong codes anyway .
So.. thanks. Issue solved.
no place like /home
github
Offline