problem with grep

sullivanva · 2007-11-29 00:12:12

Broken:

boogie:~> pacman -Q | grep grep
grep 2.5.3-2
boogie:~> echo j | grep "[A-Z]"
j
boogie:~>

Not broken:

froggie:~> pacman -Q | grep grep
grep 2.5.1a-2
froggie:~> echo j | grep "[A-Z]"
froggie:~>

Allan · 2007-11-29 00:22:39

I can not replicate this here...

Cerebral · 2007-11-29 00:24:50

$ pacman -Q grep
grep 2.5.3-2
$ echo j | grep "[A-Z]"
$

Also can't reproduce.

-edit-
Totally offtopic, but sed works fine too.

$ echo j | sed -n "/[A-Z]/ p"
$ echo A | sed -n "/[A-Z]/ p"
A

Last edited by Cerebral (2007-11-29 00:27:25)

nj · 2007-11-29 00:47:38

Any chance that you have grep aliased to 'grep -i' to ignore case?

delphiki · 2007-11-29 01:52:10

@sullivanva

$ pacman -Q grep
2.5.3-2
$ echo j | grep "[A-Z]"
$

I cannot reproduce your error either.

Cheers,

drakosha · 2007-11-29 06:29:36

You might try "egrep" which is part of grep package and understands more complex regular expressions. "which grep" should disclose which grep do u use - it might be some other executable, not the /usr/bin/grep one.

kishd · 2007-11-29 08:31:19

[kishd@dozer ~]$ pacman -Q | grep grep
grep 2.5.3-2

[kishd@dozer ~]$ echo j | grep "[A-Z]"
j

same here

mucknert · 2007-11-29 08:33:27

Hilarious! I have to try that when I get home.

Nihathrael · 2007-11-29 09:33:03

[nihathrael@reaper ~]$ pacman -Q | grep grep
grep 2.5.3-2
ngrep 1.45-3
[nihathrael@reaper ~]$ echo j | grep "[A-Z]"
j

mucknert · 2007-11-29 10:19:55

Works as intended here. No problem at all.

[~] % pacman -Q | grep grep
grep 2.5.3-2
[~] % echo j | grep "[A-Z]" 
[~] %

Last edited by mucknert (2007-11-29 10:21:26)

Sekre · 2007-11-29 12:30:24

Tried as well, no problems here either

~  $  pacman -Q | grep grep
grep 2.5.3-2
~  $  echo j | grep "[A-Z]"
~  $

shining · 2007-11-29 12:43:25

sullivanva and kishd, could you paste the output of locale ?
And try with LANG=C grep .

dolby · 2007-11-29 12:48:37

i get a j too

LANG=en_US.utf8
LC_CTYPE="en_US.utf8"
LC_NUMERIC="en_US.utf8"
LC_TIME="en_US.utf8"
LC_COLLATE="en_US.utf8"
LC_MONETARY="en_US.utf8"
LC_MESSAGES="en_US.utf8"
LC_PAPER="en_US.utf8"
LC_NAME="en_US.utf8"
LC_ADDRESS="en_US.utf8"
LC_TELEPHONE="en_US.utf8"
LC_MEASUREMENT="en_US.utf8"
LC_IDENTIFICATION="en_US.utf8"
LC_ALL=en_US.utf8

shining · 2007-11-29 12:54:28

It's because of LC_ALL, it wasn't set here, and it worked fine.
I set it and I got the bug.

> echo $LC_ALL
en_US.utf8
> echo j | grep "[A-Z]"
j
> unset LC_ALL
> echo j | grep "[A-Z]"
<nothing>

Last edited by shining (2007-11-29 12:55:45)

MrWeatherbee · 2007-11-29 14:00:21

shining wrote:

It's because of LC_ALL, it wasn't set here, and it worked fine.
I set it and I got the bug.

I don't think it's a bug based on this documentation:

Within a bracket expression, a range expression consists of two characters separated by a hyphen. It matches any single character that sorts between the two characters, inclusive, using the locale's collating sequence and character set. For example, in the default C locale, `[a-d]' is equivalent to `[abcd]'. Many locales sort characters in dictionary order, and in these locales `[a-d]' is typically not equivalent to `[abcd]'; it might be equivalent to `[aBbCcDd]', for example. To obtain the traditional interpretation of bracket expressions, you can use the C locale by setting the LC_ALL environment variable to the value `C'.

LC_COLLATE, I believe, is the really critical value. So, for example, in Dolby's case, unsetting LC_ALL by itself won't do any good if he has LC_COLLATE set to 'en_US.utf8'. LC_COLLATE needs to be set to LC_COLLATE=C, for example, to get the behavior that everyone is expecting from the posted 'grep' statement. Of course, setting LC_ALL=C would do the trick as well, but it might not really be what you want.

Edit:

Forgot to post link to above quote:

http://www.gnu.org/software/grep/doc/grep_8.html#IDX178

Last edited by MrWeatherbee (2007-11-29 14:46:11)

kishd · 2007-11-29 14:22:34

[kishd@dozer ~]$ locale
LANG=en_ZA.UTF8
LC_CTYPE="en_ZA.UTF8"
LC_NUMERIC="en_ZA.UTF8"
LC_TIME="en_ZA.UTF8"
LC_COLLATE="en_ZA.UTF8"
LC_MONETARY="en_ZA.UTF8"
LC_MESSAGES="en_ZA.UTF8"
LC_PAPER="en_ZA.UTF8"
LC_NAME="en_ZA.UTF8"
LC_ADDRESS="en_ZA.UTF8"
LC_TELEPHONE="en_ZA.UTF8"
LC_MEASUREMENT="en_ZA.UTF8"
LC_IDENTIFICATION="en_ZA.UTF8"
LC_ALL=
[kishd@dozer ~]$

Unsetting LC_COLLATE or LC_ALL does not seem to help

Last edited by kishd (2007-11-29 14:25:25)

MrWeatherbee · 2007-11-29 14:29:15

kishd wrote:

Unsetting LC_COLLATE or LC_ALL does not seem to help

Don't just unset LC_COLLATE. As you have noticed, that doesn't work.

From my first post, you need to set LC_COLLATE=C (or set it to some other locale that provides the same collation scheme). Alternatively, you can affect all the variables by setting LC_ALL=C.

Last edited by MrWeatherbee (2007-11-29 14:47:22)

mico · 2007-11-29 14:46:50

MrWeatherbee wrote:

shining wrote:
It's because of LC_ALL, it wasn't set here, and it worked fine.
I set it and I got the bug.
I don't think it's a bug based on this documentation:
Within a bracket expression, a range expression consists of two characters separated by a hyphen. It matches any single character that sorts between the two characters, inclusive, using the locale's collating sequence and character set. For example, in the default C locale, `[a-d]' is equivalent to `[abcd]'. Many locales sort characters in dictionary order, and in these locales `[a-d]' is typically not equivalent to `[abcd]'; it might be equivalent to `[aBbCcDd]', for example. To obtain the traditional interpretation of bracket expressions, you can use the C locale by setting the LC_ALL environment variable to the value `C'.
LC_COLLATE, I believe, is the really critical value. So, for example, in Dolby's case, unsetting LC_ALL by itself won't do any good because he has LC_COLLATE set to 'en_US.utf8'. LC_COLLATE needs to be set to LC_COLLATE=C, for example, to get the behavior that everyone is expecting from the posted 'grep' statement. Of course, setting LC_ALL=C would do the trick as well, but it might not really be what you want.
Edit:
Forgot to post link to above quote:
http://www.gnu.org/software/grep/doc/grep_8.html#IDX178

Despite this, I think there is a bug in grep:

$ pacman -Q grep
grep 2.5.3-2
$ echo J | grep --color=always [A-Z]
J
$ echo j | grep --color=always [A-Z]
j

Matched string is supposed to be in color (red in particular) when grep prints matched lines. The printed uppercase J is red as it should be but printed lowercase j is normal terminal color.

Sekre · 2007-11-29 15:11:25

About the LC_COLLATE,
setting it to LC_COLLATE=C does indeed take this problem away, unsetting it causes it to react to lowercase as well.

mico : I confirmed your testing as well, it did indeed not color lowercase 'j' using [A-Z] range

I'm on deep water in this, but hope someone knows whats going on "behind the scenes". Could probably learn something good from this, I hope so at least

$ echo J | grep --color=always [A-Z]
J (orange)
$ echo j | grep --color=always [A-Z]
j (white)
$ echo j | grep --color=always -i [A-Z]
j (orange)
$ echo j | grep --color=always [a-z]
j (orange)

kishd · 2007-11-29 15:19:22

[kishd@dozer ~]$ locale
LANG=en_US.UTF8
LC_CTYPE="en_US.UTF8"
LC_NUMERIC="en_US.UTF8"
LC_TIME="en_US.UTF8"
LC_COLLATE="en_US.UTF8"
LC_MONETARY="en_US.UTF8"
LC_MESSAGES="en_US.UTF8"
LC_PAPER="en_US.UTF8"
LC_NAME="en_US.UTF8"
LC_ADDRESS="en_US.UTF8"
LC_TELEPHONE="en_US.UTF8"
LC_MEASUREMENT="en_US.UTF8"
LC_IDENTIFICATION="en_US.UTF8"
LC_ALL=
[kishd@dozer ~]$ echo j | grep "[A-Z]"
j
[kishd@dozer ~]$

LC_ALL= unset but still get the j

Last edited by kishd (2007-11-29 15:19:57)

shining · 2007-11-29 15:21:51

MrWeatherbee wrote:

shining wrote:
It's because of LC_ALL, it wasn't set here, and it worked fine.
I set it and I got the bug.
I don't think it's a bug based on this documentation:

You are right, my mistake. I actually already learned this in the past and then forgot.
It's indeed only LC_COLLATE which matters, and it's not a bug.

Cerebral · 2007-11-29 15:35:38

kishd wrote:

LC_ALL= unset but still get the j

You're not supposed to UNSET it. You're supposed to set it to C.

LC_COLLATE=C
or
LC_ALL=C

Last edited by Cerebral (2007-11-29 15:36:53)

Bison · 2007-11-29 16:39:52

I wasn't aware that [A-Z] was a valid grep string. I believe egrep (or simply grep -e) handles posix (extended) regular expressions.

F · 2007-11-29 16:47:54

Looks fine on my end.

[f|~]% pacman -Q | grep grep
grep 2.5.3-1
ngrep 1.45-3.1
[f|~]% echo j
j
[f|~]% echo j | grep "[A-Z]"
[f|~]% echo j | grep "[a-z]"
j
[f|~]%

byte · 2007-11-29 16:48:59

So far I only use egrep when I need to find alternating strings (OR), like egrep 'this|that' file, for everything else normal grep suffices.

Arch Linux

#1 2007-11-29 00:12:12

problem with grep

#2 2007-11-29 00:22:39

Re: problem with grep

#3 2007-11-29 00:24:50

Re: problem with grep

#4 2007-11-29 00:47:38

Re: problem with grep

#5 2007-11-29 01:52:10

Re: problem with grep

#6 2007-11-29 06:29:36

Re: problem with grep

#7 2007-11-29 08:31:19

Re: problem with grep

#8 2007-11-29 08:33:27

Re: problem with grep

#9 2007-11-29 09:33:03

Re: problem with grep

#10 2007-11-29 10:19:55

Re: problem with grep

#11 2007-11-29 12:30:24

Re: problem with grep

#12 2007-11-29 12:43:25

Re: problem with grep

#13 2007-11-29 12:48:37

Re: problem with grep

#14 2007-11-29 12:54:28

Re: problem with grep

#15 2007-11-29 14:00:21

Re: problem with grep

#16 2007-11-29 14:22:34

Re: problem with grep

#17 2007-11-29 14:29:15

Re: problem with grep

#18 2007-11-29 14:46:50

Re: problem with grep

#19 2007-11-29 15:11:25

Re: problem with grep

#20 2007-11-29 15:19:22

Re: problem with grep

#21 2007-11-29 15:21:51

Re: problem with grep

#22 2007-11-29 15:35:38

Re: problem with grep

#23 2007-11-29 16:39:52

Re: problem with grep

#24 2007-11-29 16:47:54

Re: problem with grep

#25 2007-11-29 16:48:59

Re: problem with grep

Board footer