[Solved] RegExp and grep

jiehong · 2010-09-22 20:54:22

Hello,

I have a file called roman in the working directory which contains a novel. My purpose is to count how many «le» and «la» surrounded by a space (so " le " and " la ") the file contains.

So i've tried :

grep -c ' le \| la ' roman

I got 1742.

i wanted to verify by looking for « le » and « la » separatly and i got 1164 and 1389 which is not 1742 !!

So i've tried :

grep -c [l][ea] roman

Which gives me the addition of both before : 1164+1389 = 2443 so it seems good.

The problem is why my first form does not give me the right answer ? I thought that my last code would also found le in lens and la in land and so on… which seems not te be the case…

Last question, is there a difference between " and ' when using in repexp ? It works the same here so I'd like to know.

Thank you

Last edited by jiehong (2010-09-22 21:48:40)

livibetter · 2010-09-22 21:09:44

jiehong wrote:

Hello,
I have a file called roman in the working directory which contains a novel. My purpose is to count how many «le» and «la» surrounded by a space (so " le " and " la ") the file contains.
So i've tried :
grep -c ' le \| la ' roman
I got 1742.
i wanted to verify by looking for « le » and « la » separatly and i got 1164 and 1389 which is not 1742 !!
So i've tried :
grep -c [l][ea] roman
Which gives me the addition of both before : 1164+1389 = 2443 so it seems good.
The problem is why my first form does not give me the right answer ? I thought that my last code would also found le in lens and la in land and so on… which seems not te be the case…
Last question, is there a difference between " and ' when using in repexp ? It works the same here so I'd like to know.
Thank you

From man:

       -c, --count
              Suppress normal output; instead print a count of matching lines for each input file.  With the -v, --invert-match option (see below), count non-matching lines.  (-c is specified by POSIX.)

It's a line-basis.

I think

grep -o ' le \| la ' roman | wc -l

should give you 2443.

       -o, --only-matching
              Print only the matched (non-empty) parts of a matching line, with each such part on a separate output line.

And I would like to use

egrep -o ' (le|la) '

, it reads more clear.

Last edited by livibetter (2010-09-22 21:13:00)

Nichollan · 2010-09-22 21:12:17

I'm not sure about your setup, but on my setup grep doesn't interpret regular expressions by default. You could use grep -E or egrep. I find that it works when I enclose the expression with apostrophes or quotation marks and when I don't escape the vertical bar, like so:

$ egrep -c 'word1|word2' < textfile

jiehong · 2010-09-22 21:47:57

i see, so I lost many occurences of words per lines…

grep -o ' le \| la ' roman | wc -l

works good and I get 5382… but I understand why now because I counted the number of lines containing at least one le or la… missing a lot of them.

So thank you. Any differences between " and ' ?

livibetter · 2010-09-22 21:53:13

jiehong wrote:

So thank you. Any differences between " and ' ?

Read http://tldp.org/LDP/abs/html/quoting.html and http://www.gnu.org/software/bash/manual … ml#Quoting

Long story short:

MYVAR=abc
echo '$MYVAR'
echo "$MYVAR"

Outputs

$MYVAR
abc

jiehong · 2010-09-22 22:06:20

There's an issue : ' le|la ' will look for ' le ' and ' la ' but also for ' lela ' which is not good.

livibetter · 2010-09-22 22:06:52

One more thing I forgot to mention, that regular expression has a flaw if you want to match "le blah..." or "blah le!"--the occurrences are at beginning and end of line with/without punctuations. I don't know about French (it's French, right?).

The regular express should be

egrep -io '\b(le|la)\b'

Also a `-i` is supplied to ignore cases.

Again, I don't know French, so it might not be in your roman text.

livibetter · 2010-09-22 22:10:58

jiehong wrote:

There's an issue : ' le|la ' will look for ' le ' and ' la ' but also for ' lela ' which is not good.

You forgot parentheses.

' (le|la) ' not ' le|la ', the later, kind of equivalent to '( le|la )', will try to match ' le' or 'la '. The former will try to match ' le ' or ' la '.

jiehong · 2010-09-22 22:16:59

livibetter wrote:

jiehong wrote:
There's an issue : ' le|la ' will look for ' le ' and ' la ' but also for ' lela ' which is not good.
You forgot parentheses.
' (le|la) ' not ' le|la ', the later, kind of equivalent to '( le|la )', will try to match ' le' or 'la '. The former will try to match ' le ' or ' la '.

Thank you, it works.

Yes it's French. Actually le and la should not be followed by a string in that case. Actually i should also count the ones preceded by a dot too but it's ok now.

Thank you everybody!!

Arch Linux

#1 2010-09-22 20:54:22

[Solved] RegExp and grep

#2 2010-09-22 21:09:44

Re: [Solved] RegExp and grep

#3 2010-09-22 21:12:17

Re: [Solved] RegExp and grep

#4 2010-09-22 21:47:57

Re: [Solved] RegExp and grep

#5 2010-09-22 21:53:13

Re: [Solved] RegExp and grep

#6 2010-09-22 22:06:20

Re: [Solved] RegExp and grep

#7 2010-09-22 22:06:52

Re: [Solved] RegExp and grep

#8 2010-09-22 22:10:58

Re: [Solved] RegExp and grep

#9 2010-09-22 22:16:59

Re: [Solved] RegExp and grep

Board footer