You are not logged in.

#1 2010-09-22 20:54:22

jiehong
Member
Registered: 2009-03-19
Posts: 63
Website

[Solved] RegExp and grep

Hello,

I have a file called roman in the working directory which contains a novel. My purpose is to count how many «le» and «la» surrounded by a space  (so " le " and " la ") the file contains.

So i've tried :

grep -c ' le \| la ' roman

I got 1742.

i wanted to verify by looking for « le » and « la » separatly and i got 1164 and  1389 which is not 1742 !!

So i've tried :

grep -c [l][ea] roman

Which gives me the addition of both before : 1164+1389 = 2443 so it seems good.

The problem is why my first form does not give me the right answer ? I thought that my last code would also found le in lens and la in land and so on… which seems not te be the case…

Last question, is there a difference between " and ' when using in repexp ? It works the same here so I'd like to know.

Thank you smile

Last edited by jiehong (2010-09-22 21:48:40)

Offline

#2 2010-09-22 21:09:44

livibetter
Member
From: Taipei
Registered: 2008-05-14
Posts: 95
Website

Re: [Solved] RegExp and grep

jiehong wrote:

Hello,

I have a file called roman in the working directory which contains a novel. My purpose is to count how many «le» and «la» surrounded by a space  (so " le " and " la ") the file contains.

So i've tried :

grep -c ' le \| la ' roman

I got 1742.

i wanted to verify by looking for « le » and « la » separatly and i got 1164 and  1389 which is not 1742 !!

So i've tried :

grep -c [l][ea] roman

Which gives me the addition of both before : 1164+1389 = 2443 so it seems good.

The problem is why my first form does not give me the right answer ? I thought that my last code would also found le in lens and la in land and so on… which seems not te be the case…

Last question, is there a difference between " and ' when using in repexp ? It works the same here so I'd like to know.

Thank you smile

From man:

       -c, --count
              Suppress normal output; instead print a count of matching lines for each input file.  With the -v, --invert-match option (see below), count non-matching lines.  (-c is specified by POSIX.)

It's a line-basis.

I think

grep -o ' le \| la ' roman | wc -l

should give you 2443.

       -o, --only-matching
              Print only the matched (non-empty) parts of a matching line, with each such part on a separate output line.

And I would like to use

egrep -o ' (le|la) '

, it reads more clear.

Last edited by livibetter (2010-09-22 21:13:00)

Offline

#3 2010-09-22 21:12:17

Nichollan
Member
From: Stavanger, Norway
Registered: 2010-05-18
Posts: 110

Re: [Solved] RegExp and grep

I'm not sure about your setup, but on my setup grep doesn't interpret regular expressions by default. You could use grep -E or egrep. I find that it works when I enclose the expression with apostrophes or quotation marks and when I don't escape the vertical bar, like so:

$ egrep -c 'word1|word2' < textfile

Offline

#4 2010-09-22 21:47:57

jiehong
Member
Registered: 2009-03-19
Posts: 63
Website

Re: [Solved] RegExp and grep

i see, so I lost many occurences of words per lines…

grep -o ' le \| la ' roman | wc -l

works good and I get 5382… but I understand why now because I counted the number of lines containing at least one le or la… missing a lot of them.

So thank you. Any differences between " and ' ?

Offline

#5 2010-09-22 21:53:13

livibetter
Member
From: Taipei
Registered: 2008-05-14
Posts: 95
Website

Re: [Solved] RegExp and grep

jiehong wrote:

So thank you. Any differences between " and ' ?

Read http://tldp.org/LDP/abs/html/quoting.html and http://www.gnu.org/software/bash/manual … ml#Quoting

Long story short:

MYVAR=abc
echo '$MYVAR'
echo "$MYVAR"

Outputs

$MYVAR
abc

Offline

#6 2010-09-22 22:06:20

jiehong
Member
Registered: 2009-03-19
Posts: 63
Website

Re: [Solved] RegExp and grep

There's an issue : ' le|la ' will look for ' le ' and ' la ' but also for ' lela ' which is not good.

Offline

#7 2010-09-22 22:06:52

livibetter
Member
From: Taipei
Registered: 2008-05-14
Posts: 95
Website

Re: [Solved] RegExp and grep

One more thing I forgot to mention, that regular expression has a flaw if you want to match "le blah..." or "blah le!"--the occurrences are at beginning and end of line with/without punctuations. I don't know about French (it's French, right?).

The regular express should be

egrep -io '\b(le|la)\b'

Also a `-i` is supplied to ignore cases.

Again, I don't know French, so it might not be in your roman text.

Offline

#8 2010-09-22 22:10:58

livibetter
Member
From: Taipei
Registered: 2008-05-14
Posts: 95
Website

Re: [Solved] RegExp and grep

jiehong wrote:

There's an issue : ' le|la ' will look for ' le ' and ' la ' but also for ' lela ' which is not good.

You forgot parentheses.

' (le|la) ' not ' le|la ', the later, kind of equivalent to '( le|la )', will try to match ' le' or 'la '. The former will try to match ' le ' or ' la '.

Offline

#9 2010-09-22 22:16:59

jiehong
Member
Registered: 2009-03-19
Posts: 63
Website

Re: [Solved] RegExp and grep

livibetter wrote:
jiehong wrote:

There's an issue : ' le|la ' will look for ' le ' and ' la ' but also for ' lela ' which is not good.

You forgot parentheses.

' (le|la) ' not ' le|la ', the later, kind of equivalent to '( le|la )', will try to match ' le' or 'la '. The former will try to match ' le ' or ' la '.


Thank you, it works.

Yes it's French. Actually le and la should not be followed by a string in that case. Actually i should also count the ones preceded by a dot too but it's ok now.

Thank you everybody!!

Offline

Board footer

Powered by FluxBB