You are not logged in.
Pages: 1
Hello,
I have a file called roman in the working directory which contains a novel. My purpose is to count how many «le» and «la» surrounded by a space (so " le " and " la ") the file contains.
So i've tried :
grep -c ' le \| la ' roman
I got 1742.
i wanted to verify by looking for « le » and « la » separatly and i got 1164 and 1389 which is not 1742 !!
So i've tried :
grep -c [l][ea] roman
Which gives me the addition of both before : 1164+1389 = 2443 so it seems good.
The problem is why my first form does not give me the right answer ? I thought that my last code would also found le in lens and la in land and so on… which seems not te be the case…
Last question, is there a difference between " and ' when using in repexp ? It works the same here so I'd like to know.
Thank you
Last edited by jiehong (2010-09-22 21:48:40)
Offline
Hello,
I have a file called roman in the working directory which contains a novel. My purpose is to count how many «le» and «la» surrounded by a space (so " le " and " la ") the file contains.
So i've tried :
grep -c ' le \| la ' roman
I got 1742.
i wanted to verify by looking for « le » and « la » separatly and i got 1164 and 1389 which is not 1742 !!
So i've tried :
grep -c [l][ea] roman
Which gives me the addition of both before : 1164+1389 = 2443 so it seems good.
The problem is why my first form does not give me the right answer ? I thought that my last code would also found le in lens and la in land and so on… which seems not te be the case…
Last question, is there a difference between " and ' when using in repexp ? It works the same here so I'd like to know.
Thank you
From man:
-c, --count
Suppress normal output; instead print a count of matching lines for each input file. With the -v, --invert-match option (see below), count non-matching lines. (-c is specified by POSIX.)
It's a line-basis.
I think
grep -o ' le \| la ' roman | wc -l
should give you 2443.
-o, --only-matching
Print only the matched (non-empty) parts of a matching line, with each such part on a separate output line.
And I would like to use
egrep -o ' (le|la) '
, it reads more clear.
Last edited by livibetter (2010-09-22 21:13:00)
Offline
I'm not sure about your setup, but on my setup grep doesn't interpret regular expressions by default. You could use grep -E or egrep. I find that it works when I enclose the expression with apostrophes or quotation marks and when I don't escape the vertical bar, like so:
$ egrep -c 'word1|word2' < textfile
Offline
i see, so I lost many occurences of words per lines…
grep -o ' le \| la ' roman | wc -l
works good and I get 5382… but I understand why now because I counted the number of lines containing at least one le or la… missing a lot of them.
So thank you. Any differences between " and ' ?
Offline
So thank you. Any differences between " and ' ?
Read http://tldp.org/LDP/abs/html/quoting.html and http://www.gnu.org/software/bash/manual … ml#Quoting
Long story short:
MYVAR=abc
echo '$MYVAR'
echo "$MYVAR"
Outputs
$MYVAR
abc
Offline
There's an issue : ' le|la ' will look for ' le ' and ' la ' but also for ' lela ' which is not good.
Offline
One more thing I forgot to mention, that regular expression has a flaw if you want to match "le blah..." or "blah le!"--the occurrences are at beginning and end of line with/without punctuations. I don't know about French (it's French, right?).
The regular express should be
egrep -io '\b(le|la)\b'
Also a `-i` is supplied to ignore cases.
Again, I don't know French, so it might not be in your roman text.
Offline
There's an issue : ' le|la ' will look for ' le ' and ' la ' but also for ' lela ' which is not good.
You forgot parentheses.
' (le|la) ' not ' le|la ', the later, kind of equivalent to '( le|la )', will try to match ' le' or 'la '. The former will try to match ' le ' or ' la '.
Offline
jiehong wrote:There's an issue : ' le|la ' will look for ' le ' and ' la ' but also for ' lela ' which is not good.
You forgot parentheses.
' (le|la) ' not ' le|la ', the later, kind of equivalent to '( le|la )', will try to match ' le' or 'la '. The former will try to match ' le ' or ' la '.
Thank you, it works.
Yes it's French. Actually le and la should not be followed by a string in that case. Actually i should also count the ones preceded by a dot too but it's ok now.
Thank you everybody!!
Offline
Pages: 1