[SOLVED] Count lines with same string between 2 files (BASH)

equalizer876 · 2021-01-16 21:29:25

I want to compare 2 filter lists. Those have a lot of lines. Somehow my script does not work. Could you please help?
I'm not an expert in scripting.

#!/bin/bash

counter=0

while read line; do
if [$(grep -c $line FILE2.txt) > 0]; then
((counter++))
fi
done < FILE1.txt

echo "Amount of similar strings: ${counter}"

Last edited by equalizer876 (2021-01-16 21:58:51)

Trilby · 2021-01-16 21:42:45

At very least you'd need to quote "$line" in the grep command and you should use `grep -q` rather than `grep -c` and get rid of the extra subshell in each loop, but really this is just completely the wrong approach. Use comm:

#!/bin/bash

comm -12 <(sort FILE1.txt) <(sort FILE2.txt) | wc -l

Particularly for large files, this could be orders of magnitude more efficient. Running time of yours would be a quadratic scale of the number of input lines, while mine is linear scale: never loop through a file many times if you can avoid it; and then realize that you can (almost) always avoid it.

EDIT: note my version finds matching lines - if you mean something else by "similar" you'd need to first define what that means.

EDIT 2: if you use `comm` from coreutils (which you almost certainly will) you may also want to replace the pipe to `wc` with the gnu-specific comm option "--total", but this would also then require some parsing and formating, so really `wc -l` might still be best.

Last edited by Trilby (2021-01-16 21:55:54)

equalizer876 · 2021-01-16 21:58:31

Thank you very much. This is really more efficient.

edit: So it will only count up when both files have exactly same lines like "0.0.0.0 googleadserver..."
and not when 1 file got "googleadserver..." and the other file "127.0.0.1 googleadserver..."? Well, that's ok.

Last edited by equalizer876 (2021-01-16 22:03:52)

lambdarch · 2021-01-16 22:10:37

Another possibility:

grep -cxFf FILE1.txt FILE2.txt

Should be quite fast also.

EDIT: It's not quite the same as what Trilby proposes however, but it can be adapted with other grep options or by passing the result to another command, depending on what exactly you want…

Last edited by lambdarch (2021-01-16 22:23:50)

Trilby · 2021-01-16 22:39:57

equalizer876 wrote:

edit: So it will only count up when both files have exactly same lines like "0.0.0.0 googleadserver..."
and not when 1 file got "googleadserver..." and the other file "127.0.0.1 googleadserver..."? Well, that's ok.

Correct, but if the latter is your intended goal, that's very easy too. But that's why I asked you to define the goal. Are these lists like that of IP and server name, and you want the count of server name matches independent of the IP address? If so, what's the separator (tab, space, inconsistent)?

And what is the REAL goal? Are you comparing blocklists for different adblocking services / sources? For what purpose? Why not just concatenate them and get the best blocking of both of them?

Note on variants of "grep -f PATTERN_FILE": that will also avoid looping through the actual file(s) mulitple times and can be fairly efficient in some circumstances. But it also requires the entire pattern file to be be maintained in memory and then still looped through in memory. If the file actually is large, this may not be possible*; but in any case where it is possible, it's actually not that much different that the original shell loop as the shell loop wouldn't actually re-read the file from disk each time through the loop, but it would rather work with the cached data in memory. Comm carries no memory constraint and does not loop at all (sort kind of does, but it is very well optimized to use an efficient sorting agorithm rather than a naive full-file loop).

* Grep might still try to plow forward with your requested search even if the pattern file can't be held in memory. In that case it'd be thrashing to disk pretty horrifically either via swap space, or actually rereading chunks of the pattern file that are no longer available in the cache. I'd wager the end result could even be less efficient that the original shell loop in such circumstances.

Last edited by Trilby (2021-01-16 22:52:01)

Arch Linux

#1 2021-01-16 21:29:25

[SOLVED] Count lines with same string between 2 files (BASH)

#2 2021-01-16 21:42:45

Re: [SOLVED] Count lines with same string between 2 files (BASH)

#3 2021-01-16 21:58:31

Re: [SOLVED] Count lines with same string between 2 files (BASH)

#4 2021-01-16 22:10:37

Re: [SOLVED] Count lines with same string between 2 files (BASH)

#5 2021-01-16 22:39:57

Re: [SOLVED] Count lines with same string between 2 files (BASH)

Board footer