Removing lines from a text file which are present in another text file

ampe · 2013-07-01 15:03:17

Hello, if you could just spare a moment, please. I have 2 databases - simply customer details on their own lines - and some lines in the 2 databases will be the same and I am trying to output to a 3rd file only none matching lines, I usually use grep but it simply cannot handle the size of the databases this time and uses up all available RAM and then kills itself leaving the output blank.

To explain better here is the command I use:

grep -v -f data2 data1 > newdata

I know this command works (I've used it in the past) but grep cannot handle the size of these latest databases so I was looking to finally make something proper that can do this more professionally and efficiently. Does anyone know how I would do it with a script? I could follow bash or python if anyone would like to run me through or share a way in your favourite language and I'll try to follow the process. Thanks.

alphaniner · 2013-07-01 15:13:40

cat + sort + uniq:

$ cat data1
1
2
4
$ cat data2
2
3
4
$ cat data1 data2 >tmp1
$ sort tmp1 >tmp2
$ uniq tmp2
1
2
3
4

Don't know how efficient it will be on really big files with complex lines though.

karol · 2013-07-01 15:24:35

Try comm.

ampe · 2013-07-01 16:08:47

alphaniner wrote:

cat + sort + uniq:
$ cat data1
1
2
4
$ cat data2
2
3
4
$ cat data1 data2 >tmp1
$ sort tmp1 >tmp2
$ uniq tmp2
1
2
3
4
Don't know how efficient it will be on really big files with complex lines though.

This worked fast but it seemed to leave one of the duplicates in the output file, I need something that will only output a line from data1 if it is not found in data2.

karol wrote:

Try comm.

This seemed to only format the output differently, I don't know how I could use that to sort things any quicker.

karol · 2013-07-01 16:17:15

ampe wrote:

need something that will only output a line from data1 if it is not found in data2.

$ cat data1
1
2
4
$ cat data2
2
3
4
$ comm -23 data1 data2
1

I have no idea if there's a faster way.

Last edited by karol (2013-07-01 16:17:45)

FitzJac · 2013-07-01 16:50:52

$ cat data1 data2 >tmp1
$ awk '!_[$0]++' tmp1 >tmp2

Last edited by FitzJac (2013-07-01 17:00:23)

ampe · 2013-07-01 17:30:03

karol wrote:

$ cat data1
1
2
4
$ cat data2
2
3
4
$ comm -23 data1 data2
1
I have no idea if there's a faster way.

This did the trick, thank you.

FitzJac wrote:

$ cat data1 data2 >tmp1
$ awk '!_[$0]++' tmp1 >tmp2

Again, this merged the 2 then removed dupes, I need the 2 files to remain separate.

Trilby · 2013-07-01 17:45:44

comm would be the way to go, but note that it requires the input files to already be sorted. If this ever is an issue you can do

comm -23 <(sort data1) <(sort data2)

ampe · 2013-07-02 19:43:52

Trilby wrote:

comm would be the way to go, but note that it requires the input files to already be sorted. If this ever is an issue you can do
comm -23 <(sort data1) <(sort data2)

comm was the answer as karol said. In most cases they are sorted but it seemed to work fine even when they weren't, it just informed me that they weren't sorted, no ramifications that I saw.

karol · 2013-07-02 19:47:30

Trilby is right, I usually use comm the way he suggested.

$ cat data1
2
4
1
$ cat data2
4
3
2
$ comm -23 data1 data2
2
comm: file 1 is not in sorted order
comm: file 2 is not in sorted order
1

As you can see, unsorted files return wrong answer, the right answer is just '1': https://bbs.archlinux.org/viewtopic.php … 2#p1294902

ampe · 2013-07-03 14:22:02

It seems you guys were right as usual, thanks again. Can anyone explain why grep was so slow and hungry compared to comm/sort which are both near instant?

Trilby · 2013-07-03 14:28:53

Just think about what it has to do. Your grep command has to read every single line of data2 into memory, then it reads data1 line by line and compares each line of data1 to every line of data2 stored in memory*, and it does each of these as regex expression matches.

Comm does not need to read the files into memory, it opens each one and reads them line by line and does only "simple" string comparisons.

*- this is made much worse if the files are too big to be stored in ram as now grep has to swap data2 contents to disk, and it has to swap from mem to disk and back for *every* line in data1 so that each of these lines can be compared to every line of data2. (edit, this was assuming the data files were too big to be read completely into ram which doesn't seem to be the case, but I thought it was implied in the original description).

Last edited by Trilby (2013-07-04 18:53:20)

ampe · 2013-07-03 18:06:00

Okay but why does it blow simple text files (mere kilo/megabytes in size) up into gigabytes and possibly beyond just to do this?

Trilby · 2013-07-03 18:20:00

Does it? I just tried with some package lists from pacman -Qq (3.4K) and -Ssq (75K) and the comm command was 20 times faster (.03s vs .6s) but I didn't have any drastic spike in memory or processor use with the grep command.

Last edited by Trilby (2013-07-03 18:20:23)

ampe · 2013-07-03 18:45:38

It definitely does, task manager shows: http://i.imgur.com/mAGrCr7.png

There's a small spike when running the command, understandable. Consuming an entire core and after a while you can see a steady climb, I killed it shortly afterwards because I know it would have continued until it consumed all available RAM and eventually killing itself, the two files in this example are 1.4 MB total so Imagine working with hundreds or thousands of MB files would give an exponential result. The comm and sort combination you provided is effectively instant with no footprint.

Trilby · 2013-07-03 18:50:50

I'm not sure what's supposed to be seen in that *highly* 'redacted' image. The highlighted line, which I can only assume is the grep processes, is using 1% cpu and memory in the range of one of the files you describe.

Anyhow, I'm glad it's working. I don't think I can help with your curiosity about the grep command as it's not clear what you are describing.

EDIT: D'Oh, sorry, I completely missed the first line in that image somehow. I think I must have mistaken the blue highlighting as a title bar or WM decoration, then I looked below it.

Last edited by Trilby (2013-07-03 19:13:55)

ampe · 2013-07-03 19:09:06

the image is of the same grep command in the original post. You can see it is actually using 25% CPU (an entire core) and almost 1GB of RAM and climbing! just to process 1.4MB of data. The graphs at the top are left CPU and right RAM scrolling from right to left, grep alone is causing all of that activity. I don't know if it is a bug or if it has a reasonable explanation behind it but it definitely isn't ideal. We've gone off topic but I was just looking for an explanation, I'll assume bug for now.

tavianator · 2013-07-04 15:00:58

grep -F uses the Aho-Corasick algorithm I believe. This requires creating a trie-like data structure with one node per character in the input file, and the nodes are reasonably heavy-weight. It still should be quite that bad though, I use my own implementation of that algorithm for many megabytes worth of data and it works fine. Maybe there's a bug in grep, I'll have a look.

EDIT: Actually, I just re-read your command. You probably want "grep -F -f".

Last edited by tavianator (2013-07-04 15:03:18)

deepsoul · 2013-07-04 18:50:25

ampe wrote:

Can anyone explain why grep was so slow and hungry compared to comm/sort which are both near instant?

What takes the time is compiling the regex. Compiling a regex the size of a large text file is a huge task, and it is likely optimised for making later regex searches efficient rather than for quick compiling. Normally regexes are quite short, but tend to be applied a large number of times.

grep -F searches for fixed strings, not regexes, which is what you want in this case, as tavianator suggested.

An experiment along the lines Trilby mentioned:

$ pacman -Ss "" > /tmp/allpkgs
$ wc /tmp/allpkgs
 12126  56347 505075 /tmp/allpkgs
$ time echo | grep -f /tmp/allpkgs
echo  0.00s user 0.00s system 0% cpu 0.001 total
grep -i -f /tmp/allpkgs  40.41s user 0.16s system 99% cpu 40.589 total
$ time echo | grep -F -f /tmp/allpkgs 
echo  0.00s user 0.00s system 0% cpu 0.001 total
grep -i -F -f /tmp/allpkgs  0.19s user 0.03s system 98% cpu 0.219 total

The time until grep -F is ready for input is 1/5 second, versus 40 seconds for grep without -F.

Trilby · 2013-07-04 18:55:22

deepsoul, that might be more interesting using a disk file. /tmp/ is (at least by default) stored in ram.

deepsoul · 2013-07-04 20:10:50

Why? I thought the question was about CPU use (and memory). The CPU time should have been the same reading from disk, only the wall-clock time might have been larger, and I'd have had to wait longer before I could paste the result . And the file would likely still have been in the disk cache anyway.

Trilby · 2013-07-04 20:13:44

Ah, you're completely right. Sorry.

Arch Linux

#1 2013-07-01 15:03:17

Removing lines from a text file which are present in another text file

#2 2013-07-01 15:13:40

Re: Removing lines from a text file which are present in another text file

#3 2013-07-01 15:24:35

Re: Removing lines from a text file which are present in another text file

#4 2013-07-01 16:08:47

Re: Removing lines from a text file which are present in another text file

#5 2013-07-01 16:17:15

Re: Removing lines from a text file which are present in another text file

#6 2013-07-01 16:50:52

Re: Removing lines from a text file which are present in another text file

#7 2013-07-01 17:30:03

Re: Removing lines from a text file which are present in another text file

#8 2013-07-01 17:45:44

Re: Removing lines from a text file which are present in another text file

#9 2013-07-02 19:43:52

Re: Removing lines from a text file which are present in another text file

#10 2013-07-02 19:47:30

Re: Removing lines from a text file which are present in another text file

#11 2013-07-03 14:22:02

Re: Removing lines from a text file which are present in another text file

#12 2013-07-03 14:28:53

Re: Removing lines from a text file which are present in another text file

#13 2013-07-03 18:06:00

Re: Removing lines from a text file which are present in another text file

#14 2013-07-03 18:20:00

Re: Removing lines from a text file which are present in another text file

#15 2013-07-03 18:45:38

Re: Removing lines from a text file which are present in another text file

#16 2013-07-03 18:50:50

Re: Removing lines from a text file which are present in another text file

#17 2013-07-03 19:09:06

Re: Removing lines from a text file which are present in another text file

#18 2013-07-04 15:00:58

Re: Removing lines from a text file which are present in another text file

#19 2013-07-04 18:50:25

Re: Removing lines from a text file which are present in another text file

#20 2013-07-04 18:55:22

Re: Removing lines from a text file which are present in another text file

#21 2013-07-04 20:10:50

Re: Removing lines from a text file which are present in another text file

#22 2013-07-04 20:13:44

Re: Removing lines from a text file which are present in another text file

Board footer