You are not logged in.
Python? No, no, no. This is clearly a job for a Java applet, or perhaps Visual Basic ... or Fortran.
My understanding is this. The combination of SHELL and Coreutils are best suited for system administrative tasks NOT complex text processing. And the very reason Perl/Python exists is because the complexity a text processing task bring to us, like multilingual. I am positive Coreutils can even handle multilingual now days, but the code will be hard to understand and maintain. Besides text processing, Python has good object oriented support (Perl does not), which gives the opportunity to scale up the program. Beyond that, Perl/Python is more integrated with the Unix/Linux environment and philology, much more then Java or Visual Basic does. Fortran is a compiled language, it require much more effort for a beginner to master.
Offline
@solskog, this isn't complex and exactly the kind of everyday usecase that tools like sed, cut, grep, awk, sort, comm and diff where designed to do, which is why it can be done w/ two calls and a simplistic regexp (depending on the actual data needs the inverse grep might require a second regexp for the grep, the posted one assumes that the basenames in the list are unique patterns, ie. there's no "foo [abcd].mp4" and "foo_1 [efgh].mp4")
That being said, the task could also be covered in pure bash, no external tools, not even python, required - it is just MUCH more efficient in human- and CPU time to use the specialized tools.
Online
sed "s;.mp4;;" <(sort file2) | join -v2 - <(sed "s;\(.*\)\( \[.*\].mp4\);\\1 \\2;" <(sort file1))
Last edited by solskog (2024-04-13 13:36:21)
Offline
Congrats, now you've caught up to and replicated the solutions at the very start of this thread... almost. That's slightly more complex than the original suggestions that also failed to include the bracketed hash-like codes in the output.
It seems it's contagious: the OP isn't the only one who ignores all replies.
Last edited by Trilby (2024-04-13 12:46:11)
"UNIX is simple and coherent" - Dennis Ritchie; "GNU's Not Unix" - Richard Stallman
Offline
ignores all replies
Thanks! using "join" is the way to go.
Offline
I am not ignoring your replies and I am reading each and every post posted by you.
Offline
Posting disjoint stuff without ever picking up on other posts = you've ignored them (by definition)
Online
From the accidental report:
Sorry from now onwards strictly I will follow all your suggestions.
FWIW you shouldn't blindly do that either, but in the cases present here there was good advice given that would've solved your issue in a fraction of the work you've put in.
Offline
I realize this has already effectively been solved many times over, and python solutions were met with rather derisively - but with humility here's another python version I cobbled together. Admittedly overkill and overwrought, but while the one-liners are efficient and effective I'm working on learning python so this was a good exercise. Any thoughts are welcomed.
#!/usr/bin/env python
"""Compare two files and write lines in first but not in second to new file."""
from pathlib import Path
from typing import Set
import re
import argparse
parser: argparse.ArgumentParser = argparse.ArgumentParser()
parser.add_argument(
"SOURCE",
help="Source file, this is the file of which a portion of each of its lines will be tested to see if they exist in the target file.",
)
parser.add_argument(
"TARGET",
help="Target file, this the file which contains lines that a portion of each of the source file's lines are checked against.",
)
parser.add_argument(
"-m",
"--missing",
help="Output file containing lines existing in source file but missing from target file. Defaults to 'missing.txt' if not specified.",
required=False,
default="missing.txt",
)
args = parser.parse_args()
source: Path = Path(args.SOURCE)
target: Path = Path(args.TARGET)
missing: Path = Path(args.missing)
pattern: re.Pattern = re.compile(
r"^(?P<filename>.+)(?P<brackets>\[.+\])(?P<extension>\..+$)"
)
missing_lines: dict[str, str] = dict()
source_set: Set[str] = set()
target_set: Set[str] = set()
with source.open() as lhs:
for whole_line in lhs:
result: re.Match[str] | None = pattern.match(whole_line.rstrip())
partial_line: str = result.group("filename").rstrip() + result.group(
"extension"
)
missing_lines[partial_line] = whole_line
source_set.add(partial_line)
with target.open() as rhs:
for line in rhs:
target_set.add(line.rstrip())
for matched_line in source_set.intersection(target_set):
missing_lines.pop(matched_line, None)
with missing.open(mode="w") as f:
for line in missing_lines.values():
f.write(line)
"the wind-blown way, wanna win? don't play"
Offline
Now can we get a 200 line python script that will list all the files in the current working directory? Because `ls` is so over-rated.
"UNIX is simple and coherent" - Dennis Ritchie; "GNU's Not Unix" - Richard Stallman
Offline
overkill and overwrought, but while the one-liners are efficient and effective I'm working on learning python so this was a good exercise
There's nothing wrong w/ that. There're many ways to skin a cat and learning how to skin cats by skinning cats is an effective way to learn to skin cats.
Though cats probably don't approve.
A rather funny game on SO/SE is if somebody asks a basic question, they get bombarded with all sorts of implementations and then have to choose (so next question: "should I prefer awk or sed" )
The problem is only when people insist on specific implementations because of para-religious motives - or because their homework instructions say so.
@Trilby, too late…
Online
The problem is only when people insist on specific implementations because of para-religious motives
I hope the "because ..." clause is intended to be a restrictive clause. There may be many equally good solutions to a given problem, but that doesn't mean all solutions are equally good. Some solutions really are crap - even if they happen to get the proper result.
"UNIX is simple and coherent" - Dennis Ritchie; "GNU's Not Unix" - Richard Stallman
Offline
Q. How can I do X with sed
A. You don't, use awk
Q. Ok, but I want to do this with sed
A. Why?
Q. Can you please answer the question?
A. Yes, use awk
Q. That doesn't help me
…
Of course there're better ways to skin a cat and really dumb ways (eg. asking the cat to do it itself, obviously your first ever encounter with a cat…) and you'll insist on them being dumb, because you have to, because that's reality.
But at times the petitioner insists on a specific (and specifically dumb) approach for, I assume, religious motives - because it's not "reasons".
(I guess you could make up scenarios where, in this case, sed is the only available tool but in reality it's either their homework or "awk is too complicated")
Online
$ echo skin > cat
$ cat cat
skin
"UNIX is simple and coherent" - Dennis Ritchie; "GNU's Not Unix" - Richard Stallman
Offline
here's another python version
Performance comparisen shows the following result:
When file1 has one million and file2 has half million lines.
Python using dictionary hash.
real 0m1.787s
user 0m1.525s
sys 0m0.067s
Python using re.Match pattern matching
real 0m2.935s
user 0m2.884s
sys 0m0.040s
sed "s;.mp4;;" <(sort file2) | join -v2 - <(sed "s;\(.*\)\( \[.*\].mp4\);\\1 \\2;" <(sort file1))
As for my own code when line numbers goes up, I always get complain from "join: input is not in sorted order"
Last edited by solskog (2024-04-15 12:59:36)
Offline
As for my own code when line numbers goes up, I always get complain from "join: input is not in sorted order"
Because you are not sorting the input to join. You did have an extra uneeded sort of the input to sed though
#wrong
sed "s;.mp4;;" <(sort file2) | join -v2 - <(sed "s;\(.*\)\( \[.*\].mp4\);\\1 \\2;" <(sort file1))
#right (at least the sorting will be right, the rest not confirmed)
sed "s;.mp4;;" file2 | sort | join -v2 - <(sed "s;\(.*\)\( \[.*\].mp4\);\\1 \\2;" file1 | sort)
Last edited by Trilby (2024-04-15 13:06:33)
"UNIX is simple and coherent" - Dennis Ritchie; "GNU's Not Unix" - Richard Stallman
Offline
#wrong
sed "s;.mp4;;" <(sort file2) | join -v2 - <(sed "s;\(.*\)\( \[.*\].mp4\);\\1 \\2;" <(sort file1))
#right (at least the sorting will be right, the rest not confirmed)
sed "s;.mp4;;" file2 | sort | join -v2 - <(sed "s;\(.*\)\( \[.*\].mp4\);\\1 \\2;" file1 | sort)
I am sorry, I forgotten that I added "sort --debug" in my code and got the warning "join: input is not in sorted order". Once "--debug" flag removed, no more warnings.
Both lines gave the same result, and believe or not the #wrong one is even faster with huge input file. Maybe it has less "unix pipe/fork".
sed 's/\.mp4$//g' input2.txt | grep -vFf - input1.txt
two calls and a simplistic regexp (depending on the actual data needs the inverse grep might require a second regexp for the grep, the posted one assumes that the basenames in the list are unique patterns, ie. there's no "foo [abcd].mp4" and "foo_1 [efgh].mp4")
There is yet another exception to consider, ie: "abcd.mp4" and "bar [abcd].mp4". Because "grep -vFf -input1.txt", uses "bar [abcd].mp4" as a key, the "abcd.mp4" will be ignored, which is not correct.
Offline
The key would be "abcd" and grepping for the proper regexp to cover the mentioned case would capture this (rather unlikely, given the shown patterns) construct as well.
Online
Which is why join or comm based solutions are really the "right" solutions. These tools literally do exactly what is sought - they're not a workaround that might work in some cases, but when invoked properly they implement exactly the intended goal.
"UNIX is simple and coherent" - Dennis Ritchie; "GNU's Not Unix" - Richard Stallman
Offline
These tools literally do exactly what is sought - they're not a workaround that might work in some cases, but when invoked properly they implement exactly the intended goal.
I agree!
Yet, I still have one question about performance, where I did some tests with big input files. How come the performance decreases in coreutils when the input became large? was the pipe/fork, memory allocation are the limiting factor?
Dictionary hash in Python
real 0m0.554s
user 0m0.487s
sys 0m0.067s
Sed back reference in Shell
real 0m1.071s
user 0m0.310s
sys 0m0.013s
Patterna matching in Python
real 0m0.931s
user 0m0.883s
sys 0m0.047s
Offline
Because GNU's core-utils are completely bloated garbage (have you seen my forum signature). They are frequently orders of magnitude slower than other implementations of posix tools (and built from source code often several orders of magnitude longer). If you want performance, do not use GNU's implementations.
Also note that as the inputs get larger, the python implementation quickly becomes memory-limited as the full input must be stored in a dictionary. The sed / join implementation scales indefinitely without having a notable impact on memory use.
Last edited by Trilby (2024-04-16 15:14:45)
"UNIX is simple and coherent" - Dennis Ritchie; "GNU's Not Unix" - Richard Stallman
Offline
I reallized by changing the field delimiter, the problem became easier.
sed "s;[[:space:]]*\.mp4$; [;" <(sort -k1b input2) | join -t '[' -v2 - <(sort -k1b input1)
sed "s;[[:space:]]*\.mp4$; [;" input2 | grep -vFf - input1
Last edited by solskog (2024-04-18 03:37:42)
Offline
Nice. That second one does seem to be the best option.
"UNIX is simple and coherent" - Dennis Ritchie; "GNU's Not Unix" - Richard Stallman
Offline