You are not logged in.
project page: http://xyne.archlinux.ca/projects/rmdupes/
AUR page: https://aur.archlinux.org/packages/rmdupes/
It's also available in my repos.
I wrote a script to recursively scan target directories for duplicates of files in a reference directory, with the option to move duplicates to a backup directory. I thought it might be useful for others so I have cleaned it up and added several options (e.g. optional reference directory, regular expression file filters, automatically keep the oldest, newest or alphabetically first file, multiple target dirs, shallow file comparisons).
Check the project page for details and usage examples. rmdupes is written in Python but it uses the filecmp library for comparisons so it's relatively fast. The library compares files by content (not by hash), so there should never be any false positives (unless you use the --shallow option), and it avoid reading entire files pre-emptively to hash them.
Bug reports, suggestions and general feedback are appreciated. Let me know if you find it useful.
My Arch Linux Stuff • Forum Etiquette • Community Ethos - Arch is not for everyone
Offline
Hi Xyne,
thanks for sharing, I was now using fdupes on a large folder tree, and I don’t feel comfortable with it.
I like to be prompted to choose which file(s) to keep, and the backup dir is a real need.
I was close to ask you why --script before (re)reading your project file. Don’t know why, but I like the idea too !
2 problems (maybe me) :
When just indicating a target directory (like rmdupes Images/), duplicate files are accurately found, but after the selection, nothing happens and a CPU is at 100%
When using
rmdupes /home/antoine/Images/test /home/antoine/Images
I've got a non-duplicated file which appears twice in the selection prompt
I don’t know if it’s a misunderstanding or a bug. Feel free to indicate me how to debug.
Can't get it fully works, but already like it
Edit : oh, it comes from me ! On my laptop, it works very well (and fast !). But it bugs on my desktop...I don’t have any idea why... I’ll investigate as soon as possible. Do you have an idea/way to debug ?
But the second problem is there on both (maybe a mis-usage ?).
Edit2 : hmm...in fact, I experience the same problem on my laptop...it is intermittent... sorry for the noise and my very little information I can provide now.
Last edited by SubS0 (2017-03-04 20:19:09)
Offline
Thanks for the feedback.
The second error should be fixed now. Note, however, that rmdupes scans directories recursively so passing it nested target directories just leads to redundant os.stat calls.
I don't know what could be causing the first error. Are you using any other options (e.g. backup directory when it occurs? What is in the directories that it's scanning? If you can find a way to create a minimal working example then I can debug it.
I also fixed a bug that swapped the intended behavior or "oldest" and "newest" when using the --keep option.
edit
There is now a --list option to list duplicate files along with size and counts.
edit2
I found the error that leads to an endless loop and high CPU usage. Expect an update shortly.
edit3
The new version is up. See the changelog and help message on the project page for changes (there are several).
Last edited by Xyne (2017-03-05 06:12:58)
My Arch Linux Stuff • Forum Etiquette • Community Ethos - Arch is not for everyone
Offline
Here what I got now (2017.3.5), with a directory containing 2 identical files :
$ rmdupes /tmp/rmdupes_tests
Select the file(s) to keep. Separate numbers by spaces.
Enter nothing to keep them all.
1: /tmp/rmdupes_tests/dwm.png
2: /tmp/rmdupes_tests/dwm2.png
> 1
Traceback (most recent call last):
File "/usr/bin/rmdupes", line 759, in <module>
main()
File "/usr/bin/rmdupes", line 714, in main
selected.update((d, (size,keepers[0])) for d in dupes if d not in keepers)
File "/usr/bin/rmdupes", line 714, in <genexpr>
selected.update((d, (size,keepers[0])) for d in dupes if d not in keepers)
TypeError: 'generator' object is not subscriptable
------
Just for information, before you did the upgrade, it seems that the file's concerned with endless loop were slightly differents version of the same picture. For example, a modified (color) picture of original dwm.png (with pinta or gimp), when both original and modified were present. I still got an strace log, but it may not be pertinent anymore.
Offline
The selection dialogue error should be fixed. If the other error persists, please post the tracelog. I would be surprised if the loop was due to similarity between similar images: on a binary level, they are likely completely different, and even if there were almost identical, the filecmp.cmp algorithm would have to be really bad to end up in a loop when comparing them.
I'm curious about what it was, but I'm mostly hoping that the issue is fixed with the new code in the current version.
My Arch Linux Stuff • Forum Etiquette • Community Ethos - Arch is not for everyone
Offline
Yes it now works well and fast ! I will definitely adopt it, as it's way more comfortable and reliable than fdupes.
Options are very usefull (--symlink is a great idea, and --keep is nice).
I just don't manage to do the -r option works.
rmdupes -r foo bar
answers that <target directory> is requested. I tried several cases, with no success.
But with ' -b /bak/dir/ ' inserted between foo and bar, it's ok.
Do I miss something ?
Last edited by SubS0 (2017-03-05 17:56:16)
Offline
You can pass it multiple reference directories, so "-r" consumes arguments until the next option or the end. Use "--" to end argument parsing, e.g
# reference directory: foo
# target directory bar
rmdupes -r foo -- bar
# reference directories: foo, fub
# target directories: bar, baz
rmdupes -r foo fub -- bar baz
or just invert the arguments:
rmdupes bar -r foo
My Arch Linux Stuff • Forum Etiquette • Community Ethos - Arch is not for everyone
Offline
Ok, it’s clear now. Could have founded it myself !
Thanks a lot for this usefull tool.
Offline
To anyone actually using this besides me, I've refactored the way rmdupes displays information and prompts for confirmations to make it less tedious (e.g. no more double confirmations in some cases, lists are now displayed in tables, default log level set to WARN).
Feel free to make further usability suggestions.
My Arch Linux Stuff • Forum Etiquette • Community Ethos - Arch is not for everyone
Offline