You are not logged in.

#1 2017-03-04 04:16:21

Xyne
Moderator/TU
Registered: 2008-08-03
Posts: 6,544
Website

rmdupes: another duplicate file remover

project page: http://xyne.archlinux.ca/projects/rmdupes/
AUR page: https://aur.archlinux.org/packages/rmdupes/
It's also available in my repos.

I wrote a script to recursively scan target directories for duplicates of files in a reference directory, with the option to move duplicates to a backup directory. I thought it might be useful for others so I have cleaned it up and added several options (e.g. optional reference directory, regular expression file filters, automatically keep the oldest, newest or alphabetically first file, multiple target dirs, shallow file comparisons).

Check the project page for details and usage examples. rmdupes is written in Python but it uses the filecmp library for comparisons so it's relatively fast. The library compares files by content (not by hash), so there should never be any false positives (unless you use the --shallow option), and it avoid reading entire files pre-emptively to hash them.

Bug reports, suggestions and general feedback are appreciated. Let me know if you find it useful.


My Arch Linux StuffForum EtiquetteCommunity Ethos - Arch is not for everyone

Offline

#2 2017-03-04 19:20:21

SubS0
Member
Registered: 2015-02-10
Posts: 37

Re: rmdupes: another duplicate file remover

Hi Xyne,

thanks for sharing, I was now using fdupes on a large folder tree, and I don’t feel comfortable with it.

I like to be prompted to choose which file(s) to keep, and the backup dir is a real need.
I was close to ask you why --script before (re)reading your project file. Don’t know why, but I like the idea too !

2 problems (maybe me) :

  1. When just indicating a target directory  (like rmdupes Images/), duplicate files are accurately found, but after the selection, nothing happens and a CPU is at 100%

  2. When using

    rmdupes /home/antoine/Images/test /home/antoine/Images

    I've got a non-duplicated file which appears twice in the selection prompt

I don’t know if it’s a misunderstanding or a bug. Feel free to indicate me how to debug.
Can't get it fully works, but already like it big_smile

Edit : oh, it comes from me ! On my laptop, it works very well (and fast !). But it bugs on my desktop...I don’t have any idea why... I’ll investigate as soon as possible. Do you have an idea/way to debug ?

But the second problem is there on both (maybe a mis-usage ?).

Edit2 : hmm...in fact, I experience the same problem on my laptop...it is intermittent... sorry for the noise and my very little information I can provide now.

Last edited by SubS0 (2017-03-04 20:19:09)

Offline

#3 2017-03-04 22:01:22

Xyne
Moderator/TU
Registered: 2008-08-03
Posts: 6,544
Website

Re: rmdupes: another duplicate file remover

Thanks for the feedback.

The second error should be fixed now. Note, however, that rmdupes scans directories recursively so passing it nested target directories just leads to redundant os.stat calls.

I don't know what could be causing the first error. Are you using any other options (e.g. backup directory when it occurs? What is in the directories that it's scanning? If you can find a way to create a minimal working example then I can debug it.

I also fixed a bug that swapped the intended behavior or "oldest" and "newest" when using the --keep option.

edit
There is now a --list option to list duplicate files along with size and counts.

edit2

I found the error that leads to an endless loop and high CPU usage. Expect an update shortly.

edit3

The new version is up.  See the changelog and help message on the project page for changes (there are several).

Last edited by Xyne (2017-03-05 06:12:58)


My Arch Linux StuffForum EtiquetteCommunity Ethos - Arch is not for everyone

Offline

#4 2017-03-05 14:24:37

SubS0
Member
Registered: 2015-02-10
Posts: 37

Re: rmdupes: another duplicate file remover

Here what I got now (2017.3.5), with a directory containing 2 identical files :

$ rmdupes /tmp/rmdupes_tests

Select the file(s) to keep. Separate numbers by spaces.
Enter nothing to keep them all.
1: /tmp/rmdupes_tests/dwm.png
2: /tmp/rmdupes_tests/dwm2.png
> 1
Traceback (most recent call last):
  File "/usr/bin/rmdupes", line 759, in <module>
    main()
  File "/usr/bin/rmdupes", line 714, in main
    selected.update((d, (size,keepers[0])) for d in dupes if d not in keepers)
  File "/usr/bin/rmdupes", line 714, in <genexpr>
    selected.update((d, (size,keepers[0])) for d in dupes if d not in keepers)
TypeError: 'generator' object is not subscriptable

------

Just for information, before you did the upgrade, it seems that the file's concerned with endless loop were slightly differents version of the same picture. For example, a modified (color) picture of original dwm.png (with pinta or gimp), when both original and modified were present. I still got an strace log, but it may not be pertinent anymore.

Offline

#5 2017-03-05 15:21:34

Xyne
Moderator/TU
Registered: 2008-08-03
Posts: 6,544
Website

Re: rmdupes: another duplicate file remover

The selection dialogue error should be fixed. If the other error persists, please post the tracelog. I would be surprised if the loop was due to similarity between similar images: on a binary level, they are likely completely different, and even if there were almost identical, the filecmp.cmp algorithm would have to be really bad to end up in a loop when comparing them.

I'm curious about what it was, but I'm mostly hoping that the issue is fixed with the new code in the current version.


My Arch Linux StuffForum EtiquetteCommunity Ethos - Arch is not for everyone

Offline

#6 2017-03-05 17:52:27

SubS0
Member
Registered: 2015-02-10
Posts: 37

Re: rmdupes: another duplicate file remover

Yes it now works well and fast ! I will definitely adopt it, as it's way more comfortable and reliable than fdupes.

Options are very usefull (--symlink is a great idea, and --keep is nice).

I just don't manage to do the -r option works.

rmdupes -r foo bar

answers that <target directory> is requested. I tried several cases, with no success.

But with ' -b /bak/dir/ ' inserted between foo and bar, it's ok.

Do I miss something ?

Last edited by SubS0 (2017-03-05 17:56:16)

Offline

#7 2017-03-05 18:10:16

Xyne
Moderator/TU
Registered: 2008-08-03
Posts: 6,544
Website

Re: rmdupes: another duplicate file remover

You can pass it multiple reference directories, so "-r" consumes arguments until the next option or the end. Use "--" to end argument parsing, e.g

# reference directory: foo
# target directory bar
rmdupes -r foo -- bar


# reference directories: foo, fub
# target directories: bar, baz
rmdupes -r foo fub -- bar baz

or just invert the arguments:

rmdupes bar -r foo

My Arch Linux StuffForum EtiquetteCommunity Ethos - Arch is not for everyone

Offline

#8 2017-03-05 18:52:31

SubS0
Member
Registered: 2015-02-10
Posts: 37

Re: rmdupes: another duplicate file remover

Ok, it’s clear now. Could have founded it myself !

Thanks a lot for this usefull tool.

Offline

#9 2020-04-23 23:18:13

Xyne
Moderator/TU
Registered: 2008-08-03
Posts: 6,544
Website

Re: rmdupes: another duplicate file remover

To anyone actually using this besides me, I've refactored the way rmdupes displays information and prompts for confirmations to make it less tedious (e.g. no more double confirmations in some cases, lists are now displayed in tables, default log level set to WARN).
Feel free to make further usability suggestions.


My Arch Linux StuffForum EtiquetteCommunity Ethos - Arch is not for everyone

Offline

Board footer

Powered by FluxBB