You are not logged in.

#26 2012-05-14 10:53:33

SahibBommelig
Member
From: Germany
Registered: 2010-05-28
Posts: 80

Re: rmlint - a fast and light duplicate/lint finder

stqn wrote:

Thanks for rmlint. It would be useful to be able to specify the directory where duplicates must be found so that nothing is deleted somewhere else. I tried “rmlint ///mnt/disk1/ ///mnt/disk2/ .” but it didn’t work, sometimes duplicates were found in disk1 or disk2 and originals kept in “.”.

I also wanted to exclude “.svn”, “.hg” and other directories, but it wasn’t obvious how to do that. Also not delete empty files, but I seem to remember seeing such a feature in the git log…

Hi,

1) You probably want 'rmlint ///mnt/disk1 /mnt/disk2', files in disk1 will be treated as original.
2) Empty files can be ommitted with -K or --no-emptyfiles
3) Hidden files (files/dirs starting with a '.') are ignored by default usually, do you have -G enabled, or do you want to exclude "{.svn,.hg}" only?

I admit that the manpage is not the clearest on earth.

Roken wrote:

This looks very useful. Ran it quickly in default mode on my data drive and was surprised to find that nearly 1,700 files are duplicated (many of them several times), though I'm not quite ready to trust the generated script just yet. Many of the duplicates are there for a reason other than redundancy.

The drive is 384Gb of used size, and the total running time only took a few minutes. Very impressive.

Thanks,
You might want to use replace suspicious files with a symlink.

Offline

#27 2012-05-14 15:13:16

stqn
Member
Registered: 2010-03-19
Posts: 1,191
Website

Re: rmlint - a fast and light duplicate/lint finder

SahibBommelig wrote:

1) You probably want 'rmlint ///mnt/disk1 /mnt/disk2', files in disk1 will be treated as original.
2) Empty files can be ommitted with -K or --no-emptyfiles
3) Hidden files (files/dirs starting with a '.') are ignored by default usually, do you have -G enabled, or do you want to exclude "{.svn,.hg}" only?

I admit that the manpage is not the clearest on earth.

1) I think you’re right. I wanted to scan all my disks at the same time to find duplicates in the current directory only, but it should work better by doing each disk separately.

2) I installed the “stable” rmlint package 1.0.8, so I didn’t have this option. Didn’t want to have trouble with a potentially buggy development version!

3) Hidden files were definitely not ignored, and I didn’t use the “-G” option. My command line was: “rmlint ///mnt/disk1/ ///mnt/disk2/ ///mnt/disk3/ . -Y -s”.

And finally, about the patterns, I would have liked a couple examples in the man page, yes smile. Or maybe just the exact name of the regex syntax in use. For example you seem to suggest I should write “{.svn,.hg}”, but according to wikipedia it should be “\.svn|\.hg”… I could have tried, of course… wink

Offline

#28 2012-08-11 12:52:45

Dieter@be
Forum Fellow
From: Belgium
Registered: 2006-11-05
Posts: 2,000
Website

Re: rmlint - a fast and light duplicate/lint finder

SahibBommelig wrote:

Shameless bump.

On the request of bencahill:
- Added an possibility to mark a directory as source (==where originals come from) when having more than one dir.
  Just prepend the path with a '//' to 'prefer' it.

  Example:

$ rmlint 'testdir/recursed_a' '//testdir/recursed_b'
    ls testdir/recursed_b/one
    rm testdir/recursed_a/one

    ..while normally...

$ rmlint 'testdir/recursed_a' 'testdir/recursed_b'
    ls testdir/recursed_a/one
    rm testdir/recursed_b/one
  

- Also speeded up the --paranoid option (simply by using mmap())

why not just treat the order of the arguments as the order of preference of "originality"?
example:

rmlint dir-A dir-B
-> if duplicates, dir-A has the originals because it's listed first

< Daenyth> and he works prolifically
4 8 15 16 23 42

Offline

#29 2012-09-13 02:37:48

Narga
Member
From: Hải Phòng, Việt Nam
Registered: 2011-10-06
Posts: 60
Website

Re: rmlint - a fast and light duplicate/lint finder

I had two file
wpa_supplicant-1.0-1-x86_64.pkg.tar.xz
wpa_supplicant-1.0-2-x86_64.pkg.tar.xz
how can I tell rmlint to find it? I'm confusing with Regex options

Offline

#30 2012-09-13 10:47:20

SahibBommelig
Member
From: Germany
Registered: 2010-05-28
Posts: 80

Re: rmlint - a fast and light duplicate/lint finder

What exactly do you want? Comparing these two files? Just pass the files to the command.

rmlint wpa_supplicant-1.0-1-x86_64.pkg.tar.xz wpa_supplicant-1.0-2-x86_64.pkg.tar.xz

Offline

#31 2012-09-14 00:42:21

Narga
Member
From: Hải Phòng, Việt Nam
Registered: 2011-10-06
Posts: 60
Website

Re: rmlint - a fast and light duplicate/lint finder

I stored all archlinux's pkg files as backup but some pakages has multi version. I've tried use rmlint for searching packages which has multi version then remove the old. As I mention above:
+/path/to/folder/wpa_supplicant-1.0-1-x86_64.pkg.tar.xz
+/another/path/to/folder/wpa_supplicant-1.0-2-x86_64.pkg.tar.xz
I want find 2 pkg then delete wpa_supplicant-1.0-1-x86_64.pkg.tar.xz as the old one.

Offline

#32 2012-09-14 07:22:00

SanskritFritz
Member
From: Budapest, Hungary
Registered: 2009-01-08
Posts: 1,923
Website

Re: rmlint - a fast and light duplicate/lint finder

Narga, you may try pkgcacheclean.


zʇıɹɟʇıɹʞsuɐs AUR || Cycling in Budapest with a helmet camera || Revised log levels proposal: "FYI" "WTF" and "OMG" (John Barnette)

Offline

#33 2012-09-14 09:40:19

Narga
Member
From: Hải Phòng, Việt Nam
Registered: 2011-10-06
Posts: 60
Website

Re: rmlint - a fast and light duplicate/lint finder

No, I'm not cleaning the pkg folder, I store a lot of packages but it's has multi version after merge 2 folders. I want search the duplicate files with same name and delete old version (keep newer) like 2 examples above. I've tried with regex like rmlint /path/to/folder -u -r [a-zA-Z0-9]-* but it not works

Offline

#34 2012-09-14 12:38:17

SahibBommelig
Member
From: Germany
Registered: 2010-05-28
Posts: 80

Re: rmlint - a fast and light duplicate/lint finder

Well, that would work for you:

$ rmlint path1 path2 -v5 | grep 'wpa_*' > rm_pkgs.sh
$ cat rm_pkgs.sh 
echo  'path1/wpa_supplicant-1.0-1-x86_64.pkg.tar.xz' # original
rm -f 'path2/wpa_supplicant-2.0-1-x86_64.pkg.tar.xz' # duplicate
rm -f 'path1/wpa_supplicant-42.0-1-x86_64.pkg.tar.xz' # duplicate

I should look at the regex options.

Offline

#35 2012-09-17 03:18:16

Narga
Member
From: Hải Phòng, Việt Nam
Registered: 2011-10-06
Posts: 60
Website

Re: rmlint - a fast and light duplicate/lint finder

Thanks, with your command, I found the way to find all pkg with regex

rmlint path -r '[a-z0-9]-*.'

rmline will be search all file has same name then display the items has different version like the e.g that I mentioned above.

Offline

#36 2012-09-23 17:03:51

Narga
Member
From: Hải Phòng, Việt Nam
Registered: 2011-10-06
Posts: 60
Website

Re: rmlint - a fast and light duplicate/lint finder

How to tell rmline just searching for file name, not compare file size?

Offline

#37 2012-09-29 08:19:13

Narga
Member
From: Hải Phòng, Việt Nam
Registered: 2011-10-06
Posts: 60
Website

Re: rmlint - a fast and light duplicate/lint finder

I got a list of files in folder

qt-4.8.3-2-x86_64.pkg.tar.xz
qt-4.8.3-3-x86_64.pkg.tar.xz
qt-4.8.3-4-x86_64.pkg.tar.xz

it has difference size and time, same part of filename (just wrong some character). How can I use rmlint to keep qt-4.8.3-4-x86_64.pkg.tar.xz after scan? I've tried many regex to searching all kinds of that with another filename but it's not work

Last edited by Narga (2012-09-29 08:20:50)

Offline

#38 2012-09-29 08:21:25

karol
Archivist
Registered: 2009-05-06
Posts: 25,440

Re: rmlint - a fast and light duplicate/lint finder

Narga wrote:

I got a list of files in folder

qt-4.8.3-2-x86_64.pkg.tar.xz
qt-4.8.3-3-x86_64.pkg.tar.xz
qt-4.8.3-4-x86_64.pkg.tar.xz

it has difference size and time, same part of filename (just wrong some character). How can I use rmlint to keep qt-4.8.3-4-x86_64.pkg.tar.xz after scan? I've tried many regex to searching all kinds of that with another filename but it's not work

Why that file? Is it th newest, the smallest or ...?

Offline

#39 2012-09-29 08:30:32

progandy
Member
Registered: 2012-05-17
Posts: 5,184

Re: rmlint - a fast and light duplicate/lint finder

Narga wrote:

I got a list of files in folder

qt-4.8.3-2-x86_64.pkg.tar.xz
qt-4.8.3-3-x86_64.pkg.tar.xz
qt-4.8.3-4-x86_64.pkg.tar.xz

it has difference size and time, same part of filename (just wrong some character). How can I use rmlint to keep qt-4.8.3-4-x86_64.pkg.tar.xz after scan? I've tried many regex to searching all kinds of that with another filename but it's not work

It seems you are looking for functionality similar to cacheclean


| alias CUTF='LANG=en_XX.UTF-8@POSIX ' |

Offline

#40 2012-09-29 09:36:02

Narga
Member
From: Hải Phòng, Việt Nam
Registered: 2011-10-06
Posts: 60
Website

Re: rmlint - a fast and light duplicate/lint finder

Yes, it's similar but I guess rmline can does the same action. Hope it works

Offline

#41 2012-09-29 10:20:45

SahibBommelig
Member
From: Germany
Registered: 2010-05-28
Posts: 80

Re: rmlint - a fast and light duplicate/lint finder

Narga wrote:

it has difference size and time, same part of filename (just wrong some character). How can I use rmlint to keep qt-4.8.3-4-x86_64.pkg.tar.xz after scan? I've tried many regex to searching all kinds of that with another filename but it's not work

If the files have differenz size, and different names, rmlint will not do anything with them.
I think you're better of with cacheclean as mentioned by progandy, or using a more general find command.

Offline

#42 2012-10-13 02:39:58

likytau
Member
Registered: 2012-09-02
Posts: 142

Re: rmlint - a fast and light duplicate/lint finder

Is there any way to get rmlint to read sums from a file instead of calculating them? I use digup to keep track of the sha1sums of a whole 124GB directory hierarchy, and it seems pretty silly for rmlint to churn through calculating all that data when the sums are all there in the sha1sum.txt file ( a mere 40mb).

Offline

#43 2012-10-13 09:44:13

SahibBommelig
Member
From: Germany
Registered: 2010-05-28
Posts: 80

Re: rmlint - a fast and light duplicate/lint finder

likytau wrote:

Is there any way to get rmlint to read sums from a file instead of calculating them? I use digup to keep track of the sha1sums of a whole 124GB directory hierarchy, and it seems pretty silly for rmlint to churn through calculating all that data when the sums are all there in the sha1sum.txt file ( a mere 40mb).

Nope, Sorry.

Offline

#44 2012-10-18 01:44:56

likytau
Member
Registered: 2012-09-02
Posts: 142

Re: rmlint - a fast and light duplicate/lint finder

For anyone else who wanted to do the same thing: Here's a script that makes a simple report on duplicate files listed in a sha1sum.txt / md5sum.txt.

#!/bin/dash
FNAME=$1
DUPED=$(grep -Eve "^#" $FNAME|cut -f 1 -d ' '|sort|uniq -d)

for SUM in $DUPED; do
  echo "# $SUM:"
  grep "^$SUM  " $FNAME | cut -f 3 -d " " | sed -Ee 's/^([^"]+)$/         "\1"/g'
  echo
done;

Offline

#45 2014-01-12 14:10:22

kozaki
Member
From: London >. < Paris
Registered: 2005-06-13
Posts: 671
Website

Re: rmlint - a fast and light duplicate/lint finder

Hi. I get the same issue that samhain got into https://bbs.archlinux.org/viewtopic.php … 5#p1009315 :
i.e. searching duplicates with straight rmlint-git segfaults. While if I compile with

DEBUG=true make

I don't get a Seg fault and rmlint runs fine.
This is only on my netbook (Intel Atom N450 with 1GB RAM). rmling-git works just super fine on a bigger, Intel Core i3 with 4GB RAM  system.

Should I go further editing optimization step and try again, just like samhain did?

BTW, rmlint is greatly helping in this location! e.g. to make photos directories much cleaner after coming back from trips where we backed them up on various supports and sometimes under different names, usualy in a hurry without much thinking :-} The various options that were added since 2011 are also extremly helpful to "concatenate" backups made at various date into lighter and cleaner ones. Thank you Sahib!


Seeded last month: Arch 50 gig, derivatives 1 gig
Desktop @3.3GHz 8 gig RAM, linux-ck
laptop #1 Atom 2 gig RAM, Arch linux stock i686 (6H w/ 6yrs old battery smile) #2: ARM Tegra K1, 4 gig RAM, ChrOS
Atom Z520 2 gig RAM, OMV (Debian 7) kernel 3.16 bpo on SDHC | PGP Key: 0xFF0157D9

Offline

#46 2014-01-13 11:49:36

SahibBommelig
Member
From: Germany
Registered: 2010-05-28
Posts: 80

Re: rmlint - a fast and light duplicate/lint finder

Hello kozaki,

yes, please try the same procedure with trying different optimization levels.
But it would be even be more helpful if you (or someone elese) could provide a minimal set of files that makes rmlint crash on a certain level of optimisation.
I never could repdroduce this segfault myself sadly (although Im pretty sure it's my fault smile).

Offline

#47 2014-01-13 21:47:30

kozaki
Member
From: London >. < Paris
Registered: 2005-06-13
Posts: 671
Website

Re: rmlint - a fast and light duplicate/lint finder

Hi Sahib
OK will do it as soon as it crashes again. As since I runned it with valgrind, rmlint's working like a dream, on directories where it allways crashed till now (netbook). Yes strange as it is I'm positive about this as I launched it a few dozens times in the last few weeks (standard, then git version).

What I'm saying is that it now works fine on the exact same set of directories it crashed till yesterday, when launching it on any directories with over a few hundred megabytes (mixed text and pics) or quite a few gigabytes (audio & videos) resulted in rmlint crashing till yesterday, with and without options :-O


Seeded last month: Arch 50 gig, derivatives 1 gig
Desktop @3.3GHz 8 gig RAM, linux-ck
laptop #1 Atom 2 gig RAM, Arch linux stock i686 (6H w/ 6yrs old battery smile) #2: ARM Tegra K1, 4 gig RAM, ChrOS
Atom Z520 2 gig RAM, OMV (Debian 7) kernel 3.16 bpo on SDHC | PGP Key: 0xFF0157D9

Offline

#48 2014-06-14 02:11:00

SeeSpotRun
Member
Registered: 2014-06-14
Posts: 2

Re: rmlint - a fast and light duplicate/lint finder

This is a very useful program with a very powerful and fast algorithm; thanks Sahib.
I wanted a bit more control with regard to "original" folders and have started modifying a fork of the code - see https://github.com/SeeSpotRun/rmlint.  Mainly (1) added new option "-O --keepallorig" which means no files in 'original'
path(s) prefixed by // will be deleted, and (2) added new option "-M --mustmatchorig" which limits searching to onlyfind matches wherein at least one file is in 'original' path(s).  You can now prefix several paths with // and they will all be treated as 'originals'.

If there is interest in helping with debugging this fork and/or further development, I'm open to feedback.

Offline

#49 2014-07-26 08:41:32

motmaos
Member
Registered: 2010-01-20
Posts: 9

Re: rmlint - a fast and light duplicate/lint finder

Hi Sahib,
this piece of software is great, I'm using it to clean the horrible mess I have in my laptop and it is saving me SO MUCH time. Yet there is something strange I've noted now. I've used before the // prefix to select the original directory and now I'm trying to compare two different directories, named R and R2. I want R to be the one with the original files, so I give the command

rmlint //R R2

but it keeps considering the files in R as duplicates and the ones in R2 as originals. I can't really understand why, hoping someone can help me with this.
Thanks!

Mot

Offline

#50 2014-07-26 08:47:22

motmaos
Member
Registered: 2010-01-20
Posts: 9

Re: rmlint - a fast and light duplicate/lint finder

Ok, i found a way to convince it to work as I want, but I suppose there is some kind of bug here.

This works:

rmlint  //./R R2

and this not:

 
rmlint  //./R ./R2

I'm using the stable version, but the git one gives me the same error.

Offline

Board footer

Powered by FluxBB