You are not logged in.

#1 2011-03-21 15:13:57

SahibBommelig
Member
From: Germany
Registered: 2010-05-28
Posts: 80

rmlint - a fast and light duplicate/lint finder

UPDATE 2014: rmlint was rewritten. New  thread is here. Please use the new one.

I was asked to do a PKGBUILD so here it is. smile

rmlint is a commandline tool to clean your filesystem from various sort of lint (duplicates, empty files/dir..)
It is written in pure C and tends to be much faster than fdupes which seems to be standard on (e.g.) Ubuntu.
Additionally it is able to find empty files/dirs, nonstripped binaries, files with same basenames (nameclusters), empty files/directories, old tempdata, strange filenames and bad links.
Because it dumps a log and a script, it is easy to modify it to your needs.

Developement happens on: https://github.com/sahib/rmlint
See also the README for a detailed list of features.

Obligatory screenshot.

Any feedback & critics / patches are welcome!

Last edited by SahibBommelig (2014-12-19 14:44:25)

Offline

#2 2011-03-21 15:28:13

SanskritFritz
Member
From: Budapest, Hungary
Registered: 2009-01-08
Posts: 1,615
Website

Re: rmlint - a fast and light duplicate/lint finder

Amazing little program! Thanks, really appreciate it.
Check your merge requests smile

Last edited by SanskritFritz (2014-03-05 10:04:35)


zʇıɹɟʇıɹʞsuɐs AUR || Cycling in Budapest with a helmet camera || Revised log levels proposal: "FYI" "WTF" and "OMG" (John Barnette)

Offline

#3 2011-03-21 17:02:08

bencahill
Member
Registered: 2011-03-20
Posts: 40

Re: rmlint - a fast and light duplicate/lint finder

@Sahib:

What's the best way to "uninstall" the rmlint I made from the .tar, so that I can install the AUR one (so pacman is keeping track of everything)? (first time building from source manually, i.e. without a PKGBUILD and makepkg)

Thanks.

Offline

#4 2011-03-21 17:06:25

igndenok
Member
From: Sidoarjo, Indonesia
Registered: 2010-06-07
Posts: 160

Re: rmlint - a fast and light duplicate/lint finder

Nice, it's time to install and find duplicate files. Thanks for this amazing application Sahib smile


Ask, and it shall be given you.
Seek, and ye shall find.
Knock, and it shall be opened unto you.

Offline

#5 2011-03-21 17:22:44

SanskritFritz
Member
From: Budapest, Hungary
Registered: 2009-01-08
Posts: 1,615
Website

Re: rmlint - a fast and light duplicate/lint finder

bencahill wrote:

What's the best way to "uninstall" the rmlint I made from the .tar, so that I can install the AUR one (so pacman is keeping track of everything)? (first time building from source manually, i.e. without a PKGBUILD and makepkg)

The easiest way is
sudo make uninstall
Also if you installed the program to the same location as the PKGBUILD, then
pacman -Uf
overwrites all files.


zʇıɹɟʇıɹʞsuɐs AUR || Cycling in Budapest with a helmet camera || Revised log levels proposal: "FYI" "WTF" and "OMG" (John Barnette)

Offline

#6 2011-03-21 18:27:48

bencahill
Member
Registered: 2011-03-20
Posts: 40

Re: rmlint - a fast and light duplicate/lint finder

SanskritFritz wrote:

The easiest way is
sudo make uninstall
Also if you installed the program to the same location as the PKGBUILD, then
pacman -Uf
overwrites all files.

sudo make uninstall didn't work (make: *** No rule to make target `uninstall'.  Stop.), but -Uf worked fine.

Thanks. smile

Offline

#7 2011-03-21 18:28:35

SahibBommelig
Member
From: Germany
Registered: 2010-05-28
Posts: 80

Re: rmlint - a fast and light duplicate/lint finder

bencahill wrote:

What's the best way to "uninstall" the rmlint I made from the .tar, so that I can install the AUR one (so pacman is keeping track of everything)? (first time building from source manually, i.e. without a PKGBUILD and makepkg)

The most simple approach would be probably:
'sudo rm $DESTDIR/bin/rmlint $DESTDIR/share/man/man1/rmlint.1.gz' where DESTDIR is most likely /usr (..or /usr/local)
rmlint only installs two files...

The Makefile didn't have an 'uninstall' target, but I just added one.
But pacman -Uf should do fine too most probably.

Offline

#8 2011-03-22 19:31:59

SahibBommelig
Member
From: Germany
Registered: 2010-05-28
Posts: 80

Re: rmlint - a fast and light duplicate/lint finder

Shameless bump.

On the request of bencahill:
- Added an possibility to mark a directory as source (==where originals come from) when having more than one dir.
  Just prepend the path with a '//' to 'prefer' it.

  Example:

$ rmlint 'testdir/recursed_a' '//testdir/recursed_b'
    ls testdir/recursed_b/one
    rm testdir/recursed_a/one

    ..while normally...

$ rmlint 'testdir/recursed_a' 'testdir/recursed_b'
    ls testdir/recursed_a/one
    rm testdir/recursed_b/one
  

- Also speeded up the --paranoid option (simply by using mmap())

Offline

#9 2011-03-26 18:00:01

menta
Member
From: Hungary
Registered: 2010-07-01
Posts: 5

Re: rmlint - a fast and light duplicate/lint finder

Hi Sahib,

Nice to see the new '//' feature! smile

I was thinking whether the package should be named rmlint-git. VCS PKGBUILD Guidelines states:

Properly suffix pkgname with -cvs, -svn, -hg, -darcs, -bzr or -git. If the package tracks a moving development trunk it should be given a suffix. If the package fetches a release from a VCS tag then it should not be given a suffix. Use this rule of thumb: if the output of the package depends on the time at which it was compiled, append a suffix; otherwise do not.

Thank you for rmlint! smile
Attila

Offline

#10 2011-03-26 23:22:27

bencahill
Member
Registered: 2011-03-20
Posts: 40

Re: rmlint - a fast and light duplicate/lint finder

SahibBommelig wrote:

On the request of bencahill:
- Added an possibility to mark a directory as source (==where originals come from) when having more than one dir.
  Just prepend the path with a '//' to 'prefer' it.

I apologize for not answering...I really enjoy the new feature, and have used it at least a dozen times already smile. Yes, I have that many duplicates. tongue

Thanks again for the wondeful software smile.

!give SahibBommelig cookie

wink

Offline

#11 2011-03-27 13:21:45

SahibBommelig
Member
From: Germany
Registered: 2010-05-28
Posts: 80

Re: rmlint - a fast and light duplicate/lint finder

Hey Attila, nice to meet again wink

Thanks for the hint, I must have overlooked those guidelines.
I made a new package 'rmlint-git' which is basically the very same as before, also
I consider rmlint to be mostly finished software, so the '-git'-suffix doesn't say much either.

Please vote if you like it.

@bencahill: I'll enjoy my cookie, thanks. Hope I don't already have a copy of it.

Offline

#12 2011-04-07 15:52:09

mentat
Member
From: France
Registered: 2009-01-13
Posts: 138
Website

Re: rmlint - a fast and light duplicate/lint finder

rmlint is a great piece of software, thanks a lot !

If I run it directly in my home folder it return the error :

FATAL: nftw():: Value too large for defined data type
No files in cache to search through => No duplicates.

Offline

#13 2011-04-07 19:07:04

SahibBommelig
Member
From: Germany
Registered: 2010-05-28
Posts: 80

Re: rmlint - a fast and light duplicate/lint finder

Hi,

This was already reported by someone else and should be (..hopefully..) already fixed.
This should only happen on 32bit systems, and with files greater than 2GB.

Thanks for the report though smile

edit: typo.

Last edited by SahibBommelig (2011-04-08 11:23:07)

Offline

#14 2011-04-08 05:59:08

mentat
Member
From: France
Registered: 2009-01-13
Posts: 138
Website

Re: rmlint - a fast and light duplicate/lint finder

Hi Sahib,

It's working like a charm now !

Next time I'll update first before asking question wink
Otherwise big kudo for rmlint it's rock !

PS: glyr seem very interresting, do you plan to open a topic in this forum for asking you questions about it ?

Offline

#15 2011-04-18 16:02:00

SahibBommelig
Member
From: Germany
Registered: 2010-05-28
Posts: 80

Re: rmlint - a fast and light duplicate/lint finder

Little update.

  • Some bugfixes are in, for example the actual duplicate counter was not always exact.

  • Also some crashes have been fixed (thanks to Micheal and rider),

  • -c/-C behaves now more a bit smoother:

With -c you can specify a command that is executed on each found duplicate,
in the command given '<orig>' and '<dupl>' are replaced with the path to the original / duplicate.

Example to simulate the behaviour of '-m link' without removing anything:

$  rmlint testdir -v5
echo  '/tmp/testcase2/a' # original
rm -f '/tmp/testcase2/a.copy' # duplicate
echo  '/tmp/testcase2/b.copy' # original
rm -f '/tmp/testcase2/b' # duplicate

$ rmlint testdir -v5 -c "rm '<dupl>' && ln -s '<orig>' '<dupl>'"
echo  '/tmp/testcase2/a' # original
rm '/tmp/testcase2/a.copy' && ln -s '/tmp/testcase2/a' '/tmp/testcase2/a.copy'
echo  '/tmp/testcase2/b.copy' # original
rm '/tmp/testcase2/b' && ln -s '/tmp/testcase2/b.copy' '/tmp/testcase2/b'

Also -c replaces rmlint's defaultcommand in the script.
There is also -C, which is the same for originals, just <dupl> won't expand there.

At the moment both packages (rmlint and rmlint-git) will result in the same binary.
Edit: rmlint should also work again on 32bit. (Just fixed - Sorry)

P.S: Oh I have more than 10 votes? :-)

Last edited by SahibBommelig (2011-04-18 18:44:23)

Offline

#16 2011-10-28 11:55:28

samhain
Member
Registered: 2007-07-19
Posts: 39

Re: rmlint - a fast and light duplicate/lint finder

Hi,

I'm getting the following error when I run rmlint. It happens only in some directories, and always after rmlints starts to list the duplicates.

FATAL: Rmlint crashed due to a Segmentation fault! :(

No other detail is given in the output, just the Seg fault after listing some duplicates.

I have a 32 bit system.

I will be happy to follow your guidelines to locate the origin of the bug.

Cheers.


Arch is to Linux as Jeet Kune Do is to martial arts.

Offline

#17 2011-10-28 13:35:38

SahibBommelig
Member
From: Germany
Registered: 2010-05-28
Posts: 80

Re: rmlint - a fast and light duplicate/lint finder

Hello samhain,

Are you using the the git version or the older tar.gz?
Could you please do the following:

git clone git://github.com/sahib/rmlint.git
cd rmlint
DEBUG=true make
valgrind ./rmlint <additional args>

...and post the (full) output of the last command?

Thanks for you help.

Offline

#18 2011-10-28 18:05:17

samhain
Member
Registered: 2007-07-19
Posts: 39

Re: rmlint - a fast and light duplicate/lint finder

I guess you missed the configure command in your instructions smile

I am using the rmlint-git package from AUR.

I followed your instructions and I noticed something strange.

If I compile with

DEBUG=true make

I don't get a Seg fault and rmlint runs fine:
(I guess that just the last lines of the output are enough)

=> In total 4684 files, whereof 716 are duplicate(s)
=> 6 other suspicious items found [0 B]
=> Totally  263.50 MB  [276297238 Bytes] can be removed.
=> Nothing removed yet!

A log has been written to rmlint.log.
A ready to use shellscript to rmlint.sh.
==12415== 
==12415== HEAP SUMMARY:
==12415==     in use at exit: 0 bytes in 0 blocks
==12415==   total heap usage: 23,942 allocs, 23,942 frees, 25,869,232 bytes allocated
==12415== 
==12415== All heap blocks were freed -- no leaks are possible
==12415== 
==12415== For counts of detected and suppressed errors, rerun with: -v
==12415== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 17 from 8)

BUT, if I compile rmlint with just

make

I get the following output:

valgrind ./rmlint /home/samhain/rpg/
==10978== Memcheck, a memory error detector
==10978== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al.
==10978== Using Valgrind-3.6.1 and LibVEX; rerun with -h for copyright info
==10978== Command: ./rmlint /home/samhain/rpg/
==10978== 

# Empty file(s): 
   rm /home/samhain/rpg/heroquest/yeoldeInn/UKVersion/kellar/list
   rm /home/samhain/rpg/infinity/ora/ora-3.14/lanzaORA.log

# Empty dir(s): 
   rmdir /home/samhain/rpg/workshop/40k/game01
   rmdir /home/samhain/rpg/workshop/40k/Rosters
   rmdir /home/samhain/rpg/workshop/40k/campaignSystem/game01
   rmdir /home/samhain/rpg/workshop/40k/campaignSystem/Rosters

# Duplicate(s):==10978== Thread 3:
==10978== Invalid read of size 1
==10978==    at 0x804B703: ??? (in /home/samhain/dwork/rmlint/rmlint/rmlint)
==10978==    by 0x804C27B: ??? (in /home/samhain/dwork/rmlint/rmlint/rmlint)
==10978==    by 0x804D24C: ??? (in /home/samhain/dwork/rmlint/rmlint/rmlint)
==10978==    by 0x416DDED: clone (in /lib/libc-2.14.so)
==10978==  Address 0x6dd5000 is not stack'd, malloc'd or (recently) free'd
==10978== 
FATAL: Rmlint crashed due to a Segmentation fault! :(
FATAL: Please file a bug report (See rmlint -h)
==10978== 
==10978== HEAP SUMMARY:
==10978==     in use at exit: 290,798 bytes in 3,063 blocks
==10978==   total heap usage: 10,490 allocs, 7,427 frees, 19,848,237 bytes allocated
==10978== 
==10978== LEAK SUMMARY:
==10978==    definitely lost: 41,360 bytes in 260 blocks
==10978==    indirectly lost: 247,607 bytes in 2,781 blocks
==10978==      possibly lost: 272 bytes in 2 blocks
==10978==    still reachable: 1,559 bytes in 20 blocks
==10978==         suppressed: 0 bytes in 0 blocks
==10978== Rerun with --leak-check=full to see details of leaked memory
==10978== 
==10978== For counts of detected and suppressed errors, rerun with: -v
==10978== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 17 from 8)

Is this enough or do you need more info?


Arch is to Linux as Jeet Kune Do is to martial arts.

Offline

#19 2011-10-28 18:49:15

Awebb
Member
Registered: 2010-05-06
Posts: 4,437

Re: rmlint - a fast and light duplicate/lint finder

This is a very nice tool. I just used it to decrease the space used by duplicates (which I actually need) by using the script your tool made and changing echo and rm of the duplocates so it creates a hardlink.

Are you open for feature requests? smile

Offline

#20 2011-10-28 21:17:22

SahibBommelig
Member
From: Germany
Registered: 2010-05-28
Posts: 80

Re: rmlint - a fast and light duplicate/lint finder

Hello samhain,

I guess you missed the configure command in your instructions smile

Yupp - Pardon.

This valgrind error you are getting is very strange..
But I'd guess it's the very high optimization level (till now this never made any trouble though)
Could you try changing the OPTI= line in the Makefile to something nicer like:

OPTI=-march=native -Os -s -finline-functions

Recompile (via plain 'make') and see if the problem persists. If so I will change the level in upstream.

Awebb:

Thanks.
You may post feature requests, but it might take very long till they're ready for use (I have very little time and mostly work on glyr/gmpc in my free time)
My guess would be that your feature request is "Can it do the hardlinks automagically?" - Yupp. (See the -C / -c option)

Some fun day I will probably rewrite this with GLib and some better IO..

Offline

#21 2011-10-29 12:49:45

samhain
Member
Registered: 2007-07-19
Posts: 39

Re: rmlint - a fast and light duplicate/lint finder

SahibBommelig wrote:

Could you try changing the OPTI= line in the Makefile to something nicer like:

OPTI=-march=native -Os -s -finline-functions

Recompile (via plain 'make') and see if the problem persists. If so I will change the level in upstream.

Done, this is the error I get:

==31568== Memcheck, a memory error detector
==31568== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al.
==31568== Using Valgrind-3.6.1 and LibVEX; rerun with -h for copyright info
==31568== Command: ./rmlint /home/samhain/rpg/
==31568== 

# Empty file(s): 
   rm /home/samhain/rpg/heroquest/yeoldeInn/UKVersion/kellar/list
   rm /home/samhain/rpg/infinity/ora/ora-3.14/lanzaORA.log

# Empty dir(s): 
   rmdir /home/samhain/rpg/workshop/40k/game01
   rmdir /home/samhain/rpg/workshop/40k/Rosters
   rmdir /home/samhain/rpg/workshop/40k/campaignSystem/game01
   rmdir /home/samhain/rpg/workshop/40k/campaignSystem/Rosters

# Duplicate(s):==31568== Thread 3:
==31568== Invalid read of size 1
==31568==    at 0x804B3C4: ??? (in /home/samhain/dwork/rmlint/rmlint/rmlint)
==31568==    by 0x804B723: ??? (in /home/samhain/dwork/rmlint/rmlint/rmlint)
==31568==    by 0x804BF84: ??? (in /home/samhain/dwork/rmlint/rmlint/rmlint)
==31568==    by 0x405DCA6: start_thread (in /lib/libpthread-2.14.so)
==31568==    by 0x416DDED: clone (in /lib/libc-2.14.so)
==31568==  Address 0x6dd5000 is not stack'd, malloc'd or (recently) free'd
==31568== 
FATAL: Rmlint crashed due to a Segmentation fault! :(
FATAL: Please file a bug report (See rmlint -h)
==31568== 
==31568== HEAP SUMMARY:
==31568==     in use at exit: 290,798 bytes in 3,063 blocks
==31568==   total heap usage: 10,490 allocs, 7,427 frees, 19,848,237 bytes allocated
==31568== 
==31568== LEAK SUMMARY:
==31568==    definitely lost: 41,252 bytes in 259 blocks
==31568==    indirectly lost: 247,219 bytes in 2,776 blocks
==31568==      possibly lost: 768 bytes in 8 blocks
==31568==    still reachable: 1,559 bytes in 20 blocks
==31568==         suppressed: 0 bytes in 0 blocks
==31568== Rerun with --leak-check=full to see details of leaked memory
==31568== 
==31568== For counts of detected and suppressed errors, rerun with: -v
==31568== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 17 from 8)

Arch is to Linux as Jeet Kune Do is to martial arts.

Offline

#22 2011-10-29 13:40:06

Cloudef
Member
Registered: 2010-10-12
Posts: 636

Re: rmlint - a fast and light duplicate/lint finder

It uses threads?

valgrind --tool=helgrind ./rmlint

Might help you guys too.
also try with gdb and backtrace.

Offline

#23 2011-11-13 20:35:02

MartinZ
Member
From: Chiloé, Chile
Registered: 2005-06-10
Posts: 378

Re: rmlint - a fast and light duplicate/lint finder

I just wanna thank you for this amazing app. Please keep the good work.


All your base are belong to us

Offline

#24 2012-05-13 21:40:22

stqn
Member
Registered: 2010-03-19
Posts: 1,189
Website

Re: rmlint - a fast and light duplicate/lint finder

Thanks for rmlint. It would be useful to be able to specify the directory where duplicates must be found so that nothing is deleted somewhere else. I tried “rmlint ///mnt/disk1/ ///mnt/disk2/ .” but it didn’t work, sometimes duplicates were found in disk1 or disk2 and originals kept in “.”.

I also wanted to exclude “.svn”, “.hg” and other directories, but it wasn’t obvious how to do that. Also not delete empty files, but I seem to remember seeing such a feature in the git log…

Offline

#25 2012-05-14 10:34:48

Roken
Member
From: UK
Registered: 2012-01-16
Posts: 670

Re: rmlint - a fast and light duplicate/lint finder

This looks very useful. Ran it quickly in default mode on my data drive and was surprised to find that nearly 1,700 files are duplicated (many of them several times), though I'm not quite ready to trust the generated script just yet. Many of the duplicates are there for a reason other than redundancy.

The drive is 384Gb of used size, and the total running time only took a few minutes. Very impressive.


[img=Speedtest]http://www.speedtest.net/my-result/5145583518[/img]

Nvidia GTX 670 2Gb, AMD Phenom II X4 (965BE) @ 3.6 Ghz (Overclocked) 8GB RAM
Linux user #545703

Offline

Board footer

Powered by FluxBB