You are not logged in.

#1 2011-02-27 16:41:54

whoops
Member
Registered: 2009-03-19
Posts: 891

[solved]bash help please: sort files & recognize duplicates

Hi!

I'm trying to move/rename all files in a folder and its subfolder by age.  And move duplicate files somewhere else. The output looks mostly fine, the tmp files too... but I can't figure out, why the duplicate files handling isn't working. edit: the line that sets the new name (&& newname="$1/.dupes/$dname/$part1_$part2_$fname" \) just doesn't seem to get executed here...

#!/bin/bash
pushd "$1" && {
    rafi="/tmp/$RANDOM.size.md5.list" && echo > "$rafi"
    find $1 -type f -print0 | while read -d $'\0' file;
        do \
            fname=$(basename "$file")
            dname=$(dirname "$file")
            part1=$(stat -c%y "$file" | awk -F " |:|\\\.|-" '{printf $1"-"$2"-"$3}')
            part2=$(stat -c%y "$file" | awk -F " |:|\\\.|-" '{printf $4"-"$5"-"$6}')
            echo "$bname" | grep -q "$part1" && part1=""
            echo "$bname" | grep -q "$part2" && part2=""
            newname="$1/$part1/$part2_$fname"
            size="size$(stat -c%s "$file")"
            grep -q "$size" "$rafi" \
                && {
                    md5="md5$(md5sum -b "$file" | sed "s/ .*//")"
                    grep -q "$md5" "$rafi" \
                        && newname="$1/.dupes/$dname/$part1_$part2_$fname" \
                        || {
                            echo "$md5" >> "$rafi"
                        } \
                } || {
                    echo "$size" >> "$rafi"
                }
            newname=$(echo "$newname" | sed "s/\/\/*/\//g" | sed "s/\.~.~//g")
            if [ "$newname" != "$file" ];
                then if ! [ -d $(dirname "$newname") ]
                        then echo mkdir -p $(dirname "$newname") -v
                    fi
                    echo mv "$file" "$newname" -v --backup=numbered
                fi
        done
    popd
}

What am I doing wrong? Been trying for hours - I just don't get it...


Thanks!

Last edited by whoops (2011-03-01 10:19:48)

Offline

#2 2011-02-27 17:38:55

steve___
Member
Registered: 2008-02-24
Posts: 452

Re: [solved]bash help please: sort files & recognize duplicates

Can you supply sample data as well as the desired output?

Offline

#3 2011-02-27 19:12:33

whoops
Member
Registered: 2009-03-19
Posts: 891

Re: [solved]bash help please: sort files & recognize duplicates

Those files are duplicates:

LC_ALL=C md5sum /etc/X11/xorg.*
98b654bf1ab8c1fc6f5fe8b4a7d585e4  /etc/X11/xorg.conf
98b654bf1ab8c1fc6f5fe8b4a7d585e4  /etc/X11/xorg.conf.bak
md5sum: /etc/X11/xorg.conf.d: Is a directory
98b654bf1ab8c1fc6f5fe8b4a7d585e4  /etc/X11/xorg.conf.dub

Testing the script on /etc/X11, only showing lines for "dupes". There should be two "mv  lines" for two of the three files, but there's only one.

$ sort_by_date /etc/X11/ | grep dupe
mkdir -p /etc/X11/.dupes/etc/X11 -v
mv /etc/X11/xorg.conf.dub /etc/X11/.dupes/etc/X11/xorg.conf.dub -v --backup=numbered

Strange that in /X11, parts of it works. It doesn't find any of the duplicate files in my photo / digicam folder despite there being a lot and a lot of md5's showing up in the tmp file.

full output:

 sort_by_date /etc/X11/ 
/etc/X11 ~
mkdir -p /etc/X11/2011-02-27 -v
mv /etc/X11/xorg.conf.bak /etc/X11/2011-02-27/xorg.conf.bak -v --backup=numbered
mkdir -p /etc/X11/2010-06-21 -v
mv /etc/X11/xorg.conf.d/20-nvidia.conf /etc/X11/2010-06-21/20-nvidia.conf -v --backup=numbered
mkdir -p /etc/X11/2010-08-24 -v
mv /etc/X11/xorg.conf.d/10-evdev.conf /etc/X11/2010-08-24/10-evdev.conf -v --backup=numbered
mkdir -p /etc/X11/2010-09-01 -v
mv /etc/X11/xorg.conf.d/20-nvidia.conf.pacnew /etc/X11/2010-09-01/20-nvidia.conf.pacnew -v --backup=numbered
mkdir -p /etc/X11/2010-11-30 -v
mv /etc/X11/xorg.conf.d/20-keyboard.conf /etc/X11/2010-11-30/20-keyboard.conf -v --backup=numbered
mkdir -p /etc/X11/2010-05-26 -v
mv /etc/X11/xorg.conf.d/10-quirks.conf /etc/X11/2010-05-26/10-quirks.conf -v --backup=numbered
mkdir -p /etc/X11/2010-01-03 -v
mv /etc/X11/xorg.conf /etc/X11/2010-01-03/xorg.conf -v --backup=numbered
mkdir -p /etc/X11/.dupes/etc/X11 -v
mv /etc/X11/xorg.conf.dub /etc/X11/.dupes/etc/X11/xorg.conf.dub -v --backup=numbered
mkdir -p /etc/X11/2010-07-10 -v
mv /etc/X11/Xsession /etc/X11/2010-07-10/Xsession -v --backup=numbered
mkdir -p /etc/X11/2010-12-22 -v
mv /etc/X11/xinit/xinitrc.d/30-dbus /etc/X11/2010-12-22/30-dbus -v --backup=numbered
mkdir -p /etc/X11/2010-11-28 -v
mv /etc/X11/xinit/xinitrc.d/40-libcanberra-gtk-module /etc/X11/2010-11-28/40-libcanberra-gtk-module -v --backup=numbered
mkdir -p /etc/X11/2010-12-19 -v
mv /etc/X11/xinit/xserverrc /etc/X11/2010-12-19/xserverrc -v --backup=numbered
mkdir -p /etc/X11/2010-11-15 -v
mv /etc/X11/xinit/xinitrc /etc/X11/2010-11-15/xinitrc -v --backup=numbered
~

So, the rest seems to work... almost: $part2 isn't attached to the name because that does not work with the underscore (without underscore it does work) and I don't know how to fix that either.

newname="$1/$part1/$part2_$fname"

Offline

#4 2011-02-27 22:03:48

juster
Forum Fellow
Registered: 2008-10-07
Posts: 195

Re: [solved]bash help please: sort files & recognize duplicates

whoops wrote:
newname="$1/$part1/$part2_$fname"

You can force parameter names to start and stop by using curly brackets. So you example above would be:

newname="$1/$part1/${part2}_${fname}"

Or something similar.

I don't usually say this for bash but your style is hard to read. I gave up trying to figure out what you are doing. || and && are handy for one-liners but they are really ugly for large blocks of code. Most people keep then and do's on the same line as if or while but this is not as big a deal. That is of course only my opinion so please don't feel bad. In exchange for my cruelness here is something I came up with that you might consider using for finding duplicates with md5sum:

md5sum * 2>/dev/null | sort -k 1 | uniq -d -w 32 | cut -c 35-

Your bash programming is only as strong as your knowledge of builtin unix commands ;-). Didn't some wise man once say all computer operations are a bastardized form of sorting? You might consider a similar technique for grouping files by their same dates. Instead of md5sums preface the filename with its time, sort the list, ?, profit!

edit: s/similiar/similar/ gah!

Last edited by juster (2011-02-27 22:08:15)

Offline

#5 2011-02-28 12:26:15

whoops
Member
Registered: 2009-03-19
Posts: 891

Re: [solved]bash help please: sort files & recognize duplicates

thx, that brought me a little closer!

juster wrote:

I don't usually say this for bash but your style is hard to read.

Oops... thought so - never had an easy time writing readable code. Or readable anything... Maybe this is not as bad:

#!/bin/bash
if pushd "$1"
then
    # random tmp file for filesize + md5 
    rafi="/tmp/$RANDOM.size.md5.list" && echo > "$rafi"
    # for all files in folder $1 + subfolders...
    find $1 -type f -print0 | while read -d $'\0' file
    do
        fname=$(basename "$file")
        dname=$(dirname "$file")
        # Get year-month-day
        part1=$(stat -c%y "$file" | awk -F " |:|\\\.|-" '{printf $1"-"$2"-"$3}')
        # Get hour-minute-second
        part2=$(stat -c%y "$file" | awk -F " |:|\\\.|-" '{printf $4"-"$5"-"$6}')
        # Don't add this info to new filename if it is already in old filename
        echo "$bname" | grep -q "$part1" && part1=""
        echo "$bname" | grep -q "$part2" && part2=""
        # Destination filename
        newname="$1/$part1/${part2}_$fname"
        size="size$(stat -c%s "$file")"
        # have files with the same size already been processed?
        if grep -q $size "$rafi" 
        then
            md5="md5$(md5sum -b "$file" | sed "s/ .*//")"
            # have files with the same md5 already been processed?
            if grep -q $md5 "$rafi" 
                # set different destination for duplicate files
                then newname="$1/.dupes/$dname/${part1}_${part2}_$fname"
                # write md5 to tmp file
                else echo $md5 >> "$rafi"
            fi
        # write size to tmp file
        else echo "$size" >> "$rafi"
        fi
        # remove double slashes
        newname=$(echo "$newname" | sed "s/\/\/*/\//g" | sed "s/\.~.~//g")
        if [ "$newname" != "$file" ]
        then if ! [ -d $(dirname "$newname") ]
            then echo mkdir -p $(dirname "$newname") -v
            fi
            echo mv "$file" "$newname" -v --backup=numbered
        fi
    done
    popd
fi

Renaming seems to work correctly now, but it's still only giving me one "dupe mv line" for my test in /etc/X11 and it's not giving me any for the many duplicates in my huge digicam folder.

$ sort_by_date /etc/X11/ 
~
mkdir -p /etc/X11/2011-02-27 -v
mv /etc/X11/xorg.conf.bak /etc/X11/2011-02-27/19-43-36_xorg.conf.bak -v --backup=numbered
mkdir -p /etc/X11/2010-06-21 -v
mv /etc/X11/xorg.conf.d/20-nvidia.conf /etc/X11/2010-06-21/10-12-35_20-nvidia.conf -v --backup=numbered
mkdir -p /etc/X11/2010-08-24 -v
mv /etc/X11/xorg.conf.d/10-evdev.conf /etc/X11/2010-08-24/15-45-35_10-evdev.conf -v --backup=numbered
mkdir -p /etc/X11/2010-09-01 -v
mv /etc/X11/xorg.conf.d/20-nvidia.conf.pacnew /etc/X11/2010-09-01/12-16-10_20-nvidia.conf.pacnew -v --backup=numbered
mkdir -p /etc/X11/2010-11-30 -v
mv /etc/X11/xorg.conf.d/20-keyboard.conf /etc/X11/2010-11-30/10-27-00_20-keyboard.conf -v --backup=numbered
mkdir -p /etc/X11/2010-05-26 -v
mv /etc/X11/xorg.conf.d/10-quirks.conf /etc/X11/2010-05-26/21-51-03_10-quirks.conf -v --backup=numbered
mkdir -p /etc/X11/2010-01-03 -v
mv /etc/X11/xorg.conf /etc/X11/2010-01-03/10-56-25_xorg.conf -v --backup=numbered
mkdir -p /etc/X11/.dupes/etc/X11 -v
mv /etc/X11/xorg.conf.dub /etc/X11/.dupes/etc/X11/2011-02-27_19-43-52_xorg.conf.dub -v --backup=numbered
mkdir -p /etc/X11/2010-07-10 -v
mv /etc/X11/Xsession /etc/X11/2010-07-10/20-01-30_Xsession -v --backup=numbered
mkdir -p /etc/X11/2010-12-22 -v
mv /etc/X11/xinit/xinitrc.d/30-dbus /etc/X11/2010-12-22/15-39-41_30-dbus -v --backup=numbered
mkdir -p /etc/X11/2010-11-28 -v
mv /etc/X11/xinit/xinitrc.d/40-libcanberra-gtk-module /etc/X11/2010-11-28/02-44-08_40-libcanberra-gtk-module -v --backup=numbered
mkdir -p /etc/X11/2010-12-19 -v
mv /etc/X11/xinit/xserverrc /etc/X11/2010-12-19/11-58-47_xserverrc -v --backup=numbered
mkdir -p /etc/X11/2010-11-15 -v
mv /etc/X11/xinit/xinitrc /etc/X11/2010-11-15/11-39-17_xinitrc -v --backup=numbered
~

Still don't get it where this problem could be coming from...


md5sum * 2>/dev/null | sort -k 1 | uniq -d -w 32 | cut -c 35-

That's beautiful, just I have no idea how to use it in the context of what I'm trying to do.

Last edited by whoops (2011-02-28 13:00:27)

Offline

#6 2011-02-28 16:45:56

freak
Member
Registered: 2009-04-15
Posts: 17

Re: [solved]bash help please: sort files & recognize duplicates

This script is a mess, to be honest. It would be a lot more helpful to have a better explanation of what it's trying to do in english, with sample input and output.

Offline

#7 2011-02-28 16:58:28

steve___
Member
Registered: 2008-02-24
Posts: 452

Re: [solved]bash help please: sort files & recognize duplicates

Building on what juster wrote, these commands will generate a list of files which have duplicates:

shopt -s globstar
for i in /path/to/files/**; do
    [[ ! -d $i ]] && md5sum "$i"
done | sort | uniq -w 32 -D
shopt -u globstar

I hope this helps.

Last edited by steve___ (2011-02-28 17:01:07)

Offline

#8 2011-02-28 17:44:50

freak
Member
Registered: 2009-04-15
Posts: 17

Re: [solved]bash help please: sort files & recognize duplicates

I think you need /path/to/files/**/*

I don't know if you're gonna be looking at duplicates in the same directory, or scattered.  And I don't know how he wants to determine which file to keep.

Offline

#9 2011-02-28 18:22:46

whoops
Member
Registered: 2009-03-19
Posts: 891

Re: [solved]bash help please: sort files & recognize duplicates

freak wrote:

It would be a lot more helpful to have a better explanation of what it's trying to do in english, with sample input and output.

I'm not sure I get it what's missing.... I took "/etc/X11" as sample input. Two extra copies of the xorg.conf (see first post) are in there, no other duplicate files. Sample output is the list of mkdir / mv commands. It is supposed to move/rename all files in directory $1 and its subdirectories to a new destination ( folder/date/time_filename) and move duplicate files somewhere else (.dupe/folder/date_time_filename). There is no "determining which file is kept" - the first one is moved to the normal new folder, all subsequent ones should be moved to the "special .dupe" folder.

edit: Should I take a different folder as sample input? Is there a good one somewhere with duplicate files that looks about the same on every arch machine?

I'm trying to figure out what's wrong with that script... Been reading it again & again + testing different parts, it just doesn't work but to me the code looks as if it should do exactly what I want. Guess I'll try to make it less of a mess & put in more useful comments now again... just not sure how.

steve___ wrote:

Building on what juster wrote, these commands will generate a list of files which have duplicates:

shopt -s globstar
for i in /path/to/files/**; do
    [[ ! -d $i ]] && md5sum "$i"
done | sort | uniq -w 32 -D
shopt -u globstar

I hope this helps.

Thx! I'll keep that one in mind in case I have to give up getting mine to work without understanding what I did wrong!

Last edited by whoops (2011-02-28 18:29:29)

Offline

#10 2011-02-28 20:06:46

steve___
Member
Registered: 2008-02-24
Posts: 452

Re: [solved]bash help please: sort files & recognize duplicates

freak wrote:

I think you need /path/to/files/**/*

** is for files
**/ is for directories

Offline

#11 2011-03-01 02:11:44

milomouse
Member
Registered: 2009-03-24
Posts: 940
Website

Re: [solved]bash help please: sort files & recognize duplicates

i rewrote it without changing the majority of your style so you could easily distinguish what changed, but personally i would change the formatting.
but anyway, this works for me so hopefully it should work for you too. currently having it 'echo' instead of creating or moving, such as in your example. and with this, try your previous example of /etc/X11:

#!/bin/bash
if [[ -d "$1" ]]; then
  # random tmp file for filesize + md5 
  rafi=/tmp/${RANDOM}.size.md5.list
  # for all files in folder $1 + subfolders...
  find "$1" -type f -print0 | while read -d $'\0' file; do
    # get filename (fname) and dirname (dname)
    fname=${file##*/}
    dname=${file%/*.*}
    # get year-month-day
    fsize=$(stat -c%y "$file")
    # for redirecting ouput to descriptor 6
    exec 6>/dev/null
    part1=${fsize::(1,10)}
    # get hour-minute-second
    part2=$(echo ${fsize:11:8}>&6;echo ${_//:/-})
    # don't add this info to new filename if it is already in old filename
    [[ ${#fname} != $(echo ${fname/${part1}}>&6;echo ${#_}) ]] && part1=""
    [[ ${#fname} != $(echo ${fname/${part2}}>&6;echo ${#_}) ]] && part2=""
    # destination filename
    newname=$(echo "$1/${part1}/${part2}_${fname}">&6;echo ${_//\/\///})
    size="size$(stat -c%s "$file")"
    # write size to tmp file
    echo "$size" >> "$rafi"
    # write md5 sum to tmp file
    md5=md5$(md5sum -b "$file")
    md5=${md5::(1,35)}
    echo $md5 >> "$rafi"
    # are there files with duplicate md5 sums?
    if [[ $(grep -c $md5 "$rafi") -gt 1 ]]; then
      newname="${dname}/.dupes/${part1}_${part2}_${fname}"
    fi
    if [[ "$newname" != "$file" ]]; then
      [[ ! -d ${newname%/*.*} ]] && echo mkdir -p ${newname%/*.*} -v
      echo mv "$file" "$newname" -v --backup=numbered
    fi
    # close/free descriptor 6
    exec 6<&-
  done
fi

NOTE: i suck at bash and find it terrifying that bash cannot nest parameter expansions, hence redirecting descriptor and using a couple temp params, and also static variable splitting.

so, even after reviewing your script i couldn't tell exactly what was wrong. maybe it was how you checked the md5sums. also, i noticed you set some variables but didn't use them. i think they were typos. bname should equal dname, etc. not sure if you need to keep the /tmp/$RANDOM.. file at the end of operation or not.

please let me know if i accidently removed something you needed and if this is or isn't what you we're trying to accomplish.

EDIT: i just noticed that if part1 is in filename it wont create that directory(?)

Last edited by milomouse (2011-03-01 02:18:39)

Offline

#12 2011-03-01 10:16:47

whoops
Member
Registered: 2009-03-19
Posts: 891

Re: [solved]bash help please: sort files & recognize duplicates

Great, many thanks!
After reading that rewritten version I finally get it!

Argh, I dropped an essential part of the script days ago (I skip getting the md5's from the first same size file of every "pack"!) while trying to simplify the code and I just couldn't see that it's missing because it was still in my head (also, there was a less important typo that you found). Sort of like that:

#!/bin/bash
if pushd "$1"
then
    # random tmp file for filesize + md5 
    rafi="/tmp/$RANDOM.size.md5.list" && echo > "$rafi"
    # for all files in folder $1 + subfolders...
    find $1 -type f -print0 | sort | while read -d $'\0' file
    do
        fname=$(basename "$file")
        dname=$(dirname "$file")
        # Get year-month-day
        part1=$(stat -c%y "$file" | awk -F " |:|\\\.|-" '{printf $1"-"$2"-"$3}')
        # Get hour-minute-second
        part2=$(stat -c%y "$file" | awk -F " |:|\\\.|-" '{printf $4"-"$5"-"$6}')
        # Don't add this info to new filename if it is already in old filename
        echo "$fname" | grep -q "$part1" && part1=""
        echo "$fname" | grep -q "$part2" && part2=""
        # Destination filename
        newname="$1/$part1/${part2}_$fname"
        size="size$(stat -c%s "$file")"
        # have files with the same size already been processed?

        if dupe=$(grep $size "$rafi") 
        then
            md5="md5$(md5sum -b "$file" | sed "s/ .*//")"
            # have files with the same md5 already been processed?
            if grep -q $md5 "$rafi" 
            then 
                # automatic hit
                md5dupe=$md5
            else 
#### that was missing!!!
                # get md5 from samesize file & write to tmp file
                dupe=$(echo $dupe | sed "s/.* //")
                md5dupe="md5$(md5sum -b "$dupe" | sed "s/ .*//")"
                echo $md5dupe >> "$rafi"
#### and stuff
            fi
            if [ "$md5"=="$md5dupe" ]
            # set different destination for duplicate files
            then newname="$1/.dupes/$dname/${part1}_${part2}_$fname"
            fi
        # write size to tmp file
        else echo "$size $file" >> "$rafi"
        fi
        # remove double characters
        newname=$(echo "$newname" | sed "s/\/\/*/\//g" | sed "s/\.~.~//g")
        newname=$(echo "$newname" | sed "s/__*/_/g" | sed "s/\.~.~//g")
        if [ "$newname" != "$file" ]
        then if ! [ -d $(dirname "$newname") ]
            then echo mkdir -p $(dirname "$newname") -v
            fi
            echo mv "$file" "$newname" -v --backup=numbered
        fi
    done
    popd
fi

(That version might still not be working - no time to test & fix right now.  But I got past the point where I was stuck! big_smile)

Last edited by whoops (2011-03-01 10:20:48)

Offline

#13 2011-03-01 15:22:57

milomouse
Member
Registered: 2009-03-24
Posts: 940
Website

Re: [solved]bash help please: sort files & recognize duplicates

it's your script, do with it what you see fit. glad you're on the right direction, anyway tongue take care

Offline

Board footer

Powered by FluxBB