You are not logged in.
Hi!
I'm trying to move/rename all files in a folder and its subfolder by age. And move duplicate files somewhere else. The output looks mostly fine, the tmp files too... but I can't figure out, why the duplicate files handling isn't working. edit: the line that sets the new name (&& newname="$1/.dupes/$dname/$part1_$part2_$fname" \) just doesn't seem to get executed here...
#!/bin/bash
pushd "$1" && {
rafi="/tmp/$RANDOM.size.md5.list" && echo > "$rafi"
find $1 -type f -print0 | while read -d $'\0' file;
do \
fname=$(basename "$file")
dname=$(dirname "$file")
part1=$(stat -c%y "$file" | awk -F " |:|\\\.|-" '{printf $1"-"$2"-"$3}')
part2=$(stat -c%y "$file" | awk -F " |:|\\\.|-" '{printf $4"-"$5"-"$6}')
echo "$bname" | grep -q "$part1" && part1=""
echo "$bname" | grep -q "$part2" && part2=""
newname="$1/$part1/$part2_$fname"
size="size$(stat -c%s "$file")"
grep -q "$size" "$rafi" \
&& {
md5="md5$(md5sum -b "$file" | sed "s/ .*//")"
grep -q "$md5" "$rafi" \
&& newname="$1/.dupes/$dname/$part1_$part2_$fname" \
|| {
echo "$md5" >> "$rafi"
} \
} || {
echo "$size" >> "$rafi"
}
newname=$(echo "$newname" | sed "s/\/\/*/\//g" | sed "s/\.~.~//g")
if [ "$newname" != "$file" ];
then if ! [ -d $(dirname "$newname") ]
then echo mkdir -p $(dirname "$newname") -v
fi
echo mv "$file" "$newname" -v --backup=numbered
fi
done
popd
}
What am I doing wrong? Been trying for hours - I just don't get it...
Thanks!
Last edited by whoops (2011-03-01 10:19:48)
Offline
Can you supply sample data as well as the desired output?
Offline
Those files are duplicates:
LC_ALL=C md5sum /etc/X11/xorg.*
98b654bf1ab8c1fc6f5fe8b4a7d585e4 /etc/X11/xorg.conf
98b654bf1ab8c1fc6f5fe8b4a7d585e4 /etc/X11/xorg.conf.bak
md5sum: /etc/X11/xorg.conf.d: Is a directory
98b654bf1ab8c1fc6f5fe8b4a7d585e4 /etc/X11/xorg.conf.dub
Testing the script on /etc/X11, only showing lines for "dupes". There should be two "mv lines" for two of the three files, but there's only one.
$ sort_by_date /etc/X11/ | grep dupe
mkdir -p /etc/X11/.dupes/etc/X11 -v
mv /etc/X11/xorg.conf.dub /etc/X11/.dupes/etc/X11/xorg.conf.dub -v --backup=numbered
Strange that in /X11, parts of it works. It doesn't find any of the duplicate files in my photo / digicam folder despite there being a lot and a lot of md5's showing up in the tmp file.
full output:
sort_by_date /etc/X11/
/etc/X11 ~
mkdir -p /etc/X11/2011-02-27 -v
mv /etc/X11/xorg.conf.bak /etc/X11/2011-02-27/xorg.conf.bak -v --backup=numbered
mkdir -p /etc/X11/2010-06-21 -v
mv /etc/X11/xorg.conf.d/20-nvidia.conf /etc/X11/2010-06-21/20-nvidia.conf -v --backup=numbered
mkdir -p /etc/X11/2010-08-24 -v
mv /etc/X11/xorg.conf.d/10-evdev.conf /etc/X11/2010-08-24/10-evdev.conf -v --backup=numbered
mkdir -p /etc/X11/2010-09-01 -v
mv /etc/X11/xorg.conf.d/20-nvidia.conf.pacnew /etc/X11/2010-09-01/20-nvidia.conf.pacnew -v --backup=numbered
mkdir -p /etc/X11/2010-11-30 -v
mv /etc/X11/xorg.conf.d/20-keyboard.conf /etc/X11/2010-11-30/20-keyboard.conf -v --backup=numbered
mkdir -p /etc/X11/2010-05-26 -v
mv /etc/X11/xorg.conf.d/10-quirks.conf /etc/X11/2010-05-26/10-quirks.conf -v --backup=numbered
mkdir -p /etc/X11/2010-01-03 -v
mv /etc/X11/xorg.conf /etc/X11/2010-01-03/xorg.conf -v --backup=numbered
mkdir -p /etc/X11/.dupes/etc/X11 -v
mv /etc/X11/xorg.conf.dub /etc/X11/.dupes/etc/X11/xorg.conf.dub -v --backup=numbered
mkdir -p /etc/X11/2010-07-10 -v
mv /etc/X11/Xsession /etc/X11/2010-07-10/Xsession -v --backup=numbered
mkdir -p /etc/X11/2010-12-22 -v
mv /etc/X11/xinit/xinitrc.d/30-dbus /etc/X11/2010-12-22/30-dbus -v --backup=numbered
mkdir -p /etc/X11/2010-11-28 -v
mv /etc/X11/xinit/xinitrc.d/40-libcanberra-gtk-module /etc/X11/2010-11-28/40-libcanberra-gtk-module -v --backup=numbered
mkdir -p /etc/X11/2010-12-19 -v
mv /etc/X11/xinit/xserverrc /etc/X11/2010-12-19/xserverrc -v --backup=numbered
mkdir -p /etc/X11/2010-11-15 -v
mv /etc/X11/xinit/xinitrc /etc/X11/2010-11-15/xinitrc -v --backup=numbered
~
So, the rest seems to work... almost: $part2 isn't attached to the name because that does not work with the underscore (without underscore it does work) and I don't know how to fix that either.
newname="$1/$part1/$part2_$fname"
Offline
newname="$1/$part1/$part2_$fname"
You can force parameter names to start and stop by using curly brackets. So you example above would be:
newname="$1/$part1/${part2}_${fname}"
Or something similar.
I don't usually say this for bash but your style is hard to read. I gave up trying to figure out what you are doing. || and && are handy for one-liners but they are really ugly for large blocks of code. Most people keep then and do's on the same line as if or while but this is not as big a deal. That is of course only my opinion so please don't feel bad. In exchange for my cruelness here is something I came up with that you might consider using for finding duplicates with md5sum:
md5sum * 2>/dev/null | sort -k 1 | uniq -d -w 32 | cut -c 35-
Your bash programming is only as strong as your knowledge of builtin unix commands ;-). Didn't some wise man once say all computer operations are a bastardized form of sorting? You might consider a similar technique for grouping files by their same dates. Instead of md5sums preface the filename with its time, sort the list, ?, profit!
edit: s/similiar/similar/ gah!
Last edited by juster (2011-02-27 22:08:15)
Offline
thx, that brought me a little closer!
I don't usually say this for bash but your style is hard to read.
Oops... thought so - never had an easy time writing readable code. Or readable anything... Maybe this is not as bad:
#!/bin/bash
if pushd "$1"
then
# random tmp file for filesize + md5
rafi="/tmp/$RANDOM.size.md5.list" && echo > "$rafi"
# for all files in folder $1 + subfolders...
find $1 -type f -print0 | while read -d $'\0' file
do
fname=$(basename "$file")
dname=$(dirname "$file")
# Get year-month-day
part1=$(stat -c%y "$file" | awk -F " |:|\\\.|-" '{printf $1"-"$2"-"$3}')
# Get hour-minute-second
part2=$(stat -c%y "$file" | awk -F " |:|\\\.|-" '{printf $4"-"$5"-"$6}')
# Don't add this info to new filename if it is already in old filename
echo "$bname" | grep -q "$part1" && part1=""
echo "$bname" | grep -q "$part2" && part2=""
# Destination filename
newname="$1/$part1/${part2}_$fname"
size="size$(stat -c%s "$file")"
# have files with the same size already been processed?
if grep -q $size "$rafi"
then
md5="md5$(md5sum -b "$file" | sed "s/ .*//")"
# have files with the same md5 already been processed?
if grep -q $md5 "$rafi"
# set different destination for duplicate files
then newname="$1/.dupes/$dname/${part1}_${part2}_$fname"
# write md5 to tmp file
else echo $md5 >> "$rafi"
fi
# write size to tmp file
else echo "$size" >> "$rafi"
fi
# remove double slashes
newname=$(echo "$newname" | sed "s/\/\/*/\//g" | sed "s/\.~.~//g")
if [ "$newname" != "$file" ]
then if ! [ -d $(dirname "$newname") ]
then echo mkdir -p $(dirname "$newname") -v
fi
echo mv "$file" "$newname" -v --backup=numbered
fi
done
popd
fi
Renaming seems to work correctly now, but it's still only giving me one "dupe mv line" for my test in /etc/X11 and it's not giving me any for the many duplicates in my huge digicam folder.
$ sort_by_date /etc/X11/
~
mkdir -p /etc/X11/2011-02-27 -v
mv /etc/X11/xorg.conf.bak /etc/X11/2011-02-27/19-43-36_xorg.conf.bak -v --backup=numbered
mkdir -p /etc/X11/2010-06-21 -v
mv /etc/X11/xorg.conf.d/20-nvidia.conf /etc/X11/2010-06-21/10-12-35_20-nvidia.conf -v --backup=numbered
mkdir -p /etc/X11/2010-08-24 -v
mv /etc/X11/xorg.conf.d/10-evdev.conf /etc/X11/2010-08-24/15-45-35_10-evdev.conf -v --backup=numbered
mkdir -p /etc/X11/2010-09-01 -v
mv /etc/X11/xorg.conf.d/20-nvidia.conf.pacnew /etc/X11/2010-09-01/12-16-10_20-nvidia.conf.pacnew -v --backup=numbered
mkdir -p /etc/X11/2010-11-30 -v
mv /etc/X11/xorg.conf.d/20-keyboard.conf /etc/X11/2010-11-30/10-27-00_20-keyboard.conf -v --backup=numbered
mkdir -p /etc/X11/2010-05-26 -v
mv /etc/X11/xorg.conf.d/10-quirks.conf /etc/X11/2010-05-26/21-51-03_10-quirks.conf -v --backup=numbered
mkdir -p /etc/X11/2010-01-03 -v
mv /etc/X11/xorg.conf /etc/X11/2010-01-03/10-56-25_xorg.conf -v --backup=numbered
mkdir -p /etc/X11/.dupes/etc/X11 -v
mv /etc/X11/xorg.conf.dub /etc/X11/.dupes/etc/X11/2011-02-27_19-43-52_xorg.conf.dub -v --backup=numbered
mkdir -p /etc/X11/2010-07-10 -v
mv /etc/X11/Xsession /etc/X11/2010-07-10/20-01-30_Xsession -v --backup=numbered
mkdir -p /etc/X11/2010-12-22 -v
mv /etc/X11/xinit/xinitrc.d/30-dbus /etc/X11/2010-12-22/15-39-41_30-dbus -v --backup=numbered
mkdir -p /etc/X11/2010-11-28 -v
mv /etc/X11/xinit/xinitrc.d/40-libcanberra-gtk-module /etc/X11/2010-11-28/02-44-08_40-libcanberra-gtk-module -v --backup=numbered
mkdir -p /etc/X11/2010-12-19 -v
mv /etc/X11/xinit/xserverrc /etc/X11/2010-12-19/11-58-47_xserverrc -v --backup=numbered
mkdir -p /etc/X11/2010-11-15 -v
mv /etc/X11/xinit/xinitrc /etc/X11/2010-11-15/11-39-17_xinitrc -v --backup=numbered
~
Still don't get it where this problem could be coming from...
md5sum * 2>/dev/null | sort -k 1 | uniq -d -w 32 | cut -c 35-
That's beautiful, just I have no idea how to use it in the context of what I'm trying to do.
Last edited by whoops (2011-02-28 13:00:27)
Offline
This script is a mess, to be honest. It would be a lot more helpful to have a better explanation of what it's trying to do in english, with sample input and output.
Offline
Building on what juster wrote, these commands will generate a list of files which have duplicates:
shopt -s globstar
for i in /path/to/files/**; do
[[ ! -d $i ]] && md5sum "$i"
done | sort | uniq -w 32 -D
shopt -u globstar
I hope this helps.
Last edited by steve___ (2011-02-28 17:01:07)
Offline
I think you need /path/to/files/**/*
I don't know if you're gonna be looking at duplicates in the same directory, or scattered. And I don't know how he wants to determine which file to keep.
Offline
It would be a lot more helpful to have a better explanation of what it's trying to do in english, with sample input and output.
I'm not sure I get it what's missing.... I took "/etc/X11" as sample input. Two extra copies of the xorg.conf (see first post) are in there, no other duplicate files. Sample output is the list of mkdir / mv commands. It is supposed to move/rename all files in directory $1 and its subdirectories to a new destination ( folder/date/time_filename) and move duplicate files somewhere else (.dupe/folder/date_time_filename). There is no "determining which file is kept" - the first one is moved to the normal new folder, all subsequent ones should be moved to the "special .dupe" folder.
edit: Should I take a different folder as sample input? Is there a good one somewhere with duplicate files that looks about the same on every arch machine?
I'm trying to figure out what's wrong with that script... Been reading it again & again + testing different parts, it just doesn't work but to me the code looks as if it should do exactly what I want. Guess I'll try to make it less of a mess & put in more useful comments now again... just not sure how.
Building on what juster wrote, these commands will generate a list of files which have duplicates:
shopt -s globstar for i in /path/to/files/**; do [[ ! -d $i ]] && md5sum "$i" done | sort | uniq -w 32 -D shopt -u globstar
I hope this helps.
Thx! I'll keep that one in mind in case I have to give up getting mine to work without understanding what I did wrong!
Last edited by whoops (2011-02-28 18:29:29)
Offline
I think you need /path/to/files/**/*
** is for files
**/ is for directories
Offline
i rewrote it without changing the majority of your style so you could easily distinguish what changed, but personally i would change the formatting.
but anyway, this works for me so hopefully it should work for you too. currently having it 'echo' instead of creating or moving, such as in your example. and with this, try your previous example of /etc/X11:
#!/bin/bash
if [[ -d "$1" ]]; then
# random tmp file for filesize + md5
rafi=/tmp/${RANDOM}.size.md5.list
# for all files in folder $1 + subfolders...
find "$1" -type f -print0 | while read -d $'\0' file; do
# get filename (fname) and dirname (dname)
fname=${file##*/}
dname=${file%/*.*}
# get year-month-day
fsize=$(stat -c%y "$file")
# for redirecting ouput to descriptor 6
exec 6>/dev/null
part1=${fsize::(1,10)}
# get hour-minute-second
part2=$(echo ${fsize:11:8}>&6;echo ${_//:/-})
# don't add this info to new filename if it is already in old filename
[[ ${#fname} != $(echo ${fname/${part1}}>&6;echo ${#_}) ]] && part1=""
[[ ${#fname} != $(echo ${fname/${part2}}>&6;echo ${#_}) ]] && part2=""
# destination filename
newname=$(echo "$1/${part1}/${part2}_${fname}">&6;echo ${_//\/\///})
size="size$(stat -c%s "$file")"
# write size to tmp file
echo "$size" >> "$rafi"
# write md5 sum to tmp file
md5=md5$(md5sum -b "$file")
md5=${md5::(1,35)}
echo $md5 >> "$rafi"
# are there files with duplicate md5 sums?
if [[ $(grep -c $md5 "$rafi") -gt 1 ]]; then
newname="${dname}/.dupes/${part1}_${part2}_${fname}"
fi
if [[ "$newname" != "$file" ]]; then
[[ ! -d ${newname%/*.*} ]] && echo mkdir -p ${newname%/*.*} -v
echo mv "$file" "$newname" -v --backup=numbered
fi
# close/free descriptor 6
exec 6<&-
done
fi
NOTE: i suck at bash and find it terrifying that bash cannot nest parameter expansions, hence redirecting descriptor and using a couple temp params, and also static variable splitting.
so, even after reviewing your script i couldn't tell exactly what was wrong. maybe it was how you checked the md5sums. also, i noticed you set some variables but didn't use them. i think they were typos. bname should equal dname, etc. not sure if you need to keep the /tmp/$RANDOM.. file at the end of operation or not.
please let me know if i accidently removed something you needed and if this is or isn't what you we're trying to accomplish.
EDIT: i just noticed that if part1 is in filename it wont create that directory(?)
Last edited by milomouse (2011-03-01 02:18:39)
Offline
Great, many thanks!
After reading that rewritten version I finally get it!
Argh, I dropped an essential part of the script days ago (I skip getting the md5's from the first same size file of every "pack"!) while trying to simplify the code and I just couldn't see that it's missing because it was still in my head (also, there was a less important typo that you found). Sort of like that:
#!/bin/bash
if pushd "$1"
then
# random tmp file for filesize + md5
rafi="/tmp/$RANDOM.size.md5.list" && echo > "$rafi"
# for all files in folder $1 + subfolders...
find $1 -type f -print0 | sort | while read -d $'\0' file
do
fname=$(basename "$file")
dname=$(dirname "$file")
# Get year-month-day
part1=$(stat -c%y "$file" | awk -F " |:|\\\.|-" '{printf $1"-"$2"-"$3}')
# Get hour-minute-second
part2=$(stat -c%y "$file" | awk -F " |:|\\\.|-" '{printf $4"-"$5"-"$6}')
# Don't add this info to new filename if it is already in old filename
echo "$fname" | grep -q "$part1" && part1=""
echo "$fname" | grep -q "$part2" && part2=""
# Destination filename
newname="$1/$part1/${part2}_$fname"
size="size$(stat -c%s "$file")"
# have files with the same size already been processed?
if dupe=$(grep $size "$rafi")
then
md5="md5$(md5sum -b "$file" | sed "s/ .*//")"
# have files with the same md5 already been processed?
if grep -q $md5 "$rafi"
then
# automatic hit
md5dupe=$md5
else
#### that was missing!!!
# get md5 from samesize file & write to tmp file
dupe=$(echo $dupe | sed "s/.* //")
md5dupe="md5$(md5sum -b "$dupe" | sed "s/ .*//")"
echo $md5dupe >> "$rafi"
#### and stuff
fi
if [ "$md5"=="$md5dupe" ]
# set different destination for duplicate files
then newname="$1/.dupes/$dname/${part1}_${part2}_$fname"
fi
# write size to tmp file
else echo "$size $file" >> "$rafi"
fi
# remove double characters
newname=$(echo "$newname" | sed "s/\/\/*/\//g" | sed "s/\.~.~//g")
newname=$(echo "$newname" | sed "s/__*/_/g" | sed "s/\.~.~//g")
if [ "$newname" != "$file" ]
then if ! [ -d $(dirname "$newname") ]
then echo mkdir -p $(dirname "$newname") -v
fi
echo mv "$file" "$newname" -v --backup=numbered
fi
done
popd
fi
(That version might still not be working - no time to test & fix right now. But I got past the point where I was stuck! )
Last edited by whoops (2011-03-01 10:20:48)
Offline
it's your script, do with it what you see fit. glad you're on the right direction, anyway take care
Offline