You are not logged in.

#1 2014-07-26 19:26:01

Andy_Crowd
Member
From: 延雪平縣 Sweden
Registered: 2013-12-28
Posts: 119

[SOLVED]Find duplicates in an array?

Hi!
I have made this script to find duplicate entries in the array but it takes very long time for it to calculate. Is it any faster way to compare arrays? I have about 15210 entries that must be compared to each other. Should I use another algorithm or programming language to make it much faster? I am beginner yet.

while [ $ZZ !=  $TotalItems ];  do
TMPA="${ArrayOfFiles[$ZZ]}";
TMPwB="${TMPA/|H|*/}";
 Width=${TMPwB/*W|/}
TMPhB="${TMPA/|Format|*/}";
 Height="${TMPhB/*|/}"
DupCount="0"
while [ $DupCount != $TotalItems ]; 
   do 
TMPA="${ArrayOfFiles[$DupCount]}";
TMPwB="${TMPA/|H|*/}";
   W=${TMPwB/*W|/}
TMPhB="${TMPA/|Format|*/}";
  H="${TMPhB/*|/}"
  if [[ "$Width" -eq "$W"   &&   "$Height" -eq "$H" ]];
    then TotalDupes=$((TotalDupes+1));  
  fi;
DupCount=$((DupCount+1))
   done
CollectDupes[$ZZ]="${ArrayOfFiles[$ZZ]}Duplicates|$TotalDupes";
#echo ${CollectDupes[$ZZ]} >> /tmp/tmpXX.txt
ZZ=$((ZZ+1))
done

Here is time stamps:

19:17:59
DONE FILL IN ARRAY
19:18:00
Total: 15210
21:53:25

It took two and a half hour(19:18:00 to 21:53:25) only to calculate, bash used only one core of four to 100%. I also used date '+%T' to get time for each task.

Last edited by Andy_Crowd (2014-07-26 21:25:56)


Help to make Arora bug free!!
日不落 | Year 2081 | 笑傲江湖 | One more a really good book in my collection the Drystoll.

Offline

#2 2014-07-26 19:56:26

2ManyDogs
Forum Moderator
Registered: 2012-01-15
Posts: 4,645

Re: [SOLVED]Find duplicates in an array?

Exactly what do the array entries look like, and exactly what do you want the output to be?

A search for "bash find duplicates in an array", gave me this page: http://stackoverflow.com/questions/2205 … ment-array

and this code does work to return show the duplicate entries in my test array, but I'm not sure it is what you want, or will work with your array:

#!/bin/bash

array=( 1 2 3 4 5 7 2 6 7 )
printf '%s\n' "${array[@]}"|awk '!($0 in seen){seen[$0];next} 1'

There are many people here with more awk and bash skills, so if you can be a little more precise about what you have an what you want, I'm sure someone can give you a better answer. But in the meantime, there are many web pages with examples that may help you.

Last edited by 2ManyDogs (2014-07-26 20:05:11)


How to post. A sincere effort to use modest and proper language and grammar is a sign of respect toward the community.

Online

#3 2014-07-26 20:06:39

Andy_Crowd
Member
From: 延雪平縣 Sweden
Registered: 2013-12-28
Posts: 119

Re: [SOLVED]Find duplicates in an array?

Arrays look like:
./Sorted/jpeg/ImgSize_W_119_H_170/all-web-images_50934_f124382360.jpg|W|119|H|170|Format|jpeg|Errors|0|.
I separated only 119 for W and 170 for H. It compares only numbers in arrays (W=width, H=height) for image resolutions. And I am using build in commands for formating of the strings in the bash, instead of commands/programs like awk/sed/echo for output, they slow down the script very much.

Last edited by Andy_Crowd (2014-07-26 20:11:10)


Help to make Arora bug free!!
日不落 | Year 2081 | 笑傲江湖 | One more a really good book in my collection the Drystoll.

Offline

#4 2014-07-26 20:13:18

2ManyDogs
Forum Moderator
Registered: 2012-01-15
Posts: 4,645

Re: [SOLVED]Find duplicates in an array?

Can you give me a better example of what the array entries look like (two or three actual text entries)? And do you want a list of duplicates, a count, both, and/or to remove the duplicates from the array? Sorry, but I still don't know what you have and what you want, or if the code I gave you even comes close. I guess I'll leave this for others who will be able to give you a better answer.

Last edited by 2ManyDogs (2014-07-26 20:18:01)


How to post. A sincere effort to use modest and proper language and grammar is a sign of respect toward the community.

Online

#5 2014-07-26 20:19:39

lolilolicon
Member
Registered: 2009-03-05
Posts: 1,722

Re: [SOLVED]Find duplicates in an array?

Andy_Crowd, your code is barely readable to me. (A form of tl;dr but worse). When asking questions, your chance of getting an answer is greatly increased if you make the effort to make your question as brief and precise as possible. When it includes code, reduce it to the essential form.

Again, like 2ManyDogs, I assume you want to remove duplicates; another way to do that,

#!/bin/bash
mapfile -t array < <(printf '%s\n' "${array[@]}"|sort -u)

or if you can use zsh, simply use

${(u)array}

or

declare -Ua array
array=(... ...)

This silver ladybug at line 28...

Offline

#6 2014-07-26 20:20:24

Andy_Crowd
Member
From: 延雪平縣 Sweden
Registered: 2013-12-28
Posts: 119

Re: [SOLVED]Find duplicates in an array?

This I am using to fill in array in the script from a file:

ArrayFillCount=0;
Count=0
while read line ; do
ArrayFillCount=$((ArrayFillCount+1))
ArrayOfFiles[$ArrayFillCount]="$line"
done < myinfo.txt;

And I wrote above how each string look like.

 if [[ "$Width" -eq "$W"   &&   "$Height" -eq "$H" ]];
    then TotalDupes=$((TotalDupes+1));  
  fi;

It compares only numbers related to W and H from the whole line.
And I am using this way to separate parts in a line:

TMPA="${ArrayOfFiles[$DupCount]}";
TMPwB="${TMPA/|H|*/}";
   W=${TMPwB/*W|/}
TMPhB="${TMPA/|Format|*/}";
  H="${TMPhB/*|/}"

Instead of other extern programs or commands that do output to display like:

#does the same as above.
#This is an example of the strings in the array extracted with help of echo and awk
W=$(echo "./Sorted/jpeg/ImgSize_W_119_H_170/all-web-images_50934_f124382360.jpg|W|119|H|170|Format|jpeg|Errors|0|" | awk -F'|' '{print $3}')
H=$(echo "./Sorted/jpeg/ImgSize_W_119_H_170/all-web-images_50934_f124382360.jpg|W|119|H|170|Format|jpeg|Errors|0|" | awk -F'|' '{print $3}')

Is zsh faster than bash?
Does it has a better ways to handle arrays?
Is a programming much different in them?

Last edited by Andy_Crowd (2014-07-26 20:40:40)


Help to make Arora bug free!!
日不落 | Year 2081 | 笑傲江湖 | One more a really good book in my collection the Drystoll.

Offline

#7 2014-07-26 20:34:42

lolilolicon
Member
Registered: 2009-03-05
Posts: 1,722

Re: [SOLVED]Find duplicates in an array?

So basically anything with identical width and height are duplicates? What are you trying to do, exactly? Do you want to count the number of duplicates of each entry? Or simply remove duplicates? Or do you really just want to find duplicate images? For the last one, you can look at a tool called findimagedupes.

Andy_Crowd wrote:

This I am using to fill in array in the script from a file:
Is zsh faster than bash?
Does it has a better ways to handle arrays?
Is a programming much different in them?

I have no idea whether zsh is any faster or slower. zsh is somewhat more powerful in array handling, and has a lot more sugar in e.g. its parameter expansion stuff, but these are really just a convenience to me. zsh is a good interactive shell, but to spend time learning programing in it is not quite useful, certainly not before you're comfortable with the basic, portable shell stuff -- learn bash first. Also, don't shun external tools, they're often very efficient at what they do, and you can be sure that a bad implementation in shell will be slower than using external specialty tools.

Last edited by lolilolicon (2014-07-26 20:44:21)


This silver ladybug at line 28...

Offline

#8 2014-07-26 20:40:13

Trilby
Inspector Parrot
Registered: 2011-11-29
Posts: 29,441
Website

Re: [SOLVED]Find duplicates in an array?

Eh ... ok, can we go back to the start to avoid the big XY problem here.

You say your goal is to compare to arrays - and any language that can do it will be fine.  But then the only description of what you really want is presenting a potential solution in bash.  Then as we move along we find that these arrays are filled from files.

So can you please describe what you actually want to acheive?

I will start by saying that bash will most likely not be good for such large arrays.  It could do it ... but it shouldn't.

If you have the list in a file, why not just `sort -u`?

EDIT: I just reread your bash version - no wonder it takes hours: for all that weird variable processing, you are repeating the exact same processing on every single array element over 15 thousand times: you extract Width and Height from each element, then for each element you extract W and H from every element - so you end up extracting width and height from each array element one more time than there are elements in the array!  Why on earth are you doing that?  Preprocess every array element once into the parts you care about:

Rule of Representation: Fold knowledge into data so program logic can be stupid and robust.

EDIT: as for awk and sed slowing down the script "very much" it's because you are using them completely wrong.  Don't launch a new awk subprocess for each variable of interest for each element of the array for every other element of the array (which would mean you're launching 1500*1500*2 subprocesses with awk).  Just use one awk process to preprocess the input into a format that can be easily compared.  Then go through that input only once.

Last edited by Trilby (2014-07-26 20:53:08)


"UNIX is simple and coherent..." - Dennis Ritchie, "GNU's Not UNIX" -  Richard Stallman

Offline

#9 2014-07-26 20:49:55

Andy_Crowd
Member
From: 延雪平縣 Sweden
Registered: 2013-12-28
Posts: 119

Re: [SOLVED]Find duplicates in an array?

I want to find identical values in the array and map them.
I hope this is simple enough
A=(aa dd aa ss dd)
if A[1] = X[2] then
T=T+1
fi

X[2]="${A[1]}$T"

echo ${X[2]}
aa10

calculates how many similar and add as part of a string to a new array

Last edited by Andy_Crowd (2014-07-26 20:54:13)


Help to make Arora bug free!!
日不落 | Year 2081 | 笑傲江湖 | One more a really good book in my collection the Drystoll.

Offline

#10 2014-07-26 20:54:07

Trilby
Inspector Parrot
Registered: 2011-11-29
Posts: 29,441
Website

Re: [SOLVED]Find duplicates in an array?

I have no idea what that pseudocode is supposed to mean.  If the input was "aa dd aa ss dd" what would you want for output?

Your edit doesn't help.  SHOW what kind of output you would want for this hypothetical input.

In any case, from everything I can gather about what you are describing, the following awk script would work, and would only read the data once.  This assumes each element of the list is separated by newlines (which it looks like the original was before you smashed it into an ugly bash array) and it gives the output separated by newlines:

#!/bin/bash

awk '
// {
	W=...;
	H=...;
	WxH = W "x" H;
	Count[WxH]++;
}
END {
	for (i in Count) {
		printf "%s=%s\n", i, Count[i];
	}
}
' /path/to/your/input

You just need to fill in the elipses for the W and H, or preferrably just use one string function to extract WxH.

Last edited by Trilby (2014-07-26 21:06:33)


"UNIX is simple and coherent..." - Dennis Ritchie, "GNU's Not UNIX" -  Richard Stallman

Offline

#11 2014-07-26 21:07:17

Andy_Crowd
Member
From: 延雪平縣 Sweden
Registered: 2013-12-28
Posts: 119

Re: [SOLVED]Find duplicates in an array?

And as I told I am using build in bash functions for string handling, no need for launching of awk.
For strings like this:
The string
./Sorted/jpeg/ImgSize_W_119_H_170/all-web-images_50934_f124382360.jpg|W|119|H|170|Format|jpeg|Errors|0|
Code:
find and count duplicates
Output:
./Sorted/jpeg/ImgSize_W_119_H_170/all-web-images_50934_f124382360.jpg|W|119|H|170|Format|jpeg|Errors|0|Duplicates|12|

Last edited by Andy_Crowd (2014-07-26 21:39:58)


Help to make Arora bug free!!
日不落 | Year 2081 | 笑傲江湖 | One more a really good book in my collection the Drystoll.

Offline

#12 2014-07-26 21:11:47

Trilby
Inspector Parrot
Registered: 2011-11-29
Posts: 29,441
Website

Re: [SOLVED]Find duplicates in an array?

It's even easier:

awk -F '|' '// { Count[$3 "|" $5]++; } END { for (i in Count) { printf "%s|%s\n", i, Count[i]; }}' /path/to/file

As for not wanting to use awk, you started the thread by saying you could use any tool that could be good for this.  Bash is not good for this.  When you included awk it was much worse, yes - but that was because you completely misused it.  If you want to do this in bash - good luck.

Last edited by Trilby (2014-07-26 21:15:16)


"UNIX is simple and coherent..." - Dennis Ritchie, "GNU's Not UNIX" -  Richard Stallman

Offline

#13 2014-07-26 21:20:16

Andy_Crowd
Member
From: 延雪平縣 Sweden
Registered: 2013-12-28
Posts: 119

Re: [SOLVED]Find duplicates in an array?

Thank you, you solved my headache.


Help to make Arora bug free!!
日不落 | Year 2081 | 笑傲江湖 | One more a really good book in my collection the Drystoll.

Offline

#14 2014-07-26 21:27:06

2ManyDogs
Forum Moderator
Registered: 2012-01-15
Posts: 4,645

Re: [SOLVED]Find duplicates in an array?

lolilolicon and Trilby -- thank you both. You have added to my bash and awk knowledge. I will refrain from my comments to Andy_Crowd at this point.

Last edited by 2ManyDogs (2014-07-26 21:29:40)


How to post. A sincere effort to use modest and proper language and grammar is a sign of respect toward the community.

Online

#15 2014-07-26 21:33:00

Trilby
Inspector Parrot
Registered: 2011-11-29
Posts: 29,441
Website

Re: [SOLVED]Find duplicates in an array?

Mastering the effective use of associative arrays in awk is one of the highest forms of text/data processing wizardry an archer can strive for wink


"UNIX is simple and coherent..." - Dennis Ritchie, "GNU's Not UNIX" -  Richard Stallman

Offline

Board footer

Powered by FluxBB