Arch Linux Forums / [SOLVED]Find duplicates in an array?

Re: [SOLVED]Find duplicates in an array?

2014-07-26T21:33:00Z

Mastering the effective use of associative arrays in awk is one of the highest forms of text/data processing wizardry an archer can strive for

Re: [SOLVED]Find duplicates in an array?

2014-07-26T21:27:06Z

lolilolicon and Trilby -- thank you both. You have added to my bash and awk knowledge. I will refrain from my comments to Andy_Crowd at this point.

Re: [SOLVED]Find duplicates in an array?

2014-07-26T21:20:16Z

Thank you, you solved my headache.

Re: [SOLVED]Find duplicates in an array?

2014-07-26T21:11:47Z

It's even easier:

awk -F '|' '// { Count[$3 "|" $5]++; } END { for (i in Count) { printf "%s|%s\n", i, Count[i]; }}' /path/to/file

As for not wanting to use awk, you started the thread by saying you could use any tool that could be good for this. Bash is not good for this. When you included awk it was much worse, yes - but that was because you completely misused it. If you want to do this in bash - good luck.

Re: [SOLVED]Find duplicates in an array?

2014-07-26T21:07:17Z

And as I told I am using build in bash functions for string handling, no need for launching of awk.
For strings like this:
The string
./Sorted/jpeg/ImgSize_W_119_H_170/all-web-images_50934_f124382360.jpg|W|119|H|170|Format|jpeg|Errors|0|
Code:
find and count duplicates
Output:
./Sorted/jpeg/ImgSize_W_119_H_170/all-web-images_50934_f124382360.jpg|W|119|H|170|Format|jpeg|Errors|0|Duplicates|12|

Re: [SOLVED]Find duplicates in an array?

2014-07-26T20:54:07Z

I have no idea what that pseudocode is supposed to mean. If the input was "aa dd aa ss dd" what would you want for output?

Your edit doesn't help. SHOW what kind of output you would want for this hypothetical input.

In any case, from everything I can gather about what you are describing, the following awk script would work, and would only read the data once. This assumes each element of the list is separated by newlines (which it looks like the original was before you smashed it into an ugly bash array) and it gives the output separated by newlines:

#!/bin/bash

awk '
// {
	W=...;
	H=...;
	WxH = W "x" H;
	Count[WxH]++;
}
END {
	for (i in Count) {
		printf "%s=%s\n", i, Count[i];
	}
}
' /path/to/your/input

You just need to fill in the elipses for the W and H, or preferrably just use one string function to extract WxH.

Re: [SOLVED]Find duplicates in an array?

2014-07-26T20:49:55Z

I want to find identical values in the array and map them.
I hope this is simple enough
A=(aa dd aa ss dd)
if A[1] = X[2] then
T=T+1
fi

X[2]="${A[1]}$T"

echo ${X[2]}
aa10

calculates how many similar and add as part of a string to a new array

Re: [SOLVED]Find duplicates in an array?

2014-07-26T20:40:13Z

Eh ... ok, can we go back to the start to avoid the big XY problem here.

You say your goal is to compare to arrays - and any language that can do it will be fine. But then the only description of what you really want is presenting a potential solution in bash. Then as we move along we find that these arrays are filled from files.

So can you please describe what you actually want to acheive?

I will start by saying that bash will most likely not be good for such large arrays. It could do it ... but it shouldn't.

If you have the list in a file, why not just `sort -u`?

EDIT: I just reread your bash version - no wonder it takes hours: for all that weird variable processing, you are repeating the exact same processing on every single array element over 15 thousand times: you extract Width and Height from each element, then for each element you extract W and H from every element - so you end up extracting width and height from each array element one more time than there are elements in the array! Why on earth are you doing that? Preprocess every array element once into the parts you care about:

Rule of Representation: Fold knowledge into data so program logic can be stupid and robust.

EDIT: as for awk and sed slowing down the script "very much" it's because you are using them completely wrong. Don't launch a new awk subprocess for each variable of interest for each element of the array for every other element of the array (which would mean you're launching 1500*1500*2 subprocesses with awk). Just use one awk process to preprocess the input into a format that can be easily compared. Then go through that input only once.

Re: [SOLVED]Find duplicates in an array?

2014-07-26T20:34:42Z

So basically anything with identical width and height are duplicates? What are you trying to do, exactly? Do you want to count the number of duplicates of each entry? Or simply remove duplicates? Or do you really just want to find duplicate images? For the last one, you can look at a tool called findimagedupes.

Andy_Crowd wrote:

This I am using to fill in array in the script from a file:
Is zsh faster than bash?
Does it has a better ways to handle arrays?
Is a programming much different in them?

I have no idea whether zsh is any faster or slower. zsh is somewhat more powerful in array handling, and has a lot more sugar in e.g. its parameter expansion stuff, but these are really just a convenience to me. zsh is a good interactive shell, but to spend time learning programing in it is not quite useful, certainly not before you're comfortable with the basic, portable shell stuff -- learn bash first. Also, don't shun external tools, they're often very efficient at what they do, and you can be sure that a bad implementation in shell will be slower than using external specialty tools.

Re: [SOLVED]Find duplicates in an array?

2014-07-26T20:20:24Z

This I am using to fill in array in the script from a file:

ArrayFillCount=0;
Count=0
while read line ; do
ArrayFillCount=$((ArrayFillCount+1))
ArrayOfFiles[$ArrayFillCount]="$line"
done < myinfo.txt;

And I wrote above how each string look like.

 if [[ "$Width" -eq "$W"   &&   "$Height" -eq "$H" ]];
    then TotalDupes=$((TotalDupes+1));  
  fi;

It compares only numbers related to W and H from the whole line.
And I am using this way to separate parts in a line:

TMPA="${ArrayOfFiles[$DupCount]}";
TMPwB="${TMPA/|H|*/}";
   W=${TMPwB/*W|/}
TMPhB="${TMPA/|Format|*/}";
  H="${TMPhB/*|/}"

Instead of other extern programs or commands that do output to display like:

#does the same as above.
#This is an example of the strings in the array extracted with help of echo and awk
W=$(echo "./Sorted/jpeg/ImgSize_W_119_H_170/all-web-images_50934_f124382360.jpg|W|119|H|170|Format|jpeg|Errors|0|" | awk -F'|' '{print $3}')
H=$(echo "./Sorted/jpeg/ImgSize_W_119_H_170/all-web-images_50934_f124382360.jpg|W|119|H|170|Format|jpeg|Errors|0|" | awk -F'|' '{print $3}')

Is zsh faster than bash?
Does it has a better ways to handle arrays?
Is a programming much different in them?

Re: [SOLVED]Find duplicates in an array?

2014-07-26T20:19:39Z

Andy_Crowd, your code is barely readable to me. (A form of tl;dr but worse). When asking questions, your chance of getting an answer is greatly increased if you make the effort to make your question as brief and precise as possible. When it includes code, reduce it to the essential form.

Again, like 2ManyDogs, I assume you want to remove duplicates; another way to do that,

#!/bin/bash
mapfile -t array < <(printf '%s\n' "${array[@]}"|sort -u)

or if you can use zsh, simply use

${(u)array}

declare -Ua array
array=(... ...)

Re: [SOLVED]Find duplicates in an array?

2014-07-26T20:13:18Z

Can you give me a better example of what the array entries look like (two or three actual text entries)? And do you want a list of duplicates, a count, both, and/or to remove the duplicates from the array? Sorry, but I still don't know what you have and what you want, or if the code I gave you even comes close. I guess I'll leave this for others who will be able to give you a better answer.

Re: [SOLVED]Find duplicates in an array?

2014-07-26T20:06:39Z

Arrays look like:
./Sorted/jpeg/ImgSize_W_119_H_170/all-web-images_50934_f124382360.jpg|W|119|H|170|Format|jpeg|Errors|0|.
I separated only 119 for W and 170 for H. It compares only numbers in arrays (W=width, H=height) for image resolutions. And I am using build in commands for formating of the strings in the bash, instead of commands/programs like awk/sed/echo for output, they slow down the script very much.

Re: [SOLVED]Find duplicates in an array?

2014-07-26T19:56:26Z

Exactly what do the array entries look like, and exactly what do you want the output to be?

A search for "bash find duplicates in an array", gave me this page: http://stackoverflow.com/questions/2205 … ment-array

and this code does work to return show the duplicate entries in my test array, but I'm not sure it is what you want, or will work with your array:

#!/bin/bash

array=( 1 2 3 4 5 7 2 6 7 )
printf '%s\n' "${array[@]}"|awk '!($0 in seen){seen[$0];next} 1'

There are many people here with more awk and bash skills, so if you can be a little more precise about what you have an what you want, I'm sure someone can give you a better answer. But in the meantime, there are many web pages with examples that may help you.

[SOLVED]Find duplicates in an array?

2014-07-26T19:26:01Z

Hi!
I have made this script to find duplicate entries in the array but it takes very long time for it to calculate. Is it any faster way to compare arrays? I have about 15210 entries that must be compared to each other. Should I use another algorithm or programming language to make it much faster? I am beginner yet.

while [ $ZZ !=  $TotalItems ];  do
TMPA="${ArrayOfFiles[$ZZ]}";
TMPwB="${TMPA/|H|*/}";
 Width=${TMPwB/*W|/}
TMPhB="${TMPA/|Format|*/}";
 Height="${TMPhB/*|/}"
DupCount="0"
while [ $DupCount != $TotalItems ]; 
   do 
TMPA="${ArrayOfFiles[$DupCount]}";
TMPwB="${TMPA/|H|*/}";
   W=${TMPwB/*W|/}
TMPhB="${TMPA/|Format|*/}";
  H="${TMPhB/*|/}"
  if [[ "$Width" -eq "$W"   &&   "$Height" -eq "$H" ]];
    then TotalDupes=$((TotalDupes+1));  
  fi;
DupCount=$((DupCount+1))
   done
CollectDupes[$ZZ]="${ArrayOfFiles[$ZZ]}Duplicates|$TotalDupes";
#echo ${CollectDupes[$ZZ]} >> /tmp/tmpXX.txt
ZZ=$((ZZ+1))
done

Here is time stamps:

19:17:59
DONE FILL IN ARRAY
19:18:00
Total: 15210
21:53:25

It took two and a half hour(19:18:00 to 21:53:25) only to calculate, bash used only one core of four to 100%. I also used date '+%T' to get time for each task.