You are not logged in.
Hello,
I have a plain text database in more or less the following format (rows are items A, B, C...; columns are features 1, 2, 3...):
1 2 3
A a a a
B a b b
C b a a
and I'd like to have the items grouped by the number of common features:
1 2 3
A a a a
C b a a
B a b b
(A and C have two features in common, A and B have one, and B and C have none.)
Now, my question is this: does a tool that can do this for me already exist?
Last edited by caminoix (2011-07-12 21:41:46)
Offline
If it's a simple a / b , 0 / 1, yes / no, maybe you can sort the file?
With regard to "common features" where would
D a b a
go? It has 2 features in common with both A and B and 1 with C.
Offline
Thanks but simple sorting puts B before C, and this is just what I don't want. Reordering the columns is acceptable but it will only work for a minimal example such as the one here, not for my actual task.
I think I might have oversimplified my problem somewhat for the sake of this post. Anyway, including your line, my dream grouping would be this:
-- common in total
A a a a -- two, group 1
C b a a
D a b a
B a b b -- two, group 2
D a b a
A a a a -- one
B a b b
-- common in row
A a a a -- two, group 1
C b a a
B a b b -- two, group 2
D a b a
A a a a -- one
B a b b
The ordering of subgroups (ACD before BD, AC before BD) is absolutely unimportant to me. All I need is to find what these subgroups are.
I'll appreciate any ideas that can get me at least in the vicinity of this result.
Offline
Common in a row:
[karol@black test]$ cat abcd
A a a a
B a b b
C b a a
D a b a
[karol@black test]$ awk '$3 == "a" && $4 == "a"' abcd
A a a a
C b a a
[karol@black test]$ awk '$2 == "a" && $3 == "b"' abcd
B a b b
D a b a
This approach is pretty flexible but would involve scripting all possible combinations.
Common in total:
[karol@black test]$ grep -e ". a a" -e "a . a" -e "a a ." abcd
A a a a
C b a a
D a b a
[karol@black test]$ grep "a b ." abcd
B a b b
D a b a
Seems like grep is more concise and it can do the 'common in a row' part too.
Last edited by karol (2011-07-10 14:43:47)
Offline
Perhaps I should have mentioned that my database is going to contain probably around fifty items and some three hundred features, each with fifteen or twenty options, not just a/b, some with more than one at the same time...
Offline
Perhaps I should have mentioned that my database is going to contain probably around fifty items and some three hundred features, each with fifteen or twenty options, not just a/b, some with more than one at the same time...
What do you mean by "more than one at the same time"? Can you give an example?
Offline
What do you mean by "more than one at the same time"? Can you give an example?
Well, in reality it'll be more like this:
word 1 word 2 word 3 ...
language 1 p p pp
language 2 r p,r p
...
which means that language 2 / word 2 has both p and r; the ordering of p and r is not important; can be r,p just as well. Also note that pp (3,1) is a different value than p.
So, thank you for trying but awking through all the possible combinations by hand is out with this quantity.
Last edited by caminoix (2011-07-10 14:58:40)
Offline
I don't know of a ready-made tool to do this. That doesn't mean you can't roll your own. You probably want to do this yourself but I got a little carried away with this and wrote a solution in perl. There is no fancy algorithm used and I think it is rather straight forward.
https://gist.github.com/1078348
This is difficult to sort because similarities between languages are not transitive. Then again you only asked to group them and I think the code does that much. But who knows? Grouping is a vague concept that can be tweaked different ways.
Offline
@juster:
Oh dear me, that's so nice of you! Thanks a lot!
You're right my original intention was to roll one myself – unless, of course, someone has already done this before and shared. It seems that your solution gives me a solid base to do my own tweaking and personalizing, I'll just need to learn some more Perl
Thanks again!
Offline