[SOLVED] Grouping a plain text database

caminoix · 2011-07-10 12:10:27

Hello,

I have a plain text database in more or less the following format (rows are items A, B, C...; columns are features 1, 2, 3...):

     1    2    3
A    a    a    a
B    a    b    b
C    b    a    a

and I'd like to have the items grouped by the number of common features:

     1    2    3
A    a    a    a
C    b    a    a
B    a    b    b

(A and C have two features in common, A and B have one, and B and C have none.)

Now, my question is this: does a tool that can do this for me already exist?

Last edited by caminoix (2011-07-12 21:41:46)

karol · 2011-07-10 12:30:55

If it's a simple a / b , 0 / 1, yes / no, maybe you can sort the file?

With regard to "common features" where would

D    a    b    a

go? It has 2 features in common with both A and B and 1 with C.

caminoix · 2011-07-10 13:41:35

Thanks but simple sorting puts B before C, and this is just what I don't want. Reordering the columns is acceptable but it will only work for a minimal example such as the one here, not for my actual task.

I think I might have oversimplified my problem somewhat for the sake of this post. Anyway, including your line, my dream grouping would be this:

-- common in total

A    a    a    a        -- two, group 1
C    b    a    a
D    a    b    a

B    a    b    b        -- two, group 2
D    a    b    a

A    a    a    a        -- one
B    a    b    b


-- common in row

A    a    a    a        -- two, group 1
C    b    a    a

B    a    b    b        -- two, group 2
D    a    b    a

A    a    a    a        -- one
B    a    b    b

The ordering of subgroups (ACD before BD, AC before BD) is absolutely unimportant to me. All I need is to find what these subgroups are.

I'll appreciate any ideas that can get me at least in the vicinity of this result.

karol · 2011-07-10 14:19:11

Common in a row:

[karol@black test]$ cat abcd
A a a a
B a b b
C b a a
D a b a
[karol@black test]$ awk '$3 == "a" && $4 == "a"' abcd
A a a a
C b a a
[karol@black test]$ awk '$2 == "a" && $3 == "b"' abcd
B a b b
D a b a

This approach is pretty flexible but would involve scripting all possible combinations.

Common in total:

[karol@black test]$ grep -e ". a a" -e "a . a" -e "a a ." abcd
A a a a
C b a a
D a b a
[karol@black test]$ grep "a b ." abcd
B a b b
D a b a

Seems like grep is more concise and it can do the 'common in a row' part too.

Last edited by karol (2011-07-10 14:43:47)

caminoix · 2011-07-10 14:42:15

Perhaps I should have mentioned that my database is going to contain probably around fifty items and some three hundred features, each with fifteen or twenty options, not just a/b, some with more than one at the same time...

karol · 2011-07-10 14:49:57

caminoix wrote:

Perhaps I should have mentioned that my database is going to contain probably around fifty items and some three hundred features, each with fifteen or twenty options, not just a/b, some with more than one at the same time...

What do you mean by "more than one at the same time"? Can you give an example?

caminoix · 2011-07-10 14:58:00

karol wrote:

What do you mean by "more than one at the same time"? Can you give an example?

Well, in reality it'll be more like this:

              word 1    word 2    word 3    ...
language 1    p         p         pp
language 2    r         p,r       p
...

which means that language 2 / word 2 has both p and r; the ordering of p and r is not important; can be r,p just as well. Also note that pp (3,1) is a different value than p.

So, thank you for trying but awking through all the possible combinations by hand is out with this quantity.

Last edited by caminoix (2011-07-10 14:58:40)

juster · 2011-07-12 16:32:33

I don't know of a ready-made tool to do this. That doesn't mean you can't roll your own. You probably want to do this yourself but I got a little carried away with this and wrote a solution in perl. There is no fancy algorithm used and I think it is rather straight forward.

https://gist.github.com/1078348

This is difficult to sort because similarities between languages are not transitive. Then again you only asked to group them and I think the code does that much. But who knows? Grouping is a vague concept that can be tweaked different ways.

caminoix · 2011-07-12 21:41:23

@juster:
Oh dear me, that's so nice of you! Thanks a lot!
You're right my original intention was to roll one myself – unless, of course, someone has already done this before and shared. It seems that your solution gives me a solid base to do my own tweaking and personalizing, I'll just need to learn some more Perl
Thanks again!

Arch Linux

#1 2011-07-10 12:10:27

[SOLVED] Grouping a plain text database

#2 2011-07-10 12:30:55

Re: [SOLVED] Grouping a plain text database

#3 2011-07-10 13:41:35

Re: [SOLVED] Grouping a plain text database

#4 2011-07-10 14:19:11

Re: [SOLVED] Grouping a plain text database

#5 2011-07-10 14:42:15

Re: [SOLVED] Grouping a plain text database

#6 2011-07-10 14:49:57

Re: [SOLVED] Grouping a plain text database

#7 2011-07-10 14:58:00

Re: [SOLVED] Grouping a plain text database

#8 2011-07-12 16:32:33

Re: [SOLVED] Grouping a plain text database

#9 2011-07-12 21:41:23

Re: [SOLVED] Grouping a plain text database

Board footer