Align matching rows

ibendiben · 2008-08-06 11:47:43

Hmm
I have this type of data:
#=key

# master # trans1 # trans2 # trans3 # trans4 # trans5
1 Abisenio 7 Adam 2 Abessinië 3 Abraham 2 Abessinien 2 Abyssinie
2 Abisenujo 8 Gulf of Aden 3 Abraham 6 Azores 4 Abchasien 3 Abraham
3 Abrahamo 10 Addis Ababa 4 Abchazië 7 Adán 6 Azoren 4 Abkhasie
4 Ab{azio
5 Ab{azujo
6 Acoroj
7 Adamo
8 Adena Golfo
9 Adeno
10 Adisabebo

As you see every column has a # column followed by a column with words.
The first one I call the master column. It contains all #s, and words.
The other columns contain some translations of the words in the master column, together with their #.
I want to align the translations with their # words in the master column. So it looks like this:

# master # trans1 # trans2 # trans3 # trans4 # trans5
1 Abisenio
2 Abisenujo 2 Abessinië 2 Abessinien 2 Abyssinie
3 Abrahamo 3 Abraham 3 Abraham 3 Abraham
4 Ab{azio 4 Abchazië 4 Abchasien 4 Abkhasie
5 Ab{azujo
6 Acoroj 6 Azores 6 Azoren
7 Adamo 7 Adam 7 Adán
8 Adena Golfo 8 Gulf of Aden
9 Adeno
10 Adisabebo 10 Addis Ababa

If you know a solution, I'd be very pleased!

/edit: changed tab character to "=", because tabs are converted to spaces.
I figured it may be useful to provide the data separated by "=" where normally a tab is used. So if needed you can use it more easily:

1=Abisenio=7=Adam=2=Abessinië=3=Abraham=2=Abessinien=2=Abyssinie
2=Abisenujo=8=Gulf of Aden=3=Abraham=6=Azores=4=Abchasien=3=Abraham
3=Abrahamo=10=Addis Ababa=4=Abchazië=7=Adán=6=Azoren=4=Abkhasie
4=Ab{azio==========
5=Ab{azujo==========
6=Acoroj==========
7=Adamo==========
8=Adena Golfo==========
9=Adeno==========
10=Adisabebo==========

Thanks in advance!

Last edited by ibendiben (2008-08-06 14:10:48)

Cerebral · 2008-08-06 13:09:03

There are solutions, and I could probably write you up an app that'd do it pretty quickly, but in general it seems like somewhat of a haphazard way to store your data. You might want to reconsider your data representation first.

Why do you need this? What language did you want to use?

ibendiben · 2008-08-06 13:55:03

I agree, it's not very useful this way. But it's easy to paste it into a spreadsheet program, so that's not the problem.
I'm not very familiar with excel macro's or whatever, so I thought let's convert it in a text file so I can ask for any solution here.
The layout doesn't at all matter to me. It's just that the data needs to be aligned and I wondered if someone knew an easy way out.
In the mean time I'm investigating excel's programming capabilities.

If you know an easy way to align the data, tell me what's most easy for you. To me scripting languages are most easy to comprehense...

I need this for a program I want to make that let's me quickly learn the most common words in other languages.

/edit
It may be a good thing to know that the actual datasheets contain over 80000 rows! I hope this doesn't complicate things to much but I think this is important when deciding about the programming language.

Last edited by ibendiben (2008-08-06 13:57:52)

Procyon · 2008-08-06 15:06:10

I have an idea:

First divide the file in different files per language. With cut -d'<TAB>' -f1,2 (and 3,4 ; 5,6 ; 7,8 etc) > langfile.txt

We will get
original.txt
1 a
2 b
3 c

And e.g.
english.txt
2 b_eng

Then we make a template file:
template.txt
1[tab][tab]
2[tab][tab]
3[tab][tab]

Combine each lang file with template, sort, get rid of empty items we have translation for
(cat template; cat english.txt) | sort > lang_eng.txt
[EDITED]
sed -e '{:more;N;s/\([0-9]\+\)[[:blank:]]*\n\1/\1/;Tmore}' lang_eng.txt

Once done, for all languages:
paste orig.txt lang_eng.txt lang_french.txt > new_db.txt

Get rid of empty entries if necessary
paste orig.txt lang_eng.txt | sed 's/[0-9]\+<TAB><TAB>/<TAB><TAB>/g'

EDIT:
error in sed command. looking for a way around

EDIT:
improved sed command:
old: sed -ie 'N;s/\([0-9]\+\)[[:blank:]]*\n\1/\1/' lang_eng.txt
new: sed -e '{:more;N;s/\([0-9]\+\)[[:blank:]]*\n\1/\1/;Tmore}' lang_eng.txt

See what goes wrong? It will skip over the line that was N'ed so if it was a duplicate entry it was overlooked.

Last edited by Procyon (2008-08-06 15:19:52)

ibendiben · 2008-08-06 15:19:30

I get the idea
I'll see if sed can handle my files. grep refused with earlier probes, so I just hope it works.
Think I know the problem with the command but hang on, I'll have to try first.

ibendiben · 2008-08-06 15:54:33

More simple would be to only combine the key columns and then use "uniq -u" to only keep the uniq lines. Resulting in a list of keys that don't have translation.
Combining this new list with the keys + translation, and sorting it gives me the same result, gonna see if that works. (it takes aaages;)

ibendiben · 2008-08-06 18:54:13

done!

Steps:

-created separated files for every key + translation with awk:

awk -F'\t; '{print $3"\t"$4} >trans1.txt

-created separated files for every key column

awk -F'\t' '{print $3} >trans1_keys.txt

-added every key column to the masterkey column, sorted it and then only kept the uniq lines. The result I add to the key + translation files.

sort -n master_keys.txt trans1_keys.txt | uniq -u >>trans1.txt

-after that I only had to sort the files again

sort -n trans1.txt

-with paste I could join the files together again.

paste master.txt trans1.txt trans2.txt trans3.txt trans4.txt trans5.txt

Hopefully I made no mistakes in writing this down.
BTW I found that sed works alot better removing blanks out of huge files. Grep wouldn't react for a long time. Sed took only seconds.

Thanks for all the imput!

Arch Linux

#1 2008-08-06 11:47:43

Align matching rows

#2 2008-08-06 13:09:03

Re: Align matching rows

#3 2008-08-06 13:55:03

Re: Align matching rows

#4 2008-08-06 15:06:10

Re: Align matching rows

#5 2008-08-06 15:19:30

Re: Align matching rows

#6 2008-08-06 15:54:33

Re: Align matching rows

#7 2008-08-06 18:54:13

Re: Align matching rows

Board footer