You are not logged in.
Pages: 1
Hmm
I have this type of data:
#=key
# master # trans1 # trans2 # trans3 # trans4 # trans5
1 Abisenio 7 Adam 2 Abessinië 3 Abraham 2 Abessinien 2 Abyssinie
2 Abisenujo 8 Gulf of Aden 3 Abraham 6 Azores 4 Abchasien 3 Abraham
3 Abrahamo 10 Addis Ababa 4 Abchazië 7 Adán 6 Azoren 4 Abkhasie
4 Ab{azio
5 Ab{azujo
6 Acoroj
7 Adamo
8 Adena Golfo
9 Adeno
10 Adisabebo
As you see every column has a # column followed by a column with words.
The first one I call the master column. It contains all #s, and words.
The other columns contain some translations of the words in the master column, together with their #.
I want to align the translations with their # words in the master column. So it looks like this:
# master # trans1 # trans2 # trans3 # trans4 # trans5
1 Abisenio
2 Abisenujo 2 Abessinië 2 Abessinien 2 Abyssinie
3 Abrahamo 3 Abraham 3 Abraham 3 Abraham
4 Ab{azio 4 Abchazië 4 Abchasien 4 Abkhasie
5 Ab{azujo
6 Acoroj 6 Azores 6 Azoren
7 Adamo 7 Adam 7 Adán
8 Adena Golfo 8 Gulf of Aden
9 Adeno
10 Adisabebo 10 Addis Ababa
If you know a solution, I'd be very pleased!
/edit: changed tab character to "=", because tabs are converted to spaces.
I figured it may be useful to provide the data separated by "=" where normally a tab is used. So if needed you can use it more easily:
1=Abisenio=7=Adam=2=Abessinië=3=Abraham=2=Abessinien=2=Abyssinie
2=Abisenujo=8=Gulf of Aden=3=Abraham=6=Azores=4=Abchasien=3=Abraham
3=Abrahamo=10=Addis Ababa=4=Abchazië=7=Adán=6=Azoren=4=Abkhasie
4=Ab{azio==========
5=Ab{azujo==========
6=Acoroj==========
7=Adamo==========
8=Adena Golfo==========
9=Adeno==========
10=Adisabebo==========
Thanks in advance!
Last edited by ibendiben (2008-08-06 14:10:48)
Offline
There are solutions, and I could probably write you up an app that'd do it pretty quickly, but in general it seems like somewhat of a haphazard way to store your data. You might want to reconsider your data representation first.
Why do you need this? What language did you want to use?
Offline
I agree, it's not very useful this way. But it's easy to paste it into a spreadsheet program, so that's not the problem.
I'm not very familiar with excel macro's or whatever, so I thought let's convert it in a text file so I can ask for any solution here.
The layout doesn't at all matter to me. It's just that the data needs to be aligned and I wondered if someone knew an easy way out.
In the mean time I'm investigating excel's programming capabilities.
If you know an easy way to align the data, tell me what's most easy for you. To me scripting languages are most easy to comprehense...
I need this for a program I want to make that let's me quickly learn the most common words in other languages.
/edit
It may be a good thing to know that the actual datasheets contain over 80000 rows! I hope this doesn't complicate things to much but I think this is important when deciding about the programming language.
Last edited by ibendiben (2008-08-06 13:57:52)
Offline
I have an idea:
First divide the file in different files per language. With cut -d'<TAB>' -f1,2 (and 3,4 ; 5,6 ; 7,8 etc) > langfile.txt
We will get
original.txt
1 a
2 b
3 c
And e.g.
english.txt
2 b_eng
Then we make a template file:
template.txt
1[tab][tab]
2[tab][tab]
3[tab][tab]
Combine each lang file with template, sort, get rid of empty items we have translation for
(cat template; cat english.txt) | sort > lang_eng.txt
[EDITED]
sed -e '{:more;N;s/\([0-9]\+\)[[:blank:]]*\n\1/\1/;Tmore}' lang_eng.txt
Once done, for all languages:
paste orig.txt lang_eng.txt lang_french.txt > new_db.txt
Get rid of empty entries if necessary
paste orig.txt lang_eng.txt | sed 's/[0-9]\+<TAB><TAB>/<TAB><TAB>/g'
EDIT:
error in sed command. looking for a way around
EDIT:
improved sed command:
old: sed -ie 'N;s/\([0-9]\+\)[[:blank:]]*\n\1/\1/' lang_eng.txt
new: sed -e '{:more;N;s/\([0-9]\+\)[[:blank:]]*\n\1/\1/;Tmore}' lang_eng.txt
See what goes wrong? It will skip over the line that was N'ed so if it was a duplicate entry it was overlooked.
Last edited by Procyon (2008-08-06 15:19:52)
Offline
I get the idea
I'll see if sed can handle my files. grep refused with earlier probes, so I just hope it works.
Think I know the problem with the command but hang on, I'll have to try first.
Offline
More simple would be to only combine the key columns and then use "uniq -u" to only keep the uniq lines. Resulting in a list of keys that don't have translation.
Combining this new list with the keys + translation, and sorting it gives me the same result, gonna see if that works. (it takes aaages;)
Offline
done!
Steps:
-created separated files for every key + translation with awk:
awk -F'\t; '{print $3"\t"$4} >trans1.txt
-created separated files for every key column
awk -F'\t' '{print $3} >trans1_keys.txt
-added every key column to the masterkey column, sorted it and then only kept the uniq lines. The result I add to the key + translation files.
sort -n master_keys.txt trans1_keys.txt | uniq -u >>trans1.txt
-after that I only had to sort the files again
sort -n trans1.txt
-with paste I could join the files together again.
paste master.txt trans1.txt trans2.txt trans3.txt trans4.txt trans5.txt
Hopefully I made no mistakes in writing this down.
BTW I found that sed works alot better removing blanks out of huge files. Grep wouldn't react for a long time. Sed took only seconds.
Thanks for all the imput!
Offline
Pages: 1