You are not logged in.
Hi everyone,
I'm a programming enthusiast studying Linguistics. I've been toying with an idea for a side project and I'd like your input on a few points. I want to write a program that reads user input, matches it with noun and verb repertoires and builds grammatical sentences in French.
Sample entries, stored in separate files:
== common nouns ==
maison; fs; maisons; fp;
enfant; ms; enfants; mp;
cheval; ms; chevaux; mp;
== intransitive verbs ==
manger; mange; manges; mange; mangeons; mangez; mangent;
exploser; explose; exploses; explose; explosons; explosez; explosent;
dormir; dors; dors; dort; dormons; dormez; dorment;
A user would, for instance, select 'chevaux' and 'manger' and the program would read the gender and number (masculine, plural), pick the appropriate article and conjugate the verb accordingly by printing: Les chevaux mangent. Eventually I'd like to expand entries to allow for more complex sentences like Les chevaux mangent bruyamment les carottes que le fermier leur a offertes.
I'd also integrate syntactic and semantic data so that the program can assess the sentence's grammaticality.
That part is all fine and dandy, but I'm not exactly sure how to get started. I have some knowledge of C and Java and I'm willing to learn (a) new language(s) if necessary.
My questions are:
What language would you recommend for this? I'd like to build a GUI at some point but it's definitely not a priority.
What format (CSV, XML, etc.) should I use to store lexical units? I've read good things about XML as far as readability and parsing goes, but it seems to be less efficient with large datasets. I don't expect my repertoires to grow that much but you never know! :-)
I'm also open to IDE/text editor recommendations. I've been using geany for C and Eclipse for Java.
Thanks in advance for reading and/or replying. I look forward to your input.
"The problem is, after a week of intense googling, we’ve started to burn out on knowing the answer to everything. God must feel that way all the time. I think people in the year 2020 are going to be nostalgic for the sensation of feeling clueless." Douglas Coupland - jPod
Offline
I assume you've explored the background:
http://en.wikipedia.org/wiki/Machine_translation
From my general impression machine translation is a hard problem. Having a key:item translation dictionary is the easy part. The difficulty is idioms:
http://en.wikipedia.org/wiki/Idiom
A literal word-to-word translation often does not fit into the destination language as the meanings are not carried by the individual words but instead the phrases have special meaning culturally which is different per culture.
For example, the TV show "Big Brother" has an Orwellian meaning in the west and that is something we understand culturally. However when an equivalent show was made for middle-eastern audiences that idiom is not developed there so the name of the show, roughly translated, was "The Boss." The semantics are near each other in domain but the exact translation is not possible because the underlying meanings are not equivalent between the cultures.
Anyway, if you figure out machine translation there are a few hundred million dollars waiting for you..
Edit: And I believe I misunderstood you? You are not looking for translation?
Last edited by headkase (2013-02-01 17:02:06)
Offline
You did misunderstand, but thanks for the great reply. I'm not tech savvy enough to tackle translation software yet. I want to write a program that lets the user select components (subject, verb, direct object, etc.), and then fills in the gaps to build a full sentence. This is substantially harder in French than in English because determiners, among other things, can assume many forms, depending on the noun's gender and number.
Example: If this were in English, you'd select 'horse' and 'sleep' and the output would be The horse sleeps.
"The problem is, after a week of intense googling, we’ve started to burn out on knowing the answer to everything. God must feel that way all the time. I think people in the year 2020 are going to be nostalgic for the sensation of feeling clueless." Douglas Coupland - jPod
Offline
The difficulty is idioms:
I still chuckle at an early experiment in natural language translation. For fun, an idiom was sent round trip through an early English to Russian translator, and the result fed back through a Russian to English translator:
Out of sight, out of mind --> Invisible imbecile.
I don't know first hand if it is true or a legend; but it highlights the problem.
To the OP: This is a really good case for using a database. Don't even think about trying to hard code this.
Nothing is too wonderful to be true, if it be consistent with the laws of nature -- Michael Faraday
Sometimes it is the people no one can imagine anything of who do the things no one can imagine. -- Alan Turing
---
How to Ask Questions the Smart Way
Offline
This is substantially harder in French than in English because determiners, among other things, can assume many forms, depending on the noun's gender and number.
If you really wanted a challenge, you'd build it for Welsh...
I'm not sure if the same complications would apply to Breton or not. (The languages are very close.)
CLI Paste | How To Ask Questions
Arch Linux | x86_64 | GPT | EFI boot | refind | stub loader | systemd | LVM2 on LUKS
Lenovo x270 | Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz | Intel Wireless 8265/8275 | US keyboard w/ Euro | 512G NVMe INTEL SSDPEKKF512G7L
Offline
ewaller and cfr, thanks for your input. I apologize for not replying in a timely manner.
To the OP: This is a really good case for using a database. Don't even think about trying to hard code this.
I will definitely look into databases. I'm only (vaguely) familiar with MySQL, so once again I'm open to suggestions. I read that Python and Perl are suitable languages to interact with a MySQL DB, I might just try learning them since they're also pretty useful in NLP.
If you really wanted a challenge, you'd build it for Welsh... :)
I think I'll stick to French for now. Building it for Welsh would probably drive me nuts within days. :) Thankfully it doesn't have cases. I'm simply trying to put into practice what I've learned in my introductory linguistics courses and most of it pertains to French. Is Welsh your mother tongue / a language you've learned? After a little googling, I'm tempted to try and learn it at some point, along with Basque. Perhaps Breton as well if it's fairly close.
"The problem is, after a week of intense googling, we’ve started to burn out on knowing the answer to everything. God must feel that way all the time. I think people in the year 2020 are going to be nostalgic for the sensation of feeling clueless." Douglas Coupland - jPod
Offline
Second language (partly as child, partly as adult). It has mutations of three different kinds. For example:
cat: cath
a cat: cath
the cat: y gath
the black cat: y gath ddu
his cat: ei gath
her cat: ei chath
my cat: fy nghath
dog: ci
a dog: ci
the dog: y ci
the black dog: y ci du
his dog: ei gi
her dog: ei chi
my dog: fy nghi
in Penarth: ym Mhenarth
in Caerdydd: yng Nghaerdydd
in Gwent: yng Ngwent
So you can't just plug things into gaps because the first letters of words change in context. It is interesting but a bit complex. (And something of a nightmare if you are trying to use a dictionary with no idea of mutations. Even worse if you don't know the alphabet differs from the English one!)
CLI Paste | How To Ask Questions
Arch Linux | x86_64 | GPT | EFI boot | refind | stub loader | systemd | LVM2 on LUKS
Lenovo x270 | Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz | Intel Wireless 8265/8275 | US keyboard w/ Euro | 512G NVMe INTEL SSDPEKKF512G7L
Offline