You are not logged in.

#1 2013-02-01 16:27:25

ximun
Member
From: Montreal, QC
Registered: 2012-08-14
Posts: 26

HOWTO Write a sentence-building program?

Hi everyone,

I'm a programming enthusiast studying Linguistics. I've been toying with an idea for a side project and I'd like your input on a few points. I want to write a program that reads user input, matches it with noun and verb repertoires and builds grammatical sentences in French.

Sample entries, stored in separate files:

== common nouns ==
maison; fs; maisons; fp;
enfant; ms; enfants; mp;
cheval; ms; chevaux; mp;
== intransitive verbs ==
manger; mange; manges; mange; mangeons; mangez; mangent;
exploser; explose; exploses; explose; explosons; explosez; explosent;
dormir; dors; dors; dort; dormons; dormez; dorment;

A user would, for instance, select 'chevaux' and 'manger' and the program would read the gender and number (masculine, plural), pick the appropriate article and conjugate the verb accordingly by printing: Les chevaux mangent. Eventually I'd like to expand entries to allow for more complex sentences like Les chevaux mangent bruyamment les carottes que le fermier leur a offertes.

I'd also integrate syntactic and semantic data so that the program can assess the sentence's grammaticality.

That part is all fine and dandy, but I'm not exactly sure how to get started. I have some knowledge of C and Java and I'm willing to learn (a) new language(s) if necessary.

My questions are:

  1. What language would you recommend for this? I'd like to build a GUI at some point but it's definitely not a priority.

  2. What format (CSV, XML, etc.) should I use to store lexical units? I've read good things about XML as far as readability and parsing goes, but it seems to be less efficient with large datasets. I don't expect my repertoires to grow that much but you never know! :-)

I'm also open to IDE/text editor recommendations. I've been using geany for C and Eclipse for Java.

Thanks in advance for reading and/or replying. I look forward to your input.


"The problem is, after a week of intense googling, we’ve started to burn out on knowing the answer to everything. God must feel that way all the time. I think people in the year 2020 are going to be nostalgic for the sensation of feeling clueless." Douglas Coupland - jPod

Offline

#2 2013-02-01 16:55:32

headkase
Member
Registered: 2011-12-06
Posts: 1,975

Re: HOWTO Write a sentence-building program?

I assume you've explored the background:

http://en.wikipedia.org/wiki/Machine_translation

From my general impression machine translation is a hard problem.  Having a key:item translation dictionary is the easy part.  The difficulty is idioms:

http://en.wikipedia.org/wiki/Idiom

A literal word-to-word translation often does not fit into the destination language as the meanings are not carried by the individual words but instead the phrases have special meaning culturally which is different per culture.

For example, the TV show "Big Brother" has an Orwellian meaning in the west and that is something we understand culturally.  However when an equivalent show was made for middle-eastern audiences that idiom is not developed there so the name of the show, roughly translated, was "The Boss."  The semantics are near each other in domain but the exact translation is not possible because the underlying meanings are not equivalent between the cultures.

Anyway, if you figure out machine translation there are a few hundred million dollars waiting for you.. wink

Edit: And I believe I misunderstood you?  You are not looking for translation? wink

Last edited by headkase (2013-02-01 17:02:06)

Offline

#3 2013-02-01 20:49:41

ximun
Member
From: Montreal, QC
Registered: 2012-08-14
Posts: 26

Re: HOWTO Write a sentence-building program?

You did misunderstand, but thanks for the great reply. I'm not tech savvy enough to tackle translation software yet. I want to write a program that lets the user select components (subject, verb, direct object, etc.), and then fills in the gaps to build a full sentence. This is substantially harder in French than in English because determiners, among other things, can assume many forms, depending on the noun's gender and number.

Example: If this were in English, you'd select 'horse' and 'sleep' and the output would be The horse sleeps.


"The problem is, after a week of intense googling, we’ve started to burn out on knowing the answer to everything. God must feel that way all the time. I think people in the year 2020 are going to be nostalgic for the sensation of feeling clueless." Douglas Coupland - jPod

Offline

#4 2013-02-02 00:52:52

ewaller
Administrator
From: Pasadena, CA
Registered: 2009-07-13
Posts: 19,740

Re: HOWTO Write a sentence-building program?

headkase wrote:

The difficulty is idioms:

I still chuckle at an early experiment in natural language translation.  For fun, an idiom was sent round trip through an early English to Russian translator, and the result fed back through a Russian to English translator:

Out of sight, out of mind  --> Invisible imbecile.

I don't know first hand if it is true or a legend; but it highlights the problem.


To the OP:  This is a really good case for using a database.  Don't even think about trying to hard code this.


Nothing is too wonderful to be true, if it be consistent with the laws of nature -- Michael Faraday
Sometimes it is the people no one can imagine anything of who do the things no one can imagine. -- Alan Turing
---
How to Ask Questions the Smart Way

Offline

#5 2013-02-02 01:01:50

cfr
Member
From: Cymru
Registered: 2011-11-27
Posts: 7,131

Re: HOWTO Write a sentence-building program?

ximun wrote:

This is substantially harder in French than in English because determiners, among other things, can assume many forms, depending on the noun's gender and number.

If you really wanted a challenge, you'd build it for Welsh... smile

I'm not sure if the same complications would apply to Breton or not. (The languages are very close.)


CLI Paste | How To Ask Questions

Arch Linux | x86_64 | GPT | EFI boot | refind | stub loader | systemd | LVM2 on LUKS
Lenovo x270 | Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz | Intel Wireless 8265/8275 | US keyboard w/ Euro | 512G NVMe INTEL SSDPEKKF512G7L

Offline

#6 2013-02-04 16:30:05

ximun
Member
From: Montreal, QC
Registered: 2012-08-14
Posts: 26

Re: HOWTO Write a sentence-building program?

ewaller and cfr, thanks for your input. I apologize for not replying in a timely manner.

ewaller wrote:

To the OP:  This is a really good case for using a database.  Don't even think about trying to hard code this.

I will definitely look into databases. I'm only (vaguely) familiar with MySQL, so once again I'm open to suggestions. I read that Python and Perl are suitable languages to interact with a MySQL DB, I might just try learning them since they're also pretty useful in NLP.

cfr wrote:

If you really wanted a challenge, you'd build it for Welsh... :)

I think I'll stick to French for now. Building it for Welsh would probably drive me nuts within days. :) Thankfully it doesn't have cases. I'm simply trying to put into practice what I've learned in my introductory linguistics courses and most of it pertains to French. Is Welsh your mother tongue / a language you've learned? After a little googling, I'm tempted to try and learn it at some point, along with Basque. Perhaps Breton as well if it's fairly close.


"The problem is, after a week of intense googling, we’ve started to burn out on knowing the answer to everything. God must feel that way all the time. I think people in the year 2020 are going to be nostalgic for the sensation of feeling clueless." Douglas Coupland - jPod

Offline

#7 2013-02-05 01:09:44

cfr
Member
From: Cymru
Registered: 2011-11-27
Posts: 7,131

Re: HOWTO Write a sentence-building program?

Second language (partly as child, partly as adult). It has mutations of three different kinds. For example:

cat: cath
a cat: cath
the cat: y gath
the black cat: y gath ddu
his cat: ei gath
her cat: ei chath
my cat: fy nghath

dog: ci
a dog: ci
the dog: y ci
the black dog: y ci du
his dog: ei gi
her dog: ei chi
my dog: fy nghi

in Penarth: ym Mhenarth
in Caerdydd: yng Nghaerdydd
in Gwent: yng Ngwent

So you can't just plug things into gaps because the first letters of words change in context.  It is interesting but a bit complex. (And something of a nightmare if you are trying to use a dictionary with no idea of mutations. Even worse if you don't know the alphabet differs from the English one!)


CLI Paste | How To Ask Questions

Arch Linux | x86_64 | GPT | EFI boot | refind | stub loader | systemd | LVM2 on LUKS
Lenovo x270 | Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz | Intel Wireless 8265/8275 | US keyboard w/ Euro | 512G NVMe INTEL SSDPEKKF512G7L

Offline

Board footer

Powered by FluxBB