Traduki

Traduki is an open source machine translation program, developed with the Lua programming language and released under the GNU General Public License. It is a tool being developed to give free speech and translation to everyone. Traduki means "to translate" in Esperanto.

Development was suspended in mid-2002, but has restarted in 2003.

Traduki is a free Machine Translation program, released under the GNU General Public License. It is a tool being developed to give free speech and translation to everyone.

Machine Translation is a complex task. The folowing are preliminary ideas.

Table of contents

1 Input
2 Tokenization
3 Morphological analysis
4 Sytactical analyses
5 Disambiguation
6 Semantic Disambiguation
7 Translation to an interlanguage
8 Destination language syntheses
9 See also
10 External links and references

10.1 Useful resources for the Traduki project
10.2 Online articles
10.3 Books

Input

Input is the reading the original English text. This can be from a simple console, GUI, or web interface, but it can also be from more complicated things such as OCR, handwriting recognition or speech recognition.

Tokenization

Tolkenization is the division of the text into sentences and of sentences into words and punctuation. The division of the text into sentences can be done using "!", "?" and "." as separators. But sometimes, "." is used un numbers (i.e. 10.233), abbreviations (i.e. Dr.) and Initials (i.e. A. C. Doyle). The punctuation marks ",", ";", "", ��, :. () and [] can also be used to separate semi-independent sentences.

The article "What is a word, What is a sentence? Problems of Tokenization" is a good discussion of tokenization problems. It can be downloaded here

Morphological analysis

Each word must be analyzed to identify derived words. Dictionaries used in Machine Translation do not have words derived from simpler words. Derived words must be identified by the program itself. Verbal forms and plurals are the most common derived words.

Project Natural Language Toolkit [1] has some python code that could be reused in Traduki. However, Natural Language Toolkit is released under the IBM Common Public License 0.5. Can we use the code?

Sytactical analyses

Syntactical analysis is the determination of the syntactic function of the words. The program should discover if a word is a "verb" or a "noun". A dictionary with the syntactic classification of all root words must be used. WordNet [1] is a good source of data to build a good English dictionary.

Disambiguation

A word can have more than one syntactic function. For example, "fat" can be an adjective ("The fat boy eats hamburgers") and can be a noun ("Hamburgers have lots of fat"). So, how do we know that "fat" in the sentence "Hamburgers have lots of fat" is a noun? There are two methods:

Statistical methods use large annotated corpora. Annotated corpora could tell us that "lots of " is always followed by a noun. Traduki should not use this method because all useful annotated corpora is proprietary.
Constraint Grammar methods use grammar rules to exclude invalid combinations of syntactic functions. For example, "the" is never followed by a verb. There are more than 1000 rules that can be use to disambiguate a sentence.

Semantic Disambiguation

Sometimes, some ambiguity may remain after the application of the methods described above. Semantic information may be use to may be use to solve the problem. That's why a good dictionary must have some semantic information. For example, words related to music should be marked as such.

Translation to an interlanguage

All the syntactic, morphological and semantic information should be codified in an interlanguage. All the source language root words should be translated to root words. Esperanto is often used as an intermediate language (including in Traduki) because 99% of esperanto words have only one sense and because Esperanto is already somewhat of an interlanguage.

Ergane is a free to use multilanguage dictionary that use Esperanto as a interlanguage can be useful for Traduki.

Destination language syntheses

The syntheses of the destination language from interlanguage is an easy step. There is, however, some problems:

there is the need for a verb conjugator
there is the need for plurals generator
translation from esperanto to the destination language can be ambiguous because there are more than one word for each esperanto word. Semantic information from the source text can be used to disambiguate.

External links and references

Useful resources for the Traduki project

Traduki page on SourceForge
Pytalk: english parser and spellchecker
- http://sourceforge.net/projects/pytalk
- http://starship.python.net/crew/hennesy/index.html
WordNet - A Lexical Database for English
- p�gina oficial: http://www.cogsci.princeton.edu/~wn/
- A Python interface to the WordNet lexical http://www.cs.brandeis.edu/~steele/sources/python.html
GPLtrans
- http://sourceforge.net/projects/gpltrans/
Natural Language Toolkit
- http://nltk.sourceforge.net/
vortaro
- http://sourceforge.net/projects/vortaro
linguaphile
- http://linguaphile.sourceforge.net/
The VISL Constraint Grammar Compiler is a natural language parser generator. It is an implementation of Pasi Tapanainen's CG-2 constraint grammar formalism.
- http://sourceforge.net/projects/vislcg/
The VISL Phrase Structure Grammar Compiler is an implementation of a parser generator for ambiguous context-free grammars, ambiguous input, and ambiguous output.
- " class="external">http://sourceforge.net/projects/vislpsg/

Online articles

A phd thesis:"The present project has as its goal to incorporate a semantic component into an English Constraint Grammar parser so as to augment parser�s performance."
- http://www.hum.sdu.dk/institut/isk/kasch/PhD-thesis.htm
Should I use machine translation?
- http://www.ttt.org/theory/mt4me/index.html
Why Can't a Computer Translate More Like a Person?
- http://www.ttt.org/theory/barker.html
"Types of Semantic Information Necessary in a Machine Translation Lexicon" **http://talana.linguist.jussieu.fr/taln99/ps/A77/A77.pdf (PDF File)

Books

Constraint Grammar : A Language-Independent System for Parsing Unrestricted Text (Natural Language Processing, No 4) ISBN 3110141795

books
- http://www.astuk.com/products-by-keywords/natural-language.html