defi is a five-year project, which started in October 1995 and aims at providing a prototype of an online reading comprehension tool. Being a research project, it is not end-user-oriented, and is not limited by the constraints (most notably: elegance of the user interface) which would characterize efforts geared towards the production of a commercial tool within the same time boundaries.
It involves two languages, viz. English and French, and two directions, both languages playing the part of source and target language. The French to English direction is likely to lag behind a bit on account of the lesser availability of NLP analysis tools for French on our development platform (Windows).
defi is meant to act as a filter on a bilingual dictionary (a merge of the Oxford/Hachette (OH) and Robert/Collins (RC) English-French and French-English bilinguals) to provide the user with the most likely translation(s) of the item he has requested help about.
The tasks involved are the following:
The first of these tasks is essential to the quality of the tool. Although not all mwu's are monosemic (far from it), the first help to provide the user who has requested the translation of a word belonging to an mwu, is that mwu and its translation(s). Restricting the range of translations is again a matter of matching the context with the constraints or preferences that the dictionary associates with the source item under a specific translation.
The recognition of mwu's should not be dependent in any way on the particular item within the mwu that the user has selected. It would be a great pity for an electronic dictionary tool not to be free of physical storage considerations or to be dependent on a specific indexing scheme which would force the user to guess which word the mwu is most likely to be stored under (assuming the user is able to recognize that there is an mwu in the string he is interested in).
These tasks are carried out by a dictionary/text matcher implemented as a Prolog program. This program has access to binary trees stored in Prolog idb's (internal data bases). Such binary trees are constructed on the basis of:
The tools used in the project are the following:
defi's strategies are based on the following observations:
CIDE = Paul Procter, Editor-in-chief, Cambridge
International Dictionary of English, CUP, 1995 (first ed.)
COBUILD = John Sinclair, Editor-in-Chief, Collins Cobuild English Dictionary, Collins, 1987 (first ed.)
LDOCE = Paul Procter, Editor-in-Chief, The Longman Dictionary of Contemporary English, 1979 (first ed.)
OH = M.H. Corréard and V. Grundy (eds): The Oxford-Hachette French Dictionary (Oxford: OUP 1994)
RC = Beryl T. Atkins et al. (eds): Collins-Robert French/English English/French Dictionary (4th edition, Glasgow: HarperCollins 1995).
WordNet = WordNet Prolog Package, downloadable from the Princeton University WWW site. See also Miller 1990.
The surface parser: ENGCG was developed at the
General Linguistics Department of the University of Helsinki. It is marketed by
Lingsoft Inc. (http://www.lingsoft.fi/).
Awk: MKS and Thompson implementations for Windows and associated documentation; see also Aho et al. 1988
Prolog: Arity implementation for Windows; Arity Corporation, Damonmill Square, Concord, Mass.
Kernighan, B.W. and Weinberger, P.J.
The AWK Programming Language, Addison-Wesley, Reading, Mass.,
Miller, G. A., (ed) ‘WordNet: An On-Line Lexical Database’, International Journal of Lexicography, Volume 3, Number 4, 1990.
Montemagni, S., Federici, S. and Pirrelli,V. 1996. ‘Example-based Word Sense Disambiguation: a Paradigm-driven Approach’, Euralex’96 Proceedings, Göteborg University, 151-160.