New Developments in the defi Matcher

Archibald Michiels: Department of English, University of Liège (3, Place Cockerill, B-4000 Liège, Belgium)  (amichiels@ulg.ac.be)

 

Abstract

 

This paper focuses on the treatment of an information field found in bilingual dictionaries such as the Robert-Collins (rc) and Oxford-Hachette (oh) and meant to guide the human reader towards the selection of the appropriate translation in context. The challenge is to make use of the information contained therein in defi, a computer tool aiming at ranking translations according to their degree of relevance to the context in which the item to be translated occurs.

In addition to the two bilingual dictionaries, the process calls on a number of lexical resources, namely WordNet, Roget’s Thesaurus, three monolingual dictionaries (ldoce, cobuild and cide) and a database of metalinguistic information derived from the bilingual dictionaries.

1. Introduction to defi

defi is a suite of computer programs building up a prototype tool designed to help the reader of an English text select the right French translation of any item in context, be it a multi-word unit (henceforth: mwu) or a single-word lexeme. defi does not itself select the most appropriate translation; what it does is to rank the translations provided by its bilingual dictionary (a merge of the Robert-Collins and Oxford-Hachette English-to-French dictionaries) according to how well they match the context, and present them to the user in order of decreasing relevance. The core of defi is a matcher calling on various linguistic resources, readily assignable to two classes:


The
defi matcher is a Prolog program making use of a rather large internal data base (matcher.idb : about 600 million bytes) that houses all the lexical information just described. Prolog has been selected because it offers non-determinism through its backtracking policies and is therefore well-designed for solving problems with multiple solutions (the item-translation pairs to be ranked can best be viewed as solutions to the matching problem). Besides, Prolog is built around the concept of unification, which makes a very good basis for any matching procedure. The internal data base provides fast access to the dictionaries and thesauri making up the lexical resources the matcher calls on.


defi is described in somewhat more detail on the defi web page (http://engdep1.philo.ulg.ac.be/michiels/efdefi.htm), where results and performance data on a test suite can also be found. It should be emphasized that defi is meant to work on totally free input, and is not in any way restricted to a selected corpus or text type. This concern explains the choice of the Lingsoft engcg parser, which is extremely robust, and the harnessing of full-size dictionaries and thesauri rather than purpose-built or corpus-bound lexicons.

 

2. The experiment carried out in 1998

In 1998 defi was submitted to its first real life test. First-year and final-year university students in English were confronted with lexically fairly demanding texts and requested to underline the words they would have looked up in the dictionary. The textual chunks containing the underlined words were then submitted to defi, with the words underlined playing the part of the clicked-on word, i.e. the word whose translations defi is supposed to rank according to their relevance to the context the item appears in. The text selected for the final-year students is a passage from Angus Wilson’s masterpiece, Hemlock and After (1952, Penguin 1956), somewhere near the beginning of the first chapter, pp 9-10. It includes the following excerpt where in the last sentence quite a few students underlined the word bogey:

 

All the same, the Treasury letter was a pleasant reminder of the esteem in which he was held. To meet authority at its most impersonal level had been a new experience for him, and he had felt a certain interested speculation in how far his deep convictions would carry him against the world of Kafka’s ‘they’. And now that little bogy had been exorcized with the rest!

The results provided by defi in 1998 were irritatingly disappointing:

 And now that little bogy had been exorcised with the rest! 

Processing_time(0,0,0,28)

22 - 47255, efm, bogey = crotte {coll} {f} de nez
22 - 47256, efm, bogey = bogey {m}, bogée {m}
22 - 47257, efm, bogey = épouvantail {m}, démon {m}

20 - 47258, ohef, bogey = croquemitaine {m}
20 - 47259, ohef, bogey = spectre {m}
20 - 47260, rcef, bogey = bête noire

Each line opens with the weight assigned to the selected translation, followed by the identification number of the entry in the bilingual dictionary, the origin of the pair (efm=merged entry, ohef=Oxford-Hachette English-to-French (oh), rcef= Le Robert and Collins English-to-French (rc)), the source item, the equal sign, the selected translation. The processing time displays the following format: hours, minutes, seconds, hundredths of second. The bogy-sentence takes 28/100 sec. real time (user time).

Disappointing the results certainly are. In spite of the presence of the highly disambiguating exorcized the three selected translations included crotte de nez and bogée, which are both totally wrong, whereas croquemitaine, arguably the best translation, was left out in the cold. This paper will concentrate on the enhancements to the matcher that were necessary to obtain the results now yielded by defi:

And now that little bogey had been exorcized with the rest! 

Processing_time(0,0,0,33)

35 - 47258, ohef, bogey, croquemitaine {m}, [pos(20),vb(0),i2(spirit,demon,rg(0),mt(2),wn(0)),i2(spirit,memory,rg(0),mt(2),wn(0)),i2(spirit,3,dfexcb), evil,spirit,man,evil,spirit]
26 - 47257, efm, bogey, épouvantail {m}, démon {m},
 [pos(20),vb(0),evil,spirit]
22 - 47255, efm, bogey, crotte {coll} {f} de nez,
 [pos(20),vb(0),i2(nose,animal,rg(0),mt(2),wn(0))]
20 - 47256, efm, bogey, bogey {m}, bogée {m},
 [pos(20),vb(0)]
20 - 47259, ohef, bogey, spectre {m},
 [pos(20),vb(0)]
20 - 47260, rcef, bogey, bête noire,
 [pos(20),vb(0)]

The results as presented above include debugging information (the bits inside the square brackets), i.e. they show the grounds for ranking. The nature of the debugging information will be explained below. Suffice it here to note that croquemitaine is now an easy winner, and that démon comes in second position. Crotte de nez still rears its ugly nose, although much less perceptibly. We shall explain why in the body of the paper.

 

3. The treatment of the Indicator field

 

We start by looking at what our merged bilingual dictionary has to offer on the clicked-on word, bogey. defidic is the human-readable version of the bilingual dictionary. Its various fields should be self-explanatory. Lemma and Headword are identical here because we are dealing with a single-word lexeme. The Origin field has already been explained.

Fig1: BOGEY in defidic

IDNUM=     47255
HEADWORD=  bogey
LEMMA=     bogey
LEMMATYPE= standard
POS=       n
S/STYLE=   sl Br
INDICATOR= in nose
TRANS=     crotte {coll} {f} de nez
 ORIGIN=    efm

IDNUM=     47256
HEADWORD=  bogey
LEMMA=     bogey
LEMMATYPE= standard
POS=       n
LABELS=    golf
INDICATOR= in golf
TRANS=     bogey {m}, bogée {m}
ORIGIN=    efm

IDNUM=     47257
HEADWORD=  bogey
LEMMA=     bogey
LEMMATYPE= standard
POS=       n
INDICATOR= to frighten people + frightening
TRANS=     épouvantail {m}, démon {m}
ORIGIN=    efm

IDNUM=     47258
HEADWORD=  bogey
LEMMA=     bogey
LEMMATYPE= standard
POS=       n
INDICATOR= evil spirit
TRANS=     croquemitaine {m}
ORIGIN=    ohef

IDNUM=     47259
HEADWORD=  bogey
LEMMA=     bogey
LEMMATYPE= standard
POS=       n
INDICATOR= imagined fear
TRANS=     spectre {m}
ORIGIN=    ohef

IDNUM=     47260
HEADWORD=  bogey
LEMMA=     bogey
LEMMATYPE= standard
POS=       n
INDICATOR= bugbear
TRANS=     bête noire
ORIGIN=    rcef

Cross-references

IDNUM=     47268
HEADWORD=  bogie
LEMMA=     bogie
LEMMATYPE= standard
POS=       n
GOTHERE=   bogey^2, bogey^3
TRANS=    
XREF=      bogey
ORIGIN=    efm

IDNUM=     47277
HEADWORD=  bogy
LEMMA=     bogy
LEMMATYPE= standard
POS=       n
GOTHERE=   bogey
TRANS=    
ORIGIN=    rcef

 

The disambiguating information is mainly to be found in the Indicator field (which was not made use of in the 1998 version of  defi), a ragbag of metalinguistic information that has not found its place in the other fields, more choosy on the type of information they are supposed to house. We have a synonym (bugbear) and near-synonyms (spirit, fear) as well as prepositional phrases containing label information (in golf) and other information (note how different the two in’s are, in nose and in golf), in addition to a to-phrase by the side of a near-synonymic adjective (to frighten people + frightening). The same information is to be found in the Prolog-format entries of sdic, the dictionary for single-word entries. sdic is one of the lexical resources that are built into matcher.idb, the internal Prolog data base available to the defi matcher.

BOGEY in sdic

Pattern: sdic(Identification Number, Lemma, Part of Speech, Subject Collocates, Object Collocates, Indicator, Labels, Environment, Register, Head, Semantic Features, Countable/Uncountable Status, Cross References, GoThere, Translation, Translation Frequency, Gloss, Origin)

sdic(47255,'bogey','n',sc([nil]),oc([nil]),ind(['in nose']),lab([nil]),env([nil]),sst(['sl','Br']),head([nil]),sf([nil]),st(nil),xr([nil]),gt(nil),tr($crotte {coll} {f} de nez$),rat(2,9),gl(nil),efm).

sdic(47256,'bogey','n',sc([nil]),oc([nil]),ind(['in golf']),lab(['golf']),env([nil]),sst([nil]),head([nil]),sf([nil]),st(nil),xr([nil]),gt(nil),tr($bogey {m}, bogée {m}$),rat(2,9),gl(nil),efm).

sdic(47257,'bogey','n',sc([nil]),oc([nil]),ind(['to frighten people','frightening']),lab([nil]),env([nil]),sst([nil]),head([nil]),sf([nil]),st(nil),xr([nil]),gt(nil),tr($épouvantail {m}, démon {m}$),rat(2,9),gl(nil),efm).

sdic(47258,'bogey','n',sc([nil]),oc([nil]),ind(['evil spirit']),lab([nil]),env([nil]),sst([nil]),head([nil]),sf([nil]),st(nil),xr([nil]),gt(nil),tr($croquemitaine {m}$),rat(1,9),gl(nil),ohef).

sdic(47259,'bogey','n',sc([nil]),oc([nil]),ind(['imagined fear']),lab([nil]),env([nil]),sst([nil]),head([nil]),sf([nil]),st(nil),xr([nil]),gt(nil),tr($spectre {m}$),rat(1,9),gl(nil),ohef).

sdic(47260,'bogey','n',sc([nil]),oc([nil]),ind(['bugbear']),lab([nil]),env([nil]),sst([nil]),head([nil]),sf([nil]),st(nil),xr([nil]),gt(nil),tr($bête noire$),rat(1,9),gl(nil),rcef).

Since exorcized is the item that offers the greatest potential for the disambiguation of bogey in our sentence, it is worth having a look at what defidic (and consequently sdic) has to offer on exorcize:

Fig. 2: EXORCIZE  in defidic

From Oxford-Hachette :
IDNUM=     92101
HEADWORD=  exorcize
LEMMA=     exorcize
LEMMATYPE= standard
POS=       vtr
POSTCOLL=  demon, memory, past
TRANS=     exorciser
ORIGIN=    ohef

From Robert-Collins :
IDNUM=     92098
HEADWORD=  exorcise
LEMMA=     exorcise
LEMMATYPE= standard
POS=       vtr
TRANS=     exorciser
ORIGIN=    rcef

sdic(92098,'exorcise','vtr',sc([nil]),oc([nil]),ind([nil]),lab([nil]),env([nil]),sst([nil]),head([nil]),sf([nil]),st(nil),xr([nil]),gt(nil),tr($exorciser$),rat(1,1),gl(nil),rcef).

sdic(92101,'exorcize','vtr',sc([nil]),oc(['demon','memory','past']),ind([nil]),lab([nil]),env([nil]),sst([nil]),head([nil]),sf([nil]),st(nil),xr([nil]),gt(nil),tr($exorciser$),rat(1,1),gl(nil),ohef).

Note that the Robert and Collins provides no disambiguating information at all, a policy that cannot be faulted here, in so far as exorcize is provided with one translation only. Exorcise is to be translated as exorciser, everywhere, point à la ligne, period. The postcoll list in the Oxford Hachette is what I’d like to call reassurance information, something along the lines of: “yes, we have thought of the whole range of objects that exorcize can take, and, yes, everywhere the translation exorciser is ok.” Not only for demons, but also for memory, past, and the like (every collocate list should be understood to include such an extension, hence the use of WordNet, Roget’s and coocdb in the matching of collocate against text).

 

Besides, we should remember that the clicked-on word is bogey, not exorcized. The first thing to ensure is that the matcher has access to the verb-object relation linking exorcized and bogey. As stated above, the defi matcher makes use of the parses provided by the engcg parser. Below is the parse for our sentence: 

Fig. 3: engcg parse for And now that little bogey had been exorcized with the rest!

"<*and>"
               "and" <*> CC @CC
"<now>"
               "now" ADV  @ADVL
"<that>"
               "that" PRON DEM SG  @SUBJ @OBJ
               "that" DET CENTRAL DEM SG @DN>
               "that" <**CLB> CS @CS
"<little>"
               "little" <NonMod> <Quant> PRON ABS SG  @OBJ
               "little" <Quant> DET POST ABS SG @QN>
               "little" A ABS  @AN>
"<bogey>"
               "bogey" N NOM SG  @SUBJ
"<had>"
               "have" <SVO> <SVOC/A> V PAST VFIN  @+FAUXV
"<been>"
               "be" <SVC/N> <SVC/A> PCP2  @-FAUXV
"<exorcized>"
               "exorcize" <SVO> PCP2  @-FMAINV
"<with>"
               "with" PREP  @ADVL
"<the>"
               "the" <Def> DET CENTRAL ART SG/PL @DN>
"<rest>"
               "rest" N NOM SG  @<P
"<$\!>"

We note that the parser tends to err on the side of caution: three parses are provided for both that and little. That is parsed as a demonstrative pronoun, a determiner and a subordinator; little is given as a pronoun quantifier, liable to play the part of object, a determiner quantifier and an adjective. In context, only the second parse for that and the third for little are appropriate. We note too that bogey is parsed as a subject, not an object: the engcg parser is a surface parser, and it is not its job to retrieve deep relations.

 

Tagtxt, the engcg-enhancer written in awk and devised by the defi team[1], does what it can to retrieve such relations, so that it does indicate the verb-object relationship we are after, namely exorcize-bogey. But it also gives little as object of exorcize, because it does not go so far as to attempt to correct what the parser yields, only to enhance the results. Tagtxt works out the passivity feature of the clause and is able to assign the deep object role to the surface subject of a passive clause. It produces a list of w(ord)-structures, one devoted to each word. In each we find information about the textual variant (text), its lemmatisation (lem), its morph(ological) and syn(tactic) features, with a weight assigned to each feature, ready to be reaped by the matcher in its assessment of the quality of the match between the textual chunk and a candidate mwu.
Below the w-list is to be found an np-list, referring to textual positions in the w-structure, a list of computed syntactic relations (among which the one we are interested in), the polarity and passivity flags and the structural hypothesis for the whole chunk, namely that it spans a whole clause (s). 

Fig. 4: Enhanced engcg parse :

Pattern: txt(Word-structure List, Punctuation, NP List, Syntactic Relations List, Polarity Flag, Voice Flag, Structural Hypothesis)

Word-stucture Pattern: Starting Position in String, End Position in String, text(Textual Variant, Typographical Case), lem(Lemma, Typographical Case), morph(List of Morphological Features), syn(List of Syntactic Features)

txt([
w(0,1,text('and',u),lem('and',u),morph([m(pos,cconj,2)]),syn([s(type,cc,0,_)])),
w(1,2,text('now',l),lem('now',l),morph([m(pos,adv,5)]),syn([s(func,adv,5,_)])),
w(2,3,text('that',l),lem('that',l),morph([m(pos,pron,2),m(type,dem,2),m(num,sg,2)]),syn([s(func,subj,5,_),s(func,obj,5,_)])),
w(2,3,text('that',l),lem('that',l),morph([m(pos,det,2),m(type,central,0),m(type,dem,2),m(num,sg,2)]),syn([s(type,dn,0,r)])),w(2,3,text('that',l),lem('that',l),morph([m(type,clb,2),m(pos,sconj,2)]),syn([s(type,cs,0,_)])),
w(3,4,text('little',l),lem('little',l),morph([m(type,nonmod,0),m(type,quant,2),m(pos,pron,2),m(degree,abs,0),m(num,sg,2)]),syn([s(func,obj,5,_)])),
w(3,4,text('little',l),lem('little',l),morph([m(type,quant,2),m(pos,det,2),m(type,post,0),m(degree,abs,0),m(num,sg,2)]),syn([s(type,qn,3,r)])),
w(3,4,text('little',l),lem('little',l),morph([m(pos,adj,5),m(degree,abs,0)]),syn([s(type,an,3,r)])),
w(4,5,text('bogey',l),lem('bogey',l),morph([m(pos,n,5),m(case,nom,0),m(num,sg,2)]),syn([s(func,subj,5,_)])),
w(5,6,text('had',l),lem('have',l),morph([m(pos,v,5),m(tense,past,1),m(type,finite,2)]),syn([s(type,aux,3,f)])),
w(6,7,text('been',l),lem('been',l),morph([m(pos,edform,2)]),syn([s(type,aux,3,nf)])),
w(6,7,text('been',l),lem('be',l),morph([m(pos,edform,2)]),syn([s(type,aux,3,nf)])),
w(7,8,text('exorcized',l),lem('exorcize',l),morph([m(pos,edform,2)]),syn([s(type,main,3,nf)])),
w(8,9,text('with',l),lem('with',l),morph([m(pos,prep,2)]),syn([s(func,adv,5,_)])),
w(9,10,text('the',l),lem('the',l),morph([m(type,def,3),m(pos,det,2),m(type,central,0),m(type,art,3),m(num,sg_or_pl,1)]),syn([s(type,dn,0,r)])),
w(10,11,text('rest',l),lem('rest',l),morph([m(pos,n,5),m(case,nom,0),m(num,sg,2)]),syn([s(type,p,3,l)])),
punct(11,12,exmark)],         /* final punctuation */
[np(2,3,c(2,3)),np(3,4,c(3,4)),np(3,5,c(4,5)),np(9,11,c(10,11))],  /* NP list (the c-structure identifies the head, e.g. 10-11 (rest) is the head of 9-11 (the rest)) */
[cadj('little','bogey'),cdobj('bogey','exorcize'),cdobj('little','exorcize'),cprep(8,'with','rest'),cdobj('rest','exorcize')],  /* list of recoverable syntactic relations, e.g. bogey is the direct object of exorcize */
neg(0),     / * polarity flag */
passive(1), /* passivity flag */
s). /* structural hypothesis – the chunk functions syntactically as a whole clause */

So we do have the link we were looking for. What can we do with it? We can first attempt to match exorcize and bogey, that is to say check whether the (semantic/lexical) distance between the two is not too great for a link to be established. We can do that by calling on procedures making use of coocdb, WordNet and Roget’s Thesaurus. We can specify a maximum distance that can be covered in the WordNet hierarchy, we can specify the delicacy level of Roget’s category sharing, and we can check if the two words occur together, and how frequently they do so if they do, in coocdb. But what would be the point? None at all.

 

Indeed, we already know that there is a link between exorcize and bogey – what we do not know is with which reading of bogey the link can be established. The only way we can come to know that is to try to match, not the item bogey itself, but its various indicators.

 

The defi team started by collecting all the indicators occurring in the merged bilingual, computing their frequencies and examining their structure. Some of the indicators are very much like labels, and were rewritten and copied to the label field (e.g. in court --> jur, in geometry --> geometry, in golf --> golf). Others were so general that it was thought appropriate to turn them into semantic features and house them in a new defidic field created for the purpose (e.g. person --> hum, process --> proc, act --> proc, action --> proc). Some very frequent indicators have no discriminatory power at all, at least not the type of power that can be harnessed by a computer program (a case in point is the very frequent ‘all contexts’ indicator). The indicators that were preserved had to lose whatever structure they had, so that in nose was turned into nose, in the same way as in golf was turned into golf. The reason for reducing all indicators to a single item  is to enable the matching procedure described above (calling on WordNet, Roget’s and coocdb) to be applied to the pair textual item / indicator to measure the lexical/semantic connectedness between the two members of the pair. Applied to our entries for bogey, this cleaning and selection procedure yields a list of six indicators: nose, golf, frightening, spirit, fear and bugbear. It is these items that we must try to match against elements found in the textual chunk, or related  to it in certain ways.

 

We can try to match the indicator against all the lemmas of the textual chunk that are not themselves high frequency items such as tool words. This procedure is meant to cash it on the redundancy of natural language as captured by the concept of isotopie (Greimas 1968, passim). In our example sentence, the procedure attempts to match each indicator against each member of the following list, collected from the lem structures of the enhanced parse: little, exorcize, and rest.  The procedure does not succeed, which is not too surprising seeing that the elements to be matched belong to different parts of speech, and that both WordNet and coocdb have a POS-oriented organization. Spirit and exorcize look connected to us, human readers, but we are not shackled by  POS bonds...

 

The matching of an indicator against all full-lexemic textual lemmas enables the matcher to recover links that the parsing procedure is powerless to catch. Consider the sentence

It is not an arduous task, it's a light one. (as contrasted with, say, It’s not a dark room, it’s a light one).
The defi results are the following:

25 - 158220, efm, light, léger/-ère, [pos(20),vb(0)[2],i1(heavy,arduous,rg(0),mt(0),wn(5))]

25 - 158226, efm, light, peu fatigant, [pos(20),vb(0),i1(strenuous,arduous,rg(0),mt(0),wn(5))]

20 - 158222, efm, light, clair, [pos(20),vb(0)]

etc.

defi is able to select the right translations, in spite of the fact that the item light modifies one, and that the parser –wisely enough– does not attempt to retrieve the anaphor’s antecedent. A reasonable line to follow is to rule that the matcher cannot be tied to the structures built by the parser, although it must give them priority[3]. The debugging information tells us that the indicators heavy and strenuous (associated with the relevant reading of light, although negated by not) matched the textual element arduous (the indicator matching is of type 1 - as indicated by i1- when the match is with an element retrieved from the lem structures of the textual w-list). The match is due to WordNet (wn) and is worth a weight of 5 (the assignment of weights is purely heuristic).

 

 In the case of bogey, we can get much nearer to what we want if we apply the following procedure:

 

Finally, we have decided to complete our attempts at indicator matching by turning to the resources offered by our three monolingual dictionaries. They are all three primarily learners’ dictionaries, in which the lexicographers have made special efforts to provide clear and easy-to-understand definitions, and to coin or select examples that are highly typical of the environment in which the word under the selected reading is most likely to occur. We selected all the one-word lexemes bearing the relevant parts of speech: adjective, noun and verb. We carried out frequency studies in order to be able to eliminate from the definitions and examples the words whose high frequency deters from their discriminating power in the task of reading selection. These are not only members of the usual class of tool words, but include items that are heavily used in the lexicographic practice of writers of learners’ dictionaries (a short fragment of the high frequency item list runs as follows: made, main, mainly, make, makes, making, many, mark, may, me, mean, means, mentioned, might). The preserved items make up the deflex (definitions) and exlex (examples) structures of the monolingual dictionary (entries from ldoce, cobuild, and cide are merged – the ori feature keeps track of where they come from: ci for cide, co for cobuild, and lg for ldoce). The gw (guideword) slot is particularly important. It is derived from the guideword information provided by cide, but also from the thesauric links information to be found in cobuild. Alongside with all the items of the deflex and exlex structures it provides the information necessary for selecting the relevant items in the monolingual dictionary. We select the items whose deflex, exlex and gw slot provide an element that directly matches the indicator – direct matching is simply string identity. Below are to be found the entries for bogey in the monolingual defi dictionary, with the matches with the bogey indicators printed in bold:

Pattern: mono( Lemma, Origin, Part of Speech, Labels, GuideWord List, Items in Definition, Items in Examples). Each w-structure identifies position in string (enabling the retrieval of adjacency relations) and text form.

mono(lem('bogey'),ori('ci'),pos('n'),lab(['nil']),gw([fear]),deflex([w(3,'fear'),w(7,'based'),w(9,'reason')]),exlex([w(1,'committing'),w(2,'himself'),w(5,'relationship'),w(8,'biggest')])). /* 1 */

mono(lem('bogey'),ori('ci'),pos('n'),lab(['nil']),gw([nose]),deflex([w(4,'dried'),w(5,'mucus'),w(7,'inside'),w(9,'nose')]),exlex(['nil'])). /* 2 */

mono(lem('bogey'),ori('ci'),pos('n'),lab(['nil']),gw([shot]),deflex([w(2,'name'),w(5,'score'),w(12,'take'),w(15,'attempt'),w(19,'supposed'),w(24,'ball'),w(27,'hole')]),exlex([w(1,'shell'),w(2,'win'),w(4,'match'),w(9,'doesnt'),w(10,'pick'),w(14,'bogeys'),w(17,'remaining'),w(18,'holes'),w(1,'jackson'),w(2,'bogeyed'),w(4,'last'),w(8,'lost'),w(10,'tournament')])). /* 3 */

mono(lem('bogey'),ori('ci'),pos('v'),lab(['nil']),gw([shot]),deflex([w(2,'name'),w(5,'score'),w(12,'take'),w(15,'attempt'),w(19,'supposed'),w(24,'ball'),w(27,'hole')]),exlex([w(1,'shell'),w(2,'win'),w(4,'match'),w(9,'doesnt'),w(10,'pick'),w(14,'bogeys'),w(17,'remaining'),w(18,'holes'),w(1,'jackson'),w(2,'bogeyed'),w(4,'last'),w(8,'lost'),w(10,'tournament')])). /* 4 */

mono(lem('bogey'),ori('co'),pos('n'),lab(['nil']),gw(['monster']),deflex([w(5,'bogeyman'),w(8,'imaginary'),w(9,'frightening'),w(10,'evil'),w(11,'spirit')]),exlex([w(2,'see'),w(6,'bush'),w(4,'threaten'),w(7,'bogeymen')])). /* 5 */

mono(lem('bogey'),ori('co'),pos('n'),lab(['nil']),gw(['nil']),deflex([w(4,'dried'),w(5,'mucus'),w(9,'inside'),w(11,'nose')]),exlex(['nil'])). /* 6 */

mono(lem('bogey'),ori('co'),pos('n'),lab(['nil']),gw(['worry']),deflex([w(8,'worried'),w(10,'perhaps'),w(12,'cause'),w(14,'reason')]),exlex([w(5,'rest'),w(7,'old'),w(10,'military'),w(11,'expenditure'),w(13,'vital'),w(15,'national'),w(16,'security')])). /* 7 */

mono(lem('bogey'),ori('lg'),pos('n'),lab(['gf']),gw(['nil']),deflex([w(2,'golf'),w(4,'act'),w(5,'bogeying')]),exlex(['nil'])). /* 8 */

mono(lem('bogey'),ori('lg'),pos('n'),lab(['nil']),gw(['nil']),deflex([w(2,'imaginary'),w(3,'fear')]),exlex([w(1,'state'),w(2,'ownership'),w(4,'industry'),w(7,'political')])). /* 9 */

mono(lem('bogey'),ori('lg'),pos('n'),lab(['nil']),gw(['nil']),deflex([w(3,'man'),w(4,'--'),w(10,'children'),w(12,'imaginary'),w(13,'evil'),w(14,'spirit'),w(17,'threatening')]),exlex(['nil'])).
/* 10 */

mono(lem('bogey'),ori('lg'),pos('n'),lab(['nil']),gw(['nil']),deflex([w(5,'children'),w(9,'dirty'),w(10,'mucus'),w(13,'nose')]),exlex(['nil'])). /* 11 */

mono(lem('bogey'),ori('lg'),pos('v'),lab(['gf']),gw(['nil']),deflex([w(2,'golf'),w(4,'hit'),w(6,'ball'),w(9,'hole'),w(12,'stroke'),w(16,'average')]),exlex(['nil'])). /* 12 */

 

Once we have selected the matching items we can explore the whole wordlist made up of the concatenation of the guideword, the deflex and exlex structures, and attempt to match each element against the indicator. We can also retrieve from the monolingual dictionary the entries for the collocate bearer (in casu exorcise). We can attempt to match the indicator as against the deflex and exlex structures of the collocate bearer. This yields a successful match
for the item spirit, the indicator associated with the bogey-croquemitaine pair. Such information is derivable from the dfexcb marker in the i2 structure printed in bold in the debugging information slot:

35 - 47258, ohef, bogey, croquemitaine {m}, [pos(20),vb(0),i2(spirit,demon,rg(0),mt(2),wn(0)),i2(spirit,memory,rg(0),mt(2),wn(0)),i2(spirit,3,dfexcb), evil,spirit,man,evil,spirit]


To complete the grand tour,
defi attempts to match the relevant collocate list, deflex and exlex of the collocate bearer (exorcise) as against the guideword, deflex and exlex of the selected entries of the selected item (bogey). The matching items are recorded in a list tagged on at the end of the debugging information slot and printed in bold below in the two item-translation pairs where they contribute to weight assignment (each item is worth a weight of 3): 

35 - 47258, ohef, bogey, croquemitaine {m}, [pos(20),vb(0),i2(spirit,demon,rg(0),mt(2),wn(0)),i2(spirit,memory,rg(0),mt(2),wn(0)),i2(spirit,3,dfexcb), evil,spirit,man,evil,spirit]

26 - 47257, efm, bogey, épouvantail {m}, démon {m},  [pos(20),vb(0),evil,spirit]

The monolingual entries for exorcise/ze are given below; the matching items (evil, man, spirit) are printed in bold italics:

mono(lem('exorcise'),ori('ci'),pos('v'),lab(['nil']),gw(['nil']),deflex([w(2,'force'),w(5,'evil'),w(6,'spirit'),w(14,'praying'),w(16,'magic')]),exlex([w(3,'priest'),w(4,'exorcised'),w(6,'spirit'),w(8,'strange'),w(9,'noises'),w(10,'stopped'),w(6,'house'),w(6,'child'),w(3,'take'),w(10,'memory'),w(13,'accident')])). /* this entry does not match for the trivial reason that it is spelled –ise  instead of –ize: this can be taken care of by  a simple morphology routine */

mono(lem('exorcize'),ori('co'),pos('v'),lab(['nil']),gw(['expel']),deflex([w(4,'evil'),w(5,'spirit'),w(7,'demon'),w(10,'force'),w(13,'leave'),w(23,'prayers'),w(25,'ceremonies')]),exlex(['nil'])). /* 1 */

mono(lem('exorcize'),ori('co'),pos('v'),lab(['nil']),gw(['free']),deflex([w(9,'force'),w(11,'evil'),w(12,'spirit'),w(14,'demon'),w(16,'leave'),w(22,'prayers'),w(24,'ceremonies')]),exlex(['nil'])). /* 2 */

mono(lem('exorcize'),ori('co'),pos('v'),lab(['nil']),gw(['remove']),deflex([w(5,'painful'),w(7,'unhappy'),w(8,'memory'),w(10,'succeed'),w(12,'removing'),w(16,'mind')]),exlex(['nil'])). /* 3 */

mono(lem('exorcize'),ori('lg'),pos('v'),lab(['nil']),gw(['nil']),deflex([w(3,'rid'),w(7,'bad'),w(8,'thought'),w(10,'feeling')]),exlex([w(6,'memory'),w(10,'misdeeds')])). /* 4 */

mono(lem('exorcize'),ori('lg'),pos('v'),lab(['oc']),gw(['nil']),deflex([w(2,'drive'),w(5,'evil'),w(6,'spirit'),w(17,'solemn'),w(18,'command')]),exlex([w(3,'once'),w(4,'believed'),w(7,'man'),w(11,'devil'),w(17,'priest'),w(23,'prayer')])). /* 5 */

mono(lem('exorcize'),ori('lg'),pos('v'),lab(['oc']),gw(['nil']),deflex([w(2,'free'),w(9,'evil'),w(10,'spirit')]),exlex([w(2,'mad'),w(3,'girls'),w(4,'parents'),w(5,'took'),w(9,'priest')])). /* 6 */

Note that the items evil and spirit appear twice in the list associated with the croquemitaine translation. This is due to the fact that the indicator spirit associated with the croquemitaine translation of bogey allows the retrieval of two entries for bogey in the monolingual (entries /* 5 */ and /*10 */), and that each of them is given a chance to see its deflex and exlex structures matched against the deflex and exlex of the various entries for exorcize.

4. Conclusions

The indicator field cannot be neglected by a sophisticated word disambiguation or target translation selection program. Although  very much of a ragbag, it is often the only field providing discriminating information: if we do not make use of it, we can often do no more than list all the translations recorded in the bilingual dictionary, with the same weight assigned to all, i.e. without any ranking whatsoever – defi would then be nothing more than an automatic look-up program.

The indicator field is only partly usable by a computer procedure. The structure that it exhibits (prepositional phrase, to-phrase, etc.) cannot be interpreted in the present state of natural language processing (at least not on a large scale, and certainly not if we plan to deal with any type of text, as we do in
defi). We have to focus on one single unit within the indicator field, and the selection procedure must be automatic.

The matching of the indicator must cash in on the redundancy of natural language, i.e. the recurrence of the feature captured by the indicator in the remainder of the textual chunk. But the process cannot be too expensive computationally, and should not be allowed to run wild, putting everything in relation with everything else.

Consequently, we try to retrieve from the textual chunk the items that are most tightly connected with the selected word, and where more information is likely to be found on the link between the textual element and the selected item under the relevant reading pointed at by the indicator. The collocate bearer and its associated collocate lists is a most promising starting point.

The links between collocate bearer, collocate list and indicator can be reinforced by looking at the monolingual dictionary in so far as the entries for the selected item and the collocate bearer are likely to bring to the fore the most typical environments the items fit in, both paradigmatically (the definitions) and syntagmatically (the examples). For the matching procedure to be computationally tractable, we cannot afford to do more than simple string matching – but we do have sizeable word lists here, not just one or two items.

The indicator matching reported here has a reasonable computational cost. But it should be emphasized that the procedure is kept deterministic, with Prolog clause ordering and the use of the findall predicate ensuring that the best matches are not passed over. Even so, the ‘worst’ sentences can take more than ten seconds user time on a fairly powerful PC, and a good deal of that time is gobbled up by indicator matching. But there is not a shadow of a doubt that the results justify the computational cost, which is likely to take up less and less user time as desk machines keep getting more sophisticated and powerful.  

References

A. Dictionaries and thesauri

cide = Procter, P.  (ed.) 1995. Cambridge International Dictionary of English. Cambridge: Cambridge University Press.

cobuild =  Sinclair, J. (ed.) 1987. Collins Cobuild English Dictionary, London and Glasgow: Collins.

ldoce = Procter, P.  (ed.) 1979. The Longman Dictionary of Contemporary English. Harlow: Longman.
oh = Corréard, M.H. and Grundy, V. (eds) 1994. The Oxford-Hachette French Dictionary. Oxford: Oxford University Press.

rc = Atkins, B. T. (ed) 1995. Collins-Robert English/French Dictionary. Glasgow: HarperCollins.
Roget’s Thesaurus, public domain version downloadable from various Websites

WordNet = WordNet Prolog Package, downloadable from the Princeton University Website (http://www.cogsci.princeton.edu/~wn/). See also Miller 1990. 

B. Tools

The surface parser: engcg was developed at the General Linguistics Department of the University of Helsinki. It is marketed by Lingsoft Inc. (http://www.lingsoft.fi).

Awk: MKS and Thompson implementations for OS/2 and Win95 and associated documentation; see also Aho et al. 1988

Prolog: Arity implementation for OS/2 and Win95; Arity Corporation, Damonmill Square, Concord, Mass. 

C. Other literature

Aho, A.V., Kernighan, B. W. and  Weinberger, P.J. 1988. The AWK Programming Language. Reading, Mass.: Addison-Wesley.
Greimas, J. 1968. 
Sémantique structurale. Paris: Larousse.

Miller, G.A. (ed) 1990. ‘WordNet: An On-Line Lexical Database’, International Journal of Lexicography 3.4. (whole issue devoted to WordNet)

Montemagni, S., Federici, S. and Pirrelli,V. 1996. ‘Example-based Word Sense Disambiguation: a Paradigm-driven Approach, Euralex’96 Proceedings, Göteborg University, 151-160.

 


Back to the DEFI Home Page



[1] The defi team was first made up of Nicolas Dufour and Archibald Michiels (1995-1998). When the former left the project (1998), he was replaced by Nathalie Leclair. Now that funding has come to an end (2000), the team is reduced to a single individual, namely the author, who intends to go on developing the defi matcher.

[2] Pos is Part of Speech, and vb Verb Bonus, a bonus assigned to the items that the parser clearly identifies as verbs (as contrasted with participles and gerunds).

[3] It is worth noting that one of the sentences in its own test suite where the defi matcher is at its poorest is Cast a glance at the bird, because the engcg parser does not assign verb as POS value for cast, and the other features taken into account by defi (such as the envir feature, here the specification of the at-phrase) are unable to repair the damage attributable to the parser’s decision.