Lexdis, a Prolog program for measuring lexical proximity


Lexical resources

·        semdic : dictionary clauses derived from Cide, Cobuild, Ldoce and the WordNet Synsets and Synset Glosses

·        mt : data base of RC/OH collocates - the pivotal property is copresence within the same collocate field

·        roget : db of Roget's Thesaurus Categories (three levels)

·        indic: data base of RC(Robert/Collins)/OH(Oxford/Hachette) indicators (in these two bilingual E-F/F-E dictionaries, only the E->F direction is explored)  

·        coll : data base of RC/OH collocates - the pivotal element is the collocate bearer

·        envir : data base of environments derived from RC/OH 'extended' lemmas i.e. including phrases and examples

·        weights : data base recording the lexical weight of lemmas

 

(Cide = Cambridge International Dictionary of English

  Cobuild = Cobuild dictionary of English, based on the Cobuild Corpus

  Ldoce = Longman Dictionary of Contemporary English)

 

Sample data

Semdic

 

Cide entry

 

idnum 23101

hdwd   epic

ident   1* 2

lemma            epic

pos     adj

selpos            subj = activity/event

def      epic can also be used of events that happen over a long period and involve a lot of action and difficulty.

ex1      an epic journey

ex2      an epic struggle

 

=>

 

mono(

  lem('epic'),

  ori('ci'),

  idnum('ci23101'),

  pos('adj'),

  lab([]),

  gw([]),

  deflex(['events','happen','period','involve','difficulty']),

  exlex(['journey','struggle']),

  def(' epic can also be used of events that happen over a long period and involve a lot of action and difficulty')

    ).

 

 

Entries for ‘CAT/cat’

 

mono(lem('CAT'), ori(wn), idnum('cat%1:04:00::'), pos(n), lab([]), gw([]),

deflex(['computerized tomography', 'computed tomography', 'CT', 'computerized axial tomography', 'computed axial tomography', 'CAT', method, examining, body, organs, scanning, x, rays, computer, construct, series, 'cross-sectional', scans, single, axis]),

exlex([]),

def('a method of examining body organs by scanning them with x rays and using a computer to construct a series of cross-sectional scans along a single axis')).

 

mono(lem('cat'),ori('ci'),idnum('ci10278'),pos('n'),lab([]),gw([]),

deflex(['small','four-legged','furry','animal', 'tail', 'claws','pet','catching','mice','member','biologically','similar','animals','lion']),

exlex(['pet','stray','feed','holiday']),

def(' a small four-legged furry animal with a tail and claws usually kept as a pet or for catching mice or any member of the group of biologically similar animals such as the lion')).

 

mono(lem('cat'),ori('co'),idnum('co8704'),pos('n'),lab([]),gw(['moggy','puss']),

deflex(['small','furry','animal','tail','whiskers','sharp','claws','kills','smaller','animals','mice','birds','pets']),

exlex(['hand','stroked','softly','domestic','animals','dogs']),

def('a cat is a small furry animal with a tail whiskers and sharp claws that kills smaller animals such as mice and birds cats are often kept as pets')).

 

mono(lem('cat'),ori('co'),idnum('co8705'),pos('n'),lab([]),gw([]),

deflex(['animal','belonging','family','includes','lions','tigers','see','big']),

exlex(['lions','hunt','team','members','family']),

def('any animal belonging to the family that includes lions and tigers see also big cat')).

 

mono(lem('cat'),ori('lg'),idnum('lg20694'),pos('n'),lab(['am']),gw([]),

deflex(['small','animal','soft','fur','sharp','teeth','claws','nails','pet','buildings','catch','mice','rats']),

exlex([]),

def(' a small animal with soft fur and sharp teeth and claws nails often kept as a pet or in buildings to catch mice and rats')).

 

mono(lem('cat'),ori('lg'),idnum('lg20695'),pos('n'),lab(['mdzb']),gw([]),

deflex(['animals','lion','tiger']),

exlex([]),

def(' any of various types of animals related to this such as the lion or tiger')).

 

mono(lem('cat'),ori('lg'),idnum('lg20696'),pos('n'),lab([]),gw([]),

deflex(['nasty','woman']),

exlex([]),

def(' derog a nasty woman')).

 

mono(lem('cat'),ori('lg'),idnum('lg20697'),pos('n'),lab(['na']),gw([]),

deflex(['strong','apparatus','life','heavy','objects','anchors','ship']),

exlex([]),

def(' a strong apparatus used to life heavy objects esp anchors  onto a ship')).

 

mono(lem('cat'),ori('lg'),idnum('lg20698'),pos('n'),lab([]),gw([]),

deflex(['becoming','man']),

exlex(['hear','new','records']),

def(' sl becoming rare a man')).

 

mono(lem('cat'),ori('lg'),idnum('lg20699'),pos('n'),lab(['sm']),gw([]),

deflex(['cat-o-nine','tails']),

exlex([]),

def(' infml cat-o-nine tails')).

 

mono(lem('cat'),ori('lg'),idnum('lg20700'),pos('n'),lab(['sm']),gw([]),

deflex(['cat-o-nine','tails']),

exlex([]),

def(' infml cat-o-nine tails')).

 

mono(lem('cat'),ori('lg'),idnum('lg20701'),pos('n'),lab(['sozc']),gw([]),

deflex(['burglar']),

exlex([]),

def(' bre infml cat burglar')).

 

mono(lem('cat'),ori('lg'),idnum('lg20702'),pos('n'),lab(['eg','ag']),gw([]),

deflex(['caterpillar','tractor']),

exlex([]),

def(' infml caterpillar tractor')).

 

Mt

Connectedness through co-occurrence in Robert/Collins-Oxford/Hachette collocate lists

 hypothesis of connectedness through shared belonging to collocate lists put forward by Montemagni et al.

 

cf. Montemagni, S., Federici, S. and Pirrelli, V. 1996.

Example-based Word Sense Disambiguation: a Paradigm-driven Approach’,

Euralex’96 Proceedings, Göteborg University, 151-160. 

 

 

the cooccurence lists are assigned as early as possible in the alphabetical ranking of the lexical items

- it is therefore the 'smaller' word that should be explored

 

an mt line looks like the following:

 

mt(

    digestion,

     [

       [growth,1],

       [machine,1],

       [mind,1],

       [movement,1],

       [reaction,1],

       [recovery,1],

      [stomach,4]]).

 

this means that the word 'digestion' co-occurs 1 time with 'growth' in a collocate list ... and 4 times with 'stomach'

the sharing of 'digestion' with a word preceding 'digestion' should be looked for under that word

 

 

Cat and dog

 

Defidic entry for curl up

 

IDNUM=     72554

HEADWORD=  curl

LEMMA=     curl up

LEMMATYPE= phrasalverb

POS=       vi

PATTERNS=  4

PRECOLL=   cat, dog

TRANS=     se mettre en rond

TRADRATIO= 1/16

ORIGIN=    ohef

 

=>

 

mt(cat,[[child,2],[chin,1],[cow,1],[dog,21],[engine,2],[fur,1],[hand,1], [house,1],[lion,2],[lizard,1],[person,14],[phone,1],[prowler,1],[rabbit,3],[servant,1],[tiger,2]]).

 

mt(dog,[[dogs,13],[entry,1],[estate,1],[expression,1],[face,1],[fan,1],[fans,1],[field,1],[fish,2], [flower,3],[flowers,2],[fox,1],[friend,1],[gangster,2],[gangsters,2],[good,1],[happiness,1], [horse,21],[horses,1],[house,4],[insect,1],[insects,1],[juggler,1],[lion,1],[machine,1],[man,1], [page,1],[pages,1],[people,1],[person,27],[pet,1],[picture,1],[pig,1],[plant,1],[plumage,1], [property,1],[pursuer,1],[pursuers,1],[rabbit,2],[relative,1],[servant,1],[sheep,1],[shop,1],[shot,1], [situation,1],[spouse,1],[taxi,1],[team,1],[terrorist,2],[terrorists,2],[thief,1],[thieves,1],[thunder,2], [tiger,2],[town,1],[vampire,1],[vegetable,1],[vegetables,1],[vehicle,2],[wind,2],[wolf,2],[wolves,1],[woman,1]]).

 

Roget

Connectedness through the sharing of Roget's categories

three levels of delicacy in thesaurus organisation

 

a r line looks like the following:

 

r(

  'antiquarian',

  [

   ['n','122','4','4'],

   ['n','492','4','2']

   ]

).

 

which means that the word antiquarian is a noun that belongs to two category triples

122/4/4 and 492/4/2 where the broadest category is first (492), followed by sub-category(4) and sub-sub-category(2)

 

we retrieve the list of categories associated with the two items

and then we compute their intersection, accumulating the weights according to type of category matched

 

Cat and dog

 

r('cat',[['n','273','4','7'],['n','366','12','12'],['n','407','5','2'],['n','441','7','3'],['n','975','2','3']]).

r('dog',[['n','366','12','12'],['n','366','13','2'],['n','373','2','2'],['n','846','3','4'],['n','949','10','5'],['v','281','5','2'],['v','622','7','2']]).

Indic

Connectedness through indic sharing in the RC/OH indicator data base

db structure (indic.pl)

 

ind(

      lemma('abacus'),

      pos(n),

      indic(['counting','frame'])).

 

Cat and dog

 

ind(lemma('cat'),pos(n),indic(['catalytic','converter'])).

ind(lemma('cat'),pos(n),indic(['domestic'])).

ind(lemma('cat'),pos(n),indic(['feline','species'])).

ind(lemma('cat'),pos(n),indic(['female'])).

ind(lemma('cat'),pos(n),indic(['guy'])).

ind(lemma('cat'),pos(n),indic(['man','woman'])).

ind(lemma('cat'),pos(n),indic(['man'])).

ind(lemma('cat'),pos(n),indic(['woman'])).

 

Defidic entry

 

IDNUM=     81946

HEADWORD=  dog

LEMMA=     dog

LEMMATYPE= standard

POS=       n

LABELS=    tech, gen

INDICATOR= clamp

TRANS=     crampon {m}

TRADRATIO= 2/16

ORIGIN=    efm

 

=>

 

ind(lemma('dog'),pos(n),indic(['clamp'])).

ind(lemma('dog'),pos(n),indic(['female'])).

ind(lemma('dog'),pos(n),indic(['male','fox','wolf'])).

ind(lemma('dog'),pos(n),indic(['pawl','clamp'])).

ind(lemma('dog'),pos(n),indic(['roof'])).

ind(lemma('dog'),pos(n),indic(['unattractive','woman'])).

ind(lemma('dog'),pos(v),indic(['follow','closely'])).

ind(lemma('dog'),pos(v),indic(['harass'])).

ind(lemma('dog'),pos(v),indic(['plague'])).

 

 

Coll

Connectedness through collocate sharing in RC/OH collocate data base

 

db structure (coll.pl)

coll(

       lemma('abandonment'),

       pos(n),

       coll(['property','right'])).

 

here, contrary to what we get through metameet (mt), the two items are related if they POSSESS common elements in their collocate lists

whereas in metameet it is the co-presence within a collocate list (associated with whatever item) that is significant

 

Cat and dog

 

Defidic entry

 

IDNUM=     81949

HEADWORD=  dog

LEMMA=     dog

LEMMATYPE= standard

POS=       n

INDICATOR= male fox, wolf, etc

POSTCOLL=  fox

TRANS=     mƒle {m}

TRADRATIO= 2/16

ORIGIN=    efm

 

=>

 

coll(lemma('dog'),pos(n),coll(['fox'])).

 

Envir

Connectedness through envir sharing in RC/OH envir data base

db structure (envir.pl)

 

 e(

    hdwd('dative'),

    envir(['case','ending'])).

 

the POS are not significant here

Cat and dog

e(hdwd('cat'), envir(['big','cats','burglar','cat-basket','cat-lick','cat-onine-tails','catbird', 'seat','catfood','catgut','cathouse','catmint', 'bag','dogs','mice','play','cats-cradle', 'cats-eye','cats-paw', 'cats-whisker','catsuit','door','family','fight','dog','flap','give', 'grin','hardly','room','swing','hot','bricks', 'tin','roof','jump','kill','laugh','lead','life','let','look', 'brought','dragged','king','pigeons','cat-and-mouse','game','mouse','rain','see','jumps', 'skin','take','catnap','think','meow','pajamas','whiskers','thinks','wait'])).

 

Lemmas

 

'dog Latin'

'dog basket'

'dog biscuit'

'dog breeder'

'dog cart'

'dog collar'

'dog days'

'dog dirt'

'dog fancier'

'dog food'

'dog fox'

'dog guard'

'dog handler'

'dog handling'

'dog in the manger'

'dog leg'

'dog licence'

'dog mess'

'dog muck'

'dog owner'

'dog paddle'

'dog rose'

'dog shit'

=>

e(hdwd('dog'),envir(['basket','biscuit','breeder','case','eat','collar','crafty','old','day','dirty', 'dog-cart', 'dog-eared', 'dog-paddle', 'dog-tired', 'dog-watch','dogfight','dogfish','dogfood','dogged','controversy','ill','fortune','misfortune','uncertainty', 'doghouse','dogs','breakfast','chance','dinner','food','footsteps','life', 'dogshow','dogtrot','dressed', 'fancier','fox','gay','general', 'dogsbody','give','bad','name','hang','guard','handler','lead','led', 'leg','licence','love','lucky', 'manger','night','poor','health','childhood','see','man','sly','track','vile','wolf'])).

 

Weights

We keep track of the lexical weights of lexical items for weighting purposes (“pondération”). We decrease the factor of computed proximity in the case of ‘heavy’ lexical items. Lexical weight is computed on the basis of the length of the lexical entry in the monolingual dictionary.

 

Cat and dog

 

w(lem('cat'),weight(4)).

w(lem('dog'),weight(7)).