LKP: a User's Guide

Nicolas Dufour, University of Liège, DEFI Project





1. General Description


LKP is a simple interface that allows human users to query DEFIDIC, the online English-to-French dictionary developed in the framework of the DEFI word sense discrimination project. The 350,000 entries of DEFIDIC can be searched either directly (lemma search), via keywords or, within a restricted list of keyword-accessed lemmas, through a substring.


2. The Dictionary Behind the Machine

2.1 Creating a single machine-tractable dictionary from two 'raw tapes'

DEFI's bilingual resources consist of two general-use dictionaries, the Oxford-Hachette (OH) and Collins-Robert (CR) bilingual dictionaries. We obtained from the publishers the electronic files they use as basis for the publication of their print versions, so that the two files we got contain exactly the same information as the dictionaries you have on your desk.

These 'machine-readable' versions, however, were not even remotely fit for use in a natural language processing (NLP) context, so that we had to spend quite a few months of harmless drudgery turning them into a format that suited our purposes better. In that intermediary state, which serves as the basis for the LKP database, both dictionaries are split into 210.000-odd records similar to the following:

IDNUM= 58985
HEADWORD= exile
LEMMA= exile
LEMMATYPE= standard
POS= n
ENVIR= e(from,de). #
INDICATOR= expulsion
TRANS= exil {m}
ORIGIN= OH

IDNUM= 58991
HEADWORD= exile
LEMMA= the Exile
LEMMATYPE= example
POS= pr n
LABELS= rel
TRANS= l"Exil {m}
ORIGIN= OH

IDNUM= 58994
HEADWORD= exile
LEMMA= to @exile sb from a country
LEMMATYPE= struc
POS= vtr
TRANS= bannir qn d"un pays
ORIGIN= OH

In this format, each record contains a single English (source) item, be it a single word, a multi-word unit (MWU) or an example sentence, its French translation and all the relevant linguistic and metalinguistic information (part of speech, field and style labels, prepositional and clausal environments, collocations...). Each record is independent from all others, so that the notion of a structured and hierarchical 'entry' that governs printed dictionaries no longer applies. The only remnant of the original entries is the headword, which retains a special tagging (@) in the lemma only until the merging of the two dictionaries into one. The relevant source item, however, is the lemma.

Two dictionaries, of course, are more time-consuming to query than only one. The next stage therefore consisted in merging CR and OH into a single dictionary, DEFIDIC. For this we developed an automatic procedure for comparing records from the two dictionaries and 'merging' those that were deemed to be equivalent. Records that could not be merged were simply added to the new dictionary, creating a lexical database of 354,078 entries - compared to a total of 419,118 for the two dictionaries before merging. It should be no surprise, however, that DEFIDIC still suffers from a vast amount of redundancy: the two dictionaries contain very similar information, and only 32% of the records could be merged safely (increasing the merging ratio, while possible, is bound to lead to errors and discrepancies).


2.2 The metalinguistic information in DEFIDIC

DEFIDIC records have room for 22 different types of information, even though the most complex ones feature only a dozen at a time. This section describes in some detail the most important types of information (or 'fields'), which are also the ones provided by LKP. 'Listing tags' are letters that identify the various types of information in the LKP output. They are described in more detail later on, and listed in Appendix B.

The lemma is the English item to be translated or glossed. Lemmas can be single words, MWUs (compound nouns, phrasal verbs, infinitive clauses) or example sentences. Lemma, part of speech and 'origin' are the only obligatory fields, all others are only optionally present.

The translation hardly needs commenting on. It may be absent if the lemma is glossed or if the record simply contains a reference to some other place in the dictionary (see cross-reference and gothere fields). Sometimes records were merged whose translations were not perfectly identical; in such cases the two original translations are kept, and are separated by a '#' sign.

The gloss (listing tag g) is an explanation in French that complements or replaces the translation. Glosses are used mainly for 'cultural' (political, historical, culinary, etc.) concepts that lack an exact counterpart in French.

A part of speech (noun, verb...) is almost always specified. Even MWUs have a part of speech, which applies to the headword under which the MWU was originally recorded (to bear the brunt is n because if was found under brunt, for instance). Parts of speech have no listing tag because they are mandatory in the LKP output (the 'unknown' POS is listed as x), and their location is frozen. For a list of the parts of speech as displayed by LKP, please refer to Appendix C.

The indicator (listing tag i) is the only truly semantic information available. As can be guessed from its name, it 'indicates' how the lemma should be understood. It is typically a synonym of the lemma, a short usage note or a sense restriction. Note that in multi-word units, the indicator often applies only to the headword under which the MWU was found. Consider as to the manner born, whose indicator [mode,way] applies only to manner.

Collocates (listing tags s for subject, o for object) are words that, for a given translation, are most often associated with the lemma. Here are a few examples of lemma-collocate associations, with the corresponding translations:

break forth (storm) éclater
break forth (sun, water) jaillir
cascade (water, fireworks) cascade
cascade (sparks) pluie
collapse (regime, system, empire, bank, currency, economy, hopes, plan, prices, defences) s'effondrer
collapse (bike, umbrella, table, chairs) se plier

A collocate is said to be subject if it is the 'major member' of the lemma-collocate pair. Subject collocates can be:
- the noun in an adjective-noun pair (rain or burden as subject collocates of heavy)
- the subject of a verb (storm as subject collocate of break forth)
- the adjective modified by an adverb (rude as subject collocate of horribly)
- the verb modified by an adverb (enquire as subject collocate of coldly)

On the other hand, we could (broadly) say that object collocates act as 'modifiers' in their lemma-collocate pairings. Object collocates can be:
- the object of a verb (emotion as object collocate of control)
- N2 in N1 of N2 compounds, or N1 in N1N2 (river as object collocate of branch)

Heads (listing tag h) are a particular kind of subject collocates applying to nouns. Heads represent N2 in N1N2 compounds, where N1 is the lemma and receives the corresponding translation strictly as modifier in N1N2. Note that in such cases N1 is listed as adjective (a) by LKP. Here are a few examples of heads associated with abortion, and the corresponding translations:

abortion [law, debate] sur l'avortement
abortion [rights] à l'avortement
abortion [pill] abortif/-ive

Field labels (listing tag l) point out the particular domain(s) in which the lemma is used or receives a specific translation. The 268 different field labels of DEFIDIC are listed in Appendix D.

The environment field (listing tag e) lists the prepositions typically used with the lemma, or the typical clausal constructions that follow it, together with their French translation. A few examples of environment fields in context:

bribe (with avec; to do de faire) soudoyer
bridge (over sur; across au-dessus de) pont
broadcast (to à) diffuser

Cross-references (listing tag x) and gothere fields (listing tag gt) refer to some other place (headword) in the dictionary where the user will find more information about what he/she needs. While cross-references may be infinitely vague (they basically mean 'go have a look at that word, you'll find something that might interest you'), gothere fields refer more specifically to a synonym of the queried lemma.

The expand field (listing tag xp) contains the full forms of acronyms and abbreviations (American Automobile Association is 'expand' for AAA, for instance).

And finally, the origin field (no listing tag) tells you in which dictionary a translation was found. An origin marker is always present in the LKP translation output, it is listed after all other items of information. There are three possible 'origins':
- 'o' stands for Oxford-Hachette;
- 'c' stands for Collins-Robert;
- 'm' (merged) is used for entries containing elements from both dictionaries.


3. Querying DEFIDIC with LKP

LKP opens with the following 'initial prompt':

    Enter lemma or keyword(s):

There are three different ways to access lemmas and their translations: either by typing in the lemma as such, by entering one or more keywords that give access to a list of lemmas, or - at later stage - by entering a substring to be looked for within a pre-determined list of lemmas.


3.1 Basic query: lemma search

The simplest way to use LKP is to type in directly the lemma you want to get translated. You can enter any lemma present in the dictionary, even complex example sentences such as this horse is lame in one leg (under lame in CR) or he acted the part of the perfect host (under act in OH). Don't forget that example sentences do not start with an upper case letter, and do not end with a period! Entering example sentences is something you will seldom or never have to do, but it is useful to know that anything that is in the dictionary can be retrieved - this is more interesting in the case of 'canonical' multi-word lexemes such as give up or in accordance with.

LKP answers a lemma query by listing all possible translations of that lemma, together with their relevant metalinguistic information and sorted alphabetically on parts of speech (light is translated first as a, then n, then vi, then vt). Note that display order within the same part of speech group is arbitrary, and reflects neither the organization of original entries nor the frequency of individual translations in everyday use.

Translations are listed by groups of 10 per screen. So for instance the first screen answering the query light:

Lemma = light

1 de lampe [a,[h,switch,shade,socket],o] 2 clair [a,[i,bright],[s,evening,room,house],m] 3 enjoué [a,[i,cheerful],[s,mood,laugh],o] 4 léger/-ère [a,[i,delicate],[s,knock,tap,footsteps],o] 5 facile [a,[i,easy],[l,_fig],[s,work,task],c] 6 allégé, light [a,[i,low-fat],[l,culin],[s,product],o] 7 léger # léger/-ère [a,[i,not heavy],[s,parcel,weapon,clothes,sleep,meal,wine,soil,material,substance,mist, snow,wind,clothing,plane,beer,cake],m] 8 pas sérieux/-ieuse [a,[i,not important],[s,affair],o] 9 léger/-ère [a,[i,not intellectually demanding],[s,music,verse],o] 10 léger # léger/-ère [a,[i,not severe],[l,fig,gen],[s,play,music,breeze,punishment,shower, damage,sentence],m]

Strike any key to scroll down the translation list, 'y' to scroll up or 'x' to go back to the initial prompt. The complementary information appears in the form of a list ([...]) below the translation it is attached to. The first element of that list is the part of speech (x if none was specified), followed optionally by sub-lists containing specific types of metalinguistic information and beginning with the relevant listing tags (cf. above). Consider the information list provided for light as allégé, light:

       [a,[i,low-fat],[l,culin],[s,product],o]

It tells you that light translated as allégé, light
- is an adjective (a);
- means 'low-fat' (indicator = low-fat);
- is a term used in cooking (label = culin);
- is typically said of a product (subject collocate = product);
- was found in Oxford-Hachette (origin = o)


3.2 Don't know or want it all: keyword search

A keyword search returns a list of all the lemmas that include the keyword(s) you enter. In order to perform a keyword search, enter a word followed by '+' or several words linked by '+' signs, for instance:

light+   all(490) lemmas containing light
light+year   all (6) lemmas containing both light and year
light+year+away   all (2) lemmas containing light,year and away

Keyword search answers two different needs:
It allows you to retrieve an MWU whose exact form you don't know. Is it bear the brunt of, to bear the brunt of sth, or maybe just to bear the brunt ? Type in brunt+ and you get it anyway.
By listing all MWUs and examples sharing a number of keywords, it is an abundant source of language-in-use information (bear+witness will tell you all DEFIDIC has to say about the MWU to bear witness, either literally or figuratively).

LKP answers a keyword search by listing all the matching lemmas on screen. Lemmas are displayed by batches of 23 on each screen, with roughly the same scrolling and exit keys as in the translation display mode (most keys scroll down, 'y' = up, 'x' = exit). Here however the letters a-w have their own use, namely to display the translation(s) of the corresponding lemma (lemmas are lettered from a to w on each screen). Striking the letter associated with a lemma opens the translation display mode of that lemma, in which 'x' exits not to the initial prompt but to the lemma display mode.

Note that not all words can be used as keywords: most MWUs have been indexed only on the basis of their 'content' words, excluding toolwords such as prepositions, pronouns, auxiliaries (can, do, may, be, have...) determiners and conjunctions. Toolwords give access only to the lemmas that were originally recorded in their own entry: can+ gives you all lemmas that contain can and that were recorded under can in the print versions of the dictionaries, for instance. This restriction is justified partly by the need to avoid the unmanageable growth of database indexes, but also by the fact that toolwords used as keywords would otherwise introduce 'noise', i.e. return too much irrelevant information. Imagine a user entering the query a+: he/she obviously wants to know about the typical uses of the indefinite article (including an), not to be drowned under the 30,000-odd lemmas that happen to contain the word 'a'.

Keywords must always be de-capitalised, even those referring to proper names or acronyms. If you want to retrieve the real McCoy, for instance, the typical query would be mccoy+.

As a rule, only base forms (stems) can be used as keywords: no plurals, no conjugated forms, etc. The LKP indexes have been built in such a way as to give access to inflected forms via the base form, so that bear+ will also give you lemmas containing bears, bore, born etc.

As an illustration, here is the first screen of results for bear+:

Displaying matches 1-23 out of 340 for keyword list [bear]

a] bear
b] 'OK?'_'I'm bearing up'
c] -born
d] 12-bore shotgun
e] 3 sons born to her
f] a born housewife
g] a born liar
h] a born loser
i] a born poet
j] a crashing bore
k] a Parisian born and bred
l] after a long illness bravely borne
m] after much suffering patiently borne
n] all men are born equal
o] anger born of frustration
p] ant bear
q] as naked as the day he was born
r] as to the manner born
s] Australian-born
t] bear along
u] bear away
v] bear cub
w] bear down on the plank

Exceptions to the base form rule are due mainly to idiosyncrasies of the parser we use for base form reduction:

- gerunds and past participles retain their endings if they are regarded as lexicalized forms or as parts of larger units (as in United_States, for_granted or printing_press).
- similarly, some adjectives or adverbs with degree endings appearing in fixed phrases (e.g. sooner or later, for better or worse).
- some plurals are left untouched, either because they belong to highly lexicalized MWUs or they are never used in the singular (e.g. crocodile_tears, United_States, scissors). Note that our parser is not always consistent and that some plurals (like scissors) appear sometimes in their base form, sometimes not. Lexicalized MWUs such as crocodile_tears or United_States can still be accessed via any one of their constituents, (crocodile+ will also return lemmas containing crocodile tears) but these constituents are simply not reduced to base form (so that tear+ will not return crocodile tears, while tears+ will).
- Tense variants of modal auxiliaries, such as could and might, are left intact.

Apart from these debatable choices, you should keep in mind that the parser (like all parsers) sometimes makes plain mistakes, as in the lemmas d and j above. These two lemmas are actually out of place, since the base form of bore in both contexts is not bear but bore.

3.3 More restrictive: substring search

After reaching the lemma display mode, you can further restrict the size of your lemma sample by performing a substring search. Such a search returns only those lemmas containing the exact sequence of characters (including blank spaces) you have entered. To perform a substring search, strike '+' when in lemma display mode. LKP answers with the following prompt:

 String search: enter key string... 

Note that the 'key string' must not be made up of base forms. LKP returns only the lemmas that match the key string character for character, so that a base form query will ignore inflected forms. Conversely, an inflected form returns only the same inflected form, allowing you, for instance, to restrict your search to plurals - or to the negative uses of can, by entering can't or cannot.

Substring search is particularly useful for finding phrasal verbs, since their particles cannot be used as keywords. Imagine you want to know everything about put up with. Entering the lemma query put up with will give you all the translations of the verb itself, but no examples of its use in context. The only way to access the lemmas containing put up with is to perform a keyword search on put (put+) and then a substring search on ' up with' (not forgetting to insert a blank space before up, though that is not so vital in this case).

3.4 Saving results to a file

While in translation or lemma display mode, you can choose to save the results of your query to a file for later analysis. It is only possible to save whole lists, be they lists of translations or of lemmas. Press '!' to perform output redirection, and LKP will ask you either
a) to enter the name of the file you want to become your current output file;
b) to confirm that you still wish to use the current output file.

You can keep two output files open simultaneously, one for lemma lists, one for translation lists. The first output file you create is used by default for both types of lists, but you can change that setting at any time.

The file names you enter must comply with the following restrictions:
- No extension. The extension '.out' is added automatically, so that entering the name lemma will in fact open/create a file called lemma.out
-
No drive or path specification. Most LKP versions redirect output to pre-determined directories, such as c:\lkp or a:\ (don't forget to bring a floppy!).
- No more than eight characters - 'long' file names are not supported.

<!> File users beware: entering the name of an output file for the first time always creates it, so that any file bearing the same name in your output directory will be deleted. Within one LKP session, however, a newly created file can be opened and closed several times without loss of data.

3.5 To get the hang of it: a few query suggestions

parser, or any word in this user's guide
get+hang
father+son
and mother+daughter
bear+resemblance
resemblance+
a+
for+
, with substring search on for all
query
heavy
can+
, with substring search on can't
bear+
, with substring search on born
god+
rich+poor
old+young
brunt+
put
+, with substring search on off

3.6 Terminating an LKP session

To put+end your LKP work session, just express your utter lack of interest by entering bof as query. The program expects this to happen sooner+later, so you can rest+assure that it won't harbour+grudge against you for it.




APPENDIX A - Hot Keys


1. Scrolling and display mode



Translation Lemma display Lemma display mode display mode mode (after substring search) Scroll down any unassigned key any unassigned any unassigned key key Scroll up y y y Go back x x x Display not available a-w a-w translations Substring search not available + not available Output to file ! ! !
2. Lemma and keyword search

lemma          returns all translations of lemma (lemma can be several words)
word+          returns all lemmas containing word (also inflected forms)
word1+word2    returns all lemmas containing both word1 and word2


3. Substring search

string         returns list of lemmas containing string. Only as a complement to keyword search


4. Terminating an LKP session

bof           entering bof as query puts an end to the current session





APPENDIX B - Listing Tags

e   environment
g   gloss
gt  gothere
i   indicator
h   head
l   field label
o   object collocate
s   subject collocate
x   cross-reference
xp  full form




APPENDIX C - Main Parts of Speech

a         adjective
av        adverb
c         compound
cj        conjunction
dt        determiner
excl      exclamation
interj     interjection
n         noun
particle  particle
poss_adj  possessive adjective
pr        preposition
pref      prefix
pro       pronoun
quantif   quantifier
suf       suffix
v         verb
vi        intransitive verb
vt        transitive verb
vti       transitive or intransitive verb
x         unspecified



APPENDIX D - Field Labels

accountingaccounting
acousticsacoustics
adminadministration
advertadvertising
aerospaerospace
agragriculture
airforceair force
anatanatomy
anthropanthropology
antiquityantiquity
archeolarchaeology
archeryarchery
architarchitecture
armyarmy
artart
astrolastrology
astronastronomy
athleticsathletics
audioaudio
audiorecordingaudio recording
autautomobile
aviataviation
bacteriologybacteriology
balletballet
bankingbanking
baseballbaseball
basketballbasketball
bettingbetting
bibleBible
billiardsbilliards
biobiology
biochembiochemistry
boatracingboat racing
bookbindingbook binding
bookkeepingbook keeping
bookmakingbookmaking
botbotany
bowlingbowling
bowlsbowls
boxingboxing
brewingbrewing
bridgebridge
businessbusiness
butcherybutchery
campingcamping
cardscard games
carpentrycarpentry
casinocasino
cbCB(radio)
chemchemistry
chesschess
cinecinema
climbingclimbing
cockfightingcockfighting
cockney_slCockney slang
commcommerce
commonwealthCommonwealth
computcomputer science
confectioneryconfectionery
constrconstruction
conveyancingconveyancing
cosmeticscosmetics
cribbagecribbage
cricketcricket
croquetcroquet
culincooking
customscustoms
cyberneticscybernetics
cyclingcycling
dancingdancing
dartsdarts
demographydemography
dentistrydentistry
dicedice games
diplomacydiplomacy
dominoesdominoes
draughtsdraughts
dressdress
drug_sldrug slang
drugsdrugs
ecolecology
econeconomics
educeducation
elecelectricity
electronelectronics
embroideryembroidery
engineeringengineering
engravingengraving
equitationequitation
ethnologyethnology
europeEurope
fashionfashion
fencingfencing
figfigurative use
finfinance
fishingfishing
freemasonryfreemasonry
gamblinggambling
gamesgames
gangster_slgangster slang
gardeninggardening
gengeneral use
geneticsgenetics
geoggeography
geolgeology
geomgeometry
glassmakingglassmaking
glasswareglassware
golfgolf
govtgovernment
gramgrammar
gymgymnastics
hairdressinghairdressing
handballhandball
handwritinghandwriting
heraldheraldry
histhistory
hockeyhockey
horseracinghorseracing
horseridinghorseriding
horthorticulture
hotelhostelry, catering
huntinghunting
hydraulicshydraulics
hypnosishypnosis
indindustry
insuranceinsurance
jazzjazz
jobemployment
judojudo
jurjudicial matters
knittingknitting
lacrosselacrosse
lawlaw
librarylibrary science
linglinguistics
litliteral meaning
literatliterature
localgovtlocal government
logiclogic
marinemarine
marines_slmarine slang
marketingmarketing
mathmathematics
measmeasurements
measuremeasure
mechmechanics
mechengineeringmechanical engineering
medmedicine
merchantnavymerchant navy
metalmetallurgy
metalworkingmetalworking
meteometeorology
mgmtmanagement
milmilitary
military_slmilitary slang
minmining
minermineralogy
morphologymorphology
morsemorse
motor_racingmotor racing
mountaineeringmountaineering
musmusic
mythmythology
nautnautical matters
navynavy
nucl_physnuclear physics
optoptics
ornornithology
paintingpainting
papermakingpapermaking
parachutingparachuting
parlparliament
pharmpharmacology
philatphilately
philophilosophy
phonphonetics
photphotography
physphysics
physiolphysiology
plumbingplumbing
poetrypoetry
pokerpoker
polpolitics
policepolice
police_slpolice slang
postpost
potterypottery
presspress
printingprinting
prisonprison
prisoner_slprisoner slang
provproverb
psychpsychology
publicitypublicity
publishingpublishing
punctuationpunctuation
rregistered trademark
racingracing
radradio
radarradar
railrailways
recordingrecording
relreligion
rhetoricrhetoric
rouletteroulette
rowingrowing
rugbyrugby
sailingsailing
schoolboy_slschoolboy slang
sciscience
scolschool
scoutingscouting
sculpsculpture
semanticssemantics
sewingsewing
shipbuildingshipbuilding
shootingshooting
showjumpingshowjumping
skatingskating
skiskiing
snookersnooker
socsociety
soccersoccer
sociolsociology
spacespace
specspecialist's term
spinningspinning
spiritualismspiritualism
sportsport
spy_slspy slang
statisticsstatistics
stexstock exchange
student_slstudent slang
surgerysurgery
surveyingsurveying
swimmingswimming
tapestrytapestry
taxtaxation
taxidermytaxidermy
techtechnology
telectelecommunications
telegraphytelegraphy
tennistennis
textextile
theattheatre
theoltheology
tourismtourism
transptransports
traveltravelling
turfhorse racing
tvtelevision
typtypography
typingtyping
undergroundunderground
univuniversity
ussrUSSR
vetveterinary medicine
videovideo
volleyballvolleyball
weightliftingweight lifting
whistwhist
windsurfingwindsurfing
wineoenology
wrestlingwrestling
yogayoga
zoolzoology