From text to dictionary

A. Michiels, Ulg, 2000-2001

 

Introduction

Two ways of looking at the issues raised by the ‘journey’ from text to dictionary:

 

- from the user’s point of view (confronted with a text, he turns to the dictionary for help, either to understand or to translate - monolingual vs. bilingual). The issue is to find out the meaning or the translation of an item (single word lemma or mwu, multi-word unit) that fits the context (linguistic and extra-linguistic)

 

- from the dictionary-maker’s point of view (the lexicographer’s): how to distil from text (corpora, massaged into KWIC - Key Word in Context - lines, etc.) a range of readings (meanings) and/or a range of translations (target language equivalents)

Why is there a problem?

1. typographical definition of the word (any sequence of letters flanked by blanks or punctuation signs) is different from intuitive AND lexicographical definitions:

he is writing a letter (write - letter)
pommes de terre (pomme de terre)
to distinguish structures (to write a letter) from complex lexical items (mwu’s - multi-word units) we have to manipulate the strings (morphosyntactic manipulations) and enquire whether they admit lexical variation:

pommes de terre (* pommes de terres, de la terre, en terre, de mer,...)
write a letter (write some letters, long letters, read letters, read books,...)
Variation is always permissible; the point is to see whether it’s meaning-preserving for the bits that are left untouched. If meaning is not preserved, we conclude that we have to deal with a mwu. But what does meaning preservation mean?

2. the dictionary does not store all possible wordforms

-Variants are due inflectional morphology (write, writes, writing, wrote, written). New lexical items are generatable (!) by means of productive derivational or compositional morphology:

nation - nationalise - denationalise - denationalisation - redenationalisation - ...
Productive derivational morphology: -ize, re-, de-, un- (as opposed to -ance)
The dictionary user must be able to go back to either morphemes or at least morpheme groupings that are deemed frequent (typical...) enough to be included in the dictionary

3. A MUCH BIGGER PROBLEM

a) meanings and translations are not given, they are not something that we can simply observe and then record.
A first piece of evidence : the number of meanings (and translations) for a given item depends on dictionary size (granularity).
A more serious problem: there is often no way of regrouping the more precise meanings (definitions) under the headings that would be provided by the more general ones, i.e. we do not have:

BIGDIC
X -->       A (1,2,3),

                B(1,2),

                C


and

SMALLDIC
X -->       I,

                II

where I : A(1,2,3), B(1,2)
          II: C

Sometimes the relations are far from simple inclusion; overlaps occur; sometimes the relations are simply undecidable.

Word senses do not exist; they are a construct, an artefact (more on this below)

4. A problem specific to bilingual lexicography

Polysemy does not always run on the same lines in the target and source languages. The bilingual dictionary cuts up the semantic space covered by the source item according to the lexical structure of the target language.

 

Illustration: Polysemy in the source does not necessarily run parallel with polysemy in the target
It doesn’t : e.g. dent

Monolingually

oed

4. A hollow or impression in a surface, such as is made by a blow with a sharp or edged instrument; an indentation, dint.

1565 Jewel Repl. Harding Wks. (1611) 425 We haue thrust our fingers into the dents of his nailes.

1612 Brinsley Lud. Lit. 16 Mark it with a dent with the nayle, or a pricke with a pen.

1620 Shelton Quix. iv. xix. II. 233 O the most noble and obedient Squire that ever had Sword at a Girdle..or Dent in a Nose.

1691 T. H[ale] Acc. New Invent. p. viii, Taking his Hammer, he again beat out the dent.

1722 Chamberlayne in Phil. Trans. XXXII. 98 The fat Particles had such a Pinch, or Dent, in them, as I have shewn, that there were in the Globules of Flower of Wheat.

1848 Thoreau Maine W. i. (1867) 51 The rocks..were covered with the dents made by the spikes in the lumberers’ boots.

1857 Geo. Eliot Scenes Cler. Life, Janet’s Repent. ii, Dents and disfigurements in an old family tankard.

 

cide

dent (v,n) (to make) a small hollow mark in the surface of something caused by pressure or being hit

Somebody made a large dent in the back of my car while it was parked outside the house.

I dropped a hammer on the floor, and it dented the floorboard

to make or put a dent in an amount, esp. of money, is to reduce it : Buying a new television has made a big dent in our savings

Tax increases this year have put a huge dent in the company’s profits.

(fig) His confidence/ego/pride was dented (=reduced) when he disdn’t get into the football team.

 

Note : note small in the cide definition and notice the word large in the example : a large small hollow mark ?
We do not read definitions as we read ‘normal’ language.

Bilingually

lkp

Translations for lemma = dent

1 bosse {f}, bosselure {f} # bosse {f}

       [n,[i,in metal],[l,metal,gen],m]

2 entaille {f}

       [n,[i,in wood],m]

3 bosseler, cabosser # cabosser, bosseler

       [vt,[o,car],m]

4 cabosser

       [vt,[o,hat],c]

5 entamer

       [vt,[o,pride],o]

6 entailler

       [vt,[o,wood],c]

7 faire une entaille dans

       [vt,[o,wood],o]

It does: e.g. cell

monolingually

gw (guide words)  in cide

 

cell ROOM

cell ORGANISM

cell PART

cell ELECTRICAL DEVICE

bilingually

lkp

Translations for lemma = cell

1 cellule {f}

       [n,[i,for prisoner, monk],[l,police,gen],m]

2 alvéole {m}

       [n,[i,in honeycomb],o]

3 élément {m} -de pile- # élément {m}

       [n,[l,elec,chem],m]

4 cellule {f}

       [n,[l,gen,bio,bot,phot],m]

5 cellule {f}

       [n,[l,pol],o]

6 []

       [n,[l,_police],[x,condemn/death],c]

 


 

Displaying lemmas for keyword list [cell]

 

1] cell

2] basal cell carcinoma

3] blood cell

4] cell biologist

5] cell culture

6] cell division

7] cell formation

8] cell wall

9] daughter cell

10] death cell

11] dry cell

12] generative cell

13] germ cell

14] he spent the night in the cells

15] he was removed to the cells

16] nerve cell

17] padded cell

18] passenger cell

19] photoelectric cell

20] photoelectrical cell

21] police cell

22] red blood cell

23] sickle cell anaemia

24] single-celled

25] so there we were in the same cell

26] solar cell

27] soma cell

28] the cell wall

29] the condemned cell

30] the guards walked him back to his cell

31] the prisoners were jammed into a small cell

32] they led him away to the cells

33] to form a cell

34] to lead sb to his cell

35] to saw through the bars of a cell

36] type A and B cells

37] wet cell

38] when they got him back to his cell

39] white blood cell

 

 

 

Very often the polysemy runs parallel where one language has borrowed from the other (at a time when the borrowed item was already polysemic), or they have both borrowed from a common source. Or the items have developed (sense extensions) along the same lines; this is the case when the sense extensions are standard, either through

- metonymy (based on contiguity, but metaphorically defined!!!), especially the cases of regular polysemy (lots of literature on this topic):

the village > the inhabitants of the village
place or place name > the population

The village is situated on a hill.
The town is/are likely to protest.

 

- metaphor (based on similarity):
table (piece of furniture) > table (presentation of figures, data)
(similarity is rectangular shape)

The dictionary : an all too familiar object

 

The dictionary is something we tend to take for granted. We sometimes forget that it is produced by human beings. We believe that the information types it offers, and the way it provides that information, are given once and for all. This is due to the fact that the dictionary-making is based on a long and respectable tradition, and that as users we have been consulting the dictionary since we were children - we don’t remember the first time we opened a dictionary...

 

However, neither the information types, nor their contents, nor their presentation, are immutable.

New information types : the reliance on vast corpora makes it possible to include frequency as an information type (but the frequency information the user is interested in is not only lemma frequency, but also meaning frequency - to know that table is frequent is one thing, but what about the two readings, piece of furniture and data table; to give meaning-related frequency, we need disambiguated corpora - if disambiguated by hand, lots of disagreement between coders; if disambiguated by machine, high error-rate and poor discrimination)

 

New way of presenting the information: the dictionary in book format is tied to the alphabetical order (note: not the same for all languages: cf. CH and LL in Spanish lexicography) for practical reasons. Two words that follow one another in alphabetical order may be closely linked (cup and cupboard in CIDE), or have nothing whatsoever to do with each other (cuneiform and cunnilingus in CIDE again). A computerized dictionary should be able to allow any lexeme or constituent of a multi-word unit to become the centre from which to explore the whole lexicon. From eye I should be able to immediately reach, not only eye-doctor but also ophthalmologist and all other words which are built on the Latin and Greek lexemes for eye such as ocular; see eye to eye, turn a blind eye to, and all other mwu’s that include a variant of eye (eye or eyes; eye, eyed, eyeing, etc.); eve and all other lexemes whose distance from eye can be described in terms of number and position of letters (eye and eve have both three letters, and start and end with the letter E); eye is a body-part and should lead us to other body-parts (meronymy); it is a sense organ, and it should show us the way to the other senses and sense organs, etc.

Not only that: the dictionary should be explorable, not only from its lexemes and mwu constituents, but also from the metalinguistic information it provides - does it record the agreement property for staff, cattle, police, etc. - namely, that they take plural agreement although they do not feature the inflectional plural marking for nouns? In that case, it should be able to give us a list of all such nouns when we query the code that records the specific agreement type. It is not the business of grammars to provide such lists - grammars should mention the type of discrepancy between form and agreement, illustrate it with a few examples such as the cattle are grazing in peace, the police are after him, the staff are unhappy about the new plans, and then stop. The dictionary should then be able to provide full lists, since it is supposed to record the full language (which it never does nor can do since the vocabulary of a given language is forever growing and is open to nonce creations, which may or may not use the word-building power provided by derivational morphology)  

An introduction to LKP

(see Nicolas Dufour’s paper: LKP: a User’s Guide: http://engdep1.philo.ulg.ac.be/michiels/lkpuser.htm)

 

An introduction to KWICs and lexicographical work on corpora

monolingual and bilingual (the alignment issue) : typicality and frequency, target audience, space constraints.

 

The lexicographer’s three sources of information:

·       the lexicographical tradition

·       his own intuition as a speaker and/or translator

·       corpora and computer-based tools to explore them

 

Looking at the entry LIGHT and derivatives in monolinguals (CIDE, COBUILD) and bilinguals (OH, RC)

 

- comments on illustrations : recognizing rather than learning; artefacts and natural classes (fauna, flora) as opposed to abstract notions, actions, qualities, etc. Illustrating size and grammatical notions such as plurality, etc.

 

- comments on the GUIDEWORDS in CIDE : brightness is less frequent than light; the problem of defining the commonest words...


- comments on POS; is the POS distinction to be kept as the first to be considered in the organization of an entries or a bunch of related entries (make light of - how do we know that light is an adjective? it does not behave as one; compare umbrage in take umbrage, dint in by dint of, French fur in au fur et à mesure)

 

 


Extracting Contextual Information from Dictionaries and Corpora

From selection restrictions to multi-word units

 

Dictionaries to be explored:

 

Monolingual:

                LDOCE (Longman Dictionary of Contemporary English)

                COBUILD (Collins Cobuild English Language Dictionary)

                CIDE (Cambridge International Dictionary of English)

                LDOEI (Longman Dictionary of Idiomatic English)

                LDPV (Longman Dictionary of Phrasal Verbs)

                ODCIE (Oxford Dictionary of Current Idiomatic English (Vol I ( Second edition: Oxford Dictionary of Phrasal Verbs, 1993) and II))

Bilingual

                RC (Le Robert / Collins)

                OH (Oxford / Hachette)

                CK (Collins / Kletts)?

               

General issues to be commented on

·       what is a lexical item?

·       description of the environment is generalization from distribution statements (actual vs potential distribution; intuition vs corpus). Note that POS assignment belongs here; it sums up distributional statements about the item it is assigned to

·       multi-word units and their anchor points

·       the longest / densest match first principle

·       how can we say that behaviour is not predictable (non-compositional, distributionally aberrant) when that behaviour is part of the data that are used to define the predictable cases (is it a question of frequency?)

·       is semantic behaviour predictable from syntax? syntactic behaviour predictable from semantic make-up?

Types of information

There are various sources of contextual information (information on the environment a given lexical item can or must fit in) to be found associated with lexical entries in a dictionary. We shall first consider monolingual dictionaries.

 

POS: the part of speech offers a broad specification of the item’s environment. To say that something is a noun is to say that it can function as head of a noun phrase, for instance. It can be modified by an adjective, it can determine number agreement in the vp if it is used as subject of that vp, etc. What should be borne in mind is that the POS information does not make sense if it does not relate to an accompanying grammar. Lexicon and grammar cannot be devised independently of each other. We tend to forget this for POS, because POS looks uncontroversial. Any grammar will have to recognise nouns. True, but what it does with the concept is not so uncontroversial. And what it does with the concept is precisely how it defines it. Categories are squishy (cf. Ross 1973).

 

Grammatical information (grammatical codes, as in LDOCE and COBUILD). The prefatory material attempts to explain the way the codes should be used. It is obvious that they have to relate to a presupposed grammar. This grammar is uncontroversial and surfacy, perhaps. But the codes refer to a canonical order (deep functions), and not necessarily to what is found in text. An item coded as transitive verb can appear without its object (in passive S’s, or in case of object ellipsis, either for definite or for indefinite nps).

 

Definition. COBUILD-style definitions attempt in their left-hand part to introduce the definiendum in a typical context, and research projects (cf. Sinclair et al. 1995) have tried to cash in on this new definition style to retrieve contextual information. The Firthian principle (meaning is use; words shall be known by the company they keep). The question arises as to whether we have to do with constraints or preferences (grammatical codes are usually interpreted as embodying constraints).

 

Digression on headness: syntactic versus thesauric heads in definitions; controlled vs free defining vocabulary (trade-off lexical / syntactic simplicity; depth of thesauric hierarchies retrievable from dictionary definitions; the problem of salience and prototypes -cf. Rosch et al.); combining the use of a dictionary and a thesaurus to enrich both)

 

Definitions are an obvious source for paradigmatic information (genus words in the definiens), but they can also be used to retrieve syntagmatic information, i.e. contextual information. This is also true for traditional, non-COBUILD-like defs. The definition will include phrases that reflect the environment of the word. This reflection is not a reference to textual bits (words), but conceptual units. If the items in the definition are taken as heads in a thesauric classification, the match between textual data (the word as used in context) and the conceptual data offered by the definition can help solve the meaning assignment (disambiguation) problem. See Lesk in ref.

 

Definitions are meant to abstract away from particular uses to concentrate on the most salient features (not necessarily necessary features). Does the inclusion of contextual information within definition schemata represent the best way to tackle the problem of that type of information? What is pedagogically suitable may not be linguistically adequate.

 

Associated phrases. It does not really matter whether these appear associated with a given item, or are raised to entry status. The distinction is not theoretical, but one of ease of access. Similarly, it does not matter which item of the phrase we use as anchor point (where we house it). It is probably most often the least frequent item of the phrase, or it may be the object of the function in the lexical function paradigm, rather than the exponent of the function itself. Porter plainte would then be housed under plainte on both counts: because plainte is less frequent than porter and because we have an f(x) where f is a lexical function, porter its exponent (collocate) and plainte the x, i.e. the object (base).

 

Since in a true ldb format the access key is a multiple one the problem is immaterial. What is relevant is the strength and nature of the link.

Most phrases have a description attached. The description has either or both of two sides:

 

meaning: the meaning of the whole phrase is taken not to be a compositional function of its component parts (the lexical function paradigm is a way of specifying the links that can bind the two elements; its effect is to tone down the contribution made by the lexical function exponent - this raises problems when the lexical function paradigm is used to solve the translational problems raised by mwu’s -cf. the support verb trend)

Compositionality can only be defined with respect to pre-existing senses. But senses are distinguished on the basis of distributional phenomena, among which the existence of the phrase in question. X and Y should be described independently of XY, but XY is a context for X and XY is a context for Y. The argument can therefore be fully circular. Cf. appeal to non-compositionality in Nunberg 1994 et al. and criticism in A.Michiels, attached paper.

 

fixity: fixity is resistance to manipulations that the structural description of the phrase would predict are applicable. The problem is first that the phrase is entered as a string, not a structural description (which cannot be offered independently of selection of a grammatical framework, however uncontroversial). Second, are the restrictions to be stated negatively or positively? (we assume the string is fixed except for ...; or we assume the string is regular except that it is resistant to such and such a manipulation). Can we classify manipulations (deletion, movement, etc.)? Is there a frozenness hierarchy à la Fraser? What about lexical manipulations (paradigmatic shifts and internal modifications)?

 

Digression: internal vs external modification: the proverbial bucket vs the fatal bucket vs a red bucket

 

Examples. Either genuine or coined. If genuine, generally edited to gain generalization (example schemata rather than examples). Example schemata are like phrases. Examples should illustrate, not convey new information on the environment. They should give us typicality at most. Preferences and constraints should be recorded elsewhere. Digression on using dictionary examples as text corpus. Tagged examples (we know how to tag the entry word in the example, in terms of meaning assignment etc. (bootstrap procedure)).

 

How to interpret the environment?

The main distinction is between words as words and words as thesauric heads.

Words as words: even so, it can be words as types or words as tokens. Cf. LDOEI making an attempt at recording the possibility of  morphological (inflectional) variation. Also consider derivational variation. Le dépôt d’une plainte vs * le port d’une plainte.

Words as thesauric heads. At the top, we have place holders that do not in any way specify the slot. We notice the lack of a marker for something or somebody, which cause something to be used instead as a space-saving device (?) Give something one’s attention in COBUILD, where something is something or somebody, as appears from the way the lexicographers themselves use the phrase.

 

Then we have place holders that give the broadest type of semantic feature: sth, sby. They are just a little bit more precise than the NP slot recoverable from the transitivity code, for instance. Below this, we have references to thesauric heads without having the thesaurus! The thesaurus ought to be retrievable from the dictionary (filiation through head of definiendum). We have references like fruit, seat, etc.

 

We often have typical heads, or incomplete head lists, marked or not marked as such (etc. is a marker of incompleteness; its presence or absence does not seem to make much difference). If we have plums, cherries, etc. then we are tempted to move one up, and replace by fruit. But the lexicographer did not write fruit; does he have in mind pit-bearing fruit as opposed to other types?

 

The limiting case is heads that do not point to anything besides their own referents, i.e. are no longer heads, but the leaves of a thesauric organization. Even here, the distinction between words as words and words as heads is crucial, in that the former will not admit a synonym, whereas the latter should not object. Déposer une plainte (une réclamation). Porter plainte (* porter réclamation)

 

Note that thesauric heads and other types of information on the phrase’s environment are always assigned to the phrase as structure, not as string, unless otherwise stated. In déposer une plainte, the object can be manipulated. In porter plainte, it can’t.

 

Here, we should investigate the surface manifestations of structural frozenness. For instance, in French, we have the zero article, hardly to be found anywhere else in such a structural pattern. We could also use such surface clues for the automatic retrieval of mwu’s from corpora.

 

Meaning is seen as giving rise to the meaning assignment problem, but that is an oversimplification. Meaning gives rise to the meaning construction problem. Each variation in the object of a transitive verb causes a variation in the meaning of the verb, or very nearly so. Compare

He forgot his wounds / his sister / his homework.

If we decide to have a reading of forget as forget to bring, we can account for the ambiguity of

He has forgotten his poem (say in a poetry class).

But is the problem one of ambiguity and meaning assignment (cf. qualia structure of nouns - Pustejovsky 1995)?

 

The bilingual perspective.

The semantic space of the source language is cut up according to the needs of the target language. A monolingual distinction in the source is not made if it runs parallel to the same distinction in the target. Polysemy is introduced in the source if there are several target translations for a given source item.

A bilingual dictionary gives no unbiased information about distinctions relevant for the source. But trying to put two languages into relation sheds light on both.

The problem of mwu’s crops up in translation when structure is lacking in the target equivalent; cf. prendre du retard and lag behind (Le retard que l’Europe a pris ne pourra pas être comblé aussi facilement que nous le souhaitons).

Cf. also porter plainte / lay or lodge a complaint / vs déposer une plainte

A complaint was lodged / * Plainte a été portée / Une plainte a été déposée

we can contrast:

lodge/lay + thesauric head (complaint)

déposer + thesauric head (plainte, réclamation)

porter + word (plainte)

Does it help to identify a lexical function f such as

f(plainte) = porter

f(complaint) = lodge / lay?

Digression on the COBUILD style of dictionary definitions

Justification, rationale, ... in Sinclair 1987.

Firthian principle: you shall know a word by the company it keeps / meaning is use (Wittgenstein II)

Good on pedagogical grounds as learners may not be aware of lexicographical conventions (shouldn’t they be?)

 

Objections

mixing definition and typical environment is conducive to over-restrictive definitions

definiendum in context or true phraseological unit (typical or obligatory environment?). Is it always clear which is the case?

Examples: be a function of under function

                be in a frenzy under frenzy

Wouldn’t it be better to isolate them as phrases? One would then be led to state the amount of admissible variation (be the function of / be in a dreadful frenzy???)

Looking too close at environments may lead to over-polysemy (cf. entry use). Does the variation in the object necessarily lead to polysemy (same problem with forget, enjoy, etc.)

‘Simpler vocabulary’ in the definiens than the definiendum: this problem is acute when a controlled defining vocabulary is used; but even when it is not, we find resolve explained by deal with; function explained by work; the first is a prepositional verb, the second is heavily polysemic (cf. act defined as take action in LDOCE!)

Proposals

From the definition we should be able to retrieve the position of the definiendum in the thesauric hierarchy built by the dictionary (through the genus word). The easiest way to proceed would be to have a specific information field for genus, so that retrieval is trivial. The definition should also provide a broad description of the syntagmatic universe the item fits in, so that matching the text of the definition with the textual bit that the item occurs in in discourse should lead to a certain degree of disambiguation. The match should not be textual, but thesauric.

 

Phrases should be entered as such only if their defintion is linguistic, i.e. their constituent elements are defined as words or word lists, not heads of thesauric classes; they should exhibit restrictions on their syntactic potential; they should be given as much structure as their behaviour in text warrants (cf. attached paper).

 

The description of the environment should admit of various degrees of delicacy or fine-grainedness:

·       POS

·       Place holders for broad semantic classes: sth, sby

·       Semantic features (near the top of a semantic hierarchy, below is a matter for the thesauric organization): concrete, abstract, etc.

·       Thesauric heads (dictionary internal)

·       References to individuals (in so far as this needs to be distinguished from words, which would occur in the phraseology component)

 

The environment would be the args for verbs and certain nouns; it would be the modified item in case of modifiers such as adjectives and adverbs

 

·       Examples and example schemata

Examples are citations; example schemata are edited citations where the irrelevant bits are dropped or parametrized; examples should be illustrative only and should not have to be scanned for complementary information on the context of the item to be described

Papers

Papers on the dictionaries

Sum up relevant parts of prefatory material

Study a few entries (among which a pre-defined group[1] to be studied in each dictionary)

Where are phrases housed?

How are they described (semantically, pragmatically, syntactically, lexically)?

Critical comments and comparison with other dictionaries

 

Papers on John Le Carré’s The Little Drummer Girl (ldg)

Identify all mwu’s (very broadly defined) in the chapter; take the most promising (difficult) ones and study them in the dictionaries belonging to our dictionary list

Show how the ldg context is accounted for or not; abstract from it to a canonical representation

Offer critical comments; make your own proposals: adjustments to existing entries or new entries

 

Language and literature

Lexical frozenness and creativity

 

Clichés are a choice target for the display of creativity (harping on a known structure; cf. let’s call it a night; it’s been a field night, etc.)

Cf. Julien Gracq - Le Rivage des Syrtes p.93 prendre les fièvres (les eaux)

Paradigmatic relationship; syntagmatic one; the importance of water and corruption all through the chapter

allusions, references to other (literary) text

cf. C.P. Snow, The Masters, Penguin ed. p. 185

Even the idyllic spectacle of the lion lying down with the lamb does not entirely reconcile me to the Dean’s ingenious idea (reference to Isaiah 11:6: The wolf also shall dwell with the lamb, and the leopard shall lie down with the kid (King James Version))

Should be accounted for; as opposed to:

p. 217 The silence of the infinite spaces did not terrify him. (reference to Pascal’s Pensées)

Why? Because the first is metaphorical and the use of the definite article cannot be explained otherwise?

Corpus-based work

available corpora

                for English

                for French, Dutch, German, ...

 

types of corpora

                raw

                edited

                tagged

 

types of research

                dictionary making

                collocational research


Tools

Dictionary tools

 

LDOCE : data base format and exploration tool

RC : enriched data base (T. Fontenelle) and exploration tool; awk tool to study the examples

COBUILD : awk format and exploration tool

 

Corpus tools

 

Standard Unix tools : grep, awk, etc.

Syntactic manipulations

Is it possible to organize them into a hierarchy according to the amount of distortion that they inflict on a given frozen or half-frozen structure (Fraser’s claim)?

Is it possible to relate the syntactic manipulations to the semantic make-up of the idiom? For instance passivization would not be possible unless the NP in object position in the idiom has a certain independence; should it therefore have a semantic as well as a syntactic node?

Syntactic manipulations should include derivational morphology. Three types of nominalizations should be tracked:

Lodging a complaint / To lodge a complaint / To judge a course of action

The lodging of a complaint / John’s lodging of a complaint / The judging of a course of action

La  déposition d’une plainte / The judgment of a course of action

On Marat, bloodbaths and Oper1 in the lexical function paradigm

In the lexical function paradigm prendre in prendre un bain and have or take in have (take) a bath (a swim) are regarded as exponents of Oper1, the lexical function that is most depleted from a semantic point of view. Similarly, in the support verb paradigm, prendre and take/have would be regarded as carriers of tense and aspectual information, as well as locus for the np-vp agreement in the vp, but devoid of semantic import (« semantically (almost) empty support verbs », to quote Heid 1994, p. 235 ; see Danlos and Samvelian 1992). Mel’chuk and Wanner 1994 offer the following characterization of Oper1 (p.326) :  provides for its keyword L2 (which is a predicate noun, i.e. a noun denoting an action, an event, a state, etc.) a verb L1 with the meaning ‘perform’, ‘undergo’, ‘be in a state’, etc.

 

Schematically we have

 

f (bain) = prendre

f (swim) = have / take

f (bath) = have / take

 

where f is the Oper1 lexical function, bain, swim and bath the bases and prendre, have and take the collocates.

 

Besides, the following bilingual equivalences would be taken to hold :

 

f (bain, bath) = prendre, have/take

f (bain, swim) = prendre, have/take

 

the distinction revolving round a possible polysemy of bain, where bain would be found equivalent to bath when the purpose of the ‘bain’ is to wash, as opposed to swim, when the purpose is sport, fitness, leisure, etc.

 

However, we have here a double oversimplification :

 

1) prendre in prendre un bain is not semantically empty

 

2) the bilingual equivalence does not hold when bain is further specified in a « bain de X » pattern, where X is either the type of bath (boue, foule, jouvence, etc.) or the body-part that the ‘bath’ is restricted to (pieds, bouche, etc.)

 

Let’s begin with the first point. At first blush, it seems that prendre collocates with bain, whatever modifications we introduce, i.e. that it is genuinely empty, being there only to make a verb (or rather provide the skeleton of a vp) out of a noun:

 

a) prendre un bon bain, un bain trop chaud, etc.

b) prendre un bain de mer, un bain de minuit, etc.

c) prendre un bain de jouvence, un bain de fraîcheur, un bain de langue, un bain de foule

d) prendre un bain de siège, un bain de pieds

 

But note that faire can be used instead of prendre in the d series. Note that only faire can be used in the e series:

 

e) faire un bain de bouche, un bain d’yeux

 

It does not take too long to see that prendre implies a feature like [+ immersion] applied to the whole body or to the body-part to which the ‘bain’ is restricted. When there is no such immersion, prendre is out :

 

e’) * prendre un bain de bouche, un bain d’yeux

 

Note also that faire has its own semantic feature, something like [+ intention], so that 3 and 4 are not necessarily synonymous :

 

3) prendre un bain de pieds

4) faire un bain de pieds

 

Possible contextualisations where the two are not interchangeable :

 

Il n’a pas vu la flaque et il a pris un bon bain de pieds.

On lui a recommandé de faire des bains de pieds au tercinol pour se débarrasser de ce champignon.

 

Consider now bain de sang. Because of the immersion feature in prendre, we can write

 

Charlotte Corday fit prendre à Marat un bain de sang (Charlotte Corday had Marat take a bloodbath).

 

This is very different from the faire un bain de sang in :

 

Le bon vieux Stallone fit un bain de sang tout autour de lui (Good old Stallone made a bloodbath all around him).

 

Turning now to our second point, the lack of equivalence between prendre and have/take in prendre/faire un bain de X, suffice it to refer the reader to the new Oxford/Hachette and the third edition of the Robert/Collins, where the following translational pairs are given:

 

faire des bains de bouche

to rinse one’s mouth (out)

O/H

prendre un bain de foule

to mingle with the crowd

O/H R/C

 

to go on a walkabout

O/H R/C

prendre un bain de minuit

to go for a midnight swim

O/H

faire des bains de pieds

to soak one’s feet

O/H

prendre un bain de soleil

to sunbathe

O/H R/C

faire des bains d’yeux

to bathe one’s eyes with eyewash

 

j’ai pris un bain de jouvence

it was a rejuvenating experience

 R/C

 

it made me feel years younger

 R/C

 

Summary and conclusions

In prendre un bain, faire un bain de X, neither prendre nor faire can be regarded as totally depleted. It would be interesting to look at other exponents of Oper1 to ascertain whether they are reducible to the role of verb- or vp-maker. As for the translational equivalences posited by the support verb paradigm, they cannot be extended to the cases where bain is further specified as bain de X. In other words, we cannot postulate the following translational pattern:

 

bain (Oper1 : prendre) : bath (Oper1 : take/have)

bain (Oper1 : prendre) : swim, bathe (Oper1 : take/ have),

 

we need to look at the whole NP whose head is bain.


Multi-word Units in Horatio

Summary

This paper deals with the lexical entry types that need to be devised to account for multi-word units (mwu’s) in horatio, a parser for a subset of English. We begin by introduc­ing horatio, then move on to a characterization of mwu’s in terms of their potential for syn­tactic and lexical manipu­lations, and finally put forward prototypical entries for various classes of mwu’s.

A.Introducing Horatio

horatio is a parser for a subset of English based on a definite clause grammar be­longing to the slot grammar framework (cf. the work of Michael McCord and associates ; cf. e.g. McCord 1987, a presentation of the framework in half-tutorial fashion ; see also McCord 1982, as well as McCord 1989a, 1989b and 1990 for recent developments).

horatio remains a ‘toy’ system in that it is oriented towards the teaching of Prolog for Natural Language Processing rather than any real life application. It is geared to the parsing of ‘linguistic’ rather than ‘real’ sentences (in the sense of Tomita 1991, i.e. sentences made up for the purpose of testing linguistic hypotheses rather than utterances occurring in actual text).

The parsing algorithm is top-down, left-to-right and depth-first. This is of course the parsing algorithm that Prolog itself uses, its ‘native’ parsing algorithm as it were. horatio is written in ARITY Prolog (Version 5.1 for DOS)  and runs on a 386/486 PC under DOS or OS/2 (for OS/2, version 6 of Arity Prolog was selected).  It is not “ tied ” to Arity Prolog but is easily con­vert­ible to standard Edinburgh Prolog notation, and as a matter of fact also exists in a Yap Prolog version running on Sun. Arity Prolog has been selected be­cause it is both reason­ably fast and available on PC platforms (DOS, WINDOWS and OS/2).

The parse shown below was produced on an IBM Model 70, with 4 Megabytes of core mem­ory and a 120 Mega hard disk. The operating system is DOS 6.0.

A first question that we need to tackle concerns the nature of parsing. Obviously the nature and depth of the parses produced is a crucial issue. Parsing goes from tagging (the as­sociation of a form with grammatical tags reflecting Part of Speech (POS)) to deep analysis, looking for the semantic invariant behind different phrasings.

The level chosen here is the one that is deemed to be adequate for the translation from/into English into/from a related language, such as French. In terms of depth the type of parse produced is not very different from those in the IS (Interface Structure) in the EC Eu­rotra project, with which the author was associated[2]. The backbone remains syntactic.

In order to give an idea of the type of parses produced by horatio, we shall look at the parse returned by the system for the following sentence, which illustrates -inter alia- the use of the multi-word unit take place: The workshop is believed to have taken place in the library I wanted her to go to.

It will be seen that the parse is uncontroversial. Any application that needs to rely on a linguistic analysis of the sentences it is confronted with (i.e. an application such as machine translation, for which template matching or keyword search, however refined, are not good enough)  will at least have to be able to retrieve the information provided by the horatio parse. I tend to agree with McCord, who writes: “ It also appears reasonable to use syntactic analysis (embodying some semantic choices, such as word sense disambiguation) in machine translation systems. ” (in McCord 1987, p. 325)


 28
  clause
   pred_arg_mod_structure
   prop(vce : passive,asp : none,mod : none,tns : present)
    predicate(believe_1,agr(en_passive))
     object
      clause
       pred_arg_mod_structure
       prop(vce : active,asp : [perfect],mod : none,tns : present)
        predicate(take_place_1,agr(en_active))
         subject
          nounphrase
          index(_0508)
          agr(third,sing)
            det(the)
            noun(workshop_1,agr(sing))
         pp_arg
          prepphrase
          index(_09EC)
          prep(in)
           np_arg_of_prep
            nounphrase
            index(_09F4)
            agr(third,sing)
              det(the)
              noun(library_1,agr(sing))
             relative_clause
              clause
               pred_arg_mod_structure
               prop(vce : active,asp : none,mod : none,tns : past)
                predicate(want_1,agr(finite,past,sing,first))
                 subject
                  nounphrase
                  index(_0C28)
                  agr(first,sing)
                  ppro(first,sing,_0CAC)
                 object
                  clause
                   pred_arg_mod_structure
                   prop(vce : active,asp : none,mod : none,tns : present)
                    predicate(go_1,agr(infinitive))
                     subject
                      nounphrase
                      index(_0E10)
                      agr(third,sing)
                      ppro(third,sing,fem)
                     pp_arg
                      prepphrase
                      index(_1010)
                      prep(to)
                       np_arg_of_prep
                        nounphrase
                        index(_09F4)
                        agr(third,_1074)


The first line of the returned parse is the preference (28). In the case of multiple parses, the one with the highest preference index is to be preferred.

The parse is best conceived of as a set of clause parses each headed by a clause header of the following form:

             clause
               pred_arg_mod_structure

This means that the parser has found a clause and that it is going to display its structure in terms of its predicate, the arguments pertaining to that predicate and the clause modifiers, if any (the latter are not tied to the lexically-determined argument structure opened up by the predicate).

We then have a line devoted to the properties of the clause: voice (active/passive), aspect (none/perfect/progressive), modality (none/modal aux), and tense (present/past). Have taken place yields the following prop line:

prop(vce : active,asp : [perfect],mod : none,tns : present)

The predicate has its own property line, made up of the lexeme (with reading number) and of an agreement structure.The predicate line for wanted is the following:

predicate(want_1,agr(finite,past,sing,first))

The values sing and first (person) are obviously not computed on the basis of wanted, but on the basis of the surface subject I.

We then get the list of arguments, in canonical order. Unspecified arguments (such as the subject of believe) are left out. The relationships between the four clauses as displayed by the parse are the following:

clause 1
            predicate: believe
                                   args:  subject: unspecified
                                               object: clausal (clause 2)

clause 2
            predicate: take_place
                                   args:  subject: workshop
                                               pp_arg: in library (index X)
                                                                                  np modifier: rel clause (clause 3)

clause 3
            predicate: want
                                   args:  subject: I
                                               object: clausal (clause 4)

clause 4
            predicate: go
                                   args:  subject: she
                                               pp_arg: to library (index X)

            Prepositional phrases and noun phrases bear an index that is used for coindexing. In the sample parse, the missing np governed by the preposition to is coindexed with the np the library: (index(_09F4)). Such coindexing is crucial for the treatment of gapping and long dis­tance dependencies.

            Noun phrases also display an agreement structure. For her we find the following two lines:

                      agr(third,sing)
                      ppro(third,sing,fem)

They indicate that we have a personal pronoun whose gender is feminine, number singular and person third. The agreement structures are part of the information that the horatio parses keep about surface structure to make it possible for the generator horgen to retrieve the sur­face forms from the raw Prolog terms corresponding to the parses.

However, the adequacy of this type of parsing for translation purposes is not proven - the reader is given a program that parses and generates, not one that translates ; besides, and on a more positive note, the structures arrived at are presumably usable for other purposes than translation from and into a related language.

We claim that the real touchstone in horatio is the ability to disambiguate between the various readings of the lexical items belonging to the string to be parsed. Such reading assign­ment can be seen as one of the central tasks of any parsing system geared towards high quality translation. But of course this is not a rigorous test, because there is no way to decide on the num­ber of readings an item has - the granularity depends on the purposes that are set to the lexicon in the system, as it does on the size of the dictionary and the targeted audience in lexicographical practice.

B.The lexicon in Horatio

A main principle of horatio is that information which belongs to the lexicon should belong in the lexicon. A prime example is frame information, i.e. information on the syntactic (and/or se­mantic) environment a given item can or must fit into. The lexical entries themselves contain the relevant frames ; they do not refer to information stored elsewhere. Consider the entries for ALLOW in horatio:

m_verb(verbtr,allow_1,allow,allow,allow,allows,allowing,
        allowed,allowed,allowed,trans,abstract,
        [np(oblig,posprec(1,Wnp),object,abstract)]).
/* the facts allow the explanation */

m_verb(vthat,allow_2,allow,allow,allow,allows,allowing,
        allowed,allowed,allowed,trans,human,
        [s(oblig,posprec(1,Precs),object)]).
/* she allows that he is good */

m_verb(vio,allow_3,allow,allow,allow,allows,allowing,
        allowed,allowed,allowed,trans,human,
        [np(oblig,posprec(2,Wnp1),object,thing),
         io(oblig,posprec(1,W2),indirect_object,human,_)]).
/* the teacher allows the boys money for books */

m_verb(vinf,allow_4,allow,allow,allow,allows,allowing,
        allowed,allowed,allowed,trans,_,
        [np(oblig,posprec(1,Wnp),surf_object,_),
         np_vp(oblig,to_inf,object)]).
/* they allowed him to teach linguistics */

m_verb(vobjadv,allow_5,allow,allow,allow,allows,allowing,
        allowed,allowed,allowed,trans,human,
        [np(oblig,posprec(1,Wnp),object,human),
         pp(oblig,posprec(1,Wpp),pp_arg,_,direction,_)]).
/* he allowed the girl into the library */

m_verb(vtrprep,allow_for_1,allow,allow,allow,allows,allowing,
        allowed,allowed,allowed,trans,human,
        [pp(oblig,posprec(1,Wpp),pp_arg,_,_,for)]).
/* he allowed for the oversimplifications */

m_verb(vtrprep,allow_for_1,allow,allow,allow,allows,allowing,
        allowed,allowed,allowed,trans,human,
        [string(oblig,posprec(1,0),[for]),
         np(oblig,posprec(2,Wnp),object,_)]).
/* he allowed for the oversimplifications */

(the existence of two m_verb clauses for the same reading of ALLOW is explained below, in the section on double analysis on page 24.)

The lexical predicate’s frame takes the form of a list and appears as the last argument of the predi­cate m_verb, which acts as macro-clause. The first argument is the class the predi­cate belongs to, the second is the lexeme value - including reading number - , positions 3 to 10 take care of inflectional morphology, position 11 is the value for the transitivity feature, position 12 is a se­mantic re­striction on the deep subject[3]. Each element of the frame opens with the value for the optionality feature - either oblig(atory) or opt(ional). The posprec structure is used to establish linear precedence. It takes into account both canonical order and structural weight, giving priority to the latter. The nature of a given argument in the lexical predicate’s argument list is given by the functor of the structure (such as string, np, pp, etc. in the entries for ALLOW). A common fea­ture is that for surface or deep gf (grammatical function).

The advantage of putting lexical information in the lexicon is obvious: additions, changes or en­hancements in the argument structure of lexical predicates (whether individual predicates or whole classes) do not entail changes in the grammar.

An alleged disadvantage is the size of the lexicon, which very soon grows rather bulky. However, this disadvantage is not a real one because lexical entries need not be produced as such by the lin­guist or lexicographer ; they can result from the expansion of macro-clauses, either within or out­side Prolog. Besides, lexical entries can be imported from a machine-read­able dictionary (MRD), as in the importation from ldoce (The Longman Dictionary of Con­temporary English) to horatio, discussed in Michiels forthcoming. The task of the linguist or lexicographer is then re­duced to selecting retrieval criteria and checking and expanding the re­sulting entries. A string ma­nipulation language such as AWK (cf Aho et al. 1988) is an ideal tool for performing the neces­sary format transformations.

C.Approaches to the Treatment of Mwu’s

There are basically three ways of looking at multi-word units:

a)  syntactically and lexically, in terms of the restricted potential for manipulation that they ex­hibit ;

b)  semantically, in terms of their non-compositionality ;

c)  rhetorically, in terms of the availability to present-day speakers and writers of the figure of speech that gave rise to the mwu.

In an ideal linguistic theory these three ways of looking at mwu’s hang tightly together and it is possible from the description of a mwu in terms of any of the three to derive its de­scription in terms of the other two. We could go as far as to say that the treatment of mwu’s should be a touchstone for the coherence of these three basic components of a linguistic theory.

However, the three approaches do not display the same number of observational cor­relates. Syn­tactic and lexical manipulations are observable in discourse, and we can elicit na­tive speakers’ judgments (even if often debatable and contradictory) as to the potential of a mwu for such ma­nipulations.

Semantic compositionality is considerably harder to get at. In order to say that a con­struction is non-compositional, we ought to know what it means for it to be compositional. In the case of a mwu, we have to know the meanings of the component parts and the semantic import of the rule that seems to apply to the structure exhibited by the mwu. As pointed out above, lexical units do not have a fixed number of readings and the number of readings we are ready to posit depends on the granularity of our lexicon. We do not know much about semantic compositional rules except that they are semantic and compositional, i.e. they merge the read­ings of their components in regular ways. Consequently, non-compositionality is very often ap­proached by paraphrasing. We look for a single term paraphrasing a mwu and therefore de­cide that the mwu is a semantic unit, or we look for a paraphrase with the same number of components as the mwu that seems to be com­positional and we decide that the mwu is likewise compositional. It need hardly be pointed out that such a procedure is open to severe methodological criticism. However, it is still very much in fa­vour, and is used in such a recent and comprehensive paper as Nunberg et al. 1994: Take hold is para­phrasable as grasp and is therefore non-compositional ; take stock is paraphrasable as make an assessment, and is therefore compositional. According to the same line of reason­ing, however, take stock is non-compositional because it is paraphrasable as assess.

The rhetorical approach to mwu’s is likewise suggestive, but difficult to build on. It ap­pears that the best way to determine whether an idiom is still felt to be ‘alive’ is precisely to rely on opera­tional criteria such as openness to lexical and syntactic manipulations. The prob­lem is made more complex by the need to take into account semi-creative attempts to revive ‘dead’ idioms (flog half-dead horses?). Kick the bucket may appear dead, but it is still alive and kicking:

The proverbial / fatal bucket I’m sure none of us is really in a hurry to kick.

The metalinguistic adjective (proverbial / fatal) signals a jocular use of the idiom. It should not be confused with adjectives providing true internal modification (as in take drastic measures).

In conclusion, we believe that in NLP we should give priority to the observational cor­relates of idiomaticity, namely their potential for syntactic and lexical manipulations. It is to these that we now turn, looking at how they are made use of in horatio.

D.Multi-word units in Horatio

In horatio we call mwu’s the pieces of structure which develop ties (a degree of inter­nal cohe­siveness) that go be­yond what the grammar predicts. They are dealt with according to the degree of mor­phological, syntactic and lexical frozenness that they exhibit.

Mwus illustrate the non-givenness of the lexicon. More than single word units, they are theoreti­cal constructs. Their recognition -and the structure that they are assigned- should result from their behaviour in discourse, more precisely from their potential for manipulation. The main principle adhered to in horatio is that mwus should be assigned as little structure as their behaviour war­rants. It is this amount of assigned structure which determines the appro­priate techniques to be used for the recognition of mwus from their manifestation in discourse. To give an example: in order to recognize the mwu take place we look for a mor­phological form of the verb take im­mediately followed by the string p-l-a-c-e ; we do not look for an object NP whose realization is the noun place ;  we do not look for the noun place either. Consequently, the entry for take place runs as follows:

m_verb(vidiomintr,take_place_1,take,take,take,takes,taking,
        took,took,taken,intrans,abstract,
        [string(oblig,posprec(1,0),[place]),
         pp(opt,,posprec(2,Wpp),pp_arg,_,location,_)]).
/* the workshop took place in the university */

            Mwus also illustrate the arbitrariness of the grammar-lexis distinction. In horatio there is no linguistically motivated border between syntax and lexis. We can choose to say that unit clauses (a Prolog concept) make up the dictionary of the system, but then the term dic­tionary is no longer used in a sense that is relevant to linguistic theory.

            In order to assess the degree of internal cohesion of mwus we explore three classes of manipu­lation:

a)Insertion

Insertion of material into the lexical unit ; compare:

play a role           ---> play an important role
set fire to  ---> * set dangerous fire to

This type of insertion (insertion of modifiers attached to elements belonging to a piece of the mwu) should be distinguished from:

a)         interruption of the mwu by foreign material:

he paid, if I may say so, attention to the problem
* the match took, if I may say so, place in the library

b)         insertion into the mwu of one or several of its arguments:

he took the problems into account (insertion of the object ‘the problems’ into the mwu take into account)

b)Extraction

Extraction of an element from its position within the canonical representation of the lexical unit ; this basic manipulation subsumes all standard transformations effecting movement or deletion ; compare:

pay attention to           ---> attention was paid to every single detail
make a fool of                ---> * a fool was made of the new head

c)Proformation

Replacement of a node in the mwu by a suitable pro-form: personal or indefinite pro­noun for NP, so for S, do so for (certain classes of) VP, there for PPs functioning as place adjunct, etc. Compare:

play a role                       ---> play it again
pay attention to           ---> * pay some again / * don’t pay any to him

In horatio we distinguish (in a hierarchy from frozen to open):

a)  completely frozen mwus

A standard example is the adverb by and large.

These mwus have no internal structure. They should be regarded as objects of type string, with their various elements bound by the adjacency operator (i.e. white space). In particular, there is no reason whatsoever for trying to assign a part of speech to any of the constitutive elements: for example, by is not a preposition here (or whatever else for that mat­ter: it is no more than the se­quence of letters b-y) and large is not an adjective.

b)  mwus that allow only inflectional morphology variation (in one or several of their con­stituents)

Examples in horatio are take place and shoot the breeze, in which take and shoot can be inflected. Only the complete configurations are assigned structures. We have al­ready given the entry for take place. Here is the one for shoot the breeze:

m_verb(vidiomintr,part0 :’the breeze’,
              shoot_the_breeze_1,shoot,shoot,shoot,shoots,shooting,
        shot,shot,shot,intrans,human,
        [string(oblig,posprec(1,0),[the,breeze])]).

c)  mwus which can be interrupted by one or several of their arguments.

An example in horatio is take into account. Take can be inflected. Take and into account can be separated by the object of the mwu:

he took the problems into account
he took into account the problems that she had seen

(the relevant feature for position of the object is its weight). Here is the entry for take into account:

m_verb(vobjfixedpp,part1 :’into account’,
         take_into_account_1,take,take,take,takes,taking,
        took,took,taken,trans,human,
        [string(oblig,posprec(1,3),[into,account]),
         np(oblig,posprec(1,Wnp),object,_)]).

d)  collocations: these are mwus whose elements are free to behave as the normal (i.e. with re­spect to a given grammar) structure assignment predicts. An example in horatio is take measure, where both take and the NP whose head is the noun measure behave as predicted by the ‘normal’ structure  assignment:

VP [ V [TAKE] NP [ ... Head N [MEASURE]]]]

The link between take and measure is collocational, i.e. take is the preferred verb to express what it expresses here. The implementation of such a lexical affinity in horatio is achieved through a feature on the object of take, namely [measure], feature which is assigned to the noun measure under one of its readings. Such features can be re­garded as hyperspecial­ised semantic features, i.e. it is hypothesized that they will not be needed alongside semantic features, and that consequently they can share the same slot[4]. The entry for take measure looks like this:

m_verb(verbtr,_,take_measure_1,take,take,take,takes,taking,
        took,took,taken,trans,human,
        [np(oblig,posprec(1,Wnp),object,measure)]).

            horatio also has the corresponding entry for the noun measure when used in the take measure collocation:

m_noun(measure_1,measure,measures,[measure],[]).

In connection with the implementation of mwu’s it should be noted that when we satisfy (i.e. match it against the input word list) a fixed string, we return no parse tree, as the fixed string is included in both the predicate’s lexical entry (as in look_down_on_1) and the predicate’s arglist

satisfy(P0,P1,[],0,Posprec,Rel,Intrel,[],
                string(Type,Posprec,String),_,_) :-
append(String, P1, P0).

            The String appended to the remaining list should yield the input list. In the lexical en­try, String is a list as in:

[string(oblig,posprec(1,0),[down])

part of the entry for look down on:

m_verb(vtrphrprep,part0 :down,look_down_on_1,look,look,look,looks,looking,,
                looked,looked,looked,trans,human,
                [string(oblig,posprec(1,0),[down]),
                pp(oblig,posprec(2,Wpp),pp_arg,_,_,on)]).
/* the teacher looked down on his students */

Double analysis

Quirk et al. 1985 and Bresnan 1981 cogently argue that some English lexical con­structions can be parsed in two ways. Such a double analysis is necessary to account for the syntactic manipula­tions that these constructions admit of.

A case in point for English is the verb+preposition combination, as in look at. We can regard look at as a transitive verb like any other, or we can regard it as the verb look governing a prepositional phrase headed by at. Schematically:

1) look at + NP
2) look +PP (at)

The following sentences illustrate two of the syntactic manipulations (WH-movement and passivi­zation) that lead one to postulate the need for a double analysis. Others can be found in Quirk et al. 1985 and Bresnan 1981.

1: What are you looking at?
    The man he was looking at ...
    The problem has been looked at from every angle

2: The text at which we have been looking for too long ...

Pulman in Alshawi et al. 1992 (p. 74) points out that if take advantage of is treated as a com­plex V only one passive can be derived in the GPSG meta-rule treatment of the pas­sive, because advantage will not be available as an NP node for the meta-rule to apply to. Consequently, only the first of the first of the following two passive S’s will be generated:

Kim was taken advantage of.
Advantage was taken of Kim.

This leads Pulman to reject the GPSG treatment. But the problem disappears if a double analy­sis is provided, evidence for which is precisely the availability of two passives.

It should be noted that the need for double analysis of some lexical constructions is not limited to English. Consider avoir l’air in French. We need to assign the following two analyses:

1) avoir l’air + ADJ
2) avoir + NP (air + ADJ)

on account of the two ways in which agreement can take place (either with air or with the sub­ject of the whole phrase avoir l’air):

Elle a l’air idiote.
Elle a l’air idiot.

In horatio the appropriate lexicon file holds two macro-clauses for prepositional verbs such as look at. The first caters for the analysis in which the preposition belongs to the prepositional phrase rather than to the verb (analysis 2 in our account). The arglist contains a prepositional phrase specified in terms of the preposition heading it (at in the case of look at):

m_verb(vtrprep,_,look_at_1,look,look,look,looks,looking,
        looked,looked,looked,trans,living,
        [pp(oblig,posprec(1,Wpp),pp_arg,_,_,at)]).
/* they were looking at her
    the girl at whom they had been looking
*/

            The second macro-clause identifies at as a particle to be appended immediately to the right of the verb look (second argument of the macro-clause). The arglist opens with a string (at) and further contains the np playing the object role:

m_verb(vtrprep,part0 :at,look_at_1,look,look,look,looks,looking,
        looked,looked,looked,trans,living,
        [string(oblig,posprec(1,0),[at]),
         np(oblig,posprec(2,Wnp),object,_)
]).
/* they were looking at her
    whom are they looking at?
*/

            It should be noted that in both m_verb clauses the lexeme value is the same, viz. look_at_1. We are dealing with the same lexical item.

            In the analysis of sentences such as They were looking at her, both m_verb clauses will suc­ceed, and two parses will be returned. Such redundancy is not felt to be a negative feature, as the rela­tionship between verb and preposition is truly indeterminate in such cases.

            Consider now the phrase pay attention to. Attention is modifiable, and the np whose head it is plays a functional role, namely that of object of the verb pay:

You should pay more attention to the problems he has mentioned.
Too much attention has been paid to these pseudo-problems.

            Nevertheless pay attention to is a multi-word unit: the preposition is lexically de­termined and the sense of pay is not assignable without considering the object.

            The solution adopted for such mwu’s in horatio is again to use the slot reserved for seman­tic restrictions to code lexical restrictions. We have one entry for pay (where the lexe­me value is the phrase pay_attention_1) where we specify that the object must bear the se­mantic feature attention. We also have a reading of attention which bears the required feature in its semantic feature list:

m_verb(vobjfreepp,_,pay_attention_1,pay,pay,pay,pays,paying,
        paid,paid,paid,trans,human,
        [np(oblig,posprec(1,Wnp),object,attention),
         pp(oblig,posprec(1,Wpp),pp_arg,_,_,to)]).
/* they should pay attention to the problem he has seen */


m_noun
(attention_1,attention,[attention],[]).

            By the side of the entry for pay_attention_1 given above, we need another one, where attention to is parsed as a particle attached to the verb. This is achieved by including the string value attention to in the arglist. This entry is needed to account for passives such as The problem was paid attention to, where the object is not attention, but the problem, i.e. the object argument in the arglist for this second entry:

m_verb(vtrphrprep,part0 :’attention to’, pay_attention_to_1_a,
        pay,pay,pay,pays,paying,
        paid,paid,paid,trans,human,
        [string(oblig,posprec(1,0),[attention,to]),
         np(oblig,posprec(2,Wnp),object,_)]).
/* the problem should be paid attention to */

            We would also need another entry for attention to account for its uses outside the mwu pay attention to. The following would be appropriate:

m_noun(attention_2,attention,[abstract],[]).

Note that here abstract is a true semantic feature, not a lexical one.

References

Dictionaries:

LDOCE = P. Procter (ed.), Longman Dictionary of Contemporary English, Longman, London, 1978

Literary material:

Gracq 1951 = Gracq, J., Le Rivage des Syrtes, José Corti, Paris, 1951

Snow 1951 = Snow, C. P., The Masters, Penguin edition, 1951

Other publications referred to:

Aho et al. 1988 = Alfred V. Aho, Brian W. Kernighan, Peter J. Weinberger, The AWK Pro­gram­ming Language, Addison-Wesley, Reading, Mass., 1988

Alshawi et al. 1992 =  Alshawi, H. et al., The Core Language Engine, The MIT Press, Cam­bridge, Mass. and London, 1992

Bresnan 1981 = Bresnan, J., A Realistic Transformational Grammar, in Halle, M., Bresnan, J. and Miller, G.A. (eds), Linguistic Theory and Psychological Reality, The MIT Press, Cambridge, Mass., 1981

Danlos and Samvelian 1992 = Danlos, L. and Samvelian, P., Translation of the predicative element of a sentence: category switching, aspect and diathesis, in TMIMT-92, Proceedings of the Fourth International Conference on Theoretical and Methodological Issues in Machine Translation, Montréal, 1992

Fraser 1970 = Fraser, B., Idioms Within a Transformational Grammar, Foundations of Language, Vol. 6, pp. 22-42, 1970

Heid 1994 = Heid, U., On Ways Words Work Together - Topics in Lexical Combinatorics, Euralex ‘94 Proceedings, 1994

Lesk = Lesk, M., Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from an ice ream cone. In Proc. 1986 SIGDOC Conference, Toronto, Canada

Mackin 1978 = Mackin, R., On collocations: ‘words shall be known by the company they keep’, in P. Strevens (ed), In honour of A.S. Hornby, Oxford University Press, 1978

McCord 1982 = McCord, M. C., Using slots and modifiers in logic grammars for natural lan­guage, Artificial Intelligence, Vol. 18, pp. 327-367

McCord 1987 = McCord, M. C., Chapter 5 of Walker et al. 1987

McCord 1989a = McCord, M. C., A New Version of the Machine Translation System LMT, Journal of Literary and Linguistic Computing, 4, pp. 218-229

McCord 1989b = McCord, M. C., LMT and Slot Grammar, paper read at the IBM Europe In­stitute, August 1989, Garmisch-Partenkirchen

McCord 1990 = McCord, M. C., SLOT GRAMMAR: A System for Simpler Construction of Practical Natural Language Grammars, in Studer, R. (ed), International Symposium on Natu­ral Language and Logic, Lecture Notes in Computer Science, Springer-Verlag

Mel’chuk and Wanner 1994 = Towards an Efficient Representation of Restricted Lexical Cooccurrence, in Euralex ‘94 Proceedings, 1994

Michiels 1977 = Michiels, A., Idiomaticity in English, Revue des Langues Vivantes, XLIII, 2, 1977

Michiels 1978 = Michiels, A., A New Dictionary of Idiomatic English, Revue des Langues Vivantes, XLIV, 1, 1978

Michiels 1994 = Michiels, A., Feeding LDOCE entries into HORATIO, Studies in Ma­chine Translation and Natural Language Processing, Volume 8, Lexical Issues in machine translation, edited by Paulo Alberto and Paul Bennett, European Commission, Luxemburg, 1994, 93-115

Nunberg et al. 1994 = Nunberg, G., Sag, I., Wasow, T., Idioms, Lan­guage, 1994

Pustejovsky 1995 = Pustejovsky, James, The Generative Lexicon, Cambridge, Mass., 1995

Quirk et al. 1985 = Quirk, R, Geenbaum, S, Leech, G., Svartvik, J., A Comprehensive Grammar of the English Language, Longman, 1985

Rosch 1978 = Rosch, E., Principles of Categorization, in E. Rosch and B. Lloyd (eds), Cognition and Categorization, L. Erlbaum Associates, Hillsdale, New Jersey, 1978

Ross 1973 = Ross, J. R., A Fake NP Squish, in Bailey, Ch.-J. N. and Shuy, R.W., (eds), New Ways of Analyzing Variation in English, Georgetown University Press, Washington, 1973

Sinclair 1987 = Sinclair, J. M. (ed), Looking Up, Cobuild, Collins ELT, London and Glasgow, 1987

Sinclair et al. 1995 = Sinclair, J., Hoelter, M. and Peters, C., (eds), The Languages of Definition: The Formalisation of Dictionary Definitions for Natural Language Processing, Studies in Machine Translation and Natural Language Processing, Vol. 7, 1995

Tomita 1991 = Tomita, M., Why Parsing Technologies, in Tomita, M. (ed), Current Issues in Parsing Technology, Kluwer Academic Publishers, Boston, 1991

Walker et al. 1987 = Walker, A. et al., Knowledge Systems and Prolog, Addison-Wesley, Read­ing, Mass, 1987

Weinreich 1969 = Weinreich, U., Problems in the Analysis of Idioms, reprinted in U. Weinreich, On Semantics, University of Pennsylvania Press, 1980

 

 



[1] Determining the pre-defined group may be the topic of a paper, too.

[2]  It should be emphasized here that horatio has never been part of  'standard' Eurotra, but has always remained a sideline. More information on Eurotra can be found in the various volumes in the Studies in Machine Translation and Natural Language Processing series published by the Commission of the European Communities in Luxemburg, especially in Volumes I and II, 1991 (I: The Eurotra Linguistic Specifications ; II: The Eurotra Formal Specifications). Both volumes are edited by C. Copeland, J. Durand, S. Krauwer and B. Maegaard.

[3]  In agreement with McCord (see McCord 1987, p. 338), I do not include the subject in the predicate's argument list. As McCord writes, "Since every verb has a subject (in a finite clause), we will not actually list the subject slot by name, but will just put the subject marker in a separate argument of the lexical entry" (in McCord's system a marker is a semantic restriction).

[4]  Nunberg et al. 1994 similarly argue that nouns which occur only in idioms such as heed and dint exhibit dependencies which are 'simply the limiting case of selectional restrictions' .