Language & Information Lab.

English Multi-Word Lexemes in a Lexical Database

Cornelia Tschichold
English Seminar, University of Basel (Switzerland)
email: tschichold@ubaclu.unibas.ch

Introduction

From a morphological point of view, English may not be a language that offers much challenge to computational lexicography, but it is certainly a language very rich in multi-word units. Its verbal inflection is largely periphrastic; most adjectives have an analytic comparative and superlative; there is a rich stock of phrasal and prepositional verbs, and there are innumerable idioms and collocations. It therefore seems evident that any computational lexicon for English needs to address the problem of how to treat multi-word units and to find a solution for it.
The lexical database system I would like to present consists of two major parts, one - called WordManager (WM) - for mapping single text words onto their lexemes, e.g. associating wrote with its place in the lexeme write, and one for mapping multi-word units onto their lexemes and for recognizing idioms, called PhraseManager (PM). This second part will map, e.g. had been writing onto write, tense = past perfect continuous, and recognize to keep a stiff upper lip as an idiom. Here, I would like to concentrate on PM. The system is language-independent and has been tested for several (European) languages. The examples I will use for illustration are from English. For English, the two major groups within the multi-word lexemes are idiomatic expressions of various kinds, and periphrastic inflection for verbs, adjectives and adverbs. In the following description I will concentrate on the way linguistic knowledge is formalized in the database. For a brief account of the system's general design, see ten Hacken et al.(1994); and for further references see below.

Idioms

In linguistics, an idiom is generally defined as a multi-word lexeme whose meaning is not a compositional function of the meanings of the component words. Idioms also have limited morphosyntactic flexibility. In PM, the term idiom is used in a rather broad sense. "Idiom rules" can be used for a variety of multi-word lexemes, including phrasal and prepositional verbs and collocations. All these multi-word lexemes display a range of flexibility from completely frozen to (almost) fully productive combinations of words. Possible modifications include the paradigmatic variation of one or more of the components (e.g. the inflection of a verb), the insertion of optional elements, or the choice between a number of alternative words. Furthermore, some expressions can be interrupted by elements not belonging to the multi-word lexeme, and transformations can change the word order of the elements in the phrase. In the context of NLP, it is vital to know which modifications are possible for each idiom, as this will determine whether both an idiomatic and a literal reading are possible or whether the idiomatic reading can be ruled out. PM offers a transparent formalism to handle all of these modifications and to specify which are possible for each expression. Classes of multi-word lexemes are organized in a hierarchical structure in a tree window. The members of such a phrasal class all share the same phrase structure tree. The rules for each class are specified in separate text windows. Global and individual modifications allow a flexible description of each expression.
A few examples will illustrate this procedure. Some English idioms do not have a literal reading because their internal structure is syntactically ill-formed, e.g. by and large. These completely frozen expressions could simply be specified in the following way:


SYNTAX-TREE
(AdvP [Prep Conj Adj])

MODIFICATIONS
-

TRANSFORMATIONS
-

EXAMPLE
by and large

by (Cat Prep)
and (Cat Conj)
large (Cat Adj)
-


Most multi-word lexemes do show some variation, however. These can be formalized using the same framework. For expressions of the type to spill the beans, the verb can be inflected in the usual way (as shown by the pointed brackets < > in the example), it can be modified by an adverb (He always spills the beans.) and the whole expression can undergo passivization (The beans were spilled at the meeting.). Global modifications as the ones in the window below serve to specify modifications that are possible for all the idioms in the idiom class, while individual modifications that apply only to particular phrases can be introduced by the lexicographer (using the lexicographer's interface).

SYNTAX-TREE
(VP [V NP])

MODIFICATIONS
V <

TRANSFORMATIONS
Passive

EXAMPLE
<spill> the beans
spill (Cat V)
the (Cat Det)
beans (Cat N)(Number Pl)
-


Periphrastic inflection

The term periphrastic inflection covers the phenomenon of verbs and other word classes (i.e. one-word lexemes) having word forms that consist of more than one text word. Thus, having done will be mapped onto do and identified as the perfect gerund of do. Even such unlikely but possible forms as (she) had been being taken can be correctly identified as the passive past perfect continuous tense of take.
As all periphrastic tenses have to be analyzed into binary trees, the analysis of such a form will also yield several substeps of analysis. The fact that the tree is right-branching allows the linguist to introduce a rule to make sure that adverbs can only be correctly inserted after the first auxiliary: (she) has never been taking will be identified as the present perfect continuous, whereas the analysis of I have been with divorced friends of mine as the present perfect of the passive will be blocked because the string been divorced cannot be interrupted in such an analysis.

The text window where these rules are specified also contains the possible transformations (in the case of English, inversion) and an example. The structure of the window is basically the same as for the idiom classes, except for the additional rule for periphrastic inflection. The window name shows the path from the top node of the periphrastic inflection tree down to the specific tense.

(PIClass V)(PIClass Aux)(PIClass Finite)(PIClass Pass_Perf_Cont)


SYNTAX-TREE
(VGr [V-aux V])

MODIFICATIONS
V <

TRANSFORMATIONS
Inversion

PERIPHR-INFL
Aux-Pass-Perf-Cont

EXAMPLE
<have> pass-perf-cont
have (Cat V)
pass-perf-cont (Cat V)(Tense Pass_Perf_Cont_Part)
-


Information about person and number of the verb form can easily be percolated (PERC 1) to the citation form (CFORM 2) of the lexeme by specifying the rule for periphrastic inflection as follows:


(Cat V)(Tense Past) + (Cat V)(Tense Perf_Part) =
(CFORM 2)(PERC 1)(Cat V)(Tense Past_Perf_Cont)


Furthermore, rules for periphrastic inflection and for idioms can interact, as in the following example:

They were dropping a few hints about the affair.

PM will correctly identify the minimal form of the idiom to drop a hint and the past continous tense of to drop, as can be seen in the retrieval dialog in the appendix.

Concluding remarks

The WM/PM system offers a lexical database and matching formalism which allows the linguist to specify the internal structure of multi-word lexemes to the desired extent. Idioms and periphrastic inflection make use of very similar frameworks for specification. Global and individual modification rules allow a precise description of the limited morphosyntactic flexibility of phrase classes and their members in a straightforward way.

References

Cowie, A.P. & R. Mackin (1993). "Oxford Dictionary of Current Idiomatic English". Oxford: Oxford University Press.

Domenig, M. & P. ten Hacken (1992). Word Manager. Hildesheim: Olms.

ten Hacken, P., S. Bopp, M. Domenig, D. Holz, A. Hsiung & S. Pedrazzini (1994). "A Knowledge Acquisition and Management System for Morphological Dictionaries". COLING Proceedings. Kyoto. (1284-1288).

ten Hacken, P. & M. Domenig (1996). "Reusable dictionaries for NLP: The Word Manager approach". Lexicology.

Palmer, F.R. (1987). The English Verb. London: Longman.

Pedrazzini, S. (1994). "Phrase Manager: A System for Phrasal and Idiomatic Dictionaries". Hildesheim: Olms.

Annotations


06-Sep-95 Cornelia TSCHICHOLD