Language & Information Lab.
From a morphological point of view, English may not be a language that
offers much challenge to computational lexicography, but it is certainly
a language very rich in multi-word units. Its verbal inflection is largely
periphrastic; most adjectives have an analytic comparative and superlative;
there is a rich stock of phrasal and prepositional verbs, and there are
innumerable idioms and collocations. It therefore seems evident that any
computational lexicon for English needs to address the problem of how to
treat multi-word units and to find a solution for it.
The lexical database system I would like to present consists of two major
parts, one - called WordManager (WM) - for mapping single text words onto
their lexemes, e.g. associating wrote with its place in the lexeme
write, and one for mapping multi-word units onto their lexemes and
for recognizing idioms, called PhraseManager (PM). This second part will
map, e.g. had been writing onto write, tense = past perfect continuous,
and recognize to keep a stiff upper lip as an idiom. Here, I would
like to concentrate on PM. The system is language-independent and has been
tested for several (European) languages. The examples I will use for illustration
are from English. For English, the two major groups within the multi-word
lexemes are idiomatic expressions of various kinds, and periphrastic inflection
for verbs, adjectives and adverbs. In the following description I will
concentrate on the way linguistic knowledge is formalized in the database.
For a brief account of the system's general design, see ten Hacken et al.(1994);
and for further references see below.
In linguistics, an idiom is generally defined as a multi-word lexeme
whose meaning is not a compositional function of the meanings of the component
words. Idioms also have limited morphosyntactic flexibility. In PM, the
term idiom is used in a rather broad sense. "Idiom rules"
can be used for a variety of multi-word lexemes, including phrasal and
prepositional verbs and collocations. All these multi-word lexemes display
a range of flexibility from completely frozen to (almost) fully productive
combinations of words. Possible modifications include the paradigmatic
variation of one or more of the components (e.g. the inflection of a verb),
the insertion of optional elements, or the choice between a number of alternative
words. Furthermore, some expressions can be interrupted by elements not
belonging to the multi-word lexeme, and transformations can change the
word order of the elements in the phrase. In the context of NLP, it is
vital to know which modifications are possible for each idiom, as this
will determine whether both an idiomatic and a literal reading are possible
or whether the idiomatic reading can be ruled out. PM offers a transparent
formalism to handle all of these modifications and to specify which are
possible for each expression. Classes of multi-word lexemes are organized
in a hierarchical structure in a tree window. The members of such a phrasal
class all share the same phrase structure tree. The rules for each class
are specified in separate text windows. Global and individual modifications
allow a flexible description of each expression.
A few examples will illustrate this procedure. Some English idioms do not
have a literal reading because their internal structure is syntactically
ill-formed, e.g. by and large. These completely frozen expressions
could simply be specified in the following way:
The term periphrastic inflection covers the phenomenon of verbs
and other word classes (i.e. one-word lexemes) having word forms that consist
of more than one text word. Thus, having done will be mapped onto
do and identified as the perfect gerund of do. Even such
unlikely but possible forms as (she) had been being taken can be
correctly identified as the passive past perfect continuous tense of take.
As all periphrastic tenses have to be analyzed into binary trees, the analysis
of such a form will also yield several substeps of analysis. The fact that
the tree is right-branching allows the linguist to introduce a rule to
make sure that adverbs can only be correctly inserted after the first auxiliary:
(she) has never been taking will be identified as the present perfect
continuous, whereas the analysis of I have been with divorced friends
of mine as the present perfect of the passive will be blocked because the
string been divorced cannot be interrupted in such an analysis.
The text window where these rules are specified also contains the possible
transformations (in the case of English, inversion) and an example. The
structure of the window is basically the same as for the idiom classes,
except for the additional rule for periphrastic inflection. The window
name shows the path from the top node of the periphrastic inflection tree
down to the specific tense.
(PIClass V)(PIClass Aux)(PIClass Finite)(PIClass Pass_Perf_Cont)
SYNTAX-TREE
(VGr [V-aux V])
MODIFICATIONS
V <
TRANSFORMATIONS
Inversion
PERIPHR-INFL
Aux-Pass-Perf-Cont
EXAMPLE
<have> pass-perf-cont
have (Cat V)
pass-perf-cont (Cat V)(Tense Pass_Perf_Cont_Part)
-
Information about person and number of the verb form can easily be
percolated (PERC 1)
to the citation form (CFORM 2) of the lexeme
by specifying the rule for periphrastic inflection as follows:
They were dropping a few hints about the affair.
PM will correctly identify the minimal form of the idiom to drop a hint and the past continous tense of to drop, as can be seen in the retrieval dialog in the appendix.
The WM/PM system offers a lexical database and matching formalism which allows the linguist to specify the internal structure of multi-word lexemes to the desired extent. Idioms and periphrastic inflection make use of very similar frameworks for specification. Global and individual modification rules allow a precise description of the limited morphosyntactic flexibility of phrase classes and their members in a straightforward way.
Cowie, A.P. & R. Mackin (1993). "Oxford Dictionary of Current
Idiomatic English". Oxford: Oxford University Press.
Domenig, M. & P. ten Hacken (1992). Word Manager. Hildesheim:
Olms.
ten Hacken, P., S. Bopp, M. Domenig, D. Holz, A. Hsiung & S. Pedrazzini
(1994). "A Knowledge Acquisition and Management System for
Morphological Dictionaries". COLING Proceedings. Kyoto. (1284-1288).
ten Hacken, P. & M. Domenig (1996). "Reusable dictionaries
for NLP: The Word Manager approach". Lexicology.
Palmer, F.R. (1987). The English Verb. London: Longman.
Pedrazzini, S. (1994). "Phrase Manager: A System for Phrasal and
Idiomatic Dictionaries". Hildesheim: Olms.