Applications of Computational Linguistics, WS 01/02

Answer Extraction

Alexander Deubelbeiss, February 1, 2002

An answer extraction system: ExtrAns

An answer extraction system accepts a question formulated in natural language as its input. The answers are not generated by the system from a "knowledge database" but extracted from a body of human-formulated texts. These texts typically belong to a highly restricted domain but were originally written for other humans in syntactically unrestricted language.

Aim: to provide better precision (relevance of results to the query) while maintaining the recall (completeness of results) of Information Retrieval methods that do not use NLP, without the cost and difficulty of developing a language-understanding question-answering system or of porting an existing one to a new domain. According to the developers of ExtrAns, answer extraction is well suited to medium-sized static corpora.

The answers are not entire documents but only those parts that seem to provide a direct answer to the specific question the user asks. All questions are treated separately: the system stores no information about earlier questions, and it can't make sense of elliptical follow-up questions ("Which of these commands is suitable for text files?"). Answer extraction aims to supply all relevant answers (which may contradict each other) to each question, not to create the impression of having a conversation with a computer.

ExtrAns(University of Zürich Institute of Computational Linguistics) extracts its answers from Unix man (=Manual) pages, i.e., the documentation of a computer operating system's commands. A new version that uses the (substantially larger) technical manuals of the Airbus 320 as its data is currently being developed.

ExtrAns converts its source data into "logical forms", a representation of both the syntactic structure and the semantic content of a sentence. This explains why it is useful to have a static corpus: that way, the conversion needs only be done once. Queries are subjected to the same treatment. Then, the source text database is searched for a sentence with a logical form that is equivalent to the logical form of the query.

Problems:

Domain-specific language: The system needs to distinguish, for instance, between the command eject and the common verb eject; between the system's printing a warning message on the screen and the user's printing a document on paper; between filename as a placeholder in a command description and the word filename etc.
Ambiguity: Not all ambiguities can be resolved, neither in the source text nor in the query. ExtrAns simply stores the different interpretations as possible alternatives and then uses them to score the accuracy of its results: the more interpretations of a source text sentence match an interpretation of a query, the higher that source text sentence is scored. This score is translated into a highlighting colour in which the answer sentence is displayed. The alternative interpretations themselves have a probability score from the syntactic disambiguation process, which uses empirical rules to mark certain syntactic structures as more likely than others.
Incomplete analysis: Since the source text is unrestricted language, the parser often fails to analyse a sentence completely; in other cases, the source text contains incomplete sentences (e.g. in headings). With the specific data representation and matching methods used in ExtrAns, partial analyses can still be matched to their counterparts in more completely analysed sentences. In those extreme cases where no structure whatsoever is found, the individual part-of-speech-tagged content words are used as search terms for a string-match search.
Synonymy and hyponymy: if a search does not return a minimum number of results, the system will extend its search by allowing first synonyms and then hyponyms for certain search terms. For this purpose, a special domain-specific dictionary has been developed for ExtrAns.
Performance: ExtrAns uses a traditional Information Retrieval search to pre-select the pages to search in: Only the logical forms of documents containing the content words in the query are selected for the actual NLP-using search procedure.

References

Berri, Jawad; Mollá Aliod, Diego; Hess, Michael. Extraction automatique de réponses: implémentations du système ExtrAns http://www.ifi.unizh.ch/CL/berri/taln98.ps.gz, retrieved in January 2002

Hess, Michael. Mixed-Level Knowledge Representations and Variable-Depth Inference in Natural Language Processing. http://www.ifi.unizh.ch/CL/hess/ijait.ps.gz, retrieved in January 2002.

Mollá Aliod, Diego; Hess, Michael. On the Scalability of the Answer Extraction System "ExtrAns". http://www.ifi.unizh.ch/CL/molla/klagenfurt.ps.gz, retrieved in January 2002

Mollá Aliod, Diego; Berri, Jawad; Hess, Michael. A Real World Implementation of Answer Extraction. http://www.ifi.unizh.ch/CL/hess/nlis.ps.gz, retrieved in January 2002

The ExtrAns project

Information about the project
Demo