Daniela Goeckei, Harald Lungenii, Felix Sasakii, Andreas Witti, &Scott O. Farrariii, [i] Bielefeld University, [ii] Justus–Liebig–University Gießen, [iii] University of Bremen

GOLD and Discourse: Domain– and Community–Specific Extensions

This proposed paper deals with GOLD, the General Ontology for Linguistic Description, and its extension with respect to discourse categories. The central aim of GOLD is to specify in a formal language all of the categories necessary for broad–coverage description of linguistic data. Initially, GOLD was narrowly focused on morphosyntax, since this sub-discipline was widely represented in linguistic field data. Recently, there have been efforts to extend GOLD to various other sub–domains, particularly the area of text linguistics and discourse analysis. This proposed paper introduces applications envisaged in the research group "Text–technological modelling of information", funded by the German Research Foundation DFG. The research group deals with various modelling and processing areas of –– mainly textual–– marked–up XML–data.

A central application domain of the research group is the processing of discourse relations, for example in discourse parsing or the automatic resolution of anaphoric relations by integrating heterogeneous knowledge resources, such as annotated corpora, lexica or ontologies. In these applications, GOLD plays a crucial role with respect to two tasks. (1) GOLD supplies domain–specific knowledge during the processing of scientific articles from the discipline of linguistics which naturally contain a large amount of linguistic terminology. In this task, the role of GOLD can be compared to that of WordNet in the anaphora resolution algorithm proposed by Vieira and Poesio (2000). (2) GOLD allows for the representation of discourse categories and relations in order to make their similarity and differences explicit on an ontological level. This part can be compared to the approach taken by Cimiano and Handschuh (2003). The proposed paper will deal with the representation of discourse categories and discourse relations in detail.

GOLD has so far focused on morphosyntactic relations. Creating a module of discourse relations for GOLD is even less straightforward because it is difficult to take an "as much of a theory–neutral approach as possible" (Farrar to appear). Discourse relations are highly diverse and often specific to discourse theories. A typical discourse category is the so–called ‘discourse entity’, but a text is not exhaustively segmentable into discourse entities as it may be segmented into morphosyntactic constituents. To be able to deal with the characteristics of discourse, we focus on a small set of discourse–related categories which are represented as classes in the ontology: discourse units, discourse relations, discourse markers and discourse entities. In addition, we rely on relation taxonomies, to describe the relations between these classes. Our community of practice extension (COPE) of GOLD focuses on three kinds of relations: referential relations (between discourse entities), rhetorical relations as described in the framework of Rhetorical Structure Theory (RST, Mann and Thompson 1988) and topic development relations (between discourse units).

The integration of discourse relations into GOLD allows the description of endangered languages not only with respect to morphosyntactic categories but also with respect to discourse. The categories that are envisaged have already been applied to several typologically diverse languages including Japanese or Kilivila, an Austronesian language (SIL code KIJ). For example Sasaki et al. (2002) describe the function of classificatory particles that mark nominal classes in establishing referential relations:

kei– ta kaii ku– kau
CP.wooden– one stick 2.– take
‘take one stick...’

kei– bwabwau
CP.wooden– blue
‘the blue (stick)...’
Example: Classificatory Particles in Kilivila

The full paper will describe the kinds of relations and their integration into the GOLD core module, as well as their benefit for the endangered languages community in detail.

