Gary F. Simons, SIL International
Baden Hughes, University of Melbourne

GOLD as a Standard for Linguistic Data Interoperation : A road map for development

GOLD, the General Ontology for Linguistic Description [1], has somewhat unexpectedly emerged from the EMELD project. Originally conceived of as a morphosyntactic annotation inventory and label mapping scheme, GOLD has now been formalized as an ontology by which disparate data sets can be integrated through a common representation of the basic linguistic features.

The overall vision of the GOLD Community is that:

"By agreeing on a shared ONTOLOGY of linguistic concepts and on a shared infrastructure for INTEROPERATION, the linguistics community will be able to produce RESOURCES that describe individual languages in a comparable way, to develop TOOLS that produce these comparable resources, and to query SERVICES that aggregate as many comparable resources as are available." [2]

In the EMELD context, a significant amount of effort has been invested in the development of GOLD in the first dimension of this vision, namely a shared collection of linguistic concepts. Initial surveying work been completed to glean linguistic concepts and their definitions from published materials. This survey work has been complemented by web data mining activities [3] to further increase the coverage of GOLD. GOLD has been instantiated in several formal versions, and a range of proof of concept implementations have featured at previous EMELD events [4, 5, 6] and other venues [7].

However the latter four items from the GOLD Community vision (to achieve interoperation through resources, tools, and services) remain largely unaddressed, and thus there remains considerable effort to be expended in achieving the vision in its entirity. Upon reflection, we believe that there are presently three significant barriers to the widespread adoption of GOLD and subsequent realization of the interoperation goals, vis:
  • the complexity of the dissemination format which in effect places the threshold for engagement with GOLD at too high a level;
  • the absence of a well defined change process through which GOLD can evolve into a standard that is truly community grounded;
  • the lack of compelling GOLD-enabled applications which provide traction amongst end user communities.
In this paper we briefly discuss each of these problems in turn and offer a number of concrete suggestions as to how they can be addressed.

The current expression of GOLD is as an OWL document [8], a relatively complex representation grounded in formal and descriptive logic. This we have identified as the first impediment. We argue that this particular expression is but one way that GOLD can be expressed. At its core GOLD is the set of linguistic concepts and their corresponding definitions that have been associated with globally unique identifiers (URIs). It is these identifiers that are the basis of interoperation, not a particular technology for rendering the association between identifiers, concepts, and definitions. Therefore, GOLD can be expressed in a variety of forms including comma or tab delimited files, relational (SQL) database tables, various XML forms recommended by the W3C (e.g. RDFS, SKOS, OWL), and so on. We believe that such expression in multiple forms is critical to the adoption of GOLD by software developers, since it significantly lowers the barrier to entry imposed by having only the single representation in OWL. Adoption by more software developers should in turn promote wider adoption among linguists.

The second issue we have identified is the absence of a well defined change process through which GOLD can evolve into a standard that is truly community grounded. In order to address this issue, we propose that GOLD requires a formalized process for managing change, perhaps modeled after practices in other communities [9]. Without a constrained change management process, a GOLD adopter would have little confidence that an incremental version update would not break their implementation. In addition, a community-centric change management process allows for review of proposed changes by the wider community, thus fostering greater acceptance and wider adoption. In similar community-based efforts, the actual work of developing a standard is performed in working groups [10]. We propose that in the absence of dedicated funding for the expansion of GOLD's coverage, the working group model would allow interested parties to contribute to the development of GOLD in a lightweight but effective manner.

The third impediment, the lack of compelling GOLD-enabled applications which provide traction amongst end user communities, we believe can be addressed in several ways. The idea of language profiles [11] as an application for GOLD has been introduced into the community; a language profile is an account of the morphosyntactic categories and features of a a specific language in terms of the concepts in GOLD. Language profiles are foundational for interoperation of textual and lexical data. Thus tools to facilitate their creation and service to facilitate their comparison are an obvious first step. Additionally there are very few examples of linguistic data that are annotated with GOLD concepts, despite notable efforts [12]. We propose to start a repository for such data, and to publish the collection metadata as an OLAC data provider, thus enabling discovery of richly annotated and freely available data for interested researchers to work with. In addition to sample data, we need to provide examples of 'embedded' GOLD in commonly used linguistic analysis tools. This will allow documentary linguists to start to use GOLD without even realizing it; consider the value of GOLD-based rangesets in Shoebox [13] for example. Such stealth mode adoption, based on the collaboration of GOLD maintainers and application developers, will significantly advance the currency of GOLD.

It is our view that these steps are necessary to ensure both the development of GOLD and for its adoption in the wider community. This agenda is strongly motivated by the goal of interoperation across a wide variety of resources and services. Once such interoperation is achieved in the short to medium term, individual services will be able to harvest resources and exploit the inferencing capabilities of semantically rich expressions of GOLD (like an OWL rendition) to achieve the more ambitious goal of a linguistic knowledgebase in the future.