Scott Farrar, University of Arizona

A Universal Data Model for Linguistic Annotation Tools

An important goal of the E-MELD enterprise is to recommend best-practice standards and resources in support of the digitization of endangered language data. As a result, there have been several proposals put forth at the various E-MELD sponsored events for best-practice data models, in particular, for dictionaries, paradigms, and interlinear text. While these proposals emphasize structural and encoding compatibility, mostly through the recommendation of using XML and Unicode respectively, the resulting data models are not necessarily interoperable with respect to content. One suggestion for going beyond mere structural and encoding compatibility was to include in the various data models a reference to a common markup ontology, such as the General Ontology for Linguistic Description. As of yet, however, there have been few suggestions on how to implement such a model that emphasizes content interoperability via an ontology. This paper attempts to fill the gap by describing a common data exchange format to be useful for a variety of data digitization tools.

The data exchange format is primarily a means to provide continuity and interoperability among tools with very different purposes, such that different aspects of the same data can be manipulated by each kind of tool. For example, consider a lexicon creation tool such as FIELD whose output is a highly structured lexicon. If the data were structured according to a universally recognized model, then the results could then be loaded into another kind of tool, for instance, one that produces interlinear text based on the lexicon, or one that adds detailed phonetic annotation to each entry. The most important requirement is that the data exchange format accommodate the common linguistic data types, both of the traditional print variety (e.g., 'dictionary entries' and 'interlinear text') and of a more technical nature, such as those used in natural language processing applications (e.g., 'treebanks' and 'electronic dictionaries'). Another important requirement of the model is that it be `conversion friendly', not only to accommodate the various tools, but also to ensure that the data can be displayed in a human-friendly format. Thus, the main design issues surrounding display- versus content-oriented data structures are discussed in detail. Also, the role of current markup standards is discussed and how they can be leveraged to create a state-of-the-art data exchange format. We discuss the use of XML Schema, the Resource Description Framework, and the Extensible Stylesheet Language. Finally, the data exchange format is discussed as the basis for the data-centric component of the GOLD Community of Practice. In particular, the migration of primary data to a framework for knowledge-based applications is discussed.