Manuela Noske, Microsoft Corporation
Laura A. Otaala, University of Namibia at Windhoek
A Database of Interlinear Glossed Ateso Texts
This paper reports on the creation of an electronic corpus and an XML-based text database for Ateso. Ateso is an under-documented Eastern Nilotic language that is spoken in Uganda by about 1 million people. Over the past two years we have compiled an electronic Ateso corpus which consists of a total of ~760,000 tokens; the corpus is comprised of the online newspaper Etop (http://www.etop.co.ug), the 1961 United Bible Societies translation of the four gospels of the New Testament, and smaller samples of Ateso texts that are found on the internet.
This corpus serves as the basis for different types of inquiry into Ateso morphology and syntax. The aspect we focus on in this paper is the creation of a database of interlinear glossed Ateso texts in an XML format. The overall architecture adheres to the general-purpose hierarchical model that is proposed in Bow, Bird and Hughes (2003) which distinguishes four levels: a Text, Phrase, Word and Morpheme level. Our implementation differs, however, from the Bow, Bird and Hughes framework in that it provides glosses at three levels of representation:
While glossing at the Word and Phrase level leads to redundancy in cases in which there is no significant difference between the context-free Word level and the Phrase level translation, we have found that the benefit of having additional information in the non-redundant cases far outweighs the cost of encoding this information in XML. This is especially true, given the flexibility afforded by the XML model and the rendering process via XSLT style sheets.
The inclusion of glosses at all three levels has also proven advantageous for the overall workflow, since often a Phrase and/or Word level translation is available before a word has been morphologically analyzed and glossed. Using different style sheets, a wide or narrow view of the data can be taken. Another advantage is that the available, albeit incomplete information can be shared with other linguists at an earlier stage in the project.
In this talk we discuss