Gary Holton, Alaska Native Language Center

Approaches to digitization and annotation: A survey of language documentation materials in the Alaska Native Language Center Archive

Development of standards for the digitization of endangered language data must be undertaken with knowledge of the structures which underlie those languages. Perhaps more crucially, such standards must be based on the ways in which linguists have come to understand and represent linguistic structures, as manifested most clearly in existing linguistic documentation. To that end, this paper details the results of a brief survey of the types of documentation contained in the Alaska Native Language Center Archive. While the documentation contained in the ANLC Archive is typologically narrow in scope--most Alaska’s 20 Native languages belong to one of two major language families--the depth and comprehensive nature of the archive provide insight into the range of existing approaches to endangered language documentation.

Even if attention is restricted to (recorded and transcribed) texts, the range of materials and documentary approaches is quite vast. Recordings include magnetic media such as reel-to-reel tapes and cassettes, as well as digital audio files in a variety of both open and proprietary formats. Transcriptions may or may not include annotation. Where annotation is included, the level of annotation varies enormously, from morpheme to word to sentence or even paragraph level annotation. Multiple levels of annotation are often found with a single text. The existence of such a wide variety of annotation techniques in ANLC materials argues for the need for flexibility in the design of standards for annotation. Of particular interest is the observation that a number of texts in the archive contain handwritten comments on the transcriptions. This suggests that standards for digitization and annotation should accommodate multiple authors and the ability of one author to refine another’s transcription.