Thorsten Trippel, Bielefeld University

The Missing Links in Documentary Linguistics: An approach to bridging the gap between annotation tools

This paper focuses on ways of making linguistic annotation tools for documentary linguistics and archiving interoperable without restricting the features of the individual tools. It is assumed that no tool is exclusively used in the process of documenting and archiving a language but a suite of them according to strengths and the user's needs. The tools have frequently been discussed by developers and archivists in terms of individual tool portability, but the interoperability of different tools has been neglected generally though XML is now accepted for data structure specifications.

Examples of widely used annotation applications are Transcriber (signal based rapid broad transcription), the Linguist's Toolbox (word based morophosyntactic interlinear annotation), Praat, Wavesurfer, TASX-Annotator, ELAN. The applications support various combinations of multitier text and signal annotations, and metadata administration used for archiving and distributing the corpora.

These applications can potentially be combined into a suite of tools in a common workflow:
  1. A linguist could use Transcriber for immediate transcription in fieldwork situations, for rapid data collection on turn or sentence base, roughly aligned to the signal.
  2. The sentences from transcriber serve as the base for morphosyntactic annotation with Toolbox.
  3. The roughly signal aligned transcription can be used as the input for a detailed analysis on word or segment level in phonetic software, by taking the Transcriber annotation as a sentence/turn tier and adding more tiers.
  4. The Toolbox interlinearization, also multitier, can be used to add morphosyntactic information to the annotation from the phonetic software given a wordlevel annotation there.
At present, this workflow requires a programmer for file conversion programs or a patient person to copy and paste the text between the applications.

A suite of this kind is hybrid, and therefore needs to provide import and export functionalities for all the components. The developers however usually work on different aspects and features of their open source software, and interoperability cannot be required of them. When a standard exchange format becomes available, this situation might change. Another option is to have separate tools for the conversion file formats. Installing and learning a new interface, new sets of instructions, and new limits for a variety of tools is not a feasible ergonomic option.

One solution is to provide a website for file conversion, selecting the source data format and the output format. The linguist only needs a connection to the Internet and a browser, while the conversion is done server side, allowing fast and efficient debugging and improvement of the conversion routines. Such a tool initially supporting the source formats Praat, TASX, Transcriber and the output formats Praat, TASX and plain text is included, using the TASX-format as a generic `lingua franca' format for reasons of simplicity because it permits preservation of all application specific metadata.

Future perspectives for the extension of the tool and problems with the extensions are outlined in the full paper, including the use of Toolbox-interlinearizations as source format, given that one corresponding annotation level is already available from one of the signal based tools.