Christopher Manning, Stanford University

Delivering the Benefits of Best Practice Data: Lexicon exploration in Kirrkirr

Recent years have seen a greatly increased awareness of the importance of language documentation, and the need for archival data formats for such documentation, including, among other formats, the use of well structured text formats, such as XML, for textual data (Bird and Simons 2003, Austin in press). This emphasis is correct, but, notwithstanding increasingly good tools for the production of such data, producing well-structured archival data is almost inevitably more work for the producer than producing inconsistent or unstructured data, using tools such as text editors. I argue that it is thus essential to show to the creator the benefits that come from having well-structured archival data. One way to do that is by having a good pathway from this data to a static view of the data on a web page, such as for a typical online lexicon. In this paper I will describe the Kirrkirr project, which has tried to push this goal further within the domain of indigenous language lexicons. The central premise is that once you have well-structured dictionary data, then compute software can automatically (and cheaply) transform and visualize that data in ways that satisfy user needs, and in ways that go well beyond what online dictionaries typically offer. That is, for native speakers or semi-speakers, the dictionary can provide such things as easy lookup via spelling correction, maps of word associations, and automatically produced activities such as word games, which can promote language learning. For the linguist, the software can not only allow flexible searching via textual patterns, but can allow easy searching and visualization of semantic domains and other lexical chaining, and a way to find lexical gaps and inconsistencies.

This paper will focus on new capabilities developed within Kirrkirr within the last few years. These include: (i) generalization of the data model and data access so that Kirrkirr can work directly over lexicons encoded in the FIELD format or SIL's MDF (MultiDictionaryFormatter) - once this has been converted to XML by a tool such as Toolbox, (ii) a new visualization that allows exploration of dictionary content via semantic domain browsing, (iii) leveraging an L1-L2 dictionary to automatically produce a rough L2-L1 dictionary, and (iv) automatic deployment of word games based on dictionary content.

Taken together, these tools not only transform a dictionary into a more captivating, dynamic environment, but give the creator a clear sense of what well structured lexical data has enabled.

