William Lewis, California State University Fresno

Mining and Migrating Interlinear Text

Interlinear (IL) is a standard format used to describe linguistic data. The format has been consistently used in scholarly linguistic papers for nearly a century. IL text generally consists of a line of language data, often broken down by morpheme, a line of grammatical and gloss information aligned with the text in the first line, and a line representing the translation. Although the IL format is widely used, the format is intended for human consumption; accessing IL text using automated agents is not easy.

This paper will focus on two issues: (1) defining methods for isolating the ''semantics'' of interlinear, essentially identifying the tags commonly used in the second line of IL text and the linguistic knowledge they represent, and (2) designing tools to migrate the data contained in IL text to a Best Practice (BP) format. The first is part of a grander scheme to define an ontology of linguistic knowledge for use by automated agents (such as search engines). The second is to ensure the long-term survival of language data. Isolated to specific instances in scholarly papers, the enriched language data contained in IL is not readily accessible. Migrating it to a standard encoding format for dissemination on the Web will ensure its long-term survival.