Alison Alvarez, Lori Levin, Robert Frederking, Erik Peterson, & Simon Fung
Carnegie Mellon University

Tools for Elicitation Corpus Creation

The AVENUE project has produced a package of tools for MInor Language Elicitation (MILE). The tools enable a researcher to identify syntactic or semantic features that are of interest, specify which combinations of features should be examined, and create a questionnaire or elicitation corpus representing those features and feature combinations. There is also an elicitation tool used by language informants to translate and align elicitation corpus sentences.

The output of MILE is a list of sentences with a feature structure for each sentence. The feature structure represents the meaning or grammar that is meant to be conveyed by the sentence. The feature structure is meant to be language-independent. The sentence can be in any language that a fieldworker might use for elicitation. If the elicitation language (e.g., English) does not cover all of the meanings encoded in the feature structure, it may be necessary to include a context field to explicitly state this information.

The tools include the following:

XML Schema and XSLT Script for Feature Specification: A feature specification is a list of features such as tense, person, and identifiability with values for each feature, such as past tense, third person, and identifiable. The schema also allows for default values to be identified for each feature. It is also possible to state co-occurrence restrictions on values (e.g., no first person inanimate nouns).

Interface for Designing and Generating Feature Structures (see above): This tool uses the feature specification to create a compact representation for a large set of sentences. The representation is composed of features and values determined by the user and guided by the feature specification. For example, a field worker can design a feature structure set to explore all combinations of person, number, gender and tense. The set of feature structures is generated automatically from the compact representation.

Graphical Interface for Reading and Annotating Feature Structures (see above): This tool allows users to display complicated feature structures in a variety of ways in order to facilitate readability and annotation of feature structures with sentences and context fields.

Elicitation Tool (see above): This interface is designed to gather translations and alignments of the elicitation corpus from language informants.

These tools can be used to create elicitation corpora that will bootstrap morphological analyzers, facilitate the discovery of language features (either through machine learning or human analysis), ensure maximum coverage of syntactical and morphological coverage for translation corpora, or build questionnaires that can be administered by non-linguists.

The elicitation tool has been tested on at least ten languages. The XML schema, featurestructure generation, feature structure interface have been used in a project for the US Government to produce a 70 thousand word elicitation corpus. This corpus will be translated into seven languages a year for the next five years.