Marc Kemps-Snijders, Max Planck Institute for Psycholinguistics
Sebastian Drude, Freie Universität Berlin
Peter Wittenburg, Max Planck Institute for Psycholinguistics

Standards and Tools for Lexicon Manipulation

As long as dictionaries were available in printed form, evaluating lexica and making relations between them and other documents was left completely to the researchers interpretations. Differences in structure and style as described for example by Bell & Bird did not form problems. When computer-aided analysis started the intellectual freedom of choosing arbitrary formats, structures and terminology started to become a severe obstacle. Interesting material was practically inaccessible. Based on an investigation of many computer-based lexica, Peters, Wittenburg and Drude demonstrated the wide variety of lexical structures in language documentation, natural language processing and other linguistic sub-disciplines. While the formats are mainly determined by the tool being used, the reasons for having chosen a certain lexical structure and linguistic terminology are mainly influenced by three factors: (1) the language being analyzed, (2) the linguistic theory in mind and (3) the intention associated with the lexicon.

For a few years now some researchers worked on harmonizing lexica. This would help the researchers to easily access the material and tool developers to focus on implementing functionality rather than to address the conversion problematic again and again. The GENELEX schema was one of the early attempts to come to a generic NLP lexicon structure. Although presenting a deep insight into lexica used in NLP, GENELEX was not really generic. The work in follow-up projects such as PAROLE and MILE further elaborated on this line. In a number of conference papers given by researchers such as Ide, Erjavec, Grishman, Kilgarriff Romary and Veronis a variety of structural phenomena including, for example, inheritance mechanisms were carefully analyzed. In 2003, finally, a first workshop about a true generic model for lexical structures was organized focusing on a Lego-brick like approach. Only a model that would have the expressive power to fit with every known lexical structure would be sufficient.

Most recently, ISO TC37/SC4 started developing the Lexical Markup Framework (LMF). LMF is a flexible model that is based on a simple core model and flexible, recursive extension mechanisms allowing the specification of any kind of tree structure and on top any kind of typed relations between lexical units. Semantic interoperability is facilitated by re-using data categories from a central ISO data category registry. A similar approach was chosen by E-Meld with its Field tool linking up to the GOLD ontology.

Based on the LMF model the LEXUS tool was developed. It allows the creation and manipulation of all types of schemas and content, importing all types of Shoebox, CHAT and XML lexica. By supporting APIs to the ISO DCR and the Shoebox MDF categories (support for GOLD has to come), it allows users to re-use existing data categories and therefore tackle the semantic interoperability problem. Due to its flexible design, data categories can also include links to images, sound and video fragments. Until now we were able to represent all kinds of existing lexical structures. LEXUS has features to allow operations across different lexica, i.e., the interoperability problem was tackled at the structural and semantic level.