William Lewis, University of Washington

Querying a Large Multi-lingual Database

In this session we present ODIN, the Online Database of INterlinear text, most specifically, its new Advanced Search features. ODIN is a database of Interlinear Glossed Text (IGT), containing data harvested from online resources. At the time of this writing, it houses over 27,000 instances of IGT from over 625 different languages, extracted from nearly 1600 documents. ODIN continues to grow at a rate of about 250-500 instances per week.

Until recently, ODIN's search facility has been limited to search by language. For instance, if a linguist were interested in data for Warlpiri, she could consult ODIN's website (http://www.csufresno.edu/odin), select Warlpiri from the list of languages, and would be given a list of several dozen papers that are about or contain data on Warlpiri. The linguist could view each of these papers in turn, or even view Warlpiri data that has been extracted from these papers (ODIN will only display data where citations can be displayed).

ODIN's Advanced Search has extended this capability by providing search over the data itself. Using Advanced Search, the linguist can look for specific instances of markup, such as Ergative Case, Perfective Aspect, Causative voice, etc., across all languages or even across languages within a specific family. All search terms are normalized to a common terminology set, namely to terms defined in the General Ontology for Linguistic Description (GOLD), such that glosses like PERF, PFV, PF, 3SGPERF all map to a common form, namely gold:PerfectiveAspect. In addition to concept query, a list of pre-packaged queries are also provided, allowing linguists to search for data for particular constructions and features that might be of interest. For instance, using the Construction/Feature query facility, the linguist can look for IGT instances that contain Passive, Counterfactual, or Conditional constructions, and can also find examples of IGT that have Negation, Sentential Negation, Possessives, Reflexive Anaphors, etc. All of these queries tap into rich structure explicitly provided in interlinear examples, further enhanced by post-extraction part-of-speech tagging and text alignment.