Harvesting Uriel Weinreich’s Archive of the Yiddish Language and Culture Atlas
Presented by:           Ulrike Kiefer / Robert Neumann, Förderverein für Jiddische Sprache und Kultur e.V.  
Project / Software Title:       Evidence of Yiddish Documented in European Societies (EYDES)  
Project / Software URL: http://www.eydes.org  
Metadata Harvesting
As metadata guide archive users in their search of relevant resources they are the precondition to a successful archive harvesting. The more precise and distinctive the metadata, the more efficient the discovery of a resource and its access. This is particularly true in the case of endangered languages where only little previous knowledge can be presumed. Scholars here feel an even more pressing need to prepare the information on data collections and create gateways for their successful use. In line with the OLAC guidelines the Description is the place for relevant information about the content of a repository. In the case of the electronic archive of The Language and Culture Atlas for Ashkenazic Jewry (LCAAJ) we subsume under Description the following items:

1. Questionnaire (in sgml/xml)
2. List of locations (place names and index numbers)
3. Classic indices, like the:
3.1 Questionnaire Index (object language)
3.2 Transcript index (subject language - with links to location; question; sound)
3.3 Transliteration index (of transcripts)
3.4 Index to the Dialectology (reference to Questionnaire, secondary index)
3.5 Subject Index to the Dialectology (reference to Questionnaire, secondary index)

Data Harvesting
On the basis of the metadata, archive users will make their choice about which of the available materials they are going to ‘ order ’. The provider has to be in a position to ‘ deliver ’ the identified items via the Internet -in precise fashion and in standard format (xml). It is understood that questions as to the rights, contract conditions, covering of costs, etc. play a crucial role; regulations in these matters may be subject to change and have to be flexible as a result.

During an experimental phase, we will limit the quantity of data delivery from the electronic LCAAJ archive while requiring the acceptance of standard contract (user identification, etc.). The user alone will be responsible for processing and evaluating the data.

It is typical for archive collections that many of their inherent features are not obvious to users ‘ at first sight ’ but need processing and computing to be made manifest. Such features are, for instance, the contextual qualities of words in a corpus where one lot may stand out against the other. By definition, contextual features occur in multiple ways and in extensive quantities so that they cannot be extracted with an exhaustive, one-time, procedure. Consequently, archive users need methods and tools with which they can detect and exhibit features like these. For the electronic LCAAJ such a procedure has been created: LARA - a tool for statistical exploitation (text mining). It allows for computing and ‘ delivering ’ lexical features in data segments which are defined by a user via a set of specified contextual categories.

