William Lewis, University of Washington

Tools, Fair Use and Linguistics in the Internet Age

A recurring theme in talks and discussions at the E-MELD workshops held over the past few years has been the need for best-practice standards and tools that adhere to these standards. Although most of the discussions have focused on tool development for field linguists, an increasingly frequent topic has been the need to develop tools that migrate and manipulate legacy data. We have come to realize that the abundance of legacy data, and the reasonable reluctance by linguists to re-input said data, may stymie efforts to promote the use of best practice. In other words, we have come to realize that the following motto carries some weight: "if you migrate it, they will come." Equally important, however, is the need for search, even if the data being searched over is encoded in legacy formats. In other words, we might find the following motto of equal importance: "if they can find it, they will come."

Search for linguistic data on the Web, using search engines such as Google, is fairly unsophisticated, but surprisingly effective. Even searches for the most obscure of languages can return hundreds of hits, many containing linguistically relevant data. Some fiddling with the search string can improve the quality and relevance of the results. One could even envision a search engine custom built for linguists that would recognize what is linguistically relevant from what is not, listing only those documents that are the most salient to the query. A step beyond this might be search tools that return not just documents and web pages, but the data itself, displayed independently of the source documents, but with links to them.

In such a world, what constitutes fair use? It is accepted practice in the field of linguistics to reuse language data found elsewhere, such as snippets of language data found in a fellow linguist's paper, as long as the source of the data is acknowledged and cited. This custom has a long tradition. But what of tools that can do the same thing, except on a massive scale, extracting potentially hundreds of examples from thousands of documents? If a search engine returns a list of data extracted from one or more linguistic documents discovered on the Web, how should that data be presented? Even more important, how should the linguist who crafted the query, who can now merely cut and paste results into a file on his computer, be beholden to the source of the data? In other words, what are the responsibilities of tool developers and what are the responsibilities of linguists who use such tools?

This paper will explore these issues in the face of tools that can search for and manipulate existing linguistic data. The practices tool developers should follow will be addressed, and in turn, the rights and responsibilities of linguists who use these tools will also be discussed.