Text mining

What is text analysis and What is a corpus

Text mining or text analysis are terms for analyzing documents (books, tweets, news reports, etc) with the aid of software. Text analysis is a methodological approach and discipline agnostic. Text analysis is performed on corpora, collections of machine-readable text that are designed to answer specific kinds of questions

  • All the works of a particular author (all of Jane Austen’s books)
  • All the works of a particular subgenre (Early English books)
  • All the works by a population of authors (Enron corpus of emails)
  • A sample of every mention of a particular concept (a corpus of #indyref on twitter)

Your goals

Text and data mining is highly customized work, with varying timelines from start to conclusion.  To carry out a successful project, you will need both access to data and the skills to interact with that data.  The skills needed are determined by the nature of the data and what you want to do with it.

When starting a project, you need to consider:

  • What are the goals of my project?
  • What data sources are available that meet my needs?
  • What funding needs may this project incur?
  • What skills are needed to carry out this project? 

Text mining is a "non-consumptive" use of the materials provided.  That means that you are using the words from within the text but not necessarily the text as presented on a written page.  Therefore, a researcher should not assume that when mining the text, a full-text article or book will also be available for reading or other consumption.

Appropriate Use of Library Resources for Text Mining

Appropriate Use of Purchased or Licensed Resources

A library subscription DOES NOT imply that text mining is permitted. Some licenses have text mining language, and some will require permission.

Most of the library's electronic resources are governed by license agreements that limit use to the University of Arkansas, Fayetteville community or to individuals who are physically present at the Libraries'  facilities.

  • Each user is responsible for ensuring that he or she uses these products solely for noncommercial, educational, scholarly or research use.
  • Systematic downloading, distribution of content to non-authorized users or indefinite retention of substantial portions of information is strictly prohibited. 
  • The use of software such as scripts, agents, or robots, is generally prohibited and may result in loss of access to these resources for the entire University of Arkansas community.

Regardless of licensing permissions, some text mining techniques can create server issues for providers. Make sure the methodology to be used follows the provider's preferences. Also, some preferred methods may need assistance from the provider. 

You may need to contact the service provider.  Here are some details to communicate in your request:

  • Define types of information being mined.
  • Define method.
  • Is it for a one-time occurrence or going? (And if ongoing, at what frequency?) 

Need advice for the permission letter or for information about what our licenses permit?  Contact the Data Services librarian for your subect librarian for assistance.

 

*Language adapted from similar guides at Yale and Emory.

Copyright and Text Mining

Before you begin any data mining project, you should be aware of the limitations surrounding copyright and fair use (especially if you are dealing with data that may be under copyright).  This area of copyright law is still under development.  Growing support is being given for non-consumptive use of resources for computational analysis.

The Association of Research Libraries (ARL) and The International Federation of Library Associations (IFLA) both provide advice and statements on data and text mining, which you can find below.

In the news: