Research Guides: Text mining: Text Analysis Basics

Thanks

A thank you goes out to Yale, Berkeley, South Carolina, and Emory libraries for adapted and adopted content.

Building Legal Literacies for Text Data Mining
This book explores the legal literacies covered during the virtual Building Legal Literacies for Text Data Mining Institute, including copyright (both U.S. and international law), technological protection measures, privacy, and ethical considerations.

Data Services Librarian

Lora Lennertz

Email Me

Contact:

University of Arkansas Libraries

MULN 415

365 N. McIlRoy Ave

479-575-7197

Subjects: Copyright, Data Science, Data Services, OEDP, Open Educational Resources, Statistics

What is text analysis and What is a corpus

Text mining or text analysis are terms for analyzing documents (books, tweets, news reports, etc) with the aid of software. Text analysis is a methodological approach and discipline agnostic. Text analysis is performed on corpora, collections of machine-readable text that are designed to answer specific kinds of questions

All the works of a particular author (all of Jane Austen’s books)
All the works of a particular subgenre (Early English books)
All the works by a population of authors (Enron corpus of emails)
A sample of every mention of a particular concept (a corpus of #indyref on twitter)

Your goals

Text and data mining is highly customized work, with varying timelines from start to conclusion. To carry out a successful project, you will need both access to data and the skills to interact with that data. The skills needed are determined by the nature of the data and what you want to do with it.

When starting a project, you need to consider:

What are the goals of my project?
What data sources are available that meet my needs?
What funding needs may this project incur?
What skills are needed to carry out this project?

Text mining is a "non-consumptive" use of the materials provided. That means that you are using the words from within the text but not necessarily the text as presented on a written page. Therefore, a researcher should not assume that when mining the text, a full-text article or book will also be available for reading or other consumption.

Appropriate Use of Library Resources for Text Mining

Appropriate Use of Purchased or Licensed Resources

A library subscription DOES NOT imply that text mining is permitted. Some licenses have text mining language, and some will require permission.

Most of the library's electronic resources are governed by license agreements that limit use to the University of Arkansas, Fayetteville community or to individuals who are physically present at the Libraries' facilities.

Each user is responsible for ensuring that he or she uses these products solely for noncommercial, educational, scholarly or research use.
Systematic downloading, distribution of content to non-authorized users or indefinite retention of substantial portions of information is strictly prohibited.
The use of software such as scripts, agents, or robots, is generally prohibited and may result in loss of access to these resources for the entire University of Arkansas community.

Regardless of licensing permissions, some text mining techniques can create server issues for providers. Make sure the methodology to be used follows the provider's preferences. Also, some preferred methods may need assistance from the provider.

You may need to contact the service provider. Here are some details to communicate in your request:

Define types of information being mined.
Define method.
Is it for a one-time occurrence or going? (And if ongoing, at what frequency?)

Need advice for the permission letter or for information about what our licenses permit? Contact the Data Services librarian for your subect librarian for assistance.

*Language adapted from similar guides at Yale and Emory.

Find your librarian

Copyright and Text Mining

Before you begin any data mining project, you should be aware of the limitations surrounding copyright and fair use (especially if you are dealing with data that may be under copyright). This area of copyright law is still under development. Growing support is being given for non-consumptive use of resources for computational analysis.

The Association of Research Libraries (ARL) and The International Federation of Library Associations (IFLA) both provide advice and statements on data and text mining, which you can find below.

IFLA Statement on Text and Data Mining
The International Federation of Library Associations and Institutions' 2013 statement regarding non-consumptive use of data resources.
ARL: Text and Data Mining and Fair Use in the United States
A 2015 Issue Brief from the Association of Research Libraries
Law and Literacy in Non-Consumptive Text Mining
Samberg, R. G, & Hennesy, C. (2019). Law and Literacy in Non-Consumptive Text Mining: Guiding Researchers Through the Landscape of Computational Text Analysis. In Copyright Conversations: Rights Literacy in a Digital World. UC Berkeley. Retrieved from https://escholarship.org/uc/item/55j0h74g

In the news:

Text and Data Mining Exemption to Digital Millennium Copyright Act Would Advance Knowledge of Diverse Works
A blog post from the Association of Research Libraries
Authors Alliance Files Comment in Support of New Exemption to Section 1201 of the DMCA to Enable Text and Data Mining Research
A blog post from the Authors Alliance