Research Guides: Text mining: Creating and Locating Your Corpus

Fun and Learning

What is text analysis and What is a corpus

Text mining or text analysis are terms for analyzing documents (books, tweets, news reports, etc) with the aid of software. Text analysis is a methodological approach and discipline agnostic. Text analysis is performed on corpora, collections of machine-readable text that are designed to answer specific kinds of questions

All the works of a particular author (all of Jane Austen’s books)
All the works of a particular subgenre (Early English books)
All the works by a population of authors (Enron corpus of emails)
A sample of every mention of a particular concept (a corpus of #indyref on twitter)

Developing Your Corpus

Corpora are often obtainable from webpages, data sets and databases. Each site or database often have their own rules and restrictions on what is and is not permissible when it comes to applying text and data mining methods to their data.

Broadly, databases and websites fall into four categories:

Freely Available Resources. These databases are open access and either openly allow text mining or allow it broadly, but with specific (generally minor) restrictions.
Resources Accessible through Purchase. These are resources we may have access to in some capacity, but do not (presently) allow access to the kinds of data needed for text data mining. However, if access to these data is needed, it may be purchased. Plan the costs of purchasing this data with the development of your grant or speak with your department for assistance.
Restricted Resources. These resources either forbid the use of their data for any and all mining projects, or we do not have sufficient access to these databases to permit mining usage.
Purchased Resources. These are resources the libraries have either purchased or with whom our libraries have a Perpetual Access License (PAL). While there may be some restrictions on how data can be used--especially when it comes to publication--generally speaking these databases allow text and data mining in some capacity.

PLEASE NOTE: The University of Arkansas Libraries currently do not provide data purchasing services.

As a general rule, check with your subject librarian or the data services librarian before beginning any project that involves text data mining.

Find your librarian

What are APIs?

An Application Programming Interface (API) is a set of clearly defined methods of communication that allows two applications to talk to each other. Just as humans need a structured mechanism to share information (i.e. spoken or written language), so do computers. An API is a set of instructions for how a particular machine/information source is able to share the data it contains.

APIs are often used to extract the data used for text and data mining. Many databases and publishers have their own APIs that allow researchers to access information. These APIs are often needed because the databases and publishers prohibit web scrapers and crawlers. Always check the individual policies for text mining and API use before starting your project.

Should I or Shouldn't I ...Scrape a Website

Thank you to Berkeley for providing this model

OCR - When a text is not available digitally

Researchers in the humanities will often have a large number of PDFs, photos of archival documents, or other images of text that are not yet machine readable. Using Optical Character Recognition(OCR) you can convert images of scanned text pages into machine-readable text, so you can copy and paste, search, or edit. Please note that research-level Optical Character Recognition requires accuracy, multiple language support, and bulk processing.

For simple projects the Libraries' scanners will output OCR for most English language documents.

Abbyy FineReader
Tesseract OCR
Useful for batch OCR projects