Text mining

The HathiTrust Research Center

The HathiTrust Research Center (HTRC) enables computational analysis of the HathiTrust corpus. It is a collaborative research center launched jointly by Indiana University and the University of Illinois, along with HathiTrust, to help meet the technical challenges researchers face when dealing with massive amounts of digital text. It develops cutting-edge software tools and cyberinfrastructure to enable advanced computational access to the growing digital record of human knowledge.

HTRC has analytic tools to help you analyze text materials that have been gathered in the HathiTrust Digital Catalog. These materials do not need to be available for full-text viewing in order to be analyzed.

The HTRC requires a separate login process than the Digital Catalog to access their tools. Go to the center and use the same process for logging in as you have done in the Digital Catalog.

Previous Account?

If you already have a HathiTrust Research Center (HTRC) account, it is suggested that you create your new account using OpenAthens. This will allow you some greater flexibility and privileges in using the HTRC.

When you select your Open Athens account, you will see the following dialogue box for you to a merge previous account to this one.

Merging HathiTrust Resource Center Accounts

This will allow you to load any of your previous datasets and analyses

Out-of-the box analytics

HTRC provides several out-of-the-box  tools to help you analyze your text collections. These are found in the algorithm section of the website.


InPhO Topic Model Explorer (v1.0b225)

The InPho Topic Explorer trains multiple LDA topic models and allows you to export files containing the word-topic and topic-document distributions, along with an interactive visualization. Can be run on worksets of fewer than 3000 volumes, as long as the total size of the workset is less than 3 GB.


Named Entity Recognizer (v2.0)

Generate a list of all of the names of people and places, as well as dates, times, percentages, and monetary terms, found in a workset. You can choose which entities you would like to extract. Can be run on worksets of fewer than 3000 volumes, as long as the total size of the workset is less than 3 GB.


Token Count and Tag Cloud Creator (v2.0)

Identify the tokens (words) that occur most often in a workset and the number of times they occur. Create a tag cloud visualization of the most frequently occurring words in a workset, where the size of the word is displayed in proportion to the number of times it occurred. Can be run on worksets of fewer than 3000 volumes, as long as the total size of the workset is less than 3 GB.


HTRC Extracted Features Dataset

The HTRC Extracted Features Dataset v.2.0 is composed of page-level features for 17.1 million volumes in the HathiTrust Digital Library. This version contains non-consumptive features for both public-domain and in-copyright books.

Features include part-of-speech tagged term token counts, header/footer identification, marginal character counts, and much more.

Explore the following built sets:

Word Frequencies in English-Language Literature, 1700-1922

Geographic Locations in English-Language Literature, 1701-2011

OR Work on your own using the Extracted Features Download Helper (v3.1) under Algorithms

Build a Data Capsule

The HTRC Data Capsule environment provides individual, secure computing environments to analyze content in the HathiTrust Digital Library. Researchers can create virtual machines (called Capsules) to which they can import and then analyze HathiTrust text data. Researchers can only perform computational analysis within the secure Data Capsule environment and then export the results of their analysis. Data products leaving a Capsule must undergo results review prior to release to ensure they meet the HTRC's policy for non-consumptive data exports.

Capsules are Ubuntu virtual machines with increased security settings. Researchers have the option to set certain parameters for their Capsule when they create it.Capsules come pre-loaded with standard data analysis programs and software.While Capsules come with standard tools pre-installed, ranging from Anaconda and R to Voyant Tools, and can be configured with sample public domain data already loaded for testing, any other data or tools the researcher plans to use will need to be brought into the Capsule by the researcher. A Capsule is an almost blank slate that can be customized for each researcher's needs!