Text mining or text analysis are terms for analyzing documents (books, tweets, news reports, etc) with the aid of software. Text analysis is a methodological approach and discipline agnostic. Text analysis is performed on corpora, collections of machine-readable text that are designed to answer specific kinds of questions
Corpora are often obtainable from webpages, data sets and databases. Each site or database often have their own rules and restrictions on what is and is not permissible when it comes to applying text and data mining methods to their data.
Broadly, databases and websites fall into four categories:
As a general rule, check with your subject librarian or the data services librarian before beginning any project that involves text data mining.
An Application Programming Interface (API) is a set of clearly defined methods of communication that allows two applications to talk to each other. Just as humans need a structured mechanism to share information (i.e. spoken or written language), so do computers. An API is a set of instructions for how a particular machine/information source is able to share the data it contains.
APIs are often used to extract the data used for text and data mining. Many databases and publishers have their own APIs that allow researchers to access information. These APIs are often needed because the databases and publishers prohibit web scrapers and crawlers. Always check the individual policies for text mining and API use before starting your project.
Thank you to Berkeley for providing this model
Researchers in the humanities will often have a large number of PDFs, photos of archival documents, or other images of text that are not yet machine readable. Using Optical Character Recognition(OCR) you can convert images of scanned text pages into machine-readable text, so you can copy and paste, search, or edit. Please note that research-level Optical Character Recognition requires accuracy, multiple language support, and bulk processing.
For simple projects the Libraries' scanners will output OCR for most English language documents.