Research Guides: Text mining: Terminology and Methods

Terminology

API (Application Program Interface) - Software intermediary that allows two applications to talk to each other. In our case to
access the features or data of an operating system, application, or other service.

Geographical Analysis - Using mapping tools along with text analysis to plot terms in space

Lemmatization - Identifying the base form of the word such as "run" in run, ran, run

Named Entity Recognition - Identifying proper names

Natural Language Processing - Ability of a machine or program to understand human text or speech

N-grams - Probabilistic model in computational linguistics which identifies sequences of syllables, letters, words,etc. that can be expected in a sample of text

Parts of Speech Tagging - Identifying the syntactic role of a word

Relation Extraction - Identifying the relationships between entities such as "daughter of" or "town in ? state"

Sentiment Analysis - Using software to identify attitudinal information from a text

Stemming - Processing rules to identify the base form of a word

Tokenization - Process of separating a string of characters into tokens which may be words, phrases or sentences. In the process punctuation is removed.

Topic modeling - Coding texts into meaningful categories

Web Scraping (crawling, spidering) - Copying website information in order to extract large amounts of data and saving to a local file.

Methods

This box offers several ways to perform these counts and what their strengths are. Different software will have different implementations of these methods, so choosing your platform may have an effect on the kinds of analyses you can run.

Keyword-in-Context (KWIC) Analysis: provides a list of a specific word or phrase in context (up to 7 words in each direction is common). Best for pattern identification and close reading.

Lexical co-occurrence or collocation: Observes clusters of terms which are likely to appear together in a given population, based on statistical relationships. This is good for getting a sense of 'aboutness' for a specific term or population or detecting specific word associations. Topic modeling is based on this principle.

Word Vector modeling: Like lexical co-occurrence, this looks for terms which are likely to appear together in a given population, and projects terms into multi-dimensional space to model semantic relationships between words at scale. This is especially good for discovering how texts' use of words can relate to each other.

N-grams: Observes clusters of terms which very definitely appear together in a given population. This is very good for identifying common phrases in a particular genre, etc and stylistic features which are unique to a specific author.

Keyness: Sets up a comparison between Set A and Set B: does this word appear MORE or LESS frequently in Set A when compared with Set B, using a statistical measure called Log-likelihood.

Most Frequent Word Analysis:Finds the most frequent terms of a given population, and highlights the small function words which make up the bulk of language. Good for identifying unique stylistic fingerprints; keeping track of presence and absence.