Text mining and analysis is a form of data mining performed on text-based data sets. It involves the computational analysis of large quantities of digital information.
Using specialized software, researchers can extract data, identify trends, look for patterns and better understand the relationships of terms within and between documents. Analysis might focus on word frequency, words that frequently appear near each other, contextual information for key words, common phrases and other patterns.
Materials to be analyzed range from websites (such as publicly available Facebook posts) to 16th c. manuscripts.
"Text Mining is the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources... The difference between regular data mining and text mining is that in text mining the patterns are extracted from natural language text rather than from structured databases of facts." (Marti Hearst, What is Text Mining?)
Illinois University Library and Penn State University Libraries have good brief overviews of text analysis methods.
See the Terminology & Projects page for more information on text mining concepts and practices, as well as example projects.
See the Tools & Tutorials page for more hands-on guides to doing text analysis.
If you wish to undertake a text or data mining project with content from the Library's licensed databases, please contact a Librarian to investigate options, which may include negotiating with the vendor or purchasing access to the data. Although many database licenses prohibit text and data mining and the use of software such as scripts, agents, or robots, we are actively negotiating text mining rights with database vendors. Unauthorized text or data mining in violation of our licenses can result in loss of access for the entire Wellesley College community.
Please also see our Best Practice Tips for mining licensed databases.