Skip to Main Content
Wellesley College Research Guides

Text Analysis & Text Mining

Free Sources for Text Mining

This is a selective list of just a few of the many open access databases and repositories that can be used for text mining and text analysis. If you contact us with information about what you would like to find, we can help you search for other sources. See Licensed Databases for Mining for more possibilities.

Resource Details
Caselaw Access Project Based at Harvard University, this is a fully downloadable database of 360 years of United States caselaw. Browse and download cases via API or bulk data download.
Chronicling America: Historic American Newspapers The Chronicling America API provides access to information about historic U.S. newspapers and millions of digitized newspaper pages.
corpus.byu.edu A number of large corpora compiled by Prof. Mark Davies (Linguistics) of Brigham Young University, including corpora of historical and contemporary American English, global English, and TV, movie, and American soap opera corpora. There is an online query tool, but you can also download the corpora. Davies also has a Corpus del Español and a Corpus do Português.
Digital Public Library of America Download and analyze a wide range of digitized content and metadata from museums, libraries, archives, and other cultural heritage institutions across the United States. 
Documenting the American South Digitized primary materials that offer Southern perspectives on American history and culture. See the DocSouth Data page for information on bulk downloading data from some of their collections.

Europeana

Millions of artworks, artifacts, books, films, and music from European museums, galleries, libraries and archives. The research site offers curated thematic datasets, API options, and a bulk downloader
Google Books Ngram Viewer Charts the frequency of any word or phrase in a chosen corpus of books over time. The Ngram Viewer is optimized for quick inquiries into the usage of small sets of phrases. For large scale analysis, raw data is available for download here.
HathiTrust Digital Library Hathi Trust's collections include over 16 million volumes that span the history of printed text, primarily in English, but also in over 400 other languages. The HTRC provides a number of tools that enable researchers to perform text analysis against its texts online. They will also provide datasets for download. You can use HathiTrust Bookworm to quickly search for trends in public domain texts in HathiTrust.
Internet Archive eBooks and Texts

Over 20 million freely downloadable books and texts. There are currently two methods for bulk downloading, both requiring some familiarity with working in a Unix environment:

Project Gutenberg A library of over 60,000 free e-books, most of which are older works in the public domain. If you want to download more than 100 books per day (either manually or using an automated download software), use one of the Project Gutenberg mirror sites, not the main site. 
PubMed Central  Text Mining Tools and Text Mining Collections
University of Oxford Text Archive Thousands of full-text literary and linguistic sources in more than 25 languages.
WikiSource Online libraries of public domain texts available in several languages.