Sometimes the biggest hurdle in starting a text processing project is finding a good dataset to work with.
We went ahead and compiled a list of awesome text data resources to save you a lot of time and energy!
- Free Text
- Project Gutenberg is a great one stop shop for downloading free texts of works of literature. This is probably the best place to start if you want to start playing around with text analysis on large text files.
- The Tufts libraries put together a directory of different online text resources ranging from ancient texts to modern texts, novels to newspapers, etc. This is great especially if you are looking for digitized versions of older or rare texts.
- Google Books is pretty hit-or-miss when it comes to getting access to full text versions of books, but it’s worth taking a look if you’re looking for something specific.
- Annotated Text
- Sketch Engine has tonnes of different high quality annotated corpora for a wide range of applications in many languages. It is particularly useful for building semantic models. Most corpora are available by paid subscription only, but many are available for free or with a trial membership.
- Lionbridge has a collection of large NLP datasets for supervised machine learning. Mostly, these are links out to other data collections, but Lionbridge has collected and curated them very conveniently.
- Kaggle offers a wide variety of datasets (not just text data), almost all of which are available to download just by creating a free account with them.
- Application-specific Datasets
- If you are interested in parallel language data (typically used for machine translation training/testing, but can also be used for comparative text analysis across languages), OPUS is by far the best resource.
- A great starting point for training sentiment analysis models is this set of sentiment classified movie reviews from IMDb.
- This dataset from Kaggle is great for anyone who wants to do US presidential election analysis! It is a set of transcripts from all 2020 Democratic debates, annotated for speaker, length of speech, debate name and debate section.