How can I learn more about text processing and NLP?
- General NLP
- This is an excellent collection of lecture slides from the Accelerated Natural Language Processing course at the University of Edinburgh. These cover all of the most important subfields of NLP in a thorough yet concise way. This material is easier to digest if you have some solid math and probability theory skills under your belt.
- This book (yes, an actual book) has been and continues to serve as the first point of reference for many scholars engaging with NLP. This is absolutely the place to look for an in-depth explanation of the theories behind and applications for all subfields of NLP, both for speech and text processing. It is available at Tisch Library. The text of the third edition draft (which will contain extra material on machine translation and chatbots, among other topics) is available online.
- Basic Text Processing
- Here is an introduction and tutorial for NLTK (a Python library for text processing) and basic text processing/visualization.
- This is a really awesome tool for exploratory text analysis that doesn’t require any coding! You can read more about the tool and the reasons for using it here.
- Voyant and NVivo are also good tools for exploratory analysis and generating some cool, simple visualizations without writing any code. Be aware that NVivo can take some getting used to because of rather complex interface, while Voyant, on the contrary, is very easy to start using, but doesn’t really let you “look under the hood” to see how it’s working.
- If you prefer to use R, you can check out this tutorial for text analysis.
- Computational Semantics
- Project Gutenberg is a great one stop shop for downloading free texts of works of literature. This is probably the best place to start if you want to start playing around with text analysis on large text files.
- The Tufts libraries put together a directory of different online text resources ranging from ancient texts to modern texts, novels to newspapers, etc. This is great especially if you are looking for digitized versions of older or rare texts.
- Google Books is pretty hit-or-miss when it comes to getting access to full text versions of books, but it’s worth taking a look if you’re looking for something specific.
- If you are interested in parallel language data (typically used for machine translation training/testing, but can also be used for comparative text analysis across languages), OPUS is by far the best resource.
- Accessing Twitter data is not a trivial task. This video walkthrough (the first of a really good series) takes you slowly through every step of accessing, streaming, storing and using data from Twitter.
And don’t forget to check out the workshops available through the DataLab! There will be an introductory text analysis workshop as well as a sentiment analysis workshop in the Fall semester.