June 09, 2025
Stats for Humanists
How Words Become Numbers and Data Cleaning
Pre-session work
Tutorials
(In order of importance)
- Become familiar with regular expressions with at least one of the following resources:
- See an example of regular expressions being used by researcher Melanie Walsh in her tutorial, “Web Scraping – Part 2” in Introduction to Cultural Analytics and Python
- Review the regular expressions documentation for Python. Note: If you have a newer version of Python, you will get an error when you write a regular expression like this: “\W+”. Instead, write the letter r before the regular expression string: r”\W+”
- (Optional)Skim the tutorial: “Understanding Regular Expressions” (Knox, 2020).
Watch Python for DH lessons 5-8 by William Mattingly. If you’re on a roll, you can keep going through the additional lessons.
- Review this brief tutorial on stopwords: “Creating Stopwords Lists” (Kelber & Lawless). The focus is on English language stopwords. You can download the file by clicking “download” on the upper right hand corner of the Github window, then opening the file in Jupyter Lab. Judith: Italian language stopwords lists can be found elsewhere to be used and remixed. See, for example, stopwords-it text file on Github from stopwords-iso. You can also try the steps in the tutorial but using the Italian-language spaCy.
Hour one, Guest Speaker: Matt Thomas, Cornell Statistical Consulting Unit
Hour two: Data cleaning
- Pair up: Juan Pablo with Judith; Anne-Solène with Stephanie
- Show your data to your partner. Open up your text files and show the formatting. Describe the data cleaning issues and questios your questions to your partner. You can refer back to your discussion board posts as needed. If partners have any insights or tips, they are welcome to share!
- Matt C. and I will go around and share tips for preparing and cleaning your data for analysis. You should walk away with some next steps on ways to clean your texts.
After the session:
- Discussion prompt
- Preparation for tomorrow’s session