June 09, 2025

Stats for Humanists

How Words Become Numbers and Data Cleaning

Pre-session work

Tutorials

(In order of importance)

  1. Become familiar with regular expressions with at least one of the following resources:
  2. Watch Python for DH lessons 5-8 by William Mattingly. If you’re on a roll, you can keep going through the additional lessons.

  3. Review this brief tutorial on stopwords: “Creating Stopwords Lists” (Kelber & Lawless). The focus is on English language stopwords. You can download the file by clicking “download” on the upper right hand corner of the Github window, then opening the file in Jupyter Lab. Judith: Italian language stopwords lists can be found elsewhere to be used and remixed. See, for example, stopwords-it text file on Github from stopwords-iso. You can also try the steps in the tutorial but using the Italian-language spaCy.

Hour one, Guest Speaker: Matt Thomas, Cornell Statistical Consulting Unit

Hour two: Data cleaning

  1. Pair up: Juan Pablo with Judith; Anne-Solène with Stephanie
  2. Show your data to your partner. Open up your text files and show the formatting. Describe the data cleaning issues and questios your questions to your partner. You can refer back to your discussion board posts as needed. If partners have any insights or tips, they are welcome to share!
  3. Matt C. and I will go around and share tips for preparing and cleaning your data for analysis. You should walk away with some next steps on ways to clean your texts.

After the session:

  1. Discussion prompt
  2. Preparation for tomorrow’s session