June 09, 2025

Stats for Humanists

How Words Become Numbers and Data Cleaning

Pre-session work

Tutorials

(In order of importance)

  1. Become familiar with regular expressions with at least one of the following resources:
  2. Watch Python for DH lessons 5-8 by William Mattingly. If you’re on a roll, you can keep going through the additional lessons.

  3. Review this brief tutorial on stopwords: “Creating Stopwords Lists” (Kelber & Lawless). The focus is on English language stopwords. You can download the file by clicking “download” on the upper right hand corner of the Github window, then opening the file in Jupyter Lab. Judith: Italian language stopwords lists can be found elsewhere to be used and remixed. See, for example, stopwords-it text file on Github from stopwords-iso. You can also try the steps in the tutorial but using the Italian-language spaCy.

Hour one, Guest Speaker: Matt Thomas, Cornell Statistical Consulting Unit

Slides from the presentation can be found at https://docs.google.com/presentation/d/1OjfZnWpzJUynWta0a40VsGdkWmcWb6miHUHJawnnCHU/edit?usp=sharing.

If you’d like to look through some of the code examples, you can find:

  • Some setup code here: ./setup.R (which defined some functions)
  • Examples of uses of work embeddings here: ./embeddings.R which use the functions above
  • Topic modeling here: largely drawn from https://www.tidytextmining.com/.
  • The glove word embedding here: ./glove.csv

Hour two: Data cleaning

  1. Pair up: Juan Pablo with Judith; Anne-Solène with Stephanie
  2. Show your data to your partner. Open up your text files and show the formatting. Describe the data cleaning issues and questios your questions to your partner. You can refer back to your discussion board posts as needed. If partners have any insights or tips, they are welcome to share!
  3. Matt C. and I will go around and share tips for preparing and cleaning your data for analysis. You should walk away with some next steps on ways to clean your texts.

After the session:

  1. Discussion prompt
  2. Preparation for tomorrow’s session