June 10, 2025

Word Vectors and Topic Modelling

Before the Session

Readings

  1. Topic Modeling – Overview” (Walsh, 2021)
  2. Walsh, Melanie, and Maria Antoniak. “The Goodreads ‘Classics’: A Computational Study of Readers, Amazon, and Crowdsourced Amateur Criticism.” Post45: Peer Reviewed, Apr. 2021. post45.org.
  3. Read through (skimmming code): “TF-IDF with Scikit-Learn” (Walsh, 2021)
  4. Soni, Sandeep, et al. “Abolitionist Networks: Modeling Language Change in Nineteenth-Century Activist Newspapers.” Journal of Cultural Analytics, vol. 6, no. 1, Jan. 2021, p. 18841. culturalanalytics.org, https://doi.org/10.22148/001c.18841.

Tutorials

  1. If you have time: Watch Python for DH lessons 9-12 by William Mattingly.

Hour one: TF-IDF, Word Vectors and Topic Modelling

Comparing the methods

Similarities:

  • Both methods are based on co-occurrence patterns
  • Both are unsupervised methods in text analysis

But there are also key differences:

Topic ModellingWord Embeddings
* Exploratory method for identifying themes across documents in a corpus* Looks at the relationships between words in a smaller window of context across the corpus

Readings

Hour two: Voyant Tools

Topic Modelling

Voyant Tools

Let’s spend some time working with Voyant, a web-based suite of tools that help facilitate text analysis and create visualizations.

Content to cover:

  • Uploading files
  • Including and editing stopwords

You can start by exploring tools that correspond to the methods we’ve been discussing in our recent sessions:

Or, you can browse the visualizations and their documentation in the Voyant documentation, on the sidebar under “topics.”

Consider the following when looking at the texts:

  • What questions do you have as you explore the visualizations?
  • Are the visualizations effective in helping you learn about the texts?
  • What stopwords might help or hinder your exploration of the texts?
  • Overall, do the tools help you learn anything new or unexpected about your texts?

Further learning

  • Text as Data:
    • Ch. 7: The Vector Space Model and Similarity Metrics
    • Ch. 8: Distributed Representations of Words
    • Ch. 13: Topic Models