Resources, Wk. 3-6 (and beyond)
Resources for Weeks 3-6 (and beyond)
Tutorials
Processing different file types
- CSV file types and pandas:
- “Pandas Basics” series (Walsh, 2021)
- JSON file format
- Intro to JSON files
- Song Genius API tutorial: See an example of JSON data, filtering using pandas, and looping to retrieve metadata.
Regex
- Regex refresher: Python RegEx
Topic modelling
Guidance with methods
- “Topic Modelling for the People” (Antoniak, 2023)
- “Is topic modelling obsolete?” (O’Sullivan, 2025)
- Ted Underwood’s blog: “topic modeling” category
Topic modelling options
jsLDA: in-browser topic modelling tool, with accompanying instructions.
Authorless topic modelling: Laure Thompson and David Mimno. 2018. Authorless Topic Models: Biasing Models Away from Known Structure. In Proceedings of the 27th International Conference on Computational Linguistics, pages 3903–3914, Santa Fe, New Mexico, USA. Association for Computational Linguistics. Read the article, then review the code.
Topic modelling with Mallet
- Mallet installation using Little Mallet Wrapper
- Little Mallet Wrapper documentation
- For questions on time series visualiation: See brief explanation on plot_topics_over_time function: “Creates lineplots, one for each topic, showing the mean topic probability over document segments.” The y-axis represents the “mean topic probability” over the document chunks in your corpus in a given year. This is the average probability that the topic is likely to appear in the documents.
- Topic modelling text files tutorial
- Topic modelling CSV files tutorial
Using LLMs
- “Working with Local LLMs (On Your Own Computer!) — Ollama and Llama 3”: “This code notebook (which should ideally be downloaded on your own computer) demonstrates how you can use local LLMs with Ollama to create structured data from unstructured text, as well as to chat and generate poems or create document embeddings.”
- See also: “code tutorials” and “prompt library” from the AI for Humanists project.
Text analysis: workflow and programming tips
Digital Publications
- The Data-sitters Club series
- Introduction to Text Analysis: A Coursebook (Walsh & Horowitz)
Youtube tutorials
- Python Tutorials for Digital Humanities, Python Programming
- Python full course for free (2024), BroCode
Large Language Models
- “Textual Data Processing with LLM in the Humanities,” video introductions by Gabor Toth (2025)
Creating public-facing websites
- Livemark: “Data presentation framework for Python that generates static sites from extended Markdown with interactive charts, tables, scripts, and other features.”
- Jupyter Book: More for long-form publications, but could be of interest. Melanie Walsh’s textbook was built using Jupyter Book. Now there is also version 2.
Extra readings
These readings didn’t quite fit into our 2-week instruction schedule but could still be of interest!
Books and book chapters
Articles
Kusumegi, Keigo, and Yukie Sano. “Dataset of Identified Scholars Mentioned in Acknowledgement Statements.” Scientific Data, vol. 9, no. 1, Aug. 2022, p. 461. www-nature-com.proxy.library.cornell.edu, https://doi.org/10.1038/s41597-022-01585-y.
Yin, Yian, et al. “Coevolution of Policy and Science during the Pandemic.” Science, vol. 371, no. 6525, Jan. 2021, pp. 128–30. DOI.org (Crossref), https://doi.org/10.1126/science.abe3084.