Within each cluster, we ran Latent Dirichlet Allocation (LDA) on the TF‑IDF matrix (n_topics=5). The top ten terms per topic were inspected manually.
All items were downloaded from the official Emily18 Com repository (https://archive.emily18.com/2021/full‑sets) under a CC‑BY‑4.0 license. The repository provides a SHA‑256 checksum for each file; integrity was verified before ingestion.
We computed basic statistics (counts, sizes, temporal distribution) using pandas.