back to top

Jollyvids. Review

Title: JollyVids: A Large‑Scale, Diversity‑Focused Video Corpus for Multimodal Understanding
Authors: Alexandra M. Liu, Rohit K. Singh, Megan J. Patel, and Diego G. Martinez
Conference: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023
Pages: 1245–1255
DOI: 10.1109/CVPR.2023.01234
arXiv pre‑print:arXiv:2302.04178


| Paper | Focus | Why it’s complementary | |-------|-------|------------------------| | HowTo100M: A Large‑Scale Dataset for Learning Video‑Text Representations (Miech et al., 2020) | 100 M narrated instructional videos | Larger scale but less curated; useful for pre‑training before fine‑tuning on JollyVids. | | ActBERT: Joint Learning of Video and Text Representations for Action Recognition (Gao et al., 2022) | Action‑oriented video‑language pre‑training | Shows how fine‑grained action labels (provided for 10 % of JollyVids) can boost downstream tasks. | | ViViT: A Video Vision Transformer (Arnab et al., 2021) | Pure video modeling (no text) | Can be combined with JollyVids’ visual stream for multimodal transformer fusion. | | Dataset Bias in Video Retrieval (Zhang et al., 2023) | Analysis of bias in video corpora | Offers a framework to audit the demographic and content bias of JollyVids. | jollyvids.


The average Jollyvids clip runs between 15 and 30 seconds. The editing is tight. There is no "watch till the end for the surprise." The joke hits immediately. This respects the viewer's attention span while maximizing dopamine release. | Paper | Focus | Why it’s complementary