Explore Collections:
Utilize Wayback Machine for Websites:
If you don't have the local storage for 500GB of plain text, use the Internet Archive's Search inside feature. rec 2007 internet archive
The rec 2007 crawler began visiting websites at high speed. On many sites, it encountered:
The crawler, following its programming, sent an email to each address it found. When it emailed an auto-responder, that auto-responder sent a reply. The crawler then saw the reply as a new email address to respond to, and emailed it back. This created an infinite loop: Explore Collections :
Within hours, these loops were generating millions of emails per hour.
Before you download the "rec 2007" set, a word of caution. Unlike modern social media where you click "I agree," Usernames in 2007 often used real names (e.g., John.Doe@university.edu). Even though the Internet Archive believes these posts are in the public domain or covered by fair use (archiving purposes), researchers must consider PII (Personally Identifiable Information). Utilize Wayback Machine for Websites :
If you use rec.2007 to train a Large Language Model (LLM), you cannot "opt out" those late-night arguments about Star Wars canon. Ethically, most researchers strip headers and anonymize email addresses before releasing derivative datasets.
In late 2007, the Archive deployed a new crawler instance internally referred to as "rec 2007" (likely short for "record 2007" or a project code). This crawler was designed to be aggressive — to capture as much of the web as possible, including dynamic pages and email links.
The critical mistake: the crawler did not properly filter email addresses. It was set to harvest any email it found and, in some configurations, to send a confirmation or notification to those addresses — a standard practice for some types of crawlers, but disastrous here.
For researchers, 2007 is a "Goldilocks zone" for digital sociology. It represents the last breath of the old text-based internet before the mobile/smartphone revolution.