Instead of manually reading timestamps, you can scrape and parse the index. Here’s a robust way to get the latest updated file from an Apache-style index:
import requests from bs4 import BeautifulSoup from datetime import datetimeurl = "http://example.com/data/"
response = requests.get(url) soup = BeautifulSoup(response.text, 'html.parser')
files = [] for row in soup.find_all('tr'): cols = row.find_all('td') if len(cols) >= 3: name_elem = cols[0].find('a') if name_elem and name_elem.get('href') != '../': name = name_elem.text mod_time_str = cols[1].text.strip() try: mod_time = datetime.strptime(mod_time_str, '%Y-%m-%d %H:%M') files.append((name, mod_time, cols[2].text)) except: pass
if files: latest = max(files, key=lambda x: x[1]) print(f"Latest updated file: latest[0] at latest[1]")index of files updated
This script is invaluable for building automated watchers over any "index of files updated" page.
Automated data pipelines often publish CSV, JSON, or Parquet files to public or internal directories. An updated index helps:
If you need to programmatically check a remote "index of files" for updates, you cannot just parse HTML (which breaks when designs change). Use this robust bash + curl + grep approach: Instead of manually reading timestamps, you can scrape
# Fetch the directory listing
curl -s http://example.com/files/ | \
grep -oP '(?<=<a href=")[^"]+' | \
grep -v '/$' | \
while read file; do
# Fetch headers to get Last-Modified
curl -sI "http://example.com/files/$file" | grep -i "last-modified"
done
This script ignores the visual table and queries the HTTP headers directly, returning the exact "index of files updated" metadata for each file.
In the context of an index of files, the "updated" timestamp specifically refers to the mtime (modification time) , not the creation time. Here is why this distinction is vital:
When a user visits a file directory, they are usually looking for one of two things:
By default, most web servers (Apache, Nginx) serve directory listings alphabetically. This means Annual_Report_2015.pdf sits right at the top, while Annual_Report_2024.pdf is buried halfway down the page. This script is invaluable for building automated watchers
A dynamic "Updated" index solves this by sorting by the Last Modified timestamp.
Admins rely on updated indexes to:
lftp can mirror only new/modified files from an HTTP index:
lftp -c "mirror --only-newer --verbose http://example.com/files/ /local/mirror/"