The Impact: Extracting text from large PDFs (hundreds of pages, legal contracts, financial reports) is the most common task. PyMuPDF outpaces pure-python alternatives by 5-10x.
Verified Strategy: Use fitz.Document with page-level caching and structured block extraction.
import fitz # PyMuPDF
def extract_pdf_text_powerful(pdf_path: str) -> dict: doc = fitz.open(pdf_path) full_text = [] for page_num, page in enumerate(doc): # Extracts text with formatting blocks (headers, paragraphs) blocks = page.get_text("dict") for block in blocks["blocks"]: for line in block["lines"]: for span in line["spans"]: full_text.append(span["text"]) doc.close() return "pages": len(doc), "text": " ".join(full_text)The Impact: Extracting text from large PDFs (hundreds
Modern Twist: Parallelize across pages using concurrent.futures for PDFs over 500 pages. Modern Twist: Parallelize across pages using concurrent
Catch bugs when iterables differ in length.
names = ["Alice", "Bob"]
scores = [95, 87, 99]
for n, s in zip(names, scores, strict=True): # Raises ValueError
print(n, s)
Hash the byte stream of specific objects (not the whole file): Catch bugs when iterables differ in length
import hashlib
with pikepdf.Pdf.open("doc.pdf") as pdf:
page0_hash = hashlib.blake2b(pdf.pages[0].read_raw_bytes()).hexdigest()
Use as cache key for OCR or text extraction — saves hours.
Writing code is only half the battle. The "modern" approach to Python involves a robust ecosystem of tools that verify quality and manage complexity.
| Library | Use Case | Key Feature |
|---------|----------|--------------|
| pypdf (formerly PyPDF2) | Reading, merging, splitting, rotating, cropping | Pure Python, no dependencies |
| pdfplumber | Extract text, tables, metadata | Handles complex layouts better |
| reportlab | Generate PDFs from scratch | Canvas, Platypus for flowables |
| pikepdf | Advanced manipulation, repair, linearization | Wrapper around QPDF |
| borb | Modern PDF reading/writing, annotations, forms | OO design, type hints |
| pdf2image + pytesseract | OCR on scanned PDFs | Converts pages to images |
Verified pick for 2024+: pypdf + pdfplumber + pikepdf cover 90% of needs.