400 Invoices, Two Pipelines: The PDF Trick That Cut My Claude Bill 58%

You've got a folder of supplier PDFs and a prompt that asks Claude to return JSON. It works. Until you do it 400 times a month and the invoice from Anthropic shows up. I ran the same 400 Serbian invoices through two paths — raw PDF upload vs. a 2-step extract-then-send pipeline — and the gap was bigger than I expected: 58% fewer tokens, accuracy from 71% to 96%, on the exact same model and prompt.
Why uploading PDFs to Claude burns tokens you don't see
When you attach a PDF to the Anthropic API, it doesn't parse the file. It rasterizes each page to an image and runs vision on it. Every page becomes thousands of vision tokens whether that page has 5 words or 500. A clean 2-page invoice with one table costs roughly the same as a 2-page wall of text.
On our 400-invoice sample (mixed digital + phone-scan PDFs, A4, 1-3 pages):
- Average input tokens per invoice: ~11,000
- Field-level accuracy (every required field correct, char-by-char): 71%
- Cost per invoice with Sonnet pricing: roughly $0.033 in input alone
71% sounds fine until you do the math the other way: about 1 in 3 invoices needs a human to fix something before the data hits the accounting system. At 400/month that's ~120 manual corrections. The token bill and the labor bill both hurt.
The "convert to markdown first" advice is half right
The standard YouTube fix is: convert PDF → markdown, send markdown. It works perfectly on clean digital PDFs where there's an embedded text layer. It silently fails the moment someone hands you a phone-photographed scan, because there's no text to extract. You get an empty string and a confused Claude.
A real pipeline has to detect which kind of PDF it's looking at and branch:
- Digital PDF → pull the text layer directly. Fast, free, lossless.
- Scanned/image PDF → render to image, OCR it, then send the text.
That branch is the whole trick. It's not a framework. It's an if-else with two good tools behind it.
The 25-line router that does the work
Here's the actual extractor we run in production. pdfplumber first, Tesseract fallback. For Serbian I load both Latin and Cyrillic language packs so mixed-script lines don't collapse.
import pdfplumber
import pytesseract
from pdf2image import convert_from_path
TEXT_THRESHOLD = 100 # chars of real text = "digital PDF"
def extract_text(pdf_path: str) -> str:
# Step 1: try the text layer
with pdfplumber.open(pdf_path) as pdf:
text = "\n".join((page.extract_text() or "") for page in pdf.pages)
if len(text.strip()) > TEXT_THRESHOLD:
return text # digital PDF, done
# Step 2: OCR fallback for scans
images = convert_from_path(pdf_path, dpi=300)
ocr_pages = [
pytesseract.image_to_string(img, lang="srp+srp_latn+eng")
for img in images
]
return "\n".join(ocr_pages)
Two things to call out:
- The 100-character threshold catches the common failure where
pdfplumberreturns a few stray ligatures from a scan and tricks you into thinking you got text. - DPI matters. At 200 DPI Tesseract misread enough digits to drop accuracy ~4 points. 300 DPI is the sweet spot; 400 DPI cost more RAM with no measurable gain.
What you send to Claude is now a plain text blob. No PDF. No image. Just text.
The numbers after the switch
Same 400 invoices, same prompt, same model (Sonnet). Only the input format changed.
| Metric | Raw PDF upload | Extract-then-send |
|---|---|---|
| Avg input tokens / invoice | ~11,000 | ~4,600 |
| Field-level accuracy | 71% | 96% |
| Cost / invoice (input) | ~$0.033 | ~$0.014 |
| Monthly cost @ 400 invoices | ~$13.20 | ~$5.60 |
The 58% token drop was the expected win. The accuracy jump from 71% to 96% wasn't, and it's the more interesting number. Here's what's actually happening: when Claude processes a rasterized page, it does two jobs at once — read the glyphs and understand the structure. When you hand it clean text, it only has to do the structure job. It stops hallucinating numbers because it's no longer guessing what a smudged digit on a phone scan is supposed to be.
Concrete example from the test set: one invoice came back from the raw-PDF path with the supplier name off by two characters (Cyrillic "ћ" read as "ћ" + space), one line item dropped entirely, and the VAT total off by a decimal. Same invoice through the pipeline: every field correct, supplier name included.
When you should still upload the PDF directly
I'm not telling you to delete the vision path. There are documents where vision still wins, and pretending otherwise is how tutorials get people in trouble:
- Handwritten margin notes you need to capture
- Stamps and signatures you need to verify exist
- Layout-critical docs — medical forms, legal docs with checkboxes, anything where the position of a mark carries meaning
- Multi-column scientific PDFs where reading order from OCR is garbage
For ~95% of business paperwork — invoices, receipts, contracts, statements, monthly reports — the extract-then-send pipeline is faster, cheaper, and more accurate. For the other 5%, pay the vision tax knowingly.
Cache the extraction or you'll do this work twice
In any real workflow, the same PDF gets processed more than once. Someone re-uploads it. A retry fires. A nightly job re-scans the inbox. OCR'ing the same 3-page scan twice is pure waste — Tesseract on a 300 DPI A4 page runs about 1.2-1.8 seconds on my box, and that adds up.
Hash the file, key on the hash, store the text. Trivial:
import hashlib, json, os
CACHE_DIR = "./extract_cache"
def extract_cached(pdf_path: str) -> str:
with open(pdf_path, "rb") as f:
digest = hashlib.sha256(f.read()).hexdigest()
cache_file = os.path.join(CACHE_DIR, f"{digest}.txt")
if os.path.exists(cache_file):
with open(cache_file, "r", encoding="utf-8") as f:
return f.read()
text = extract_text(pdf_path)
os.makedirs(CACHE_DIR, exist_ok=True)
with open(cache_file, "w", encoding="utf-8") as f:
f.write(text)
return text
On the 400-invoice month, cache hits saved another ~18% of total pipeline runtime because duplicates from supplier portals are more common than people admit.
Why bizflowai.io helps with this
This pipeline — pdfplumber-first, Tesseract-fallback, hash-cached, Claude-on-text-only — is the same one I deploy for clients who process supplier invoices, receipts, and contracts in volume. The setup includes language-pack tuning for whatever scripts they actually receive (Serbian Latin + Cyrillic in my case, but the pattern is the same for Greek, Arabic, mixed EU languages), a confidence threshold per extracted field, and a small human-review queue for the cases where vision still wins. It's not a product; it's a working pipeline I ship end-to-end so the operator stops re-typing numbers from PDFs.
Frequently asked questions
Why is uploading PDFs directly to Claude so expensive?
When you upload a PDF to Claude's API, it doesn't read it as text. It rasterizes each page into an image and processes it with vision, consuming thousands of vision tokens per page regardless of content. A two-page invoice with one table costs roughly the same as a two-page invoice packed with text — you're paying for pixels, not information. This averages around 11,000 input tokens per invoice.
How do I reduce token costs when extracting data from PDFs with Claude?
Use a 2-step pipeline before calling Claude. First, try extracting text directly with a library like pdfplumber in Python. If you get meaningful text (over 100 characters), send that to Claude. If not, the PDF is likely a scan — render each page to an image and run Tesseract OCR. Either way, send Claude plain text, not the PDF. This dropped costs from ~11,000 to ~4,600 tokens per invoice (58% reduction).
Why does converting PDFs to text improve Claude's extraction accuracy?
When Claude processes a rendered PDF page, it does two jobs at once: reading text and understanding structure. Giving it clean text upfront means it only handles the structure job, so it stops hallucinating numbers by guessing what smudged digits are. In one real-world invoicing pipeline, field-level accuracy jumped from 71% to 96% after switching from direct PDF uploads to pre-extracted text.
When should I use pdfplumber vs Tesseract OCR for PDF extraction?
Use pdfplumber first for any PDF — it extracts the embedded text layer from digital PDFs (like SaaS-generated invoices) almost perfectly and cheaply. Only fall back to Tesseract OCR when pdfplumber returns nothing or near-nothing (under ~100 characters), which signals a scanned image PDF with no text layer. For mixed-script documents like Serbian invoices, load both Latin and Cyrillic Tesseract language packs.
What is the problem with the 'convert PDF to markdown' advice for Claude pipelines?
The common advice to convert PDFs to markdown before sending to Claude only works for digital PDFs with embedded text. It completely fails on scanned PDFs because there is no text layer to extract — you'll get an empty string back. The real solution is a routing pipeline that tries text extraction first and falls back to OCR (like Tesseract) when no text layer exists.
Want more like this?
I publish practical AI automation, GenAI engineering, and faceless content workflows on YouTube every week.
Subscribe to bizflowai.io on YouTube — never miss a new tutorial.
Planning an AI automation project or need a second opinion on your architecture?
Connect with me on LinkedIn — Lazar Milicevic, GenAI Engineer & bizflowai.io Founder.
Visit bizflowai.io for our services, case studies, and AI consulting.
Frequently asked questions
Why is uploading PDFs directly to Claude so expensive?
When you upload a PDF to Claude's API, it doesn't read it as text. It rasterizes each page into an image and processes it with vision, consuming thousands of vision tokens per page regardless of content. A two-page invoice with one table costs roughly the same as a two-page invoice packed with text — you're paying for pixels, not information. This averages around 11,000 input tokens per invoice.
How do I reduce token costs when extracting data from PDFs with Claude?
Use a 2-step pipeline before calling Claude. First, try extracting text directly with a library like pdfplumber in Python. If you get meaningful text (over 100 characters), send that to Claude. If not, the PDF is likely a scan — render each page to an image and run Tesseract OCR. Either way, send Claude plain text, not the PDF. This dropped costs from ~11,000 to ~4,600 tokens per invoice (58% reduction).
Why does converting PDFs to text improve Claude's extraction accuracy?
When Claude processes a rendered PDF page, it does two jobs at once: reading text and understanding structure. Giving it clean text upfront means it only handles the structure job, so it stops hallucinating numbers by guessing what smudged digits are. In one real-world invoicing pipeline, field-level accuracy jumped from 71% to 96% after switching from direct PDF uploads to pre-extracted text.
When should I use pdfplumber vs Tesseract OCR for PDF extraction?
Use pdfplumber first for any PDF — it extracts the embedded text layer from digital PDFs (like SaaS-generated invoices) almost perfectly and cheaply. Only fall back to Tesseract OCR when pdfplumber returns nothing or near-nothing (under ~100 characters), which signals a scanned image PDF with no text layer. For mixed-script documents like Serbian invoices, load both Latin and Cyrillic Tesseract language packs.
What is the problem with the 'convert PDF to markdown' advice for Claude pipelines?
The common advice to convert PDFs to markdown before sending to Claude only works for digital PDFs with embedded text. It completely fails on scanned PDFs because there is no text layer to extract — you'll get an empty string back. The real solution is a routing pipeline that tries text extraction first and falls back to OCR (like Tesseract) when no text layer exists.