RAG over 4M Regulatory PDFs

The Problem

A fintech compliance team needed to answer policy questions against 4M+ regulatory PDFs — SEC filings, FINRA rules, internal policy docs — in real time. The catch: none of this data could touch a public LLM API. Any leak would be a regulatory violation.

They were spending 3–4 hours per query using keyword search + manual reading. At 20+ queries per day per analyst, that was 60–80 hours of analyst time wasted on lookups.

What I Built

A fully self-hosted RAG pipeline running on Azure Private Endpoint infrastructure:

Ingestion — PDF parsing (pdfminer + custom layout heuristics for tables), chunking with overlap, embedding via text-embedding-3-large running on Azure OpenAI (no data leaves the tenant)
Indexing — Chroma on a dedicated VM, 4.2M chunks, metadata-filtered by document type, jurisdiction, and date
Retrieval — Hybrid BM25 + vector search, re-ranked with a fine-tuned cross-encoder
Generation — GPT-4 via Azure OpenAI, system prompt enforcing citation-only responses
Eval — Golden-set of 400 Q&A pairs, weekly automated regression runs

The Air-Gap Architecture

Everything runs inside the client’s Azure subscription:

Azure OpenAI (models deployed in their tenant)
Azure Blob for document storage
Private endpoints — no public internet egress
Audit logs on every query for compliance review

The Hard Part: Table Extraction

Regulatory PDFs are full of data tables. Standard PDF text extraction flattens them into garbage. I wrote a custom extractor that detects table regions (via pdfminer layout boxes), reconstructs row/column structure, and serializes to markdown before chunking. This alone moved accuracy from 71% to 88% on table-heavy queries.

Results

92% accuracy on the golden-set eval (vs. 41% with keyword search baseline)
Query time from 3–4 hours to under 30 seconds
Zero data leaks — all processing inside the Azure tenant
Analyst capacity freed up: ~50 hours/week redirected to actual analysis

Lessons Learned

Cross-encoders are underused. The retrieval quality improvement from adding a fine-tuned cross-encoder re-ranker was larger than any prompt engineering change. If you’re building RAG and haven’t benchmarked a re-ranker, do it first.