Fintech · Compliance 2024 Live

RAG over 4M Regulatory PDFs

Private, air-gapped RAG system letting a compliance team query 4M+ regulatory documents with 92% accuracy — zero data leaks to public LLMs.

4.2M
Docs indexed
92%
Eval accuracy
0
Data leaks
Stack LlamaIndexChromaFastAPIAzureHugging Face

The Problem

A fintech compliance team needed to answer policy questions against 4M+ regulatory PDFs — SEC filings, FINRA rules, internal policy docs — in real time. The catch: none of this data could touch a public LLM API. Any leak would be a regulatory violation.

They were spending 3–4 hours per query using keyword search + manual reading. At 20+ queries per day per analyst, that was 60–80 hours of analyst time wasted on lookups.

What I Built

A fully self-hosted RAG pipeline running on Azure Private Endpoint infrastructure:

  1. Ingestion — PDF parsing (pdfminer + custom layout heuristics for tables), chunking with overlap, embedding via text-embedding-3-large running on Azure OpenAI (no data leaves the tenant)
  2. Indexing — Chroma on a dedicated VM, 4.2M chunks, metadata-filtered by document type, jurisdiction, and date
  3. Retrieval — Hybrid BM25 + vector search, re-ranked with a fine-tuned cross-encoder
  4. Generation — GPT-4 via Azure OpenAI, system prompt enforcing citation-only responses
  5. Eval — Golden-set of 400 Q&A pairs, weekly automated regression runs

The Air-Gap Architecture

Everything runs inside the client’s Azure subscription:

  • Azure OpenAI (models deployed in their tenant)
  • Azure Blob for document storage
  • Private endpoints — no public internet egress
  • Audit logs on every query for compliance review

The Hard Part: Table Extraction

Regulatory PDFs are full of data tables. Standard PDF text extraction flattens them into garbage. I wrote a custom extractor that detects table regions (via pdfminer layout boxes), reconstructs row/column structure, and serializes to markdown before chunking. This alone moved accuracy from 71% to 88% on table-heavy queries.

Results

  • 92% accuracy on the golden-set eval (vs. 41% with keyword search baseline)
  • Query time from 3–4 hours to under 30 seconds
  • Zero data leaks — all processing inside the Azure tenant
  • Analyst capacity freed up: ~50 hours/week redirected to actual analysis

Lessons Learned

Cross-encoders are underused. The retrieval quality improvement from adding a fine-tuned cross-encoder re-ranker was larger than any prompt engineering change. If you’re building RAG and haven’t benchmarked a re-ranker, do it first.

← Previous Document Classification Pipeline Next → Operations Automation Suite

Want something
like this?

30 minutes, free, no deck. We'll figure out if I'm the right fit for your project.