Document Classification Pipeline

The Problem

A legal tech company was manually triaging incoming contracts — categorizing NDAs, MSAs, SOWs, employment agreements, etc. before routing them to the right review team. Two paralegals spent 3 hours per day doing nothing but reading the first page of a document and deciding which bucket it went in.

14 categories, 500+ documents per day, nearly zero ambiguity. A classic classification task.

What I Built

A fine-tuned text classification pipeline using legal-bert-base-uncased (a BERT variant pre-trained on legal text from Hugging Face):

Data

8,400 labeled contracts (600 per class) from the client’s historical archive
80/10/10 train/val/test split, stratified by class
Data augmentation on underrepresented classes via back-translation

Model

Base: legal-bert-base-uncased from Hugging Face
Fine-tuned on the labeled dataset: 3 epochs, AdamW, cosine LR schedule
Classification head: 2-layer MLP over the [CLS] token embedding
Final accuracy: 99.1% on held-out test set

Pipeline

FastAPI service wrapping the model:

Document received (PDF or DOCX)
Text extracted (pdfminer + python-docx)
First 512 tokens fed to the model (first page is almost always sufficient)
Classification + confidence score returned
High-confidence results (>0.95) auto-routed; low-confidence flagged for human review

Infrastructure

Containerized with Docker, deployed on AWS ECS
Model weights stored on S3, loaded at startup
Postgres for job queue and audit trail
Average inference time: 0.3s per document

Why BERT over GPT?

For classification, a discriminative model (BERT) trained on the specific task consistently outperforms a generative model (GPT) prompted for the same task — especially when you have labeled training data. GPT prompting got 94% on this task; fine-tuned BERT got 99.1%. The extra 5% is worth it when errors mean misrouted legal documents.

Results

99.1% accuracy on held-out test set, 98.7% in production (6-month monitoring)
Throughput: 500 documents/day with headroom to 2,000+
0 paralegal hours spent on routing (3 hrs/day → 30 min/week review of low-confidence items)
ROI: payback in 6 weeks