Legal Tech 2023 Live

Document Classification Pipeline

ML pipeline classifying 14 contract types with 99.1% accuracy for a legal tech company — replacing a manual review process.

99.1%
Classification accuracy
14
Contract types
0.3s
Per document
Stack scikit-learnHugging FaceFastAPIPostgresAWS S3

The Problem

A legal tech company was manually triaging incoming contracts — categorizing NDAs, MSAs, SOWs, employment agreements, etc. before routing them to the right review team. Two paralegals spent 3 hours per day doing nothing but reading the first page of a document and deciding which bucket it went in.

14 categories, 500+ documents per day, nearly zero ambiguity. A classic classification task.

What I Built

A fine-tuned text classification pipeline using legal-bert-base-uncased (a BERT variant pre-trained on legal text from Hugging Face):

Data

  • 8,400 labeled contracts (600 per class) from the client’s historical archive
  • 80/10/10 train/val/test split, stratified by class
  • Data augmentation on underrepresented classes via back-translation

Model

  • Base: legal-bert-base-uncased from Hugging Face
  • Fine-tuned on the labeled dataset: 3 epochs, AdamW, cosine LR schedule
  • Classification head: 2-layer MLP over the [CLS] token embedding
  • Final accuracy: 99.1% on held-out test set

Pipeline

FastAPI service wrapping the model:

  1. Document received (PDF or DOCX)
  2. Text extracted (pdfminer + python-docx)
  3. First 512 tokens fed to the model (first page is almost always sufficient)
  4. Classification + confidence score returned
  5. High-confidence results (>0.95) auto-routed; low-confidence flagged for human review

Infrastructure

  • Containerized with Docker, deployed on AWS ECS
  • Model weights stored on S3, loaded at startup
  • Postgres for job queue and audit trail
  • Average inference time: 0.3s per document

Why BERT over GPT?

For classification, a discriminative model (BERT) trained on the specific task consistently outperforms a generative model (GPT) prompted for the same task — especially when you have labeled training data. GPT prompting got 94% on this task; fine-tuned BERT got 99.1%. The extra 5% is worth it when errors mean misrouted legal documents.

Results

  • 99.1% accuracy on held-out test set, 98.7% in production (6-month monitoring)
  • Throughput: 500 documents/day with headroom to 2,000+
  • 0 paralegal hours spent on routing (3 hrs/day → 30 min/week review of low-confidence items)
  • ROI: payback in 6 weeks
← Previous Real-Time Analytics Dashboard Next → RAG over 4M Regulatory PDFs

Want something
like this?

30 minutes, free, no deck. We'll figure out if I'm the right fit for your project.