Tier-1 Support Deflection Agent

The Problem

A YC-backed B2B SaaS company had a support team of 12 handling a flood of repetitive Tier-1 tickets — password resets, billing questions, feature how-tos. As they pushed toward a 3× user growth target, headcount was not an option.

65% of all tickets were answerable from the existing help-center docs. No judgment required. Just retrieval and a coherent reply.

What I Built

A production LangChain agent sitting in front of the existing Zendesk queue. The agent:

Classifies incoming tickets by intent (can-answer vs. needs-human)
Retrieves the most relevant help-center chunks from Pinecone (4M+ docs indexed)
Drafts a response grounded strictly in retrieved content — no hallucination
Posts the reply via Zendesk API with a confidence score
Escalates anything below threshold to the human queue with a pre-filled context summary

Six languages supported via a detect-then-translate pipeline before and after retrieval.

Architecture Decisions

Why Pinecone over Chroma?

Scale. 4M documents with sub-100ms retrieval at P99 required a managed vector store. Pinecone’s metadata filtering also let us scope searches to the customer’s specific plan tier.

Hybrid search

Pure vector search missed exact-match queries (“what is my plan limit?”). I layered BM25 keyword search (via Elasticsearch) with a reciprocal rank fusion step. Accuracy on the eval set jumped from 84% to 92%.

Confidence gating

The agent only auto-sends when cosine similarity > 0.87 AND the LLM’s self-reported confidence is “high”. Everything else gets a draft in the agent’s queue for one-click human approval. This kept the false-positive rate below 0.3%.

Results

70% of Tier-1 volume deflected in week 3 (ramp-up period for eval tuning)
Median response time dropped from 4 hours to under 2 seconds
$240K/year in avoided headcount at their planned growth trajectory
CSAT held flat at 4.6/5 — customers couldn’t distinguish agent vs. human replies

What I’d Do Differently

The eval harness was built after the agent — it should have come first. We spent a week manually reviewing edge cases that a proper golden-set eval would have caught in hours.