All Projects
AI/ML
Live
May 2024 ML Pipeline Orchestrator
A self-healing ML training and deployment pipeline that automatically retrains models on data drift, runs A/B experiments, and promotes winners to production — all without human intervention.
PythonKubeflowMLflowTerraformAWS
Overview
Managing ML models in production is notoriously painful. Models degrade silently, retraining is manual, and deployment is risky. This pipeline eliminates all three pain points through automation.
How It Works
- Data drift monitoring — statistical tests (KS-test, PSI) run hourly on incoming data
- Automatic retraining trigger — when drift exceeds threshold, Kubeflow kicks off a training run
- Experiment tracking — MLflow records every hyperparameter, metric, and artifact
- Shadow deployment — new model runs alongside production, capturing predictions without serving them
- Automated promotion — if shadow model wins A/B test over 7 days, it gets promoted automatically
Infrastructure
Everything is defined in Terraform and runs on AWS:
- EKS for Kubeflow and training workloads
- S3 for artifact storage
- RDS Aurora for experiment metadata
- SQS for async job queuing
Impact
Reduced model degradation incidents from 3/month to 0 over a 6-month period. The team reclaimed ~20 engineer-hours per week previously spent on manual retraining cycles.