ML Pipeline Orchestrator

A self-healing ML training and deployment pipeline that automatically retrains models on data drift, runs A/B experiments, and promotes winners to production — all without human intervention.

PythonKubeflowMLflowTerraformAWS

View on GitHub

Overview

Managing ML models in production is notoriously painful. Models degrade silently, retraining is manual, and deployment is risky. This pipeline eliminates all three pain points through automation.

How It Works

Data drift monitoring — statistical tests (KS-test, PSI) run hourly on incoming data
Automatic retraining trigger — when drift exceeds threshold, Kubeflow kicks off a training run
Experiment tracking — MLflow records every hyperparameter, metric, and artifact
Shadow deployment — new model runs alongside production, capturing predictions without serving them
Automated promotion — if shadow model wins A/B test over 7 days, it gets promoted automatically

Infrastructure

Everything is defined in Terraform and runs on AWS:

EKS for Kubeflow and training workloads
S3 for artifact storage
RDS Aurora for experiment metadata
SQS for async job queuing

Impact

Reduced model degradation incidents from 3/month to 0 over a 6-month period. The team reclaimed ~20 engineer-hours per week previously spent on manual retraining cycles.

Back to all projects