Production ML in practice: what matters after the model is trained

The production gap

In many teams, “ML is done” is treated as a milestone: model trained, metrics look good, endpoint deployed. In practice, this is where the work begins. Production introduces constraints that training does not: latency budgets, failure modes, changing upstream data, partial availability, and evolving user behaviour.

The highest-leverage way to think about production ML is as a system with contracts and feedback loops. Below is a practical checklist I use to reason about that system.

1) Data contracts are the foundation

Most production ML incidents are data incidents. The model rarely “breaks”; the input distribution drifts, upstream fields change semantics, or pipelines degrade in subtle ways.

Schema discipline: version inputs and outputs; treat schema changes as controlled events.
Feature semantics: document what each feature means, units, and null-handling behaviour.
Training/serving parity: ensure identical transformations, with explicit versioning.
Backfills and replay: design pipelines to replay safely (idempotency matters).

A simple framing: if you cannot describe your model inputs as a contract, you cannot operate the system reliably.

2) Deployment strategy: treat models like software artefacts

A model is a deployable artefact with dependencies and compatibility constraints. Operationally, this means you want: controlled rollout, fast rollback, and traceability.

Immutable model versions: hash or version every model; never “replace in place”.
Canary / gradual rollout: route a small percentage of traffic first.
Rollback path: rollback should be easier than roll-forward.
Reproducibility: capture training data window, code version, params, and evaluation summary.

3) Observability: you cannot manage what you cannot see

Observability for ML has two halves: system health and model health. You need both.

System health

Latency percentiles, timeouts, error rates
Dependency availability (feature stores, upstream services, data pipelines)
Queue lag / backlog for asynchronous scoring

Model health

Input drift: basic stats on key features; compare to training reference windows.
Output drift: score distribution changes can be early-warning signals.
Data quality: null rates, range checks, categorical explosion.
Outcome feedback: where possible, measure real-world outcomes and calibration over time.

If you only measure offline AUC but not production drift and outcomes, you are operating blind.

4) Failure handling: decide what “safe degradation” means

Production systems fail. The question is how your ML component behaves when it does. You need explicit policies, not ad-hoc behaviour.

Fallback behaviour: what happens if features are missing or the model is unavailable?
Timeout budgets: do not allow ML scoring to collapse user-facing latency.
Defaulting and imputation: define consistent rules and log when they trigger.
Circuit breakers: protect dependencies and isolate failures.

5) Evaluation is not a one-off event

Offline evaluation is necessary, but production requires ongoing evaluation. In practice:

Gold datasets: maintain a small curated set that detects regressions quickly.
Shadow mode: run new models without affecting decisions; compare outputs.
A/B testing: where feasible, evaluate impact in the real system.
Decision review: if your ML affects users, create review and escalation mechanisms.

6) Build an iteration loop you can sustain

The goal is not to deploy a model; it is to run a sustainable loop: measure, learn, update, and redeploy safely.

Trigger conditions: define what causes retraining (time-based, drift-based, outcome-based).
Automation with guardrails: automate what is safe; require review for what is risky.
Auditability: keep a record of model versions and why they changed.

Practical minimum

If you want a minimum viable “production ML” setup, I would start with:

Versioned inputs/outputs (data contract) and reproducible training
Controlled rollout (canary) with rollback
System metrics plus basic drift and data quality monitoring
Explicit fallback behaviour

You can add sophistication later. Without these, you will repeatedly pay the incident tax.

If you would like to discuss production ML systems, you can reach me via Contact. For other work examples, see Work.