Model Registry Patterns That Hold in Production
A model registry is supposed to be the source of truth for what's deployed. Most implementations drift from that ideal within six months. Here's what breaks and how to prevent it.
The model registry problem sounds solved. MLflow has been around since 2018. W&B, Neptune, Comet, and the cloud-native registries (SageMaker, Vertex) all have mature implementations. Pick one, integrate it, move on.
This does not match experience. After working with six different registries across a range of organizations, the pattern is consistent: the registry starts clean, accumulates debt within a quarter, and within a year it’s a graveyard of unnamed versions that nobody trusts to answer the question “what’s actually running in production?”
This post covers the patterns that prevent registry rot — and the specific failure modes to design against.
Why registries drift
Three causes, in order of prevalence:
Incentive mismatch. The person who trains the model benefits from pushing it to production quickly. Registering it properly — with lineage, evaluation artifacts, dataset hashes, deployment conditions — takes 20 minutes and creates no immediate value for that person. The discipline requires organizational incentives, not just tooling.
Training pipeline integration is incomplete. Models get registered manually, after the fact, by an engineer who wasn’t the original author. Metadata is reconstructed rather than captured. Reconstructed metadata is wrong.
Environments multiply. One team calls their staging environment “staging.” Another calls it “pre-prod.” A third calls it “shadow.” The registry gets populated with transitions between environments that don’t map to a consistent promotion flow.
The patterns that work
1. Registry writes are part of the training job, not an afterthought
The only metadata that stays accurate is the metadata that the training job writes at run time. This means:
- The training code calls
registry.log_model()with all the relevant metadata - Dataset hash, training config, evaluation metrics, and the git commit SHA of the training code are captured automatically
- The CI/CD system prevents model promotion without a valid registry entry
Manual registration steps produce manual-quality metadata. Build the registration into the artifact.
2. Treat model versions like software releases
The software engineering parallel is precise. A model version should have:
- A unique identifier (hash, not just a sequence number)
- Reproducibility: given the same training code + dataset hash, can you reproduce this artifact?
- A changelog: what changed from the previous version and why
- Test results: the eval suite run against this artifact, with links to the run artifacts
- A “blessed” status: staging, production, deprecated — with explicit transitions and approvers
MLflow’s stage system (None → Staging → Production → Archived) maps to this. Use it. The problem is that teams use it inconsistently. Define the promotion criteria in documentation that lives next to the registry integration, not in tribal knowledge.
3. Production registrations require eval gates
The largest source of registry confusion is models that were promoted to production without clearing formal eval gates. The registry entry exists; the eval entry is blank or says “looks good.”
Implement this at the pipeline level: the promotion step reads the registry, checks whether an eval artifact exists for this version, and fails the pipeline if it doesn’t. You cannot mark a model as production without an eval artifact. This is a one-day implementation with years of dividends.
4. Shadow deployments register separately
Shadow mode (running a new model on production traffic but not serving its outputs) is a useful pre-production step, but it contaminates the registry if shadow models get the same lifecycle transitions as real production models.
Separate solution: shadow versions get a distinct status, shadow traffic metrics get stored as a separate artifact type, and shadow → production promotion requires a deliberate, documented step with its own approval flow.
5. Deprecation is as important as registration
A registry that grows but never shrinks loses its value as source of truth. Define a deprecation policy:
- Models more than N versions old and not running in any environment get auto-archived
- Archived models are not deleted; they’re frozen and tagged
- Production models that haven’t served traffic in 30 days generate an alert for the owning team
Without this, the registry becomes an archaeology site rather than an operational tool.
Registry tool choices
MLflow is the default for good reasons: it’s free, it integrates with most training frameworks, and the model registry is functional if you define clear conventions. The UI is acceptable. The API is stable. The weakness is that there’s no opinionated promotion workflow — you define your own, which means every team defines it differently.
W&B Model Registry has the best UI in the category and tight integration with W&B Runs. If your team already uses W&B for experiment tracking, extending to model registry is low-friction. The cost scales with seats and usage — model the economics before committing.
Vertex Model Registry and SageMaker Model Registry are correct choices if you’re committed to those cloud ecosystems and want to avoid managing another self-hosted service. Both have adequate feature sets; neither is best-in-class on its own merits.
The LLM registry problem
Everything above applies double for LLMs, with an added complication: LLM “versions” aren’t just model weights. They’re weights plus prompt templates, plus few-shot examples, plus the RAG retrieval configuration. A model version in the traditional sense captures less than half of what determines production behavior.
For LLM applications, the registry needs to capture the full inference configuration as an artifact: model identifier (including date-pinned version), system prompt, prompt template, retrieval config, and any post-processing logic. We cover this in more depth in LLMOps patterns at llmops.report ↗.
A checklist for registry implementation
Before you call your registry integration “done”:
- Training job writes to registry automatically, without manual steps
- Dataset hash is captured for every training run
- Eval artifacts are required for production promotion (enforced by pipeline)
- Staging/Production/Deprecated transitions have defined criteria and approvers
- Shadow deployment has a separate status
- Deprecation policy is documented and automated
- Registry serves as the authoritative answer to “what is running in production right now”
If you can’t answer the last point with a registry lookup, the registry isn’t doing its job.
Sources
MLOps Platforms — in your inbox
Honest reviews and comparisons of MLOps platforms. — delivered when there's something worth your inbox.
No spam. Unsubscribe anytime.
Related
Data Versioning for Production ML: DVC, Delta Lake, and What Actually Works
Training data versioning sounds like an ML engineering nicety. In practice it's the prerequisite for reproducible models, auditable compliance, and debugging production failures.
Evaluation Pipeline Design: What CI Evals Miss and How to Cover It
CI evals catch regressions in code. They don't catch production drift, prompt sensitivity, or behavioral changes in upstream models. Building an eval system that covers both requires a different architecture.
Training Infrastructure Cost Control: Where ML Spend Actually Goes
Cloud training bills surprise teams that model costs at the benchmark level. Real training cost includes wasted compute, storage, egress, and idle GPUs. Here's how to audit and reduce it.