Model Registry Patterns That Hold in Production

The model registry problem sounds solved. MLflow has been around since 2018. W&B, Neptune, Comet, and the cloud-native registries (SageMaker, Vertex) all have mature implementations. Pick one, integrate it, move on.

This does not match experience. After working with six different registries across a range of organizations, the pattern is consistent: the registry starts clean, accumulates debt within a quarter, and within a year it’s a graveyard of unnamed versions that nobody trusts to answer the question “what’s actually running in production?”

This post covers the patterns that prevent registry rot — and the specific failure modes to design against.

Why registries drift

Three causes, in order of prevalence:

Incentive mismatch. The person who trains the model benefits from pushing it to production quickly. Registering it properly — with lineage, evaluation artifacts, dataset hashes, deployment conditions — takes 20 minutes and creates no immediate value for that person. The discipline requires organizational incentives, not just tooling.

Training pipeline integration is incomplete. Models get registered manually, after the fact, by an engineer who wasn’t the original author. Metadata is reconstructed rather than captured. Reconstructed metadata is wrong.

Environments multiply. One team calls their staging environment “staging.” Another calls it “pre-prod.” A third calls it “shadow.” The registry gets populated with transitions between environments that don’t map to a consistent promotion flow.

The patterns that work

1. Registry writes are part of the training job, not an afterthought

The only metadata that stays accurate is the metadata that the training job writes at run time. This means:

The training code calls registry.log_model() with all the relevant metadata
Dataset hash, training config, evaluation metrics, and the git commit SHA of the training code are captured automatically
The CI/CD system prevents model promotion without a valid registry entry

Manual registration steps produce manual-quality metadata. Build the registration into the artifact.

2. Treat model versions like software releases

The software engineering parallel is precise. A model version should have:

A unique identifier (hash, not just a sequence number)
Reproducibility: given the same training code + dataset hash, can you reproduce this artifact?
A changelog: what changed from the previous version and why
Test results: the eval suite run against this artifact, with links to the run artifacts
A “blessed” status: staging, production, deprecated — with explicit transitions and approvers

MLflow’s stage system (None → Staging → Production → Archived) maps to this. Use it. The problem is that teams use it inconsistently. Define the promotion criteria in documentation that lives next to the registry integration, not in tribal knowledge.

3. Production registrations require eval gates

The largest source of registry confusion is models that were promoted to production without clearing formal eval gates. The registry entry exists; the eval entry is blank or says “looks good.”

Implement this at the pipeline level: the promotion step reads the registry, checks whether an eval artifact exists for this version, and fails the pipeline if it doesn’t. You cannot mark a model as production without an eval artifact. This is a one-day implementation with years of dividends.

4. Shadow deployments register separately

Shadow mode (running a new model on production traffic but not serving its outputs) is a useful pre-production step, but it contaminates the registry if shadow models get the same lifecycle transitions as real production models.

Separate solution: shadow versions get a distinct status, shadow traffic metrics get stored as a separate artifact type, and shadow → production promotion requires a deliberate, documented step with its own approval flow.

5. Deprecation is as important as registration

A registry that grows but never shrinks loses its value as source of truth. Define a deprecation policy:

Models more than N versions old and not running in any environment get auto-archived
Archived models are not deleted; they’re frozen and tagged
Production models that haven’t served traffic in 30 days generate an alert for the owning team

Without this, the registry becomes an archaeology site rather than an operational tool.

Registry tool choices

MLflow is the default for good reasons: it’s free, it integrates with most training frameworks, and the model registry is functional if you define clear conventions. The UI is acceptable. The API is stable. The weakness is that there’s no opinionated promotion workflow — you define your own, which means every team defines it differently.

W&B Model Registry has the best UI in the category and tight integration with W&B Runs. If your team already uses W&B for experiment tracking, extending to model registry is low-friction. The cost scales with seats and usage — model the economics before committing.

Vertex Model Registry and SageMaker Model Registry are correct choices if you’re committed to those cloud ecosystems and want to avoid managing another self-hosted service. Both have adequate feature sets; neither is best-in-class on its own merits.

The LLM registry problem

Everything above applies double for LLMs, with an added complication: LLM “versions” aren’t just model weights. They’re weights plus prompt templates, plus few-shot examples, plus the RAG retrieval configuration. A model version in the traditional sense captures less than half of what determines production behavior.

For LLM applications, the registry needs to capture the full inference configuration as an artifact: model identifier (including date-pinned version), system prompt, prompt template, retrieval config, and any post-processing logic. We cover this in more depth in LLMOps patterns at llmops.report ↗.

A checklist for registry implementation

Before you call your registry integration “done”:

Training job writes to registry automatically, without manual steps
Dataset hash is captured for every training run
Eval artifacts are required for production promotion (enforced by pipeline)
Staging/Production/Deprecated transitions have defined criteria and approvers
Shadow deployment has a separate status
Deprecation policy is documented and automated
Registry serves as the authoritative answer to “what is running in production right now”

If you can’t answer the last point with a registry lookup, the registry isn’t doing its job.

Model Registry Patterns That Hold in Production

Why registries drift

The patterns that work

1. Registry writes are part of the training job, not an afterthought

2. Treat model versions like software releases

3. Production registrations require eval gates

4. Shadow deployments register separately

5. Deprecation is as important as registration

Registry tool choices

The LLM registry problem

A checklist for registry implementation

Sources

MLOps Platforms — in your inbox

Related

Data Versioning for Production ML: DVC, Delta Lake, and What Actually Works

Evaluation Pipeline Design: What CI Evals Miss and How to Cover It

Training Infrastructure Cost Control: Where ML Spend Actually Goes

Comments