MLOps Platforms
Model registry workflow diagram
ops

Model Registry Patterns That Hold in Production

A model registry is supposed to be the source of truth for what's deployed. Most implementations drift from that ideal within six months. Here's what breaks and how to prevent it.

By Priya Anand · · 8 min read

The model registry problem sounds solved. MLflow has been around since 2018. W&B, Neptune, Comet, and the cloud-native registries (SageMaker, Vertex) all have mature implementations. Pick one, integrate it, move on.

This does not match experience. After working with six different registries across a range of organizations, the pattern is consistent: the registry starts clean, accumulates debt within a quarter, and within a year it’s a graveyard of unnamed versions that nobody trusts to answer the question “what’s actually running in production?”

This post covers the patterns that prevent registry rot — and the specific failure modes to design against.

Why registries drift

Three causes, in order of prevalence:

Incentive mismatch. The person who trains the model benefits from pushing it to production quickly. Registering it properly — with lineage, evaluation artifacts, dataset hashes, deployment conditions — takes 20 minutes and creates no immediate value for that person. The discipline requires organizational incentives, not just tooling.

Training pipeline integration is incomplete. Models get registered manually, after the fact, by an engineer who wasn’t the original author. Metadata is reconstructed rather than captured. Reconstructed metadata is wrong.

Environments multiply. One team calls their staging environment “staging.” Another calls it “pre-prod.” A third calls it “shadow.” The registry gets populated with transitions between environments that don’t map to a consistent promotion flow.

The patterns that work

1. Registry writes are part of the training job, not an afterthought

The only metadata that stays accurate is the metadata that the training job writes at run time. This means:

Manual registration steps produce manual-quality metadata. Build the registration into the artifact.

2. Treat model versions like software releases

The software engineering parallel is precise. A model version should have:

MLflow’s stage system (None → Staging → Production → Archived) maps to this. Use it. The problem is that teams use it inconsistently. Define the promotion criteria in documentation that lives next to the registry integration, not in tribal knowledge.

3. Production registrations require eval gates

The largest source of registry confusion is models that were promoted to production without clearing formal eval gates. The registry entry exists; the eval entry is blank or says “looks good.”

Implement this at the pipeline level: the promotion step reads the registry, checks whether an eval artifact exists for this version, and fails the pipeline if it doesn’t. You cannot mark a model as production without an eval artifact. This is a one-day implementation with years of dividends.

4. Shadow deployments register separately

Shadow mode (running a new model on production traffic but not serving its outputs) is a useful pre-production step, but it contaminates the registry if shadow models get the same lifecycle transitions as real production models.

Separate solution: shadow versions get a distinct status, shadow traffic metrics get stored as a separate artifact type, and shadow → production promotion requires a deliberate, documented step with its own approval flow.

5. Deprecation is as important as registration

A registry that grows but never shrinks loses its value as source of truth. Define a deprecation policy:

Without this, the registry becomes an archaeology site rather than an operational tool.

Registry tool choices

MLflow is the default for good reasons: it’s free, it integrates with most training frameworks, and the model registry is functional if you define clear conventions. The UI is acceptable. The API is stable. The weakness is that there’s no opinionated promotion workflow — you define your own, which means every team defines it differently.

W&B Model Registry has the best UI in the category and tight integration with W&B Runs. If your team already uses W&B for experiment tracking, extending to model registry is low-friction. The cost scales with seats and usage — model the economics before committing.

Vertex Model Registry and SageMaker Model Registry are correct choices if you’re committed to those cloud ecosystems and want to avoid managing another self-hosted service. Both have adequate feature sets; neither is best-in-class on its own merits.

The LLM registry problem

Everything above applies double for LLMs, with an added complication: LLM “versions” aren’t just model weights. They’re weights plus prompt templates, plus few-shot examples, plus the RAG retrieval configuration. A model version in the traditional sense captures less than half of what determines production behavior.

For LLM applications, the registry needs to capture the full inference configuration as an artifact: model identifier (including date-pinned version), system prompt, prompt template, retrieval config, and any post-processing logic. We cover this in more depth in LLMOps patterns at llmops.report.

A checklist for registry implementation

Before you call your registry integration “done”:

If you can’t answer the last point with a registry lookup, the registry isn’t doing its job.

Sources

  1. MLflow Model Registry Documentation
  2. Weights & Biases Model Registry
  3. Neptune AI Documentation
#model-registry #mlflow #mlops #deployment #versioning #governance
Subscribe

MLOps Platforms — in your inbox

Honest reviews and comparisons of MLOps platforms. — delivered when there's something worth your inbox.

No spam. Unsubscribe anytime.

Related

Comments