Data Versioning for Production ML: DVC, Delta Lake, and What Actually Works

The conversation about data versioning usually starts when something goes wrong. A model degrades in production, the team wants to reproduce the training run to investigate, and they discover the training dataset no longer exists in the form it was in when the model was trained. Columns were dropped, rows were filtered differently, a lookup table was updated in place.

Data versioning is the engineering discipline that prevents this failure mode. It’s also the prerequisite for compliant ML in regulated industries, where auditability of training data is a legal requirement, not an engineering aspiration.

The three flavors of data versioning

Data versioning means different things at different layers:

Schema versioning: Tracking changes to the shape of the data — column additions, type changes, removed fields. This is the easiest version to implement and the one that causes the most silent failures when it’s missing.

Content versioning: Tracking changes to the data itself — new rows, updated values, deleted records. This is what allows you to ask “what did this dataset look like on March 15th?”

Lineage tracking: Tracking where data came from and what transformations were applied. This is the highest-value version for debugging and compliance, and the hardest to implement completely.

Most teams need all three, implemented at different levels of rigor for different datasets.

DVC: file-level versioning for ML datasets

DVC (Data Version Control) treats datasets like git treats code. You store the actual data in S3, GCS, or another remote; DVC tracks content hashes in a git-compatible metadata format.

# Initialize DVC in a git repo
dvc init

# Track a dataset
dvc add data/training/features.parquet

# This creates data/training/features.parquet.dvc — commit this to git
git add data/training/features.parquet.dvc .gitignore
git commit -m "track training dataset v1"

# Push data to remote
dvc push

When you retrain later with a different dataset version:

git checkout <older-commit>
dvc pull  # restores the exact dataset from that commit

What DVC does well:

Integrates naturally with git workflows
Works with any cloud storage backend
Supports pipelines (tracking what code + data produced what artifact)
Free and open source

Where DVC falls short:

Column-level schema awareness is not built in
Large datasets with frequent updates are slow to hash and push
The caching behavior requires documentation for new team members

Who should use DVC: Teams with file-based datasets (Parquet, CSV, HDF5) who want dataset versioning integrated with their code versioning workflow.

Delta Lake: transactional versioning for tabular data

Delta Lake is a storage layer on top of Parquet that adds ACID transactions, schema enforcement, and time travel. If your training data lives in a data lake built on Parquet files, Delta Lake provides content versioning without a separate tool.

from delta import DeltaTable
from pyspark.sql import SparkSession

spark = SparkSession.builder.config("spark.jars.packages", "io.delta:delta-core_2.12:2.4.0").getOrCreate()

# Write with Delta
df.write.format("delta").save("s3://my-bucket/training-data")

# Time travel: read data as it was on a specific date
df_historical = (
    spark.read
    .format("delta")
    .option("timestampAsOf", "2026-03-15")
    .load("s3://my-bucket/training-data")
)

# Or by version number
df_v3 = (
    spark.read
    .format("delta")
    .option("versionAsOf", 3)
    .load("s3://my-bucket/training-data")
)

Delta Lake’s transaction log is the versioning mechanism. Every write operation appends to the log. Time travel reads the log to reconstruct the dataset as of a given version.

What Delta Lake does well:

True content versioning with time travel
Schema enforcement (writes with incompatible schemas fail)
ACID transactions prevent partial writes
Native Spark integration

Where Delta Lake falls short:

Requires Spark (or the standalone Delta Rust library)
Retention of old versions requires configuration (default: 30 days with VACUUM)
Not a fit for teams without Spark or Databricks infrastructure

Who should use Delta Lake: Teams already using Databricks or Spark for data processing.

Mixing approaches

In practice, many teams use both:

Delta Lake for the live data warehouse (raw → processed → feature tables)
DVC for the specific snapshots used in training (point-in-time exports from Delta, tracked in DVC)

This gives you the query flexibility of Delta Lake for data engineering and the reproducibility of DVC for training artifacts.

What to capture in the model registry

Data versioning is only as useful as the linkage between training artifacts and data versions. The model registry entry for every model version should record:

The DVC commit hash or Delta Lake version number for the training dataset
The DVC commit hash for any validation/test datasets
The schema version at training time
Any data filtering conditions applied (e.g., “records from 2025-01-01 to 2026-03-31, US region only”)

Without this linkage, data versioning solves the storage problem but not the reproducibility problem.

Compliance implications

For regulated industries (financial services, healthcare), training data versioning is typically required by:

Model risk management guidelines (SR 11-7 in banking)
GDPR right-to-explanation (requires knowing what data trained a model)
FDA software as a medical device guidance (for ML-based medical devices)

If compliance is a driver, design for auditability first: schema versioning + lineage tracking + a documented retention policy that satisfies your regulatory timeline. DVC + Delta Lake covers the technical requirements; the governance layer (who can access what, how long versions are retained, audit logging) requires additional tooling.

The intersection of data governance and ML monitoring — catching data quality issues before they become training issues — is covered extensively at mlobserve.com ↗.

Data Versioning for Production ML: DVC, Delta Lake, and What Actually Works

The three flavors of data versioning

DVC: file-level versioning for ML datasets

Delta Lake: transactional versioning for tabular data

Mixing approaches

What to capture in the model registry

Compliance implications

Sources

MLOps Platforms — in your inbox

Related

Evaluation Pipeline Design: What CI Evals Miss and How to Cover It

Training Infrastructure Cost Control: Where ML Spend Actually Goes

Model Registry Patterns That Hold in Production

Comments