MLOps Platforms
Data versioning lineage diagram
ops

Data Versioning for Production ML: DVC, Delta Lake, and What Actually Works

Training data versioning sounds like an ML engineering nicety. In practice it's the prerequisite for reproducible models, auditable compliance, and debugging production failures.

By Priya Anand · · 8 min read

The conversation about data versioning usually starts when something goes wrong. A model degrades in production, the team wants to reproduce the training run to investigate, and they discover the training dataset no longer exists in the form it was in when the model was trained. Columns were dropped, rows were filtered differently, a lookup table was updated in place.

Data versioning is the engineering discipline that prevents this failure mode. It’s also the prerequisite for compliant ML in regulated industries, where auditability of training data is a legal requirement, not an engineering aspiration.

The three flavors of data versioning

Data versioning means different things at different layers:

Schema versioning: Tracking changes to the shape of the data — column additions, type changes, removed fields. This is the easiest version to implement and the one that causes the most silent failures when it’s missing.

Content versioning: Tracking changes to the data itself — new rows, updated values, deleted records. This is what allows you to ask “what did this dataset look like on March 15th?”

Lineage tracking: Tracking where data came from and what transformations were applied. This is the highest-value version for debugging and compliance, and the hardest to implement completely.

Most teams need all three, implemented at different levels of rigor for different datasets.

DVC: file-level versioning for ML datasets

DVC (Data Version Control) treats datasets like git treats code. You store the actual data in S3, GCS, or another remote; DVC tracks content hashes in a git-compatible metadata format.

# Initialize DVC in a git repo
dvc init

# Track a dataset
dvc add data/training/features.parquet

# This creates data/training/features.parquet.dvc — commit this to git
git add data/training/features.parquet.dvc .gitignore
git commit -m "track training dataset v1"

# Push data to remote
dvc push

When you retrain later with a different dataset version:

git checkout <older-commit>
dvc pull  # restores the exact dataset from that commit

What DVC does well:

Where DVC falls short:

Who should use DVC: Teams with file-based datasets (Parquet, CSV, HDF5) who want dataset versioning integrated with their code versioning workflow.

Delta Lake: transactional versioning for tabular data

Delta Lake is a storage layer on top of Parquet that adds ACID transactions, schema enforcement, and time travel. If your training data lives in a data lake built on Parquet files, Delta Lake provides content versioning without a separate tool.

from delta import DeltaTable
from pyspark.sql import SparkSession

spark = SparkSession.builder.config("spark.jars.packages", "io.delta:delta-core_2.12:2.4.0").getOrCreate()

# Write with Delta
df.write.format("delta").save("s3://my-bucket/training-data")

# Time travel: read data as it was on a specific date
df_historical = (
    spark.read
    .format("delta")
    .option("timestampAsOf", "2026-03-15")
    .load("s3://my-bucket/training-data")
)

# Or by version number
df_v3 = (
    spark.read
    .format("delta")
    .option("versionAsOf", 3)
    .load("s3://my-bucket/training-data")
)

Delta Lake’s transaction log is the versioning mechanism. Every write operation appends to the log. Time travel reads the log to reconstruct the dataset as of a given version.

What Delta Lake does well:

Where Delta Lake falls short:

Who should use Delta Lake: Teams already using Databricks or Spark for data processing.

Mixing approaches

In practice, many teams use both:

This gives you the query flexibility of Delta Lake for data engineering and the reproducibility of DVC for training artifacts.

What to capture in the model registry

Data versioning is only as useful as the linkage between training artifacts and data versions. The model registry entry for every model version should record:

Without this linkage, data versioning solves the storage problem but not the reproducibility problem.

Compliance implications

For regulated industries (financial services, healthcare), training data versioning is typically required by:

If compliance is a driver, design for auditability first: schema versioning + lineage tracking + a documented retention policy that satisfies your regulatory timeline. DVC + Delta Lake covers the technical requirements; the governance layer (who can access what, how long versions are retained, audit logging) requires additional tooling.

The intersection of data governance and ML monitoring — catching data quality issues before they become training issues — is covered extensively at mlobserve.com.

Sources

  1. DVC Documentation
  2. Delta Lake Documentation
  3. Lakeformation and Data Governance
#data-versioning #dvc #delta-lake #reproducibility #mlops #data-engineering
Subscribe

MLOps Platforms — in your inbox

Honest reviews and comparisons of MLOps platforms. — delivered when there's something worth your inbox.

No spam. Unsubscribe anytime.

Related

Comments