Data Versioning for Production ML: DVC, Delta Lake, and What Actually Works
Training data versioning sounds like an ML engineering nicety. In practice it's the prerequisite for reproducible models, auditable compliance, and debugging production failures.
The conversation about data versioning usually starts when something goes wrong. A model degrades in production, the team wants to reproduce the training run to investigate, and they discover the training dataset no longer exists in the form it was in when the model was trained. Columns were dropped, rows were filtered differently, a lookup table was updated in place.
Data versioning is the engineering discipline that prevents this failure mode. It’s also the prerequisite for compliant ML in regulated industries, where auditability of training data is a legal requirement, not an engineering aspiration.
The three flavors of data versioning
Data versioning means different things at different layers:
Schema versioning: Tracking changes to the shape of the data — column additions, type changes, removed fields. This is the easiest version to implement and the one that causes the most silent failures when it’s missing.
Content versioning: Tracking changes to the data itself — new rows, updated values, deleted records. This is what allows you to ask “what did this dataset look like on March 15th?”
Lineage tracking: Tracking where data came from and what transformations were applied. This is the highest-value version for debugging and compliance, and the hardest to implement completely.
Most teams need all three, implemented at different levels of rigor for different datasets.
DVC: file-level versioning for ML datasets
DVC (Data Version Control) treats datasets like git treats code. You store the actual data in S3, GCS, or another remote; DVC tracks content hashes in a git-compatible metadata format.
# Initialize DVC in a git repo
dvc init
# Track a dataset
dvc add data/training/features.parquet
# This creates data/training/features.parquet.dvc — commit this to git
git add data/training/features.parquet.dvc .gitignore
git commit -m "track training dataset v1"
# Push data to remote
dvc push
When you retrain later with a different dataset version:
git checkout <older-commit>
dvc pull # restores the exact dataset from that commit
What DVC does well:
- Integrates naturally with git workflows
- Works with any cloud storage backend
- Supports pipelines (tracking what code + data produced what artifact)
- Free and open source
Where DVC falls short:
- Column-level schema awareness is not built in
- Large datasets with frequent updates are slow to hash and push
- The caching behavior requires documentation for new team members
Who should use DVC: Teams with file-based datasets (Parquet, CSV, HDF5) who want dataset versioning integrated with their code versioning workflow.
Delta Lake: transactional versioning for tabular data
Delta Lake is a storage layer on top of Parquet that adds ACID transactions, schema enforcement, and time travel. If your training data lives in a data lake built on Parquet files, Delta Lake provides content versioning without a separate tool.
from delta import DeltaTable
from pyspark.sql import SparkSession
spark = SparkSession.builder.config("spark.jars.packages", "io.delta:delta-core_2.12:2.4.0").getOrCreate()
# Write with Delta
df.write.format("delta").save("s3://my-bucket/training-data")
# Time travel: read data as it was on a specific date
df_historical = (
spark.read
.format("delta")
.option("timestampAsOf", "2026-03-15")
.load("s3://my-bucket/training-data")
)
# Or by version number
df_v3 = (
spark.read
.format("delta")
.option("versionAsOf", 3)
.load("s3://my-bucket/training-data")
)
Delta Lake’s transaction log is the versioning mechanism. Every write operation appends to the log. Time travel reads the log to reconstruct the dataset as of a given version.
What Delta Lake does well:
- True content versioning with time travel
- Schema enforcement (writes with incompatible schemas fail)
- ACID transactions prevent partial writes
- Native Spark integration
Where Delta Lake falls short:
- Requires Spark (or the standalone Delta Rust library)
- Retention of old versions requires configuration (default: 30 days with VACUUM)
- Not a fit for teams without Spark or Databricks infrastructure
Who should use Delta Lake: Teams already using Databricks or Spark for data processing.
Mixing approaches
In practice, many teams use both:
- Delta Lake for the live data warehouse (raw → processed → feature tables)
- DVC for the specific snapshots used in training (point-in-time exports from Delta, tracked in DVC)
This gives you the query flexibility of Delta Lake for data engineering and the reproducibility of DVC for training artifacts.
What to capture in the model registry
Data versioning is only as useful as the linkage between training artifacts and data versions. The model registry entry for every model version should record:
- The DVC commit hash or Delta Lake version number for the training dataset
- The DVC commit hash for any validation/test datasets
- The schema version at training time
- Any data filtering conditions applied (e.g., “records from 2025-01-01 to 2026-03-31, US region only”)
Without this linkage, data versioning solves the storage problem but not the reproducibility problem.
Compliance implications
For regulated industries (financial services, healthcare), training data versioning is typically required by:
- Model risk management guidelines (SR 11-7 in banking)
- GDPR right-to-explanation (requires knowing what data trained a model)
- FDA software as a medical device guidance (for ML-based medical devices)
If compliance is a driver, design for auditability first: schema versioning + lineage tracking + a documented retention policy that satisfies your regulatory timeline. DVC + Delta Lake covers the technical requirements; the governance layer (who can access what, how long versions are retained, audit logging) requires additional tooling.
The intersection of data governance and ML monitoring — catching data quality issues before they become training issues — is covered extensively at mlobserve.com ↗.
Sources
MLOps Platforms — in your inbox
Honest reviews and comparisons of MLOps platforms. — delivered when there's something worth your inbox.
No spam. Unsubscribe anytime.
Related
Evaluation Pipeline Design: What CI Evals Miss and How to Cover It
CI evals catch regressions in code. They don't catch production drift, prompt sensitivity, or behavioral changes in upstream models. Building an eval system that covers both requires a different architecture.
Training Infrastructure Cost Control: Where ML Spend Actually Goes
Cloud training bills surprise teams that model costs at the benchmark level. Real training cost includes wasted compute, storage, egress, and idle GPUs. Here's how to audit and reduce it.
Model Registry Patterns That Hold in Production
A model registry is supposed to be the source of truth for what's deployed. Most implementations drift from that ideal within six months. Here's what breaks and how to prevent it.