Skip to main content

DVC

DVC or Data Version Control is an open source tool for managing data assets. It is very useful but also can be quite overwhelming to use.

The main use cases I've found for DVC are:

  • Keeping large data assets (e.g. machine learning datasets) version controlled alongside code so that you always know which version of some data was used to train a particular model.
  • Making sure that all intermediate and final outputs from experiments are reproducible by:
    1. Keeping track of file hashes used at each stage in a pipeline
    2. Keeping a copy of the file content at each stage in the pipeline.

 

I like to use DVC to track data assets