DVC
DVC or Data Version Control is an open source tool for managing data assets. It is very useful but also can be quite overwhelming to use.
The main use cases I've found for DVC are:
- Keeping large data assets (e.g. machine learning datasets) version controlled alongside code so that you always know
whichwhere the latest version of some project specific training datawascanusedbeto train a particular model.found. - Making sure that all intermediate and final outputs from experiments are
reproduciblereproducible. This is always good to know when a client inevitably asks you "ok how did you get Y surprising result?" or "can you just confirm that you included X features in your model?" DVC helps by:
- Keeping track of file hashes used at each stage in a pipeline
- Keeping a copy of the file content at each stage in the pipeline.
I like to use DVC to track data assets