Data Engineering and MLOps
DVC
DVC or Data Version Control is an open source tool for managing data assets. It is very useful but also can be quite overwhelming to use.
The main use cases I've found for DVC are:
-
Keeping large data assets (e.g. machine learning datasets) version controlled alongside code so that you always know where the latest version of some project specific training data can be found.
-
Making sure that all intermediate and final outputs from experiments are reproducible. This is always good to know when a client inevitably asks you "ok how did you get Y surprising result?" or "can you just confirm that you included X features in your model?" DVC helps by:
- Keeping track of file hashes used at each stage in a pipeline
- Keeping a copy of the file content at each stage in the pipeline.
Model Registry
A model registry is a service that provides version-control-like behaviour for ML models. There are a number of open source and commercial model registries.
MLFlow
MLFlow is an open source model registry that provides a bunch of features including model version registration and result storage.
SnowPark Registry
SnowPark is a container runtime environment inside Snowflake and it provides Model Registry functionality which is documented here.
Data Wrangling
DuckDB
DuckDB is a lightweight OLAP type database system written in C++ and designed to be used for EDA style activities:
From their website: advice on when to use and not to use DuckDB
Polars
Polars is a rust-based data frames library with Python bindings
Here is a talk that Juan Luis gave about the library
DBT
DBT is a data transformation tool with a SaaS platform and an open-core command line tool.
The tool is widely used to put the T in ELT.
Robin Moffat has written a walkthrough/guide on how he used DBT with DuckDB
Data Loading with Airbyte
Airbyte is a FOSS tool for mass data import and export when working with common flavours of SQL and OLAP databases.