Data Quality and Preparation

Exploratory Data Analysis (EDA)

There are a number of powerful tools like Pandas Profiling and SweetViz that can make EDA fast and repeatable.

Pandas Profiling

Pandas Profiling is an automated EDA tool that generates rich HTML reports from pandas dataframes. It can be a very nice way to show early progress to a customer when doing data engineering.

SweetViz

SweetViz is a visualisation tool for Python that generates comparisons of data frames. The primary use case is comparison of test and train sets to ensure that they are similar but it could be used for other purposes such as comparing annotated data from different sources.

Assessing Data Quality

One of the biggest difficulties with ML is dealing with messy data. This is a common and reoccurring problem.

CleanLab

CleanLab is a product that attempts to use statistical methods to clean up data and labels. I need to read more about exactly how it works.

They have some tutorials on how to use their system to clean up text for processing here

Variance

Variance essentially refers to how spread out your data is relative to its mean.

image.png

In the diagram the red distribution has low variance and the blue distribution has high variance.

In finance a probability distribution with high variance is typically seen as more risky (historical events have been very widely spread which implies that there is more chance that future events are more likely to fall across a much wider range).

 

Effects of Variance on ML Training Data

Data with high variance leads to less sensitive models and vice versa. This is really nicely illustrated in this blog post (mirror)

 

Effects of Variance on Machine Learning Models

Models with a high degree of variance (often strongly tied to the number of parameters in the model) often fit better to training data but struggle to generalise to test data and vice versa.