Evaluating Model Performance

How to compare ground-truth annotations with model predictions

How to evaluate model performance and compare ground-truth annotations with model predictions.

Models are never perfect after the first training, and model predictions need to be compared with ground-truth annotations in order to iterate on the training process. This comparison often reveals incorrectly annotated data and sheds light on the types of data where the model fails to make the correct prediction.

This playbook demonstrates how to use Activeloop Deep Lake to:

  • Improve training data by finding data for which the model has poor performance

    • Train an object detection model using a Deep Lake dataset

    • Upload the training loss per image to a branch on the dataset designated for evaluating model performance

    • Sort the training dataset based on model loss and identify bad samples

    • Edit and clean the bad training data and commit the changes

  • Evaluate model performance on validation data and identify difficult data

    • Compute model predictions of object detections for a validation Deep Lake dataset

    • Upload the model predictions to the validation dataset, compared them to ground truth annotations, and identify samples for which the model fails to make the correct predictions.

Prerequisites

In addition to installation of commonly user packages, this playbook requires installation of:

pip3 install deeplake
pip3 install albumentations
pip3 install opencv-python-headless==4.1.2.30 #In order for Albumentations to work properly

The required python imports are:

You should also register with Activeloop and create an API token in the UI.

Creating the Dataset

In this playbook we will use the svhn-train and -test datasets that are already hosted by Activeloop. Let's copy them to our own organization dl-corp in order to have write access:

These are object detection datasets that localize address numbers on buildings:

Let's create a branch called training_run on both datasets for storing the model results.

Since we will write the model results back to the Deep Lake datasets, let's create a group called model_evaluation in the datasets and add tensors that will store the model results.

Training an Object Detection Model

An object detection model can be trained using the same approach that is used for all Deep Lake datasets, with several examples in our tutorials. First, let's specify an augmentation pipeline, which mostly utilizes Albumentations. We also define several helper functions for resizing and converting the format of bounding boxes.

We can now create a PyTorch dataloader that connects the Deep Lake dataset to the PyTorch model using the provided method ds.pytorch(). This method automatically applies the transformation function and takes care of random shuffling (if desired). The num_workers parameter can be used to parallelize data preprocessing, which is critical for ensuring that preprocessing does not bottleneck the overall training workflow.

This playbook uses a pre-trained torchvision neural network from the torchvision.models module. We define helper functions for loading the model and for training 1 epoch.

Training is performed on a GPU if possible. Otherwise, it's on a CPU.

Let's initialize the model and optimizer:

The model and data are ready for training ๐Ÿš€!

Evaluating Model Performance on Training Data

Evaluating the performance of the model on a per-image basis can be a powerful tool for identifying bad or difficult data. First, we define a helper function that does a forward-pass through the model and computes the loss per image, without updating the weights. Since the model outputs the loss per batch, this functions requires that the batch size is 1.

Next, let's create another PyTorch dataloader on the training dataset that is not shuffled, has a batch size of 1, uses the evaluation transform, and returns the indices of the current batch the dataloader using return_index= True:

Finally, we evaluate the loss for each image, write it back to the dataset, and add a commit to the training_run branch that we created at the start of this playbook:

Cleanup and Reverting Mistakes in The Workflow

If you make a mistake you can use the following commands to start over or delete the new data:

  • Delete data in a tensor: ds.<tensor_name>.clear()

  • Delete the entire tensor and its data: ds.delete_tensor(<tensor_name>)

  • Reset all edits since the prior commit: ds.reset()

  • Delete the branch you just created: ds.delete_branch(<branch_name>)

    • Must be on another branch, and deleted branch must not have been merged to another.

Inspecting the Training Dataset based on Model Results

The dataset can be sorted based on loss in Activeloop Platform. An inspection of the high-loss images immediately reveals that many of them have poor quality or are incorrectly annotated.

We can edit some of the bad data by deleting the incorrect annotation of "1" at index 14997 , and by removing the poor quality samples at indices 2899 and 32467.

Lastly, we commit the edits in order to permanently store this snapshot of the data.

The next step would be perform a more exhaustive inspection of the high-loss data and make further improvements to the dataset, after which the model should be re-trained.

Evaluating Model Performance on Validation Data

After iterating on the training data re-training the model, a general assessment of model performance should be performed on validation data that was not used to train the model. We create a helper function for running an inference of the model on the validation data that returns the model predictions and the average IOU (intersection-over-union) for each sample:

Let's create a PyTorch dataloader using the validation data and run the inference using evaluate_iou above.

Finally, we write the predictions back to the dataset and add a commit to the training_run branch that we created at the start of this playbook:

Comparing Model Results to Ground-Truth Annotations.

When sorting the model predictions based on IOU, we observe that the model successfully makes the correct predictions in images with one street number and where the street letters are large relative to the image. However, the model predictions are very poor for data with small street numbers, and there exist artifacts in the data where the model interprets vertical objects, such as narrow windows that the model thinks are the number "1".

Understanding the edge cases for which the model makes incorrect predictions is critical for improving the model performance. If the edge cases are irrelevant given the model's intended use, they should be eliminated from both the training and validation data. If they are applicable, more representative edge cases should be added to the training dataset, or the edge cases should be sampled more frequently while training.

Congratulations ๐Ÿš€. You can now use Activeloop Deep Lake to evaluate the performance of your Deep-Learning models and compare their predictions to the ground-truth!

Was this helpful?