Evaluating Model Performance
How to compare ground-truth annotations with model predictions
How to evaluate model performance and compare ground-truth annotations with model predictions.
Models are never perfect after the first training, and model predictions need to be compared with ground-truth annotations in order to iterate on the training process. This comparison often reveals incorrectly annotated data and sheds light on the types of data where the model fails to make the correct prediction.
This playbook demonstrates how to use Activeloop Deep Lake to:
Improve training data by finding data for which the model has poor performance
Train an object detection model using a Deep Lake dataset
Upload the training loss per image to a branch on the dataset designated for evaluating model performance
Sort the training dataset based on model loss and identify bad samples
Edit and clean the bad training data and commit the changes
Evaluate model performance on validation data and identify difficult data
Compute model predictions of object detections for a validation Deep Lake dataset
Upload the model predictions to the validation dataset, compared them to ground truth annotations, and identify samples for which the model fails to make the correct predictions.
Prerequisites
In addition to installation of commonly user packages, this playbook requires installation of:
pip3 install deeplake
pip3 install albumentations
pip3 install opencv-python-headless==4.1.2.30 #In order for Albumentations to work properlyThe required python imports are:
You should also register with Activeloop and create an API token in the UI.
Creating the Dataset
In this playbook we will use the svhn-train and -test datasets that are already hosted by Activeloop. Let's copy them to our own organization dl-corp in order to have write access:
These are object detection datasets that localize address numbers on buildings:
Let's create a branch called training_run on both datasets for storing the model results.
Since we will write the model results back to the Deep Lake datasets, let's create a group called model_evaluation in the datasets and add tensors that will store the model results.
Putting the model results in a separate group will prevent the visualizer from confusing the predictions and ground-truth data.
Training an Object Detection Model
An object detection model can be trained using the same approach that is used for all Deep Lake datasets, with several examples in our tutorials. First, let's specify an augmentation pipeline, which mostly utilizes Albumentations. We also define several helper functions for resizing and converting the format of bounding boxes.
We can now create a PyTorch dataloader that connects the Deep Lake dataset to the PyTorch model using the provided method ds.pytorch(). This method automatically applies the transformation function and takes care of random shuffling (if desired). The num_workers parameter can be used to parallelize data preprocessing, which is critical for ensuring that preprocessing does not bottleneck the overall training workflow.
This playbook uses a pre-trained torchvision neural network from the torchvision.models module. We define helper functions for loading the model and for training 1 epoch.
Training is performed on a GPU if possible. Otherwise, it's on a CPU.
Let's initialize the model and optimizer:
The model and data are ready for training ๐!
Evaluating Model Performance on Training Data
Evaluating the performance of the model on a per-image basis can be a powerful tool for identifying bad or difficult data. First, we define a helper function that does a forward-pass through the model and computes the loss per image, without updating the weights. Since the model outputs the loss per batch, this functions requires that the batch size is 1.
Next, let's create another PyTorch dataloader on the training dataset that is not shuffled, has a batch size of 1, uses the evaluation transform, and returns the indices of the current batch the dataloader using return_index= True:
Finally, we evaluate the loss for each image, write it back to the dataset, and add a commit to the training_run branch that we created at the start of this playbook:
Cleanup and Reverting Mistakes in The Workflow
If you make a mistake you can use the following commands to start over or delete the new data:
Delete data in a tensor:
ds.<tensor_name>.clear()Delete the entire tensor and its data:
ds.delete_tensor(<tensor_name>)Reset all edits since the prior commit:
ds.reset()Delete the branch you just created:
ds.delete_branch(<branch_name>)Must be on another branch, and deleted branch must not have been merged to another.
Inspecting the Training Dataset based on Model Results
The dataset can be sorted based on loss in Activeloop Platform. An inspection of the high-loss images immediately reveals that many of them have poor quality or are incorrectly annotated.
The sort feature in the video below was removed. To sort, please run the query:
select * order by "model_evaluation/loss" order by desc
We can edit some of the bad data by deleting the incorrect annotation of "1" at index 14997 , and by removing the poor quality samples at indices 2899 and 32467.
Lastly, we commit the edits in order to permanently store this snapshot of the data.
The next step would be perform a more exhaustive inspection of the high-loss data and make further improvements to the dataset, after which the model should be re-trained.
Evaluating Model Performance on Validation Data
After iterating on the training data re-training the model, a general assessment of model performance should be performed on validation data that was not used to train the model. We create a helper function for running an inference of the model on the validation data that returns the model predictions and the average IOU (intersection-over-union) for each sample:
Let's create a PyTorch dataloader using the validation data and run the inference using evaluate_iou above.
Finally, we write the predictions back to the dataset and add a commit to the training_run branch that we created at the start of this playbook:
Comparing Model Results to Ground-Truth Annotations.
When sorting the model predictions based on IOU, we observe that the model successfully makes the correct predictions in images with one street number and where the street letters are large relative to the image. However, the model predictions are very poor for data with small street numbers, and there exist artifacts in the data where the model interprets vertical objects, such as narrow windows that the model thinks are the number "1".
The sort feature in the video below was removed. To sort, please run the query:
select * order by "model_evaluation/iou" order by asc
Understanding the edge cases for which the model makes incorrect predictions is critical for improving the model performance. If the edge cases are irrelevant given the model's intended use, they should be eliminated from both the training and validation data. If they are applicable, more representative edge cases should be added to the training dataset, or the edge cases should be sampled more frequently while training.
Congratulations ๐. You can now use Activeloop Deep Lake to evaluate the performance of your Deep-Learning models and compare their predictions to the ground-truth!
Was this helpful?