Querying, Training and Editing Datasets with Data Lineage

How to use queries and version control while training models.

How to use queries and version control to train models with reproducible data lineage.

The road from raw data to a trainable deep-learning dataset can be treacherous, often involving multiple tools glued together with spaghetti code. Activeloop simplifies this journey so you can create high-quality datasets and train production-level deep-learning models.

This playbook demonstrates how to use Activeloop Deep Lake to:

  • Create a Deep Lake dataset from data stored in an S3 bucket

  • Visualize the data to gain insights about the underlying data challenges

  • Update, edit, and store different versions of the data with reproducibility

  • Query the data, save the query result, and materialize it for training a model.

  • Train a object detection model while streaming data

Prerequisites

In addition to installation of commonly used packages, this playbook requires installation of:

The required python imports are:

You should also register with Activeloop and create an API token in the UI.

Creating the Dataset

Since many real-world datasets use the COCO annotation format, the COCO training dataset is used in this playbook. To avoid data duplication, linked tensors are used to store references to the images in the Deep Lake dataset from the S3 bucket containing the original data. For simplicity, only the bounding box annotations are copied to the the Deep Lake dataset.

To convert the original dataset to Deep Lake format, let's establish a connection to the original data in S3.

Next, let's load the annotations so we can access them later:

Moving on, let's create an empty Deep Lake dataset and pull managed credentials from Platform, so that we don't have to manually specify the credentials to access the s3 links every time we use this dataset. Since the Deep Lake dataset is stored in Deep Lake storage, we also provide an API token to identify the user.

The UI for managed credentials in Platform is shown below, and more details are available here.

Last but not least, let's create the Deep Lake dataset's tensors. In this example, we ignore the segmentations and keypoints from the COCO dataset, only uploading the bounding box annotations as well as their labels.

Finally, let's iterate through the data and append it to our Deep Lake dataset. Note that when appending data, we directly pass the s3 URL and the managed credentials key for accessing that URL using deeplake.link(url, creds_key)

Note: if dataset creation speed is a priority, it can be accelerated using 2 options:

  • By uploading the dataset in parallel. An example is available here.

  • By setting the optional parameters below to False. In this case, the upload machine will not load any of the data before creating the dataset, thus speeding the upload by up to 100X. The parameters below are defaulted to True because they improve the query speed on image shapes and file metadata, and they also verify the integrity of the data before uploading. More information is available here:

Inspecting the Dataset

In this example, we will train an object detection model for driving applications. Therefore, we are interested in images containing cars, busses, trucks, bicycles, motorcycles, traffic lights, and stop signs, which we can find by running a SQL query on the dataset in Platform. More details on the query syntax are available here.

A quick visual inspection of the dataset reveals several problems with the data including:

  • Sample 61 but is a-low quality image where it's very difficult to discern the features, and it is not clear whether the small object in the distance is an actual traffic light. Images like this do not positively contribute to model performance, so let's delete all the data in this sample.

  • In sample 8, a road sign is labeled as a stop sign, even though the sign is facing away from the camera. Even though it may be a stop sign, computer vision systems should positively identify the type of a road sign based on its visible text. Therefore, let's remove the stop sign label from this image.

Both changes are now evident in the visualizer, and they were both logged as separate commits in the version control history. A summary of this inspection workflow is shown below:

Optimizing the Dataset for Training

Now that the dataset has been improved, we save the query result containing the samples of interest and optimize the data for training. Since query results are associated with a particular commit, they are immutable and can be retrieved at any point in time.

First, let's re-run the query and save the result as a dataset view, which is uniquely identified by an id.

The dataset is currently storing references to the images in S3, so the images are not rapidly streamable for training. Therefore, we materialize the query result (Dataset View) by copying and re-chunking the data for maximum performance:

Once we're finished using the materialized dataset view, we may choose to delete it using:

Training an Object Detection Model

An object detection model can be trained using the same approach that is used for all Deep Lake datasets, with several examples in our tutorials. Typically the training would occur on another machine with more GPU power, so we start by loading the dataset and and corresponding dataset view:

When using subsets of datasets, it's advised to remap the input classes for model training. In this example, the source dataset has 81 classes, but we are only interested in 7 classes (cars, busses, trucks, bicycles, motorcycles, traffic lights, and stop signs). Therefore, we remap the classes of interest to values 0,1,2,3,4,6 before feeding them into the model for training. We also specify resolution for resizing the data before training the model.

Next, let's specify an augmentation pipeline, which mostly utilizes Albumentations. We perform the remapping of the class labels inside the transformation function.

You can now create a PyTorch dataloader that connects the Deep Lake dataset to the PyTorch model using the provided method ds_view.pytorch(). This method automatically applies the transformation function and takes care of random shuffling (if desired). The num_workers parameter can be used to parallelize data preprocessing, which is critical for ensuring that preprocessing does not bottleneck the overall training workflow.

This playbook uses a pre-trained torchvision neural network from the torchvision.models module. We define helper functions for loading the model and for training 1 epoch.

Training is performed on a GPU if possible. Otherwise, it's on a CPU.

Let's initialize the model and optimizer.

The model and data are ready for training 🚀!

Congratulations 🚀. You can now use Activeloop Deep Lake to edit and version control your datasets, as well as query datasets and train models on the results, all while maintaining data lineage!

Last updated

Was this helpful?