LogoLogo
API ReferenceGitHubSlackService StatusLogin
v3.9.16
v3.9.16
  • 🏠Deep Lake Docs
  • List of ML Datasets
  • 🏗️SETUP
    • Installation
    • User Authentication
      • Workload Identities (Azure Only)
    • Storage and Credentials
      • Storage Options
      • Setting up Deep Lake in Your Cloud
        • Microsoft Azure
          • Configure Azure SSO on Activeloop
          • Provisioning Federated Credentials
          • Enabling CORS
        • Google Cloud
          • Provisioning Federated Credentials
          • Enabling CORS
        • Amazon Web Services
          • Provisioning Role-Based Access
          • Enabling CORS
  • 📚Examples
    • Deep Learning
      • Deep Learning Quickstart
      • Deep Learning Guide
        • Step 1: Hello World
        • Step 2: Creating Deep Lake Datasets
        • Step 3: Understanding Compression
        • Step 4: Accessing and Updating Data
        • Step 5: Visualizing Datasets
        • Step 6: Using Activeloop Storage
        • Step 7: Connecting Deep Lake Datasets to ML Frameworks
        • Step 8: Parallel Computing
        • Step 9: Dataset Version Control
        • Step 10: Dataset Filtering
      • Deep Learning Tutorials
        • Creating Datasets
          • Creating Complex Datasets
          • Creating Object Detection Datasets
          • Creating Time-Series Datasets
          • Creating Datasets with Sequences
          • Creating Video Datasets
        • Training Models
          • Splitting Datasets for Training
          • Training an Image Classification Model in PyTorch
          • Training Models Using MMDetection
          • Training Models Using PyTorch Lightning
          • Training on AWS SageMaker
          • Training an Object Detection and Segmentation Model in PyTorch
        • Updating Datasets
        • Data Processing Using Parallel Computing
      • Deep Learning Playbooks
        • Querying, Training and Editing Datasets with Data Lineage
        • Evaluating Model Performance
        • Training Reproducibility Using Deep Lake and Weights & Biases
        • Working with Videos
      • Deep Lake Dataloaders
      • API Summary
    • RAG
      • RAG Quickstart
      • RAG Tutorials
        • Vector Store Basics
        • Vector Search Options
          • LangChain API
          • Deep Lake Vector Store API
          • Managed Database REST API
        • Customizing Your Vector Store
        • Image Similarity Search
        • Improving Search Accuracy using Deep Memory
      • LangChain Integration
      • LlamaIndex Integration
      • Managed Tensor Database
        • REST API
        • Migrating Datasets to the Tensor Database
      • Deep Memory
        • How it Works
    • Tensor Query Language (TQL)
      • TQL Syntax
      • Index for ANN Search
        • Caching and Optimization
      • Sampling Datasets
  • 🔬Technical Details
    • Best Practices
      • Creating Datasets at Scale
      • Training Models at Scale
      • Storage Synchronization and "with" Context
      • Restoring Corrupted Datasets
      • Concurrent Writes
        • Concurrency Using Zookeeper Locks
    • Deep Lake Data Format
      • Tensor Relationships
      • Version Control and Querying
    • Dataset Visualization
      • Visualizer Integration
    • Shuffling in Dataloaders
    • How to Contribute
Powered by GitBook
On this page
  • How to use deeplake.compute for parallelizing workflows
  • Transformations on New Datasets
  • Transformations on Existing Datasets
  • Dataset Processing Pipelines
  • Recovering From Errors

Was this helpful?

Edit on GitHub
  1. Examples
  2. Deep Learning
  3. Deep Learning Tutorials

Data Processing Using Parallel Computing

Deeplake offers built-in methods for parallelizing dataset computations in order to achieve faster data processing.

PreviousUpdating DatasetsNextDeep Learning Playbooks

Was this helpful?

How to use deeplake.compute for parallelizing workflows

This tutorial is also available as a

in the highlights how deeplake.compute can be used to rapidly upload datasets. This tutorial expands further and highlights the power of parallel computing for dataset processing.

Transformations on New Datasets

Computer vision applications often require users to process and transform their data. For example, you may perform perspective transforms, resize images, adjust their coloring, or many others. In this example, a flipped version of the is created, which may be useful for training a model that identifies text in scenes where the camera orientation is unknown.

First, let's define a function that will flip the dataset images.

import deeplake
from PIL import Image
import numpy as np

@deeplake.compute
def flip_vertical(sample_in, sample_out):
    ## First two arguments are always default arguments containing:
    #     1st argument is an element of the input iterable (list, dataset, array,...)
    #     2nd argument is a dataset sample
    
    # Append the label and image to the output sample
    sample_out.append({'labels': sample_in.labels.numpy(),
                       'images': np.flip(sample_in.images.numpy(), axis = 0)})
    
    return sample_out
ds_mnist = deeplake.load('deeplake://activeloop/mnist-train')

#We use the overwrite=True to make this code re-runnable
ds_mnist_flipped = deeplake.like('./mnist_flipped', ds_mnist, overwrite = True)

Finally, the flipping operation is evaluated for the 1st 100 elements in the input dataset ds_in, and the result is automatically stored in ds_out.

flip_vertical().eval(ds_mnist[0:100], ds_mnist_flipped, num_workers = 2)

Let's check out the flipped images:

Image.fromarray(ds_mnist.images[0].numpy())
Image.fromarray(ds_mnist_flipped.images[0].numpy())

Transformations on Existing Datasets

In the previous example, a new dataset was created while performing a transformation. In this example, a transformation is used to modify an existing dataset.

First, download and unzip the small classification dataset below called animals.

Next, use deeplake.ingest_classification to automatically convert this image classification dataset into Deep Lake format and save it in ./animals_deeplake.

ds = deeplake.ingest_classification('./animals', './animals_deeplake') # Creates the dataset

The first image in the dataset is a picture of a cat:

Image.fromarray(ds.images[0].numpy())

The images in the dataset can now be flipped by evaluating the flip_vertical() transformation function from the previous example. If a second dataset is not specified as an input to .eval(), the transformation is applied to the input dataset.

flip_vertical().eval(ds, num_workers = 2)

The picture of the cat is now flipped:

Image.fromarray(ds.images[0].numpy())

Dataset Processing Pipelines

In order to modularize your dataset processing, it is helpful to create functions for specific data processing tasks and combine them in pipelines. In this example, you can create a pipeline using the flip_vertical function from the first example and the resize function below.

@deeplake.compute
def resize(sample_in, sample_out, new_size):
    ## First two arguments are always default arguments containing:
    #     1st argument is an element of the input iterable (list, dataset, array,...)
    #     2nd argument is a dataset sample
    ## Third argument is the required size for the output images
    
    # Append the label and image to the output sample
    sample_out.labels.append(sample_in.labels.numpy())
    sample_out.images.append(np.array(Image.fromarray(sample_in.images.numpy()).resize(new_size)))
    
    return sample_out

Functions decorated using deeplake.compute can be combined into pipelines using deeplake.compose. Required arguments for the functions must be passed into the pipeline in this step:

pipeline = deeplake.compose([flip_vertical(), resize(new_size = (64,64))])

Just like for the single-function example above, the input and output datasets are created first, and the pipeline is evaluated for the 1st 100 elements in the input dataset ds_in. The result is automatically stored in ds_out.

#We use the overwrite=True to make this code re-runnable
ds_mnist_pipe = deeplake.like('./mnist_pipeline', ds_mnist, overwrite = True)
pipeline.eval(ds_mnist[0:100], ds_mnist_pipe, num_workers = 2)

Recovering From Errors

If an error occurs related to a specific sample_in, deplake.compute will throw a TransformError and the error-causing index or sample can be caught using:

# from deeplake.util.exceptions import TransformError

# try:
#     compute_fn.eval(...)
# except TransformError as e:
#     failed_idx = e.index
#     failed_sample = e.sample

The traceback also typically shows information such as the filename of the data that was causing issues. One the problematic sample has been identified, it should be removed from the list of input samples and the deeplake.compute function should be re-executed.

Congrats! You just learned how to make parallelize your computations using Deep Lake! 🎉

Next, the existing is loaded, and deeplake.like is used to create an empty dataset with the same tensor structure.

📚
Colab Notebook
Step 8
Getting Started Guide
MNIST dataset
MNIST dataset
338KB
animals.zip
archive