LogoLogo
API ReferenceGitHubSlackService StatusLogin
v3.8.16
v3.8.16
  • Deep Lake Docs
  • Vector Store Quickstart
  • Deep Learning Quickstart
  • Storage & Credentials
    • Storage Options
    • User Authentication
    • Storing Deep Lake Data in Your Own Cloud
      • Microsoft Azure
        • Provisioning Federated Credentials
        • Enabling CORS
      • Amazon Web Services
        • Provisioning Role-Based Access
        • Enabling CORS
  • List of ML Datasets
  • 🏢High-Performance Features
    • Introduction
    • Performant Dataloader
    • Tensor Query Language (TQL)
      • TQL Syntax
      • Sampling Datasets
    • Deep Memory
      • How it Works
    • Index for ANN Search
      • Caching and Optimization
    • Managed Tensor Database
      • REST API
      • Migrating Datasets to the Tensor Database
  • 📚EXAMPLE CODE
    • Getting Started
      • Vector Store
        • Step 1: Hello World
        • Step 2: Creating Deep Lake Vector Stores
        • Step 3: Performing Search in Vector Stores
        • Step 4: Customizing Vector Stores
      • Deep Learning
        • Step 1: Hello World
        • Step 2: Creating Deep Lake Datasets
        • Step 3: Understanding Compression
        • Step 4: Accessing and Updating Data
        • Step 5: Visualizing Datasets
        • Step 6: Using Activeloop Storage
        • Step 7: Connecting Deep Lake Datasets to ML Frameworks
        • Step 8: Parallel Computing
        • Step 9: Dataset Version Control
        • Step 10: Dataset Filtering
    • Tutorials (w Colab)
      • Vector Store Tutorials
        • Vector Search Options
          • Deep Lake Vector Store API
          • REST API
          • LangChain API
        • Image Similarity Search
        • Deep Lake Vector Store in LangChain
        • Deep Lake Vector Store in LlamaIndex
        • Improving Search Accuracy using Deep Memory
      • Deep Learning Tutorials
        • Creating Datasets
          • Creating Complex Datasets
          • Creating Object Detection Datasets
          • Creating Time-Series Datasets
          • Creating Datasets with Sequences
          • Creating Video Datasets
        • Training Models
          • Splitting Datasets for Training
          • Training an Image Classification Model in PyTorch
          • Training Models Using MMDetection
          • Training Models Using PyTorch Lightning
          • Training on AWS SageMaker
          • Training an Object Detection and Segmentation Model in PyTorch
        • Updating Datasets
        • Data Processing Using Parallel Computing
      • Concurrent Writes
        • Concurrency Using Zookeeper Locks
    • Playbooks
      • Querying, Training and Editing Datasets with Data Lineage
      • Evaluating Model Performance
      • Training Reproducibility Using Deep Lake and Weights & Biases
      • Working with Videos
    • Low-Level API Summary
  • 🔬Technical Details
    • Best Practices
      • Creating Datasets at Scale
      • Training Models at Scale
      • Storage Synchronization and "with" Context
      • Restoring Corrupted Datasets
      • Concurrent Writes
    • Data Layout
    • Version Control and Querying
    • Dataset Visualization
    • Tensor Relationships
    • Visualizer Integration
    • Shuffling in dataloaders
    • How to Contribute
Powered by GitBook
On this page

Was this helpful?

  1. Technical Details

Shuffling in dataloaders

Understanding data shuffling in Deep Lake's pytorch dataloader

PreviousVisualizer IntegrationNextHow to Contribute

Was this helpful?

It is important to understand the pseudo-random shuffling in Deep Lake's dataloaders because it may affect model performance in some cases.

How Shuffling Works in Deep Lake's PyTorch DataLoader

The Deep Lake shuffling algorithm is based upon a shuffle buffer that preloads a specified amount of data (in MB) determined by the buffer_size parameter in ds.pytorch(buffer_size = 2048). First, the dataloader randomly selects chunks from the applicable tensors until the shuffle buffer is full. Next, the indices in shuffle buffer are randomly sampled to construct the batches that are returned by the dataloader. As the data in the shuffle buffer is consumed, new chunks are randomly selected and added to the buffer.

  • In the , the shuffle buffer contains the decompressed, decoded, and transformed samples. When using the PyTorch dataloaders, this corresponds to torch tensors.

  • In the , the shuffle buffer contains the non-decompressed data in the format they are stored in. For images, this typically corresponds to compressed bytes in jpeg, png, or other compressions.

    • Since compressed data is stored more efficiently than uncompressed data, there are typically more distinct samples of data in the Performant dataloader shuffle buffer compared to the OSS shuffle buffer.

If many chunks in the buffer contain data from the same class, which may occur if data was uploaded in non-random order, the shuffle buffer may contain fewer unique classes than if the samples were chosen fully randomly based on index. The most extreme case of reduced randomness occurs when datasets are much larger than the shuffle buffer, when they have many classes, and when those classes occur in sequence within the dataset indices.

One example dataset is Unshuffled ImageNet, which has 1000 classes, 1.2M images, 140GB of data, and approximately 140 images per 16MB chunk. When the images are uploaded in sequence, the plot below shows how many unique classes are returned by the loader vs the number of images that have been returned in total. It is evident that fully randomly sampling returns more unique values than the Deep Lake dataloader.

If reduced randomness has an impact on model performance in your workflows, the recommended countermeasures are:

  • Store the dataset in a shuffled fashion such that the data does not appear in order by class. This completely mitigates the randomness concerns at the output of the data loader.

  • Store the dataset with a smaller chunk size. This increases randomness because the shuffle buffer selects more discreet chunks before filling up. The current default size is 8, and reducing chunk size to 4MB significantly increases randomness (see plot above) with only a modest slowdown in data transfer speed.

  • Increase the size of the shuffle buffer. This mitigates the randomness concerns but may not completely alleviate them.

🔬
OSS dataloader
Performant dataloader