LogoLogo
API ReferenceGitHubSlackService StatusLogin
v3.1.1
v3.1.1
  • Deep Lake Docs
  • List of ML Datasets
  • Quickstart
  • Dataset Visualization
  • Storage & Credentials
    • Storage Options
    • Managed Credentials
      • Enabling CORS
      • Provisioning Role-Based Access
  • API Reference
  • Enterprise Features
    • Querying Datasets
      • Sampling Datasets
    • Performant Dataloader
  • EXAMPLE CODE
  • Getting Started
    • Step 1: Hello World
    • Step 2: Creating Deep Lake Datasets
    • Step 3: Understanding Compression
    • Step 4: Accessing Data
    • Step 5: Visualizing Datasets
    • Step 6: Using Activeloop Storage
    • Step 7: Connecting Deep Lake Datasets to ML Frameworks
    • Step 8: Parallel Computing
    • Step 9: Dataset Version Control
    • Step 10: Dataset Filtering
  • Tutorials (w Colab)
    • Creating Datasets
      • Creating Complex Datasets
      • Creating Object Detection Datasets
      • Creating Time-Series Datasets
      • Creating Datasets with Sequences
      • Creating Video Datasets
    • Training Models
      • Training an Image Classification Model in PyTorch
      • Training Models Using MMDetection
      • Training on AWS SageMaker Using Deep Lake Datasets
      • Training an Object Detection and Segmentation Model in PyTorch
    • Data Processing Using Parallel Computing
  • Playbooks
    • Querying, Training and Editing Datasets with Data Lineage
    • Evaluating Model Performance
    • Training Reproducibility Using Deep Lake and Weights & Biases
    • Working with Videos
  • API Summary
  • How Deep Lake Works
    • Data Layout
    • Version Control and Querying
    • Tensor Relationships
    • Visualizer Integration
    • Shuffling in ds.pytorch()
    • Storage Synchronization
    • How to Contribute
Powered by GitBook
On this page
  • Understanding Deep Lake's Data Layout
  • Tensors
  • Indexing and Samples
  • Chunking
  • Groups

Was this helpful?

  1. How Deep Lake Works

Data Layout

Understanding the data layout in Deep Lake

PreviousAPI SummaryNextVersion Control and Querying

Last updated 2 years ago

Was this helpful?

Understanding Deep Lake's Data Layout

Tensors

Hidden Tensors

When data is appended to Deep Lake, certain important information is broken up and duplicated in a separate tensor, so that the information can be accessed and queried without loading all of the data. Examples include the shape of a sample (i.e. width, height, and number of channels for an image), or the metadata from file headers that were passed to deeplake.read('filename').

Indexing and Samples

Deep Lake datasets and their tensors are indexed, and data at a given index that spans multiple tensors are referred to as samples. Data at the same index are assumed to be related. For example, data in a bbox tensor at index 100 is assumed to be related to data in the tensor image at index 100.

Chunking

Most data in Deep Lake format is stored in chunks, which are a blobs of data of a pre-defined size. The purpose of chunking is to accelerate the streaming of data across networks by increasing the amount of data that is transferred per network request.

Each tensors has its own chunks, and the default chunk size is 8MB. A single chunk consists of data from multiple indices when the individual data points (image, label, annotation, etc.) are smaller than the chunk size. Conversely, when individual data points are larger than the chunk size, the data is split among multiple chunks (tiling).

Exceptions to chunking logic are video data. Videos that are larger than the specified chunk size are not broken into smaller pieces, because Deep Lake uses efficient libraries to stream and access subsets of videos, thus making it unnecessary to split them apart.

Groups

Deep Lake uses a , and the columns in Deep Lake are referred to as tensors. Data in the tensors can be added or modified, and the data in different tensors are independent of each other.

Multiple tensor can be combined into groups. Groups do not fundamentally change the way data is stored, but they are useful for helping Activeloop Platform understand .

columnar storage architecture
how different tensors are related