LogoLogo
API ReferenceGitHubSlackService StatusLogin
v3.9.16
v3.9.16
  • 🏠Deep Lake Docs
  • List of ML Datasets
  • πŸ—οΈSETUP
    • Installation
    • User Authentication
      • Workload Identities (Azure Only)
    • Storage and Credentials
      • Storage Options
      • Setting up Deep Lake in Your Cloud
        • Microsoft Azure
          • Configure Azure SSO on Activeloop
          • Provisioning Federated Credentials
          • Enabling CORS
        • Google Cloud
          • Provisioning Federated Credentials
          • Enabling CORS
        • Amazon Web Services
          • Provisioning Role-Based Access
          • Enabling CORS
  • πŸ“šExamples
    • Deep Learning
      • Deep Learning Quickstart
      • Deep Learning Guide
        • Step 1: Hello World
        • Step 2: Creating Deep Lake Datasets
        • Step 3: Understanding Compression
        • Step 4: Accessing and Updating Data
        • Step 5: Visualizing Datasets
        • Step 6: Using Activeloop Storage
        • Step 7: Connecting Deep Lake Datasets to ML Frameworks
        • Step 8: Parallel Computing
        • Step 9: Dataset Version Control
        • Step 10: Dataset Filtering
      • Deep Learning Tutorials
        • Creating Datasets
          • Creating Complex Datasets
          • Creating Object Detection Datasets
          • Creating Time-Series Datasets
          • Creating Datasets with Sequences
          • Creating Video Datasets
        • Training Models
          • Splitting Datasets for Training
          • Training an Image Classification Model in PyTorch
          • Training Models Using MMDetection
          • Training Models Using PyTorch Lightning
          • Training on AWS SageMaker
          • Training an Object Detection and Segmentation Model in PyTorch
        • Updating Datasets
        • Data Processing Using Parallel Computing
      • Deep Learning Playbooks
        • Querying, Training and Editing Datasets with Data Lineage
        • Evaluating Model Performance
        • Training Reproducibility Using Deep Lake and Weights & Biases
        • Working with Videos
      • Deep Lake Dataloaders
      • API Summary
    • RAG
      • RAG Quickstart
      • RAG Tutorials
        • Vector Store Basics
        • Vector Search Options
          • LangChain API
          • Deep Lake Vector Store API
          • Managed Database REST API
        • Customizing Your Vector Store
        • Image Similarity Search
        • Improving Search Accuracy using Deep Memory
      • LangChain Integration
      • LlamaIndex Integration
      • Managed Tensor Database
        • REST API
        • Migrating Datasets to the Tensor Database
      • Deep Memory
        • How it Works
    • Tensor Query Language (TQL)
      • TQL Syntax
      • Index for ANN Search
        • Caching and Optimization
      • Sampling Datasets
  • πŸ”¬Technical Details
    • Best Practices
      • Creating Datasets at Scale
      • Training Models at Scale
      • Storage Synchronization and "with" Context
      • Restoring Corrupted Datasets
      • Concurrent Writes
        • Concurrency Using Zookeeper Locks
    • Deep Lake Data Format
      • Tensor Relationships
      • Version Control and Querying
    • Dataset Visualization
      • Visualizer Integration
    • Shuffling in Dataloaders
    • How to Contribute
Powered by GitBook
On this page
  • How to Create Datasets in Deep Lake Format
  • Manual Creation
  • Automatic Creation
  • Creating Tensor Hierarchies

Was this helpful?

Edit on GitHub
  1. Examples
  2. Deep Learning
  3. Deep Learning Guide

Step 2: Creating Deep Lake Datasets

Creating and storing Deep Lake Datasets.

PreviousStep 1: Hello WorldNextStep 3: Understanding Compression

Was this helpful?

How to Create Datasets in Deep Lake Format

This guide creates Deep Lake datasets locally. You may create datasets in the Activeloop cloud by , creating an API token, and replacing the local paths below with the path to your Deep Lake organization hub://org_id/dataset_name

You don't have to worry about uploading datasets after you've created them. They are automatically synchronized with .

Manual Creation

Let's follow along with the example below to create our first dataset manually. First, download and unzip the small classification dataset below called animals.

The dataset has the following folder structure:

_animals
|_cats
    |_image_1.jpg
    |_image_2.jpg
|_dogs
    |_image_3.jpg
    |_image_4.jpg

Now that you have the data, you can create a Deep Lake Dataset and initialize its tensors. Running the following code will create Deep Lake dataset inside of the ./animals_deeplakefolder.

import deeplake
from PIL import Image
import numpy as np
import os

ds = deeplake.empty('./animals_deeplake') # Create the dataset locally

Next, let's inspect the folder structure for the source dataset './animals' to find the class names and the files that need to be uploaded to the Deep Lake dataset.

# Find the class_names and list of files that need to be uploaded
dataset_folder = './animals'

# Find the subfolders, but filter additional files like DS_Store that are added on Mac machines.
class_names = [item for item in os.listdir(dataset_folder) if os.path.isdir(os.path.join(dataset_folder, item))]

files_list = []
for dirpath, dirnames, filenames in os.walk(dataset_folder):
    for filename in filenames:
        files_list.append(os.path.join(dirpath, filename))
with ds:
    # Create the tensors with names of your choice.
    ds.create_tensor('images', htype = 'image', sample_compression = 'jpeg')
    ds.create_tensor('labels', htype = 'class_label', class_names = class_names)
    
    # Add arbitrary metadata - Optional
    ds.info.update(description = 'My first Deep Lake dataset')
    ds.images.info.update(camera_type = 'SLR')

Finally, let's populate the data in the tensors. The data is automatically uploaded to the dataset, regardless of whether it's local or in the cloud.

with ds:
    # Iterate through the files and append to Deep Lake dataset
    for file in files_list:
        label_text = os.path.basename(os.path.dirname(file))
        label_num = class_names.index(label_text)
        
        #Append data to the tensors
        ds.append({'images': deeplake.read(file), 'labels': np.uint32(label_num)})

In order to maintain proper indexing across tensors, ds.append({...}) requires that you to append to all tensors in the dataset. If you wish to skip tensors during appending, please use ds.append({...}, skip_ok = True) or append to a single tensor using ds.tensor_name.append(...).

Image.fromarray(ds.images[0].numpy())

Dataset inspection

You can print a summary of the dataset structure using:

ds.summary()

Congrats! You just created your first dataset! πŸŽ‰

Automatic Creation

If your source data conforms to one of the formats below, you can ingest them directly with 1 line of code.

  • Classifications

For example, the above animals dataset can be converted to Deep Lake format using:

src = './animals'
dest = './animals_deeplake_auto'

ds = deeplake.ingest_classification(src, dest)

Creating Tensor Hierarchies

Often it's important to create tensors hierarchically, because information between tensors may be inherently coupledβ€”such as bounding boxes and their corresponding labels. Hierarchy can be created using tensor groups:

ds = deeplake.empty('./groups_test') # Creates the dataset

# Create tensor hierarchies
ds.create_group('my_group')
ds.my_group.create_tensor('my_tensor')

# Alternatively, a group can us created using create_tensor with '/'
ds.create_tensor('my_group_2/my_tensor') #Automatically creates the group 'my_group_2'

Tensors in groups are accessed via:

ds.my_group.my_tensor

#OR

ds['my_group/my_tensor']

Next, let's create the dataset tensors and upload metadata. Check out our page on for details about the with syntax below.

Specifying and dtype is not required, but it is highly recommended in order to optimize performance, especially for large datasets. Usedtypeto specify the numeric type of tensor data, and usehtypeto specify the underlying data structure.

Appending the object deeplake.read(path)is equivalent to appending PIL.Image.fromarray(path). However, the deeplake.read() method is significantly faster because it does not decompress and recompress the image if the compression matches thesample_compression for that tensor. Further details are available in .

Check out the first image from this dataset. More details about Accessing Data are available in .

For more detailed information regarding accessing datasets and their tensors, check out .

πŸ“š
Storage Synchronization
htype
Understanding Compression
Step 4
YOLO
COCO
Dataframe
Step 4
Colab Notebook
registering
wherever they are being stored
338KB
animals.zip
archive
animals dataset