LogoLogo
API ReferenceGitHubSlackService StatusLogin
v3.9.16
v3.9.16
  • 🏠Deep Lake Docs
  • List of ML Datasets
  • 🏗️SETUP
    • Installation
    • User Authentication
      • Workload Identities (Azure Only)
    • Storage and Credentials
      • Storage Options
      • Setting up Deep Lake in Your Cloud
        • Microsoft Azure
          • Configure Azure SSO on Activeloop
          • Provisioning Federated Credentials
          • Enabling CORS
        • Google Cloud
          • Provisioning Federated Credentials
          • Enabling CORS
        • Amazon Web Services
          • Provisioning Role-Based Access
          • Enabling CORS
  • 📚Examples
    • Deep Learning
      • Deep Learning Quickstart
      • Deep Learning Guide
        • Step 1: Hello World
        • Step 2: Creating Deep Lake Datasets
        • Step 3: Understanding Compression
        • Step 4: Accessing and Updating Data
        • Step 5: Visualizing Datasets
        • Step 6: Using Activeloop Storage
        • Step 7: Connecting Deep Lake Datasets to ML Frameworks
        • Step 8: Parallel Computing
        • Step 9: Dataset Version Control
        • Step 10: Dataset Filtering
      • Deep Learning Tutorials
        • Creating Datasets
          • Creating Complex Datasets
          • Creating Object Detection Datasets
          • Creating Time-Series Datasets
          • Creating Datasets with Sequences
          • Creating Video Datasets
        • Training Models
          • Splitting Datasets for Training
          • Training an Image Classification Model in PyTorch
          • Training Models Using MMDetection
          • Training Models Using PyTorch Lightning
          • Training on AWS SageMaker
          • Training an Object Detection and Segmentation Model in PyTorch
        • Updating Datasets
        • Data Processing Using Parallel Computing
      • Deep Learning Playbooks
        • Querying, Training and Editing Datasets with Data Lineage
        • Evaluating Model Performance
        • Training Reproducibility Using Deep Lake and Weights & Biases
        • Working with Videos
      • Deep Lake Dataloaders
      • API Summary
    • RAG
      • RAG Quickstart
      • RAG Tutorials
        • Vector Store Basics
        • Vector Search Options
          • LangChain API
          • Deep Lake Vector Store API
          • Managed Database REST API
        • Customizing Your Vector Store
        • Image Similarity Search
        • Improving Search Accuracy using Deep Memory
      • LangChain Integration
      • LlamaIndex Integration
      • Managed Tensor Database
        • REST API
        • Migrating Datasets to the Tensor Database
      • Deep Memory
        • How it Works
    • Tensor Query Language (TQL)
      • TQL Syntax
      • Index for ANN Search
        • Caching and Optimization
      • Sampling Datasets
  • 🔬Technical Details
    • Best Practices
      • Creating Datasets at Scale
      • Training Models at Scale
      • Storage Synchronization and "with" Context
      • Restoring Corrupted Datasets
      • Concurrent Writes
        • Concurrency Using Zookeeper Locks
    • Deep Lake Data Format
      • Tensor Relationships
      • Version Control and Querying
    • Dataset Visualization
      • Visualizer Integration
    • Shuffling in Dataloaders
    • How to Contribute
Powered by GitBook
On this page
  • How to use Deep Lake to store time-series data
  • Create the Deep Lake Dataset
  • Inspect the Deep Lake Dataset

Was this helpful?

Edit on GitHub
  1. Examples
  2. Deep Learning
  3. Deep Learning Tutorials
  4. Creating Datasets

Creating Time-Series Datasets

Deep Lake is a powerful tool for easily storing and sharing time-series datasets with your team.

PreviousCreating Object Detection DatasetsNextCreating Datasets with Sequences

Was this helpful?

How to use Deep Lake to store time-series data

This tutorial is also available as a

Deep Lake is intuitive format for storing large time-series datasets and it offers compression for reducing storage costs. This tutorial demonstrates how to convert a time-series data to Deep Lake format and load the data for plotting.

Create the Deep Lake Dataset

The first step is to download the small dataset below called sensor data.

This is a subset of a , and it contains the iPhone x,y,z acceleration for 24 users (subjects) under conditions of walking and jogging. The dataset has the folder structure below. subjects_info.csv contains metadata such as height, weight, etc. for each subject, and the sub_n.csv files contains the time-series acceleration data for the nth subject.

data_dir
|_subjects_into.csv
|_motion_data
    |_walk
        |_sub_1.csv
        |_sub_2.csv
        ...
        ...
    |_jog
        |_sub_1.csv
        |_sub_2.csv
        ...
        ...

Now that you have the data, let's create a Deep Lake Dataset in the ./sensor_data_deeplake folder by running:

import deeplake
import pandas as pd
import os
from tqdm import tqdm
import numpy as np
import matplotlib.pyplot as plt

ds = deeplake.empty('./sensor_data_deeplake') # Create the dataset locally

Next, let's specify the folder path containing the existing dataset, load the subjects metadata to a Pandas DataFrame, and create a list of all of the time-series files that should be converted to Deep Lake format.

dataset_path= './sensor_data'

subjects_info = pd.read_csv(os.path.join(dataset_path, 'subjects_info.csv'))

fns_series = []
for dirpath, dirnames, filenames in os.walk(os.path.join(dataset_path, 'motion_data')):
    for filename in filenames:
        fns_series .append(os.path.join(dirpath, filename))

Next, let's create the tensors and add relevant metadata, such as the dataset source, the tensor units, and other information. We leverage groups to separate out the primary acceleration data from other user data such as the weight and height of the subjects.

with ds:
    #Update dataset metadata
    ds.info.update(source = 'https://www.kaggle.com/malekzadeh/motionsense-dataset', 
                   notes = 'This is a small subset of the data in the source link')

    #Create tensors. Setting chunk_compression is optional and it defaults to None
    ds.create_tensor('acceleration_x', chunk_compression = 'lz4') 
    ds.create_tensor('acceleration_y', chunk_compression = 'lz4')
    
    # Save the sampling rate as tensor metadata. Alternatively,
    # you could also create a 'time' tensor.
    ds.acceleration_x.info.update(sampling_rate_s = 0.1)
    ds.acceleration_y.info.update(sampling_rate_s = 0.1)
    
    # Encode activity as text
    ds.create_tensor('activity', htype = 'text')
    
    # Encode 'activity' as numeric labels and convert to text via class_names
    # ds.create_tensor('activity', htype = 'class_label', class_names = ['xyz'])
    
    ds.create_group('subjects_info')
    ds.subjects_info.create_tensor('age')
    ds.subjects_info.create_tensor('weight')
    ds.subjects_info.create_tensor('height')
    
    # Save the units of weight as tensor metadata
    ds.subjects_info.weight.info.update(units = 'kg')
    ds.subjects_info.height.info.update(units = 'cm')

Finally, let's iterate through all the time-series data and upload it to the Deep Lake dataset.

with ds:
    # Iterate through the time series and append data
    for fn in tqdm(fns_series):
        
        # Read the data in the time series
        df_data = pd.read_csv(fn)
        
        # Parse the 'activity' from the file name
        activity = os.path.basename(os.path.dirname(fn))
        
        # Parse the subject code from the filename  and pull the subject info from 'subjects_info'
        subject_code = int(os.path.splitext(os.path.basename(fn))[0].split('_')[1])
        subject_info = subjects_info[subjects_info['code']==subject_code]
        
        # Append data to tensors
        ds.activity.append(activity)
        ds.subjects_info.age.append(subject_info['age'].values)
        ds.subjects_info.weight.append(subject_info['weight'].values)
        ds.subjects_info.height.append(subject_info['height'].values)
                
        ds.acceleration_x.append(df_data['userAcceleration.x'].values)
        ds.acceleration_y.append(df_data['userAcceleration.y'].values)

Inspect the Deep Lake Dataset

Let's check out the first sample from this dataset and plot the acceleration time-series.

It is noteworthy that the Deep Lake dataset takes 36% less memory than the original dataset due to lz4 chunk compression for the acceleration tensors.

s_ind = 0 # Plot the first time series
t_ind = 100 # Plot the first 100 indices in the time series

#Plot the x acceleration
x_data = ds.acceleration_x[s_ind].numpy()[:t_ind]
sampling_rate_x = ds.acceleration_x.info.sampling_rate_s

plt.plot(np.arange(0, x_data.size)*sampling_rate_x, x_data, label='acceleration_x')

#Plot the y acceleration
y_data = ds.acceleration_y[s_ind].numpy()[:t_ind]
sampling_rate_y = ds.acceleration_y.info.sampling_rate_s

plt.plot(np.arange(0, y_data.size)*sampling_rate_y, y_data, label='acceleration_y')

plt.legend()
plt.xlabel('time [s]', fontweight = 'bold')
plt.ylabel('acceleration [g]', fontweight = 'bold')
plt.title('Weight: {} {}, Height: {} {}'.format(ds.subjects_info.weight[s_ind].numpy()[0],
                                               ds.subjects_info.weight.info.units,
                                               ds.subjects_info.height[s_ind].numpy()[0],
                                               ds.subjects_info.height.info.units),
         fontweight = 'bold')

plt.xlim([0, 10])
plt.grid()
plt.gcf().set_size_inches(8, 5)
plt.show()

Congrats! You just converted a time-series dataset to Deep Lake format! 🎉

📚
Colab Notebook
dataset available on kaggle
1MB
sensor_data.zip
archive