LogoLogo
API ReferenceGitHubSlackService StatusLogin
v3.9.16
v3.9.16
  • 🏠Deep Lake Docs
  • List of ML Datasets
  • 🏗️SETUP
    • Installation
    • User Authentication
      • Workload Identities (Azure Only)
    • Storage and Credentials
      • Storage Options
      • Setting up Deep Lake in Your Cloud
        • Microsoft Azure
          • Configure Azure SSO on Activeloop
          • Provisioning Federated Credentials
          • Enabling CORS
        • Google Cloud
          • Provisioning Federated Credentials
          • Enabling CORS
        • Amazon Web Services
          • Provisioning Role-Based Access
          • Enabling CORS
  • 📚Examples
    • Deep Learning
      • Deep Learning Quickstart
      • Deep Learning Guide
        • Step 1: Hello World
        • Step 2: Creating Deep Lake Datasets
        • Step 3: Understanding Compression
        • Step 4: Accessing and Updating Data
        • Step 5: Visualizing Datasets
        • Step 6: Using Activeloop Storage
        • Step 7: Connecting Deep Lake Datasets to ML Frameworks
        • Step 8: Parallel Computing
        • Step 9: Dataset Version Control
        • Step 10: Dataset Filtering
      • Deep Learning Tutorials
        • Creating Datasets
          • Creating Complex Datasets
          • Creating Object Detection Datasets
          • Creating Time-Series Datasets
          • Creating Datasets with Sequences
          • Creating Video Datasets
        • Training Models
          • Splitting Datasets for Training
          • Training an Image Classification Model in PyTorch
          • Training Models Using MMDetection
          • Training Models Using PyTorch Lightning
          • Training on AWS SageMaker
          • Training an Object Detection and Segmentation Model in PyTorch
        • Updating Datasets
        • Data Processing Using Parallel Computing
      • Deep Learning Playbooks
        • Querying, Training and Editing Datasets with Data Lineage
        • Evaluating Model Performance
        • Training Reproducibility Using Deep Lake and Weights & Biases
        • Working with Videos
      • Deep Lake Dataloaders
      • API Summary
    • RAG
      • RAG Quickstart
      • RAG Tutorials
        • Vector Store Basics
        • Vector Search Options
          • LangChain API
          • Deep Lake Vector Store API
          • Managed Database REST API
        • Customizing Your Vector Store
        • Image Similarity Search
        • Improving Search Accuracy using Deep Memory
      • LangChain Integration
      • LlamaIndex Integration
      • Managed Tensor Database
        • REST API
        • Migrating Datasets to the Tensor Database
      • Deep Memory
        • How it Works
    • Tensor Query Language (TQL)
      • TQL Syntax
      • Index for ANN Search
        • Caching and Optimization
      • Sampling Datasets
  • 🔬Technical Details
    • Best Practices
      • Creating Datasets at Scale
      • Training Models at Scale
      • Storage Synchronization and "with" Context
      • Restoring Corrupted Datasets
      • Concurrent Writes
        • Concurrency Using Zookeeper Locks
    • Deep Lake Data Format
      • Tensor Relationships
      • Version Control and Querying
    • Dataset Visualization
      • Visualizer Integration
    • Shuffling in Dataloaders
    • How to Contribute
Powered by GitBook
On this page
  • How to Use Deep Lake as a Vector Store in LlamaIndex
  • Downloading and Preprocessing the Data
  • Creating the Deep Lake Vector Store
  • Use the Vector Store in a Q&A App

Was this helpful?

Edit on GitHub
  1. Examples
  2. RAG

LlamaIndex Integration

Using Deep Lake as a Vector Store in LlamaIndex

PreviousLangChain IntegrationNextManaged Tensor Database

Was this helpful?

How to Use Deep Lake as a Vector Store in LlamaIndex

Deep Lake can be used as a in for building Apps that require filtering and vector search. In this tutorial we will show how to create a Deep Lake Vector Store in LangChain and use it to build a Q&A App about the . This tutorial requires installation of:

%pip3 install llama-index-vector-stores-deeplake
!pip3 install langchain llama-index deeplake

Downloading and Preprocessing the Data

First, let's import necessary packages and make sure the Activeloop and OpenAI keys are in the environmental variables ACTIVELOOP_TOKEN, OPENAI_API_KEY.

import os
import textwrap

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.vector_stores.deeplake import DeepLakeVectorStore
from llama_index.core import StorageContext

Next, let's clone the Twitter OSS recommendation algorithm:

!git clone https://github.com/twitter/the-algorithm

Next, let's specify a local path to the files and add a reader for processing and chunking them.

repo_path = 'the-algorithm'
documents = SimpleDirectoryReader(repo_path, recursive=True).load_data()

Creating the Deep Lake Vector Store

First, we create an empty Deep Lake Vector Store using a specified path:

dataset_path = 'hub://<org-id>/twitter_algorithm'
vector_store = DeepLakeVectorStore(dataset_path=dataset_path)

The Deep Lake Vector Store has 4 tensors including the text, embedding, ids, and metadata which includes the filename of the text .

  tensor      htype     shape    dtype  compression
  -------    -------   -------  -------  ------- 
   text       text      (0,)      str     None   
 metadata     json      (0,)      str     None   
 embedding  embedding   (0,)    float32   None   
    id        text      (0,)      str     None  

Next, we create a LlamaIndex StorageContext and VectorStoreIndex, and use the from_documents() method to populate the Vector Store with data. This step takes several minutes because of the time to embed the text.

storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context,
)

We observe that the Vector Store has 8286 rows of data:

  tensor      htype       shape       dtype  compression
  -------    -------     -------     -------  ------- 
   text       text      (8262, 1)      str     None   
 metadata     json      (8262, 1)      str     None   
 embedding  embedding  (8262, 1536)  float32   None   
    id        text      (8262, 1)      str     None 

Use the Vector Store in a Q&A App

We can now use the VectorStore in Q&A app, where the embeddings will be used to filter relevant documents (texts) that are fed into an LLM in order to answer a question.

If we were on another machine, we would load the existing Vector Store without re-ingesting the data

vector_store = DeepLakeVectorStore(dataset_path=dataset_path, read_only=True)
index = VectorStoreIndex.from_vector_store(vector_store=vector_store)

Next, Let's create the LlamaIndex query engine and run a query:

query_engine = index.as_query_engine()
response = query_engine.query("What programming language is most of the SimClusters written in?")
print(str(response))

Most of the SimClusters project is written in Scala.

Congrats! You just used the Deep Lake Vector Store in LangChain to create a Q&A App! 🎉

📚
VectorStore
LlamaIndex
Twitter OSS recommendation algorithm