LogoLogo
API ReferenceGitHubSlackService StatusLogin
v3.9.16
v3.9.16
  • 🏠Deep Lake Docs
  • List of ML Datasets
  • 🏗️SETUP
    • Installation
    • User Authentication
      • Workload Identities (Azure Only)
    • Storage and Credentials
      • Storage Options
      • Setting up Deep Lake in Your Cloud
        • Microsoft Azure
          • Configure Azure SSO on Activeloop
          • Provisioning Federated Credentials
          • Enabling CORS
        • Google Cloud
          • Provisioning Federated Credentials
          • Enabling CORS
        • Amazon Web Services
          • Provisioning Role-Based Access
          • Enabling CORS
  • 📚Examples
    • Deep Learning
      • Deep Learning Quickstart
      • Deep Learning Guide
        • Step 1: Hello World
        • Step 2: Creating Deep Lake Datasets
        • Step 3: Understanding Compression
        • Step 4: Accessing and Updating Data
        • Step 5: Visualizing Datasets
        • Step 6: Using Activeloop Storage
        • Step 7: Connecting Deep Lake Datasets to ML Frameworks
        • Step 8: Parallel Computing
        • Step 9: Dataset Version Control
        • Step 10: Dataset Filtering
      • Deep Learning Tutorials
        • Creating Datasets
          • Creating Complex Datasets
          • Creating Object Detection Datasets
          • Creating Time-Series Datasets
          • Creating Datasets with Sequences
          • Creating Video Datasets
        • Training Models
          • Splitting Datasets for Training
          • Training an Image Classification Model in PyTorch
          • Training Models Using MMDetection
          • Training Models Using PyTorch Lightning
          • Training on AWS SageMaker
          • Training an Object Detection and Segmentation Model in PyTorch
        • Updating Datasets
        • Data Processing Using Parallel Computing
      • Deep Learning Playbooks
        • Querying, Training and Editing Datasets with Data Lineage
        • Evaluating Model Performance
        • Training Reproducibility Using Deep Lake and Weights & Biases
        • Working with Videos
      • Deep Lake Dataloaders
      • API Summary
    • RAG
      • RAG Quickstart
      • RAG Tutorials
        • Vector Store Basics
        • Vector Search Options
          • LangChain API
          • Deep Lake Vector Store API
          • Managed Database REST API
        • Customizing Your Vector Store
        • Image Similarity Search
        • Improving Search Accuracy using Deep Memory
      • LangChain Integration
      • LlamaIndex Integration
      • Managed Tensor Database
        • REST API
        • Migrating Datasets to the Tensor Database
      • Deep Memory
        • How it Works
    • Tensor Query Language (TQL)
      • TQL Syntax
      • Index for ANN Search
        • Caching and Optimization
      • Sampling Datasets
  • 🔬Technical Details
    • Best Practices
      • Creating Datasets at Scale
      • Training Models at Scale
      • Storage Synchronization and "with" Context
      • Restoring Corrupted Datasets
      • Concurrent Writes
        • Concurrency Using Zookeeper Locks
    • Deep Lake Data Format
      • Tensor Relationships
      • Version Control and Querying
    • Dataset Visualization
      • Visualizer Integration
    • Shuffling in Dataloaders
    • How to Contribute
Powered by GitBook
On this page
  • How to Get Started with Vector Search in Deep Lake in Under 5 Minutes
  • Installing Deep Lake
  • Creating Your First Vector Store
  • Performing Vector Search
  • Visualizing your Vector Store
  • Authentication
  • Creating Vector Stores in the Deep Lake Managed Tensor Database
  • Next Steps

Was this helpful?

Edit on GitHub
  1. Examples
  2. RAG

RAG Quickstart

A jump-start guide to using Deep Lake for Vector Search.

PreviousRAGNextRAG Tutorials

Was this helpful?

How to Get Started with Vector Search in Deep Lake in Under 5 Minutes

If you prefer to use higher level wrappers, please check out our or tutorials. This Quickstart focuses on vector storage and search, instead of end-2-end LLM apps, and it offers more customization and search options compared to other wrappers.

Installing Deep Lake

Deep Lake can be installed using pip. By default, Deep Lake does not install dependencies for google-cloud, video support, and other features. . This quickstart also requires OpenAI.

!pip3 install deeplake
!pip3 install openai

Creating Your First Vector Store

Let's embed and store one of in a Deep Lake Vector Store stored locally. First, we download the data:

Next, let's import the required modules and set the OpenAI environmental variables for embeddings:

from deeplake.core.vectorstore import VectorStore
import openai
import os

os.environ['OPENAI_API_KEY'] = <OPENAI_API_KEY>

Let's also read and chunk the essay text based on a constant number of characters.

source_text = 'paul_graham_essay.txt'
vector_store_path = 'pg_essay_deeplake'

with open(source_text, 'r') as f:
    text = f.read()

CHUNK_SIZE = 1000
chunked_text = [text[i:i+CHUNK_SIZE] for i in range(0,len(text), CHUNK_SIZE)]

Next, let's define an embedding function using OpenAI. It must work for a single string and a list of strings, so that it can both be used to embed a prompt and a batch of texts.

def embedding_function(texts, model="text-embedding-ada-002"):
   
   if isinstance(texts, str):
       texts = [texts]

   texts = [t.replace("\n", " ") for t in texts]
   
   return [data.embedding for data in openai.embeddings.create(input = texts, model=model).data]
vector_store = VectorStore(
    path = vector_store_path,
)

vector_store.add(text = chunked_text, 
                 embedding_function = embedding_function, 
                 embedding_data = chunked_text, 
                 metadata = [{"source": source_text}]*len(chunked_text))

The path parameter is bi-directional:

  • When a new path is specified, a new Vector Store is created

  • When an existing path is specified, the existing Vector Store is loaded

The Vector Store's data structure can be summarized using vector_store.summary(), which shows 4 tensors with 76 samples:

  tensor      htype      shape      dtype  compression
  -------    -------    -------    -------  ------- 
 embedding  embedding  (76, 1536)  float32   None   
    id        text      (76, 1)      str     None   
 metadata     json      (76, 1)      str     None   
   text       text      (76, 1)      str     None   

To create a vector store using pre-compute embeddings instead of the embedding_data and embedding_function, you may run

# vector_store.add(text = chunked_text, 
#                  embedding = <list_of_embeddings>, 
#                  metadata = [{"source": source_text}]*len(chunked_text))

Performing Vector Search

prompt = "What are the first programs he tried writing?"

search_results = vector_store.search(embedding_data=prompt, embedding_function=embedding_function)

The search_results is a dictionary with keys for the text, score, id, and metadata, with data ordered by score. If we examine the first returned text using search_results['text'][0], it appears to contain the answer to the prompt.

What I Worked On

February 2021

Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.

The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.

The language we used was an early version of Fortran. You had to type programs on punch cards, then stack them in

Visualizing your Vector Store

Authentication

Environmental Variable

Set the environmental variable ACTIVELOOP_TOKEN to your API token. In Python, this can be done using:

os.environ['ACTIVELOOP_TOKEN'] = <your_token>

Pass the Token to Individual Methods

You can pass your API token to individual methods that require authentication such as:

ds = VectorStore('hub://org_name/dataset_name', token = <your_token>)

Creating Vector Stores in the Deep Lake Managed Tensor Database

# vector_store = VectorStore(
#     path = vector_store_path,
#     runtime = {"tensor_db": True}
# )

# vector_store.add(text = chunked_text, 
#                  embedding_function = embedding_function, 
#                  embedding_data = chunked_text, 
#                  metadata = [{"source": source_text}]*len(chunked_text))
                 
# search_results = vector_store.search(embedding_data = prompt, 
#                                      embedding_function = embedding_function)

Next Steps

Next, lets specify paths for the source text and the Deep Lake Vector Store. Though we store the Vector Store locally, Deep Lake Vectors Stores can also be created in memory, in the Deep Lake , or in your cloud. .

Finally, let's create the Deep Lake Vector Store and populate it with data. We use a default tensor configuration, which creates tensors with text (str), metadata(json), id (str, auto-populated), embedding (float32).

Deep Lake offers highly-flexible vector search and hybrid search options . In this Quickstart, we show a simple example of vector search using default options, which performs cosine similarity search in Python on the client.

Visualization is available for Vector Stores stored in or connected to Deep Lake. The vector store above is stored locally, so it cannot be visualized, but

To use Deep Lake features that require authentication (Deep Lake storage, Tensor Database storage, connecting your cloud dataset to the Deep Lake UI, etc.) you should and authenticate on the client using the methods in the link below:

Deep Lake provides that stores and runs queries on Deep Lake infrastructure, instead of the client. To use this service, specify runtime = {"tensor_db": True} when creating the Vector Store.

Check out our Getting Started Guide for a comprehensive walk-through of Deep Lake Vector Stores. For scaling Deep Lake to production-level applications, check out our and .

Congratulations, you've created a Vector Store and performed vector search using Deep Lake

📚
🤓
Managed Tensor Database
Further details on storage options are available here
Learn more about tensor customizability here.
discussed in detail in these tutorials
here's an example of visualization for a representative Vector Store.
register in the Deep Lake App
Managed Tensor Database
Managed Tensor Database
Support for Concurrent Writes
LangChain
LlamaIndex
Details on all installation options are available here
Paul Graham's essays
73KB
paul_graham_essay.txt