Step 2: Creating Deep Lake Vector Stores

How to Create a Deep Lake Vector Store

Let's create a Vector Store in LangChain for storing and searching information about the Twitter OSS recommendation algorithmarrow-up-right.

Downloading and Preprocessing the Data

First, let's import necessary packages and make sure the Activeloop and OpenAI keys are in the environmental variables ACTIVELOOP_TOKEN, OPENAI_API_KEY.

from deeplake.core.vectorstore.deeplake_vectorstore import DeepLakeVectorStore
import openai
import os

Next, let's clone the Twitter OSS recommendation algorithm and define paths for for source data and the Vector Store.

!git clone https://github.com/twitter/the-algorithm
vector_store_path = '/vector_store_getting_started'
repo_path = '/the-algorithm'

Next, let's load all the files from the repo into list of data that will be added to the Vector Store (chunked_text and metadata). We use simple text chunking based on a constant number of characters.

CHUNK_SIZE = 1000

chunked_text = []
metadata = []
for dirpath, dirnames, filenames in os.walk(repo_path):
    for file in filenames:
        try: 
            full_path = os.path.join(dirpath,file)
            with open(full_path, 'r') as f:
               text = f.read()
            new_chunkned_text = [text[i:i+1000] for i in range(0,len(text), CHUNK_SIZE)]
            chunked_text += new_chunkned_text
            metadata += [{'filepath': full_path} for i in range(len(new_chunkned_text))]
        except Exception as e: 
            print(e)
            pass

Next, let's define an embedding function using OpenAI. It must work for a single string and a list of strings, so that it can both be used to embed a prompt and a batch of texts.

Finally, let's create the Deep Lake Vector Store and populate it with data. We use a default tensor configuration, which creates tensors with text (str), metadata(json), id (str, auto-populated), embedding (float32). Learn more about tensor customizability here.

The Vector Store's data structure can be summarized using vector_store.summary(), which shows 4 tensors with 21055 samples:

Last updated

Was this helpful?