Let's create a Vector Store in LangChain for storing and searching information about the .
Downloading and Preprocessing the Data
First, let's import necessary packages and make sure the Activeloop and OpenAI keys are in the environmental variables ACTIVELOOP_TOKEN, OPENAI_API_KEY.
from deeplake.core.vectorstore.deeplake_vectorstore import DeepLakeVectorStore
import openai
import os
Next, let's clone the Twitter OSS recommendation algorithm and define paths for for source data and the Vector Store.
Next, let's load all the files from the repo into list of data that will be added to the Vector Store (chunked_text and metadata). We use simple text chunking based on a constant number of characters.
CHUNK_SIZE = 1000
chunked_text = []
metadata = []
for dirpath, dirnames, filenames in os.walk(repo_path):
for file in filenames:
try:
full_path = os.path.join(dirpath,file)
with open(full_path, 'r') as f:
text = f.read()
new_chunkned_text = [text[i:i+1000] for i in range(0,len(text), CHUNK_SIZE)]
chunked_text += new_chunkned_text
metadata += [{'filepath': full_path} for i in range(len(new_chunkned_text))]
except Exception as e:
print(e)
pass
Next, let's define an embedding function using OpenAI. It must work for a single string and a list of strings, so that it can both be used to embed a prompt and a batch of texts.
def embedding_function(texts, model="text-embedding-ada-002"):
if isinstance(texts, str):
texts = [texts]
texts = [t.replace("\n", " ") for t in texts]
return [data['embedding']for data in openai.Embedding.create(input = texts, model=model)['data']]
Finally, let's create the Deep Lake Vector Store and populate it with data. We use a default tensor configuration, which creates tensors with text (str), metadata(json), id (str, auto-populated), embedding (float32). Learn more about tensor customizability here.