# Deep Lake Vector Store in LangChain

## How to Use Deep Lake as a Vector Store in LangChain

Vector embeddings are a data representation that is commonly used for down-selecting contextual data that is fed into a language models, since they typically have a finite token limit. Deep Lake can be use as a [VectorStore](https://python.langchain.com/en/latest/reference/modules/vectorstores.html#langchain.vectorstores.DeepLake) in [LangChain](https://github.com/hwchase17/langchain) for building Apps that require vector filtering and search. In this tutorial we will show how to create a Deep Lake Vector Store in LangChain and use it to build a Q\&A App about the [Twitter OSS recommendation algorithm](https://github.com/twitter/the-algorithm).

### Downloading and Preprocessing the Data

First, let's import necessary packages and **make sure the Activeloop and OpenAI keys are in the environmental variables `ACTIVELOOP_TOKEN`, `OPENAI_API_KEY`.**

```python
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import DeepLake
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA, ConversationalRetrievalChain
import os
```

Next, let's clone the Twitter OSS recommendation algorithm:

```python
!git clone https://github.com/twitter/the-algorithm
```

Next, let's load all the files from the repo into a list:

```python
repo_path = '/the-algorithm'

docs = []
for dirpath, dirnames, filenames in os.walk(repo_path):
    for file in filenames:
        try: 
            loader = TextLoader(os.path.join(dirpath, file), encoding='utf-8')
            docs.extend(loader.load_and_split())
        except Exception as e: 
            print(e)
            pass
```

#### A note on chunking text files:

Since some of the files are very large, we split them into chunks. In general, more chunks increases the relevancy of data that is fed into the language model, since granular data can be selected with higher precision. However, since an embedding will be created for each chunk, more chunks increase the computational complexity.

```python
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(docs)
```

{% hint style="warning" %}
Chunks in the above context should not be confused with Deep Lake chunks!
{% endhint %}

### Creating the Deep Lake Vector Store

First, we specify a path for storing the Deep Lake dataset containing the embeddings and their metadata.&#x20;

```python
dataset_path = 'hub://<org-id>/twitter_algorithm'
```

Next, we specify an OpenAI algorithm for creating the embeddings, and create the VectorStore. This process creates an embedding for each element in the `texts` lists and stores it in Deep Lake format at the specified path.&#x20;

```python
embeddings = OpenAIEmbeddings()
```

```python
db = DeepLake.from_documents(texts, embeddings, dataset_path=dataset_path)
```

The Deep Lake dataset serving as a VectorStore has 4 tensors including the `embedding`, its `ids`, `metadata` including the filename of the `text`, and the `text` itself.&#x20;

```
  tensor     htype       shape       dtype  compression
  -------   -------     -------     -------  ------- 
 embedding  generic  (23156, 1536)  float32   None   
    ids      text     (23156, 1)      str     None   
 metadata    json     (23156, 1)      str     None   
   text      text     (23156, 1)      str     None   
```

### Use the Vector Store in a Q\&A App

We can now use the VectorStore in Q\&A app, where the embeddings will be used to filter relevant documents (`texts`) that are fed into an LLM in order to answer a question.

If we were on another machine, we would load the existing Vector Store without recalculating the embeddings:

```python
db = DeepLake(dataset_path=dataset_path, read_only=True, embedding_function=embeddings)
```

We have to create a `retriever` object and specify the search parameters ([parameter details are available here](https://python.langchain.com/en/latest/reference/modules/vectorstore.html#langchain.vectorstores.DeepLake.search)).

```python
retriever = db.as_retriever()
retriever.search_kwargs['distance_metric'] = 'cos'
retriever.search_kwargs['k'] = 20
```

Finally, let's create an `RetrievalQA` chain in LangChain and run it:

```python
model = ChatOpenAI(model='gpt-4') # 'gpt-3.5-turbo',
qa = RetrievalQA.from_llm(model, retriever=retriever)
```

```python
qa.run('What programming language is most of the SimClusters written in?')
```

This returns:

`Most of the SimClusters code is written in Scala, as seen in the provided context with the file path [src`[`/scala/com/twitter/simclusters_v2/scio/bq_generation`](https://file+.vscode-resource.vscode-cdn.net/scala/com/twitter/simclusters_v2/scio/bq_generation)`](scio`[`/bq_generation`](https://file+.vscode-resource.vscode-cdn.net/bq_generation)`) and the package declarations that use the Scala package syntax.`

{% hint style="info" %}
We can tune `k` in the `retriever` depending on whether the prompt exceeds the model's token limit. Higher `k` increase the accuracy by including more data in the prompt.
{% endhint %}

### Adding data to to an existing Vector Store

Data can be added to an existing Vector Store by loading it using its path and adding documents or texts.&#x20;

```python
db = DeepLake(dataset_path=dataset_path, embedding_function=embeddings)

# Don't run this here in order to avoid data duplication
# db.add_documents(texts)
```

### Adding Hybrid Search to the Vector Store

Since embeddings search can be computationally expensive, you can simplify the search by filtering out data using an explicit search on top of the embeddings search. Suppose we want to answer to a question related to the trust and safety models. We can filter the filenames (`source`) in the `metadata` using a custom function that is added to the retriever:

```python
def filter(deeplake_sample):
    return 'trust_and_safety_models' in deeplake_sample['metadata'].data()['value']['source']

retriever.search_kwargs['filter'] = filter
```

```python
qa.run("What do the trust and safety models do?")
```

This returns:

`"The Trust and Safety Models are designed to detect various types of content on Twitter that may be inappropriate, harmful, or against their terms of service. Here's a brief overview of each model:\n\n1. pNSFWMedia: This model detects tweets containing Not Safe For Work (NSFW) images, including adult and pornographic content.\n2. pNSFWText: This model identifies tweets with NSFW text or those discussing adult`[`/sexual`](https://file+.vscode-resource.vscode-cdn.net/sexual) `topics.\n3. pToxicity: This model detects toxic tweets, which may include marginal content like insults and certain types of harassment. Toxic content does not necessarily violate Twitter's terms of service.\n4. pAbuse: This model identifies abusive content that violates Twitter's terms of service, including hate speech, targeted harassment, and abusive behavior."`<br>

Congrats! You just used the Deep Lake VectorStore in LangChain to create a Q\&A App! 🎉
