> For the complete documentation index, see [llms.txt](https://docs-v3.activeloop.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs-v3.activeloop.ai/v3.0.15/tutorials/querying-datasets.md).

# Querying Datasets

## How to query machine learning datasets using Activeloop's query engine

Querying datasets is a critical aspect of data science workflows that enables users to filter datasets and focus their work on the most relevant data at hand. Activeloop offers a highly-performant dataset query engine built in C++ and optimized for Deep Lake datasets.&#x20;

{% hint style="danger" %}
Starting from `deeplake 3.1.0`, querying your own datasets will require the dataset's parent organization to be on a Growth of Enterprise plan. Querying Activeloop datasets will be enabled for everyone.&#x20;
{% endhint %}

### Dataset Query Summary

#### Querying in the UI

{% embed url="<https://www.loom.com/share/40f8f10af5064f9a8baf3dfd37029700>" %}

#### Querying in the Python API

Queries can also be performed in the Python API using:

```python
view = ds.query('Query string')
```

#### Saving and utilizing dataset query results

The query results (`Dataset Views`) can be saved in the UI as shown above, or if the view is generated in Python, it can be saved using the Python API below. Full details are [available here](/v3.0.15/getting-started/dataset-filtering.md).

```python
ds_view.save_view(message = 'Samples with monarchs')
```

{% hint style="warning" %}
In order to maintain data lineage, `Dataset Views` are immutable and are connected to specific commits. Therefore, views can only be saved if the dataset has a commit and there are no uncommitted changes in the `HEAD`. You can check for this using `ds.has_head_changes`
{% endhint %}

`Dataset Views` can be loaded in the python API and they can passed to ML frameworks just like regular datasets:

```python
ds_view = ds.load_view(view_id, optimize = True, num_workers = 2)

for data in ds_view.pytorch():
    # Training loop here
```

{% hint style="warning" %}
The `optimize` parameter in `ds.load_view(...,`` `**`optimize = True`**`)` materializes the `Dataset View` into a new sub-dataset that is optimized for streaming. If the original dataset uses [linked tensors](broken://pages/PAxJHAoZTMrUQpRYjzXz), the data will be copied to Deep Lake format.

Optimizing the `Dataset View` is critical for achieving rapid streaming.
{% endhint %}

If the saved `Dataset View` is no longer needed, it can be deleted using:

```python
ds.delete_view(view_id)
```

### Dataset Query Syntax

#### CONTAINS and ==

```sql
# Exact match, which generally requires that the sample
# has 1 value, i.e. no lists or multi-dimensional arrays
select * where tensor_name == 'text_value'    # If value is numeric
select * where tensor_name == numeric_value  # If values is text

select * where contains(tensor_name, 'text_value')
```

{% hint style="warning" %}
Any special characters in tensor or group names should be wrapped with double-quotes:

```
select * where contains("tensor-name", 'text_value')

select * where "tensor_name/group_name" == numeric_value
```

{% endhint %}

#### SHAPE

```sql
select * where shape(tensor_name)[dimension_index] > numeric_value 
select * where shape(tensor_name)[1] > numeric_value # Second array dimension > value
```

#### LIMIT

```sql
select * where contains(tensor_name, 'text_value') limit num_samples
```

#### AND, OR, NOT

```sql
select * where contains(tensor_name, 'text_value') and NOT contains(tensor_name_2, numeric_value)
select * where contains(tensor_name, 'text_value') or tensor_name_2 == numeric_value

select * where (contains(tensor_name, 'text_value') and shape(tensor_name_2)[dimension_index]>numeric_value) or contains(tensor_name, 'text_value_2')
```

#### UNION and INTERSECT

```sql
(select * where contains(tensor_name, 'value')) intersect (select * where contains(tensor_name, 'value_2'))

(select * where contains(tensor_name, 'value') limit 100) union (select * where shape(tensor_name)[0] > numeric_value limit 100)
```

#### ORDER BY

<pre class="language-sql"><code class="lang-sql"><strong># Order by requires that sample is numeric and has 1 value, 
</strong># i.e. no lists or multi-dimensional arrays
select * where contains(tensor_name, 'text_value') order by tensor_name asc
</code></pre>

#### ANY, ALL, and ALL\_STRICT

<pre class="language-sql"><code class="lang-sql"><strong>select * where all_strict(tensor_name[:,2]>numeric_value)
</strong>
select * where any(tensor_name[0:6]>numeric_value)
</code></pre>

{% hint style="warning" %}
**`all`** adheres to NumPy and list logic where `all(empty_sample)` returns `True`

**`all_strict`** is more intuitive for queries so `all_strict(empty_sample)` returns `False`
{% endhint %}

**LOGICAL\_AND** and **LOGICAL\_OR**

```sql
select * where any(logical_and(tensor_name_1[:,3]>numeric_value, tensor_name_2 == 'text_value'))
```

#### SAMPLE BY

```sql
select * sample by weight_choice(expression_1: weight_1, expression_2: weight_2, ...)
        replace True limit N
```

* **`weight_choice`** resolves the weight that is used when multiple expressions evaluate to `True` for a given sample. Options are `max_weight, sum_weight`. For example, if `weight_choice` is `max_weight`, then the maximum weight will be chosen for that sample.
* **`replace`** determines whether samples should be drawn with replacement. It defaults to `True`.
* **`limit`** specifies the number of samples that should be returned. If unspecified, the sampler will return the number of samples corresponding to the length of the dataset


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://docs-v3.activeloop.ai/v3.0.15/tutorials/querying-datasets.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.