# Querying Datasets

## How to query machine learning datasets using Activeloop's query engine

Querying datasets is a critical aspect of data science workflows that enables users to filter datasets and focus their work on the most relevant data at hand. Activeloop offers a highly-performant dataset query engine built in C++ and optimized for Deep Lake datasets.&#x20;

### Dataset Query Summary

{% embed url="<https://www.loom.com/share/40f8f10af5064f9a8baf3dfd37029700>" %}

#### Saving and utilizing dataset query results

The query results (`Dataset Views`) can be saved in the UI as shown above, or if the view is generated in Python, it can be saved using the Python API below. Full details are [available here](/v3.0.0/getting-started/dataset-filtering.md).

```python
ds_view.save_view(message = 'Samples with monarchs')
```

{% hint style="warning" %}
In order to maintain data lineage, `Dataset Views` are immutable and are connected to specific commits. Therefore, views can only be saved if the dataset has a commit and there are no uncommitted changes in the `HEAD`.&#x20;
{% endhint %}

`Dataset Views` can be loaded in the python API and they can passed to ML frameworks just like regular datasets:

```python
ds_view = ds.load_view(view_id, optimize = True, num_workers = 2)

for data in ds_view.pytorch():
    # Training loop here
```

{% hint style="warning" %}
The `optimize` parameter in `ds.load_view(...,`` `**`optimize = True`**`)` materializes the `Dataset View` into a new sub-dataset that is optimized for streaming. If the original dataset uses [linked tensors](broken://pages/PAxJHAoZTMrUQpRYjzXz), the data will be copied to Deep Lake format.

Optimizing the `Dataset View` is critical for achieving rapid streaming.
{% endhint %}

If the saved `Dataset View` is no longer needed, it can be deleted using:

```python
ds.delete_view(view_id)
```

### Dataset Query Syntax

#### CONTAINS and ==

```sql
# Exact match, which generally requires that the sample
# has 1 value, i.e. no lists or multi-dimensional arrays
select * where tensor_name == 'text_value'    # If value is numeric
select * where tensor_name == numeric_value  # If values is text

select * where contains(tensor_name, 'text_value')
```

{% hint style="warning" %}
Any special characters in tensor or group names should be wrapped with double-quotes:

```
select * where contains("tensor-name", 'text_value')

select * where "tensor_name/group_name" == numeric_value
```

{% endhint %}

#### SHAPE

```sql
select * where shape(tensor_name)[dimension_index] > numeric_value 
select * where shape(tensor_name)[1] > numeric_value # Second array dimension > value
```

#### LIMIT

```sql
select * where contains(tensor_name, 'text_value') limit num_samples
```

#### AND, OR, NOT

```sql
select * where contains(tensor_name, 'text_value') and NOT contains(tensor_name_2, numeric_value)
select * where contains(tensor_name, 'text_value') or tensor_name_2 == numeric_value

select * where (contains(tensor_name, 'text_value') and shape(tensor_name_2)[dimension_index]>numeric_value) or contains(tensor_name, 'text_value_2')
```

#### UNION and INTERSECT

```sql
(select * where contains(tensor_name, 'value')) intersect (select * where contains(tensor_name, 'value_2'))

(select * where contains(tensor_name, 'value') limit 100) union (select * where shape(tensor_name)[0] > numeric_value limit 100)
```

#### ORDER BY

<pre class="language-sql"><code class="lang-sql"><strong># Order by requires that sample is numeric and has 1 value, 
</strong># i.e. no lists or multi-dimensional arrays
select * where contains(tensor_name, 'text_value') order by tensor_name asc
</code></pre>

#### ANY, ALL, and ALL\_STRICT

<pre class="language-sql"><code class="lang-sql"><strong>select * where all_strict(tensor_name[:,2]>numeric_value)
</strong>
select * where any(tensor_name[0:6]>numeric_value)
</code></pre>

{% hint style="warning" %}
**`all`** adheres to NumPy and list logic where `all(empty_sample)` returns `True`

**`all_strict`** is more intuitive for queries so `all_strict(empty_sample)` returns `False`
{% endhint %}

**LOGICAL\_AND** and **LOGICAL\_OR**

```sql
select * where any(logical_and(tensor_name_1[:,3]>numeric_value, tensor_name_2 == 'text_value'))
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs-v3.activeloop.ai/v3.0.0/tutorials/querying-datasets.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
