You have to be logged in to leave a comment.

title	description
DagsHub Data Engine - Querying Data	Documentation on using Data Engine to query datasources to create subsets

Querying and saving subsets of your data

Using the automatically generated enrichments, or the ones you manually added previously, Data Engine provides the ability to zoom in and focus on relevant data points by querying your data source and generating new subsets to train your model on.

Query syntax

Data Engine queries are structured in a familiar Pandas-like syntax.

A few query examples:

# Get all data points from episodes after 5
q1 = ds["episode"] > 5

# Get all data points from the first episode that also include baby Yoda in them 
q2 = (ds["episode"] == 1) & (ds["has_baby_yoda"] == True)

# Get data points that aren't between episodes 4 and 6
q3 = ~((ds["episode"] >= 4) & (ds["episode"] <= 6))

# Get data points that don't have an attached annotation
q4 = ds["annotation"].is_null()

Data Engine supports the following operators:

== (equal)
!= (not equal)
> (greater than)
>= (greater than or equal)
< (less than)
<= (less than or equal)
.contains()
.is_null()
.is_not_null()
queries composing:
- & (and)
- | (or)
- ~ (not)

The query composition operators (&, |, ~) are binary and will be executed before the regular operators. For example:

# Supported
new_ds = (ds["episode"] > 5) & (ds["has_baby_yoda"] == True)

# Not supported
new_ds = ds["episode"] > 5 & ds["has_baby_yoda"] == True

!!! warning "Notes and limitations:" 1. Comparison is supported only on primitives - comparison between columns is not supported yet. 1. The in, and, or, not syntax (Python) is not supported. Usecontains(), &, | , ~ instead. For example: ```python # Supported ds["col"].contains("aaa") ds = (ds["episode"] == 0) & (ds["has_baby_yoda"] == True) ds[~(ds["episode"] == 0)]

    # Not supported
    "aaa" in df["col"]
    ds = (ds["episode"] == 0) and (ds["has_baby_yoda"] == True) 
    ```
1. For re-querying, assign the result to a new variable to not lose the query. For example:
    ```python
    # Supported
    filtered_ds = ds[ds["episode"] > 5]
    filtered_ds2 = filtered_ds[filtered_ds["has_baby_yoda"] == True]
    
    # Not supported 
    filtered_ds = ds[ds["episode"] > 5]
    filtered_ds2 = filtered_ds[ds["has_baby_yoda"] == True]
    ```
1. `.contains()` is supported only for strings fields. For example:
    ```python
    # For given data:
    # path      animals
    # 001.jpg   "cat, squirrel, dog"
    # 002.jpg   "snake, dog"
    # 003.jpg   "cat"
    
    ds["animals"].contains("cat")
    # Will return datapoints [001.jpg, 003.jpg]
    ```

Creating DataFrames from query results

Use .dataframe to get a pandas DataFrame that contains the data points and their enrichments:

df = ds.head().dataframe

# You can also use it like this
ds[ds["episode"] > 5].all().dataframe

!!! note .dataframe provides a copy of the metadata as a DataFrame. Changes made on a DataFrame do not apply to the original data source it was created from.

Saving query results as a new dataset

Query results can be saved and used later as a new dataset. To save your results as a new dataset, use the .save_dataset function:

# Filtered datasource
new_ds = ds[["episode"] > 5]

# Save the query as a dataset
new_ds.save_dataset("dataset-name")

After saving the new dataset, it will be displayed in your repository under the Datasets tab:

To get a list of all the saved datasets in a repository, use the get_datasets function:

from dagshub.data_engine import datasets
ds_list = datasets.get_datasets("username/repoName")

View and use saved datasets

To use saved datasets, use the .get_dataset() function:

from dagshub.data_engine import datasets

ds = datasets.get_dataset("user/repo", "dataset-name")

Or navigate to the datasets tab in your repository, click on the Use this dataset button attached to the relevant dataset, and follow the instructions:

Where you'll see a notebook full of copyable code snippets enabling you to use your dataset:

Tip!

Press p or to see the previous file or, n or to see the next file

Specify your S3 bucket

Bucket name cannot be the same as the repository name. Please change one of them.

Bucket url and prefix

Region

Endpoint Url

Disable SSL verification

query_and_create_subsets.md 4.5 KB

Permalink History Raw

Querying and saving subsets of your data

Query syntax

Creating DataFrames from query results

Saving query results as a new dataset

View and use saved datasets

Comments

Use AWS S3 as storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

Use Google Cloud Storage!

Specify your Google Storage bucket

Service Account Key

Congratulations!

Use Azure Cloud Storage!

Specify your Azure Storage bucket

Access key (If needed)

Congratulations!

Use any S3 compatible storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

DAGsHub-Official / dagshub-docs

query_and_create_subsets.md 4.5 KB Permalink History Raw

Querying and saving subsets of your data

Query syntax

Creating DataFrames from query results

Saving query results as a new dataset

View and use saved datasets

Comments

Use AWS S3 as storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

Use Google Cloud Storage!

Specify your Google Storage bucket

Service Account Key

Congratulations!

Use Azure Cloud Storage!

Specify your Azure Storage bucket

Access key (If needed)

Congratulations!

Use any S3 compatible storage!

Specify your S3 bucket

Access key (If needed)

Congratulations!

DAGsHub-Official
/
dagshub-docs

query_and_create_subsets.md 4.5 KB

Permalink History Raw