Query & Filter a Dataset¶

Now that we have our enriched dataset, we can start filtering it and creating subsets. Querying and filtering can be used to remove unwanted samples or classes, but also to curate datasets for specific tasks, for example in a case where we'd want to train on data from a specific location or customer.

After creating our data subsets - we can download or stream the data directly to our training environment using the DagsHub client.

Let's see how to query our datasources through the UI and DagsHub client.

Video Tutorial¶

Step-by-Step Guide¶

UI Flow¶

In your datasource view, the top of the screen is the query builder. The query builder has a basic and advanced mode. For simple queries, you can simply click the "+" icon and add

Add Simple Condition
Now you'll need to choose which metadata field to apply the condition to, which condition to apply, and what value.

Condition Structure
After applying your condition, click the "Apply query" button to run the query and see your data subset.

Apply Query

The query will load, and you'll see the datapoints that the condition applies to.
To add additional conditions, simply click the "+" button again.
Now, you might want to save this query for use in your model training, or for sharing with your team. To do this, simply click "Save as new dataset".

Save Button
Then, add a name for your dataset, and click on save.

Save Modal

Advanced Queries¶

Sometimes a simple AND query isn't enough, and you need to have complex NOT and OR queries. By toggling the advanced query mode, you can create these complex queries easily.

In the advanced query builder, you can add query conditions, like before, but also add condition groups.

A condition group can be AND or OR, and will define the relationship between the conditions under this group. In the example above, we see a query where one of 2 conditions must be met. Each of these condition is a group of conditions as well, and this can be arbitrarily complex.

Python Client Flow¶

In many cases, you'll need to run queries from your development environment. DagsHub client offers an easy interface to create and run these queries. Below you'll see a simple example, but there are many different query operators - see them all in the querying doc.

Start by installing the DagsHub client. Simply type in the following:
```
$ pip3 install dagshub
```

Retrieve the datasource you created with the following code:

from dagshub.data_engine import datasources

ds = datasources.get_datasource(
  repo="<user_name>/<repo_name>", # User name and repository name separated by a "/"
  name="<datasource_name>", # Name of your datasource
)

Now we can define our query. DagsHub takes a pandas-like approach to querying in the client. To select which field to add the condition to, simply refer to it with square brackets (e.g. ds["categories"]). Then, add the operator, and the value.

newQuery = ds["categories"].contains("cats")

res = newQuery.all() # This will run the query on DagsHub

print(res.dataframe) # Prints the results as a Pandas dataframe

You can of course create more complex queries for example:

newQuery = (ds["categories"].contains("cats") & (ds["cute"] == True)) | 
           (ds["categories"].contains("dog") & (ds["size"] > 60) & ~(ds["categories"].contains("hot dog")))

To save your dataset, simply run:

newQuery.save_dataset("Cat Dataset") # The argument is the dataset name

Next Steps¶

Now that we've queried and filtered our dataset, we might want to add annotations, before we head over to train our model and track our experiments.