Skip to content

Query & Filter a Dataset

Now that we have our enriched dataset, we can start filtering it and creating subsets. Querying and filtering can be used to remove unwanted samples or classes, but also to curate datasets for specific tasks, for example in a case where we'd want to train on data from a specific location or customer.

After creating our data subsets - we can download or stream the data directly to our training environment using the DagsHub client.

Let's see how to query our datasources through the UI and DagsHub client.

Video Tutorial

Step-by-Step Guide

UI Flow

  1. In your datasource view, the top of the screen is the query builder. The query builder has a basic and advanced mode. For simple queries, you can simply click the "+" icon and add

    Add Simple Condition
    Add Simple Condition

  2. Now you'll need to choose which metadata field to apply the condition to, which condition to apply, and what value.

    Condition Structure
    Condition Structure

  3. After applying your condition, click the "Apply query" button to run the query and see your data subset.

    Apply Query
    Apply Query

    The query will load, and you'll see the datapoints that the condition applies to.

  4. To add additional conditions, simply click the "+" button again.

  5. Now, you might want to save this query for use in your model training, or for sharing with your team. To do this, simply click "Save as new dataset".

    Save Button
    Save Button

  6. Then, add a name for your dataset, and click on save.

    Save Modal
    Save Modal

Advanced Queries

Sometimes a simple AND query isn't enough, and you need to have complex NOT and OR queries. By toggling the advanced query mode, you can create these complex queries easily.

Advanced Query Builder Toggle
Advanced Query Builder Toggle

In the advanced query builder, you can add query conditions, like before, but also add condition groups.

Advanced Query
Advanced Query

A condition group can be AND or OR, and will define the relationship between the conditions under this group. In the example above, we see a query where one of 2 conditions must be met. Each of these condition is a group of conditions as well, and this can be arbitrarily complex.

Python Client Flow

In many cases, you'll need to run queries from your development environment. DagsHub client offers an easy interface to create and run these queries. Below you'll see a simple example, but there are many different query operators - see them all in the querying doc.

  1. Start by installing the DagsHub client. Simply type in the following:

    $ pip3 install dagshub
    

  2. Retrieve the datasource you created with the following code:

    from dagshub.data_engine import datasources
    
    ds = datasources.get_datasource(
      repo="<user_name>/<repo_name>", # User name and repository name separated by a "/"
      name="<datasource_name>", # Name of your datasource
    ) 
    

  3. Now we can define our query. DagsHub takes a pandas-like approach to querying in the client. To select which field to add the condition to, simply refer to it with square brackets (e.g. ds["categories"]). Then, add the operator, and the value.

    newQuery = ds["categories"].contains("cats")
    
    res = newQuery.all() # This will run the query on DagsHub
    
    print(res.dataframe) # Prints the results as a Pandas dataframe
    

    You can of course create more complex queries for example:

    newQuery = (ds["categories"].contains("cats") & (ds["cute"] == True)) | 
               (ds["categories"].contains("dog") & (ds["size"] > 60) & ~(ds["categories"].contains("hot dog")))
    

  4. To save your dataset, simply run:

    newQuery.save_dataset("Cat Dataset") # The argument is the dataset name
    

Next Steps

Now that we've queried and filtered our dataset, we might want to add annotations, before we head over to train our model and track our experiments.