Skip to content

Create a Dataset

After uploading, connecting, or versioning, we have our data files as part of our DagsHub repository. That's a great first step, but in order to build models with it, we need to turn this heap of files into a proper dataset. The next step in that process is to create a DagsHub Dataset.

For the purpose of this quick start, we assume you've already uploaded, connected or added a data version into your project.

Video Tutorial

Step-by-Step Guide

UI Flow

  1. In your project containing the data, click on the "Datasets tab".

    Datasets Tab
    Datasets Tab

  2. Then click on "Add New Source".

    Add Datasource
    Add New Datasource

  3. Now, you'll need to select the location of your data files. This can be any folder in your project files, DagsHub Storage, or any connected storage. You can also give it a name - the default name is the folder name.

    Select Folder & Datasource Name
    Select Folder & Datasource Name

  4. You're done. You'll be redirected to your datasource, and see it scanning (the orange dot next to the name). You'll be able to see datapoints that have already been scanned and start working without needing to wait for scanning to finish. Once scanning is done, the dot will turn green.

    Datasource Successfully Created
    Datasource Successfully Created

  5. Deleting a Datasource - To delete a datasource, click back to the datasets tab, then click on the three dots at the side of the datasource row, then click on "Delete this source". The system will ask you to make sure this is what you want to do, since deleting a datasource will irreversibly delete all the metadata (it won't delete the underlying files, but you will lose all enrichments and metadata).

    Delete a Datasource
    Delete a Datasource

Python Client Flow

  1. Start by installing the DagsHub client. Simply type in the following:
    $ pip3 install dagshub
    
  2. Run the following code to create the datasource:
    from dagshub.data_engine import datasources
    
    ds = datasources.create_datasource(
      repo="<user_name>/<repo_name>", # User name and repository name separated by a "/"
      name="awesome-datasource", # Name of your datasource
      path="s3://<repo_name>/<path_to_data_folder>" # Path to your data folder in your repo, this example is for a path inside DagsHub Storage, for a different bucket or files in the repo, use the appropriate path.
    ) 
    
  3. Deleting the Datasource - To delete the datasource, simply run:
    ds.delete_source()
    

Next Steps

Now that you have your datasource ready, you can go ahead and add enrichments, or query to curate it and create datasets.