Create a Dataset¶
After uploading, connecting, or versioning, we have our data files as part of our DagsHub repository. That's a great first step, but in order to build models with it, we need to turn this heap of files into a proper dataset. The next step in that process is to create a DagsHub Dataset.
For the purpose of this quick start, we assume you've already uploaded, connected or added a data version into your project.
Video Tutorial¶
Step-by-Step Guide¶
UI Flow¶
-
In your project containing the data, click on the "Datasets tab".
-
Then click on "Add New Source".
-
Now, you'll need to select the location of your data files. This can be any folder in your project files, DagsHub Storage, or any connected storage. You can also give it a name - the default name is the folder name.
-
You're done. You'll be redirected to your datasource, and see it scanning (the orange dot next to the name). You'll be able to see datapoints that have already been scanned and start working without needing to wait for scanning to finish. Once scanning is done, the dot will turn green.
-
Deleting a Datasource - To delete a datasource, click back to the datasets tab, then click on the three dots at the side of the datasource row, then click on "Delete this source". The system will ask you to make sure this is what you want to do, since deleting a datasource will irreversibly delete all the metadata (it won't delete the underlying files, but you will lose all enrichments and metadata).
Python Client Flow¶
- Start by installing the DagsHub client. Simply type in the following:
$ pip3 install dagshub
- Run the following code to create the datasource:
from dagshub.data_engine import datasources ds = datasources.create_datasource( repo="<user_name>/<repo_name>", # User name and repository name separated by a "/" name="awesome-datasource", # Name of your datasource path="s3://<repo_name>/<path_to_data_folder>" # Path to your data folder in your repo, this example is for a path inside DagsHub Storage, for a different bucket or files in the repo, use the appropriate path. )
- Deleting the Datasource - To delete the datasource, simply run:
ds.delete_source()
Next Steps¶
Now that you have your datasource ready, you can go ahead and add enrichments, or query to curate it and create datasets.