Data Engine (Dataset Management)

Datasources

The main class used to interact with the Data Engine is Datasource.

Here are functions that you can use to get and create datasources on your repository:

dagshub.data_engine.datasources.create_datasource(repo: str, name: str, path: str, revision: str | None = None) Datasource

Create a datasource from a path in the repo or a storage bucket URL. You can have multiple datasources pointing at the same path.

Parameters:
  • repo – Repo in <owner>/<reponame> format

  • name – Name of the datasource to be created. Name should be unique across the repository’s datasources

  • path

    Either of:

    • a path to a directory inside the Git/DVC repo on DagsHub:

      path/to/dir

    • URL pointing to a storage bucket which is connected to the DagsHub repo:

      s3://bucketname/path/in/bucket

  • revision – Branch or revision the datasource should be used with. Only valid when using a Git/DVC path inside the DagsHub repo. The default repo branch is used if this is left blank.

Returns:

Created datasource

Return type:

Datasource

Raises:

DatasourceAlreadyExistsError – Datasource with this name already exists in repo.

dagshub.data_engine.datasources.create(*args, **kwargs) Datasource

Alias for create_datasource()

dagshub.data_engine.datasources.get_datasource(repo: str, name: str | None = None, id: int | str | None = None, **kwargs) Datasource

Gets datasource with matching name or id for the repo

Parameters:
  • repo – Repo in <owner>/<reponame> format

  • name – Name of the datasource

  • id – ID of the datasource

Kwargs:

revision - for repo datasources defines which branch/revision to download from. If not specified, uses the default branch of the repo

Returns:

datasource that has supplied name and/or id

Return type:

Datasource

Raises:

DatasourceNotFoundError – The datasource with this id or name does not exist.

dagshub.data_engine.datasources.get_datasources(repo: str) List[Datasource]

Get all datasources that exist on the repo

Parameters:

repo – Repo in <owner>/<reponame> format

Returns:

All datasources that exist for the repository

Return type:

list(Datasource)

dagshub.data_engine.datasources.get(*args, **kwargs) Datasource

Alias for get_datasource()

dagshub.data_engine.datasources.get_or_create(repo: str, name: str, path: str, revision: str | None = None) Datasource

First attempts to get the repo datasource with the given name, and only if that fails, invokes create_datasource with the given parameters. See the docs on create_datasource for more info.

Parameters:
  • repo (str) – Repo in the format of user/repo

  • name (str) – The name of the datasource to retrieve or create.

  • path (str) – The path to the datasource within the repository.

  • revision (Optional[str], optional) – The specific revision or version of the datasource to retrieve.

Returns:

The retrieved or newly created Datasource instance.

Return type:

Datasource

Raises:

DatasourceAlreadyExistsError – Datasource with this name already exists in repo.

dagshub.data_engine.datasources.get_from_mlflow(run: mlflow.entities.Run | str | None = None, artifact_name='datasource.dagshub.json') Datasource

Load a datasource from an MLflow run.

To save a datasource to MLflow, use Datasource.log_to_mlflow().

Parameters:
  • run – MLflow Run or its ID to load the datasource from. If None, loads datasource from the current active run.

  • artifact_name – Name of the datasource artifact in the run.

Datasets

Datasets are “save states” of Datasources with an already preapplied query. They can be stored on DagsHub and retrieved later by you or anybody else.

To save a dataset, apply a query to a datasource then call save_dataset().

dagshub.data_engine.datasets.get_datasets(repo: str) List[Datasource]

Get all datasources that exist on the repo

Parameters:

repo – Repo in <owner>/<reponame> format

Returns:

All datasets of the repo

Return type:

list(Datasource)

dagshub.data_engine.datasets.get_dataset(repo: str, name: str | None = None, id: int | str | None = None) Datasource

Get specific dataset by name or id

Parameters:
  • repo – Repo in <owner>/<reponame> format

  • name – Name of the dataset

  • id – ID of the dataset

Returns:

Found dataset

Return type:

Datasource

Raises:

DatasetNotFoundError – No dataset found with this name or id

dagshub.data_engine.datasets.get_dataset_from_file(path: str) Datasource

Load a dataset from a local file.

This is a copy of datasources.get_datasource_from_file()

Parameters:

path – Path to the .dagshub file with the relevant dataset

Returns:

dataset that was logged to the file

dagshub.data_engine.datasets.get_from_mlflow(run=None, artifact_name='datasource.dagshub.json') Datasource

Load a dataset from an MLflow run.

To save a datasource to MLflow, use Datasource.log_to_mlflow().

This is a copy of datasources.get_from_mlflow()

Parameters:
  • run – Run or ID of the MLflow run to load the datasource from. If None, gets it from the current active run.

  • artifact_name – Name of the artifact in the run.

Data Engine Structures