Data Engine (Dataset Management)¶

Datasources¶

The main class used to interact with the Data Engine is Datasource.

Here are functions that you can use to get and create datasources on your repository:

dagshub.data_engine.datasources.create_datasource(repo: str, name: str, path: str, revision: str | None = None) → Datasource¶

Create a datasource from a path in the repo or a storage bucket URL. You can have multiple datasources pointing at the same path.

Parameters:

repo – Repo in <owner>/<reponame> format
name – Name of the datasource to be created. Name should be unique across the repository’s datasources
path –
Either of:
- a path to a directory inside the Git/DVC repo on DagsHub:
  path/to/dir
- URL pointing to a storage bucket which is connected to the DagsHub repo:
  s3://bucketname/path/in/bucket
revision – Branch or revision the datasource should be used with. Only valid when using a Git/DVC path inside the DagsHub repo. The default repo branch is used if this is left blank.

Returns:

Created datasource

Return type:

Datasource

Raises:

DatasourceAlreadyExistsError – Datasource with this name already exists in repo.

dagshub.data_engine.datasources.create(*args, **kwargs) → Datasource¶: Alias for create_datasource()

dagshub.data_engine.datasources.get_datasource(repo: str, name: str | None = None, id: int | str | None = None, **kwargs) → Datasource¶

Gets datasource with matching name or id for the repo

Parameters:

repo – Repo in <owner>/<reponame> format
name – Name of the datasource
id – ID of the datasource

Kwargs:: revision - for repo datasources defines which branch/revision to download from. If not specified, uses the default branch of the repo

Returns:: datasource that has supplied name and/or id
Return type:: Datasource
Raises:: DatasourceNotFoundError – The datasource with this id or name does not exist.

dagshub.data_engine.datasources.get_datasources(repo: str) → List[Datasource]¶

Get all datasources that exist on the repo

Parameters:: repo – Repo in <owner>/<reponame> format
Returns:: All datasources that exist for the repository
Return type:: list(Datasource)

dagshub.data_engine.datasources.get(*args, **kwargs) → Datasource¶: Alias for get_datasource()

dagshub.data_engine.datasources.get_or_create(repo: str, name: str, path: str, revision: str | None = None) → Datasource¶

First attempts to get the repo datasource with the given name, and only if that fails, invokes create_datasource with the given parameters. See the docs on create_datasource for more info.

Parameters:

repo (str) – Repo in the format of user/repo
name (str) – The name of the datasource to retrieve or create.
path (str) – The path to the datasource within the repository.
revision (Optional[str], optional) – The specific revision or version of the datasource to retrieve.

Returns:

The retrieved or newly created Datasource instance.

Return type:

Datasource

Raises:

DatasourceAlreadyExistsError – Datasource with this name already exists in repo.

dagshub.data_engine.datasources.get_from_mlflow(run: mlflow.entities.Run | str | None = None, artifact_name='datasource.dagshub.json') → Datasource¶

Load a datasource from an MLflow run.

To save a datasource to MLflow, use Datasource.log_to_mlflow().

Parameters:

run – MLflow Run or its ID to load the datasource from. If None, loads datasource from the current active run.
artifact_name – Name of the datasource artifact in the run.

Datasets¶

Datasets are “save states” of Datasources with an already preapplied query. They can be stored on DagsHub and retrieved later by you or anybody else.

To save a dataset, apply a query to a datasource then call save_dataset().

dagshub.data_engine.datasets.get_datasets(repo: str) → List[Datasource]¶

Get all datasources that exist on the repo

Parameters:: repo – Repo in <owner>/<reponame> format
Returns:: All datasets of the repo
Return type:: list(Datasource)

dagshub.data_engine.datasets.get_dataset(repo: str, name: str | None = None, id: int | str | None = None) → Datasource¶

Get specific dataset by name or id

Parameters:

repo – Repo in <owner>/<reponame> format
name – Name of the dataset
id – ID of the dataset

Returns:

Found dataset

Return type:

Datasource

Raises:

DatasetNotFoundError – No dataset found with this name or id

dagshub.data_engine.datasets.get_dataset_from_file(path: str) → Datasource¶

Load a dataset from a local file.

This is a copy of datasources.get_datasource_from_file()

Parameters:: path – Path to the .dagshub file with the relevant dataset
Returns:: dataset that was logged to the file

dagshub.data_engine.datasets.get_from_mlflow(run=None, artifact_name='datasource.dagshub.json') → Datasource¶

Load a dataset from an MLflow run.

To save a datasource to MLflow, use Datasource.log_to_mlflow().

This is a copy of datasources.get_from_mlflow()

Parameters:

run – Run or ID of the MLflow run to load the datasource from. If None, gets it from the current active run.
artifact_name – Name of the artifact in the run.

Data Engine (Dataset Management)¶

Datasources¶

Datasets¶

Data Engine Structures¶