Data Engine (Dataset Management)¶
Datasources¶
The main class used to interact with the Data Engine is Datasource
.
Here are functions that you can use to get and create datasources on your repository:
- dagshub.data_engine.datasources.create_datasource(repo: str, name: str, path: str, revision: str | None = None) Datasource ¶
Create a datasource from a path in the repo or a storage bucket URL. You can have multiple datasources pointing at the same path.
- Parameters:
repo – Repo in
<owner>/<reponame>
formatname – Name of the datasource to be created. Name should be unique across the repository’s datasources
path –
Either of:
- a path to a directory inside the Git/DVC repo on DagsHub:
path/to/dir
- URL pointing to a storage bucket which is connected to the DagsHub repo:
s3://bucketname/path/in/bucket
revision – Branch or revision the datasource should be used with. Only valid when using a Git/DVC path inside the DagsHub repo. The default repo branch is used if this is left blank.
- Returns:
Created datasource
- Return type:
- Raises:
DatasourceAlreadyExistsError – Datasource with this name already exists in repo.
- dagshub.data_engine.datasources.create(*args, **kwargs) Datasource ¶
Alias for
create_datasource()
- dagshub.data_engine.datasources.get_datasource(repo: str, name: str | None = None, id: int | str | None = None, **kwargs) Datasource ¶
Gets datasource with matching name or id for the repo
- Parameters:
repo – Repo in
<owner>/<reponame>
formatname – Name of the datasource
id – ID of the datasource
- Kwargs:
revision - for repo datasources defines which branch/revision to download from. If not specified, uses the default branch of the repo
- Returns:
datasource that has supplied name and/or id
- Return type:
- Raises:
DatasourceNotFoundError – The datasource with this id or name does not exist.
- dagshub.data_engine.datasources.get_datasources(repo: str) List[Datasource] ¶
Get all datasources that exist on the repo
- Parameters:
repo – Repo in
<owner>/<reponame>
format- Returns:
All datasources that exist for the repository
- Return type:
list(Datasource)
- dagshub.data_engine.datasources.get(*args, **kwargs) Datasource ¶
Alias for
get_datasource()
- dagshub.data_engine.datasources.get_or_create(repo: str, name: str, path: str, revision: str | None = None) Datasource ¶
First attempts to get the repo datasource with the given name, and only if that fails, invokes create_datasource with the given parameters. See the docs on create_datasource for more info.
- Parameters:
repo (str) – Repo in the format of user/repo
name (str) – The name of the datasource to retrieve or create.
path (str) – The path to the datasource within the repository.
revision (Optional[str], optional) – The specific revision or version of the datasource to retrieve.
- Returns:
The retrieved or newly created Datasource instance.
- Return type:
- Raises:
DatasourceAlreadyExistsError – Datasource with this name already exists in repo.
- dagshub.data_engine.datasources.get_from_mlflow(run: mlflow.entities.Run | str | None = None, artifact_name='datasource.dagshub.json') Datasource ¶
Load a datasource from an MLflow run.
To save a datasource to MLflow, use
Datasource.log_to_mlflow()
.- Parameters:
run – MLflow Run or its ID to load the datasource from. If
None
, loads datasource from the current active run.artifact_name – Name of the datasource artifact in the run.
Datasets¶
Datasets are “save states” of Datasources with an already preapplied query. They can be stored on DagsHub and retrieved later by you or anybody else.
To save a dataset, apply a query to a datasource then call
save_dataset()
.
- dagshub.data_engine.datasets.get_datasets(repo: str) List[Datasource] ¶
Get all datasources that exist on the repo
- Parameters:
repo – Repo in
<owner>/<reponame>
format- Returns:
All datasets of the repo
- Return type:
list(Datasource)
- dagshub.data_engine.datasets.get_dataset(repo: str, name: str | None = None, id: int | str | None = None) Datasource ¶
Get specific dataset by name or id
- Parameters:
repo – Repo in
<owner>/<reponame>
formatname – Name of the dataset
id – ID of the dataset
- Returns:
Found dataset
- Return type:
- Raises:
DatasetNotFoundError – No dataset found with this name or id
- dagshub.data_engine.datasets.get_dataset_from_file(path: str) Datasource ¶
Load a dataset from a local file.
This is a copy of
datasources.get_datasource_from_file()
- Parameters:
path – Path to the
.dagshub
file with the relevant dataset- Returns:
dataset that was logged to the file
- dagshub.data_engine.datasets.get_from_mlflow(run=None, artifact_name='datasource.dagshub.json') Datasource ¶
Load a dataset from an MLflow run.
To save a datasource to MLflow, use
Datasource.log_to_mlflow()
.This is a copy of
datasources.get_from_mlflow()
- Parameters:
run – Run or ID of the MLflow run to load the datasource from. If
None
, gets it from the current active run.artifact_name – Name of the artifact in the run.