QueryResult¶

class dagshub.data_engine.model.query_result.QueryResult(_entries, datasource, fields, query_data_time=None) → None¶

Result of executing a query on a Datasource.

You can iterate over this object to get the datapoints:

res = ds.head()
for dp in res:
    print(dp.path_in_repo)

property entries¶

Datapoints contained in this QueryResult

Type:: list(Datapoint)

property dataframe¶

Represent the contents of this QueryResult as a pandas.DataFrame.

The created dataframe has a copy of the QueryResult’s data.

as_ml_dataset(flavor, **kwargs)¶

Convert the QueryResult into a dataset for a machine learning framework

Parameters:

flavor (str) –

Either of:

"torch": returns a torch.utils.data.Dataset
"tensorflow": returns a tf.data.Dataset

Keyword Arguments:

metadata_columns (list(str)) – which fields to use in the dataset.
strategy (str) –
Datapoint file loading strategy. Possible values:
- "preload" - Download all datapoints before returning the dataset.
- "background" - Start downloading the datapoints in the background. If an undownloaded datapoint is accessed, it gets downloaded.
- "lazy" (default) - Download each datapoint as it is being accessed by the dataset.
savedir (str|Path) – Where to store the datapoint files. Default is datasource's default location
processes (int) – number of parallel processes to download the datapoints with. Default is 8.
tensorizers –
How to transform the datapoint file/metadata into tensors. Possible values:
- "auto" - try to guess the tensorizers for every field. For files the tensorizer will be the determined by the first file’s extension.
- "image" | "audio" - tensorize all fields according to this type
- List of tensorizers. First is the tensorizer for the datapoint file and then a tensorizer for each of the metadata fields. Tensorizer can either be strings "image", "audio", "numeric", or your own function that receives the metadata value of the field and turns it into a tensor.

as_ml_dataloader(flavor, **kwargs)¶

Convert the QueryResult into a dataloader for a machine learning framework

Parameters:

flavor –

Either of:

"torch": returns a torch.utils.data.DataLoader
"tensorflow": returns a tf.keras.utils.Sequence

Kwargs are the same as as_ml_dataset().

as_hf_dataset(target_dir=None, download_datapoints=True, download_blobs=True)¶

Loads this QueryResult as a HuggingFace dataset.

The paths of the downloads are set to the local paths in the filesystem, so they can be used with a cast_column() function later.

Parameters:

target_dir (Union[PathLike, str, None]) – Where to download the datapoints. The metadata is still downloaded into the global cache.
download_datapoints – If set to True (default), downloads the datapoint files and sets the path column to the path of the datapoint in the filesystem
download_blobs – If set to True (default), downloads all blob fields and sets the respective column to the path of the file in the filesystem.

get_blob_fields(*fields, load_into_memory=False, cache_on_disk=True, num_proc=32, path_format='path') → QueryResult¶

Downloads data from blob fields.

If load_into_memory is set to True, then we additionally convert special fields to new types:

Annotation fields are converted to MetadataAnnotations

Document fields are converted to strings

Parameters:

fields (str) – list of binary fields to download blobs for. If empty, download all blob fields.
load_into_memory –
Whether to load the blobs into the datapoints, or just store them on disk

If True: the datapoints’ specified fields will contain the blob data

If False: the datapoints’ specified fields will contain Path objects to the file of the downloaded blob
cache_on_disk – Whether to cache the blobs on disk or not (valid only if load_into_memory is set to True) Cache location is ~/dagshub/datasets/<repo>/<datasource_id>/.metadata_blobs/
num_proc (int) – number of download threads
path_format (Literal['str', 'path']) – What way the paths to the file should be represented. path returns a Path object, and str returns a string of this path.

Return type:

QueryResult

annotate_with_mlflow_model(repo, name, host=None, version='latest', pre_hook=<function identity_func>, post_hook=<function identity_func>, batch_size=1, log_to_field='annotation') → Dict[str, Any]¶

Fetch an MLflow model from a specific repository and use it to annotate the datapoints in this QueryResult. The resulting annotations are then stored in the field specified by log_to_field.

Any MLflow model that has a model.predict endpoint is supported. This includes, but is not limited to the following flavors:

torch
tensorflow
pyfunc
scikit-learn

Keep in mind that by default mlflow.predict() will receive the list of downloaded datapoint paths as input, so additional “massaging” of the data might be required for prediction to work. Use the pre_hook function to do so.

Parameters:

repo (str) – repository to extract the model from
name (str) – name of the model in the repository’s MLflow registry.
host (Optional[str]) – address of the DagsHub instance with the repo to load the model from. Set it if the model is hosted on a different DagsHub instance than the datasource.
version (str) – version of the model in the mlflow registry.
pre_hook (Callable[[List[str]], Any]) – function that runs before datapoints are sent to model.predict(). The input argument is the list of paths to datapoint files in the current batch.
post_hook (Callable[[Any], Any]) – function that converts the model output to the desired format.
batch_size (int) – Size of the file batches that are sent to model.predict(). Default batch size is 1, but it is still being sent as a list for consistency.
log_to_field (str) – Field to store the resulting annotations in.

Return type:

Dict[str, Any]

predict_with_mlflow_model(repo, name, host=None, version='latest', pre_hook=<function identity_func>, post_hook=<function identity_func>, batch_size=1, log_to_field=None) → Dict[str, Any]¶

Fetch an MLflow model from a specific repository and use it to predict on the datapoints in this QueryResult.

Any MLflow model that has a model.predict endpoint is supported. This includes, but is not limited to the following flavors:

torch
tensorflow
pyfunc
scikit-learn

Parameters:

repo (str) – repository to extract the model from
name (str) – name of the model in the repository’s MLflow registry.
host (Optional[str]) – address of the DagsHub instance with the repo to load the model from. Set it if the model is hosted on a different DagsHub instance than the datasource.
version (str) – version of the model in the mlflow registry.
pre_hook (Callable[[List[str]], Any]) – function that runs before datapoints are sent to model.predict(). The input argument is the list of paths to datapoint files in the current batch.
post_hook (Callable[[Any], Any]) – function that converts the model output to the desired format.
batch_size (int) – Size of the file batches that are sent to model.predict(). Default batch size is 1, but it is still being sent as a list for consistency.
log_to_field (Optional[str]) – If set, writes prediction results to this metadata field in the datasource.

Return type:

Dict[str, Any]

get_annotations(**kwargs) → QueryResult¶

Loads all annotation fields using get_blob_fields().

All keyword arguments are passed to get_blob_fields().

Return type:: QueryResult

download_files(target_dir=None, keep_source_prefix=True, redownload=False, path_field=None) → PathLike¶

Downloads the datapoints to the target_dir directory

Parameters:

target_dir (Union[PathLike, str, None]) – Where to download the files. Defaults to datasource's default location
keep_source_prefix – If True, includes the prefix of the datasource in the download path.
redownload – Whether to redownload a file if it exists on the filesystem already. We don’t do any hashsum checks, so if it’s possible that the file has been updated, set to True
path_field (Optional[str]) – Set this to the name of the field with the file’s path if you want to download files from a field other than the datapoint’s path.

Note

For path_field the path in the field still needs to be in the same repo and have the same format as the path of the datapoint, including not having the prefix. For now, you can’t download arbitrary paths/urls.

Return type:: PathLike
Returns:: Path to the directory with the downloaded files

export_as_yolo(download_dir=None, annotation_field=None, annotation_type=None, classes=None) → Path¶

Downloads the files and annotations in a way that can be used to train with YOLO immediately.

Parameters:

download_dir (Union[str, Path, None]) – Where to download the files. Defaults to ./dagshub_export
annotation_field (Optional[str]) – Field with the annotations. If None, uses the first alphabetical annotation field.
annotation_type (Optional[Literal['bbox', 'segmentation', 'pose']]) – Type of YOLO annotations to export. Possible values: “bbox”, “segmentation”, “pose”. If None, returns based on the most common annotation type.
classes (Optional[Dict[int, str]]) – Classes and indices for the YOLO dataset. If None, the classes will be inferred from the annotations. Any class in the annotations that doesn’t exist in the dictionary will be added at the end of the classes. The dictionary has to be in the format {index: class_name}.

Return type:

Path

Returns:

The path to the YAML file with the metadata. Pass this path to YOLO.train() to train a model.

to_voxel51_dataset(**kwargs) → fiftyone.Dataset¶

Creates a voxel51 dataset that can be used with fo.launch_app() to visualize it.

Keyword Arguments:

name (str) – Name of the dataset. Default is the name of the datasource.
force_download (bool) – Download the dataset even if the size of the files is bigger than 100MB. Default is False
files_location (str|PathLike) – path to the location where to download the local files. Default is datasource's default location
redownload (bool) – Redownload files, replacing the ones that might exist on the filesystem. Default is False.
voxel_annotations (List[str]) – List of fields from which to load voxel annotations serialized with to_json(). This will override the labelstudio annotations

Return type:

fo.Dataset

visualize(visualizer='dagshub', **kwargs) → str | fiftyone.Session¶

Visualize this QueryResult either on DagsHub or with Voxel51.

If visualizer is dagshub, a webpage is opened on DagsHub with the query applied.

If visualizer is fiftyone, this function calls to_voxel51_dataset(), passing to it the keyword arguments, and launches a fiftyone session showing the dataset.

Additionally, this function adds a DagsHub plugin into Voxel51 that you can use for additional interactions with the datasource from within the voxel environment.

Returns the session object that you can wait() if you are using it outside a notebook and need to not close the script immediately:

session = ds.fetch().visualize()
session.wait(-1)

Return type:: Union[str, Session]

generate_predictions(predict_fn, batch_size=1, log_to_field=None, is_prediction=False) → Dict[str, Tuple[str, float | None]]¶

Sends all the datapoints returned in this QueryResult as prediction targets for a generic object.

Parameters:

predict_fn (Callable[[List[str]], List[Tuple[Any, Optional[float]]]]) – function that handles batched input and returns predictions with an optional prediction score.
batch_size (int) – (optional, default: 1) number of datapoints to run inference on simultaneously
log_to_field (Optional[str]) – (optional, default: ‘prediction’) write prediction results to metadata logged in data engine.
is_prediction (Optional[bool]) – (optional, default: False) whether we’re creating predictions or annotations.
None (If)
predictions. (just returns)
field ((in addition to logging to a)
set) (iff that parameter is)

Return type:

Dict[str, Tuple[str, Optional[float]]]

generate_annotations(predict_fn, batch_size=1, log_to_field='annotation')¶

Sends all the datapoints returned in this QueryResult as prediction targets for a generic object.

Parameters:

predict_fn (Callable[[List[str]], List[Tuple[Any, Optional[float]]]]) – function that handles batched input and returns predictions with an optional prediction score.
batch_size (int) – (optional, default: 1) number of datapoints to run inference on simultaneously.
log_to_field (str) – (optional, default: ‘prediction’) write prediction results to metadata logged in data engine.

annotate(open_project=True, ignore_warning=True, fields_to_embed=None, fields_to_exclude=None) → str | None¶

Sends all the datapoints returned in this QueryResult to be annotated in Label Studio on DagsHub. Alternatively, uses MLFlow to automatically label datapoints.

Parameters:

open_project (bool) – Automatically open the Label Studio project in the browser
ignore_warning (bool) – Suppress the prompt-warning if you try to annotate too many datapoints at once.
fields_to_embed (Optional[List[str]]) – list of meta-data columns that will show up in Label Studio UI. if not specified all will be displayed.
fields_to_exclude (Optional[List[str]]) – list of meta-data columns that will not show up in Label Studio UI

Return type:

Optional[str]

Returns:

The URL of the created Label Studio workspace

log_to_mlflow(run=None) → mlflow.entities.Run¶

Logs the query result information to MLflow as an artifact. The artifact will be saved at the root of the run with the name in the format of log_{datasource_name}_{query_time}_{random_chunk}.dagshub.dataset.json.

You can later load the dataset back from MLflow using dagshub.data_engine.datasources.get_from_mlflow().

Parameters:: run (Optional[Run]) – MLflow run to save to. If None, uses the active MLflow run or creates a new run.
Return type:: Run
Returns:: Run to which the artifact was logged.