QueryResult

class dagshub.data_engine.model.query_result.QueryResult(_entries, datasource, fields, query_data_time=None) None

Result of executing a query on a Datasource.

You can iterate over this object to get the datapoints:

res = ds.head()
for dp in res:
    print(dp.path_in_repo)
property entries

Datapoints contained in this QueryResult

Type:

list(Datapoint)

property dataframe

Represent the contents of this QueryResult as a pandas.DataFrame.

The created dataframe has a copy of the QueryResult’s data.

as_ml_dataset(flavor, **kwargs)

Convert the QueryResult into a dataset for a machine learning framework

Parameters:

flavor (str) –

Either of:

Keyword Arguments:
  • metadata_columns (list(str)) – which fields to use in the dataset.

  • strategy (str) –

    Datapoint file loading strategy. Possible values:

    • "preload" - Download all datapoints before returning the dataset.

    • "background" - Start downloading the datapoints in the background. If an undownloaded datapoint is accessed, it gets downloaded.

    • "lazy" (default) - Download each datapoint as it is being accessed by the dataset.

  • savedir (str|Path) – Where to store the datapoint files. Default is datasource's default location

  • processes (int) – number of parallel processes to download the datapoints with. Default is 8.

  • tensorizers

    How to transform the datapoint file/metadata into tensors. Possible values:

    • "auto" - try to guess the tensorizers for every field. For files the tensorizer will be the determined by the first file’s extension.

    • "image" | "audio" - tensorize all fields according to this type

    • List of tensorizers. First is the tensorizer for the datapoint file and then a tensorizer for each of the metadata fields. Tensorizer can either be strings "image", "audio", "numeric", or your own function that receives the metadata value of the field and turns it into a tensor.

as_ml_dataloader(flavor, **kwargs)

Convert the QueryResult into a dataloader for a machine learning framework

Parameters:

flavor

Either of:

Kwargs are the same as as_ml_dataset().

as_hf_dataset(target_dir=None, download_datapoints=True, download_blobs=True)

Loads this QueryResult as a HuggingFace dataset.

The paths of the downloads are set to the local paths in the filesystem, so they can be used with a cast_column() function later.

Parameters:
  • target_dir (Union[PathLike, str, None]) – Where to download the datapoints. The metadata is still downloaded into the global cache.

  • download_datapoints – If set to True (default), downloads the datapoint files and sets the path column to the path of the datapoint in the filesystem

  • download_blobs – If set to True (default), downloads all blob fields and sets the respective column to the path of the file in the filesystem.

get_blob_fields(*fields, load_into_memory=False, cache_on_disk=True, num_proc=32, path_format='path') QueryResult

Downloads data from blob fields.

If load_into_memory is set to True, then we additionally convert special fields to new types:

  • Annotation fields are converted to MetadataAnnotations

  • Document fields are converted to strings

Parameters:
  • fields (str) – list of binary fields to download blobs for. If empty, download all blob fields.

  • load_into_memory

    Whether to load the blobs into the datapoints, or just store them on disk

    If True: the datapoints’ specified fields will contain the blob data

    If False: the datapoints’ specified fields will contain Path objects to the file of the downloaded blob

  • cache_on_disk – Whether to cache the blobs on disk or not (valid only if load_into_memory is set to True) Cache location is ~/dagshub/datasets/<repo>/<datasource_id>/.metadata_blobs/

  • num_proc (int) – number of download threads

  • path_format (Literal['str', 'path']) – What way the paths to the file should be represented. path returns a Path object, and str returns a string of this path.

Return type:

QueryResult

annotate_with_mlflow_model(repo, name, host=None, version='latest', pre_hook=<function identity_func>, post_hook=<function identity_func>, batch_size=1, log_to_field='annotation') Dict[str, Any]

Fetch an MLflow model from a specific repository and use it to annotate the datapoints in this QueryResult. The resulting annotations are then stored in the field specified by log_to_field.

Any MLflow model that has a model.predict endpoint is supported. This includes, but is not limited to the following flavors:

  • torch

  • tensorflow

  • pyfunc

  • scikit-learn

Keep in mind that by default mlflow.predict() will receive the list of downloaded datapoint paths as input, so additional “massaging” of the data might be required for prediction to work. Use the pre_hook function to do so.

Parameters:
  • repo (str) – repository to extract the model from

  • name (str) – name of the model in the repository’s MLflow registry.

  • host (Optional[str]) – address of the DagsHub instance with the repo to load the model from. Set it if the model is hosted on a different DagsHub instance than the datasource.

  • version (str) – version of the model in the mlflow registry.

  • pre_hook (Callable[[List[str]], Any]) – function that runs before datapoints are sent to model.predict(). The input argument is the list of paths to datapoint files in the current batch.

  • post_hook (Callable[[Any], Any]) – function that converts the model output to the desired format.

  • batch_size (int) – Size of the file batches that are sent to model.predict(). Default batch size is 1, but it is still being sent as a list for consistency.

  • log_to_field (str) – Field to store the resulting annotations in.

Return type:

Dict[str, Any]

predict_with_mlflow_model(repo, name, host=None, version='latest', pre_hook=<function identity_func>, post_hook=<function identity_func>, batch_size=1, log_to_field=None) Dict[str, Any]

Fetch an MLflow model from a specific repository and use it to predict on the datapoints in this QueryResult.

Any MLflow model that has a model.predict endpoint is supported. This includes, but is not limited to the following flavors:

  • torch

  • tensorflow

  • pyfunc

  • scikit-learn

Keep in mind that by default mlflow.predict() will receive the list of downloaded datapoint paths as input, so additional “massaging” of the data might be required for prediction to work. Use the pre_hook function to do so.

Parameters:
  • repo (str) – repository to extract the model from

  • name (str) – name of the model in the repository’s MLflow registry.

  • host (Optional[str]) – address of the DagsHub instance with the repo to load the model from. Set it if the model is hosted on a different DagsHub instance than the datasource.

  • version (str) – version of the model in the mlflow registry.

  • pre_hook (Callable[[List[str]], Any]) – function that runs before datapoints are sent to model.predict(). The input argument is the list of paths to datapoint files in the current batch.

  • post_hook (Callable[[Any], Any]) – function that converts the model output to the desired format.

  • batch_size (int) – Size of the file batches that are sent to model.predict(). Default batch size is 1, but it is still being sent as a list for consistency.

  • log_to_field (Optional[str]) – If set, writes prediction results to this metadata field in the datasource.

Return type:

Dict[str, Any]

get_annotations(**kwargs) QueryResult

Loads all annotation fields using get_blob_fields().

All keyword arguments are passed to get_blob_fields().

Return type:

QueryResult

download_files(target_dir=None, keep_source_prefix=True, redownload=False, path_field=None) PathLike

Downloads the datapoints to the target_dir directory

Parameters:
  • target_dir (Union[PathLike, str, None]) – Where to download the files. Defaults to datasource's default location

  • keep_source_prefix – If True, includes the prefix of the datasource in the download path.

  • redownload – Whether to redownload a file if it exists on the filesystem already. We don’t do any hashsum checks, so if it’s possible that the file has been updated, set to True

  • path_field (Optional[str]) – Set this to the name of the field with the file’s path if you want to download files from a field other than the datapoint’s path.

Note

For path_field the path in the field still needs to be in the same repo and have the same format as the path of the datapoint, including not having the prefix. For now, you can’t download arbitrary paths/urls.

Return type:

PathLike

Returns:

Path to the directory with the downloaded files

export_as_yolo(download_dir=None, annotation_field=None, annotation_type=None, classes=None) Path

Downloads the files and annotations in a way that can be used to train with YOLO immediately.

Parameters:
  • download_dir (Union[str, Path, None]) – Where to download the files. Defaults to ./dagshub_export

  • annotation_field (Optional[str]) – Field with the annotations. If None, uses the first alphabetical annotation field.

  • annotation_type (Optional[Literal['bbox', 'segmentation', 'pose']]) – Type of YOLO annotations to export. Possible values: “bbox”, “segmentation”, “pose”. If None, returns based on the most common annotation type.

  • classes (Optional[Dict[int, str]]) – Classes and indices for the YOLO dataset. If None, the classes will be inferred from the annotations. Any class in the annotations that doesn’t exist in the dictionary will be added at the end of the classes. The dictionary has to be in the format {index: class_name}.

Return type:

Path

Returns:

The path to the YAML file with the metadata. Pass this path to YOLO.train() to train a model.

to_voxel51_dataset(**kwargs) fiftyone.Dataset

Creates a voxel51 dataset that can be used with fo.launch_app() to visualize it.

Keyword Arguments:
  • name (str) – Name of the dataset. Default is the name of the datasource.

  • force_download (bool) – Download the dataset even if the size of the files is bigger than 100MB. Default is False

  • files_location (str|PathLike) – path to the location where to download the local files. Default is datasource's default location

  • redownload (bool) – Redownload files, replacing the ones that might exist on the filesystem. Default is False.

  • voxel_annotations (List[str]) – List of fields from which to load voxel annotations serialized with to_json(). This will override the labelstudio annotations

Return type:

fo.Dataset

visualize(visualizer='dagshub', **kwargs) str | fiftyone.Session

Visualize this QueryResult either on DagsHub or with Voxel51.

If visualizer is dagshub, a webpage is opened on DagsHub with the query applied.

If visualizer is fiftyone, this function calls to_voxel51_dataset(), passing to it the keyword arguments, and launches a fiftyone session showing the dataset.

Additionally, this function adds a DagsHub plugin into Voxel51 that you can use for additional interactions with the datasource from within the voxel environment.

Returns the session object that you can wait() if you are using it outside a notebook and need to not close the script immediately:

session = ds.fetch().visualize()
session.wait(-1)
Return type:

Union[str, Session]

generate_predictions(predict_fn, batch_size=1, log_to_field=None, is_prediction=False) Dict[str, Tuple[str, float | None]]

Sends all the datapoints returned in this QueryResult as prediction targets for a generic object.

Parameters:
  • predict_fn (Callable[[List[str]], List[Tuple[Any, Optional[float]]]]) – function that handles batched input and returns predictions with an optional prediction score.

  • batch_size (int) – (optional, default: 1) number of datapoints to run inference on simultaneously

  • log_to_field (Optional[str]) – (optional, default: ‘prediction’) write prediction results to metadata logged in data engine.

  • is_prediction (Optional[bool]) – (optional, default: False) whether we’re creating predictions or annotations.

  • None (If)

  • predictions. (just returns)

  • field ((in addition to logging to a)

  • set) (iff that parameter is)

Return type:

Dict[str, Tuple[str, Optional[float]]]

generate_annotations(predict_fn, batch_size=1, log_to_field='annotation')

Sends all the datapoints returned in this QueryResult as prediction targets for a generic object.

Parameters:
  • predict_fn (Callable[[List[str]], List[Tuple[Any, Optional[float]]]]) – function that handles batched input and returns predictions with an optional prediction score.

  • batch_size (int) – (optional, default: 1) number of datapoints to run inference on simultaneously.

  • log_to_field (str) – (optional, default: ‘prediction’) write prediction results to metadata logged in data engine.

annotate(open_project=True, ignore_warning=True, fields_to_embed=None, fields_to_exclude=None) str | None

Sends all the datapoints returned in this QueryResult to be annotated in Label Studio on DagsHub. Alternatively, uses MLFlow to automatically label datapoints.

Parameters:
  • open_project (bool) – Automatically open the Label Studio project in the browser

  • ignore_warning (bool) – Suppress the prompt-warning if you try to annotate too many datapoints at once.

  • fields_to_embed (Optional[List[str]]) – list of meta-data columns that will show up in Label Studio UI. if not specified all will be displayed.

  • fields_to_exclude (Optional[List[str]]) – list of meta-data columns that will not show up in Label Studio UI

Return type:

Optional[str]

Returns:

The URL of the created Label Studio workspace

log_to_mlflow(run=None) mlflow.entities.Run

Logs the query result information to MLflow as an artifact. The artifact will be saved at the root of the run with the name in the format of log_{datasource_name}_{query_time}_{random_chunk}.dagshub.dataset.json.

You can later load the dataset back from MLflow using dagshub.data_engine.datasources.get_from_mlflow().

Parameters:

run (Optional[Run]) – MLflow run to save to. If None, uses the active MLflow run or creates a new run.

Return type:

Run

Returns:

Run to which the artifact was logged.