QueryResult

class dagshub.data_engine.model.query_result.QueryResult(_entries: List[Datapoint], datasource: Datasource, fields: List[MetadataSelectFieldSchema], query_data_time: datetime | None = None)

Result of executing a query on a Datasource.

You can iterate over this object to get the datapoints:

res = ds.head()
for dp in res:
    print(dp.path_in_repo)
property entries

Datapoints contained in this QueryResult

Type:

list(Datapoint)

property dataframe

Represent the contents of this QueryResult as a pandas.DataFrame.

The created dataframe has a copy of the QueryResult’s data.

as_ml_dataset(flavor: str, **kwargs)

Convert the QueryResult into a dataset for a machine learning framework

Parameters:

flavor

Either of:

Keyword Arguments:
  • metadata_columns (list(str)) – which fields to use in the dataset.

  • strategy (str) –

    Datapoint file loading strategy. Possible values:

    • "preload" - Download all datapoints before returning the dataset.

    • "background" - Start downloading the datapoints in the background. If an undownloaded datapoint is accessed, it gets downloaded.

    • "lazy" (default) - Download each datapoint as it is being accessed by the dataset.

  • savedir (str|Path) – Where to store the datapoint files. Default is datasource's default location

  • processes (int) – number of parallel processes to download the datapoints with. Default is 8.

  • tensorizers

    How to transform the datapoint file/metadata into tensors. Possible values:

    • "auto" - try to guess the tensorizers for every field. For files the tensorizer will be the determined by the first file’s extension.

    • "image" | "audio" - tensorize all fields according to this type

    • List of tensorizers. First is the tensorizer for the datapoint file and then a tensorizer for each of the metadata fields. Tensorizer can either be strings "image", "audio", "numeric", or your own function that receives the metadata value of the field and turns it into a tensor.

as_ml_dataloader(flavor, **kwargs)

Convert the QueryResult into a dataloader for a machine learning framework

Parameters:

flavor

Either of:

Kwargs are the same as as_ml_dataset().

as_hf_dataset(target_dir: PathLike | str | None = None, download_datapoints=True, download_blobs=True)

Loads this QueryResult as a HuggingFace dataset.

The paths of the downloads are set to the local paths in the filesystem, so they can be used with a cast_column() function later.

Parameters:
  • target_dir – Where to download the datapoints. The metadata is still downloaded into the global cache.

  • download_datapoints – If set to True (default), downloads the datapoint files and sets the path column to the path of the datapoint in the filesystem

  • download_blobs – If set to True (default), downloads all blob fields and sets the respective column to the path of the file in the filesystem.

get_blob_fields(*fields: str, load_into_memory=False, cache_on_disk=True, num_proc: int = 32, path_format: Literal['str', 'path'] = 'path') QueryResult

Downloads data from blob fields.

If load_into_memory is set to True, then we additionally convert special fields to new types:

  • Annotation fields are converted to MetadataAnnotations

  • Document fields are converted to strings

Parameters:
  • fields – list of binary fields to download blobs for. If empty, download all blob fields.

  • load_into_memory

    Whether to load the blobs into the datapoints, or just store them on disk

    If True: the datapoints’ specified fields will contain the blob data

    If False: the datapoints’ specified fields will contain Path objects to the file of the downloaded blob

  • cache_on_disk – Whether to cache the blobs on disk or not (valid only if load_into_memory is set to True) Cache location is ~/dagshub/datasets/<repo>/<datasource_id>/.metadata_blobs/

  • num_proc – number of download threads

  • path_format – What way the paths to the file should be represented. path returns a Path object, and str returns a string of this path.

predict_with_mlflow_model(repo: str, name: str, host: str | None = None, version: str = 'latest', pre_hook: ~typing.Callable[[~typing.Any], ~typing.Any] = <function QueryResult.<lambda>>, post_hook: ~typing.Callable[[~typing.Any], ~typing.Any] = <function QueryResult.<lambda>>, batch_size: int = 1, log_to_field: str | None = None) list | None

Sends all the datapoints returned in this QueryResult as prediction targets for an MLFlow model registered on DagsHub.

Parameters:
  • repo – repository to extract the model from

  • name – name of the model in the mlflow registry

  • version – (optional, default: ‘latest’) version of the model in the mlflow registry

  • pre_hook – (optional, default: identity function) function that runs before datapoint is sent to the model

  • post_hook – (optional, default: identity function) function that converts mlflow model output

  • format (to the desired)

  • batch_size – (optional, default: 1) function that sets batch_size

  • log_to_field – (optional, default: ‘prediction’) write prediction results to metadata logged in data engine.

  • None (If)

  • predictions. (just returns)

  • field ((in addition to logging to a)

  • set) (iff that parameter is)

get_annotations(**kwargs) QueryResult

Loads all annotation fields using get_blob_fields().

All keyword arguments are passed to get_blob_fields().

download_files(target_dir: PathLike | str | None = None, keep_source_prefix=True, redownload=False, path_field: str | None = None) PathLike

Downloads the datapoints to the target_dir directory

Parameters:
  • target_dir – Where to download the files. Defaults to datasource's default location

  • keep_source_prefix – If True, includes the prefix of the datasource in the download path.

  • redownload – Whether to redownload a file if it exists on the filesystem already. We don’t do any hashsum checks, so if it’s possible that the file has been updated, set to True

  • path_field – Set this to the name of the field with the file’s path if you want to download files from a field other than the datapoint’s path.

Note

For path_field the path in the field still needs to be in the same repo and have the same format as the path of the datapoint, including not having the prefix. For now, you can’t download arbitrary paths/urls.

Returns:

Path to the directory with the downloaded files

export_as_yolo(download_dir: str | Path | None = None, annotation_field: str | None = None, annotation_type: Literal['bbox', 'segmentation', 'pose'] | None = None) Path

Downloads the files and annotations in a way that can be used to train with YOLO immediately.

Parameters:
  • download_dir – Where to download the files. Defaults to ./dagshub_export

  • annotation_field – Field with the annotations. If None, uses the first alphabetical annotation field.

  • annotation_type – Type of YOLO annotations to export. Possible values: “bbox”, “segmentation”, “pose”. If None, returns based on the most common annotation type.

Returns:

The path to the YAML file with the metadata. Pass this path to YOLO.train() to train a model.

to_voxel51_dataset(**kwargs) fo.Dataset

Creates a voxel51 dataset that can be used with fo.launch_app() to visualize it.

Keyword Arguments:
  • name (str) – Name of the dataset. Default is the name of the datasource.

  • force_download (bool) – Download the dataset even if the size of the files is bigger than 100MB. Default is False

  • files_location (str|PathLike) – path to the location where to download the local files. Default is datasource's default location

  • redownload (bool) – Redownload files, replacing the ones that might exist on the filesystem. Default is False.

  • voxel_annotations (List[str]) – List of fields from which to load voxel annotations serialized with to_json(). This will override the labelstudio annotations

Return type:

fo.Dataset

visualize(visualizer: Literal['dagshub', 'fiftyone'] = 'dagshub', **kwargs) str | fo.Session

Visualize this QueryResult either on DagsHub or with Voxel51.

If visualizer is dagshub, a webpage is opened on DagsHub with the query applied.

If visualizer is fiftyone, this function calls to_voxel51_dataset(), passing to it the keyword arguments, and launches a fiftyone session showing the dataset.

Additionally, this function adds a DagsHub plugin into Voxel51 that you can use for additional interactions with the datasource from within the voxel environment.

Returns the session object that you can wait() if you are using it outside a notebook and need to not close the script immediately:

session = ds.all().visualize()
session.wait(-1)
annotate_with_mlflow_model(repo: str, name: str, post_hook: ~typing.Callable = <function QueryResult.<lambda>>, pre_hook: ~typing.Callable = <function QueryResult.<lambda>>, host: str | None = None, version: str = 'latest', batch_size: int = 1, log_to_field: str = 'annotation') str | None

Sends all the datapoints returned in this QueryResult to an MLFlow model which automatically labels datapoints.

Parameters:
  • repo – repository to extract the model from

  • name – name of the model in the mlflow registry

  • version – (optional, default: ‘latest’) version of the model in the mlflow registry

  • pre_hook – (optional, default: identity function) function that runs

  • model (before the datapoint is sent to the)

  • post_hook – (optional, default: identity function) function that converts

  • format (mlflow model output converts to labelstudio)

  • batch_size – (optional, default: 1) batched annotation size

annotate(open_project: bool = True, ignore_warning: bool = True, fields_to_embed: List[str] | None = None, fields_to_exclude: List[str] | None = None) str | None

Sends all the datapoints returned in this QueryResult to be annotated in Label Studio on DagsHub. Alternatively, uses MLFlow to automatically label datapoints.

Parameters:
  • open_project – Automatically open the Label Studio project in the browser

  • ignore_warning – Suppress the prompt-warning if you try to annotate too many datapoints at once.

  • fields_to_embed – list of meta-data columns that will show up in Label Studio UI. if not specified all will be displayed.

  • fields_to_exclude – list of meta-data columns that will not show up in Label Studio UI

Returns:

The URL of the created Label Studio workspace