QueryResult¶
- class dagshub.data_engine.model.query_result.QueryResult(_entries: List[Datapoint], datasource: Datasource, fields: List[MetadataSelectFieldSchema], query_data_time: datetime | None = None)¶
Result of executing a query on a
Datasource
.You can iterate over this object to get the
datapoints
:res = ds.head() for dp in res: print(dp.path_in_repo)
- property dataframe¶
Represent the contents of this QueryResult as a pandas.DataFrame.
The created dataframe has a copy of the QueryResult’s data.
- as_ml_dataset(flavor: str, **kwargs)¶
Convert the QueryResult into a dataset for a machine learning framework
- Parameters:
flavor –
Either of:
"torch"
: returns a torch.utils.data.Dataset"tensorflow"
: returns a tf.data.Dataset
- Keyword Arguments:
metadata_columns (list(str)) – which fields to use in the dataset.
strategy (str) –
Datapoint file loading strategy. Possible values:
"preload"
- Download all datapoints before returning the dataset."background"
- Start downloading the datapoints in the background. If an undownloaded datapoint is accessed, it gets downloaded."lazy"
(default) - Download each datapoint as it is being accessed by the dataset.
savedir (str|Path) – Where to store the datapoint files. Default is
datasource's default location
processes (int) – number of parallel processes to download the datapoints with. Default is 8.
tensorizers –
How to transform the datapoint file/metadata into tensors. Possible values:
"auto"
- try to guess the tensorizers for every field. For files the tensorizer will be the determined by the first file’s extension."image"
|"audio"
- tensorize all fields according to this typeList of tensorizers. First is the tensorizer for the datapoint file and then a tensorizer for each of the metadata fields. Tensorizer can either be strings
"image"
,"audio"
,"numeric"
, or your own function that receives the metadata value of the field and turns it into a tensor.
- as_ml_dataloader(flavor, **kwargs)¶
Convert the QueryResult into a dataloader for a machine learning framework
- Parameters:
flavor –
Either of:
"torch"
: returns a torch.utils.data.DataLoader"tensorflow"
: returns a tf.keras.utils.Sequence
Kwargs are the same as
as_ml_dataset()
.
- as_hf_dataset(target_dir: PathLike | str | None = None, download_datapoints=True, download_blobs=True)¶
Loads this QueryResult as a HuggingFace dataset.
The paths of the downloads are set to the local paths in the filesystem, so they can be used with a cast_column() function later.
- Parameters:
target_dir – Where to download the datapoints. The metadata is still downloaded into the global cache.
download_datapoints – If set to
True
(default), downloads the datapoint files and sets the path column to the path of the datapoint in the filesystemdownload_blobs – If set to
True
(default), downloads all blob fields and sets the respective column to the path of the file in the filesystem.
- get_blob_fields(*fields: str, load_into_memory=False, cache_on_disk=True, num_proc: int = 32, path_format: Literal['str', 'path'] = 'path') QueryResult ¶
Downloads data from blob fields.
If
load_into_memory
is set toTrue
, then we additionally convert special fields to new types:Annotation fields are converted to
MetadataAnnotations
Document fields are converted to strings
- Parameters:
fields – list of binary fields to download blobs for. If empty, download all blob fields.
load_into_memory –
Whether to load the blobs into the datapoints, or just store them on disk
If True: the datapoints’ specified fields will contain the blob data
If False: the datapoints’ specified fields will contain Path objects to the file of the downloaded blob
cache_on_disk – Whether to cache the blobs on disk or not (valid only if load_into_memory is set to True) Cache location is
~/dagshub/datasets/<repo>/<datasource_id>/.metadata_blobs/
num_proc – number of download threads
path_format – What way the paths to the file should be represented.
path
returns a Path object, andstr
returns a string of this path.
- predict_with_mlflow_model(repo: str, name: str, host: str | None = None, version: str = 'latest', pre_hook: ~typing.Callable[[~typing.Any], ~typing.Any] = <function QueryResult.<lambda>>, post_hook: ~typing.Callable[[~typing.Any], ~typing.Any] = <function QueryResult.<lambda>>, batch_size: int = 1, log_to_field: str | None = None) list | None ¶
Sends all the datapoints returned in this QueryResult as prediction targets for an MLFlow model registered on DagsHub.
- Parameters:
repo – repository to extract the model from
name – name of the model in the mlflow registry
version – (optional, default: ‘latest’) version of the model in the mlflow registry
pre_hook – (optional, default: identity function) function that runs before datapoint is sent to the model
post_hook – (optional, default: identity function) function that converts mlflow model output
format (to the desired)
batch_size – (optional, default: 1) function that sets batch_size
log_to_field – (optional, default: ‘prediction’) write prediction results to metadata logged in data engine.
None (If)
predictions. (just returns)
field ((in addition to logging to a)
set) (iff that parameter is)
- get_annotations(**kwargs) QueryResult ¶
Loads all annotation fields using
get_blob_fields()
.All keyword arguments are passed to
get_blob_fields()
.
- download_files(target_dir: PathLike | str | None = None, keep_source_prefix=True, redownload=False, path_field: str | None = None) PathLike ¶
Downloads the datapoints to the
target_dir
directory- Parameters:
target_dir – Where to download the files. Defaults to
datasource's default location
keep_source_prefix – If True, includes the prefix of the datasource in the download path.
redownload – Whether to redownload a file if it exists on the filesystem already. We don’t do any hashsum checks, so if it’s possible that the file has been updated, set to True
path_field – Set this to the name of the field with the file’s path if you want to download files from a field other than the datapoint’s path.
Note
For
path_field
the path in the field still needs to be in the same repo and have the same format as the path of the datapoint, including not having the prefix. For now, you can’t download arbitrary paths/urls.- Returns:
Path to the directory with the downloaded files
- export_as_yolo(download_dir: str | Path | None = None, annotation_field: str | None = None, annotation_type: Literal['bbox', 'segmentation', 'pose'] | None = None) Path ¶
Downloads the files and annotations in a way that can be used to train with YOLO immediately.
- Parameters:
download_dir – Where to download the files. Defaults to
./dagshub_export
annotation_field – Field with the annotations. If None, uses the first alphabetical annotation field.
annotation_type – Type of YOLO annotations to export. Possible values: “bbox”, “segmentation”, “pose”. If None, returns based on the most common annotation type.
- Returns:
The path to the YAML file with the metadata. Pass this path to
YOLO.train()
to train a model.
- to_voxel51_dataset(**kwargs) fo.Dataset ¶
Creates a voxel51 dataset that can be used with fo.launch_app() to visualize it.
- Keyword Arguments:
name (str) – Name of the dataset. Default is the name of the datasource.
force_download (bool) – Download the dataset even if the size of the files is bigger than 100MB. Default is False
files_location (str|PathLike) – path to the location where to download the local files. Default is
datasource's default location
redownload (bool) – Redownload files, replacing the ones that might exist on the filesystem. Default is False.
voxel_annotations (List[str]) – List of fields from which to load voxel annotations serialized with to_json(). This will override the labelstudio annotations
- Return type:
- visualize(visualizer: Literal['dagshub', 'fiftyone'] = 'dagshub', **kwargs) str | fo.Session ¶
Visualize this QueryResult either on DagsHub or with Voxel51.
If
visualizer
isdagshub
, a webpage is opened on DagsHub with the query applied.If
visualizer
isfiftyone
, this function callsto_voxel51_dataset()
, passing to it the keyword arguments, and launches a fiftyone session showing the dataset.Additionally, this function adds a DagsHub plugin into Voxel51 that you can use for additional interactions with the datasource from within the voxel environment.
Returns the session object that you can
wait()
if you are using it outside a notebook and need to not close the script immediately:session = ds.all().visualize() session.wait(-1)
- annotate_with_mlflow_model(repo: str, name: str, post_hook: ~typing.Callable = <function QueryResult.<lambda>>, pre_hook: ~typing.Callable = <function QueryResult.<lambda>>, host: str | None = None, version: str = 'latest', batch_size: int = 1, log_to_field: str = 'annotation') str | None ¶
Sends all the datapoints returned in this QueryResult to an MLFlow model which automatically labels datapoints.
- Parameters:
repo – repository to extract the model from
name – name of the model in the mlflow registry
version – (optional, default: ‘latest’) version of the model in the mlflow registry
pre_hook – (optional, default: identity function) function that runs
model (before the datapoint is sent to the)
post_hook – (optional, default: identity function) function that converts
format (mlflow model output converts to labelstudio)
batch_size – (optional, default: 1) batched annotation size
- annotate(open_project: bool = True, ignore_warning: bool = True, fields_to_embed: List[str] | None = None, fields_to_exclude: List[str] | None = None) str | None ¶
Sends all the datapoints returned in this QueryResult to be annotated in Label Studio on DagsHub. Alternatively, uses MLFlow to automatically label datapoints.
- Parameters:
open_project – Automatically open the Label Studio project in the browser
ignore_warning – Suppress the prompt-warning if you try to annotate too many datapoints at once.
fields_to_embed – list of meta-data columns that will show up in Label Studio UI. if not specified all will be displayed.
fields_to_exclude – list of meta-data columns that will not show up in Label Studio UI
- Returns:
The URL of the created Label Studio workspace