Datasource¶

class dagshub.data_engine.model.datasource.Datasource(datasource, query=None, from_dataset=None)¶

clear_query(reset_to_dataset=True)¶

Clear the attached query.

Parameters:: reset_to_dataset – If True and this Datasource was saved as a dataset, reset to the query in the dataset, instead of clearing the query completely.

property annotation_fields: List[str]¶: Return all fields that have the annotation meta tag set

fetch(load_documents=True, load_annotations=True) → QueryResult¶

Executes the query and returns a QueryResult object containing returned datapoints.

If there’s an active MLflow run, logs an artifact with information about the query to the run.

This function respects the limit set on the query with limit().

Parameters:

load_documents – Automatically download all document blob fields
load_annotations – Automatically download all annotation blob fields

Return type:

QueryResult

head(size=100, load_documents=True, load_annotations=True) → QueryResult¶

Executes the query and returns a QueryResult object containing first size datapoints

Note

This function is intended for quick checks and debugging your queries. As a result of that, this function does not log an artifact to MLflow. If you want to limit the number of datapoints returned by the query as part of the training workflow, use limit() instead. That will save the limit as part of the query.

Parameters:

size – how many datapoints to get. Default is 100
load_documents – Automatically download all document blob fields
load_annotations – Automatically download all annotation blob fields

Return type:

QueryResult

all(load_documents=True, load_annotations=True) → QueryResult¶

Executes the query and returns a QueryResult object containing all datapoints

If there’s an active MLflow run, logs an artifact with information about the query to the run.

Warning

Unlike fetch(), this function will override any limits set on the query. If you have set any limits on the query with limit(), use fetch() instead.

Parameters:

load_documents – Automatically download all document blob fields
load_annotations – Automatically download all annotation blob fields

Return type:

QueryResult

select(*selected) → Datasource¶

Select which fields should appear in the query result.

If you want to query older versions of metadata, use Field objects with as_of set to your desired time.

By default, only the defined fields are returned. If you want to return all existing fields plus whatever additional fields you define, add "*" into the arguments.

Parameters:

selected (Union[str, Field]) –

Fields you want to select. Can be either of:

Name of the field to select: "field".
"*" to select all the fields in the datasource.
Field object.

Return type:

Datasource

Example:

t = datetime.now() - timedelta(hours=24)
q1 = ds.select("*", Field("size", as_of=t, alias="size_asof_24h_ago"))
q1.all()

as_of(time) → Datasource¶

Get a snapshot of the datasource’s state as of time.

Parameters:: time (Union[float, datetime]) – At which point in time do you want to get data from. Either a UTC timestamp or a datetime object.
Return type:: Datasource

In the following example, you will get back datapoints that were created no later than yesterday AND had their size at this point bigger than 5 bytes:

t = datetime.now() - timedelta(hours=24)
q1 = (ds["size"] > 5).as_of(t)
q1.all()

Note

If used with select(), the as_of set on the fields takes precedence over the global query as_of set here.

with_time_zone(tz_val) → Datasource¶

A time zone offset string in the form of “+HH:mm” or “-HH:mm”.

A metadata of type datetime is always stored in DB as a UTC time, when a query is done on this field there are 3 options: :rtype: Datasource

Metadata was saved with a timezone, in which case it will be used.
Metadata was saved without a timezone, in which case UTC will be used.
with_time_zone specified a time zone and it will override whatever is in the database.

order_by(*args) → Datasource¶

Sort the query result by the specified fields. Any previously set order will be overwritten.

Parameters:

of (Fields to sort by. Can be either) –

Name of the field to sort by: "field".
A tuple of (field_name, ascending): ("field", True).
A tuple of (field_name, "asc"|"desc"): ("field", "asc").

Return type:

Datasource

Examples:

ds.order_by("size").all()                   # Order by ascending size
ds.order_by(("date", "desc"), "size).all()  # Order by descending date, then ascending size

limit(size) → Datasource¶

Limit the number of datapoints returned by the query. Use None to remove the limit. This argument is only respected when using fetch().

Parameters:: size (Optional[int]) – Number of datapoints to return. If None, no limit is applied and all datapoints will be fetched.
Return type:: Datasource

Example:

ds.limit(10).fetch()

metadata_field(field_name) → MetadataFieldBuilder¶

Returns a builder for a metadata field. The builder can be used to change properties of a field or create a new field altogether. Note that fields get automatically created when you upload new metadata to the Data Engine, so it’s not necessary to create fields with this function.

Example of creating a new annotation field:

ds.metadata_field("annotation").set_type(dtypes.LabelStudioAnnotation).apply()

Note

New fields have to have their type defined using .set_type() before doing anything else

Example of marking an existing field as an annotation field:

ds.metadata_field("existing-field").set_annotation().apply()

Parameters:: field_name (str) – Name of the field that you want to create/change
Return type:: MetadataFieldBuilder

apply_field_changes(field_builders)¶: Applies one or multiple metadata field builders that can be constructed using the metadata_field() function.

upload_metadata_of_implicit_context()¶: commit meta data changes done in dictionary assignment context :meta private:

metadata_context() → ContextManager[MetadataContextManager, bool | None]¶

Returns a metadata context, that you can upload metadata through using its update_metadata() function. Once the context is exited, all metadata is uploaded in one batch:

with ds.metadata_context() as ctx:
    ctx.update_metadata("file1", {"key1": True, "key2": "value"})

Return type:: ContextManager[MetadataContextManager, bool | None]

upload_metadata_from_file(file_path, path_column=None, ingest_on_server=False)¶

Upload metadata from a file.

Parameters:

file_path – Path to the file with metadata. Allowed formats are CSV, Parquet, ZIP, GZ.
path_column (Union[int, str, None]) – Column with the datapoints’ paths. Can either be the name of the column, or its index. If not specified, the first column is used.
ingest_on_server (bool) – Set to True to process the metadata asynchronously. The file will be sent to our server and ingested into the datasource there. Default is False.

upload_metadata_from_dataframe(df, path_column=None, ingest_on_server=False)¶

Upload metadata from a pandas dataframe.

All columns are uploaded as metadata, and the path of every datapoint is taken from path_column.

Parameters:

df (pandas.DataFrame) – DataFrame with metadata
path_column (Union[int, str, None]) – Column with the datapoints’ paths. Can either be the name of the column, or its index. If not specified, the first column is used.
ingest_on_server (bool) – Set to True to process the metadata asynchronously. The file will be sent to our server and ingested into the datasource there. Default is False.

delete_source(force=False)¶

Delete the record of this datasource along with all datapoints.

Warning

This is a destructive operation! If you delete the datasource, all the datapoints and metadata will be removed.

Parameters:: force (bool) – Skip the confirmation prompt

delete_dataset(force=False)¶

Deletes the dataset, if this object was created from a dataset (e.g. from datasets.get_dataset()).

This doesn’t delete the underlying datasource and its metadata, only deleting the dataset and its query.

If this datasource object wasn’t created from a dataset, raises a ValueError.

Parameters:: force (bool) – Skip the confirmation prompt

scan_source(options=None)¶

This function fires a call to the backend to rescan the datapoints. Call this function whenever you uploaded new files and want them to appear when querying the datasource, or if you changed existing file contents and want their metadata to be updated.

DagsHub periodically rescans all datasources, this function is a way to make a scan happen as soon as possible.

Notes about automatically scanned metadata:

Only new datapoints (files) will be added. If files were removed from the source, their metadata will still remain, and they will still be returned from queries on the datasource. An API to actively remove metadata will be available soon.
Some metadata fields will be automatically scanned and updated by DagsHub based on this scan - the list of automatic metadata fields is growing frequently!

Parameters:: options (Optional[List[ScanOption]]) – List of scanning options. If not sure, leave empty.

delete_metadata_from_datapoints(datapoints, fields)¶

Delete metadata from datapoints. The deleted values can be accessed using versioned query with time set before the deletion

Parameters:

datapoints (List[Datapoint]) – datapoints to delete metadata from
fields (List[str]) – fields to delete

delete_datapoints(datapoints, force=False)¶

Delete datapoints.

These datapoints will no longer show up in queries.
Does not delete the datapoint’s file, only removing the data from the datasource.
You can still query these datapoints and associated metadata with versioned queries whose time is before deletion time.
You can re-add these datapoints to the datasource by uploading new metadata to it with, for example, Datasource.metadata_context. This will create a new datapoint with new id and new metadata records.
Datasource scanning will not add these datapoints back.

Parameters:

datapoints (List[Datapoint]) – list of datapoints objects to delete
force (bool) – Skip the confirmation prompt

save_dataset(name) → Datasource¶

Save the dataset, which is a combination of datasource + query, on the backend. That way you can persist and share your queries. You can get the dataset back later by calling datasets.get_dataset()

Parameters:: name (str) – Name of the dataset
Return type:: Datasource
Returns:: A datasource object with the dataset assigned to it

log_to_mlflow(artifact_name=None, run=None, as_of=None) → mlflow.entities.Run¶

Logs the current datasource state to MLflow as an artifact.

Warning

This function is deprecated. Use autologging or QueryResult.log_to_mlflow() instead.

Parameters:

artifact_name (Optional[str]) – Name of the artifact that will be stored in the MLflow run.
run (Optional[Run]) – MLflow run to save to. If None, uses the active MLflow run or creates a new run.
as_of (Optional[datetime]) – The querying time for which to save the artifact. Any time the datasource is recreated from the artifact, it will be queried as of this timestamp. If None, the current machine time will be used. If the artifact is autologged to MLflow (will happen if you have an active MLflow run), then the timestamp of the query will be used.

Return type:

Run

Returns:

Run to which the artifact was logged.

save_to_file(path='.') → Path¶

Saves a JSON file representing the current state of datasource or dataset. Useful for connecting code versions to the datasource used for training.

Note

Does not save the actual contents of the datasource/dataset, only the query.

Parameters:: path (Union[str, PathLike]) – Where to save the file. If path is an existing folder, saves to <path>/<ds_name>.json.
Return type:: Path
Returns:: The path to the saved file

property is_query_different_from_dataset: bool | None¶

Is the current query of the object different from the one in the assigned dataset.

If no dataset is assigned, returns None.

static load_from_serialized_state(state_dict) → Datasource¶

Load a Datasource that was saved with save_to_file()

Parameters:: state_dict (Dict) – Serialized JSON object
Return type:: Datasource

to_voxel51_dataset(**kwargs) → fiftyone.Dataset¶

Refer to QueryResult.to_voxel51_dataset() for documentation.

Return type:: Dataset

property default_dataset_location: Path¶

Default location where datapoint files are stored.

On UNIX-likes the path is ~/dagshub/datasets/<repo_name>/<datasource_id>

On Windows the path is C:\Users\<user>\dagshub\datasets\<repo_name>\<datasource_id>

visualize(visualizer='dagshub', **kwargs) → str | fiftyone.Session¶

Visualize the whole datasource using QueryResult.visualize().

Read the function docs for kwarg documentation.

Return type:: Union[str, Session]

async add_annotation_model_from_config(config, project_name, ngrok_authtoken, port=9090)¶

Initialize a LS backend for ML annotation using a preset configuration.

Parameters:

config – dictionary containing information about the mlflow model, hooks and LS label config
repo (recommended to use with get_config() from preconfigured_models in the orchestrator)
project_name – automatically adds backend to project
ngrok_authtoken – uses ngrok to forward local connection
port – (optional, default: 9090) port on which orchestrator is hosted

async add_annotation_model(repo, name, version='latest', post_hook=<function Datasource.<lambda>>, pre_hook=<function Datasource.<lambda>>, port=9090, project_name=None, ngrok_authtoken=None) → None¶

Initialize a LS backend for ML annotation.

Parameters:

repo (str) – repository to extract the model from
name (str) – name of the model in the mlflow registry
version (str) – (optional, default: ‘latest’) version of the model in the mlflow registry
pre_hook (Callable[[Any], Any]) – (optional, default: identity function) function that runs before datapoint is sent to the model
post_hook (Callable[[Any], Any]) – (optional, default: identity function) function that converts mlflow model output
format (to the desired)
port (int) – (optional, default: 9090) port on which orchestrator is hosted
project_name (Optional[str]) – (optional, default: None) automatically adds backend to project
ngrok_authtoken (Optional[str]) – (optional, default: None) uses ngrok to forward local connection

Return type:

None

annotate(fields_to_embed=None, fields_to_exclude=None) → str | None¶

Sends all datapoints in the datasource for annotation in Label Studio.

Parameters:

fields_to_embed – list of meta-data columns that will show up in Label Studio UI. if not specified all will be displayed.
fields_to_exclude – list of meta-data columns that will not show up in Label Studio UI

Note

This will send ALL datapoints in the datasource for annotation. It’s recommended to not send a huge amount of datapoints to be annotated at once, to avoid overloading the Label Studio workspace. Use QueryResult.annotate() to annotate a result of a query with less datapoints. Alternatively, use a lower level send_datapoints_to_annotation() function

Return type:: Optional[str]
Returns:: Link to open Label Studio in the browser

send_datapoints_to_annotation(datapoints, open_project=True, ignore_warning=False, fields_to_exclude=None, fields_to_embed=None) → str | None¶

Sends datapoints for annotation in Label Studio.

Parameters:

datapoints (Union[List[Datapoint], QueryResult, List[Dict]]) –
Either of:
- A QueryResult
- List of Datapoint objects
- List of dictionaries. Each dictionary should have fields id and download_url.
  id is the ID of the datapoint in the datasource.
open_project – Automatically open the created Label Studio project in the browser.
ignore_warning – Suppress the prompt-warning if you try to annotate too many datapoints at once.
fields_to_embed – list of meta-data columns that will show up in Label Studio UI. if not specified all will be displayed.
fields_to_exclude – list of meta-data columns that will not show up in Label Studio UI

Return type:

Optional[str]

Returns:

Link to open Label Studio in the browser

wait_until_ready(max_wait_time=300, fail_on_timeout=True)¶

Blocks until the datasource preprocessing is complete.

Useful when you have just created a datasource and the initial scanning hasn’t finished yet.

Parameters:

max_wait_time – Maximum time to wait in seconds
fail_on_timeout – Whether to raise a RuntimeError or log a warning if the scan does not complete on time

has_field(field_name) → bool¶

Checks if a metadata field field_name exists in the datasource.

Return type:: bool

date_field_in_years(*item)¶

Checks if a metadata field (which is of datetime type) is in one of given years list.

Parameters:: years. (List of)

Examples:

datasource[(datasource["y"].date_field_in_years(1979, 2003)

date_field_in_months(*item)¶

Checks if a metadata field (which is of datetime type) is in one of given months list.

Parameters:: months. (List of)

Examples:

datasource[(datasource["y"].date_field_in_months(12, 2)

date_field_in_days(*item)¶

Checks if a metadata field (which is of datetime type) is in one of given days list.

Parameters:: days. (List of)

Examples:

datasource[(datasource["y"].date_field_in_days(25, 2)

date_field_in_timeofday(item)¶

Checks if a metadata field (which is of datetime type) is in given minute range inside the day (any day). range is in the format of: “HH:mm-HH:mm” (or “HH:mm:ss-HH:mm:ss”) where start hour is on the left. a range that starts at one day and ends at next day, should be expressed as OR of 2 range filter.

Parameters:: string. (Time range)

Examples:

datasource[(datasource["y"].date_field_in_timeofday("11:30-12:30")

import_annotations_from_files(annotation_type, path, field='imported_annotation', load_from=None, remapping_function=None, **kwargs)¶

Imports annotations into the datasource from files

The annotations will be downloaded and converted into Label Studio tasks, that are then uploaded into the specified fields.

If the annotations are stored in a repo and not locally, they are downloaded to a temporary directory.

Caveats:

YOLO:
- Images need to also be downloaded to get their dimensions.
- The .YAML file needs to have the path argument set to the relative path to the data. We’re using that to download the files
- You have to specify the yolo_type kwarg with the type of annotation to import

Parameters:

annotation_type (Literal['yolo', 'cvat']) – Type of annotations to import. Possible values are yolo and cvat
path (Union[str, Path]) – If YOLO - path to the .yaml file, if CVAT - path to the .zip file. Can be either on disk or in repository
field (str) – Which field to upload the annotations into. If it’s an existing field, it has to be a blob field, and it will have the annotations flag set afterwards.
load_from (Optional[Literal['repo', 'disk']]) – Force specify where to get the files from. By default, we’re trying to load files from the disk first, and then repository. If this is specified, then that check is being skipped and we’ll try to download from the specified location.
remapping_function (Optional[Callable[[str], str]]) – Function that maps from a path of the annotation to the path of the datapoint. If None, we try to make a best guess based on the first imported annotation. This might fail, if there is no matching datapoint in the datasource for some annotations or if the paths are wildly different.

Keyword Arguments:

yolo_type – Type of YOLO annotations to import. Either bbox, segmentation or pose.

Example to import segmentation annotations into an imported_annotations field, using YOLO information from an annotations.yaml file (can be local, or in the repo):

ds.import_annotations_from_files(
    annotation_type="yolo",
    path="annotations.yaml",
    field="imported_annotations",
    yolo_type="segmentation"
)

class dagshub.data_engine.model.datasource.Field(field_name, as_of=None, alias=None) → None¶

Class used to define custom fields for use in Datasource.select() or in filtering.

Example of filtering on old data from a field:

t = datetime.now() - timedelta(days=2)
q = ds[Field("size", as_of=t)] > 500
q.all()

field_name: str¶: The database field where the values are stored. In other words, where to get the values from.

as_of: Union[float, datetime, None] = None¶

If defined, the data in this field would be shown as of this moment in time.

Accepts either a datetime object, or a UTC timestamp.

alias: Optional[str] = None¶

How the returned custom data field should be named.

Useful when you’re comparing the same field at multiple points in time:

yesterday = datetime.now() - timedelta(days=1)

ds.select(
    Field("value", alias="value_today"),
    Field("value", as_of=yesterday, alias="value_yesterday")
).all()

class dagshub.data_engine.model.datasource.MetadataContextManager(datasource)¶

Context manager for updating the metadata on a datasource. Batches the metadata changes, so they are being sent all at once.

update_metadata(datapoints, metadata)¶

Update metadata for the specified datapoints.

Note

If datapoints is a list, the same metadata is assigned to all the datapoints in the list. Call update_metadata() separately for each datapoint if you need to assign different metadata.

Parameters:

datapoints (Union[List[str], str]) – A list of datapoints or a single datapoint path to update metadata for.
metadata (Dict[str, Any]) – A dictionary containing metadata key-value pairs to update.

Example:

with ds.metadata_context() as ctx:
    metadata = {
        "episode": 5,
        "has_baby_yoda": True,
    }

    # Attach metadata to a single specific file in the datasource.
    # The first argument is the filepath to attach metadata to, **relative to the root of the datasource**.
    ctx.update_metadata("images/005.jpg", metadata)

    # Attach metadata to several files at once:
    ctx.update_metadata(["images/006.jpg","images/007.jpg"], metadata)

class dagshub.data_engine.model.metadata_field_builder.MetadataFieldBuilder(datasource, field_name)¶

Builder class for changing properties of a metadata field in a datasource. It is also possible to create a new empty field with predefined schema with this builder. All functions return back the builder object to facilitate a builder pattern, for example:

builder.set_type(bytes).set_annotation().apply()

set_type(t) → MetadataFieldBuilder¶

Set the type of the field. The type can be either a Python primitive supported by the Data Engine (str, bool, int, float, bytes) or it can be a DagshubDataType inheritor. The DataType inheritors can define additional tags on top of just the basic backing type

Return type:: MetadataFieldBuilder

set_annotation(is_annotation=True) → MetadataFieldBuilder¶

Mark or unmark the field as annotation field

Return type:: MetadataFieldBuilder

set_thumbnail(thumbnail_type=None, is_thumbnail=True) → MetadataFieldBuilder¶

Mark or unmark the field as thumbnail field, with the specified thumbnail type

Return type:: MetadataFieldBuilder

apply()¶

Apply the outgoing changes to this metadata field.

If you need to apply changes to multiple fields at once, use Datasource.apply_field_changes instead.

class dagshub.data_engine.client.models.ScanOption(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)¶

Enum of options that can be applied during scanning process with scan_source()

FORCE_REGENERATE_AUTO_SCAN_VALUES = 'FORCE_REGENERATE_AUTO_SCAN_VALUES'¶: Regenerate all the autogenerated metadata values for the whole datasource