File Downloading

Streaming files from the repository (dagshub.streaming)

class dagshub.streaming.DagsHubFilesystem(project_root: PathLike | str | None = None, repo_url: str | None = None, branch: str | None = None, username: str | None = None, password: str | None = None, token: str | None = None, timeout: int | None = None, exclude_globs: List[str] | str | None = None, frameworks: List[str] | None = None)

A DagsHub-repo aware filesystem class

Parameters:
  • project_root – Path to the git repository with the repo. If None, traverse up the filesystem from the current dir until we find a git repo

  • repo_url – URL to the DagsHub repository. If None, URL is received from the git configuration

  • branch – Explicitly sets a branch/commit revision to work with If None, branch is received from the git configuration

  • token – DagsHub API token

  • username – DagsHub username (as an alternative to using the token)

  • password – DagsHub password (as an alternative to using the token)

  • timeout – Timeout in seconds for HTTP requests. Influences all requests except for file download, which has no timeout

  • exclude_globs – One or more glob patterns to exclude from looking up on the server This is useful in case your framework tries to look up cached files on disk that might not be there. Example: YOLO and .npy files

  • frameworks

    List of frameworks that need custom patched openers. Right now the following is supported:

    • transformers - patches safetensors

install_hooks()

Install hooks to override default file and directory operations with DagsHub-aware functionality.

This method patches the standard Python I/O operations such as open, stat, listdir, scandir, and chdir with DagsHub-aware equivalents. Works inside a notebook and with Pathlib.

If install_hooks() have already been called before, this method does nothing.

Example:

dagshub_fs = DagsHubFilesystem()
dagshub_fs.install_hooks()

with open("src/file_in_repo.txt") as f:
    print(f.read())

Call uninstall_hooks() to undo the monkey patching.

classmethod uninstall_hooks()

Reverses the changes made by install_hooks(), bringing back the builtin file I/O functions.

dagshub.streaming.install_hooks(project_root: PathLike | None = None, repo_url: str | None = None, branch: str | None = None, username: str | None = None, password: str | None = None, token: str | None = None, timeout: int | None = None, exclude_globs: List[str] | str | None = None, frameworks: List[str] | None = None)

Monkey patches builtin Python functions to make them DagsHub-repo aware. Patched functions are: open(), os.listdir(), os.scandir(), os.stat() and pathlib’s functions that use them

Calling this function is equivalent to creating a DagsHubFilesystem object and calling its install_hooks() method

For argument documentation, read DagsHubFilesystem

Call uninstall_hooks() to undo the monkey patching.

dagshub.streaming.uninstall_hooks()

Reverses the changes made by install_hooks()

Direct download from connected buckets

These functions allow you to enable a client-downloader for a bucket you have connected to DagsHub.

When you download a file from a connected bucket, the request usually has to go through our server. This function allows you to skip the middleman and download the file directly from the bucket. This could save you time and money, if the downloading machine is colocated with the buckets.

The functions that work with these downloaders are:

dagshub.common.download.enable_s3_bucket_downloader(client=None)

Enables downloading storage items using the AWS Boto3 client, instead of going through DagsHub’s server.

For custom clients use add_bucket_downloader() function.

Parameters:

client – a boto3.client. If client isn’t specified, the default parameterless constructor is used.

dagshub.common.download.enable_gcs_bucket_downloader(client=None)

Enables downloading storage items using the Google Cloud Storage client, instead of going through DagsHub’s server.

For custom clients use add_bucket_downloader() function.

Parameters:

client – a google.cloud.storage.Client from the google-cloud-storage package. If client isn’t specified, the default parameterless constructor is used

dagshub.common.download.enable_azure_container_downloader(account_url=None, client=None)

Enables downloading storage items using the Azure Blob Storage client, instead of going through DagsHub’s server.

For custom clients use add_bucket_downloader() function.

Parameters:
  • account_url – an azure storage account url, of the form https://<storage-account-name>.blob.core.windows.net

  • client – preconfigured azure.storage.blob.BlobServiceClient. If client isn’t specified, the default parameterless constructor is used. If specified, account_url is disregarded, and the client is used.

dagshub.common.download.add_bucket_downloader(proto: Literal['gs', 's3', 'azure'], func: Callable[[str, str], bytes])

Add your own custom connected bucket downloader.

Parameters:
  • proto – Protocol for which you’re adding the downloader. This function will handle all download requests to this protocol.

  • func – Function that receives the name of the bucket and the path to the object and returns the object content in bytes.

Warning

The func function will be used in a ThreadPool, so it needs to be picklable.