File Downloading¶
Streaming files from the repository (dagshub.streaming
)¶
- class dagshub.streaming.DagsHubFilesystem(project_root: PathLike | str | None = None, repo_url: str | None = None, branch: str | None = None, username: str | None = None, password: str | None = None, token: str | None = None, timeout: int | None = None, exclude_globs: List[str] | str | None = None, frameworks: List[str] | None = None)¶
A DagsHub-repo aware filesystem class
- Parameters:
project_root – Path to the git repository with the repo. If None, traverse up the filesystem from the current dir until we find a git repo
repo_url – URL to the DagsHub repository. If None, URL is received from the git configuration
branch – Explicitly sets a branch/commit revision to work with If None, branch is received from the git configuration
token – DagsHub API token
username – DagsHub username (as an alternative to using the token)
password – DagsHub password (as an alternative to using the token)
timeout – Timeout in seconds for HTTP requests. Influences all requests except for file download, which has no timeout
exclude_globs – One or more glob patterns to exclude from looking up on the server This is useful in case your framework tries to look up cached files on disk that might not be there. Example: YOLO and .npy files
frameworks –
List of frameworks that need custom patched openers. Right now the following is supported:
transformers
- patchessafetensors
- install_hooks()¶
Install hooks to override default file and directory operations with DagsHub-aware functionality.
This method patches the standard Python I/O operations such as
open
,stat
,listdir
,scandir
, andchdir
with DagsHub-aware equivalents. Works inside a notebook and with Pathlib.If
install_hooks()
have already been called before, this method does nothing.Example:
dagshub_fs = DagsHubFilesystem() dagshub_fs.install_hooks() with open("src/file_in_repo.txt") as f: print(f.read())
Call
uninstall_hooks()
to undo the monkey patching.
- classmethod uninstall_hooks()¶
Reverses the changes made by
install_hooks()
, bringing back the builtin file I/O functions.
- dagshub.streaming.install_hooks(project_root: PathLike | None = None, repo_url: str | None = None, branch: str | None = None, username: str | None = None, password: str | None = None, token: str | None = None, timeout: int | None = None, exclude_globs: List[str] | str | None = None, frameworks: List[str] | None = None)¶
Monkey patches builtin Python functions to make them DagsHub-repo aware. Patched functions are:
open()
,os.listdir()
,os.scandir()
,os.stat()
and pathlib’s functions that use themCalling this function is equivalent to creating a
DagsHubFilesystem
object and calling itsinstall_hooks()
methodFor argument documentation, read
DagsHubFilesystem
Call
uninstall_hooks()
to undo the monkey patching.
- dagshub.streaming.uninstall_hooks()¶
Reverses the changes made by
install_hooks()
Direct download from connected buckets¶
These functions allow you to enable a client-downloader for a bucket you have connected to DagsHub.
When you download a file from a connected bucket, the request usually has to go through our server. This function allows you to skip the middleman and download the file directly from the bucket. This could save you time and money, if the downloading machine is colocated with the buckets.
The functions that work with these downloaders are:
- dagshub.common.download.enable_s3_bucket_downloader(client=None)¶
Enables downloading storage items using the AWS Boto3 client, instead of going through DagsHub’s server.
For custom clients use
add_bucket_downloader()
function.- Parameters:
client – a boto3.client. If client isn’t specified, the default parameterless constructor is used.
- dagshub.common.download.enable_gcs_bucket_downloader(client=None)¶
Enables downloading storage items using the Google Cloud Storage client, instead of going through DagsHub’s server.
For custom clients use
add_bucket_downloader()
function.- Parameters:
client – a google.cloud.storage.Client from the
google-cloud-storage
package. If client isn’t specified, the default parameterless constructor is used
- dagshub.common.download.enable_azure_container_downloader(account_url=None, client=None)¶
Enables downloading storage items using the Azure Blob Storage client, instead of going through DagsHub’s server.
For custom clients use
add_bucket_downloader()
function.- Parameters:
account_url – an azure storage account url, of the form
https://<storage-account-name>.blob.core.windows.net
client – preconfigured azure.storage.blob.BlobServiceClient. If client isn’t specified, the default parameterless constructor is used. If specified,
account_url
is disregarded, and the client is used.
- dagshub.common.download.add_bucket_downloader(proto: Literal['gs', 's3', 'azure'], func: Callable[[str, str], bytes])¶
Add your own custom connected bucket downloader.
- Parameters:
proto – Protocol for which you’re adding the downloader. This function will handle all download requests to this protocol.
func – Function that receives the name of the bucket and the path to the object and returns the object content in
bytes
.
Warning
The
func
function will be used in a ThreadPool, so it needs to be picklable.