Are you sure you want to delete this access key?
title | description |
---|---|
Connect Your Remote Storage to DagsHub - ML Data in One Place | DagsHub connects to AWS S3, Google Cloud Storage, and S3-compatible storage, enhancing data accessibility and interaction with large files all in one place. |
DagsHub supports connecting external storage from AWS S3, Google Cloud Storage (GCS), and S3-compatible storage to DagsHub repositories. It enables to access and interact with your data and large files without leaving the DagsHub platform.
In this section, we'll cover all the required steps to set up an external remote storage for your project. We assume that you have already created a DagsHub project and added a Git remote.
If you haven't already created a storage bucket, you should set it up now. Follow the instructions in one of these links:
=== "AWS – S3" Explanation on how to create an S3 bucket
=== "GCP – Google Storage" Explanation on how to create a Google Storage bucket
We need a minimum set of permission in order to use the bucket we created as our remote. If you have admin access to your cloud account, you might be able to skip this step. Here we assume you start without permissions and set up minimum permissions for the remote storage use case.
=== "AWS – S3"
Copy the following JSON permission file:
json { "Version": "2012-10-17", "Statement": [ { "Sid": "VisualEditor0", "Effect": "Allow", "Action": "s3:ListBucket", "Resource": "arn:aws:s3:::<your-bucket-name>" }, { "Sid": "VisualEditor1", "Effect": "Allow", "Action": "s3:GetObject", "Resource": "arn:aws:s3:::<your-bucket-name>/*" }, { "Sid": "VisualEditor2", "Effect": "Allow", "Action": "s3:PutObject", "Resource": "arn:aws:s3:::<your-bucket-name>/*" }, { "Sid": "VisualEditor3", "Effect": "Allow", "Action": "s3:DeleteObject", "Resource": "arn:aws:s3:::<your-bucket-name>/*" }, { "Sid": "VisualEditor4", "Effect": "Allow", "Action": "s3:ReplicateObject", "Resource": "arn:aws:s3:::<your-bucket-name>/*" } ] }
Paste it into your policy editor found here. This requires you
to log into your aws console. After you have created the policy, make sure it is attached to the relevant IAM user/s (if you're not sure how to do this, follow these steps).
#### S3 public buckets:
Making a bucket public is not enough, Its required to have global ListBucket and GetObject capabilities set on the bucket permissions -
Copy the following JSON permission file:
```json
{
"Version": "2012-10-17",
"Id": "Policy1711548745790",
"Statement": [
{
"Sid": "VisualEditor5",
"Effect": "Allow",
"Principal": {"AWS": "*"},
"Action": "s3:ListBucket",
"Resource": "arn:aws:s3:::my-bucket"
},
{
"Sid": "VisualEditor6",
"Effect": "Allow",
"Principal": {"AWS": "*"},
"Action": "s3:GetObject",
"Resource": "arn:aws:s3:::my-bucket/*"
}
]
}
```
=== "GCP – Google Storage"
In GCP, you can set the permissions while creating the GS bucket. Make sure you have at least Storage Object Creator
access if you only want to push and pull and Storage Object Admin
for more advanced capabilities like garbage collection.
The easiest way to allow you to push and pull to your remote storage is by installing the appropriate command line tools.
=== "AWS – S3" Follow the instructions to install the AWS CLI version 2. After installation has completed, follow the instructions to configure your AWS CLI.
What this does is save you access key to `~/.aws/credentials` which DVC knows to access.
=== "GCP – Google Storage" Follow the quickstart guide to install and initialize the Google Cloud SDK on your system. Alternatively you can install the Google Cloud SDK and then authorize the SDK without going through the full configuration.
This will save a configuration file in your home directory, which DVC knows to access when accessing Google Cloud Storage.
Now that we have our bucket configured with the correct permissions, we can go to the guide on how to connect it to DagsHub.
DagsHub external storage support also works with DVC. If you use the supported external storage types as a DVC remote. The following section will walk you through how to do this.
This step consists of 2 parts - installing the DVC extension and configuring the remote. If at this point you still don't have DVC installed, you should install it.
Type in the following command (according to the service you are using):
=== "AWS – S3"
bash pip3 install 'dvc[s3]'
=== "GCP – Google Storage"
bash pip3 install 'dvc[gs]'
=== "All Extensions"
bash pip3 install 'dvc[all]'
After the installation reopen the terminal window to make sure the changes have taken place.
Define the dvc remote{target=_blank}.
We do this with one command (don't forget to replace the bucket name with your own bucket):
=== "AWS – S3"
bash dvc remote add <remote-name> s3://<bucket-name>
=== "GCP – Google Storage"
bash dvc remote add <remote-name> gs://<bucket-name>
??? info "Consider using --local
"
It is our opinion that the configuration of the remote may vary between team members (working in various environments)
and over time (if you switch between cloud providers),
therefore it is prudent not to modify the .dvc/config
file which is monitored by Git.
Instead, we prefer to use the local configuration instead. You can find it in `.dvc/config.local`,
and confirm that it's ignored in `.dvc/.gitignore`.
That way you don't couple the current environment configuration to the code history.
This is the same best practice which naturally occurs when you run `git remote add` - the configuration is only
local to your own working repo, and won't be pushed to any git remote.
Is as simple as one command.
dvc push -r <remote-name>
This step might take a while, depending on the size of files you are pushing.
We automatically detect your remote location normally.
If you used the --local
option when configuring your DVC remote, follow the instructions here:
??? info "Connecting DagsHub to a remote configured with --local
"
To reap the benefits of doing this while using DagsHub to host your repo, go to your repo settings,
and add the link to the bucket in the Advanced Settings Local DVC cache URL
. In our case it looks something like this:
With DagsHub, when you track files with DVC you'll see them both in your file viewer, and in your pipeline view.
We believe this is useful for several reasons:
!!! warning When downloading a link through the graph, it might be saved as a file with the DVC hash as its filename. You can safely change it to the intended filename, including the original extension and it'll work just fine.
Press p or to see the previous file or, n or to see the next file
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?