Skip to content

Setup Remote Storage for Data & Models

DAGsHub provides a central location to view and collaborate on data science projects. We make it easy to find the code, data and models you need, and get them into any machine you use.

In order to get these capabilities, all you need to do is set up a Git and DVC remote. Here we will cover all the required steps to set up a DVC remote storage for your project. We assume that you have already created a DAGsHub project and added a Git remote.

Create a Storage Bucket

If you haven't already created a storage bucket, you should set it up now. Follow the instructions in one of these links:

Explanation on how to create an S3 bucket

Explanation on how to create a Google Storage bucket

Making Sure You Have Permissions

We need a minimum set of permission in order to use the bucket we created as our remote. If you have admin access to your cloud account, you might be able to skip this step. Here we assume you start without permissions and set up minimum permissions for the remote storage use case.

Copy the following JSON permission file:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "VisualEditor0",
      "Effect": "Allow",
      "Action": "s3:ListBucket",
      "Resource": "arn:aws:s3:::<your-bucket-name>"
    },
    {
      "Sid": "VisualEditor1",
      "Effect": "Allow",
      "Action": "s3:GetObject",
      "Resource": "arn:aws:s3:::<your-bucket-name>/*"
    },
    {
      "Sid": "VisualEditor2",
      "Effect": "Allow",
      "Action": "s3:PutObject",
      "Resource": "arn:aws:s3:::<your-bucket-name>/*"
    },
    {
      "Sid": "VisualEditor3",
      "Effect": "Allow",
      "Action": "s3:DeleteObject",
      "Resource": "arn:aws:s3:::<your-bucket-name>/*"
    },
    {
      "Sid": "VisualEditor4",
      "Effect": "Allow",
      "Action": "s3:ReplicateObject",
      "Resource": "arn:aws:s3:::<your-bucket-name>/*"
    }
  ]
}
Paste it into your policy editor found here. This requires you to log into your aws console. After you have created the policy, make sure it is attached to the relevant IAM user/s (if you're not sure how to do this, follow these steps).

In GCP, you can set the permissions while creating the GS bucket. Make sure you have at least Storage Object Creator access if you only want to push and pull and Storage Object Admin for more advanced capabilities like garbage collection.

Installing the Command Line Tools

The easiest way to allow you to push and pull to your remote storage is by installing the appropriate command line tools.

Follow the instructions to install the AWS CLI version 2. After installation has completed, follow the instructions to configure your AWS CLI.

What this does is save you access key to ~/.aws/credentials which DVC knows to access.

Follow the quickstart guide to install and initialize the Google Cloud SDK on your system. Alternatively you can install the Google Cloud SDK and then authorize the SDK without going through the full configuration.

This will save a configuration file in your home directory, which DVC knows to access when accessing Google Cloud Storage.

Adding the DVC Remote Locally

This step consists of 2 parts - installing the DVC extension and configuring the remote. If at this point you still don't have DVC installed, you should install it.

Installing the DVC extension

Type in the following command (according to the service you are using):

pip install 'dvc[s3]'
pip install 'dvc[gs]'
pip install 'dvc[all]'

After the installation reopen the terminal window to make sure the changes have taken place.

dvc remote add <remote-name> s3://<bucket-name>
dvc remote add <remote-name> gs://<bucket-name>
Consider using --local

It is our opinion that the configuration of the remote may vary between team members (working in various environments) and over time (if you switch between cloud providers), therefore it is prudent not to modify the .dvc/config file which is monitored by Git.

Instead, we prefer to use the local configuration instead. You can find it in .dvc/config.local, and confirm that it's ignored in .dvc/.gitignore.

That way you don't couple the current environment configuration to the code history. This is the same best practice which naturally occurs when you run git remote add - the configuration is only local to your own working repo, and won't be pushed to any git remote.

Pushing Files to Your Remote Storage

Is as simple as one command.

dvc push -r <remote-name>
This step might take a while, depending on the size of files you are pushing.

Connecting DAGsHub to Your Remote Storage

We automatically detect your remote location normally. If you used the --local option when configuring your DVC remote, follow the instructions here:

Connecting DAGsHub to a remote configured with --local

To reap the benefits of doing this while using DAGsHub to host your repo, go to your repo settings, and add the link to the bucket in the Advanced Settings Local DVC cache URL. In our case it looks something like this:

Screenshot Local DVC cache URL setting

With DAGsHub, when you track files with DVC you'll see them both in your file viewer, and in your pipeline view.

Screenshot File viewer with blue files that are DVC tracked

After connecting your remote storage to DAGsHub, these files now have functioning links, and are therefore available for viewing or download for anyone who would want to (provided they have authorization to your bucket, of course).

Screenshot Path link change after adding remote

We believe this is useful for several reasons:

  • If you want to let someone browse your data and trained models, you can just send them a link to your DAGsHub repo. They don't need to clone or run anything, or sift through undocumented directory structures to find the model they are looking for.
  • The files managed by DVC and pushed to the cloud are immutable - just like a specific version of a file which is saved in a Git commit, even if you continue working and the branch has moved on, you can always go back to some old branch or commit, and the download links will still point to the files as they were in the past.
  • By using DVC and DAGsHub, you can preserve your own sanity when running a lot of different experiments in multiple parallel branches. Don't remember where you saved that model which you trained a month ago? Just take a look at your repo, it's a click away. Let software do the grunt work of organizing files, just like those wonderfully lazy software developers do.

Warning

When downloading a link through the graph, it might be saved as a file with the DVC hash as its filename. You can safely change it to the intended filename, including the original extension and it'll work just fine.