The first section of this tutorial covers setting up our project. This includes the following tasks:
- Creating an account and repo in DAGsHub
- Installing DVC as well as the optional installation needed to use remote caches for a project (we assume you have Git installed and are familiar with it to some extent).
- Project initialization in Git and DVC
- Creating a virtualenv and installing the needed requirements for the first part of the project
If you are familiar with these steps, you can skip to the next part which covers pipeline creation.
...is really easy. Just sign up.
Then, after logging in, create a new repo, simply by clicking on the plus sign and create repository in the
Repo creation button
This opens up a dialog, which should be somewhat familiar, in which you can set the repository name, description and other option.
Repo creation dialog
For this tutorial, fill in the Name and description, and leave everything else in the default settings.
Note the Local DVC cache URL setting, which is unique to DAGsHub. We will get back to it later in the tutorial.
Done with repo creation. On to project initialization.
We need to make sure we have Git and DVC installed on our computer.
We won't cover Git installation here, but if you need it, here is a link to an installation guide.
For full documentation and all options for installation of DVC, see the official docs.
The fastest way to install DVC is using pip. Just open a terminal and type:
pip install dvc
Setting up our project¶
Create a directory name "dagshub_mnist" for the project somewhere on your computer. Open a terminal and input the following:
1 2 3 4
cd path/to/folder/dagshub_mnist git init dvc init
The last command initializes the project as a DVC project, which sets up the following directory structure:
1 2 3 4 5 6
.dvc/ .dvc/.gitignore .dvc/cache/ .dvc/config .dvc/updater .dvc/updater.lock
This is very similar to the .git folder contained in every git repo, except some of its contents will be tracked using git.
.dvc/configis similar to
.git/config. By default it's empty. More on this later on.
.dvc/.gitignoremakes sure git ignores everything in the
.dvcfolder except the config file.
.dvc/cacheis where DVC will keep the different versions of our data files. It's very similar in principle to
- The other files & directories under
.dvcare used internally by DVC, and shouldn't interest the average user.
Now, we will set the remote to our repo on dagshub.com. This can be done using the following command:
git remote add origin https://dagshub.com/<username>/<repo-name>.git
mkdir code data metrics
Creating a virtual environment and installing requirements¶
To create and activate our virtual environment type the following commands into your terminal (still in the project folder):
1 2 3 4 5
virtualenv .env echo '.env/' >> .gitignore echo '__pycache__/' >> .gitignore source .env/bin/activate
To install the requirements for the first part of this project, simply download the requirements.txt into your project folder.
Alternatively, you can create a file called requirements.txt and copy the following into it:
1 2 3 4 5 6 7 8
numpy==1.15.4 pandas==0.23.4 python-dateutil==2.7.5 pytz==2018.7 scikit-learn==0.20.2 scipy==1.2.0 six==1.12.0 sklearn==0.0
Now, to install them type:
pip install -r requirements.txt
Committing progress to Git¶
Let us check the Git status of our project:
1 2 3 4 5
$ git status -s A .dvc/.gitignore A .dvc/config ?? .gitignore ?? requirements.txt
Now let us commit this to Git using the command line:
1 2 3 4
git add . git commit -m "Initialized project" git push -u origin master
Let's open up our repo on DAGsHub. It should look something like this:
Repo view after first commit
The top part should look familiar to anyone that has used a Git server before. However, notice the bottom window. This is a new addition in DAGsHub, and will, later on in this tutorial, show our data pipeline.
In the next section, we will use dvc to define our pipeline. That is where the magic happens.