Are you sure you want to delete this access key?
title | description |
---|---|
DVC and DagsHub Tutorial - Classifying MNIST Handwritten Digits with ML Pipelines – Project Setup | Embark on a journey through machine learning basics with this DagsHub tutorial, where you'll learn to classify MNIST handwritten digits. Discover how to version your data pipeline with DVC and leverage DagsHub for project repository management and pipeline visualization, streamlining your ML workflows |
The first section of this tutorial covers setting up our project. This includes the following tasks:
If you are familiar with these steps, you can skip to the next part which covers pipeline creation.
...is really easy. Just sign up.
Then, after logging in, create a new repo, simply by clicking on the plus sign and create repository in the
nav bar.
!!! info Note the Local DVC cache URL setting, which is unique to DagsHub. We will get back to it later in the tutorial.
Done with repo creation. On to project initialization.
We need to make sure we have Git and DVC installed on our computer.
We won't cover Git installation here, but if you need it, here is a link to an installation guide{target=_blank}.
This tutorial assumes a Python 3 installation.
Make sure you aren't using Python 2 by mistake, using python --version
.
For full documentation and all options for installation of DVC, see the official docs{target=_blank}.
DVC is actually a python module. You can run it from your OS environment or from your currently activated python environment. As such, the fastest way to install DVC is using pip. Just open a terminal and type:
pip3 install dvc
Note: DagsHub supports DVC 3 and its new hashing mechanism! and voilà. Done.
!!! warning This tutorial was last updated to DVC version 2.13.0 . If you are using an older version, please update. If you are using a newer version, be aware that the behavior of some commands may change.
Create a directory name "dagshub_mnist" for the project somewhere on your computer. Open a terminal and input the following:
cd path/to/folder/dagshub_mnist
git init
dvc init
The last command initializes the project as a DVC project, which sets up the following directory structure:
.
├── .dvc
│ ├── .gitignore
│ ├── config
│ ├── plots
│ │ ├── confusion.json
│ │ ├── default.json
│ │ ├── scatter.json
│ │ └── smooth.json
│ └── tmp
│ └── index
└── .git
┊
This is somewhat similar to the .git folder contained in every git repo, except some of its contents will be tracked using git.
.dvc/config
is similar to .git/config
. By default it's empty. More on this later on..dvc/.gitignore
makes sure git ignores dvc internal files that shouldn't be tracked by git..dvc/plots
contains predefined templates for plots you can generate using dvc
- more info here{target=_blank}..dvc/tmp
is used by dvc to store temporary files, this shouldn't interest the average user..dvc/cache
doesn't exist yet, but it is where DVC will keep the different versions of our data files. It's very similar in principle to .git/objects
.Now, we will set the git remote to our repo on dagshub.com{target=_blank}. This can be done using the following command:
git remote add origin https://dagshub.com/<username>/<repo-name>.git
Next, we will set the remote to our dvc remote. This will allow us to interact with DagsHub's DVC storage.
dvc remote add origin s3://dvc
dvc remote modify origin endpointurl https://dagshub.com/<username>/<repo-name>.s3
dvc remote modify origin --local access_key_id <token>
dvc remote modify origin --local secret_access_key <token>
Finally, let's create 3 folders, for each of our main project components.
mkdir code data metrics
This will add a clear structure for the project.
We strongly recommend using a virtual environment to run the code in this tutorial. You may use the environment manager
of your choosing. In this example we will use virtualenv
.
In the project directory type the following commands:
=== "Linux/Mac"
bash virtualenv venv echo -e 'venv/\n*.pyc' >> .gitignore source venv/bin/activate
=== "Windows"
batch virtualenv venv echo venv/ >> .gitignore echo *.pyc >> .gitignore venv\Scripts\activate.bat
The first command creates our virutal environment in the venv/
directory.
The second (and third, in case of Windows) command builds our gitignore. It ensures the virtual environment packages and python bytecode are not tracked by Git.
The final command activates our virtual python environment. This ensures that no python package we use contaminates
our global python installation.
**The rest of this tutorial should be executed using this environment **
If you exit the shell session, or want to create another, make sure to activate the virtual environment in that shell session first.
To install the requirements for the first part of this project, simply download the requirements.txt into your project folder.
!!! tip
Alternatively, you can create a file called requirements.txt and copy the following into it:
text torch==1.12.0 numpy==1.21.6 pandas==1.3.5 python-dateutil==2.8.2 pytz==2022.1 scikit-learn==1.0.2 scipy==1.7.3 six==1.16.0 sklearn==0.0
Now, to install them type:
pip3 install -r requirements.txt
Let's check the Git status of our project:
$ git status -s
A .dvc/.gitignore
A .dvc/config
A .dvcignore
?? .gitignore
Now let's commit this to Git using the command line:
git add .
git commit -m "Initialized project"
git push -u origin master
The last command might request your DagsHub.com username and token.
Let's open up our repo on DagsHub. It should look something like this:
In the next section, we will use dvc to define our pipeline. That is where the magic happens.
Press p or to see the previous file or, n or to see the next file
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?