Skip to content
Reader Mode

Found a problem?
Let us know (or fix it):

Edit this Page

Have a question?
Join our community now:

Discord Chat

Version Code and Data

In the previous part of the Get Started section, we created and configured a DAGsHub repository. In this part, we will download and add a project to our local directory, track the files using DVC and Git, and push the files to the remotes.

Start From This Part

To start the project from this part, please follow the instructions below.

  • Fork the hello-world repository.
  • Clone the repository and work on the start-version-project branch using the following command:
    git clone -b start-version-project https://dagshub.com/<DAGsHub-user-name>/hello-world.git
    
  • Create and activate a virtual environment.
  • Install and initialize DVC
  • Configure DVC locally and set DAGsHub storage as the remote.
Checkpoint

Check that the current DVC configuration matches the following:

Important

To avoide conflicts, work on the start-version-project branch for the rest of the toturial.

Add a Project

At this point, we want to add the required files for our ML project to the local directory. We will use the dvc get command that downloads files from a Git repository or DVC storage without tracking them.

  • Run the following commands from your CLI:

    dvc get https://dagshub.com/nirbarazida/hello-world-files requirements.txt
    dvc get https://dagshub.com/nirbarazida/hello-world-files src
    dvc get https://dagshub.com/nirbarazida/hello-world-files data/
    

Project Overview

This project is a simple 'Ham or Spam' classifier for emails using the Enron data set.

tree -I venv
    .
    ├── data
    │   └── enron.csv
    ├── requirements.txt
    └── src
        ├── const.py
        ├── data_preprocessing.py
        └── modeling.py

    2 directories, 5 files
  • src directory - Holds the data-preprocessing, modeling and const files:
    • data-preprocessing.py - Processing the raw data, splits it to train and test sets, and saves it to the data directory.
    • modeling.py - Simple Random Forest Regressor.
    • const.py - Holds the Constance of the projects.
  • data directory - Contains the raw data - enron.csv.
  • requirements.txt - Python dependencies that are required to run the python files.

Install Requirements

We will use the requirements.txt file to install the project's dependencies on the Virtual Environment. Make sure that the Virtual Environment is activated.

  • Run the following command from your CLI:

    pip3 install -r requirements.txt
    
Checkpoint

Check that the current status of your Git tracking matches the following

git status -s
    ?? data/
    ?? requirements.txt
    ?? src/

Track Files Using Git and DVC

At this point, we need to decide which files will be tracked by Git and which will be tracked by DVC. We will start with files tracked by DVC because this action will generate new files tracked by Git.

Track Files with DVC

The data directory contains the data sets for this project, which are quite big. Thus, we will track this directory using DVC and use Git to track the rest of the project's files.

  • Add the data directory to DVC tracking

    dvc add data
        100% Add|███████████████████████████████████████████████████████████████|1/1 [00:01,  1.28s/file]
        To track the changes with git, run:
             git add data.dvc .gitignore
    

    As we can see from the above, DVC provides us with information about the modified files due to this action and what we should do with them.

  • Track the changes with Git

    git add data.dvc .gitignore
    git commit -m "Add the data directory to DVC tracking"
         2 files changed, 6 insertions(+)
         create mode 100644 data.dvc
    

Track Files with Git

  • Check the current status of the files in the directory

    git status -s
        ?? requirements.txt
        ?? src/
    

    As we can see from the above, all the remaining files are either code or configuration files. Thus, we will track them using Git.

  • Track the files with Git

    git add requirements.txt src/
    git commit -m "Add requirements and src to Git tracking"
    

Push the Files to the Remotes

At this point, we would like to push the files tracked by Git and DVC to our DAGsHub remotes. Performing this action will make the entire project available for sharing & collaborating on DAGsHub.

Push DVC tracked files

dvc push -r origin
    Enter a password for host <storage-provider> user <user-name>:
    1 file pushed

Note

You will be asked to enter the password of your user at the storage provider.

Note

To see the DVC tracked files in your DAGsHub repository, you will need to push the <filename>.dvc file, which is tracked by Git, to the remote repository. We will do this in the next step.

Push Git tracked files

git push

After completing these steps, our repository will look like this:

  • The main repository page:

Note

The DVC tracked files are marked with blue background.

  • The data directory:

  • The data file itself: As we can see in the image above, DAGsHub displays the content of the files (e.g. CSV, YAML, image, etc.), tracked by both Git and DVC. In this case, the CSV file, tracked by DVC, is displayed in a table that you can filter and compare to different commits.

Process and Track Data Changes

  • Now, we would like to preprocess our data and track the results using DVC.
  • Let's run the data_preprocessing.py file from our CLI.

    python3 src/data_preprocessing.py
        [DEBUG] Preprocessing raw data
            [DEBUG] Loading raw data
            [DEBUG] Removing punctuation from Emails
            [DEBUG] Label encoding target column
            [DEBUG] Vectorizing the emails by words
            [DEBUG] Splitting data to train and test
            [DEBUG] Saving data to file
    

    This action generated 4 new files of processed data to the 'data' directory.

    tree data
        data
        ├── X_test.csv
        ├── X_train.csv
        ├── enron.csv
        ├── y_test.csv
        └── y_train.csv
    
        0 directories, 5 files
    

Checkpoint

Check that the current status of your Git and DVC tracking matches the following

git status
    On branch master
    Your branch is up to date with 'origin/<branch-name>'.
    nothing to commit, working tree clean
dvc status
    data.dvc:
            changed outs:
                            modified: data

Nothing was changed in Git tracking because the data directory is being tracked by DVC.

  • Let's version the new status of the data directory with DVC:

    dvc add data
        100% Add|███████████████████████████████████████████████████████|1/1 [00:01,  1.18s/file]
        To track the changes with git, run:
                git add data.dvc
    git add data.dvc
    git commit -m "Process raw-data and save it to data directory"
    
  • Push our changes to the remote

    dvc push -r origin
        Enter a password for host <storage provider> user <username>:
        5 files pushed
    git push
    
  • Let's see the new status of the data directory in DAGsHub.

In this section, we covered the basic workflow of DVC and Git. We added our project files to the repository and tracked them using Git and DVC. We generated preprocessed data files and learned how to add these changes to DVC as well. In the next sections, we will learn how to: