Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel

version-code-and-data.md 9.7 KB

You have to be logged in to leave a comment. Sign In

Version Code and Data

In the previous part of the Get Started section, we created and configured a DAGsHub repository. In this part, we will download and add a project to our local directory, track the files using DVC and Git, and push the files to the remotes.

??? Example "Start From This Part" To start the project from this part, please follow the instructions below.

- Fork the [hello-world](https://dagshub.com/nirbarazida/hello-world) repository.
- Clone the repository and work on the start-version-project branch using the following command:<br/>
    ```bash
    git clone -b start-version-project https://dagshub.com/<DAGsHub-user-name>/hello-world.git
    ```
- Create and activate a virtual environment.
- Install and initialize DVC
- Configure DVC locally and set DAGsHub storage as the remote.

??? checkpoint "Checkpoint"

    Check that the current DVC configuration matches the following:

    === "Mac-os, Linux"
        ```bash
        cat .dvc/config.local
            ['remote "origin"']
                url = https://dagshub.com/<DAGsHub-user-name>/hello-world.dvc
                auth = basic
                user = <DAGsHub-user-name>>
                ask_password = true
        ```
    === "Windows"
        ```bash
        type .dvc/config.local
            ['remote "origin"']
                url = https://dagshub.com/<DAGsHub-user-name>/hello-world.dvc
                auth = basic
                user = <DAGsHub-user-name>>
                ask_password = true
        ```
!!! Important
    To avoide conflicts, **work on the start-version-project branch** for the rest of the toturial.

Add a Project

At this point, we want to add the required files for our ML project to the local directory. We will use the dvc get command that downloads files from a Git repository or DVC storage without tracking them.

  • Run the following commands from your CLI:

    === "Mac-os, Linux, Windows" bash dvc get https://dagshub.com/nirbarazida/hello-world-files requirements.txt dvc get https://dagshub.com/nirbarazida/hello-world-files src dvc get https://dagshub.com/nirbarazida/hello-world-files data/

Project Overview

This project is a simple 'Ham or Spam' classifier for emails using the Enron data set.

tree -I venv
    .
    ├── data
    │   └── enron.csv
    ├── requirements.txt
    └── src
        ├── const.py
        ├── data_preprocessing.py
        └── modeling.py
    
    2 directories, 5 files
  • src directory - Holds the data-preprocessing, modeling and const files:
    • data-preprocessing.py - Processing the raw data, splits it to train and test sets, and saves it to the data directory.
    • modeling.py - Simple Random Forest Regressor.
    • const.py - Holds the Constance of the projects.
  • data directory - Contains the raw data - enron.csv.
  • requirements.txt - Python dependencies that are required to run the python files.

Install Requirements

We will use the requirements.txt file to install the project's dependencies on the Virtual Environment. Make sure that the Virtual Environment is activated.

  • Run the following command from your CLI:

    === "Mac-os, Linux, Windows" bash pip3 install -r requirements.txt

??? checkpoint "Checkpoint"

Check that the current status of your Git tracking matches the following

=== "Mac-os, Linux, Windows"
    ```bash
    git status -s
        ?? data/
        ?? requirements.txt
        ?? src/
    ```

Track Files Using Git and DVC

At this point, we need to decide which files will be tracked by Git and which will be tracked by DVC. We will start with files tracked by DVC because this action will generate new files tracked by Git.

Track Files with DVC

The data directory contains the data sets for this project, which are quite big. Thus, we will track this directory using DVC and use Git to track the rest of the project's files.

  • Add the data directory to DVC tracking

    === "Mac-os, Linux, Windows" bash dvc add data 100% Add|███████████████████████████████████████████████████████████████|1/1 [00:01, 1.28s/file] To track the changes with git, run: git add data.dvc .gitignore As we can see from the above, DVC provides us with information about the modified files due to this action and what we should do with them.

  • Track the changes with Git

    === "Mac-os, Linux, Windows" bash git add data.dvc .gitignore git commit -m "Add the data directory to DVC tracking" 2 files changed, 6 insertions(+) create mode 100644 data.dvc

Track Files with Git

  • Check the current status of the files in the directory

    === "Mac-os, Linux, Windows" bash git status -s ?? requirements.txt ?? src/ As we can see from the above, all the remaining files are either code or configuration files. Thus, we will track them using Git.

    • Track the files with Git

      === "Mac-os, Linux, Windows" bash git add requirements.txt src/ git commit -m "Add requirements and src to Git tracking"

Push the Files to the Remotes

At this point, we would like to push the files tracked by Git and DVC to our DAGsHub remotes. Performing this action will make the entire project available for sharing & collaborating on DAGsHub.

Push DVC tracked files

=== "Mac-os, Linux, Windows" dvc push -r origin Enter a password for host <storage-provider> user <user-name>: 1 file pushed

!!! Note You will be asked to enter the password of your user at the storage provider.

!!! Note To see the DVC tracked files in your DAGsHub repository, you will need to push the <filename>.dvc file, which is tracked by Git, to the remote repository. We will do this in the next step.

Push Git tracked files

=== "Mac-os, Linux, Windows" bash git push

After completing these steps, our repository will look like this:

  • The main repository page: ![](assets/5-repo-stat-after-push-shadow.png)

!!! Note The DVC tracked files are marked with blue background.

  • The data directory:

  • The data file itself: As we can see in the image above, DAGsHub displays the content of the files (e.g. CSV, YAML, image, etc.), tracked by both Git and DVC. In this case, the CSV file, tracked by DVC, is displayed in a table that you can filter and compare to different commits.

Process and Track Data Changes

  • Now, we would like to preprocess our data and track the results using DVC.

  • Let's run the data_preprocessing.py file from our CLI.

    === "Mac-os, Linux, Windows" bash python3 src/data_preprocessing.py [DEBUG] Preprocessing raw data [DEBUG] Loading raw data [DEBUG] Removing punctuation from Emails [DEBUG] Label encoding target column [DEBUG] Vectorizing the emails by words [DEBUG] Splitting data to train and test [DEBUG] Saving data to file This action generated 4 new files of processed data to the 'data' directory.

    tree data
        data
        ├── X_test.csv
        ├── X_train.csv
        ├── enron.csv
        ├── y_test.csv
        └── y_train.csv
    
        0 directories, 5 files
    

??? checkpoint "Checkpoint"

Check that the current status of your Git and DVC tracking matches the following

=== "Mac-os, Linux, Windows"
    ```bash
    git status
        On branch master
        Your branch is up to date with 'origin/<branch-name>'.
        nothing to commit, working tree clean
    dvc status
        data.dvc:
                changed outs:
                                modified: data
    ```
Nothing was changed in Git tracking because the data directory is being tracked by DVC.
  • Let's version the new status of the data directory with DVC:

    === "Mac-os, Linux, Windows" bash dvc add data 100% Add|███████████████████████████████████████████████████████|1/1 [00:01, 1.18s/file] To track the changes with git, run: git add data.dvc git add data.dvc git commit -m "Process raw-data and save it to data directory"

  • Push our changes to the remote

    === "Mac-os, Linux, Windows" dvc push -r origin Enter a password for host <storage provider> user <username>: 5 files pushed git push

  • Let's see the new status of the data directory in DAGsHub.

![](assets/8-data-dir-after-push-shadow.png)

In this section, we covered the basic workflow of DVC and Git. We added our project files to the repository and tracked them using Git and DVC. We generated preprocessed data files and learned how to add these changes to DVC as well. In the next sections, we will learn how to:

Tip!

Press p or to see the previous file or, n or to see the next file

Comments

Loading...