Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel

version_data_and_code.md 6.8 KB

You have to be logged in to leave a comment. Sign In

Version Data and Code

In the previous part of the Quick Start section, we created and configured a DagsHub repository on our local computer. In this part, we will download the content of an email classifier to our local computer and learn how to:

  • Version code and data using DVC and Git
  • Push the files to DagsHub remotes

!!! illustration "Video for this tutorial" Prefer to follow along with a video instead of reading? Check out the video for this section below:

<center>
<iframe width="400" height="225" src="https://www.youtube.com/embed/haLftyGzH3g" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
</center>

Add a Project

We will start this section by downloading the project's files (cde, data and dependencies) to our local directory. We will use the dvc get command that downloads files from a Git repository or DVC storage without tracking them.

  • Run the following commands from your CLI:

    === "Mac, Linux, Windows" bash dvc get https://dagshub.com/nirbarazida/hello-world-files requirements.txt dvc get https://dagshub.com/nirbarazida/hello-world-files src dvc get https://dagshub.com/nirbarazida/hello-world-files data/

Project Overview

This project is a simple 'Ham or Spam' classifier for emails using the Enron data set with the following structure:

tree -I <venv-name> 
    .
    ├── data
    │   └── enron.csv
    ├── requirements.txt
    └── src
        ├── const.py
        ├── data_preprocessing.py
        └── modeling.py
    
    2 directories, 5 files

??? Info "Info about the files"

- <u>src directory</u> - Holds the data-preprocessing, modeling and const files:
    - `data-preprocessing.py` - Processing the raw data, splits it to train and test sets, and saves it to the data directory.
    - `modeling.py` - Simple Random Forest Regressor.
    - `const.py` - Holds the constants of the projects.
- <u>data directory</u> - Contains the raw data - `enron.csv`.
- `requirements.txt` - Python dependencies that are required to run the python files.

Install Requirements

We will use the requirements.txt file to install the project's dependencies.

  • Run the following command from your CLI:

    === "Mac, Linux, Windows" bash pip install -r requirements.txt

??? checkpoint "Checkpoint"

Check that the current status of your Git tracking matches the following:

=== "Mac, Linux, Windows"
    ```bash
    git status -s
        ?? data/
        ?? requirements.txt
        ?? src/
    ```

Track Files Using Git and DVC

At this point, we need to decide which files will be tracked by Git and which will be tracked by DVC. We will start with files tracked by DVC because this action will generate new files tracked by Git.

Track Files with DVC

The data directory contains the datasets for this project, which are quite big. Thus, we will track this directory using DVC and use Git to track the rest of the project's files.

  • Add the data directory to DVC tracking

    === "Mac, Linux, Windows" bash dvc add data 100% Add|███████████████████████████████████████████████████████████████|1/1 [00:01, 1.28s/file] To track the changes with git, run: git add data.dvc .gitignore As we can see from the above, DVC provides us with information about the modified files due to this action and what we should do with them.

  • Track the changes with Git

    === "Mac, Linux, Windows" bash git add data.dvc .gitignore git commit -m "Add the data directory to DVC tracking" 2 files changed, 6 insertions(+) create mode 100644 data.dvc

Track Files with Git

  • Check the current status of the files in the directory

    === "Mac, Linux, Windows" bash git status -s ?? requirements.txt ?? src/ As we can see from the above, all the remaining files are either code or configuration files. Thus, we will track them using Git.

    • Track the files with Git

      === "Mac, Linux, Windows" bash git add requirements.txt src/ git commit -m "Add requirements and src to Git tracking"

Push the Files to the Remotes

At this point, we would like to push the files tracked by Git and DVC to our DagsHub remotes. Performing this action will make the entire project available for sharing & collaborating on DagsHub.

Push DVC tracked files

=== "Mac, Linux, Windows" dvc push -r origin 2 files pushed

!!! Note To see the DVC tracked files in your DagsHub repository, you will need to push the <filename>.dvc file, which is tracked by Git, to the remote repository. We will do this in the next step.

Push Git tracked files

=== "Mac, Linux, Windows" bash git push

Results

After completing these steps, our repository will look like this:

  • The main repository page: [![repo-stat-after-push](assets/5-repo-stat-after-push.png){: style="padding-top:0.7em"}](assets/5-repo-stat-after-push.png){target=_blank}

!!! Note The DVC tracked files are marked with blue background and have a DVC tag.

  • The data directory:

    [![data-dir-after-push](assets/6-data-dir-after-push.png){: style="padding-top:0.7em"}](assets/6-data-dir-after-push.png){target=_blank}
  • The data file itself:

    [![content-of-enron-file](assets/7-content-of-enron-file.png){: style="padding-top:0.7em"}](assets/7-content-of-enron-file.png){target=_blank}

    !!! Note "DagsHub's Data Catalog"

      As we can see in the image above, DagsHub displays the content of the files (e.g. CSV, YAML, image, etc.), 
      tracked by both Git and DVC. In this case, the CSV file, tracked by DVC, is displayed in a table that you can 
      filter and compare to different commits.
    

We hope that this Tutorial was helpful and made the on-boarding process easier for you. If you found an issue in the Docs, please let us know or, better yet, help us fix it. If you have any questions feel free to join our Discord channel and ask there. We can't wait to see what remarkable project you will create and share with the Data Science community!

Here for any help you need,
Team DagsHub.

Tip!

Press p or to see the previous file or, n or to see the next file

Comments

Loading...