Register
Login
Resources
Docs Blog Datasets Glossary Case Studies Tutorials & Webinars
Product
Data Engine LLMs Platform Enterprise
Pricing Explore
Connect to our Discord channel

version_data_and_code.md 14 KB

You have to be logged in to leave a comment. Sign In
        ---

title: Data & Code Versioning Essentials with DagsHub | DagsHub - DagsHub Docs description: Master the essentials of data and code versioning for machine learning projects with DagsHub. Ensure the reproducibility of your ML experiments with traceable and collaborative workflows. Ideal for data scientists seeking reliable version control.

Version Data, Metadata and Code

Versioning data and code is a critical component of data science projects that ensures the reproducibility of ML experiments. It provides traceability and enables collaboration among team members with ease.

In this section, we'll learn how to version code using Git, data using DVC, and metadata using data engine, and host them on DagsHub.

??? illustration "Video for this tutorial" Prefer to follow along with a video instead of reading? Check out the video for this section below:

<center>
<iframe width="400" height="225" src="https://www.youtube.com/embed/haLftyGzH3g" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
</center>

Configure DagsHub

We'll Start by creating a new project on DagsHub and configuring it with our machine.

Set up the project

In this tutorial, we will work with the email classifier project, where we build a Random Forest regressor to detect spam emails.

??? Info "Project overview"

This project is a simple 'Ham or Spam' classifier for emails using the Enron data set with the following structure:

```bash
tree -I <venv-name> 
    .
    ├── data
    │   └── enron.csv
    ├── requirements.txt
    └── src
        ├── const.py
        ├── data_preprocessing.py
        └── modeling.py
    
    2 directories, 5 files
```

- <u>src directory</u> - Holds the data-preprocessing, modeling and const files:
    - `data-preprocessing.py` - Processing the raw data, splits it to train and test sets, and saves it to the data directory.
    - `modeling.py` - Simple Random Forest Regressor.
    - `const.py` - Holds the constants of the projects.
- <u>data directory</u> - Contains the raw data - `enron.csv`.
- `requirements.txt` - Python dependencies that are required to run the python files.

Download files

To begin, we will use the dvc get command to download the project's files (code, data, and dependencies) to our local directory.

??? Info "dvc get" The dvc get command downloads files from a Git repository or DVC storage without tracking them.

  • Run the following commands from your CLI:

    === "Mac, Linux, Windows" bash pip install dvc dvc get https://dagshub.com/nirbarazida/hello-world-files requirements.txt dvc get https://dagshub.com/nirbarazida/hello-world-files src dvc get https://dagshub.com/nirbarazida/hello-world-files data/

Install requirements

  • Run the following commands from your CLI:

    === "Mac, Linux, Windows" bash pip install -r requirements.txt

At this point, we're ready to version the files to mark the project's starting point. We will use Git to version the lightweight files and the others, such as data and models, using DVC.

Version data

To version our Data directory, we'll use DagsHub Client, which leverages DVC under the hood. Its edge over vanilla DVC is that it utilizes DagsHub backend capabilities to calculate the DVC hashes without conducting any action locally or requiring having the entire folder content locally.

With DagsHub Client, you can version your data from the CLI or your Python scripts to enable integrations into your project pipeline.

  • We will start by installing DagsHub Client using pip

    === "Mac, Linux, Windows" bash pip install dagshub

Version data from CLI

=== "Mac, Linux, Windows"

```bash
dagshub upload --update --message "Add raw Data" "yonomitt/hello-world" "data/" "data/"
    ⠙ Uploading files... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% -:--:--
    Uploading files (1) to "yonomitt/hello-world"...
    Upload finished successfully!
    Directory upload complete, uploaded 1 files
```

Version data using Python

=== "Mac, Linux, Windows"

```Python
import dagshub

repo = dagshub.upload.Repo("yonomitt", "hello-world")
repo.upload(local_path="data/", remote_path="data/",
            commit_message="Added Raw Data",versioning="dvc")
```

By conducting this action, the data directory is uploaded to DagsHub Storage, versioned by DVC, and the new data.dvc pointer file is committed to Git.

!!! Note To see the DVC tracked files in your DagsHub repository, you will need to push the <filename>.dvc file, which is tracked by Git, to the remote repository. We will do this in the next step.

Now, we'd like to pull the latest commit from DagsHub. We will do it with Git.

=== "Mac, Linux, Windows"

```bash
git pull
```

Version metadata

when working with data engine, we can add metadata to our dataset. the dataset can then be queried by different versions of this metadata, example:

# adding metadata:
with ds.metadata_context() as ctx:
    metadata = {
        "is_dog": True,
    }

    ctx.update_metadata("datapoint1.png", metadata)

# querying by metadata at time t:
q = ds[Field("is_dog", as_of=t)] == True

for more on that: versioning metadata.

Version code

Let's check the current status of the project.

=== "Mac, Linux, Windows" bash git status -s ?? requirements.txt ?? src/ As we can see from the above, all the remaining files are either code or configuration files. Thus, we will track them using Git.

  • Track the files with Git

=== "Mac, Linux, Windows" bash git add requirements.txt src/ git commit -m "Add requirements and src to Git tracking" git push

Results

After completing these steps, our repository will look like this:

  • The main repository page: [![repo-stat-after-push](assets/5-repo-stat-after-push.png){: style="padding-top:0.7em"}](assets/5-repo-stat-after-push.png){target=_blank}

!!! Note The DVC tracked files are marked with blue background and have a DVC tag.

  • The data directory:

    [![data-dir-after-push](assets/6-data-dir-after-push.png){: style="padding-top:0.7em"}](assets/6-data-dir-after-push.png){target=_blank}
  • The data file itself:

    [![content-of-enron-file](assets/7-content-of-enron-file.png){: style="padding-top:0.7em"}](assets/7-content-of-enron-file.png){target=_blank}

    !!! Note "DagsHub's Data Catalog"

      As we can see in the image above, DagsHub displays the content of the files (e.g. CSV, YAML, image, etc.), 
      tracked by both Git and DVC. In this case, the CSV file, tracked by DVC, is displayed in a table that you can 
      filter and compare to different commits.
    
See the project on DagsHub --- title: Data & Code Versioning Essentials with DagsHub | DagsHub - DagsHub Docs description: Master the essentials of data and code versioning for machine learning projects with DagsHub. Ensure the reproducibility of your ML experiments with traceable and collaborative workflows. Ideal for data scientists seeking reliable version control. --- # Version Data and Code

Versioning data and code is a critical component of data science projects that ensures the reproducibility of ML experiments. It provides traceability and enables collaboration among team members with ease.

In this section, we'll learn how to version code using Git and data using DVC and host them on DagsHub.

??? illustration "Video for this tutorial" Prefer to follow along with a video instead of reading? Check out the video for this section below:

<center>
<iframe width="400" height="225" src="https://www.youtube.com/embed/haLftyGzH3g" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
</center>

Configure DagsHub

We'll Start by creating a new project on DagsHub and configuring it with our machine.

Set up the project

In this tutorial, we will work with the email classifier project, where we build a Random Forest regressor to detect spam emails.

??? Info "Project overview"

This project is a simple 'Ham or Spam' classifier for emails using the Enron data set with the following structure:

```bash
tree -I <venv-name> 
    .
    ├── data
    │   └── enron.csv
    ├── requirements.txt
    └── src
        ├── const.py
        ├── data_preprocessing.py
        └── modeling.py
    
    2 directories, 5 files
```

- <u>src directory</u> - Holds the data-preprocessing, modeling and const files:
    - `data-preprocessing.py` - Processing the raw data, splits it to train and test sets, and saves it to the data directory.
    - `modeling.py` - Simple Random Forest Regressor.
    - `const.py` - Holds the constants of the projects.
- <u>data directory</u> - Contains the raw data - `enron.csv`.
- `requirements.txt` - Python dependencies that are required to run the python files.

Download files

To begin, we will use the dvc get command to download the project's files (code, data, and dependencies) to our local directory.

??? Info "dvc get" The dvc get command downloads files from a Git repository or DVC storage without tracking them.

  • Run the following commands from your CLI:

    === "Mac, Linux, Windows" bash pip install dvc dvc get https://dagshub.com/nirbarazida/hello-world-files requirements.txt dvc get https://dagshub.com/nirbarazida/hello-world-files src dvc get https://dagshub.com/nirbarazida/hello-world-files data/

Install requirements

  • Run the following commands from your CLI:

    === "Mac, Linux, Windows" bash pip install -r requirements.txt

At this point, we're ready to version the files to mark the project's starting point. We will use Git to version the lightweight files and the others, such as data and models, using DVC.

Version Data

To version our Data directory, we'll use DagsHub Client, which leverages DVC under the hood. Its edge over vanilla DVC is that it utilizes DagsHub backend capabilities to calculate the DVC hashes without conducting any action locally or requiring having the entire folder content locally.

With DagsHub Client, you can version your data from the CLI or your Python scripts to enable integrations into your project pipeline.

  • We will start by installing DagsHub Client using pip

    === "Mac, Linux, Windows" bash pip install dagshub

Version data from CLI

=== "Mac, Linux, Windows"

```bash
dagshub upload --update --message "Add raw Data" "yonomitt/hello-world" "data/" "data/"
    ⠙ Uploading files... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% -:--:--
    Uploading files (1) to "yonomitt/hello-world"...
    Upload finished successfully!
    Directory upload complete, uploaded 1 files
```

Version data using Python

=== "Mac, Linux, Windows"

```Python
import dagshub

repo = dagshub.upload.Repo("yonomitt", "hello-world")
repo.upload(local_path="data/", remote_path="data/",
            commit_message="Added Raw Data",versioning="dvc")
```

By conducting this action, the data directory is uploaded to DagsHub Storage, versioned by DVC, and the new data.dvc pointer file is committed to Git.

!!! Note To see the DVC tracked files in your DagsHub repository, you will need to push the <filename>.dvc file, which is tracked by Git, to the remote repository. We will do this in the next step.

Now, we'd like to pull the latest commit from DagsHub. We will do it with Git.

=== "Mac, Linux, Windows"

```bash
git pull
```

Version code

Let's check the current status of the project.

=== "Mac, Linux, Windows" bash git status -s ?? requirements.txt ?? src/ As we can see from the above, all the remaining files are either code or configuration files. Thus, we will track them using Git.

  • Track the files with Git

=== "Mac, Linux, Windows" bash git add requirements.txt src/ git commit -m "Add requirements and src to Git tracking" git push

Results

After completing these steps, our repository will look like this:

  • The main repository page: [![repo-stat-after-push](assets/5-repo-stat-after-push.png){: style="padding-top:0.7em"}](assets/5-repo-stat-after-push.png){target=_blank}

!!! Note The DVC tracked files are marked with blue background and have a DVC tag.

  • The data directory:

    [![data-dir-after-push](assets/6-data-dir-after-push.png){: style="padding-top:0.7em"}](assets/6-data-dir-after-push.png){target=_blank}
  • The data file itself:

    [![content-of-enron-file](assets/7-content-of-enron-file.png){: style="padding-top:0.7em"}](assets/7-content-of-enron-file.png){target=_blank}

    !!! Note "DagsHub's Data Catalog"

      As we can see in the image above, DagsHub displays the content of the files (e.g. CSV, YAML, image, etc.), 
      tracked by both Git and DVC. In this case, the CSV file, tracked by DVC, is displayed in a table that you can 
      filter and compare to different commits.
    
See the project on DagsHub
Tip!

Press p or to see the previous file or, n or to see the next file

Comments

Loading...