Are you sure you want to delete this access key?
---
Versioning data and code is a critical component of data science projects that ensures the reproducibility of ML experiments. It provides traceability and enables collaboration among team members with ease.
In this section, we'll learn how to version code using Git, data using DVC, and metadata using data engine, and host them on DagsHub.
??? illustration "Video for this tutorial" Prefer to follow along with a video instead of reading? Check out the video for this section below:
<center>
<iframe width="400" height="225" src="https://www.youtube.com/embed/haLftyGzH3g" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
</center>
We'll Start by creating a new project on DagsHub and configuring it with our machine.
In this tutorial, we will work with the email classifier project, where we build a Random Forest regressor to detect spam emails.
??? Info "Project overview"
This project is a simple 'Ham or Spam' classifier for emails using the Enron data set with the following structure:
```bash
tree -I <venv-name>
.
├── data
│ └── enron.csv
├── requirements.txt
└── src
├── const.py
├── data_preprocessing.py
└── modeling.py
2 directories, 5 files
```
- <u>src directory</u> - Holds the data-preprocessing, modeling and const files:
- `data-preprocessing.py` - Processing the raw data, splits it to train and test sets, and saves it to the data directory.
- `modeling.py` - Simple Random Forest Regressor.
- `const.py` - Holds the constants of the projects.
- <u>data directory</u> - Contains the raw data - `enron.csv`.
- `requirements.txt` - Python dependencies that are required to run the python files.
To begin, we will use the dvc get
command to download the project's files (code, data, and dependencies) to our local directory.
??? Info "dvc get"
The dvc get
command downloads files from a Git repository or DVC storage without tracking them.
Run the following commands from your CLI:
=== "Mac, Linux, Windows"
bash pip install dvc dvc get https://dagshub.com/nirbarazida/hello-world-files requirements.txt dvc get https://dagshub.com/nirbarazida/hello-world-files src dvc get https://dagshub.com/nirbarazida/hello-world-files data/
Run the following commands from your CLI:
=== "Mac, Linux, Windows"
bash pip install -r requirements.txt
At this point, we're ready to version the files to mark the project's starting point. We will use Git to version the lightweight files and the others, such as data and models, using DVC.
To version our Data directory, we'll use DagsHub Client, which leverages DVC under the hood. Its edge over vanilla DVC is that it utilizes DagsHub backend capabilities to calculate the DVC hashes without conducting any action locally or requiring having the entire folder content locally.
With DagsHub Client, you can version your data from the CLI or your Python scripts to enable integrations into your project pipeline.
We will start by installing DagsHub Client using pip
=== "Mac, Linux, Windows"
bash pip install dagshub
=== "Mac, Linux, Windows"
```bash
dagshub upload --update --message "Add raw Data" "yonomitt/hello-world" "data/" "data/"
⠙ Uploading files... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% -:--:--
Uploading files (1) to "yonomitt/hello-world"...
Upload finished successfully!
Directory upload complete, uploaded 1 files
```
=== "Mac, Linux, Windows"
```Python
import dagshub
repo = dagshub.upload.Repo("yonomitt", "hello-world")
repo.upload(local_path="data/", remote_path="data/",
commit_message="Added Raw Data",versioning="dvc")
```
By conducting this action, the data directory is uploaded to DagsHub Storage, versioned by DVC, and the new data.dvc
pointer file is committed to Git.
!!! Note
To see the DVC tracked files in your DagsHub repository, you will need to push the <filename>.dvc
file, which is tracked by Git, to the remote repository. We will do this in the next step.
Now, we'd like to pull the latest commit from DagsHub. We will do it with Git.
=== "Mac, Linux, Windows"
```bash
git pull
```
when working with data engine, we can add metadata to our dataset. the dataset can then be queried by different versions of this metadata, example:
# adding metadata:
with ds.metadata_context() as ctx:
metadata = {
"is_dog": True,
}
ctx.update_metadata("datapoint1.png", metadata)
# querying by metadata at time t:
q = ds[Field("is_dog", as_of=t)] == True
for more on that: versioning metadata.
Let's check the current status of the project.
=== "Mac, Linux, Windows"
bash git status -s ?? requirements.txt ?? src/
As we can see from the above, all the remaining files are either code or configuration files.
Thus, we will track them using Git.
=== "Mac, Linux, Windows"
bash git add requirements.txt src/ git commit -m "Add requirements and src to Git tracking" git push
After completing these steps, our repository will look like this:
!!! Note The DVC tracked files are marked with blue background and have a DVC tag.
The data directory:
[{: style="padding-top:0.7em"}](assets/6-data-dir-after-push.png){target=_blank}The data file itself:
[{: style="padding-top:0.7em"}](assets/7-content-of-enron-file.png){target=_blank}!!! Note "DagsHub's Data Catalog"
As we can see in the image above, DagsHub displays the content of the files (e.g. CSV, YAML, image, etc.),
tracked by both Git and DVC. In this case, the CSV file, tracked by DVC, is displayed in a table that you can
filter and compare to different commits.
Versioning data and code is a critical component of data science projects that ensures the reproducibility of ML experiments. It provides traceability and enables collaboration among team members with ease.
In this section, we'll learn how to version code using Git and data using DVC and host them on DagsHub.
??? illustration "Video for this tutorial" Prefer to follow along with a video instead of reading? Check out the video for this section below:
<center>
<iframe width="400" height="225" src="https://www.youtube.com/embed/haLftyGzH3g" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
</center>
We'll Start by creating a new project on DagsHub and configuring it with our machine.
In this tutorial, we will work with the email classifier project, where we build a Random Forest regressor to detect spam emails.
??? Info "Project overview"
This project is a simple 'Ham or Spam' classifier for emails using the Enron data set with the following structure:
```bash
tree -I <venv-name>
.
├── data
│ └── enron.csv
├── requirements.txt
└── src
├── const.py
├── data_preprocessing.py
└── modeling.py
2 directories, 5 files
```
- <u>src directory</u> - Holds the data-preprocessing, modeling and const files:
- `data-preprocessing.py` - Processing the raw data, splits it to train and test sets, and saves it to the data directory.
- `modeling.py` - Simple Random Forest Regressor.
- `const.py` - Holds the constants of the projects.
- <u>data directory</u> - Contains the raw data - `enron.csv`.
- `requirements.txt` - Python dependencies that are required to run the python files.
To begin, we will use the dvc get
command to download the project's files (code, data, and dependencies) to our local directory.
??? Info "dvc get"
The dvc get
command downloads files from a Git repository or DVC storage without tracking them.
Run the following commands from your CLI:
=== "Mac, Linux, Windows"
bash pip install dvc dvc get https://dagshub.com/nirbarazida/hello-world-files requirements.txt dvc get https://dagshub.com/nirbarazida/hello-world-files src dvc get https://dagshub.com/nirbarazida/hello-world-files data/
Run the following commands from your CLI:
=== "Mac, Linux, Windows"
bash pip install -r requirements.txt
At this point, we're ready to version the files to mark the project's starting point. We will use Git to version the lightweight files and the others, such as data and models, using DVC.
To version our Data directory, we'll use DagsHub Client, which leverages DVC under the hood. Its edge over vanilla DVC is that it utilizes DagsHub backend capabilities to calculate the DVC hashes without conducting any action locally or requiring having the entire folder content locally.
With DagsHub Client, you can version your data from the CLI or your Python scripts to enable integrations into your project pipeline.
We will start by installing DagsHub Client using pip
=== "Mac, Linux, Windows"
bash pip install dagshub
=== "Mac, Linux, Windows"
```bash
dagshub upload --update --message "Add raw Data" "yonomitt/hello-world" "data/" "data/"
⠙ Uploading files... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% -:--:--
Uploading files (1) to "yonomitt/hello-world"...
Upload finished successfully!
Directory upload complete, uploaded 1 files
```
=== "Mac, Linux, Windows"
```Python
import dagshub
repo = dagshub.upload.Repo("yonomitt", "hello-world")
repo.upload(local_path="data/", remote_path="data/",
commit_message="Added Raw Data",versioning="dvc")
```
By conducting this action, the data directory is uploaded to DagsHub Storage, versioned by DVC, and the new data.dvc
pointer file is committed to Git.
!!! Note
To see the DVC tracked files in your DagsHub repository, you will need to push the <filename>.dvc
file, which is tracked by Git, to the remote repository. We will do this in the next step.
Now, we'd like to pull the latest commit from DagsHub. We will do it with Git.
=== "Mac, Linux, Windows"
```bash
git pull
```
Let's check the current status of the project.
=== "Mac, Linux, Windows"
bash git status -s ?? requirements.txt ?? src/
As we can see from the above, all the remaining files are either code or configuration files.
Thus, we will track them using Git.
=== "Mac, Linux, Windows"
bash git add requirements.txt src/ git commit -m "Add requirements and src to Git tracking" git push
After completing these steps, our repository will look like this:
!!! Note The DVC tracked files are marked with blue background and have a DVC tag.
The data directory:
[{: style="padding-top:0.7em"}](assets/6-data-dir-after-push.png){target=_blank}The data file itself:
[{: style="padding-top:0.7em"}](assets/7-content-of-enron-file.png){target=_blank}!!! Note "DagsHub's Data Catalog"
As we can see in the image above, DagsHub displays the content of the files (e.g. CSV, YAML, image, etc.),
tracked by both Git and DVC. In this case, the CSV file, tracked by DVC, is displayed in a table that you can
filter and compare to different commits.
Press p or to see the previous file or, n or to see the next file
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?