Are you sure you want to delete this access key?
In the previous part of the Get Started section, we created and configured a DAGsHub repository. In this part, we will download and add a project to our local directory, track the files using DVC and Git, and push the files to the remotes.
??? Example "Start From This Part" To start the project from this part, please follow the instructions below.
- Fork the [hello-world](https://dagshub.com/nirbarazida/hello-world) repository.
- Clone the repository and work on the start-version-project branch using the following command:<br/>
```bash
git clone -b start-version-project https://dagshub.com/<DAGsHub-user-name>/hello-world.git
```
- Create and activate a virtual environment.
- Install and initialize DVC
- Configure DVC locally and set DAGsHub storage as the remote.
??? checkpoint "Checkpoint"
Check that the current DVC configuration matches the following:
=== "Mac-os, Linux"
```bash
cat .dvc/config.local
['remote "origin"']
url = https://dagshub.com/<DAGsHub-user-name>/hello-world.dvc
auth = basic
user = <DAGsHub-user-name>>
ask_password = true
```
=== "Windows"
```bash
type .dvc/config.local
['remote "origin"']
url = https://dagshub.com/<DAGsHub-user-name>/hello-world.dvc
auth = basic
user = <DAGsHub-user-name>>
ask_password = true
```
!!! Important
To avoide conflicts, **work on the start-version-project branch** for the rest of the toturial.
At this point, we want to add the required files for our ML project to the local directory. We will use the dvc get
command that downloads files from a Git repository or DVC storage without tracking them.
Run the following commands from your CLI:
=== "Mac-os, Linux, Windows"
bash dvc get https://dagshub.com/nirbarazida/hello-world-files requirements.txt dvc get https://dagshub.com/nirbarazida/hello-world-files src dvc get https://dagshub.com/nirbarazida/hello-world-files data/
This project is a simple 'Ham or Spam' classifier for emails using the Enron data set.
tree -I venv
.
├── data
│ └── enron.csv
├── requirements.txt
└── src
├── const.py
├── data_preprocessing.py
└── modeling.py
2 directories, 5 files
data-preprocessing.py
- Processing the raw data, splits it to train and test sets, and saves it to the data directory.modeling.py
- Simple Random Forest Regressor.const.py
- Holds the Constance of the projects.enron.csv
.requirements.txt
- Python dependencies that are required to run the python files.We will use the requirements.txt file to install the project's dependencies on the Virtual Environment. Make sure that the Virtual Environment is activated.
Run the following command from your CLI:
=== "Mac-os, Linux, Windows"
bash pip3 install -r requirements.txt
??? checkpoint "Checkpoint"
Check that the current status of your Git tracking matches the following
=== "Mac-os, Linux, Windows"
```bash
git status -s
?? data/
?? requirements.txt
?? src/
```
At this point, we need to decide which files will be tracked by Git and which will be tracked by DVC. We will start with files tracked by DVC because this action will generate new files tracked by Git.
The data directory contains the data sets for this project, which are quite big. Thus, we will track this directory using DVC and use Git to track the rest of the project's files.
Add the data directory to DVC tracking
=== "Mac-os, Linux, Windows"
bash dvc add data 100% Add|███████████████████████████████████████████████████████████████|1/1 [00:01, 1.28s/file] To track the changes with git, run: git add data.dvc .gitignore
As we can see from the above, DVC provides us with information about the modified files due to this action and what
we should do with them.
Track the changes with Git
=== "Mac-os, Linux, Windows"
bash git add data.dvc .gitignore git commit -m "Add the data directory to DVC tracking" 2 files changed, 6 insertions(+) create mode 100644 data.dvc
Check the current status of the files in the directory
=== "Mac-os, Linux, Windows"
bash git status -s ?? requirements.txt ?? src/
As we can see from the above, all the remaining files are either code or configuration files.
Thus, we will track them using Git.
Track the files with Git
=== "Mac-os, Linux, Windows"
bash git add requirements.txt src/ git commit -m "Add requirements and src to Git tracking"
At this point, we would like to push the files tracked by Git and DVC to our DAGsHub remotes. Performing this action will make the entire project available for sharing & collaborating on DAGsHub.
=== "Mac-os, Linux, Windows"
dvc push -r origin Enter a password for host <storage-provider> user <user-name>: 1 file pushed
!!! Note You will be asked to enter the password of your user at the storage provider.
!!! Note
To see the DVC tracked files in your DAGsHub repository, you will need to push the <filename>.dvc
file, which is tracked by Git, to the remote repository. We will do this in the next step.
=== "Mac-os, Linux, Windows"
bash git push
After completing these steps, our repository will look like this:
!!! Note The DVC tracked files are marked with blue background.
The data file itself:
As we can see in the image above, DAGsHub displays the content of the files (e.g. CSV, YAML, image, etc.),
tracked by both Git and DVC. In this case, the CSV file, tracked by DVC, is displayed in a table that you can
filter and compare to different commits.
Now, we would like to preprocess our data and track the results using DVC.
Let's run the data_preprocessing.py
file from our CLI.
=== "Mac-os, Linux, Windows"
bash python3 src/data_preprocessing.py [DEBUG] Preprocessing raw data [DEBUG] Loading raw data [DEBUG] Removing punctuation from Emails [DEBUG] Label encoding target column [DEBUG] Vectorizing the emails by words [DEBUG] Splitting data to train and test [DEBUG] Saving data to file
This action generated 4 new files of processed data to the 'data' directory.
tree data
data
├── X_test.csv
├── X_train.csv
├── enron.csv
├── y_test.csv
└── y_train.csv
0 directories, 5 files
??? checkpoint "Checkpoint"
Check that the current status of your Git and DVC tracking matches the following
=== "Mac-os, Linux, Windows"
```bash
git status
On branch master
Your branch is up to date with 'origin/<branch-name>'.
nothing to commit, working tree clean
dvc status
data.dvc:
changed outs:
modified: data
```
Nothing was changed in Git tracking because the data directory is being tracked by DVC.
Let's version the new status of the data directory with DVC:
=== "Mac-os, Linux, Windows"
bash dvc add data 100% Add|███████████████████████████████████████████████████████|1/1 [00:01, 1.18s/file] To track the changes with git, run: git add data.dvc git add data.dvc git commit -m "Process raw-data and save it to data directory"
Push our changes to the remote
=== "Mac-os, Linux, Windows"
dvc push -r origin Enter a password for host <storage provider> user <username>: 5 files pushed git push
Let's see the new status of the data directory in DAGsHub.
In this section, we covered the basic workflow of DVC and Git. We added our project files to the repository and tracked them using Git and DVC. We generated preprocessed data files and learned how to add these changes to DVC as well. In the next sections, we will learn how to:
Press p or to see the previous file or, n or to see the next file
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?