Are you sure you want to delete this access key?
In the previous part, we learned how to track the project's files using Git and DVC, and track the experiments using DagsHub. This part will cover the most common practice of Exploring a New Hypothesis. We will learn how to examine a new approach to process the data, compare the results, and save the project's best result.
!!! illustration "Video for this tutorial" Prefer to follow along with a video instead of reading? Check out the video for this section below:
<center>
<iframe width="400" height="225" src="https://www.youtube.com/embed/iiKbwQVkNl8" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
</center>
??? Example "Start From This Part" To start the project from this part, please follow the instructions below.
- Fork the [hello-world](https://dagshub.com/nirbarazida/hello-world){target=_blank} repository.
- Clone the master branch from the repository using the following command (change the user name):<br/>
```bash
git clone -b master https://dagshub.com/<DagsHub-user-name>/hello-world.git && cd hello-world
```
- Create and activate a virtual environment.
- Install the python dependencies:
```bash
pip3 install -r requirements.txt
pip3 install dvc
pip3 install dagshub
```
- Configure DVC locally and set DagsHub storage as the remote.
- Download the files using the following command:<br/>
```bash
dvc get --rev processed-data https://dagshub.com/nirbarazida/hello-world-files data/
```
- Track the data directory using DVC and the `data.dvc` file using Git.
- Push the files to Git and DVC remotes.
The Data Science field is research-driven and exploring different solutions to a problem is a core principle. When a project evolves or grows in complexity, we need to compare results and see what approaches are more promising than others. In this process, we need to make sure we don't lose track of the project's components or miss any information. Therefore, it is useful to have a well-defined workflow.
The common workflow of exploring a new approach is to create a new branch for it. In the branch, we will change the code, modify the data and models, and track them using Git and DVC. Then, we will compare the new model's performances with the current model. This comparison can be a hassle when not using the proper tools to track and visualize the result. Let's see how we can use DagsHub to overcome these challenges.
We will log the models' performances to readable formats and commit them to DagsHub. Using the Experiment Tab, we will easily compare the results and determine if the new approach was effective or not. Based on our conclusions, we will either merge the code, data, and models to the main branch or return to the main branch and retrieve the data and models from the remote storage to continue to the next experiment.
As mentioned in the previous sections, we are using the Enron data set that contains emails. The emails are stored in a CSV file and labeled as 'Ham' or 'Spam'. The current data processing method for the emails is to lower-case the characters and remove the string's punctuations. We will try to reduce the processing time by not removing punctuations and see how it will affect the model's performance.
We will first create a new branch, name it based on the hypothesis and switch to it.
=== "Mac, Linux, Windows"
bash git checkout -b data-with-punctuations Switched to a new branch 'data-with-punctuations'
data_preprocessing.py
module to the following:import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from const import *
import string
print(M_PRO_INIT, '\n' + M_PRO_LOAD_DATA)
data = pd.read_csv(RAW_DATA_PATH)
print(M_PRO_RMV_PUNC)
clean_text = data[TEXT_COL_NAME].map(lambda x: x.lower().replace('\n', ''))
print(M_PRO_LE)
y = data[TARGET_COL].map({CLASS_0: 0, CLASS_1: 1})
print(M_PRO_VEC)
# every column is 1-2 words and the value is the number of appearance in Email
email_text_list = clean_text.tolist()
vectorizer = CountVectorizer(encoding='utf-8', decode_error='ignore', stop_words='english',
analyzer='word', ngram_range=(1, 2), max_features=500)
X_sparse = vectorizer.fit_transform(email_text_list)
X = pd.DataFrame(X_sparse.toarray(), columns=vectorizer.get_feature_names())
print(M_PRO_SPLIT_DATA)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
print(M_PRO_SAVE_DATA)
X_train.to_csv(X_TRAIN_PATH, index=False)
X_test.to_csv(X_TEST_PATH, index=False)
y_train.to_csv(Y_TRAIN_PATH, index=False)
y_test.to_csv(Y_TEST_PATH, index=False)
Track and Commit the changes with Git.
=== "Mac, Linux, Windows"
bash git add src/data_preprocessing.py git commit -m "Change data processing method - will not remove the string's punctuations"
Process the data using the new method by running the data_preprocessing.py
script.
=== "Mac, Linux, Windows"
bash python3 src/data_preprocessing.py [DEBUG] Preprocessing raw data [DEBUG] Loading raw data [DEBUG] Removing punctuation from Emails [DEBUG] Label encoding target column [DEBUG] vectorizing the emails by words [DEBUG] Splitting data to train and test [DEBUG] Saving data to file
??? checkpoint "Checkpoint"
Check that the current status of your Git and DVC tracking matches the following:
=== "Mac, Linux, Windows"
```bash
git status
On branch data-with-punctuations
nothing to commit, working tree clean
dvc status
data.dvc:
changed outs:
modified: data
```
As we can see form the above, only the DVC tracked files were updated.
Track and commit the changes using DVC and Git.
=== "Mac, Linux, Windows"
bash dvc add data To track the changes with git, run: git add data.dvc git add data.dvc git commit -m "Processed the data and tracked it using DVC and Git"
Push the code and data changes to the remotes.
=== "Mac, Linux, Windows"
bash git push origin data-with-punctuations dvc push -r origin Enter a password for host <storage provider> user <username>: 4 files pushed
We have everything set to run our second Data Science Experiment! We will train a new model and log its performance
using the DagsHub logger. Then, we will push the updated metrics.csv
file to DagsHub and easily compare the results.
We will start by running the modeling.py
script.
=== "Mac, Linux, Windows"
bash python3 src/modeling.py [DEBUG] Initialize Modeling [DEBUG] Loading data sets for modeling [DEBUG] Runing Random Forest Classifier [INFO] Finished modeling with AUC Score: 0.927
??? checkpoint "Checkpoint"
Check that the current status of your Git and DVC tracking matches the following:
=== "Mac, Linux, Windows"
```bash
git status -s
M metrics.csv
dvc status
Data and pipelines are up to date.
```
Because we didn't change the model's Hyperparameters from the previous part, only the metrics.csv file was modified.
We will track the changes using Git and push the to the DagsHub repository.
=== "Mac, Linux, Windows"
bash git add metrics.csv git commit -m "Punctuations experiment results - update metrics.csv file" git push origin data-with-punctuations
This is where the magic happens - with DagsHub, we can easily compare the model's performance between the two experiments. We will simply open the Experiment Tab in the DagsHub repository and compare the model's ROC AUC scores:
Our experiment resulted in worse performance, and we want to retrieve the previous version.
Now, we can reap the benefits of our workflow. The best version of the project is always stored on the main branch.
When concluding an experiment with insufficient impprovements, we simply need to check out the version we want, in this case, the master branch and pull the remote storage files
based on the .dvc
pointers.
Checkout to branch master using Git and pull the data files from the remote storage using DVC:
=== "Mac, Linux, Windows"
bash git checkout master Switched to branch 'master' Your branch is up to date with 'origin/master'. dvc checkout M data/
Congratulations - You made it to the finish line!
In the Get Started section, we covered the fundamental of DagsHub usage. We started by creating a repository and configuring Git and DVC. Then, we added project files to the repository using Git (for code and configuration files) and DVC (for data). We created our very first data science experiment using DagsHub logger to log metrics and parameters. Finally, we learned how to explore new approaches and retrieve another version's files.
We hope that this Tutorial was helpful and made the on-boarding process easier for you. If you found an issue in the Docs, please let us know or, better yet, help us fix it. If you have any questions feel free to join our Discord channel and ask there. We can't wait to see what remarkable project you will create and share with the Data Science community!
Here for any help you need,
Team DagsHub.
Press p or to see the previous file or, n or to see the next file
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?
Are you sure you want to delete this access key?