Skip to content
Reader Mode

Found a problem?
Let us know (or fix it):

Edit this Page

Have a question?
Join our community now:

Discord Chat

Ready to build your own project? It's free

Sign Up

Explore a New Hypothesis

In the previous part, we learned how to track the project's files using Git and DVC, and track the experiments using DagsHub. This part will cover the most common practice of Exploring a New Hypothesis. We will learn how to examine a new approach to process the data, compare the results, and save the project's best result.

Start From This Part

To start the project from this part, please follow the instructions below.

  • Fork the hello-world repository.
  • Clone the master branch from the repository using the following command (change the user name):
    git clone -b master https://dagshub.com/<DagsHub-user-name>/hello-world.git && cd hello-world
    
  • Create and activate a virtual environment.
  • Install the python dependencies:
    pip3 install -r requirements.txt
    pip3 install dvc
    pip3 install dagshub
    
  • Configure DVC locally and set DagsHub storage as the remote.
  • Download the files using the following command:
    dvc get --rev processed-data https://dagshub.com/nirbarazida/hello-world-files data/
    
  • Track the data directory using DVC and the data.dvc file using Git.
  • Push the files to Git and DVC remotes.

Basic Theory

The Data Science field is research-driven and exploring different solutions to a problem is a core principle. When a project evolves or grows in complexity, we need to compare results and see what approaches are more promising than others. In this process, we need to make sure we don't lose track of the project's components or miss any information. Therefore, it is useful to have a well-defined workflow.

The common workflow of exploring a new approach is to create a new branch for it. In the branch, we will change the code, modify the data and models, and track them using Git and DVC. Then, we will compare the new model's performances with the current model. This comparison can be a hassle when not using the proper tools to track and visualize the result. Let's see how we can use DagsHub to overcome these challenges.

We will log the models' performances to readable formats and commit them to DagsHub. Using the Experiment Tab, We will easily compare the results and determine if the new approach was effective or not. Based on our conclusions, we will either merge the code, data, and models to the main branch or return to the main branch and retrieve the data and models from the remote storage to continue to the next experiment.

Create a New Branch

As mentioned in the previous sections, we are using the Enron data set that contains emails. The emails are stored in a CSV file and labeled as 'Ham' or 'Spam'.The current data processing method for the emails is to lower-case the characters and removes the string's punctuations. We will try to reduce the processing time by not removing punctuations and see how it will affect the model's performance.

  • We will first create a new branch, name it based on the hypothesis and switch to it.

    git checkout -b data-with-punctuations
        Switched to a new branch 'data-with-punctuations'
    

Update the Processing Method

  • Change the code in the data_preprocessing.py module to the following:

    import pandas as pd
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.model_selection import train_test_split
    from const import *
    import string
    
    print(M_PRO_INIT, '\n' + M_PRO_LOAD_DATA)
    data = pd.read_csv(RAW_DATA_PATH)
    
    print(M_PRO_RMV_PUNC)
    clean_text = data[TEXT_COL_NAME].map(lambda x: x.lower().replace('\n', ''))
    
    print(M_PRO_LE)
    y = data[TARGET_COL].map({CLASS_0: 0, CLASS_1: 1})
    
    print(M_PRO_VEC)
    # every column is 1-2 words and the value is the number of appearance in Email
    email_text_list = clean_text.tolist()
    vectorizer = CountVectorizer(encoding='utf-8', decode_error='ignore', stop_words='english',
                                 analyzer='word', ngram_range=(1, 2), max_features=500)
    X_sparse = vectorizer.fit_transform(email_text_list)
    X = pd.DataFrame(X_sparse.toarray(), columns=vectorizer.get_feature_names())
    
    print(M_PRO_SPLIT_DATA)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
    
    print(M_PRO_SAVE_DATA)
    X_train.to_csv(X_TRAIN_PATH, index=False)
    X_test.to_csv(X_TEST_PATH, index=False)
    y_train.to_csv(Y_TRAIN_PATH, index=False)
    y_test.to_csv(Y_TEST_PATH, index=False)
    

  • Track and Commit the changes with Git.

    git add src/data_preprocessing.py
    git commit -m "Change data processing method - will not remove the string's punctuations"
    
  • Process the data using the new method by running the data_preprocessing.py script.

    python3 src/data_preprocessing.py
        [DEBUG] Preprocessing raw data
             [DEBUG] Loading raw data
             [DEBUG] Removing punctuation from Emails
             [DEBUG] Label encoding target column
             [DEBUG] vectorizing the emails by words
             [DEBUG] Splitting data to train and test
             [DEBUG] Saving data to file
    
Checkpoint

Check that the current status of your Git and DVC tracking matches the following.

git status
    On branch data-with-punctuations
    nothing to commit, working tree clean
dvc status
    data.dvc:
        changed outs:
            modified:           data

As we can see form the above, only the DVC tracked files were updated.

  • Track and commit the changes using DVC and Git.

    dvc add data
        To track the changes with git, run:
            git add data.dvc
    git add data.dvc
    git commit -m "Processed the data and tracked it using DVC and Git"
    
  • Push the code and data changes to the remotes.

    git push origin data-with-punctuations
    dvc push -r origin
        Enter a password for host <storage provider> user <username>:
        4 files pushed
    

Run a new Experiment and Compare the Results

We have everything set to run our second Data Science Experiment! We will train a new model and log its performance using the DagsHub logger. Then, we will push the updated metrics.csv file to DagsHub and easily compare the results.

  • We will start by running the modeling.py script.

    python3 src/modeling.py
        [DEBUG] Initialize Modeling
             [DEBUG] Loading data sets for modeling
             [DEBUG] Runing Random Forest Classifier
             [INFO] Finished modeling with AUC Score: 0.927
    
Checkpoint

Check that the current status of your Git and DVC tracking matches the following.

git status -s
    M metrics.csv
dvc status
    Data and pipelines are up to date.

Because we didn't change the model's Hyperparameters from the previous part, only the metrics.csv file was modified.

  • We will track the changes using Git and push the to the DagsHub repository.

    git add metrics.csv
    git commit -m "Punctuations experiment results - update metrics.csv file"
    git push origin data-with-punctuations
    
  • This is where the magic happens - with DagsHub, we can easily compare the model's performance between the two experiments. We will simply open the Experiment Tab in the DagsHub repository and compare the model's ROC AUC scores:

    As we can see in the image above, the new data processing method didn't provide better results; hence, we will not use it.

Retrieve Files

Our experiment resulted in worse performance, and we want to retrieve the previous version. Now, we can reap the benefits of our workflow. The best version of the project is always stored on the main branch. When concluding an experiment with insufficient impprovements, we simply need to check out the version we want, in this case, the master branch and pull the remote storage files based on the .dvc pointers.

  • Checkout to branch master using Git and pull the data files from the remote storage using DVC:

    git checkout master
        Switched to branch 'master'
        Your branch is up to date with 'origin/master'.
    dvc checkout
        M       data/
    

Congratulations - You made it to the finish line!

In the Get Started section, we covered the fundamental of DagsHub usage. We started by creating a repository and configuring Git and DVC. Then, we added project files to the repository using Git (for code and configuration files) and DVC (for data). We created our very first data science experiment using DagsHub logger to log metrics and parameters. Finally, we learned how to explore new approaches and retrieve another version's files.

We hope that this Tutorial was helpful and made the on-boarding process easier for you. If you found an issue in the Docs, please let us know or, better yet, help us fix it. If you have any questions feel free to join our Discord channel and ask there. We can't wait to see what remarkable project you will create and share with the Data Science community!

Here for any help you need,
Team DagsHub.