Skip to content
Reader Mode

Found a problem?
Let us know (or fix it):

Edit this Page

Have a question?
Join our community now:

Discord Chat

Ready to build your own project? It's free

Sign Up

Tutorial Overview

This tutorial covers the basics of using DVC and DAGsHub to create version-controlled data science projects, as well as tracking experiments with Git.

You will learn about data exploration, tracking experiment parameters and metrics, comparing experiments, and more...

Detecting questions about Machine Learning

In this tutorial, we'll create a model to predict whether a question on the Cross Validated Stack Exchange concerns Machine Learning or not.

This kind of prediction can be useful if we want to recommend to a user to add the machine-learning tag to their question for example, which can make it more likely they will get an answer.

This task is simple and clean enough for a tutorial but leaves room for experimentation with feature engineering, data enrichment, and model selection.

The tutorial is divided into several "levels", each of which demonstrates another workflow improvement. It's designed so that you learn something useful at each "level", even if the level after that is less to your liking, and you choose to stop early.

The levels are:

  1. Data Exploration - Getting the data and trying to understand it, otherwise known as doing exploratory data analysis.
  2. Setup - Creating a DAGsHub account and project.
  3. Data Versioning - Using DVC to keep track of data and model versions.
  4. Experimentation - Logging hyperparameters and metrics to DAGsHub to keep track of and comparing different experiments.

    Screenshot Delicious statistics 😋 (source: Cross Validated)

Too slow for you?

Here is a link to the complete code repo. You can go over it or use the code as you wish.

The tutorial will guide you, step-by-step, to create this repo.