Closing the data loop with DagsHub Annotations and Label Studio
  Back to blog home

Closing the data loop with DagsHub Annotations and Label Studio

Data Labeling Jan 06, 2022

TL;DR – DagsHub recently announced an integration with Label Studio. In this blog I'll show you how to do Git Flow for the data labeling process, and walk you through a 7-step get started tutorial for using Label Studio and DagsHub Annotations.


DagsHub's latest integration with Label Studio makes the labeling process effortless. Each repository has a fully configured labeling workspace with access to all the project files. With the new commit button, you can version the state of your project using Git at any point in time, making the process fully reproducible and easy to manage. No need to move your data between platforms, suffer from synchronization issues, break the project structure, or perform tedious tasks.

A lot has been said about the importance of clean data with accurate and consistent labels. The entire data-centric paradigm relies heavily on making data labels more consistent to improve model performance. So why isn’t the data science community quickly adopting this approach? There might be many reasons, but a recurring claim is that labeling is too tedious a task, one that’s hard to iterate over, manage, and scale.

DagsHub integrates Label Studio

Labeling is such an important task, but it’s more complex than it should be. Knowing that, we decided that it's a barrier to entry DagsHub should help remove. We had many strong candidates for this integration, but the one that stood out was Label Studio - a powerful open-source tool that supports the labeling of many unstructured and structured data types with a strong and active community.

Supported data types:

  • Computer Vision - images and video
  • Audio & Speech Applications
  • NLP, Documents, Chatbots, Transcripts
  • Time series
  • Structured data – tabular, HTML, freeform
  • Multi-Domain Applications
From the official Label Studio website


Every repository on DagsHub comes with a fully configured Label Studio workspace. This workspace lets you annotate your data, with access to all the project's files. By directly fetching data from DagsHub Storage, so you no longer need to move, copy, or pull it to a third-party platform. This reduces a significant burden associated with labeling, which is managing and synchronizing data and labels.

Git Flow for Data Labeling

Labeling workflow is equivalent to developing a new feature. It should be done in an isolated environment with the ability to compare, analyze, and merge changes or roll them back and restore previous versions. Knowing that labeling is usually outsourced, these capabilities become even more vital to ensure its success.

Based on those needs and the challenges labelers face, DagsHub added a few toppings to Label Studio and created its unique flavor of the open-source version. It provides a Git experience, following the industry's best practices, to ensure full reproducibility, scalability, and efficient version control of the labels and data.

Git flow for labeling unstructured data

The workflow for DagsHub and Label Studio

When creating a new labeling project on DagsHub, you associate it with a tip of an active branch. It marks the project's starting point and will make all the files hosted on DagsHub Storage, under the selected commit, available for labeling. Once you reach a valuable result, you'll be able to version and commit the annotations using Git, directly to a remote branch. Once the task is complete, you can create a pull request on DagsHub, where a reviewer can see and comment on every annotation.

How to version a Label Studio project with DagsHub?

To version control any artifact, it needs to have a single source of truth. To provide this source of truth for annotations, we created the .labelstudio directory, which holds annotations for every task in open source formats. When creating a new labeling project, DagsHub parses the selected commit for this directory and loads the existing annotations to their associated tasks. This way, we can roll back to previous versions with a click of a button.

Get started with Label Studio and DagsHub

In this section, I'll guide you, step-by-step, on how to use Label Studio and DagsHub Annotations while following the recommended Git Flow. The main goal is to help you gain hands-on experience while having the benefit of following my lead. For that, I'll use my "Where's Elon" project, where I annotate Elon Musk's images. I'm assuming you already have a project on DagsHub, with versioned data ready to be annotated.

Step 1: Create a Label Studio workspace.

Navigate to the Annotations tab in your DagsHub repository and create a new workspace. This process can take 2-3 minutes as DagsHub spins up the Label Studio machine behind the scenes.

Create a Label Studio workspace

Step 2: Create a Label Studio project

In the new Annotation Project menu, choose the tip of a remote branch to associate the project with. It marks the project's starting point and will make all the files hosted on DagsHub Storage, under the selected commit, available for labeling. To work in an isolated environment, we will create a new branch for the labeling project. The default project name is based on the annotator who created it; however, you can change it as you wish.

Create a Label Studio project

Step 3: Choose the files to annotate

When launching the project for the first time, you'll need to choose the files to annotate (AKA tasks). You can choose a specific file or an entire directory by checking the box next to its name.

Note: you can annotate files hosted on both Git and DVC remotes. As a role of thumb: "if you can see the file - you can annotate it."

Choose the files to annotate

Step 4: Configure Label Studio

You can configure Label Studio's labeling interface using one of its many great templates. If you need a custom template , you can create it using basic HTML.

Note: If you choose to work with a template, you'll need to set the project's labels manually.

Configure Label Studio

Step 5: Annotate the data

As simple as that, you can start annotating your data. No need to move the data to a different platform, change its structure or synchronize anything. You can start working on the tasks and save the annotations to DagsHub's database.

Annotate the data

Step 6: Commit changes to Git

At any point in time, you can version the state of the project using Git, and commit the changes back to the branch you chose in step 2 or create a new branch and commit to it. The commit will include the special `.labelstudio` directory. You can add an annotations file in one of the commonly used formats (JSON, COCO, CSV, TSV, etc.) to the commit.

Commit changes to Git

Utilizing Git's capabilities, you can now seamlessly iterate over steps 5 and 6, compare the different versions, merge the results, or roll back the changes.

Step 7: Create a pull request

When you're satisfied with the labels, meaning they’re accurate and consistent, you can merge them to the main branch. With DagsHub, communicating over the labels is part of the pull request without moving to a 3rd party platform. The reviewer can leave his comments on each label and have the entire process logged and easy to manage. Once completing the task, merging it to the project’s main branch is one click of a button away.

Summary

Labeling unstructured data comes with various challenges, many of which are a by-product of the workflow and unrelated to the annotation task itself. DagsHub Annotations and the Label Studio integration are designed to help overcome those challenges and create a smooth labeling workflow. It does the DevOps heavy lifting and provides you with the tools you need to manage and scale the labeling process.

If you have any questions, ideas, or thoughts about the integration - we'd love to hear about it on our Discord channel! We can't wait to see your amazing projects become even greater with accurate and consistent labels.




Tags

Nir Barazida

MLOps Team Lead @ DagsHub

Great! You've successfully subscribed.
Great! Next, complete checkout for full access.
Welcome back! You've successfully signed in.
Success! Your account is fully activated, you now have access to all content.