Launching Data Engine – A toolset for rapid iteration on unstructured datasets

Data Engine Jul 24, 2023

Today, I’m absolutely thrilled to unveil Data Engine – A toolset that is built from the ground up to empower machine learning teams to handle unstructured data, iterate on it quickly and reliably, and use it to build better models for production.

TL;DR

Data Engine is a new component of DagsHub that provides machine learning teams with an end-to-end flow to effectively manage, curate, visualize, annotate, and serve unstructured datasets, using familiar interfaces and integrations with the popular tools that already come with the DagsHub platform, like MLflow, Label Studio, and DVC. Let's dive in and see how Data Engine will supercharge your data-driven projects! Check out the tutorial, or read the docs.

Data Engine – The next chapter in our mission

At DagsHub, our mission has always been to facilitate data science teamwork and foster community collaboration. We believe that when data scientists and ML engineers unite their powers, they can achieve extraordinary feats. With Data Engine, we're pushing the boundaries of data science by emphasizing the importance of data as the bedrock of real-world models.

With unstructured data in general, but especially in the rapidly evolving landscape of GenAI & LLMs, we can look forward to many exceptional AI capabilities. These heavily rely on one crucial factor – high-quality data. Without the right data, these groundbreaking models won't deliver the desired outcomes for your specific use cases.

Building better datasets with Data Engine

We all know the drill – academic projects often deal with static datasets, but when it comes to production, data is constantly streaming in, changing, and evolving. To stay ahead in the game, we need to level up our datasets and continually improve them to enhance model performance.

It’s so important in fact, that we even invented a term for it – Data-Centric AI.

However, improving datasets is easier said than done, especially when dealing with unstructured data. As machine learning engineers, we face challenges like identifying edge cases, managing subsets for specific tasks, and maintaining dataset quality over time. The struggle is real, and we recognized the need for a comprehensive solution to tackle these hurdles head-on.

Iterating on your unstructured dataset is an involved process

In order to improve your datasets, you need to iterate on the following things quickly:

Analysis – Understand where your model is failing and why
Collection – Get more data
Validation & Review – Ensure the data is good enough to add to the training dataset
Annotation – Add ground truth labels to your data
Curation – Create subsets of your dataset for finetuning or testing
Serving – Getting your datasets to a trainable format
Versioning & Lineage – Since your data and metadata (annotations, predictions) constantly change, tracking the relationship between models to the data they were trained on is critical

Try to do all of this with unstructured data, for example, images for object detection, text for LLM finetuning, or audio for speech-to-text, and you’ll find yourself facing a few huge challenges:

No end-to-end flow – No one tool covers the entire flow above. Most tools for this process we’ve seen cover a subset of the capabilities, requiring you to invest a lot of your (or your MLOps Engineer’s) time setting up infrastructure and writing glue code to connect the various parts of the flow.
No centralized context – Even after you do that, things are likely not to work smoothly, and when you're working in a team, data and information falling through the cracks will slow down progress and increase the chance of human error. Many times, multiple people are changing the data, annotations, and metadata. Coordinating these changes is challenging, and you’ll find it hard to know what the “state of the art” for your dataset is.
Improving the model where it fails is hard – When you deploy your model to production, you’ll find it doesn’t work equally well for all cases. Maybe your object detection model doesn’t work as well at night, or your ChatGPT chatbot answers questions about specific features in your product poorly. You need to collect samples from these hard cases and integrate them back into your training dataset, or manage different subsets of your data for these edge cases. You might copy your data aside into different folders for each use case, but solutions like that are hacky, error-prone, and fragile
You can’t close the loop – Often, the last part of the loop, collecting additional data and integrating it into the dataset, remains unsolved. You now need to annotate your data and review the quality of new data coming in to incorporate into your existing datasets. This usually requires a lot of custom glue code which can also result in things falling through the cracks.

Introducing DagsHub Data Engine

To address the issues above, we’re excited to share a new toolset built especially for them: Data Engine.

Data Engine was built to help you tackle the challenges of unstructured data iteration. With Data Engine, you get an end-to-end flow, from data collection all the way to model training, with no glue code, no infrastructure setup, and version control and collaboration built into the foundation.

Data Engine Superpowers – A simplified view

Here’s what Data Engine does for you:

Zero-Copy Dataset Management: Collect data to your object storage or DVC-versioned directories, and use it as a source for your datasets. We separate your data from the metadata, annotations, and predictions, and enrich your metadata. Metadata is versioned so you can always go back to a previous point in time when you need to reproduce a result.
Infinitely Flexible Metadata Enrichment API: Use an intuitive Python API to add custom metadata, annotations, or predictions. Data Engine supports almost any metadata type, including numbers, strings, and booleans, but also arbitrary binary data, so you can add custom annotation formats, images, or pickle files as metadata too. This means you can associate the metadata you need with your data, without worrying you’ll need to manage it in a custom way for each unique column. It just works.
Pandas-like Data Querying: Our early users describe this as a “Pandas dataframe in the cloud”. After adding your metadata, you can query it to create subsets for specific team members, use cases, or tests. Since this is centralized, anyone can use your datasets without duplicating data. This makes creating task-oriented subsets of your data easy, enabling you to quickly solve new edge cases that arise in production.
Visualize Unstructured Datasets: Visualizing your datasets is one command or one click away. To understand your data, many times there is no replacement for taking a look, and with Data Engine you don’t need to write any special code to do so. It just works.
Annotation and Auto-Labeling: Annotating new data, or improving existing data point’s annotations is critical for iterating on dataset quality. With Data Engine, you can send data to annotation with a click, and save it back to your dataset for training with another. Our seamless annotation flow supports connecting your own models for auto-labeling, so you can assist your labelers, or use bigger models to train smaller ones.
Dataloaders for data serving: When you’re ready to train, Data Engine makes it easy to convert your dataset to a PyTorch or Tensorflow compatible dataloader, that fits your data types, and downloads data as you need it so your training run works smoothly. You can then use the DagsHub integration with MLflow to log your experiment and model and deploy it to production from our model registry.

You can start improving your machine learning projects with DagsHub Data Engine right now. Sign up to Dagshub and experience the power of Data Engine firsthand.

What’s Next for Data Engine

As we bask in the excitement of Data Engine's launch, we're already gearing up for more remarkable capabilities. In the coming weeks, we plan to roll out some amazing features, including:

Dataset Versioning – Go back to any point in time to use your datasets as they were then. This is the closest upcoming feature, and it will enable full reproducibility of Data Engine experiments.
Cloud Dataset Visualization – Visualize your datasets effortlessly on the cloud, facilitating easier collaboration and exploration.
Vector Similarity Search – Experience the power of vector search capabilities, revolutionizing how you find and interact with your data.
More Automatic Metadata Calculation – Automate metadata calculation for even greater efficiency and reduced manual work.
Dedicated Visualizations for Additional Data Types – Enhance your data visualization experience with dedicated visualizations tailored for various data types.

Stay tuned, folks! Lot’s to look ahead to.

Join the Data Engine revolution!

Ready to supercharge your machine learning projects with Data Engine? Don't wait another moment! Sign up to DagsHub today and unleash the full potential of your data-driven endeavors. It's time to embrace the future of data science, and we're here to lead the way!

We can't wait to see what you'll achieve with Data Engine! As always, we're here to support and empower your data-driven journey. Let's build incredible models and get them into production, together!

Happy building! 🚀

Recommended for you

Active Learning

Active Learning Your Way to Better Models

2 years ago • 10 min read

Computer Vision

Train An Emotion Recognition Model Using Open Source MLOps Tools

10 months ago • 11 min read

CI/CD

CI/CD for Machine Learning: Test and Deploy Your ML Model with GitHub Actions

2 years ago • 9 min read

How to choose MLOps tools (MLOps from first principles)

🍪 Machine Learning in the cookie-less era with Uri Goren

Top Computer Vision Generative Models in 2024

7 Best Machine Learning Workflow and Pipeline Orchestration Tools 2024