Create production grade training-ready datasets for machine learning

We provide an out-of-the-box solution with a clear display of your datasets, querying abilities, annotations, lineage and eventually a faster way to experiment and improve models.

>>Get early acess <<

We’re covering all steps to create training ready datasets

Features

Seamless connection to your existing storage

Simple interface to connect your external storage, no DevOps needed.
We currently support S3, Google Cloud, and S3 compatible, with more to be added in the near future.

Datasets versioning and lineage

Clear & organized display of your datasets,
including visual lineage that connects datasets, models, experiments, labels and predictions

Data querying

Pick and choose the most relevant data points to improve a model where performance is low. This can be achieved by filtering, sorting, and searching for similar examples to create and save a new training-ready version of the dataset

Annotations

Annotate relevant data points in one click with zero setup. Use existing models to automatically label your data, and fine tune manually.

Experiments and retraining

Use subsets of your data to experiment and retrain your model by streaming it directly to your pipeline and track your experiments within DagsHub.

Why DagsHub data engine

We set up everything for you

We provide out-of-the-box solution for data queries, visualizations, versioning and annotations. No need for you to set up and maintain complicated infrastructure

Collaborate with your teammates on subsets

The Data Engine will allow you to share your queries and results with your teammates so they can continue where you left off.

A single source of truth

Manage and display all your datasets, code, labels, experiments, models and queries, all in one platform.